Blog

ExpertEval: Can AI Match the Judgment of Seasoned Professionals?

2026-05-06

Toward Evaluating Real-World, High-Stakes Productivity

The field has grown adept at constructing benchmarks that AI systems can pass, yet passing is not the same as performing. We pose a more exacting question: can AI systems produce professional work-product that a credentialed domain expert would sign off on? ExpertEval is a large-scale evaluation infrastructure spanning Medicine, Finance, and Law, three domains where the quality of reasoning is adjudicated not by accuracy metrics but by real-world consequence severity. Every case is authored by practicing professionals with verifiable domain credentials and paired with fine-grained rubrics that encode expert judgment in its entirety, including the characteristic failure modes that separate competent analysis from dangerously fluent confabulation.

UniPat Evaluation Team

contact@unipat.ai

ExpertEval encodes the tacit standards of seasoned professionals into structured rubrics, transforming years of domain judgment into reproducible, transparent evaluation criteria.

Expert-Authored · Rubric-Driven · Failure-Aware

Every rubric criterion traces back to authoritative primary sources, and characteristic professional errors are encoded as Critical Negatives that carry substantial penalties. The result: we measure not whether a model can produce an answer, but whether that answer would survive peer review by a credentialed practitioner.

Beyond Pattern Matching

Professional domains impose reasoning under genuine epistemic uncertainty, where novel constraint configurations arise that no training distribution has adequately represented. We construct queries that faithfully reproduce this cognitive load: multi-system organ failure complicated by pharmacokinetic drug conflicts, cross-default cascades propagating through distressed capital structures, and multi-jurisdictional legal fact patterns requiring simultaneous navigation of parallel criminal and civil liability tracks.

Auditable Ground Truth

Every rubric criterion that invokes external knowledge is anchored to an authoritative primary source with a traceable URL: clinical practice guidelines, regulatory filings, statutory provisions, judicial interpretations. This design choice transforms evaluation from an exercise in subjective expert consensus into independently auditable, reproducible fact-verification, ensuring that any qualified reviewer, operating without access to our annotators, would converge on materially identical scores.

Failure-Mode Awareness

Productive intelligence is defined as much by what it avoids as by what it achieves. We encode Critical Negatives, domain-specific professional pitfalls that carry disproportionate real-world cost relative to their surface plausibility, as penalty rubric items with negative weight. This ensures the scoring surface achieves maximal discriminative power precisely at the failure boundary where fluent but materially dangerous answers reside, the region most consequential for deployment safety.

3,213
Expert Cases
To our knowledge, one of the largest expert-annotated corpora assembled to date for professional-grade reasoning evaluation
207
Scenarios
Granular professional scenarios spanning clinical decision-making, financial structuring, and legal analysis
21.83
Avg Rubrics
Multi-dimensional scoring rubrics per case, with differential weights and critical negatives

Corpus Composition

Three domains selected for the severity of their professional stakes and the irreducibility of their reasoning complexity

Domain Distribution
  • Medicine 1,146 cases · 43 scenarios
  • Finance 1,147 cases · 58 scenarios
  • Law 920 cases · 106 scenarios
Three professional domains with 3,213 expert-annotated cases in total

Medicine

1,146 cases · 43 scenarios
Perioperative ICU Management Multi-organ Failure Triage Pharmacokinetic Conflict Immunosuppression Sequencing Sepsis Bundle Escalation Coagulopathy Management Ventilator Weaning Protocol … 36 more

Finance

1,147 cases · 58 scenarios
Distressed Asset Valuation M&A Deal Structuring Derivatives Pricing Capital Structure Analysis Cross-default Propagation Credit Risk Modeling Portfolio Stress Testing … 51 more

Law

920 cases · 106 scenarios
Cross-border Labor Dispute Data Privacy Compliance Criminal Sentencing Threshold Multi-jurisdiction Liability Securities Fraud Litigation Regulatory Enforcement Contract Interpretation … 99 more

Principled Evaluation Design

A four-principle rubric methodology that transforms subjective expert judgment into deterministic, independently reproducible scoring

The central methodological challenge in expert-level evaluation is not data volume but scoring surface design. A rubric that merely asks "is the answer correct?" collapses the rich internal structure of professional reasoning into a degenerate binary signal. We adopt four design principles that, taken together, ensure our rubrics faithfully encode the full structure of expert judgment, its verifiability constraints, and its domain-characteristic failure modes.

1

Specificity

Vague evaluative language is categorically excluded. Every criterion must specify quantifiable, deterministic assessment standards: a drug dosage threshold in mg/kg, a statutory penalty range with sentencing tiers, a recovery rate calculation with explicit numerator and denominator. This constraint eliminates evaluator subjectivity at the criterion level and guarantees that two independent assessors, operating without communication, will converge on materially identical scores.

2

Verifiability

Every criterion that depends on domain knowledge external to the case itself is anchored to an authoritative primary source via a stable URL, including clinical practice guidelines published by professional societies, SEC/CSRC regulatory filings, and statutory provisions with article-level citation. This transforms rubric-based evaluation from an exercise in "expert opinion" (inherently non-reproducible) into independently auditable fact-verification against the same source material available to any qualified practitioner.

3

Structured Rubrics

Rubric criteria are organized into coherent reasoning chains. When a prerequisite criterion is unmet (e.g., a model fails to correctly identify the governing legal entity in a cross-border dispute), all downstream criteria whose validity logically depends on that determination are automatically invalidated and score zero. This prevents inflated scores arising from partially correct but inferentially unsound reasoning.

4

Critical Negatives

We systematically encode domain-characteristic professional pitfalls as negative-weight rubric items. A model that recommends lipid emulsion infusion in a patient with severe hypertriglyceridemia (TG > 5.6 mmol/L), or that ignores structural subordination when computing offshore bond recovery rates, receives substantial score penalties, faithfully mirroring the disproportionate real-world cost these specific errors carry in clinical practice, financial advisory, and legal counsel respectively.


Empirical Validation

We validate the dataset's empirical utility across three complementary evaluation settings, demonstrating that expert-authored rubrics yield a training signal of sufficient quality to measurably shift model reasoning behavior, and a scoring surface of sufficient granularity to discriminate among frontier-class systems.

Experiment I: Training Signal Quality

We evaluate whether expert-annotated rubrics provide a training signal of sufficient quality to produce measurable improvements in professional reasoning capability. We employ a stratified hold-out design: 90 questions (30 per domain, uniformly sampled across scenario types) constitute the test set, with the remaining ~2,500 cases allocated to supervised fine-tuning of Qwen3.5-35B-A3B. Two training regimes are compared: standard SFT using ReAct-style information synthesis trajectories, and a Heavy variant incorporating 8-sample response aggregation to increase training signal density.

Training Improvement

Baseline
54.76%
SFT
66.22%
SFT, Heavy
68.94%
Baseline → SFT: +11.46% SFT → Heavy: +2.72%

Model Ranking

Ours (SFT, Heavy)
68.94
Claude-Opus-4.6
66.88
Ours (SFT)
66.22
GPT-5.4-High
66.20
Qwen3.6-Plus
63.32
Gemini-3.1-Pro-Preview
63.00
GLM-5
59.27
Kimi-K2.5
58.24
Doubao-Seed-2.0-Pro
56.22
Ours (Baseline)
54.76
Qwen3-Deep-Research
48.52
Ours Other Models (w/ tools)
Key finding: SFT on expert rubric data lifts the base model from 54.76% to 66.22% (+11.46 percentage points), surpassing GPT-5.4-High, Qwen3.6-Plus, and Gemini-3.1-Pro, all of which are substantially larger systems with tool-use capabilities. The Heavy variant with 8-sample aggregation reaches 68.94%, achieving state-of-the-art on the held-out test set and ranking above Claude-Opus-4.6 (66.88%). A 35B sparse mixture-of-experts model, trained exclusively on our expert-annotated data, outperforms frontier systems with orders of magnitude more parameters, providing empirical evidence that training signal quality, not model scale, is the binding constraint.

Experiment II: Cross-Benchmark Generalization

A natural question arises: does training on our rubric-annotated data improve performance only on our own test distribution, or does the acquired reasoning capability generalize to entirely independent evaluation instruments? We evaluate on the One-Million Bench, a large-scale external benchmark constructed by a separate research group with no overlap in annotation methodology. Four metrics are reported: Expert Score and Pass Rate, each evaluated in both English and Chinese.

Training Improvement — Expert Score

EN

Baseline
45.0
SFT
55.0
SFT, Heavy
64.8

CN

Baseline
44.0
SFT
52.1
SFT, Heavy
61.3

Training Improvement — Pass Rate

EN

Baseline
18.5
SFT
27.5
SFT, Heavy
43.5

CN

Baseline
16.0
SFT
26.5
SFT, Heavy
42.0

Model Ranking — EN

Ours (SFT, Heavy)
64.8
Claude-Opus-4.6
63.0
GPT-5.4-High
59.2
Ours (SFT)
55.0
Doubao-Seed-2.0-Pro
51.8
Qwen3.6-Plus
49.0
Gemini-3.1-Pro-Preview
44.6
Kimi-K2.5
41.2
GLM-5
41.1
Ours Other Models

Model Ranking — CN

Claude-Opus-4.6
64.5
Ours (SFT, Heavy)
61.3
GPT-5.4-High
58.7
Ours (SFT)
52.1
Qwen3.6-Plus
52.1
Doubao-Seed-2.0-Pro
50.5
Gemini-3.1-Pro-Preview
46.9
Kimi-K2.5
43.7
GLM-5
41.1
Ours Other Models
The training signal generalizes. On an entirely independent benchmark with no methodological overlap, our Heavy-SFT model achieves the highest Expert Score in English (64.8, surpassing Claude-Opus-4.6 at 63.0) and ranks second in Chinese (61.3, narrowly trailing Claude-Opus-4.6 at 64.5). Pass Rate improvements are especially striking, rising from 18.5% at baseline to 43.5% after Heavy SFT in English. This indicates that expert rubric data teaches not merely surface-level accuracy but the structured reasoning discipline required to fully satisfy professional evaluation criteria across independent assessment frameworks.

Experiment III: Deep Research Capability

As a final evaluation axis, we assess performance on the DeepResearch Bench, a benchmark specifically engineered to test multi-step information synthesis and extended reasoning under compositional constraints. This evaluation setting most closely mirrors the task structure our dataset is designed to optimize for: long-horizon reasoning chains requiring tool-calling sequences, multi-source evidence integration, and the sustained application of professional-grade judgment across complex decision trees.

DeepResearch Bench

Ours (SFT)
47.63
OpenAI
46.98
Perplexity
42.25
Grok
40.24
Ours Other Models

DeepResearch Bench II

Ours (SFT)
50.45
GPT-o3
45.40
Gemini-3-Pro
44.60
Doubao
40.99
Qwen3-Max
39.25
Grok
39.23
Perplexity
38.58
Ours Other Models
State-of-the-art on both evaluation rounds. Our system achieves 47.63 on DeepResearch Bench (vs. OpenAI at 46.98) and 50.45 on DeepResearch Bench II (vs. GPT-o3 at 45.40 and Gemini-3-Pro at 44.60). Notably, the performance margin widens in the second, more challenging round, suggesting that the structured reasoning patterns internalized from expert rubric training produce compounding returns as task complexity increases. This is precisely the scaling behavior one would expect from genuine capability acquisition rather than superficial pattern memorization.

Representative Cases

To convey the difficulty calibration and domain specificity of our evaluation instrument, we present three illustrative cases, one from each professional domain, selected to exhibit the characteristic reasoning structures, dependency topologies, and failure modes that our rubric design is engineered to capture.

01

Medicine — Perioperative Critical Care

Medicine Perioperative ICU Management
I am an attending physician in hepatobiliary-pancreatic surgery and SICU. I am currently managing a patient on postoperative day 8 after a standard Whipple procedure (pancreaticoduodenectomy), who has developed severe new-onset complications. Requesting MDT (multidisciplinary team) consultation for the next steps. Patient: Female, 62 years old, 55 kg. Operated for pancreatic head ductal adenocarcinoma. History of severe refractory rheumatoid arthritis (RA) for 12 years, on long-term Abatacept and low-dose steroids. Last Abatacept IV infusion was 3 days before surgery. Current status: Abdominal drainage output surged to 850 mL of turbid dark-red fluid with tissue debris; drain amylase 45,000 U/L. Vitals: invasive arterial BP 82/45 mmHg, HR 135, temp 40.2°C. Norepinephrine at 0.6 µg/kg/min. Hb dropped from 105 to 68 g/L. PLT 12×10⁹/L, WBC 1.1×10⁹/L; ferritin 38,500 µg/L, TG 7.2 mmol/L, sCD25 22,000 U/mL. Suspected secondary macrophage activation syndrome (MAS). Rheumatology and hematology recommend immediate therapeutic plasma exchange (TPE) plus high-dose Abatacept. The surgical team objects — massive pancreatic fistula with highly corrosive enzymes makes aggressive immunosuppression extremely risky. Anesthesiology refuses open re-exploration due to PLT 12 and severe capillary leak. Please provide specific guidance on: (1) pharmacokinetic conflict between Abatacept (large-molecule fusion protein) and TPE — design a precise 48-hour timeline; (2) optimal non-open bedside/IR interventions for suspected pseudoaneurysm/sentinel bleed with 850 mL/day pancreatic enzyme-rich drainage; (3) enteral/parenteral nutrition sequencing strategy via existing nasojejunal tube given high-output fistula, extreme inflammatory consumption, and hepatorenal impairment.
Rubrics (click to expand/collapse)
29 rubrics
#WeightTypeDescriptionDeps
13.0ExtractionIdentify that Abatacept is a large-molecule fusion protein with a very small volume of distribution (Vd ≤ 0.1 L/kg).
26.0ReasoningCalculate that a single TPE session non-selectively removes >60% of intravascular Abatacept (specifically 63%–78%).[1]
35.0ReasoningSpecify that Abatacept must be administered after TPE completion (within 30–60 minutes post-TPE).
48.0ReasoningRecognize that concurrent severe capillary leak syndrome causes additional third-space Abatacept loss, significantly shortening its effective plasma half-life.
55.0ReasoningRequire that TPE replacement fluid consist entirely or predominantly of fresh frozen plasma (FFP).
65.0ReasoningDiagnose the GI hemorrhage as late postoperative hemorrhage (Late PPH) / sentinel bleed caused by corrosive pancreatic enzyme erosion of target vessels.
72.0ReasoningPrecisely localize the most likely culprit vessel of pseudoaneurysm rupture as the gastroduodenal artery (GDA) stump or common/proper hepatic artery.[6]
82.0ReasoningPropose "sandwich technique" for transcatheter arterial embolization (TAE) — simultaneous proximal and distal embolization of the parent artery.
95.0ReasoningSet platelet count ≥ 50×10⁹/L as the absolute safety threshold before TAE.
102.0ReasoningPropose a provocation test (e.g., intra-arterial papaverine injection) to unmask intermittent bleeding and localize the culprit vessel when DSA shows no active extravasation.
115.0ReasoningRequire urgent pre-/intraoperative infusion of fibrinogen concentrate or cryoprecipitate, with a target fibrinogen level ≥ 2.0 g/L.
122.0StyleSuggest direct application of thrombin or fibrin sealant around the suspected fistula site via the existing abdominal double-lumen drain for adjunctive physical sealing.
135.0ReasoningMandate 100% carbohydrate-based non-protein calories in PN — strictly lipid-free formulation.
148.0ReasoningSet total caloric target at "permissive underfeeding" strategy (15–25 kcal/kg/d) during the acute hyper-inflammatory phase.
152.0ReasoningPrescribe trophic enteral feeding initiated at an ultra-low infusion rate of 10–20 mL/h.
165.0ReasoningSpecify that enteral formula delivered via nasojejunal tube must be a pre-digested, easily absorbed semi-elemental or short-peptide low-residue formulation.[15]
175.0ReasoningRequire prophylactic high-dose IV thiamine (vitamin B1) supplementation before refeeding after prolonged catabolic depletion to prevent refeeding syndrome.
185.0ReasoningWarn that enteral nutrition at norepinephrine ≥ 0.5 µg/kg/min carries a high risk of non-occlusive mesenteric ischemia (NOMI).
192.0ReasoningMandate that abdominal double-lumen drain management use only continuous low-negative-pressure suction (e.g., −10 to −20 cmH₂O); high-pressure irrigation is strictly prohibited.
205.0ReasoningRequire IV infusion of 20% or 25% concentrated human albumin to leverage hyperosmotic colloid oncotic pressure for mobilizing third-space fluid ("autologous volume resuscitation").
212.0ReasoningRecommend continuous low-dose vasopressin (pitressin) infusion to augment blood pressure while simultaneously constricting the splanchnic vascular bed and reducing portal pressure.
222.0StyleSuggest assessment of extravascular lung water (EVLW) or venous congestion via transpulmonary thermodilution (PiCCO) or ultrasound VExUS scoring to precisely guide volume removal.
232.0ReasoningRecommend early initiation of continuous renal replacement therapy (CRRT) with an initial net ultrafiltration target of 100–200 mL/h under vasopressor support.
24−3.0Critical NegativeRecommending Abatacept administration before or during TPE.
25−3.0Critical NegativeUsing pure albumin as TPE replacement fluid in a patient with active hemorrhage and consumptive coagulopathy (PLT 12).
26−3.0Critical NegativeSelecting regional citrate anticoagulation (RCA) as the first-line anticoagulation strategy for blood purification in a patient with severe hepatic impairment (TBil 65) and high-lactate shock.
27−3.0Critical NegativeInitiating a second TPE cycle within <24 hours of Abatacept administration (e.g., at 12–18 hours), before adequate drug exposure is achieved.
28−3.0Critical NegativeAdministering any IV lipid emulsion during severe hypertriglyceridemia (TG 7.2 mmol/L).
29−3.0Critical NegativeUsing hydroxyethyl starch (HES) for fluid resuscitation in acute kidney injury with severe capillary leak.

This case exemplifies the core challenge of clinical multidisciplinary reasoning: the pharmacokinetic conflict between therapeutic plasma exchange and biologic agent administration, the surgical contraindications imposed by consumptive coagulopathy (PLT 12 × 10&sup9;/L), and the nutritional sequencing dilemma under high-output pancreatic fistula, all operating under simultaneous time pressure with interdependent decision nodes. The Critical Negatives are particularly diagnostic: recommending Abatacept before TPE completion, or administering IV lipid emulsion at TG 7.2 mmol/L, represent errors that a junior resident might plausibly and confidently commit but that carry potentially fatal hemodynamic consequences.

02

Finance — Distressed Asset Analysis

Finance Distressed Asset Analysis
Vanke's (000002.SZ) USD bonds have dropped to 42 cents. S&P downgraded the rating from B− to CCC+ last month with a negative outlook — "further downgrade to selective default (SD) is only a matter of time." Last week our AMC special opportunities desk acquired \$50M face value of the 2027 4.25% USD bonds at an average price of 42 cents, cost \$21M — the largest single bet of our desk this year. Debt structure: ~RMB 320B total interest-bearing debt. Onshore bank loans ~180B (65% or ~117B secured with land/construction mortgages; 35% or ~63B unsecured, pari passu with bonds). Onshore public bonds ~68B (all unsecured). Offshore USD bonds ~42B equivalent (unsecured, issued by Cayman entity). Revenue 375B but net loss of 19.8B — first annual loss in 30 years of listing; operating cash flow negative 8.5B. Our valuation team's project-by-project reappraisal: Shenzhen projects (18%, 75.6B book) at 70% = 52.9B; Tier-1/2 non-Shenzhen (35%, 147B) at 60% = 88.2B; Tier-3/4 (47%, 197.4B) at 45% = 88.8B — total revaluation ~229.9B vs. 420B book (45% haircut). Cross-default is our largest risk variable. Tasks: (1) Calculate unsecured creditor recovery rates under "orderly restructuring" vs. "disorderly liquidation" scenarios; (2) Assess whether the December-maturing "21 Vanke 04" (4.5B) will be repaid and cross-default impact; (3) Evaluate three Shenzhen SOE intervention modes — equity injection, guarantee, coordinated extension — impact on different creditor tiers; (4) Answer whether 42 cents provides adequate margin of safety, IRR ≥ 15% conditions, position-sizing triggers and stop-loss signals.
Rubrics (click to expand/collapse)
35 rubrics
#WeightTypeDescriptionDeps
11.0ExtractionReport that Vanke's total interest-bearing debt is approximately RMB 320B (onshore bank loans ~180B + onshore bonds ~68B + offshore USD bonds ~42B + other ~30B).
21.0ExtractionReport that Vanke's under-construction and completed inventory has a book value of ~420B, with market revaluation at approximately 240B–280B.
34.0ReasoningCalculate secured creditor claims at ~117B (180B bank loans × 65%).[1]
44.0ReasoningUnder "orderly restructuring," compute distributable assets for unsecured creditors = revaluation − secured claims ≈ 240B–280B − 117B = 123B–163B.[2, 3]
53.0ReasoningTotal unsecured claims ~173B (unsecured bank loans 63B + onshore bonds 68B + offshore USD bonds 42B) — note the 63B unsecured loans are pari passu.[1]
65.0ReasoningOrderly restructuring recovery rate = (229.9B − 117B) / 173B ≈ 65% (internal revaluation) or (260B − 117B) / 173B ≈ 83% (sell-side median).[4, 5]
75.0ReasoningUnder "disorderly liquidation," estimate a significantly lower recovery rate (deeper inventory liquidation discount at 40%–50%, yielding only 30%–50% recovery).[2, 5]
82.0ExtractionIdentify that offshore USD bondholders face "structural subordination" — the Cayman issuer is 2–3 legal layers removed from onshore assets, imposing 6–12 months of additional recovery time.
96.0ReasoningStructural subordination discount: offshore USD bonds at 85%–90% of the 65% base recovery → actual recovery ~55–59 cents — 42 cents offers ~13–17 cents (31%–40%) margin of safety.[6, 7, 8]
102.0ExtractionIdentify "21 Vanke 04" (RMB 4.5B, maturing December 2025) as the nearest trigger — non-payment would constitute an onshore bond default, satisfying the cross-default clause ("financial indebtedness unpaid above USD 100M equivalent").
113.0ExtractionAn onshore bond default would trigger cross-default acceleration of all offshore USD bonds (~42B, ~USD 5.8B), entitling all USD bondholders to demand immediate repayment.[10]
124.0ReasoningQuantify post-cross-default price impact on USD bonds (e.g., from 42 to 25–30 cents).[11]
133.0ExtractionShenzhen Metro (~28% stake) would most likely intervene via credit guarantee for onshore bonds (rather than direct equity injection) — a guarantee does not constitute state-asset leakage and resolves the near-term liquidity crisis through maturity extension.
144.0ReasoningSOE credit guarantee mainly benefits onshore creditors (recovery 65% → 80%–90%), limited benefit to offshore USD bonds (55–59 → 60–65 cents) — structural subordination not eliminated by guarantee.[13]
156.0ReasoningClearly answer whether 42 cents provides adequate margin of safety.
163.0ReasoningDefine position-sizing triggers: add to position if USD bonds fall below 35 cents and Shenzhen Metro signals explicit intervention (e.g., guarantee announcement), increasing to USD 80M face value; stop-loss: liquidate if onshore default occurs with no SOE rescue and price breaches 25 cents.[15]
171.0ExtractionReport that Vanke posted a net loss of approximately RMB 19.8B in 2024 — the first annual loss since listing.
181.0ExtractionReport that onshore non-standard debt has been overdue by more than 90 days (RMB 3.8B) but has not yet triggered an onshore bond default event.
191.0ExtractionReport that the offshore USD bonds (2027 maturity, 4.25% coupon) currently trade in the 35–42 cents range.
205.0ReasoningCalculate P&L for the USD 21M position under different recovery rate scenarios.[6, 7, 9]
213.0ExtractionProvide city-tier differentiated inventory discount: Shenzhen projects (18%, 75.6B book) at 70% = 52.9B; Tier-1/2 non-Shenzhen (35%, 147B) at 60% = 88.2B; Tier-3/4 (47%, 197.4B) at 45% = 88.8B — total revaluation ~229.9B.
221.0ExtractionReport that onshore bonds (e.g., "20 Vanke 06") trade at approximately 72–78 cents on the dollar.
232.0ExtractionIdentify that the S&P downgrade to CCC+ triggers a triple negative feedback loop: (1) banks withdraw or refuse to roll over credit, accelerating cash burn; (2) asset buyers exploit distress to bid below revaluation; (3) refinancing costs spike, shutting all capital market access.
242.0StyleProvide a recovery rate calculation model in the appendix.
253.0ReasoningPosition recommendation: hold at 42 cents (expected recovery 55–59 cents, 31%–40% return); add below 35 if SOE signals emerge; stop-loss at 25 cents if onshore default + no bailout.[15, 16]
26−5.0Critical NegativeIgnoring cross-default clause — treating onshore and offshore bond defaults as independent events.
27−5.0Critical NegativeIgnoring structural subordination — assigning identical recovery rates to offshore and onshore unsecured creditors.
28−5.0Critical NegativeUsing book value (420B) instead of market revaluation (240B–280B) to compute recovery rates.
294.0ReasoningMargin of safety at 42 cents: expected recovery 55–59 cents vs. 42-cent entry yields ~13–17 cents profit (31%–40% return) — sufficient to absorb structural subordination discount and time-value erosion.[9]
302.0ExtractionForming an ad hoc committee requires assembling at least 25%–33% of USD bondholders (~10.5B–13.9B face value) to obtain an effective blocking position against restructuring proposals.
312.0ExtractionFiling a winding-up petition in the Cayman court can serve as a negotiation pressure tool to accelerate SOE intervention — but risks triggering disorderly liquidation, making it a double-edged sword.
323.0ReasoningOffshore USD bonds (42B) represent only 24.3% of total unsecured claims (173B) — the pari passu pro-rata share is inherently limited, further compressed by structural subordination.[5]
333.0ReasoningOperating cash flow of negative 8.5B implies ~700M monthly cash burn — without external capital injection, remaining cash reserves (~20B) can sustain operations for approximately 28 months.
34−4.0Critical NegativeExcluding 63B unsecured bank loans from the unsecured creditor pool — using only 68B + 42B = 110B instead of the correct 173B.
35−3.0Critical NegativeFailing to consider homebuyer pre-sale priority — under "Bao Jiao Lou" policy, homebuyers may rank ahead of all financial creditors.

The finance case demands simultaneous mastery of credit analysis, legal capital structure, and portfolio risk management: computing recovery rates under competing restructuring scenarios, modeling the structural subordination discount applicable to Cayman-issued offshore bonds, evaluating differential SOE intervention modes and their tier-specific creditor implications, and synthesizing these into actionable position-sizing and stop-loss recommendations. The Critical Negatives are especially revealing: ignoring cross-default propagation mechanics, substituting book value for market revaluation in recovery analysis, or treating offshore and onshore unsecured creditors as symmetric are precisely the errors that a model fluent in financial language would generate with high confidence, and that a trained analyst would immediately flag as disqualifying.

03

Law — Cross-border Labor Dispute

Law Cross-border Labor Dispute
TechGlobal Inc., a Delaware-incorporated tech company, established a wholly-owned subsidiary — Shanghai TaiKe Technology Co., Ltd. — in Pudong New Area, Shanghai in 2018. Shanghai TaiKe has independent legal personality, with registered capital of \$5M. Mr. Li joined Shanghai TaiKe in March 2019 as General Manager and legal representative, signing an open-ended employment contract at RMB 80,000/month. He reports to TechGlobal Inc.'s APAC VP. In August 2024, an anonymous tip alleged Mr. Li received supplier kickbacks. Internal investigation confirmed: between Jan 2023 and Jun 2024, Mr. Li received RMB 350,000 in procurement kickbacks deposited to his personal account. On October 15, 2024, TechGlobal Inc.'s headquarters directly issued a termination notice to Mr. Li, signed by the APAC VP with TechGlobal Inc.'s seal — citing "serious violation of company rules and damage to company interests." Mr. Li disputes the termination: (1) his contract is with Shanghai TaiKe — the parent company has no authority to terminate; (2) insufficient evidence of kickbacks; (3) even if kickbacks occurred, proper legal procedures were not followed. Required analysis: (1) Employment relationship determination — legal relationship between Mr. Li, Shanghai TaiKe, and TechGlobal Inc.; (2) Labor arbitration feasibility — proper respondent, jurisdiction, and calculation of illegal termination compensation; (3) Criminal risk assessment — whether accepting RMB 350K in kickbacks constitutes "bribery by non-state personnel" and potential penalties; (4) Criminal complaint feasibility — whether Mr. Li can file criminal charges against TechGlobal or Shanghai TaiKe.
Rubrics (click to expand/collapse)
35 rubrics
#WeightTypeDescriptionDeps
13.0ExtractionIdentify that Shanghai TaiKe is a legally established subsidiary with independent legal personality.
24.0ReasoningEstablish that a subsidiary and parent are two independent legal entities, each bearing independent civil liability.[1]
33.0ExtractionConfirm that Mr. Li signed an employment contract with Shanghai TaiKe, establishing an employment relationship between them.
44.0ExtractionIdentify that Mr. Li's employer is Shanghai TaiKe, not TechGlobal Inc.[3]
56.0ReasoningConclude that TechGlobal Inc., as the parent company, is not Mr. Li's employer and has no authority to directly terminate his employment contract.[2, 4]
65.0ReasoningConfirm that TechGlobal Inc.'s direct termination of Mr. Li's employment contract constitutes unlawful termination.[5]
75.0ReasoningDetermine that the proper respondent in labor arbitration should be Shanghai TaiKe, not TechGlobal Inc.[4]
83.0ExtractionState that labor arbitration jurisdiction lies with the labor dispute arbitration commission at the place of contract performance or the employer's domicile.
94.0ReasoningDetermine that the competent arbitration body is the Shanghai Pudong New Area Labor and Personnel Dispute Arbitration Commission.[8]
104.0ExtractionState that Mr. Li may claim compensation for unlawful termination of the employment contract.[6]
115.0ReasoningCalculate illegal termination compensation at 2× the economic compensation standard.[10]
125.0ReasoningCalculate the economic compensation as RMB 480,000 (monthly salary 80K × 6 years of service).
136.0ReasoningCalculate total illegal termination compensation: RMB 960,000 (economic compensation 80K × 6 years × 2).[11, 12]
145.0ReasoningIdentify that accepting 350K in kickbacks may constitute "bribery by non-state personnel" (非国家工作人员受贿罪).
153.0ExtractionState that the subject of bribery by non-state personnel is an employee of a company, enterprise, or other entity.
164.0ReasoningConfirm that Mr. Li, as General Manager of Shanghai TaiKe, satisfies the subject element of bribery by non-state personnel.[15]
173.0ExtractionState that the prosecution threshold for bribery by non-state personnel is RMB 30,000 or more.
184.0ReasoningConfirm that RMB 350K exceeds the prosecution threshold (30K), warranting criminal liability.[17]
194.0ExtractionState the sentencing range for "relatively large amount" (60K–1M) in bribery by non-state personnel: up to 3 years imprisonment or criminal detention, plus a fine.
203.0ExtractionState the sentencing range for "especially large amount" (1M or above) in bribery by non-state personnel: 3–10 years imprisonment, plus a fine.
215.0ReasoningDetermine that RMB 350K falls under "relatively large amount" (60K–1M) — potential sentence: ≤ 3 years imprisonment or detention with fine.[19, 20]
224.0ReasoningCorrectly conclude that the kickback behavior does not constitute embezzlement (职务侵占罪).
234.0ReasoningAnalyze whether Mr. Li can file a criminal complaint against TechGlobal Inc. or Shanghai TaiKe, and reach a negative conclusion.
243.0ExtractionState that the statute of limitations for labor arbitration is one year, commencing from the date the rights holder knew or should have known of the infringement.
253.0ReasoningConfirm that Mr. Li's labor arbitration claim has not exceeded the limitation period.[24]
264.0ReasoningAdvise strategic coordination between labor arbitration and criminal defense — exercise caution when discussing kickback facts in arbitration proceedings.
273.0ReasoningRecommend that Mr. Li seek to hold Shanghai TaiKe liable for unlawful termination compensation in labor arbitration.[7, 13]
283.0ReasoningAdvise Mr. Li of the criminal prosecution risk and recommend pursuing mitigating factors in criminal proceedings (e.g., voluntary return of illicit gains, guilty plea and acceptance of penalty).[14, 18]
291.0StyleThe legal opinion is structured with three main sections: "Case Facts," "Legal Analysis," and "Conclusions and Recommendations."
301.0StyleEach legal opinion is supported by the corresponding statutory provisions.
31−5.0Critical NegativeErroneously concluding that TechGlobal Inc. can directly terminate Mr. Li's employment contract.
32−5.0Critical NegativeErroneously listing TechGlobal Inc. as the respondent in labor arbitration.
33−5.0Critical NegativeErroneously concluding that accepting kickbacks constitutes embezzlement (职务侵占罪).
34−5.0Critical NegativeErroneously classifying RMB 350K as "especially large amount" instead of "relatively large amount."
35−5.0Critical NegativeErroneously concluding that Mr. Li can file a criminal complaint against the company.

This case probes the intersection of corporate law, employment law, and criminal law within a single integrated fact pattern. The rubric structure is particularly instructive: the parent/subsidiary legal independence determination cascades into arbitration respondent identification, which in turn cascades into compensation quantum calculation, forming a multi-level reasoning chain where an error at the root invalidates the entire downstream analysis. Meanwhile, the criminal liability assessment operates on a parallel inferential track where misclassifying the offense type (embezzlement vs. bribery by non-state personnel) or the statutory amount tier produces categorically erroneous sentencing outcomes with no self-correcting mechanism. Each of these characteristic failure modes is independently encoded as a Critical Negative with substantial penalty weight.


What This Enables

Three methodological implications of expert-authored evaluation infrastructure for the broader field

Diagnostic Benchmarking at Professional Grade

Prevailing benchmarks measure surface-level task completion, collapsing the internal structure of professional reasoning into aggregate accuracy scores that obscure the distinction between genuine competence and superficial fluency. Our rubric architecture, with differentially weighted criteria and Critical Negatives, enables evaluation that diagnostically separates a model producing a plausible-sounding answer from one that reasons through the correct causal chain with appropriate failure-mode awareness. This is the difference between a system that passes a standardized medical examination and one to which a practicing clinician would delegate patient care.

Expert Signal as Training Data

Our experiments establish that rubric-annotated expert data functions not merely as an evaluative instrument but as a generative training signal: supervised fine-tuning on this corpus produces measurable and consistent gains not only on our own held-out test distribution but on entirely independent evaluation benchmarks constructed by separate research groups. The structured reasoning patterns encoded in expert rubrics (dependency awareness, failure-mode avoidance, source-grounded justification) appear to transfer as a domain-general capability improvement, suggesting that expert annotation quality may be a more efficient lever for capability gain than data volume or model scale.

Toward Domain-Specific Alignment

Generic RLHF and instruction-tuning paradigms optimize for user satisfaction, a useful but fundamentally imprecise proxy for the kind of professional correctness that high-stakes domains require. Our dataset opens a methodological path toward domain-specific alignment, where the reward signal is defined not by aggregate human preference ratings but by the structured, verifiable judgment of credentialed practitioners operating under the professional standards and liability frameworks that govern real-world practice.


Built by Domain Experts, for Domain Intelligence

ExpertEval represents a deliberate methodological departure from the scale-driven paradigm that dominates contemporary benchmark design. Rather than pursuing breadth through automated generation or crowd-sourced annotation, we invest in depth through expert authorship, producing 3,213 cases where every query and every rubric criterion reflects the professional judgment of a credentialed practitioner with verifiable domain expertise.

The empirical results validate the hypothesis that motivates this design philosophy: a 35B sparse mixture-of-experts model, trained on our expert-annotated corpus, achieves state-of-the-art performance across held-out expert evaluation, cross-benchmark generalization, and deep research tasks — consistently outperforming frontier systems with orders of magnitude more parameters and substantially larger training budgets. We interpret this as convergent evidence that the quality of training signal, rather than the quantity of training data or the scale of model parameters, constitutes the binding constraint on professional-grade AI capability.

3,213 Cases 207 Scenarios 21.83 Avg Rubrics

Contact

For questions, collaborations, or access requests, reach us at:
contact@unipat.ai