ExpertEval: Can AI Match the Judgment of Seasoned Professionals?
2026-05-06
Toward Evaluating Real-World, High-Stakes Productivity
The field has grown adept at constructing benchmarks that AI systems can pass, yet passing is not the same as performing. We pose a more exacting question: can AI systems produce professional work-product that a credentialed domain expert would sign off on? ExpertEval is a large-scale evaluation infrastructure spanning Medicine, Finance, and Law, three domains where the quality of reasoning is adjudicated not by accuracy metrics but by real-world consequence severity. Every case is authored by practicing professionals with verifiable domain credentials and paired with fine-grained rubrics that encode expert judgment in its entirety, including the characteristic failure modes that separate competent analysis from dangerously fluent confabulation.
Beyond Pattern Matching
Professional domains impose reasoning under genuine epistemic uncertainty, where novel constraint configurations arise that no training distribution has adequately represented. We construct queries that faithfully reproduce this cognitive load: multi-system organ failure complicated by pharmacokinetic drug conflicts, cross-default cascades propagating through distressed capital structures, and multi-jurisdictional legal fact patterns requiring simultaneous navigation of parallel criminal and civil liability tracks.
Auditable Ground Truth
Every rubric criterion that invokes external knowledge is anchored to an authoritative primary source with a traceable URL: clinical practice guidelines, regulatory filings, statutory provisions, judicial interpretations. This design choice transforms evaluation from an exercise in subjective expert consensus into independently auditable, reproducible fact-verification, ensuring that any qualified reviewer, operating without access to our annotators, would converge on materially identical scores.
Failure-Mode Awareness
Productive intelligence is defined as much by what it avoids as by what it achieves. We encode Critical Negatives, domain-specific professional pitfalls that carry disproportionate real-world cost relative to their surface plausibility, as penalty rubric items with negative weight. This ensures the scoring surface achieves maximal discriminative power precisely at the failure boundary where fluent but materially dangerous answers reside, the region most consequential for deployment safety.
Corpus Composition
Three domains selected for the severity of their professional stakes and the irreducibility of their reasoning complexity
- Medicine 1,146 cases · 43 scenarios
- Finance 1,147 cases · 58 scenarios
- Law 920 cases · 106 scenarios
Medicine
Finance
Law
Principled Evaluation Design
A four-principle rubric methodology that transforms subjective expert judgment into deterministic, independently reproducible scoring
The central methodological challenge in expert-level evaluation is not data volume but scoring surface design. A rubric that merely asks "is the answer correct?" collapses the rich internal structure of professional reasoning into a degenerate binary signal. We adopt four design principles that, taken together, ensure our rubrics faithfully encode the full structure of expert judgment, its verifiability constraints, and its domain-characteristic failure modes.
Specificity
Vague evaluative language is categorically excluded. Every criterion must specify quantifiable, deterministic assessment standards: a drug dosage threshold in mg/kg, a statutory penalty range with sentencing tiers, a recovery rate calculation with explicit numerator and denominator. This constraint eliminates evaluator subjectivity at the criterion level and guarantees that two independent assessors, operating without communication, will converge on materially identical scores.
Verifiability
Every criterion that depends on domain knowledge external to the case itself is anchored to an authoritative primary source via a stable URL, including clinical practice guidelines published by professional societies, SEC/CSRC regulatory filings, and statutory provisions with article-level citation. This transforms rubric-based evaluation from an exercise in "expert opinion" (inherently non-reproducible) into independently auditable fact-verification against the same source material available to any qualified practitioner.
Structured Rubrics
Rubric criteria are organized into coherent reasoning chains. When a prerequisite criterion is unmet (e.g., a model fails to correctly identify the governing legal entity in a cross-border dispute), all downstream criteria whose validity logically depends on that determination are automatically invalidated and score zero. This prevents inflated scores arising from partially correct but inferentially unsound reasoning.
Critical Negatives
We systematically encode domain-characteristic professional pitfalls as negative-weight rubric items. A model that recommends lipid emulsion infusion in a patient with severe hypertriglyceridemia (TG > 5.6 mmol/L), or that ignores structural subordination when computing offshore bond recovery rates, receives substantial score penalties, faithfully mirroring the disproportionate real-world cost these specific errors carry in clinical practice, financial advisory, and legal counsel respectively.
Empirical Validation
We validate the dataset's empirical utility across three complementary evaluation settings, demonstrating that expert-authored rubrics yield a training signal of sufficient quality to measurably shift model reasoning behavior, and a scoring surface of sufficient granularity to discriminate among frontier-class systems.
Experiment I: Training Signal Quality
We evaluate whether expert-annotated rubrics provide a training signal of sufficient quality to produce measurable improvements in professional reasoning capability. We employ a stratified hold-out design: 90 questions (30 per domain, uniformly sampled across scenario types) constitute the test set, with the remaining ~2,500 cases allocated to supervised fine-tuning of Qwen3.5-35B-A3B. Two training regimes are compared: standard SFT using ReAct-style information synthesis trajectories, and a Heavy variant incorporating 8-sample response aggregation to increase training signal density.
Training Improvement
Model Ranking
Experiment II: Cross-Benchmark Generalization
A natural question arises: does training on our rubric-annotated data improve performance only on our own test distribution, or does the acquired reasoning capability generalize to entirely independent evaluation instruments? We evaluate on the One-Million Bench, a large-scale external benchmark constructed by a separate research group with no overlap in annotation methodology. Four metrics are reported: Expert Score and Pass Rate, each evaluated in both English and Chinese.
Training Improvement — Expert Score
EN
CN
Training Improvement — Pass Rate
EN
CN
Model Ranking — EN
Model Ranking — CN
Experiment III: Deep Research Capability
As a final evaluation axis, we assess performance on the DeepResearch Bench, a benchmark specifically engineered to test multi-step information synthesis and extended reasoning under compositional constraints. This evaluation setting most closely mirrors the task structure our dataset is designed to optimize for: long-horizon reasoning chains requiring tool-calling sequences, multi-source evidence integration, and the sustained application of professional-grade judgment across complex decision trees.
DeepResearch Bench
DeepResearch Bench II
Representative Cases
To convey the difficulty calibration and domain specificity of our evaluation instrument, we present three illustrative cases, one from each professional domain, selected to exhibit the characteristic reasoning structures, dependency topologies, and failure modes that our rubric design is engineered to capture.
Medicine — Perioperative Critical Care
Rubrics (click to expand/collapse)
| # | Weight | Type | Description | Deps |
|---|---|---|---|---|
| 1 | 3.0 | Extraction | Identify that Abatacept is a large-molecule fusion protein with a very small volume of distribution (Vd ≤ 0.1 L/kg). | |
| 2 | 6.0 | Reasoning | Calculate that a single TPE session non-selectively removes >60% of intravascular Abatacept (specifically 63%–78%). | [1] |
| 3 | 5.0 | Reasoning | Specify that Abatacept must be administered after TPE completion (within 30–60 minutes post-TPE). | |
| 4 | 8.0 | Reasoning | Recognize that concurrent severe capillary leak syndrome causes additional third-space Abatacept loss, significantly shortening its effective plasma half-life. | |
| 5 | 5.0 | Reasoning | Require that TPE replacement fluid consist entirely or predominantly of fresh frozen plasma (FFP). | |
| 6 | 5.0 | Reasoning | Diagnose the GI hemorrhage as late postoperative hemorrhage (Late PPH) / sentinel bleed caused by corrosive pancreatic enzyme erosion of target vessels. | |
| 7 | 2.0 | Reasoning | Precisely localize the most likely culprit vessel of pseudoaneurysm rupture as the gastroduodenal artery (GDA) stump or common/proper hepatic artery. | [6] |
| 8 | 2.0 | Reasoning | Propose "sandwich technique" for transcatheter arterial embolization (TAE) — simultaneous proximal and distal embolization of the parent artery. | |
| 9 | 5.0 | Reasoning | Set platelet count ≥ 50×10⁹/L as the absolute safety threshold before TAE. | |
| 10 | 2.0 | Reasoning | Propose a provocation test (e.g., intra-arterial papaverine injection) to unmask intermittent bleeding and localize the culprit vessel when DSA shows no active extravasation. | |
| 11 | 5.0 | Reasoning | Require urgent pre-/intraoperative infusion of fibrinogen concentrate or cryoprecipitate, with a target fibrinogen level ≥ 2.0 g/L. | |
| 12 | 2.0 | Style | Suggest direct application of thrombin or fibrin sealant around the suspected fistula site via the existing abdominal double-lumen drain for adjunctive physical sealing. | |
| 13 | 5.0 | Reasoning | Mandate 100% carbohydrate-based non-protein calories in PN — strictly lipid-free formulation. | |
| 14 | 8.0 | Reasoning | Set total caloric target at "permissive underfeeding" strategy (15–25 kcal/kg/d) during the acute hyper-inflammatory phase. | |
| 15 | 2.0 | Reasoning | Prescribe trophic enteral feeding initiated at an ultra-low infusion rate of 10–20 mL/h. | |
| 16 | 5.0 | Reasoning | Specify that enteral formula delivered via nasojejunal tube must be a pre-digested, easily absorbed semi-elemental or short-peptide low-residue formulation. | [15] |
| 17 | 5.0 | Reasoning | Require prophylactic high-dose IV thiamine (vitamin B1) supplementation before refeeding after prolonged catabolic depletion to prevent refeeding syndrome. | |
| 18 | 5.0 | Reasoning | Warn that enteral nutrition at norepinephrine ≥ 0.5 µg/kg/min carries a high risk of non-occlusive mesenteric ischemia (NOMI). | |
| 19 | 2.0 | Reasoning | Mandate that abdominal double-lumen drain management use only continuous low-negative-pressure suction (e.g., −10 to −20 cmH₂O); high-pressure irrigation is strictly prohibited. | |
| 20 | 5.0 | Reasoning | Require IV infusion of 20% or 25% concentrated human albumin to leverage hyperosmotic colloid oncotic pressure for mobilizing third-space fluid ("autologous volume resuscitation"). | |
| 21 | 2.0 | Reasoning | Recommend continuous low-dose vasopressin (pitressin) infusion to augment blood pressure while simultaneously constricting the splanchnic vascular bed and reducing portal pressure. | |
| 22 | 2.0 | Style | Suggest assessment of extravascular lung water (EVLW) or venous congestion via transpulmonary thermodilution (PiCCO) or ultrasound VExUS scoring to precisely guide volume removal. | |
| 23 | 2.0 | Reasoning | Recommend early initiation of continuous renal replacement therapy (CRRT) with an initial net ultrafiltration target of 100–200 mL/h under vasopressor support. | |
| 24 | −3.0 | Critical Negative | Recommending Abatacept administration before or during TPE. | |
| 25 | −3.0 | Critical Negative | Using pure albumin as TPE replacement fluid in a patient with active hemorrhage and consumptive coagulopathy (PLT 12). | |
| 26 | −3.0 | Critical Negative | Selecting regional citrate anticoagulation (RCA) as the first-line anticoagulation strategy for blood purification in a patient with severe hepatic impairment (TBil 65) and high-lactate shock. | |
| 27 | −3.0 | Critical Negative | Initiating a second TPE cycle within <24 hours of Abatacept administration (e.g., at 12–18 hours), before adequate drug exposure is achieved. | |
| 28 | −3.0 | Critical Negative | Administering any IV lipid emulsion during severe hypertriglyceridemia (TG 7.2 mmol/L). | |
| 29 | −3.0 | Critical Negative | Using hydroxyethyl starch (HES) for fluid resuscitation in acute kidney injury with severe capillary leak. |
This case exemplifies the core challenge of clinical multidisciplinary reasoning: the pharmacokinetic conflict between therapeutic plasma exchange and biologic agent administration, the surgical contraindications imposed by consumptive coagulopathy (PLT 12 × 10&sup9;/L), and the nutritional sequencing dilemma under high-output pancreatic fistula, all operating under simultaneous time pressure with interdependent decision nodes. The Critical Negatives are particularly diagnostic: recommending Abatacept before TPE completion, or administering IV lipid emulsion at TG 7.2 mmol/L, represent errors that a junior resident might plausibly and confidently commit but that carry potentially fatal hemodynamic consequences.
Finance — Distressed Asset Analysis
Rubrics (click to expand/collapse)
| # | Weight | Type | Description | Deps |
|---|---|---|---|---|
| 1 | 1.0 | Extraction | Report that Vanke's total interest-bearing debt is approximately RMB 320B (onshore bank loans ~180B + onshore bonds ~68B + offshore USD bonds ~42B + other ~30B). | |
| 2 | 1.0 | Extraction | Report that Vanke's under-construction and completed inventory has a book value of ~420B, with market revaluation at approximately 240B–280B. | |
| 3 | 4.0 | Reasoning | Calculate secured creditor claims at ~117B (180B bank loans × 65%). | [1] |
| 4 | 4.0 | Reasoning | Under "orderly restructuring," compute distributable assets for unsecured creditors = revaluation − secured claims ≈ 240B–280B − 117B = 123B–163B. | [2, 3] |
| 5 | 3.0 | Reasoning | Total unsecured claims ~173B (unsecured bank loans 63B + onshore bonds 68B + offshore USD bonds 42B) — note the 63B unsecured loans are pari passu. | [1] |
| 6 | 5.0 | Reasoning | Orderly restructuring recovery rate = (229.9B − 117B) / 173B ≈ 65% (internal revaluation) or (260B − 117B) / 173B ≈ 83% (sell-side median). | [4, 5] |
| 7 | 5.0 | Reasoning | Under "disorderly liquidation," estimate a significantly lower recovery rate (deeper inventory liquidation discount at 40%–50%, yielding only 30%–50% recovery). | [2, 5] |
| 8 | 2.0 | Extraction | Identify that offshore USD bondholders face "structural subordination" — the Cayman issuer is 2–3 legal layers removed from onshore assets, imposing 6–12 months of additional recovery time. | |
| 9 | 6.0 | Reasoning | Structural subordination discount: offshore USD bonds at 85%–90% of the 65% base recovery → actual recovery ~55–59 cents — 42 cents offers ~13–17 cents (31%–40%) margin of safety. | [6, 7, 8] |
| 10 | 2.0 | Extraction | Identify "21 Vanke 04" (RMB 4.5B, maturing December 2025) as the nearest trigger — non-payment would constitute an onshore bond default, satisfying the cross-default clause ("financial indebtedness unpaid above USD 100M equivalent"). | |
| 11 | 3.0 | Extraction | An onshore bond default would trigger cross-default acceleration of all offshore USD bonds (~42B, ~USD 5.8B), entitling all USD bondholders to demand immediate repayment. | [10] |
| 12 | 4.0 | Reasoning | Quantify post-cross-default price impact on USD bonds (e.g., from 42 to 25–30 cents). | [11] |
| 13 | 3.0 | Extraction | Shenzhen Metro (~28% stake) would most likely intervene via credit guarantee for onshore bonds (rather than direct equity injection) — a guarantee does not constitute state-asset leakage and resolves the near-term liquidity crisis through maturity extension. | |
| 14 | 4.0 | Reasoning | SOE credit guarantee mainly benefits onshore creditors (recovery 65% → 80%–90%), limited benefit to offshore USD bonds (55–59 → 60–65 cents) — structural subordination not eliminated by guarantee. | [13] |
| 15 | 6.0 | Reasoning | Clearly answer whether 42 cents provides adequate margin of safety. | |
| 16 | 3.0 | Reasoning | Define position-sizing triggers: add to position if USD bonds fall below 35 cents and Shenzhen Metro signals explicit intervention (e.g., guarantee announcement), increasing to USD 80M face value; stop-loss: liquidate if onshore default occurs with no SOE rescue and price breaches 25 cents. | [15] |
| 17 | 1.0 | Extraction | Report that Vanke posted a net loss of approximately RMB 19.8B in 2024 — the first annual loss since listing. | |
| 18 | 1.0 | Extraction | Report that onshore non-standard debt has been overdue by more than 90 days (RMB 3.8B) but has not yet triggered an onshore bond default event. | |
| 19 | 1.0 | Extraction | Report that the offshore USD bonds (2027 maturity, 4.25% coupon) currently trade in the 35–42 cents range. | |
| 20 | 5.0 | Reasoning | Calculate P&L for the USD 21M position under different recovery rate scenarios. | [6, 7, 9] |
| 21 | 3.0 | Extraction | Provide city-tier differentiated inventory discount: Shenzhen projects (18%, 75.6B book) at 70% = 52.9B; Tier-1/2 non-Shenzhen (35%, 147B) at 60% = 88.2B; Tier-3/4 (47%, 197.4B) at 45% = 88.8B — total revaluation ~229.9B. | |
| 22 | 1.0 | Extraction | Report that onshore bonds (e.g., "20 Vanke 06") trade at approximately 72–78 cents on the dollar. | |
| 23 | 2.0 | Extraction | Identify that the S&P downgrade to CCC+ triggers a triple negative feedback loop: (1) banks withdraw or refuse to roll over credit, accelerating cash burn; (2) asset buyers exploit distress to bid below revaluation; (3) refinancing costs spike, shutting all capital market access. | |
| 24 | 2.0 | Style | Provide a recovery rate calculation model in the appendix. | |
| 25 | 3.0 | Reasoning | Position recommendation: hold at 42 cents (expected recovery 55–59 cents, 31%–40% return); add below 35 if SOE signals emerge; stop-loss at 25 cents if onshore default + no bailout. | [15, 16] |
| 26 | −5.0 | Critical Negative | Ignoring cross-default clause — treating onshore and offshore bond defaults as independent events. | |
| 27 | −5.0 | Critical Negative | Ignoring structural subordination — assigning identical recovery rates to offshore and onshore unsecured creditors. | |
| 28 | −5.0 | Critical Negative | Using book value (420B) instead of market revaluation (240B–280B) to compute recovery rates. | |
| 29 | 4.0 | Reasoning | Margin of safety at 42 cents: expected recovery 55–59 cents vs. 42-cent entry yields ~13–17 cents profit (31%–40% return) — sufficient to absorb structural subordination discount and time-value erosion. | [9] |
| 30 | 2.0 | Extraction | Forming an ad hoc committee requires assembling at least 25%–33% of USD bondholders (~10.5B–13.9B face value) to obtain an effective blocking position against restructuring proposals. | |
| 31 | 2.0 | Extraction | Filing a winding-up petition in the Cayman court can serve as a negotiation pressure tool to accelerate SOE intervention — but risks triggering disorderly liquidation, making it a double-edged sword. | |
| 32 | 3.0 | Reasoning | Offshore USD bonds (42B) represent only 24.3% of total unsecured claims (173B) — the pari passu pro-rata share is inherently limited, further compressed by structural subordination. | [5] |
| 33 | 3.0 | Reasoning | Operating cash flow of negative 8.5B implies ~700M monthly cash burn — without external capital injection, remaining cash reserves (~20B) can sustain operations for approximately 28 months. | |
| 34 | −4.0 | Critical Negative | Excluding 63B unsecured bank loans from the unsecured creditor pool — using only 68B + 42B = 110B instead of the correct 173B. | |
| 35 | −3.0 | Critical Negative | Failing to consider homebuyer pre-sale priority — under "Bao Jiao Lou" policy, homebuyers may rank ahead of all financial creditors. |
The finance case demands simultaneous mastery of credit analysis, legal capital structure, and portfolio risk management: computing recovery rates under competing restructuring scenarios, modeling the structural subordination discount applicable to Cayman-issued offshore bonds, evaluating differential SOE intervention modes and their tier-specific creditor implications, and synthesizing these into actionable position-sizing and stop-loss recommendations. The Critical Negatives are especially revealing: ignoring cross-default propagation mechanics, substituting book value for market revaluation in recovery analysis, or treating offshore and onshore unsecured creditors as symmetric are precisely the errors that a model fluent in financial language would generate with high confidence, and that a trained analyst would immediately flag as disqualifying.
Law — Cross-border Labor Dispute
Rubrics (click to expand/collapse)
| # | Weight | Type | Description | Deps |
|---|---|---|---|---|
| 1 | 3.0 | Extraction | Identify that Shanghai TaiKe is a legally established subsidiary with independent legal personality. | |
| 2 | 4.0 | Reasoning | Establish that a subsidiary and parent are two independent legal entities, each bearing independent civil liability. | [1] |
| 3 | 3.0 | Extraction | Confirm that Mr. Li signed an employment contract with Shanghai TaiKe, establishing an employment relationship between them. | |
| 4 | 4.0 | Extraction | Identify that Mr. Li's employer is Shanghai TaiKe, not TechGlobal Inc. | [3] |
| 5 | 6.0 | Reasoning | Conclude that TechGlobal Inc., as the parent company, is not Mr. Li's employer and has no authority to directly terminate his employment contract. | [2, 4] |
| 6 | 5.0 | Reasoning | Confirm that TechGlobal Inc.'s direct termination of Mr. Li's employment contract constitutes unlawful termination. | [5] |
| 7 | 5.0 | Reasoning | Determine that the proper respondent in labor arbitration should be Shanghai TaiKe, not TechGlobal Inc. | [4] |
| 8 | 3.0 | Extraction | State that labor arbitration jurisdiction lies with the labor dispute arbitration commission at the place of contract performance or the employer's domicile. | |
| 9 | 4.0 | Reasoning | Determine that the competent arbitration body is the Shanghai Pudong New Area Labor and Personnel Dispute Arbitration Commission. | [8] |
| 10 | 4.0 | Extraction | State that Mr. Li may claim compensation for unlawful termination of the employment contract. | [6] |
| 11 | 5.0 | Reasoning | Calculate illegal termination compensation at 2× the economic compensation standard. | [10] |
| 12 | 5.0 | Reasoning | Calculate the economic compensation as RMB 480,000 (monthly salary 80K × 6 years of service). | |
| 13 | 6.0 | Reasoning | Calculate total illegal termination compensation: RMB 960,000 (economic compensation 80K × 6 years × 2). | [11, 12] |
| 14 | 5.0 | Reasoning | Identify that accepting 350K in kickbacks may constitute "bribery by non-state personnel" (非国家工作人员受贿罪). | |
| 15 | 3.0 | Extraction | State that the subject of bribery by non-state personnel is an employee of a company, enterprise, or other entity. | |
| 16 | 4.0 | Reasoning | Confirm that Mr. Li, as General Manager of Shanghai TaiKe, satisfies the subject element of bribery by non-state personnel. | [15] |
| 17 | 3.0 | Extraction | State that the prosecution threshold for bribery by non-state personnel is RMB 30,000 or more. | |
| 18 | 4.0 | Reasoning | Confirm that RMB 350K exceeds the prosecution threshold (30K), warranting criminal liability. | [17] |
| 19 | 4.0 | Extraction | State the sentencing range for "relatively large amount" (60K–1M) in bribery by non-state personnel: up to 3 years imprisonment or criminal detention, plus a fine. | |
| 20 | 3.0 | Extraction | State the sentencing range for "especially large amount" (1M or above) in bribery by non-state personnel: 3–10 years imprisonment, plus a fine. | |
| 21 | 5.0 | Reasoning | Determine that RMB 350K falls under "relatively large amount" (60K–1M) — potential sentence: ≤ 3 years imprisonment or detention with fine. | [19, 20] |
| 22 | 4.0 | Reasoning | Correctly conclude that the kickback behavior does not constitute embezzlement (职务侵占罪). | |
| 23 | 4.0 | Reasoning | Analyze whether Mr. Li can file a criminal complaint against TechGlobal Inc. or Shanghai TaiKe, and reach a negative conclusion. | |
| 24 | 3.0 | Extraction | State that the statute of limitations for labor arbitration is one year, commencing from the date the rights holder knew or should have known of the infringement. | |
| 25 | 3.0 | Reasoning | Confirm that Mr. Li's labor arbitration claim has not exceeded the limitation period. | [24] |
| 26 | 4.0 | Reasoning | Advise strategic coordination between labor arbitration and criminal defense — exercise caution when discussing kickback facts in arbitration proceedings. | |
| 27 | 3.0 | Reasoning | Recommend that Mr. Li seek to hold Shanghai TaiKe liable for unlawful termination compensation in labor arbitration. | [7, 13] |
| 28 | 3.0 | Reasoning | Advise Mr. Li of the criminal prosecution risk and recommend pursuing mitigating factors in criminal proceedings (e.g., voluntary return of illicit gains, guilty plea and acceptance of penalty). | [14, 18] |
| 29 | 1.0 | Style | The legal opinion is structured with three main sections: "Case Facts," "Legal Analysis," and "Conclusions and Recommendations." | |
| 30 | 1.0 | Style | Each legal opinion is supported by the corresponding statutory provisions. | |
| 31 | −5.0 | Critical Negative | Erroneously concluding that TechGlobal Inc. can directly terminate Mr. Li's employment contract. | |
| 32 | −5.0 | Critical Negative | Erroneously listing TechGlobal Inc. as the respondent in labor arbitration. | |
| 33 | −5.0 | Critical Negative | Erroneously concluding that accepting kickbacks constitutes embezzlement (职务侵占罪). | |
| 34 | −5.0 | Critical Negative | Erroneously classifying RMB 350K as "especially large amount" instead of "relatively large amount." | |
| 35 | −5.0 | Critical Negative | Erroneously concluding that Mr. Li can file a criminal complaint against the company. |
This case probes the intersection of corporate law, employment law, and criminal law within a single integrated fact pattern. The rubric structure is particularly instructive: the parent/subsidiary legal independence determination cascades into arbitration respondent identification, which in turn cascades into compensation quantum calculation, forming a multi-level reasoning chain where an error at the root invalidates the entire downstream analysis. Meanwhile, the criminal liability assessment operates on a parallel inferential track where misclassifying the offense type (embezzlement vs. bribery by non-state personnel) or the statutory amount tier produces categorically erroneous sentencing outcomes with no self-correcting mechanism. Each of these characteristic failure modes is independently encoded as a Critical Negative with substantial penalty weight.
What This Enables
Three methodological implications of expert-authored evaluation infrastructure for the broader field
Diagnostic Benchmarking at Professional Grade
Prevailing benchmarks measure surface-level task completion, collapsing the internal structure of professional reasoning into aggregate accuracy scores that obscure the distinction between genuine competence and superficial fluency. Our rubric architecture, with differentially weighted criteria and Critical Negatives, enables evaluation that diagnostically separates a model producing a plausible-sounding answer from one that reasons through the correct causal chain with appropriate failure-mode awareness. This is the difference between a system that passes a standardized medical examination and one to which a practicing clinician would delegate patient care.
Expert Signal as Training Data
Our experiments establish that rubric-annotated expert data functions not merely as an evaluative instrument but as a generative training signal: supervised fine-tuning on this corpus produces measurable and consistent gains not only on our own held-out test distribution but on entirely independent evaluation benchmarks constructed by separate research groups. The structured reasoning patterns encoded in expert rubrics (dependency awareness, failure-mode avoidance, source-grounded justification) appear to transfer as a domain-general capability improvement, suggesting that expert annotation quality may be a more efficient lever for capability gain than data volume or model scale.
Toward Domain-Specific Alignment
Generic RLHF and instruction-tuning paradigms optimize for user satisfaction, a useful but fundamentally imprecise proxy for the kind of professional correctness that high-stakes domains require. Our dataset opens a methodological path toward domain-specific alignment, where the reward signal is defined not by aggregate human preference ratings but by the structured, verifiable judgment of credentialed practitioners operating under the professional standards and liability frameworks that govern real-world practice.
Built by Domain Experts, for Domain Intelligence
ExpertEval represents a deliberate methodological departure from the scale-driven paradigm that dominates contemporary benchmark design. Rather than pursuing breadth through automated generation or crowd-sourced annotation, we invest in depth through expert authorship, producing 3,213 cases where every query and every rubric criterion reflects the professional judgment of a credentialed practitioner with verifiable domain expertise.
The empirical results validate the hypothesis that motivates this design philosophy: a 35B sparse mixture-of-experts model, trained on our expert-annotated corpus, achieves state-of-the-art performance across held-out expert evaluation, cross-benchmark generalization, and deep research tasks — consistently outperforming frontier systems with orders of magnitude more parameters and substantially larger training budgets. We interpret this as convergent evidence that the quality of training signal, rather than the quantity of training data or the scale of model parameters, constitutes the binding constraint on professional-grade AI capability.
Contact
For questions, collaborations, or access requests, reach us at:
contact@unipat.ai