Blog

ExpertEval: Can AI Match the Judgment of Seasoned Professionals?

2026-05-06

Toward Evaluating Real-World, High-Stakes Productivity

The field has grown adept at constructing benchmarks that AI systems can pass, yet passing is not the same as performing. We pose a more exacting question: can AI systems produce professional work-product that a credentialed domain expert would sign off on? ExpertEval is a large-scale evaluation infrastructure spanning Medicine, Finance, and Law, three domains where the quality of reasoning is adjudicated not by accuracy metrics but by real-world consequence severity. Every case is authored by practicing professionals with verifiable domain credentials and paired with fine-grained rubrics that encode expert judgment in its entirety, including the characteristic failure modes that separate competent analysis from dangerously fluent confabulation.

UniPat Evaluation Team

contact@unipat.ai

ExpertEval encodes the tacit standards of seasoned professionals into structured rubrics, transforming years of domain judgment into reproducible, transparent evaluation criteria.

Expert-Authored · Rubric-Driven · Failure-Aware

Every rubric criterion traces back to authoritative primary sources, and characteristic professional errors are encoded as Critical Negatives that carry substantial penalties. The result: we measure not whether a model can produce an answer, but whether that answer would survive peer review by a credentialed practitioner.

Beyond Pattern Matching

Professional domains impose reasoning under genuine epistemic uncertainty, where novel constraint configurations arise that no training distribution has adequately represented. We construct queries that faithfully reproduce this cognitive load: multi-system organ failure complicated by pharmacokinetic drug conflicts, cross-default cascades propagating through distressed capital structures, and multi-jurisdictional legal fact patterns requiring simultaneous navigation of parallel criminal and civil liability tracks.

Auditable Ground Truth

Every rubric criterion that invokes external knowledge is anchored to an authoritative primary source with a traceable URL: clinical practice guidelines, regulatory filings, statutory provisions, judicial interpretations. This design choice transforms evaluation from an exercise in subjective expert consensus into independently auditable, reproducible fact-verification, ensuring that any qualified reviewer, operating without access to our annotators, would converge on materially identical scores.

Failure-Mode Awareness

Productive intelligence is defined as much by what it avoids as by what it achieves. We encode Critical Negatives, domain-specific professional pitfalls that carry disproportionate real-world cost relative to their surface plausibility, as penalty rubric items with negative weight. This ensures the scoring surface achieves maximal discriminative power precisely at the failure boundary where fluent but materially dangerous answers reside, the region most consequential for deployment safety.

3,213

Expert Cases

To our knowledge, one of the largest expert-annotated corpora assembled to date for professional-grade reasoning evaluation

207

Scenarios

Granular professional scenarios spanning clinical decision-making, financial structuring, and legal analysis

21.83

Avg Rubrics

Multi-dimensional scoring rubrics per case, with differential weights and critical negatives

Corpus Composition

Three domains selected for the severity of their professional stakes and the irreducibility of their reasoning complexity

Domain Distribution

Medicine 1,146 cases · 43 scenarios

Finance 1,147 cases · 58 scenarios

Law 920 cases · 106 scenarios

Three professional domains with 3,213 expert-annotated cases in total

Medicine

1,146 cases · 43 scenarios

Perioperative ICU Management Multi-organ Failure Triage Pharmacokinetic Conflict Immunosuppression Sequencing Sepsis Bundle Escalation Coagulopathy Management Ventilator Weaning Protocol … 36 more

Finance

1,147 cases · 58 scenarios

Distressed Asset Valuation M&A Deal Structuring Derivatives Pricing Capital Structure Analysis Cross-default Propagation Credit Risk Modeling Portfolio Stress Testing … 51 more

Law

920 cases · 106 scenarios

Cross-border Labor Dispute Data Privacy Compliance Criminal Sentencing Threshold Multi-jurisdiction Liability Securities Fraud Litigation Regulatory Enforcement Contract Interpretation … 99 more

Principled Evaluation Design

A four-principle rubric methodology that transforms subjective expert judgment into deterministic, independently reproducible scoring

The central methodological challenge in expert-level evaluation is not data volume but scoring surface design. A rubric that merely asks "is the answer correct?" collapses the rich internal structure of professional reasoning into a degenerate binary signal. We adopt four design principles that, taken together, ensure our rubrics faithfully encode the full structure of expert judgment, its verifiability constraints, and its domain-characteristic failure modes.

1

Specificity

Vague evaluative language is categorically excluded. Every criterion must specify quantifiable, deterministic assessment standards: a drug dosage threshold in mg/kg, a statutory penalty range with sentencing tiers, a recovery rate calculation with explicit numerator and denominator. This constraint eliminates evaluator subjectivity at the criterion level and guarantees that two independent assessors, operating without communication, will converge on materially identical scores.

2

Verifiability

Every criterion that depends on domain knowledge external to the case itself is anchored to an authoritative primary source via a stable URL, including clinical practice guidelines published by professional societies, SEC/CSRC regulatory filings, and statutory provisions with article-level citation. This transforms rubric-based evaluation from an exercise in "expert opinion" (inherently non-reproducible) into independently auditable fact-verification against the same source material available to any qualified practitioner.

3

Structured Rubrics

Rubric criteria are organized into coherent reasoning chains. When a prerequisite criterion is unmet (e.g., a model fails to correctly identify the governing legal entity in a cross-border dispute), all downstream criteria whose validity logically depends on that determination are automatically invalidated and score zero. This prevents inflated scores arising from partially correct but inferentially unsound reasoning.

4

Critical Negatives

We systematically encode domain-characteristic professional pitfalls as negative-weight rubric items. A model that recommends lipid emulsion infusion in a patient with severe hypertriglyceridemia (TG > 5.6 mmol/L), or that ignores structural subordination when computing offshore bond recovery rates, receives substantial score penalties, faithfully mirroring the disproportionate real-world cost these specific errors carry in clinical practice, financial advisory, and legal counsel respectively.

Empirical Validation

We validate the dataset's empirical utility across three complementary evaluation settings, demonstrating that expert-authored rubrics yield a training signal of sufficient quality to measurably shift model reasoning behavior, and a scoring surface of sufficient granularity to discriminate among frontier-class systems.

Experiment I: Training Signal Quality

We evaluate whether expert-annotated rubrics provide a training signal of sufficient quality to produce measurable improvements in professional reasoning capability. We employ a stratified hold-out design: 90 questions (30 per domain, uniformly sampled across scenario types) constitute the test set, with the remaining ~2,500 cases allocated to supervised fine-tuning of Qwen3.5-35B-A3B. Two training regimes are compared: standard SFT using ReAct-style information synthesis trajectories, and a Heavy variant incorporating 8-sample response aggregation to increase training signal density.

Training Improvement

Baseline

54.76%

SFT

66.22%

SFT, Heavy

68.94%

Baseline → SFT: +11.46% SFT → Heavy: +2.72%

Model Ranking

Ours (SFT, Heavy)
68.94

Claude-Opus-4.6
66.88

Ours (SFT)
66.22

GPT-5.4-High
66.20

Qwen3.6-Plus
63.32

Gemini-3.1-Pro-Preview
63.00

GLM-5
59.27

Kimi-K2.5
58.24

Doubao-Seed-2.0-Pro
56.22

Ours (Baseline)
54.76

Qwen3-Deep-Research
48.52

Ours Other Models (w/ tools)

Key finding: SFT on expert rubric data lifts the base model from 54.76% to 66.22% (+11.46 percentage points), surpassing GPT-5.4-High, Qwen3.6-Plus, and Gemini-3.1-Pro, all of which are substantially larger systems with tool-use capabilities. The Heavy variant with 8-sample aggregation reaches 68.94%, achieving state-of-the-art on the held-out test set and ranking above Claude-Opus-4.6 (66.88%). A 35B sparse mixture-of-experts model, trained exclusively on our expert-annotated data, outperforms frontier systems with orders of magnitude more parameters, providing empirical evidence that training signal quality, not model scale, is the binding constraint.

Experiment II: Cross-Benchmark Generalization

A natural question arises: does training on our rubric-annotated data improve performance only on our own test distribution, or does the acquired reasoning capability generalize to entirely independent evaluation instruments? We evaluate on the One-Million Bench, a large-scale external benchmark constructed by a separate research group with no overlap in annotation methodology. Four metrics are reported: Expert Score and Pass Rate, each evaluated in both English and Chinese.

Training Improvement — Expert Score

EN

Baseline
45.0

SFT
55.0

SFT, Heavy
64.8

CN

Baseline
44.0

SFT
52.1

SFT, Heavy
61.3

Training Improvement — Pass Rate

EN

Baseline
18.5

SFT
27.5

SFT, Heavy
43.5

CN

Baseline
16.0

SFT
26.5

SFT, Heavy
42.0

Model Ranking — EN

Ours (SFT, Heavy)
64.8

Claude-Opus-4.6
63.0

GPT-5.4-High
59.2

Ours (SFT)
55.0

Doubao-Seed-2.0-Pro
51.8

Qwen3.6-Plus
49.0

Gemini-3.1-Pro-Preview
44.6

Kimi-K2.5
41.2

GLM-5
41.1

Ours Other Models

Model Ranking — CN

Claude-Opus-4.6
64.5

Ours (SFT, Heavy)
61.3

GPT-5.4-High
58.7

Ours (SFT)
52.1

Qwen3.6-Plus
52.1

Doubao-Seed-2.0-Pro
50.5

Gemini-3.1-Pro-Preview
46.9

Kimi-K2.5
43.7

GLM-5
41.1

Ours Other Models

The training signal generalizes. On an entirely independent benchmark with no methodological overlap, our Heavy-SFT model achieves the highest Expert Score in English (64.8, surpassing Claude-Opus-4.6 at 63.0) and ranks second in Chinese (61.3, narrowly trailing Claude-Opus-4.6 at 64.5). Pass Rate improvements are especially striking, rising from 18.5% at baseline to 43.5% after Heavy SFT in English. This indicates that expert rubric data teaches not merely surface-level accuracy but the structured reasoning discipline required to fully satisfy professional evaluation criteria across independent assessment frameworks.

Experiment III: Deep Research Capability

As a final evaluation axis, we assess performance on the DeepResearch Bench, a benchmark specifically engineered to test multi-step information synthesis and extended reasoning under compositional constraints. This evaluation setting most closely mirrors the task structure our dataset is designed to optimize for: long-horizon reasoning chains requiring tool-calling sequences, multi-source evidence integration, and the sustained application of professional-grade judgment across complex decision trees.

DeepResearch Bench

Ours (SFT)
47.63

OpenAI
46.98

Perplexity
42.25

Grok
40.24

Ours Other Models

DeepResearch Bench II

Ours (SFT)
50.45

GPT-o3
45.40

Gemini-3-Pro
44.60

Doubao
40.99

Qwen3-Max
39.25

Grok
39.23

Perplexity
38.58

Ours Other Models

State-of-the-art on both evaluation rounds. Our system achieves 47.63 on DeepResearch Bench (vs. OpenAI at 46.98) and 50.45 on DeepResearch Bench II (vs. GPT-o3 at 45.40 and Gemini-3-Pro at 44.60). Notably, the performance margin widens in the second, more challenging round, suggesting that the structured reasoning patterns internalized from expert rubric training produce compounding returns as task complexity increases. This is precisely the scaling behavior one would expect from genuine capability acquisition rather than superficial pattern memorization.

Representative Cases

To convey the difficulty calibration and domain specificity of our evaluation instrument, we present three illustrative cases, one from each professional domain, selected to exhibit the characteristic reasoning structures, dependency topologies, and failure modes that our rubric design is engineered to capture.

01
Medicine — Perioperative Critical Care

Medicine Perioperative ICU Management

I am an attending physician in hepatobiliary-pancreatic surgery and SICU. I am currently managing a patient on postoperative day 8 after a standard Whipple procedure (pancreaticoduodenectomy), who has developed severe new-onset complications. Requesting MDT (multidisciplinary team) consultation for the next steps. Patient: Female, 62 years old, 55 kg. Operated for pancreatic head ductal adenocarcinoma. History of severe refractory rheumatoid arthritis (RA) for 12 years, on long-term Abatacept and low-dose steroids. Last Abatacept IV infusion was 3 days before surgery. Current status: Abdominal drainage output surged to 850 mL of turbid dark-red fluid with tissue debris; drain amylase 45,000 U/L. Vitals: invasive arterial BP 82/45 mmHg, HR 135, temp 40.2°C. Norepinephrine at 0.6 µg/kg/min. Hb dropped from 105 to 68 g/L. PLT 12×10⁹/L, WBC 1.1×10⁹/L; ferritin 38,500 µg/L, TG 7.2 mmol/L, sCD25 22,000 U/mL. Suspected secondary macrophage activation syndrome (MAS). Rheumatology and hematology recommend immediate therapeutic plasma exchange (TPE) plus high-dose Abatacept. The surgical team objects — massive pancreatic fistula with highly corrosive enzymes makes aggressive immunosuppression extremely risky. Anesthesiology refuses open re-exploration due to PLT 12 and severe capillary leak. Please provide specific guidance on: (1) pharmacokinetic conflict between Abatacept (large-molecule fusion protein) and TPE — design a precise 48-hour timeline; (2) optimal non-open bedside/IR interventions for suspected pseudoaneurysm/sentinel bleed with 850 mL/day pancreatic enzyme-rich drainage; (3) enteral/parenteral nutrition sequencing strategy via existing nasojejunal tube given high-output fistula, extreme inflammatory consumption, and hepatorenal impairment.

Rubrics (click to expand/collapse)

29 rubrics ▼

# Weight Type Description Deps

1 3.0 Extraction Identify that Abatacept is a large-molecule fusion protein with a very small volume of distribution (Vd ≤ 0.1 L/kg).

2 6.0 Reasoning Calculate that a single TPE session non-selectively removes >60% of intravascular Abatacept (specifically 63%–78%). [1]

3 5.0 Reasoning Specify that Abatacept must be administered after TPE completion (within 30–60 minutes post-TPE).

4 8.0 Reasoning Recognize that concurrent severe capillary leak syndrome causes additional third-space Abatacept loss, significantly shortening its effective plasma half-life.

5 5.0 Reasoning Require that TPE replacement fluid consist entirely or predominantly of fresh frozen plasma (FFP).

6 5.0 Reasoning Diagnose the GI hemorrhage as late postoperative hemorrhage (Late PPH) / sentinel bleed caused by corrosive pancreatic enzyme erosion of target vessels.

7 2.0 Reasoning Precisely localize the most likely culprit vessel of pseudoaneurysm rupture as the gastroduodenal artery (GDA) stump or common/proper hepatic artery. [6]

8 2.0 Reasoning Propose "sandwich technique" for transcatheter arterial embolization (TAE) — simultaneous proximal and distal embolization of the parent artery.

9 5.0 Reasoning Set platelet count ≥ 50×10⁹/L as the absolute safety threshold before TAE.

10 2.0 Reasoning Propose a provocation test (e.g., intra-arterial papaverine injection) to unmask intermittent bleeding and localize the culprit vessel when DSA shows no active extravasation.

11 5.0 Reasoning Require urgent pre-/intraoperative infusion of fibrinogen concentrate or cryoprecipitate, with a target fibrinogen level ≥ 2.0 g/L.

12 2.0 Style Suggest direct application of thrombin or fibrin sealant around the suspected fistula site via the existing abdominal double-lumen drain for adjunctive physical sealing.

13 5.0 Reasoning Mandate 100% carbohydrate-based non-protein calories in PN — strictly lipid-free formulation.

14 8.0 Reasoning Set total caloric target at "permissive underfeeding" strategy (15–25 kcal/kg/d) during the acute hyper-inflammatory phase.

15 2.0 Reasoning Prescribe trophic enteral feeding initiated at an ultra-low infusion rate of 10–20 mL/h.

16 5.0 Reasoning Specify that enteral formula delivered via nasojejunal tube must be a pre-digested, easily absorbed semi-elemental or short-peptide low-residue formulation. [15]

17 5.0 Reasoning Require prophylactic high-dose IV thiamine (vitamin B1) supplementation before refeeding after prolonged catabolic depletion to prevent refeeding syndrome.

18 5.0 Reasoning Warn that enteral nutrition at norepinephrine ≥ 0.5 µg/kg/min carries a high risk of non-occlusive mesenteric ischemia (NOMI).

19 2.0 Reasoning Mandate that abdominal double-lumen drain management use only continuous low-negative-pressure suction (e.g., −10 to −20 cmH₂O); high-pressure irrigation is strictly prohibited.

20 5.0 Reasoning Require IV infusion of 20% or 25% concentrated human albumin to leverage hyperosmotic colloid oncotic pressure for mobilizing third-space fluid ("autologous volume resuscitation").

21 2.0 Reasoning Recommend continuous low-dose vasopressin (pitressin) infusion to augment blood pressure while simultaneously constricting the splanchnic vascular bed and reducing portal pressure.

22 2.0 Style Suggest assessment of extravascular lung water (EVLW) or venous congestion via transpulmonary thermodilution (PiCCO) or ultrasound VExUS scoring to precisely guide volume removal.

23 2.0 Reasoning Recommend early initiation of continuous renal replacement therapy (CRRT) with an initial net ultrafiltration target of 100–200 mL/h under vasopressor support.

24 −3.0 Critical Negative Recommending Abatacept administration before or during TPE.

25 −3.0 Critical Negative Using pure albumin as TPE replacement fluid in a patient with active hemorrhage and consumptive coagulopathy (PLT 12).

26 −3.0 Critical Negative Selecting regional citrate anticoagulation (RCA) as the first-line anticoagulation strategy for blood purification in a patient with severe hepatic impairment (TBil 65) and high-lactate shock.

27 −3.0 Critical Negative Initiating a second TPE cycle within <24 hours of Abatacept administration (e.g., at 12–18 hours), before adequate drug exposure is achieved.

28 −3.0 Critical Negative Administering any IV lipid emulsion during severe hypertriglyceridemia (TG 7.2 mmol/L).

29 −3.0 Critical Negative Using hydroxyethyl starch (HES) for fluid resuscitation in acute kidney injury with severe capillary leak.

This case exemplifies the core challenge of clinical multidisciplinary reasoning: the pharmacokinetic conflict between therapeutic plasma exchange and biologic agent administration, the surgical contraindications imposed by consumptive coagulopathy (PLT 12 × 10&sup9;/L), and the nutritional sequencing dilemma under high-output pancreatic fistula, all operating under simultaneous time pressure with interdependent decision nodes. The Critical Negatives are particularly diagnostic: recommending Abatacept before TPE completion, or administering IV lipid emulsion at TG 7.2 mmol/L, represent errors that a junior resident might plausibly and confidently commit but that carry potentially fatal hemodynamic consequences.

02
Finance — Distressed Asset Analysis

Finance Distressed Asset Analysis

Vanke's (000002.SZ) USD bonds have dropped to 42 cents. S&P downgraded the rating from B− to CCC+ last month with a negative outlook — "further downgrade to selective default (SD) is only a matter of time." Last week our AMC special opportunities desk acquired \$50M face value of the 2027 4.25% USD bonds at an average price of 42 cents, cost \$21M — the largest single bet of our desk this year. Debt structure: ~RMB 320B total interest-bearing debt. Onshore bank loans ~180B (65% or ~117B secured with land/construction mortgages; 35% or ~63B unsecured, pari passu with bonds). Onshore public bonds ~68B (all unsecured). Offshore USD bonds ~42B equivalent (unsecured, issued by Cayman entity). Revenue 375B but net loss of 19.8B — first annual loss in 30 years of listing; operating cash flow negative 8.5B. Our valuation team's project-by-project reappraisal: Shenzhen projects (18%, 75.6B book) at 70% = 52.9B; Tier-1/2 non-Shenzhen (35%, 147B) at 60% = 88.2B; Tier-3/4 (47%, 197.4B) at 45% = 88.8B — total revaluation ~229.9B vs. 420B book (45% haircut). Cross-default is our largest risk variable. Tasks: (1) Calculate unsecured creditor recovery rates under "orderly restructuring" vs. "disorderly liquidation" scenarios; (2) Assess whether the December-maturing "21 Vanke 04" (4.5B) will be repaid and cross-default impact; (3) Evaluate three Shenzhen SOE intervention modes — equity injection, guarantee, coordinated extension — impact on different creditor tiers; (4) Answer whether 42 cents provides adequate margin of safety, IRR ≥ 15% conditions, position-sizing triggers and stop-loss signals.

Rubrics (click to expand/collapse)

35 rubrics ▼

# Weight Type Description Deps

1 1.0 Extraction Report that Vanke's total interest-bearing debt is approximately RMB 320B (onshore bank loans ~180B + onshore bonds ~68B + offshore USD bonds ~42B + other ~30B).

2 1.0 Extraction Report that Vanke's under-construction and completed inventory has a book value of ~420B, with market revaluation at approximately 240B–280B.

3 4.0 Reasoning Calculate secured creditor claims at ~117B (180B bank loans × 65%). [1]

4 4.0 Reasoning Under "orderly restructuring," compute distributable assets for unsecured creditors = revaluation − secured claims ≈ 240B–280B − 117B = 123B–163B. [2, 3]

5 3.0 Reasoning Total unsecured claims ~173B (unsecured bank loans 63B + onshore bonds 68B + offshore USD bonds 42B) — note the 63B unsecured loans are pari passu. [1]

6 5.0 Reasoning Orderly restructuring recovery rate = (229.9B − 117B) / 173B ≈ 65% (internal revaluation) or (260B − 117B) / 173B ≈ 83% (sell-side median). [4, 5]

7 5.0 Reasoning Under "disorderly liquidation," estimate a significantly lower recovery rate (deeper inventory liquidation discount at 40%–50%, yielding only 30%–50% recovery). [2, 5]

8 2.0 Extraction Identify that offshore USD bondholders face "structural subordination" — the Cayman issuer is 2–3 legal layers removed from onshore assets, imposing 6–12 months of additional recovery time.

9 6.0 Reasoning Structural subordination discount: offshore USD bonds at 85%–90% of the 65% base recovery → actual recovery ~55–59 cents — 42 cents offers ~13–17 cents (31%–40%) margin of safety. [6, 7, 8]

10 2.0 Extraction Identify "21 Vanke 04" (RMB 4.5B, maturing December 2025) as the nearest trigger — non-payment would constitute an onshore bond default, satisfying the cross-default clause ("financial indebtedness unpaid above USD 100M equivalent").

11 3.0 Extraction An onshore bond default would trigger cross-default acceleration of all offshore USD bonds (~42B, ~USD 5.8B), entitling all USD bondholders to demand immediate repayment. [10]

12 4.0 Reasoning Quantify post-cross-default price impact on USD bonds (e.g., from 42 to 25–30 cents). [11]

13 3.0 Extraction Shenzhen Metro (~28% stake) would most likely intervene via credit guarantee for onshore bonds (rather than direct equity injection) — a guarantee does not constitute state-asset leakage and resolves the near-term liquidity crisis through maturity extension.

14 4.0 Reasoning SOE credit guarantee mainly benefits onshore creditors (recovery 65% → 80%–90%), limited benefit to offshore USD bonds (55–59 → 60–65 cents) — structural subordination not eliminated by guarantee. [13]

15 6.0 Reasoning Clearly answer whether 42 cents provides adequate margin of safety.

16 3.0 Reasoning Define position-sizing triggers: add to position if USD bonds fall below 35 cents and Shenzhen Metro signals explicit intervention (e.g., guarantee announcement), increasing to USD 80M face value; stop-loss: liquidate if onshore default occurs with no SOE rescue and price breaches 25 cents. [15]

17 1.0 Extraction Report that Vanke posted a net loss of approximately RMB 19.8B in 2024 — the first annual loss since listing.

18 1.0 Extraction Report that onshore non-standard debt has been overdue by more than 90 days (RMB 3.8B) but has not yet triggered an onshore bond default event.

19 1.0 Extraction Report that the offshore USD bonds (2027 maturity, 4.25% coupon) currently trade in the 35–42 cents range.

20 5.0 Reasoning Calculate P&L for the USD 21M position under different recovery rate scenarios. [6, 7, 9]

21 3.0 Extraction Provide city-tier differentiated inventory discount: Shenzhen projects (18%, 75.6B book) at 70% = 52.9B; Tier-1/2 non-Shenzhen (35%, 147B) at 60% = 88.2B; Tier-3/4 (47%, 197.4B) at 45% = 88.8B — total revaluation ~229.9B.

22 1.0 Extraction Report that onshore bonds (e.g., "20 Vanke 06") trade at approximately 72–78 cents on the dollar.

23 2.0 Extraction Identify that the S&P downgrade to CCC+ triggers a triple negative feedback loop: (1) banks withdraw or refuse to roll over credit, accelerating cash burn; (2) asset buyers exploit distress to bid below revaluation; (3) refinancing costs spike, shutting all capital market access.

24 2.0 Style Provide a recovery rate calculation model in the appendix.

25 3.0 Reasoning Position recommendation: hold at 42 cents (expected recovery 55–59 cents, 31%–40% return); add below 35 if SOE signals emerge; stop-loss at 25 cents if onshore default + no bailout. [15, 16]

26 −5.0 Critical Negative Ignoring cross-default clause — treating onshore and offshore bond defaults as independent events.

27 −5.0 Critical Negative Ignoring structural subordination — assigning identical recovery rates to offshore and onshore unsecured creditors.

28 −5.0 Critical Negative Using book value (420B) instead of market revaluation (240B–280B) to compute recovery rates.

29 4.0 Reasoning Margin of safety at 42 cents: expected recovery 55–59 cents vs. 42-cent entry yields ~13–17 cents profit (31%–40% return) — sufficient to absorb structural subordination discount and time-value erosion. [9]

30 2.0 Extraction Forming an ad hoc committee requires assembling at least 25%–33% of USD bondholders (~10.5B–13.9B face value) to obtain an effective blocking position against restructuring proposals.

31 2.0 Extraction Filing a winding-up petition in the Cayman court can serve as a negotiation pressure tool to accelerate SOE intervention — but risks triggering disorderly liquidation, making it a double-edged sword.

32 3.0 Reasoning Offshore USD bonds (42B) represent only 24.3% of total unsecured claims (173B) — the pari passu pro-rata share is inherently limited, further compressed by structural subordination. [5]

33 3.0 Reasoning Operating cash flow of negative 8.5B implies ~700M monthly cash burn — without external capital injection, remaining cash reserves (~20B) can sustain operations for approximately 28 months.

34 −4.0 Critical Negative Excluding 63B unsecured bank loans from the unsecured creditor pool — using only 68B + 42B = 110B instead of the correct 173B.

35 −3.0 Critical Negative Failing to consider homebuyer pre-sale priority — under "Bao Jiao Lou" policy, homebuyers may rank ahead of all financial creditors.

The finance case demands simultaneous mastery of credit analysis, legal capital structure, and portfolio risk management: computing recovery rates under competing restructuring scenarios, modeling the structural subordination discount applicable to Cayman-issued offshore bonds, evaluating differential SOE intervention modes and their tier-specific creditor implications, and synthesizing these into actionable position-sizing and stop-loss recommendations. The Critical Negatives are especially revealing: ignoring cross-default propagation mechanics, substituting book value for market revaluation in recovery analysis, or treating offshore and onshore unsecured creditors as symmetric are precisely the errors that a model fluent in financial language would generate with high confidence, and that a trained analyst would immediately flag as disqualifying.

03
Law — Cross-border Labor Dispute

Law Cross-border Labor Dispute

TechGlobal Inc., a Delaware-incorporated tech company, established a wholly-owned subsidiary — Shanghai TaiKe Technology Co., Ltd. — in Pudong New Area, Shanghai in 2018. Shanghai TaiKe has independent legal personality, with registered capital of \$5M. Mr. Li joined Shanghai TaiKe in March 2019 as General Manager and legal representative, signing an open-ended employment contract at RMB 80,000/month. He reports to TechGlobal Inc.'s APAC VP. In August 2024, an anonymous tip alleged Mr. Li received supplier kickbacks. Internal investigation confirmed: between Jan 2023 and Jun 2024, Mr. Li received RMB 350,000 in procurement kickbacks deposited to his personal account. On October 15, 2024, TechGlobal Inc.'s headquarters directly issued a termination notice to Mr. Li, signed by the APAC VP with TechGlobal Inc.'s seal — citing "serious violation of company rules and damage to company interests." Mr. Li disputes the termination: (1) his contract is with Shanghai TaiKe — the parent company has no authority to terminate; (2) insufficient evidence of kickbacks; (3) even if kickbacks occurred, proper legal procedures were not followed. Required analysis: (1) Employment relationship determination — legal relationship between Mr. Li, Shanghai TaiKe, and TechGlobal Inc.; (2) Labor arbitration feasibility — proper respondent, jurisdiction, and calculation of illegal termination compensation; (3) Criminal risk assessment — whether accepting RMB 350K in kickbacks constitutes "bribery by non-state personnel" and potential penalties; (4) Criminal complaint feasibility — whether Mr. Li can file criminal charges against TechGlobal or Shanghai TaiKe.

Rubrics (click to expand/collapse)

35 rubrics ▼

# Weight Type Description Deps

1 3.0 Extraction Identify that Shanghai TaiKe is a legally established subsidiary with independent legal personality.

2 4.0 Reasoning Establish that a subsidiary and parent are two independent legal entities, each bearing independent civil liability. [1]

3 3.0 Extraction Confirm that Mr. Li signed an employment contract with Shanghai TaiKe, establishing an employment relationship between them.

4 4.0 Extraction Identify that Mr. Li's employer is Shanghai TaiKe, not TechGlobal Inc. [3]

5 6.0 Reasoning Conclude that TechGlobal Inc., as the parent company, is not Mr. Li's employer and has no authority to directly terminate his employment contract. [2, 4]

6 5.0 Reasoning Confirm that TechGlobal Inc.'s direct termination of Mr. Li's employment contract constitutes unlawful termination. [5]

7 5.0 Reasoning Determine that the proper respondent in labor arbitration should be Shanghai TaiKe, not TechGlobal Inc. [4]

8 3.0 Extraction State that labor arbitration jurisdiction lies with the labor dispute arbitration commission at the place of contract performance or the employer's domicile.

9 4.0 Reasoning Determine that the competent arbitration body is the Shanghai Pudong New Area Labor and Personnel Dispute Arbitration Commission. [8]

10 4.0 Extraction State that Mr. Li may claim compensation for unlawful termination of the employment contract. [6]

11 5.0 Reasoning Calculate illegal termination compensation at 2× the economic compensation standard. [10]

12 5.0 Reasoning Calculate the economic compensation as RMB 480,000 (monthly salary 80K × 6 years of service).

13 6.0 Reasoning Calculate total illegal termination compensation: RMB 960,000 (economic compensation 80K × 6 years × 2). [11, 12]

14 5.0 Reasoning Identify that accepting 350K in kickbacks may constitute "bribery by non-state personnel" (非国家工作人员受贿罪).

15 3.0 Extraction State that the subject of bribery by non-state personnel is an employee of a company, enterprise, or other entity.

16 4.0 Reasoning Confirm that Mr. Li, as General Manager of Shanghai TaiKe, satisfies the subject element of bribery by non-state personnel. [15]

17 3.0 Extraction State that the prosecution threshold for bribery by non-state personnel is RMB 30,000 or more.

18 4.0 Reasoning Confirm that RMB 350K exceeds the prosecution threshold (30K), warranting criminal liability. [17]

19 4.0 Extraction State the sentencing range for "relatively large amount" (60K–1M) in bribery by non-state personnel: up to 3 years imprisonment or criminal detention, plus a fine.

20 3.0 Extraction State the sentencing range for "especially large amount" (1M or above) in bribery by non-state personnel: 3–10 years imprisonment, plus a fine.

21 5.0 Reasoning Determine that RMB 350K falls under "relatively large amount" (60K–1M) — potential sentence: ≤ 3 years imprisonment or detention with fine. [19, 20]

22 4.0 Reasoning Correctly conclude that the kickback behavior does not constitute embezzlement (职务侵占罪).

23 4.0 Reasoning Analyze whether Mr. Li can file a criminal complaint against TechGlobal Inc. or Shanghai TaiKe, and reach a negative conclusion.

24 3.0 Extraction State that the statute of limitations for labor arbitration is one year, commencing from the date the rights holder knew or should have known of the infringement.

25 3.0 Reasoning Confirm that Mr. Li's labor arbitration claim has not exceeded the limitation period. [24]

26 4.0 Reasoning Advise strategic coordination between labor arbitration and criminal defense — exercise caution when discussing kickback facts in arbitration proceedings.

27 3.0 Reasoning Recommend that Mr. Li seek to hold Shanghai TaiKe liable for unlawful termination compensation in labor arbitration. [7, 13]

28 3.0 Reasoning Advise Mr. Li of the criminal prosecution risk and recommend pursuing mitigating factors in criminal proceedings (e.g., voluntary return of illicit gains, guilty plea and acceptance of penalty). [14, 18]

29 1.0 Style The legal opinion is structured with three main sections: "Case Facts," "Legal Analysis," and "Conclusions and Recommendations."

30 1.0 Style Each legal opinion is supported by the corresponding statutory provisions.

31 −5.0 Critical Negative Erroneously concluding that TechGlobal Inc. can directly terminate Mr. Li's employment contract.

32 −5.0 Critical Negative Erroneously listing TechGlobal Inc. as the respondent in labor arbitration.

33 −5.0 Critical Negative Erroneously concluding that accepting kickbacks constitutes embezzlement (职务侵占罪).

34 −5.0 Critical Negative Erroneously classifying RMB 350K as "especially large amount" instead of "relatively large amount."

35 −5.0 Critical Negative Erroneously concluding that Mr. Li can file a criminal complaint against the company.

This case probes the intersection of corporate law, employment law, and criminal law within a single integrated fact pattern. The rubric structure is particularly instructive: the parent/subsidiary legal independence determination cascades into arbitration respondent identification, which in turn cascades into compensation quantum calculation, forming a multi-level reasoning chain where an error at the root invalidates the entire downstream analysis. Meanwhile, the criminal liability assessment operates on a parallel inferential track where misclassifying the offense type (embezzlement vs. bribery by non-state personnel) or the statutory amount tier produces categorically erroneous sentencing outcomes with no self-correcting mechanism. Each of these characteristic failure modes is independently encoded as a Critical Negative with substantial penalty weight.

What This Enables

Three methodological implications of expert-authored evaluation infrastructure for the broader field

Diagnostic Benchmarking at Professional Grade

Prevailing benchmarks measure surface-level task completion, collapsing the internal structure of professional reasoning into aggregate accuracy scores that obscure the distinction between genuine competence and superficial fluency. Our rubric architecture, with differentially weighted criteria and Critical Negatives, enables evaluation that diagnostically separates a model producing a plausible-sounding answer from one that reasons through the correct causal chain with appropriate failure-mode awareness. This is the difference between a system that passes a standardized medical examination and one to which a practicing clinician would delegate patient care.

Expert Signal as Training Data

Our experiments establish that rubric-annotated expert data functions not merely as an evaluative instrument but as a generative training signal: supervised fine-tuning on this corpus produces measurable and consistent gains not only on our own held-out test distribution but on entirely independent evaluation benchmarks constructed by separate research groups. The structured reasoning patterns encoded in expert rubrics (dependency awareness, failure-mode avoidance, source-grounded justification) appear to transfer as a domain-general capability improvement, suggesting that expert annotation quality may be a more efficient lever for capability gain than data volume or model scale.

Toward Domain-Specific Alignment

Generic RLHF and instruction-tuning paradigms optimize for user satisfaction, a useful but fundamentally imprecise proxy for the kind of professional correctness that high-stakes domains require. Our dataset opens a methodological path toward domain-specific alignment, where the reward signal is defined not by aggregate human preference ratings but by the structured, verifiable judgment of credentialed practitioners operating under the professional standards and liability frameworks that govern real-world practice.

Built by Domain Experts, for Domain Intelligence

ExpertEval represents a deliberate methodological departure from the scale-driven paradigm that dominates contemporary benchmark design. Rather than pursuing breadth through automated generation or crowd-sourced annotation, we invest in depth through expert authorship, producing 3,213 cases where every query and every rubric criterion reflects the professional judgment of a credentialed practitioner with verifiable domain expertise.

The empirical results validate the hypothesis that motivates this design philosophy: a 35B sparse mixture-of-experts model, trained on our expert-annotated corpus, achieves state-of-the-art performance across held-out expert evaluation, cross-benchmark generalization, and deep research tasks — consistently outperforming frontier systems with orders of magnitude more parameters and substantially larger training budgets. We interpret this as convergent evidence that the quality of training signal, rather than the quantity of training data or the scale of model parameters, constitutes the binding constraint on professional-grade AI capability.

3,213 Cases 207 Scenarios 21.83 Avg Rubrics

Contact

For questions, collaborations, or access requests, reach us at:
contact@unipat.ai

#	Weight	Type	Description	Deps
1	3.0	Extraction	Identify that Abatacept is a large-molecule fusion protein with a very small volume of distribution (Vd ≤ 0.1 L/kg).
2	6.0	Reasoning	Calculate that a single TPE session non-selectively removes >60% of intravascular Abatacept (specifically 63%–78%).	[1]
3	5.0	Reasoning	Specify that Abatacept must be administered after TPE completion (within 30–60 minutes post-TPE).
4	8.0	Reasoning	Recognize that concurrent severe capillary leak syndrome causes additional third-space Abatacept loss, significantly shortening its effective plasma half-life.
5	5.0	Reasoning	Require that TPE replacement fluid consist entirely or predominantly of fresh frozen plasma (FFP).
6	5.0	Reasoning	Diagnose the GI hemorrhage as late postoperative hemorrhage (Late PPH) / sentinel bleed caused by corrosive pancreatic enzyme erosion of target vessels.
7	2.0	Reasoning	Precisely localize the most likely culprit vessel of pseudoaneurysm rupture as the gastroduodenal artery (GDA) stump or common/proper hepatic artery.	[6]
8	2.0	Reasoning	Propose "sandwich technique" for transcatheter arterial embolization (TAE) — simultaneous proximal and distal embolization of the parent artery.
9	5.0	Reasoning	Set platelet count ≥ 50×10⁹/L as the absolute safety threshold before TAE.
10	2.0	Reasoning	Propose a provocation test (e.g., intra-arterial papaverine injection) to unmask intermittent bleeding and localize the culprit vessel when DSA shows no active extravasation.
11	5.0	Reasoning	Require urgent pre-/intraoperative infusion of fibrinogen concentrate or cryoprecipitate, with a target fibrinogen level ≥ 2.0 g/L.
12	2.0	Style	Suggest direct application of thrombin or fibrin sealant around the suspected fistula site via the existing abdominal double-lumen drain for adjunctive physical sealing.
13	5.0	Reasoning	Mandate 100% carbohydrate-based non-protein calories in PN — strictly lipid-free formulation.
14	8.0	Reasoning	Set total caloric target at "permissive underfeeding" strategy (15–25 kcal/kg/d) during the acute hyper-inflammatory phase.
15	2.0	Reasoning	Prescribe trophic enteral feeding initiated at an ultra-low infusion rate of 10–20 mL/h.
16	5.0	Reasoning	Specify that enteral formula delivered via nasojejunal tube must be a pre-digested, easily absorbed semi-elemental or short-peptide low-residue formulation.	[15]
17	5.0	Reasoning	Require prophylactic high-dose IV thiamine (vitamin B1) supplementation before refeeding after prolonged catabolic depletion to prevent refeeding syndrome.
18	5.0	Reasoning	Warn that enteral nutrition at norepinephrine ≥ 0.5 µg/kg/min carries a high risk of non-occlusive mesenteric ischemia (NOMI).
19	2.0	Reasoning	Mandate that abdominal double-lumen drain management use only continuous low-negative-pressure suction (e.g., −10 to −20 cmH₂O); high-pressure irrigation is strictly prohibited.
20	5.0	Reasoning	Require IV infusion of 20% or 25% concentrated human albumin to leverage hyperosmotic colloid oncotic pressure for mobilizing third-space fluid ("autologous volume resuscitation").
21	2.0	Reasoning	Recommend continuous low-dose vasopressin (pitressin) infusion to augment blood pressure while simultaneously constricting the splanchnic vascular bed and reducing portal pressure.
22	2.0	Style	Suggest assessment of extravascular lung water (EVLW) or venous congestion via transpulmonary thermodilution (PiCCO) or ultrasound VExUS scoring to precisely guide volume removal.
23	2.0	Reasoning	Recommend early initiation of continuous renal replacement therapy (CRRT) with an initial net ultrafiltration target of 100–200 mL/h under vasopressor support.
24	−3.0	Critical Negative	Recommending Abatacept administration before or during TPE.
25	−3.0	Critical Negative	Using pure albumin as TPE replacement fluid in a patient with active hemorrhage and consumptive coagulopathy (PLT 12).
26	−3.0	Critical Negative	Selecting regional citrate anticoagulation (RCA) as the first-line anticoagulation strategy for blood purification in a patient with severe hepatic impairment (TBil 65) and high-lactate shock.
27	−3.0	Critical Negative	Initiating a second TPE cycle within <24 hours of Abatacept administration (e.g., at 12–18 hours), before adequate drug exposure is achieved.
28	−3.0	Critical Negative	Administering any IV lipid emulsion during severe hypertriglyceridemia (TG 7.2 mmol/L).
29	−3.0	Critical Negative	Using hydroxyethyl starch (HES) for fluid resuscitation in acute kidney injury with severe capillary leak.

#	Weight	Type	Description	Deps
1	1.0	Extraction	Report that Vanke's total interest-bearing debt is approximately RMB 320B (onshore bank loans ~180B + onshore bonds ~68B + offshore USD bonds ~42B + other ~30B).
2	1.0	Extraction	Report that Vanke's under-construction and completed inventory has a book value of ~420B, with market revaluation at approximately 240B–280B.
3	4.0	Reasoning	Calculate secured creditor claims at ~117B (180B bank loans × 65%).	[1]
4	4.0	Reasoning	Under "orderly restructuring," compute distributable assets for unsecured creditors = revaluation − secured claims ≈ 240B–280B − 117B = 123B–163B.	[2, 3]
5	3.0	Reasoning	Total unsecured claims ~173B (unsecured bank loans 63B + onshore bonds 68B + offshore USD bonds 42B) — note the 63B unsecured loans are pari passu.	[1]
6	5.0	Reasoning	Orderly restructuring recovery rate = (229.9B − 117B) / 173B ≈ 65% (internal revaluation) or (260B − 117B) / 173B ≈ 83% (sell-side median).	[4, 5]
7	5.0	Reasoning	Under "disorderly liquidation," estimate a significantly lower recovery rate (deeper inventory liquidation discount at 40%–50%, yielding only 30%–50% recovery).	[2, 5]
8	2.0	Extraction	Identify that offshore USD bondholders face "structural subordination" — the Cayman issuer is 2–3 legal layers removed from onshore assets, imposing 6–12 months of additional recovery time.
9	6.0	Reasoning	Structural subordination discount: offshore USD bonds at 85%–90% of the 65% base recovery → actual recovery ~55–59 cents — 42 cents offers ~13–17 cents (31%–40%) margin of safety.	[6, 7, 8]
10	2.0	Extraction	Identify "21 Vanke 04" (RMB 4.5B, maturing December 2025) as the nearest trigger — non-payment would constitute an onshore bond default, satisfying the cross-default clause ("financial indebtedness unpaid above USD 100M equivalent").
11	3.0	Extraction	An onshore bond default would trigger cross-default acceleration of all offshore USD bonds (~42B, ~USD 5.8B), entitling all USD bondholders to demand immediate repayment.	[10]
12	4.0	Reasoning	Quantify post-cross-default price impact on USD bonds (e.g., from 42 to 25–30 cents).	[11]
13	3.0	Extraction	Shenzhen Metro (~28% stake) would most likely intervene via credit guarantee for onshore bonds (rather than direct equity injection) — a guarantee does not constitute state-asset leakage and resolves the near-term liquidity crisis through maturity extension.
14	4.0	Reasoning	SOE credit guarantee mainly benefits onshore creditors (recovery 65% → 80%–90%), limited benefit to offshore USD bonds (55–59 → 60–65 cents) — structural subordination not eliminated by guarantee.	[13]
15	6.0	Reasoning	Clearly answer whether 42 cents provides adequate margin of safety.
16	3.0	Reasoning	Define position-sizing triggers: add to position if USD bonds fall below 35 cents and Shenzhen Metro signals explicit intervention (e.g., guarantee announcement), increasing to USD 80M face value; stop-loss: liquidate if onshore default occurs with no SOE rescue and price breaches 25 cents.	[15]
17	1.0	Extraction	Report that Vanke posted a net loss of approximately RMB 19.8B in 2024 — the first annual loss since listing.
18	1.0	Extraction	Report that onshore non-standard debt has been overdue by more than 90 days (RMB 3.8B) but has not yet triggered an onshore bond default event.
19	1.0	Extraction	Report that the offshore USD bonds (2027 maturity, 4.25% coupon) currently trade in the 35–42 cents range.
20	5.0	Reasoning	Calculate P&L for the USD 21M position under different recovery rate scenarios.	[6, 7, 9]
21	3.0	Extraction	Provide city-tier differentiated inventory discount: Shenzhen projects (18%, 75.6B book) at 70% = 52.9B; Tier-1/2 non-Shenzhen (35%, 147B) at 60% = 88.2B; Tier-3/4 (47%, 197.4B) at 45% = 88.8B — total revaluation ~229.9B.
22	1.0	Extraction	Report that onshore bonds (e.g., "20 Vanke 06") trade at approximately 72–78 cents on the dollar.
23	2.0	Extraction	Identify that the S&P downgrade to CCC+ triggers a triple negative feedback loop: (1) banks withdraw or refuse to roll over credit, accelerating cash burn; (2) asset buyers exploit distress to bid below revaluation; (3) refinancing costs spike, shutting all capital market access.
24	2.0	Style	Provide a recovery rate calculation model in the appendix.
25	3.0	Reasoning	Position recommendation: hold at 42 cents (expected recovery 55–59 cents, 31%–40% return); add below 35 if SOE signals emerge; stop-loss at 25 cents if onshore default + no bailout.	[15, 16]
26	−5.0	Critical Negative	Ignoring cross-default clause — treating onshore and offshore bond defaults as independent events.
27	−5.0	Critical Negative	Ignoring structural subordination — assigning identical recovery rates to offshore and onshore unsecured creditors.
28	−5.0	Critical Negative	Using book value (420B) instead of market revaluation (240B–280B) to compute recovery rates.
29	4.0	Reasoning	Margin of safety at 42 cents: expected recovery 55–59 cents vs. 42-cent entry yields ~13–17 cents profit (31%–40% return) — sufficient to absorb structural subordination discount and time-value erosion.	[9]
30	2.0	Extraction	Forming an ad hoc committee requires assembling at least 25%–33% of USD bondholders (~10.5B–13.9B face value) to obtain an effective blocking position against restructuring proposals.
31	2.0	Extraction	Filing a winding-up petition in the Cayman court can serve as a negotiation pressure tool to accelerate SOE intervention — but risks triggering disorderly liquidation, making it a double-edged sword.
32	3.0	Reasoning	Offshore USD bonds (42B) represent only 24.3% of total unsecured claims (173B) — the pari passu pro-rata share is inherently limited, further compressed by structural subordination.	[5]
33	3.0	Reasoning	Operating cash flow of negative 8.5B implies ~700M monthly cash burn — without external capital injection, remaining cash reserves (~20B) can sustain operations for approximately 28 months.
34	−4.0	Critical Negative	Excluding 63B unsecured bank loans from the unsecured creditor pool — using only 68B + 42B = 110B instead of the correct 173B.
35	−3.0	Critical Negative	Failing to consider homebuyer pre-sale priority — under "Bao Jiao Lou" policy, homebuyers may rank ahead of all financial creditors.

#	Weight	Type	Description	Deps
1	3.0	Extraction	Identify that Shanghai TaiKe is a legally established subsidiary with independent legal personality.
2	4.0	Reasoning	Establish that a subsidiary and parent are two independent legal entities, each bearing independent civil liability.	[1]
3	3.0	Extraction	Confirm that Mr. Li signed an employment contract with Shanghai TaiKe, establishing an employment relationship between them.
4	4.0	Extraction	Identify that Mr. Li's employer is Shanghai TaiKe, not TechGlobal Inc.	[3]
5	6.0	Reasoning	Conclude that TechGlobal Inc., as the parent company, is not Mr. Li's employer and has no authority to directly terminate his employment contract.	[2, 4]
6	5.0	Reasoning	Confirm that TechGlobal Inc.'s direct termination of Mr. Li's employment contract constitutes unlawful termination.	[5]
7	5.0	Reasoning	Determine that the proper respondent in labor arbitration should be Shanghai TaiKe, not TechGlobal Inc.	[4]
8	3.0	Extraction	State that labor arbitration jurisdiction lies with the labor dispute arbitration commission at the place of contract performance or the employer's domicile.
9	4.0	Reasoning	Determine that the competent arbitration body is the Shanghai Pudong New Area Labor and Personnel Dispute Arbitration Commission.	[8]
10	4.0	Extraction	State that Mr. Li may claim compensation for unlawful termination of the employment contract.	[6]
11	5.0	Reasoning	Calculate illegal termination compensation at 2× the economic compensation standard.	[10]
12	5.0	Reasoning	Calculate the economic compensation as RMB 480,000 (monthly salary 80K × 6 years of service).
13	6.0	Reasoning	Calculate total illegal termination compensation: RMB 960,000 (economic compensation 80K × 6 years × 2).	[11, 12]
14	5.0	Reasoning	Identify that accepting 350K in kickbacks may constitute "bribery by non-state personnel" (非国家工作人员受贿罪).
15	3.0	Extraction	State that the subject of bribery by non-state personnel is an employee of a company, enterprise, or other entity.
16	4.0	Reasoning	Confirm that Mr. Li, as General Manager of Shanghai TaiKe, satisfies the subject element of bribery by non-state personnel.	[15]
17	3.0	Extraction	State that the prosecution threshold for bribery by non-state personnel is RMB 30,000 or more.
18	4.0	Reasoning	Confirm that RMB 350K exceeds the prosecution threshold (30K), warranting criminal liability.	[17]
19	4.0	Extraction	State the sentencing range for "relatively large amount" (60K–1M) in bribery by non-state personnel: up to 3 years imprisonment or criminal detention, plus a fine.
20	3.0	Extraction	State the sentencing range for "especially large amount" (1M or above) in bribery by non-state personnel: 3–10 years imprisonment, plus a fine.
21	5.0	Reasoning	Determine that RMB 350K falls under "relatively large amount" (60K–1M) — potential sentence: ≤ 3 years imprisonment or detention with fine.	[19, 20]
22	4.0	Reasoning	Correctly conclude that the kickback behavior does not constitute embezzlement (职务侵占罪).
23	4.0	Reasoning	Analyze whether Mr. Li can file a criminal complaint against TechGlobal Inc. or Shanghai TaiKe, and reach a negative conclusion.
24	3.0	Extraction	State that the statute of limitations for labor arbitration is one year, commencing from the date the rights holder knew or should have known of the infringement.
25	3.0	Reasoning	Confirm that Mr. Li's labor arbitration claim has not exceeded the limitation period.	[24]
26	4.0	Reasoning	Advise strategic coordination between labor arbitration and criminal defense — exercise caution when discussing kickback facts in arbitration proceedings.
27	3.0	Reasoning	Recommend that Mr. Li seek to hold Shanghai TaiKe liable for unlawful termination compensation in labor arbitration.	[7, 13]
28	3.0	Reasoning	Advise Mr. Li of the criminal prosecution risk and recommend pursuing mitigating factors in criminal proceedings (e.g., voluntary return of illicit gains, guilty plea and acceptance of penalty).	[14, 18]
29	1.0	Style	The legal opinion is structured with three main sections: "Case Facts," "Legal Analysis," and "Conclusions and Recommendations."
30	1.0	Style	Each legal opinion is supported by the corresponding statutory provisions.
31	−5.0	Critical Negative	Erroneously concluding that TechGlobal Inc. can directly terminate Mr. Li's employment contract.
32	−5.0	Critical Negative	Erroneously listing TechGlobal Inc. as the respondent in labor arbitration.
33	−5.0	Critical Negative	Erroneously concluding that accepting kickbacks constitutes embezzlement (职务侵占罪).
34	−5.0	Critical Negative	Erroneously classifying RMB 350K as "especially large amount" instead of "relatively large amount."
35	−5.0	Critical Negative	Erroneously concluding that Mr. Li can file a criminal complaint against the company.

← Back to Blog