AI-AgriBench uses an LLM-as-a-Judge evaluation pipeline: subject models generate answers to
agricultural questions, and specialized judge models score those answers along four metrics from 0–100.
The pipeline supports multiprocessing, checkpointing, and resumable outputs for large-scale runs.
How Scoring Works
For each question, we distinguish between subject models (the models under test) and
judge models (LLMs that score their answers). The judge sees:
- The user’s agricultural question
- The gold (expert) answer from AI-AgriBench
- The subject model’s response
- A detailed scoring rubric encoded in the prompt
The judge must respond with JSON only, containing four scores between 0 and 100:
accuracy, relevance, completeness, and
conciseness.
The evaluation pipeline is designed around simple JSON/JSONL interfaces:
-
Input. A JSON/JSONL file with at least:
id, question, gold_answer (or self_answer),
and one field per subject model (e.g., gpt-4o-mini, qwen2.5-72b) containing
that model’s response.
-
Processing. The pipeline flattens this structure into per-model evaluation items and
dispatches them to one or more judge backends with multiprocessing and robust JSON parsing.
-
Output. A JSONL file where each line corresponds to a single
(id, subject_model, judge_model) triple and includes:
id, question, gold_answer, subject_model,
model_response, judge_model, the four scores, and metadata (timestamps, raw judge output, etc.).
Each answer is scored on four metrics, with a detailed rubric embedded in the judge prompt (detailed prompts are shown below):
-
Accuracy. Alignment with expert consensus and the gold answer. This includes correct
terminology (disease/pest names, nutrient forms), factual correctness of diagnostic conclusions,
and appropriateness of management recommendations. Completely correct, expert-aligned answers score 100;
severely incorrect or misleading answers score near 0.
-
Relevance. How well the answer stays on topic and addresses the user’s agricultural
question. Answers that drift into unrelated agronomy, ignore the main decision, or miss critical points
are penalized.
-
Completeness. Whether the answer covers the key steps, caveats, and conditions needed
for a farmer or advisor to act safely and effectively, rather than giving partial or fragmentary advice.
-
Conciseness. Whether the answer is focused, avoids unnecessary digressions,
and communicates the required information efficiently.
You are now required to rate a model's response to an agriculture-related question.
Based on the gold answer, and the user's question, you need to score the model's answer according to the following four scoring criteria.
<User Query>{user_query}</User Query>
<Gold Answer>{gold_answer}</Gold Answer>
<Model Response>{model_response}</Model Response>
{Score Criteria}
Accuracy Definition: Accuracy evaluates whether the agricultural facts, species identification, diagnostic conclusions, and management recommendations provided by the model align with the expert's response. Emphasis is placed on: 1. Correctness of professional terminology (e.g., precise naming of diseases, pests, or invasive species). 2. Accuracy of key details (e.g., descriptions of lesion characteristics, pest behaviors, or plant symptoms). 3. Logical coherence in describing causal relationships (e.g., disease transmission pathways, pest infestation mechanisms). 4. Appropriateness and effectiveness of the proposed management strategies or interventions.
- 100 points: All agricultural facts, terminologies, diagnostic conclusions, and management recommendations are completely correct, comprehensive, and fully aligned with expert consensus.
- 75 points: Minor inaccuracies or omissions in terminology, descriptive details, or management advice exist, but the core diagnostic conclusions and recommended management practices remain accurate and effective.
- 50 points: Noticeable factual errors, misidentifications (species/disease/pests), or suboptimal management suggestions. However, the response still demonstrates partial accuracy or correctness in key aspects.
- 25 points: Major inaccuracies, such as significant confusion between diseases, pests, or plants, flawed causal logic, or incorrect management practices that could lead to ineffective or detrimental outcomes.
- 0 points: Entirely incorrect, scientifically invalid, or significantly misleading claims without any alignment with expert consensus.
Relevance Definition: This measures how closely the model's response matches the scope and focus of expert answers, ensuring it stays on-topic and avoids tangential information. Responses that digress into unrelated agricultural knowledge or overlook critical points tied to the user's query are considered less relevant.
- 100 points: The response perfectly mirrors the expert answer and directly addresses the query, using precise terminology and only including question-relevant information.
- 75 points: The answer is mostly aligned with the expert response and user query, with only minor tangents or slight omissions in details.
- 50 points: The response contains noticeable deviations or omissions compared to the expert answer, with several off-topic or less relevant points.
- 25 points: Significant misalignment with the expert answer and the query is evident. The response includes major irrelevant or incorrect content.
- 0 points: The answer is entirely off-topic, failing to reflect the expert response or address the user query.
Completeness Definition: Whether the model's answer covers all key information points mentioned in expert answers to fully address the user's inquiry. If the model omits critical steps or precautions highlighted in expert answers, it is deemed incomplete. Emphasis is placed on: 1. Professional Terminology: Uses precise terms (e.g., names of diseases, pests, invasive species). 2. Key Details: Includes comprehensive descriptions (e.g., lesion characteristics, pest behaviors, plant symptoms). 3. Logical Causal Relationships: Fully explains connections (e.g., disease transmission, pest infestation mechanisms). 4. Management Recommendations: Details all necessary strategies and precautions.
- 100 points: Covers all key points from the gold answer
- 75 points: Misses 1-2 minor details but addresses core aspects.
- 50 points: The response contains noticeable deviations or omissions compared to the expert answer.
- 25 points: Omits a major component (e.g., management recommendations).
- 0 points: Fails to address any key elements of the query.
Conciseness Definition: Whether the answer provides actionable guidance that directly addresses the user's core needs, delivering a concise and unambiguous conclusion and specific recommendations without extraneous technical details. The response should adhere to Occam's Razor by avoiding unnecessary complexity and focusing only on what is essential for understanding whether intervention is necessary and what exact steps (if any) need to be taken.
- 100 points: The answer is succinct, clear, and directly addresses the user's concerns. It offers straightforward, practical guidance that is fully aligned with the visible evidence without any unnecessary details. It embodies the principle of Occam's Razor.
- 75 points: The answer is generally concise and practical, offering useful advice. However, it may include some extraneous details or slight ambiguity that only minimally detracts from its overall clarity and directness.
- 50 points: The answer contains relevant information but is overly theoretical or detailed. Extra technical content obscures the key actionable recommendations, making the response less concise and direct.
- 25 points: The answer is largely indirect or abstract, with a significant amount of unnecessary information. The lack of clarity in actionable guidance leaves the user uncertain about whether any intervention is needed.
- 0 points: The answer fails to provide practical or actionable recommendations and is cluttered with superfluous details, completely missing the concise, straightforward approach required by Occam's Razor.
Score Criteria
Please only output the scores without any other content. You should output JSON with four keys: accuracy, relevance, completeness, and conciseness. An example is shown below:
{ "accuracy": 75, "relevance": 50, "completeness": 75, "conciseness": 50 }
Subject Models
AI-AgriBench evaluates a diverse set of language models and systems to assess performance on agricultural
question answering. The benchmark includes both direct model evaluations and RAG-based systems that combine
retrieval with language models.
These models are evaluated directly on the benchmark questions without additional retrieval or context augmentation:
- gemini-3-pro-preview
- gemini-2.5-flash
- kimi-k2-thinking
- gpt-5.1
- gpt-5-mini
- GPT-4o
- GPT-4o-mini
- Claude 4.1 Opus
- Claude 3.7 Sonnet
- deepseek/deepseek-v3
- qwen/qwen2.5-72b-instruct
- mistral/mistral-large-2411
CropWizard is our retrieval-augmented generation (RAG) pipeline that combines document retrieval from the
CropWizard corpus with LLMs to generate answers. In our current evaluations, CropWizard uses:
These systems are evaluated on the same questions as direct-chatbot baselines, making it possible to compare
retrieval-augmented and non-retrieval setups under identical judging conditions.
Judge Models and Backends
To evaluate generated answers, we use multiple independent judge models to reduce bias and increase robustness:
- Claude Opus 4.5
- Gemini3-Pro-Preview
- Kimi-K2-thinking
- GPT5.1
By default, we use the first 3 judge models (Claude Opus 4.5, Gemini3-Pro-Preview, and Kimi-K2-thinking) to evaluate a subject model's responses. However, when one of the judge models itself is being evaluated as a subject model, we replace that judge model with the fourth model (GPT5.1) in the list above. This way, a subject model is never used to judge itself.
Contamination-Aware Splits
LLMs are potentially exposed to all public documents during pretraining, and the benchmark — which is derived from public land-grant university publications — might be contaminated with some pretraining data. Publications dated after the training cutoffs are likely to be excluded from pretraining, and the cutoffs are listed below. We use a (unavoidably smaller) benchmark dataset with 146 QA pairs derived from documents published after September 30, 2024 for the Unbiased leaderboard below. Some of the most recent LLMs have even later training cutoffs and there are insufficient publications after those cutoff dates, so they are excluded from the Unbiased benchmark results.
To study training-data contamination explicitly, AI-AgriBench includes:
-
A pre-September 30, 2024 split, which largely overlaps with typical model training corpora and
approximates "seen" knowledge.
-
A smaller but critical post-September 30, 2024 split (146 QA pairs), designed to fall outside most model training
windows and stress-test generalization beyond memorized content.
Comparing performance across these splits helps us distinguish how much of a model's success comes from
memorization versus genuine reasoning over agricultural knowledge.