Why AI AgriBench?

AI AgriBench is designed to answer a simple but critical question: "Is agronomic advice from large language models and AI-based agricultural advisory tools accurate?" Instead of testing models on generic benchmarks, AI AgriBench evaluates them on realistic, agronomic questions grounded in extension knowledge.
This version of AI AgriBench (v0.5) focuses on evaluating technical agronomic understanding — one key piece of the puzzle in overall evaluation of advisory services. This version helps users assess how reliable is a chatbot (based on a language model) or an Ag advisory service in its understanding of core technical concepts. This version of the benchmark does NOT cover field-level decisions such as seed (hybrid) selection, input product choices, decisions based on farm-specific data, etc. Real-world farmer questions involving such decisions will be included in future versions of AI AgriBench. Other capabilities to be added in future include image-based QA, decisions based on data such as weather forecasts or price predictions, and reasoning about open-ended user questions.

Grounded knowledge sources

We build from extension publications from US land-grant universities. This keeps the benchmark anchored in the same documents farmers and advisors actually use.

Actionable, farmer-oriented Q&A

Questions focus on real decisions: diagnosing nutrient deficiencies, timing fungicides, setting seeding rates, managing weeds, and responding to weather events. Answers are multi-paragraph, practical, and written in plain language.

Expert-Curated QA Data Set

Our benchmark is curated by highly experienced experts. Each QA pair was reviewed by 1-4 experts, with 951 questions reviewed. Many questions were manually edited, and 416 high-quality Q&A pairs were retained after expert review.

Contamination-aware evaluation

To avoid bias for LLMs potentially exposed in pretraining to the source documents used for question-answer extraction, we extract a subset of the QA pairs from documents published AFTER September 30, 2024 and evaluate several LLMs on this subset. These results are marked with an asterisk (**) in the leaderboard and should be compared with the corresponding results for the same model on the full data set.

Who is AI-AgriBench for?

Farmers & advisors

Understand where current models work well - and where they are not yet trustworthy.

AgTech Advisory Services

Demonstrate quality of proprietary Ag advisory services using standard, widely accepted benchmarks.

Other AgTech Companies

Evaluate whether advisory tools are safe and useful before deployment to growers and agronomists.

Funders, Investors, Policy Makers

Evaluate effectiveness of potential commercial or research Ag advisory services.

Results and Takeaways

Some key conclusions that can be drawn from the data in the leaderboard are as follows:

The frontier models and also several of the Ag-focused advisory services based on them all saturate this benchmark, with several scoring above 95% and over half scoring above 90% in Accuracy. This indicates that text-based agronomic knowledge is robust in several modern LLMs. It is possible that other features like image-based QA and data-driven decision with reasoning and tools may prove more challenging.
Adding RAG to a frontier LLM for agriculture-specific knowledge retrieval as in CropWizard and Extension Bot does not significantly improve accuracy compared with the same LLMs without RAG.
Completeness and Relevance scores are also high, in the 90% range, across the top performers. Conciseness suffers, however, indicating that many of the models tend to produce longer and perhaps more detailed answers than the length recommended by experts, as captured in the ground truth data set.
While Accuracy, Relevance, and Completeness are strongly correlated among top-performing models, Conciseness is clearly anti-correlated with Completeness for several frontier systems. This suggests a systematic tendency for stronger models to trade brevity for thoroughness, even when expert-curated ground truth answers are relatively concise.
There is little variation across the 9 categories of agronomic topics, with models scoring high on every topic.
Models evaluated on the post-cutoff dataset (marked **) show performance closely aligned with their full-dataset results (recall that the full benchmark data set is drawn from extension documents published before or after the training cutoffs whereas the post-cutoff data set is only taken from documents published after the cutoff date). The important takeaway is that the models can successfully answer agronomic questions not drawn from their pretraining data, i.e., the results are not biased by exposure to benchmark data sources during pretraining. A stronger conclusion could be that memorization of pretraining data is unlikely to be a primary driver of performance and that the models are generalizing agronomic concepts rather than recalling benchmark-specific content; more careful experimentation is needed to confirm this conclusion.

Instructions for Joining the Leaderboard

To submit results to the AI-AgriBench leaderboard, follow these steps:

Download the questions JSON: To download the test set questions in JSON format, click on the "Click Here to Join the Leaderboard" button at the top of the Leaderboard page and fill out the form. The file contains question IDs, questions, and associated metadata. Each entry includes fields such as qna_id, question, crop_group, topic_categories, and split (pre-September 30, 2024 or post-September 30, 2024).
Run your model: Generate answers for all questions in the test set using your model or system.
Format responses: Format your responses according to our JSON protocol. Each response should include the question ID and your model's answer.
Email submissions: Email the JSON with the responses to Ansh Ankul and Vikram Adve.

For detailed evaluation guidelines, see the Evaluation section below. All submissions will be evaluated using our standardized judge models and scoring rubric.

TECHNICAL DETAILS

A comprehensive guide to the AI-AgriBench methodology, pipeline, and evaluation framework. Click on the sections below to expand and view detailed information.

AI-AgriBench is constructed through four major stages:

Step 1 - Document Filtering

We begin with the CropWizard corpus (400K+ extension PDFs). Filtering occurs in two parts: hard filtering removes low-quality scans, boilerplate pages, and non-agronomic documents; semantic filtering segments and embeds documents (Qwen3-Embedding-8B), then keeps only chunks relevant to key agricultural topics using LLM-generated and curated seed queries.

Step 2 - Q&A Generation (YourBench)

Filtered chunks are turned into realistic, farmer-focused question-answer pairs using YourBench with GPT-4o-mini (arXiv, HuggingFace). The process includes chunking & summaries, Q&A generation with strict grounding rules, and structured output with provenance links.

Step 3 - Q&A Data Set Filtering and Categorization

Raw Q&As are cleaned and organized through deduplication (using embeddings + FAISS), complexity filtering, premise filtering, and categorization (assigning each item to a crop group and topic category; discarding off-topic items).

Step 4 - Benchmark Review and Editing

Domain experts from production farming backgrounds review and validate Q&A pairs, manually edit content to meet quality standards, and ensure the benchmark reflects real-world agricultural expertise. The final benchmark comprises 416 expert-validated Q&A pairs.

AI-AgriBench is constructed through a multi-stage pipeline that transforms 400K+ real agricultural PDFs into a high-quality benchmark of grounded, actionable question-answer pairs. Each stage is designed to preserve fidelity to the underlying documents, avoid hallucinations, and ensure the final benchmark reflects real production agriculture.

We begin with the CropWizard document corpus, which contains 400,000+ PDFs from land-grant extension publications, university materials, and open-access research. Only the PDFs are used to ensure grounding in original agricultural knowledge.

1.1 Hard Filtering

This quality-screening stage eliminates documents that would be unsuitable for LLM-based extraction. Examples of removed content include:

Low-content PDFs (scanned covers, empty pages, newsletter headers)
Unreadable scans or pages dominated by OCR noise
Material unrelated to production agriculture (e.g., youth programs, policy memos)
Documents too short or too visually complex for meaningful text extraction

Hard filtering ensures that no computational resources are wasted on unusable material and that downstream stages operate on documents containing genuine agronomic content.

1.2 Semantic Filtering

After hard filtering, documents are divided into hierarchical text chunks and embedded using Qwen3-Embedding-8B. We then perform retrieval against a curated set of agricultural "seed queries" that define the nine major production themes of the benchmark:

Crop Management
Pests & Pest Management
Weeds & Weed Management
Crop Nutrition & Fertility
Water & Irrigation
Soil Health
Crop Marketing
Farm Finance
Climate & Weather Risk

Seed queries are generated using GPT-4.1 over the entire corpus, transformed into question-style prompts, clustered to remove redundancy, and manually reviewed to build a high-quality retrieval set. Only chunks that semantically match these queries are kept. This ensures the pipeline focuses exclusively on agriculturally relevant knowledge and avoids irrelevant or tangential text.

Output of Step 1: A curated collection of relevant, high-quality text chunks ready for Q&A generation.

In this stage, we convert filtered text into realistic, farmer-oriented question-answer pairs using YourBench with GPT-4o-mini. YourBench is an open-source framework for generating zero-shot benchmarks from documents, enabling automated pipelines for ingestion, summarization, and question generation (arXiv, HuggingFace). HuggingFace has adopted YourBench for benchmarking customer RAG systems. This framework enforces document grounding, multi-hop reasoning, and strict prompt constraints to ensure outputs remain faithful to the underlying text.

Filtered PDFs are converted to structured Markdown. Yourbench automatically divides content into manageable, semantically coherent segments that stay within model context limits while preserving paragraph boundaries and factual continuity.

For each chunk, an LLM-generated "global summary" provides a high-level context of the document. This summary, combined with the local text, allows the model to generate multi-hop questions that reference broader agronomic themes while remaining grounded in the provided content.

GPT-4o-mini is prompted to generate question-answer pairs using the chunk + summary as its only source. The generation prompt enforces strict rules:

The question must reflect real-world farmer or advisor concerns (e.g., diagnosing symptoms, choosing inputs, timing operations).
The answer must be 2-4 paragraphs long, practical, and actionable.
No external knowledge is allowed - all content must come from the input chunk.
No academic framing ("the study shows…") or region-specific references.

This produces production-oriented, field-level Q&As aligned with how extension services communicate with growers.

All generated Q&As are written to a structured dataset.json file containing:

The question and answer
Chunk citations
Document ID links
Token and cost metadata

This ensures traceability back to the original extension source.

Output of Step 2: A large pool of grounded, farmer-oriented Q&As with structured provenance.

The Q&A pool contains duplicates, overly complex items, or questions not adequately supported by the text. The curation stage cleans and structures the dataset into a final evaluation benchmark.

3.1 Deduplication

Questions are embedded using the MiniLM-L6-v2 model and compared using cosine similarity via FAISS (Facebook AI Similarity Search). A threshold-based search removes near-duplicate questions and answers, retaining the earliest occurrence for stability and generating a full duplicate report for auditing.

3.2 Complexity Filtering

A dedicated LLM classifier labels each Q&A as KEEP or REMOVE. Removed items include:

Overly technical or research-heavy questions ("meta-analyses", "experimental design")
Engineering-focused content (e.g., drone algorithm design)
Excessive jargon inappropriate for grower communication
Questions that are not meaningful in advisory contexts

3.3 Premise Filtering

This step removes Q&As where the question or answer is not fully supported by the underlying chunk. This improves factual grounding and reduces hallucination risk by ensuring every benchmark item has verifiable provenance in the extension document.

3.4 Categorization

A specialized LLM classifier identifies the specific crop name (when present) and assigns each Q&A to one of several crop groups. Crop groups categorize questions by the type of crop being discussed. The crop groups are:

Midwestern_Row_Crops - Corn, Wheat, Soybean, Sorghum
Tree_Crops - Fruits and nuts from orchard or plantation trees
Commercial_Vegetables - Large-scale vegetables for commercial sale
Southern_Row_Crops - Rice, Cotton, Peanuts, Tobacco
Small_Fruits - Berries, grapes, other small fruits
Northern_Crops - Canola, Barley, Potatoes, Dry Beans
Herbs - Culinary/medicinal herbs for commercial production
Discard - Vague, incomplete, or not about production farming

In our crop-group categorization analysis, we evaluated 65,538 Q&A pairs, of which 26,279 (40.1%) were retained as non-Discard items and 39,259 (59.9%) were labeled as Discard (vague, incomplete, or not about production farming). Note that the difference between the 84,385 Q&A pairs evaluated in topic categorization and the 65,538 pairs evaluated in crop-group categorization reflects different pipeline runs and filtering processes at different stages.

Rank	Crop Group	Description	Q&A Count	Percentage
1	Discard	Vague/incomplete / not about production farming	39,259	59.90%
2	Midwestern_Row_Crops	Corn, Wheat, Soybean, Sorghum	10,508	16.03%
3	Tree_Crops	Fruits and nuts from orchard or plantation trees	3,775	5.76%
4	Commercial_Vegetables	Large-scale vegetables for commercial sale	3,697	5.64%
5	Southern_Row_Crops	Rice, Cotton, Peanuts, Tobacco	3,550	5.42%
6	Small_Fruits	Berries, grapes, other small fruits	2,501	3.82%
7	Northern_Crops	Canola, Barley, Potatoes, Dry Beans	2,035	3.11%
8	Herbs	Culinary/medicinal herbs for commercial production	213	0.33%
Total			65,538	100%

Each Q&A is additionally labeled with 1-3 agronomic topic categories. The 7 topic categories used in the final dataset are:

Crop Management Decisions
Agricultural Sustainability
Crop Nutrition
Pests and Pest Management
Water and Irrigation Management
Weeds and Weed Management
Agricultural Weather Risk

In our topic categorization analysis, we evaluated 84,385 Q&A pairs and retained 45,172 records after filtering. The following table shows the distribution of topic categories (note that totals exceed 45,172 due to multi-label assignment, where each Q&A can have 1-3 topic labels):

Topic Category	Q&A with this label
Soils_and_Soil_Health	17,509
Other_Agronomic_Practices	16,744
Crop_Nutrition_and_Fertility_Management	14,602
Water_Management_and_Irrigation	10,908
Discard	10,748
Pests_and_Pest_Management	10,318
Seed_Hybrid_Rootstock_Selection	9,722
Weather_and_Weather_Risks	6,920
Diseases_and_Disease_Management	6,094
Weeds_and_Weed_Management	5,926

LLMs are potentially exposed to all public documents during pretraining, and the benchmark — which is derived from public land-grant university publications — might be contaminated with some pretraining data. Publications dated after the training cutoffs are likely to be excluded from pretraining, and the cutoffs are listed below. We use a (unavoidably smaller) benchmark dataset with 146 QA pairs derived from documents published after September 30, 2024 for the Unbiased leaderboard below. Some of the most recent LLMs have even later training cutoffs and there are insufficient publications after those cutoff dates, so they are excluded from the Unbiased benchmark results.

To study training-data contamination, all documents and Q&As are divided into:

pre-September 30, 2024 split - likely included in model training corpora
post-September 30, 2024 split - outside training cutoff windows (146 QA pairs)

This enables evaluation of whether models rely on memorization or true reasoning.

Output of Step 3: A fully cleaned, deduplicated, categorized, contamination-aware benchmark of ~45K high-quality Q&As.

To ensure the highest quality and practical relevance of the benchmark, we engage domain experts from production farming backgrounds to review and validate Q&A pairs. This human-in-the-loop stage adds critical quality control and ensures that the benchmark reflects real-world agricultural expertise.

We connected with a pool of 31 production farming experts with diverse expertise across different crop types, regions, and agricultural practices. However, only 23 reviewers completed the review process. Each reviewer's background and specialization were carefully documented to enable intelligent question assignment based on their domain knowledge.

Questions were assigned to reviewers based on their specific areas of expertise, ensuring that each Q&A pair is evaluated by someone with relevant domain knowledge. The review process involves:

Quality assessment: Reviewers evaluate whether each Q&A pair is good, needs revision (CBF - Could Be Fixed), or should be discarded (Not-Good).
Manual editing: For Q&A pairs that are close but fixable, reviewers manually edit and improve the content to meet quality standards. Only Q&A pairs with GOOD or CBF ratings were edited.
Domain validation: Reviewers verify that questions reflect real production farming scenarios and that answers are accurate and actionable.

Reviewer Coverage: We assigned 951 questions for review. The distribution of reviewers per question was:

Questions with 3 reviewers: 248
Questions with 2 reviewers: 415
Questions with 1 reviewer: 286

This resulted in an average of approximately 2 reviewers per question (range: 1-3), with a total of 1,868 individual expert responses across all 951 questions.

Quality Breakdown: Across the 951 questions reviewed, the distribution of quality labels was:

Only Good: 416
Only CBF (Could Be Fixed): 146
Only Good and CBF: 38
Only CBF and Not-Good: 72
Only Not-Good: 72
Good and Not-Good: 66
Good, CBF, and Not-Good: 19

100 QA pairs were manually edited by reviewers. Only Q&A pairs with GOOD or CBF ratings were edited.

Through careful evaluation and manual refinement, the final benchmark comprises 416 high-quality Q&A pairs that have been validated by domain experts and meet our strict quality criteria for production agriculture.

Output of Step 4: A curated benchmark of 416 expert-validated Q&A pairs ready for evaluation and deployment.

Pipeline Statistics

The following table shows record counts at each stage of the pipeline. Note that counts may vary across different pipeline runs and views (e.g., topic categorization vs. crop-group categorization), and some intermediate counts are not explicitly reported in our documentation.

Stage	Output File / View	Records Remaining
YourBench Q&A generation + ingestion	dataset.json	84,385
Step 1 - Preprocessing	step1_preprocessing/cleaned_qna.json	84,385
Step 2 - Deduplication	step2_deduplication/cleaned_qna.json	65,538 (18,847 duplicates removed)
Step 3 - Topic categorization (multi-label)	step3_categorization/categorized_qas.json	54,790
Step 4 - Complexity filter	step4_complexity_filter/filtered_qna.json	46,662 (8,128 removed)
Step 5 - Premise Mismatch Filter	step5_premise_mismatch_filter/filtered_qna.json	43,271
Step 4 - Expert Review and Quality Assurance	step4_expert_review/final_benchmark.json	951 questions assigned to 31 experts → 416 Good questions retained

AI-AgriBench relies on carefully designed prompts at multiple stages of the pipeline: generation, filtering, and categorization. Use the arrows to explore the prompts that shape the benchmark.

YourBench Q&A Generation Prompt

1 / 5

DOCUMENT CONTENT FILTERING (CRITICAL - MUST CHECK FIRST): Examine the actual content of this document chunk, NOT the folder name. If the content discusses ANY of the following, return an empty array []: - Livestock, animals, dairy, poultry, cattle, swine, sheep, goats, aquaculture, animal health, animal nutrition - Home gardening, backyard gardening, small-scale gardening - 4-H programs, youth education, or educational events - Purely economic/policy topics without technical farming content For COMMERCIAL CROP PRODUCTION documents only: Questions must: - Come from a commercial farmer's perspective dealing with real field situations - Focus on PRACTICAL decisions and situations, NOT overly technical concepts - Use PLAIN LANGUAGE - avoid excessive scientific terminology and technical jargon - MINIMIZE specific numerical data from research studies (avoid citing specific percentages, statistical values, or study results) - Be as GENERIC as possible - avoid mentioning specific locations (states, cities, regions) - BE DIRECTLY BASED ON THE DOCUMENT CONTENT - no external knowledge - Focus on actionable problems farmers actually face, not academic research findings Examples of GOOD questions: - "My corn shows purple leaves early in the season - what's causing this?" - "Should I apply fungicide when I see leaf spots appearing in wheat?" - "What should I do if my soil test shows low nutrients?" Examples of BAD questions: - "How does crop rotation benefit soil health?" (too broad) - "What's the importance of integrated pest management?" (too conceptual) - "Research shows 25% yield increase - what causes this?" (too much data/academic) - "When should Iowa farmers plant corn?" (location-specific) Answers must: - MINIMUM 2 paragraphs, maximum 4 paragraphs long (NEVER just one paragraph) - BE COMPLETE AND COMPREHENSIVE - don't just copy text snippets; synthesize information into a full, coherent response - Start with direct, actionable advice - Include specific rates, timings, thresholds when applicable - Provide PRACTICAL SOLUTIONS that farmers can implement - USE INFORMATION FROM THE PROVIDED DOCUMENT but EXPAND and EXPLAIN concepts fully - don't rely solely on partial text quotes - Address potential missing details or processes (e.g., denitrification processes, intermediate steps) - Use clear, practical language - avoid excessive technical jargon - Blend practical guidance with scientific explanation - Provide complete explanations that stand alone, not just references to document sections - Avoid location-specific recommendations unless absolutely necessary CRITICAL REMINDERS: 1. COMPLETE ANSWERS: Never provide incomplete or fragmented answers - always synthesize information into comprehensive, standalone responses. 2. PLAIN LANGUAGE: Avoid academic jargon, complex technical terms, and research study statistics in questions. 3. FARMER-FOCUSED: Questions should sound like real problems farmers encounter, not academic research topics. 4. FULL EXPLANATIONS: Answers must explain the complete reasoning and solution, not just reference document sections. Return approximately 10 questions with document names and citations for every chunk group if possible (can be more, fewer, or an empty array [] for filtered topics). Output JSON format: { "question": "...", "answer": "...", "citations": ["exact text span", ...] }

The core AI-AgriBench dataset consists of cleaned, categorized question-answer pairs derived from extension and research PDFs. Each record is linked back to its source documents and carries crop, topic, and complexity labels.

Source Documents

400K+ agricultural PDFs from the CropWizard corpus, including land-grant extension bulletins, university publications, and open-access articles.

Raw knowledge base

Curated Q&A Pairs

84,385 Q&A pairs were evaluated in topic categorization, with 43,271 retained after filtering. In the crop-group categorization view,

Benchmark items

Coverage

Multi-crop and multi-topic coverage across row crops, tree crops, vegetables, small fruits, and herbs, with labels for crop group and production topic.

Diverse agronomic scenarios

Record Schema (Core Benchmark)

Each Q&A in the benchmark follows a standardized JSON structure:

{ "qna_id": "qna_2116", "question": "How can I determine the right amount of nitrogen fertilizer to apply based on my soil's biological activity?", "answer": "To find the appropriate nitrogen fertilizer amount for your crops, start by conducting a soil test...", "document_id": "005314", "citations": [ "Strong association occurred between soil-test biological activity and net N mineralization." ], "crop_name": "Corn", "crop_group": "Midwestern_Row_Crops", "topic_categories": ["Crop_Nutrition", "Agricultural_Sustainability"], "complexity_label": "KEEP", "split": "pre_2024_08" }

Field Descriptions

qna_id: Unique identifier for the Q&A pair.
question: Farmer-oriented, open-ended question grounded in extension content.
answer: Multi-paragraph, actionable answer generated and constrained by the source chunk.
document_id: Identifier linking back to the originating PDF in CropWizard.
citations: Optional supporting text spans from the original document.
crop_name: Specific crop name if present (e.g., Corn, Rice, Garlic) or NA.
crop_group: Crop group label (e.g., Midwestern_Row_Crops, Tree_Crops, Commercial_Vegetables).
topic_categories: One to three labels from topic categories such as Crop_Nutrition or Weeds_and_Weed_management.
complexity_label: Whether the Q&A was retained (KEEP) or removed by the complexity filter.
split: Time-based split indicator (pre_2024_08 vs post_2024_08) for contamination-aware evaluation.

Topic Categories

Pests and Pest Management

Insect identification, pest thresholds, IPM strategies, and pesticide programs.

Diseases and Disease Management

Disease identification, symptom diagnosis, disease prevention, and treatment strategies.

Weeds and Weed Management

Weed identification, herbicide programs and timings, resistance management, and control strategies.

Crop Nutrition and Fertility Management

Soil testing, fertilizer rates, split applications, nutrient placement, and deficiency diagnosis.

Soils and Soil Health

Soil structure, cover crops, reduced tillage, compaction management, and soil improvement practices.

Seed, Hybrid, and Rootstock Selection

Variety selection, seed quality, hybrid characteristics, rootstock choices, and planting decisions.

Horticultural and Agronomic Practices

Planting dates, seeding rates, row spacing, harvest timing, storage, and general crop management practices.

Water Management and Irrigation

Irrigation scheduling, drought mitigation, moisture monitoring, water quality, and water conservation.

Weather and Weather Risks

Frost, heat waves, hail, flooding, and short-term weather-related operational risks and mitigation strategies.

AI-AgriBench uses an LLM-as-a-Judge evaluation pipeline: subject models generate answers to agricultural questions, and specialized judge models score those answers along four metrics from 0–100. The pipeline supports multiprocessing, checkpointing, and resumable outputs for large-scale runs.

How Scoring Works

For each question, we distinguish between subject models (the models under test) and judge models (LLMs that score their answers). The judge sees:

The user’s agricultural question
The gold (expert) answer from AI-AgriBench
The subject model’s response
A detailed scoring rubric encoded in the prompt

The judge must respond with JSON only, containing four scores between 0 and 100: accuracy, relevance, completeness, and conciseness.

The evaluation pipeline is designed around simple JSON/JSONL interfaces:

Input. A JSON/JSONL file with at least: id, question, gold_answer (or self_answer), and one field per subject model (e.g., gpt-4o-mini, qwen2.5-72b) containing that model’s response.
Processing. The pipeline flattens this structure into per-model evaluation items and dispatches them to one or more judge backends with multiprocessing and robust JSON parsing.
Output. A JSONL file where each line corresponds to a single (id, subject_model, judge_model) triple and includes: id, question, gold_answer, subject_model, model_response, judge_model, the four scores, and metadata (timestamps, raw judge output, etc.).

Each answer is scored on four metrics, with a detailed rubric embedded in the judge prompt (detailed prompts are shown below):

Accuracy. Alignment with expert consensus and the gold answer. This includes correct terminology (disease/pest names, nutrient forms), factual correctness of diagnostic conclusions, and appropriateness of management recommendations. Completely correct, expert-aligned answers score 100; severely incorrect or misleading answers score near 0.
Relevance. How well the answer stays on topic and addresses the user’s agricultural question. Answers that drift into unrelated agronomy, ignore the main decision, or miss critical points are penalized.
Completeness. Whether the answer covers the key steps, caveats, and conditions needed for a farmer or advisor to act safely and effectively, rather than giving partial or fragmentary advice.
Conciseness. Whether the answer is focused, avoids unnecessary digressions, and communicates the required information efficiently.

Judge Evaluation Prompt

You are now required to rate a model's response to an agriculture-related question. Based on the gold answer, and the user's question, you need to score the model's answer according to the following four scoring criteria. <User Query>{user_query}</User Query> <Gold Answer>{gold_answer}</Gold Answer> <Model Response>{model_response}</Model Response> {Score Criteria} Accuracy Definition: Accuracy evaluates whether the agricultural facts, species identification, diagnostic conclusions, and management recommendations provided by the model align with the expert's response. Emphasis is placed on: 1. Correctness of professional terminology (e.g., precise naming of diseases, pests, or invasive species). 2. Accuracy of key details (e.g., descriptions of lesion characteristics, pest behaviors, or plant symptoms). 3. Logical coherence in describing causal relationships (e.g., disease transmission pathways, pest infestation mechanisms). 4. Appropriateness and effectiveness of the proposed management strategies or interventions. - 100 points: All agricultural facts, terminologies, diagnostic conclusions, and management recommendations are completely correct, comprehensive, and fully aligned with expert consensus. - 75 points: Minor inaccuracies or omissions in terminology, descriptive details, or management advice exist, but the core diagnostic conclusions and recommended management practices remain accurate and effective. - 50 points: Noticeable factual errors, misidentifications (species/disease/pests), or suboptimal management suggestions. However, the response still demonstrates partial accuracy or correctness in key aspects. - 25 points: Major inaccuracies, such as significant confusion between diseases, pests, or plants, flawed causal logic, or incorrect management practices that could lead to ineffective or detrimental outcomes. - 0 points: Entirely incorrect, scientifically invalid, or significantly misleading claims without any alignment with expert consensus. Relevance Definition: This measures how closely the model's response matches the scope and focus of expert answers, ensuring it stays on-topic and avoids tangential information. Responses that digress into unrelated agricultural knowledge or overlook critical points tied to the user's query are considered less relevant. - 100 points: The response perfectly mirrors the expert answer and directly addresses the query, using precise terminology and only including question-relevant information. - 75 points: The answer is mostly aligned with the expert response and user query, with only minor tangents or slight omissions in details. - 50 points: The response contains noticeable deviations or omissions compared to the expert answer, with several off-topic or less relevant points. - 25 points: Significant misalignment with the expert answer and the query is evident. The response includes major irrelevant or incorrect content. - 0 points: The answer is entirely off-topic, failing to reflect the expert response or address the user query. Completeness Definition: Whether the model's answer covers all key information points mentioned in expert answers to fully address the user's inquiry. If the model omits critical steps or precautions highlighted in expert answers, it is deemed incomplete. Emphasis is placed on: 1. Professional Terminology: Uses precise terms (e.g., names of diseases, pests, invasive species). 2. Key Details: Includes comprehensive descriptions (e.g., lesion characteristics, pest behaviors, plant symptoms). 3. Logical Causal Relationships: Fully explains connections (e.g., disease transmission, pest infestation mechanisms). 4. Management Recommendations: Details all necessary strategies and precautions. - 100 points: Covers all key points from the gold answer - 75 points: Misses 1-2 minor details but addresses core aspects. - 50 points: The response contains noticeable deviations or omissions compared to the expert answer. - 25 points: Omits a major component (e.g., management recommendations). - 0 points: Fails to address any key elements of the query. Conciseness Definition: Whether the answer provides actionable guidance that directly addresses the user's core needs, delivering a concise and unambiguous conclusion and specific recommendations without extraneous technical details. The response should adhere to Occam's Razor by avoiding unnecessary complexity and focusing only on what is essential for understanding whether intervention is necessary and what exact steps (if any) need to be taken. - 100 points: The answer is succinct, clear, and directly addresses the user's concerns. It offers straightforward, practical guidance that is fully aligned with the visible evidence without any unnecessary details. It embodies the principle of Occam's Razor. - 75 points: The answer is generally concise and practical, offering useful advice. However, it may include some extraneous details or slight ambiguity that only minimally detracts from its overall clarity and directness. - 50 points: The answer contains relevant information but is overly theoretical or detailed. Extra technical content obscures the key actionable recommendations, making the response less concise and direct. - 25 points: The answer is largely indirect or abstract, with a significant amount of unnecessary information. The lack of clarity in actionable guidance leaves the user uncertain about whether any intervention is needed. - 0 points: The answer fails to provide practical or actionable recommendations and is cluttered with superfluous details, completely missing the concise, straightforward approach required by Occam's Razor. Score Criteria Please only output the scores without any other content. You should output JSON with four keys: accuracy, relevance, completeness, and conciseness. An example is shown below: { "accuracy": 75, "relevance": 50, "completeness": 75, "conciseness": 50 }

Subject Models

AI-AgriBench evaluates a diverse set of language models and systems to assess performance on agricultural question answering. The benchmark includes both direct model evaluations and RAG-based systems that combine retrieval with language models.

These models are evaluated directly on the benchmark questions without additional retrieval or context augmentation:

gemini-3-pro-preview
gemini-2.5-flash
kimi-k2-thinking
gpt-5.1
gpt-5-mini
GPT-4o
GPT-4o-mini
Claude 4.1 Opus
Claude 3.7 Sonnet
deepseek/deepseek-v3
qwen/qwen2.5-72b-instruct
mistral/mistral-large-2411

CropWizard is our retrieval-augmented generation (RAG) pipeline that combines document retrieval from the CropWizard corpus with LLMs to generate answers. In our current evaluations, CropWizard uses:

gpt-5
gpt-5-mini

These systems are evaluated on the same questions as direct-chatbot baselines, making it possible to compare retrieval-augmented and non-retrieval setups under identical judging conditions.

Judge Models and Backends

To evaluate generated answers, we use multiple independent judge models to reduce bias and increase robustness:

Claude Opus 4.5
Gemini3-Pro-Preview
Kimi-K2-thinking
GPT5.1

By default, we use the first 3 judge models (Claude Opus 4.5, Gemini3-Pro-Preview, and Kimi-K2-thinking) to evaluate a subject model's responses. However, when one of the judge models itself is being evaluated as a subject model, we replace that judge model with the fourth model (GPT5.1) in the list above. This way, a subject model is never used to judge itself.

Contamination-Aware Splits

To study training-data contamination explicitly, AI-AgriBench includes:

A pre-September 30, 2024 split, which largely overlaps with typical model training corpora and approximates "seen" knowledge.
A smaller but critical post-September 30, 2024 split (146 QA pairs), designed to fall outside most model training windows and stress-test generalization beyond memorized content.

Comparing performance across these splits helps us distinguish how much of a model's success comes from memorization versus genuine reasoning over agricultural knowledge.

Frequently Asked Questions

Why not just use general benchmarks? +

Most popular LLM benchmarks focus on multiple-choice questions, short contexts, and clean textbook-style content. Real agriculture looks very different. Extension bulletins are long, noisy PDFs with complex technical content. And the questions that matter - "What are the key factors affecting nitrogen uptake in corn?", "How does soil pH influence nutrient availability?" - require deep agronomic understanding and rarely have four neat answer options.

AI-AgriBench is built to reflect that reality. It evaluates models on open-ended Q&A over real extension content, with multi-step reasoning and domain-specific terminology.

What is the source of the data? +

Our data comes from the CropWizard document corpus, which includes 400K+ publications from extension websites of 55+ U.S. land-grant universities, similar documents from other universities, and open-access research publications from agricultural journals. We only use the PDF documents from this corpus.

Does this version of the benchmark cover real-world farmer questions? +

AI-AgriBench v0.5 is designed to evaluate technical agronomic understanding—specifically how well chatbots and agricultural advisory services grasp core technical concepts. This is an important component of evaluating advisory quality, but it does not represent the full range of real-world farmer decision-making.

The current version does not include field-level decision scenarios such as seed or hybrid selection, input product choices, or recommendations that depend on farm-specific data like soil test results, field history, or local weather conditions. These types of practical, context-dependent farmer questions are planned for inclusion in future versions of AI-AgriBench.

Future releases will also incorporate image-based question answering, data-driven recommendations using weather forecasts and market prices, and evaluation of models' ability to reason through open-ended, exploratory agricultural questions.

See the Future Improvements section below for more details on future plans.

How do I submit results to the leaderboard? +

See the Instructions for Joining the Leaderboard section above for detailed submission guidelines.

Will images or multimodal tasks be included? +

The initial AI-AgriBench release focuses on text-based Q&A over extension documents. We are actively exploring multimodal extensions (e.g., pairing images of crop symptoms with text) for future versions.

How were questions and answers generated? +

We use a structured generation pipeline ("yourbench") where GPT-4o-mini receives both local chunk content and global summaries. A strict prompt enforces that questions are framed from a farmer's perspective and answers are multi-paragraph, practical, and based only on the source text, with no external knowledge.

What makes AI-AgriBench different from other benchmarks? +

AI-AgriBench is built directly from real agricultural extension knowledge; questions are actionable and farmer-oriented; the pipeline is transparent and reproducible; and we explicitly account for training-data contamination via time-based splits.

Can I use AI-AgriBench for commercial applications? +

Commercial use of AI-AgriBench requires explicit permission and may be subject to licensing terms. Please reach out via the contact form for more information.

Is there a code repository available? +

Yes. The code used to construct the benchmark and run evaluations is available on GitHub: AIFARMS/AI-AgriBench. It includes preprocessing scripts, deduplication tools, classification prompts, and evaluation utilities.

Future Improvements

AI-AgriBench v0.5 represents our initial release focused on text-based agronomic understanding. We are actively working on new releases to expand the benchmark's scope, improve evaluation rigor, and better reflect real-world agricultural advisory scenarios.

1. Multimodal Capabilities

The current benchmark focuses exclusively on text-based question answering (QA). Future versions will incorporate image-based QA—evaluation of a subject model's ability to diagnose crop diseases, identify pests, assess nutrient deficiencies, and recognize weed species from field photographs. This will test visual understanding capabilities that are critical for real-world agricultural diagnostics.

2. Enhanced Evaluation Metrics

While the current Accuracy metric provides an overall assessment, it bundles together several features into a single complex yardstick. For example, it does not always capture fine-grained factual correctness, particularly for individual facts embedded in longer answers. Planned improvements include:

Granular accuracy breakdowns: Decomposing Accuracy into sub-components such as species identification accuracy, diagnostic correctness, recommendation appropriateness, and terminology precision, enabling more actionable diagnostic insights for improving the subject models.
Fact-level verification: Developing methods to verify individual factual claims within responses, allowing detection of partially correct answers that mix accurate and inaccurate information.
Additional specialized metrics: Exploring metrics for safety (e.g., detection of potentially harmful recommendations) and temporal relevance (whether advice appropriately accounts for seasonality or timing).

3. Field-Level Decision Making

Version 0.5 primarily evaluates technical agronomic knowledge. Future releases will incorporate more realistic decision-making scenarios, including:

Farm-specific decision contexts: Questions that require seed or hybrid selection, input product choices, and recommendations conditioned on farm-specific data such as soil test results, field history, and local conditions.
Data-driven recommendations: Evaluation of models' ability to integrate real-time data sources such as weather forecasts, market prices, soil sensor readings, and satellite imagery to generate actionable advice.
Multi-factor reasoning: Assessment of how models balance competing considerations such as economic cost, environmental impact, and timing constraints when making complex agricultural decisions.

Team

CDA Team

Vikram Adve, Talon Becker, Dennis Bowman, Elizabeth Wahle, John Reid, Ansh Ankul, Jixin Li, Chi Gui, Sol Robinson

Steering Committee Members

Tami Craig Schilling, Pratik Desai, Sachi Desai, Gershom Kutliroff, Jonathan Lehe, J. Mark Locklear, Sara Malvar, Lakshmi Pedapudi, Bradley Van De Woestyne, David Warren

Agronomy Experts (Reviewers)

Reviewer names will be listed here as permissions are obtained.

Acknowledgments

This work was funded by the AIFARMS AI Institute (which is sponsored by the USDA National Institute of Food and Agriculture) and by the Center for Digital Agriculture. It was also supported in part by membership fees paid by consortium member companies.

We thank the agronomists who manually reviewed and edited a large number of QA pairs for inclusion in the benchmark. Special recognition goes to the CropWizard project team for providing access to their document corpus and technical details of the benchmark evaluation pipeline.

Contact

We welcome questions, feedback, and collaboration opportunities from researchers, practitioners, and organizations interested in AI-AgriBench. Whether you're looking to submit model results, request dataset access, or explore research collaborations, we'd love to hear from you.

To receive updates about AI-AgriBench, join our mailing list below.

Join the mailing list

Email Contacts

General Inquiries Vikram Adve (vadve@illinois.edu)

For general questions about AI-AgriBench, dataset access, or research collaboration.

Technical Support Ansh Ankul (aankul2@illinois.edu)

For technical questions about evaluation, model submission, or code issues.

AI-AgriBench

Why AI AgriBench?

Grounded knowledge sources

Actionable, farmer-oriented Q&A

Expert-Curated QA Data Set

Contamination-aware evaluation

Who is AI-AgriBench for?

Farmers & advisors

AgTech Advisory Services

Other AgTech Companies

Funders, Investors, Policy Makers

Results and Takeaways

Instructions for Joining the Leaderboard

TECHNICAL DETAILS

Overview

Step 1 - Document Filtering

Step 2 - Q&A Generation (YourBench)

Step 3 - Q&A Data Set Filtering and Categorization

Step 4 - Benchmark Review and Editing

Pipeline

Step 1 - Document Filtering

1.1 Hard Filtering

1.2 Semantic Filtering

Step 2 - Q&A Generation (YourBench)

Step 3 - Q&A Data Set Filtering and Categorization

3.1 Deduplication

3.2 Complexity Filtering

3.3 Premise Filtering

3.4 Categorization

Step 4 - Expert Review and Quality Assurance

Pipeline Statistics

Prompt Design

YourBench Q&A Generation Prompt

Dataset

Source Documents

Curated Q&A Pairs

Coverage

Record Schema (Core Benchmark)

Field Descriptions

Topic Categories

Pests and Pest Management

Diseases and Disease Management

Weeds and Weed Management

Crop Nutrition and Fertility Management

Soils and Soil Health

Seed, Hybrid, and Rootstock Selection

Horticultural and Agronomic Practices

Water Management and Irrigation

Weather and Weather Risks

Evaluation

How Scoring Works

Judge Evaluation Prompt

Subject Models

Judge Models and Backends

Contamination-Aware Splits

Frequently Asked Questions

Future Improvements

1. Multimodal Capabilities

2. Enhanced Evaluation Metrics

3. Field-Level Decision Making

Team

CDA Team

Steering Committee Members

Agronomy Experts (Reviewers)

Acknowledgments

Contact

Email Contacts