AI-AgriBench

A domain-grounded benchmark for reliable agricultural question answering

Built from real extension bulletins, university materials, and open-access research.

Why AI AgriBench?

AI AgriBench is designed to answer a simple but critical question: "Is agronomic advice from large language models and AI-based agricultural advisory tools accurate?" Instead of testing models on generic benchmarks, AI AgriBench evaluates them on realistic, agronomic questions grounded in extension knowledge.
This version of AI AgriBench (v0.5) focuses on evaluating technical agronomic understanding — one key piece of the puzzle in overall evaluation of advisory services. This version helps users assess how reliable is a chatbot (based on a language model) or an Ag advisory service in its understanding of core technical concepts. This version of the benchmark does NOT cover field-level decisions such as seed (hybrid) selection, input product choices, decisions based on farm-specific data, etc. Real-world farmer questions involving such decisions will be included in future versions of AI AgriBench. Other capabilities to be added in future include image-based QA, decisions based on data such as weather forecasts or price predictions, and reasoning about open-ended user questions.

Grounded knowledge sources

We build from extension publications from US land-grant universities. This keeps the benchmark anchored in the same documents farmers and advisors actually use.

Actionable, farmer-oriented Q&A

Questions focus on real decisions: diagnosing nutrient deficiencies, timing fungicides, setting seeding rates, managing weeds, and responding to weather events. Answers are multi-paragraph, practical, and written in plain language.

Expert-Curated QA Data Set

Our benchmark is curated by highly experienced experts. Each QA pair was reviewed by 1-4 experts, with 951 questions reviewed. Many questions were manually edited, and 416 high-quality Q&A pairs were retained after expert review.

Contamination-aware evaluation

To avoid bias for LLMs potentially exposed in pretraining to the source documents used for question-answer extraction, we extract a subset of the QA pairs from documents published AFTER September 30, 2024 and evaluate several LLMs on this subset. These results are marked with an asterisk (**) in the leaderboard and should be compared with the corresponding results for the same model on the full data set.

Who is AI-AgriBench for?

Farmers & advisors

Understand where current models work well - and where they are not yet trustworthy.

AgTech Advisory Services

Demonstrate quality of proprietary Ag advisory services using standard, widely accepted benchmarks.

Other AgTech Companies

Evaluate whether advisory tools are safe and useful before deployment to growers and agronomists.

Funders, Investors, Policy Makers

Evaluate effectiveness of potential commercial or research Ag advisory services.

Results and Takeaways

Some key conclusions that can be drawn from the data in the leaderboard are as follows:

  1. The frontier models and also several of the Ag-focused advisory services based on them all saturate this benchmark, with several scoring above 95% and over half scoring above 90% in Accuracy. This indicates that text-based agronomic knowledge is robust in several modern LLMs. It is possible that other features like image-based QA and data-driven decision with reasoning and tools may prove more challenging.

  2. Adding RAG to a frontier LLM for agriculture-specific knowledge retrieval as in CropWizard and Extension Bot does not significantly improve accuracy compared with the same LLMs without RAG.

  3. Completeness and Relevance scores are also high, in the 90% range, across the top performers. Conciseness suffers, however, indicating that many of the models tend to produce longer and perhaps more detailed answers than the length recommended by experts, as captured in the ground truth data set.

  4. While Accuracy, Relevance, and Completeness are strongly correlated among top-performing models, Conciseness is clearly anti-correlated with Completeness for several frontier systems. This suggests a systematic tendency for stronger models to trade brevity for thoroughness, even when expert-curated ground truth answers are relatively concise.

  5. There is little variation across the 9 categories of agronomic topics, with models scoring high on every topic.

  6. Models evaluated on the post-cutoff dataset (marked **) show performance closely aligned with their full-dataset results (recall that the full benchmark data set is drawn from extension documents published before or after the training cutoffs whereas the post-cutoff data set is only taken from documents published after the cutoff date). The important takeaway is that the models can successfully answer agronomic questions not drawn from their pretraining data, i.e., the results are not biased by exposure to benchmark data sources during pretraining. A stronger conclusion could be that memorization of pretraining data is unlikely to be a primary driver of performance and that the models are generalizing agronomic concepts rather than recalling benchmark-specific content; more careful experimentation is needed to confirm this conclusion.

Instructions for Joining the Leaderboard

To submit results to the AI-AgriBench leaderboard, follow these steps:

  1. Download the questions JSON: To download the test set questions in JSON format, click on the "Click Here to Join the Leaderboard" button at the top of the Leaderboard page and fill out the form. The file contains question IDs, questions, and associated metadata. Each entry includes fields such as qna_id, question, crop_group, topic_categories, and split (pre-September 30, 2024 or post-September 30, 2024).
  2. Run your model: Generate answers for all questions in the test set using your model or system.
  3. Format responses: Format your responses according to our JSON protocol. Each response should include the question ID and your model's answer.
  4. Email submissions: Email the JSON with the responses to Ansh Ankul and Vikram Adve.

For detailed evaluation guidelines, see the Evaluation section below. All submissions will be evaluated using our standardized judge models and scoring rubric.

TECHNICAL DETAILS

A comprehensive guide to the AI-AgriBench methodology, pipeline, and evaluation framework. Click on the sections below to expand and view detailed information.

AI-AgriBench is constructed through four major stages:

Step 1 - Document Filtering

We begin with the CropWizard corpus (400K+ extension PDFs). Filtering occurs in two parts: hard filtering removes low-quality scans, boilerplate pages, and non-agronomic documents; semantic filtering segments and embeds documents (Qwen3-Embedding-8B), then keeps only chunks relevant to key agricultural topics using LLM-generated and curated seed queries.

Step 2 - Q&A Generation (YourBench)

Filtered chunks are turned into realistic, farmer-focused question-answer pairs using YourBench with GPT-4o-mini (arXiv, HuggingFace). The process includes chunking & summaries, Q&A generation with strict grounding rules, and structured output with provenance links.

Step 3 - Q&A Data Set Filtering and Categorization

Raw Q&As are cleaned and organized through deduplication (using embeddings + FAISS), complexity filtering, premise filtering, and categorization (assigning each item to a crop group and topic category; discarding off-topic items).

Step 4 - Benchmark Review and Editing

Domain experts from production farming backgrounds review and validate Q&A pairs, manually edit content to meet quality standards, and ensure the benchmark reflects real-world agricultural expertise. The final benchmark comprises 416 expert-validated Q&A pairs.

AI-AgriBench is constructed through a multi-stage pipeline that transforms 400K+ real agricultural PDFs into a high-quality benchmark of grounded, actionable question-answer pairs. Each stage is designed to preserve fidelity to the underlying documents, avoid hallucinations, and ensure the final benchmark reflects real production agriculture.

We begin with the CropWizard document corpus, which contains 400,000+ PDFs from land-grant extension publications, university materials, and open-access research. Only the PDFs are used to ensure grounding in original agricultural knowledge.

1.1 Hard Filtering

This quality-screening stage eliminates documents that would be unsuitable for LLM-based extraction. Examples of removed content include:

  • Low-content PDFs (scanned covers, empty pages, newsletter headers)
  • Unreadable scans or pages dominated by OCR noise
  • Material unrelated to production agriculture (e.g., youth programs, policy memos)
  • Documents too short or too visually complex for meaningful text extraction

Hard filtering ensures that no computational resources are wasted on unusable material and that downstream stages operate on documents containing genuine agronomic content.

1.2 Semantic Filtering

After hard filtering, documents are divided into hierarchical text chunks and embedded using Qwen3-Embedding-8B. We then perform retrieval against a curated set of agricultural "seed queries" that define the nine major production themes of the benchmark:

  • Crop Management
  • Pests & Pest Management
  • Weeds & Weed Management
  • Crop Nutrition & Fertility
  • Water & Irrigation
  • Soil Health
  • Crop Marketing
  • Farm Finance
  • Climate & Weather Risk

Seed queries are generated using GPT-4.1 over the entire corpus, transformed into question-style prompts, clustered to remove redundancy, and manually reviewed to build a high-quality retrieval set. Only chunks that semantically match these queries are kept. This ensures the pipeline focuses exclusively on agriculturally relevant knowledge and avoids irrelevant or tangential text.

Output of Step 1: A curated collection of relevant, high-quality text chunks ready for Q&A generation.

In this stage, we convert filtered text into realistic, farmer-oriented question-answer pairs using YourBench with GPT-4o-mini. YourBench is an open-source framework for generating zero-shot benchmarks from documents, enabling automated pipelines for ingestion, summarization, and question generation (arXiv, HuggingFace). HuggingFace has adopted YourBench for benchmarking customer RAG systems. This framework enforces document grounding, multi-hop reasoning, and strict prompt constraints to ensure outputs remain faithful to the underlying text.

Filtered PDFs are converted to structured Markdown. Yourbench automatically divides content into manageable, semantically coherent segments that stay within model context limits while preserving paragraph boundaries and factual continuity.

For each chunk, an LLM-generated "global summary" provides a high-level context of the document. This summary, combined with the local text, allows the model to generate multi-hop questions that reference broader agronomic themes while remaining grounded in the provided content.

GPT-4o-mini is prompted to generate question-answer pairs using the chunk + summary as its only source. The generation prompt enforces strict rules:

  • The question must reflect real-world farmer or advisor concerns (e.g., diagnosing symptoms, choosing inputs, timing operations).
  • The answer must be 2-4 paragraphs long, practical, and actionable.
  • No external knowledge is allowed - all content must come from the input chunk.
  • No academic framing ("the study shows…") or region-specific references.

This produces production-oriented, field-level Q&As aligned with how extension services communicate with growers.

All generated Q&As are written to a structured dataset.json file containing:

  • The question and answer
  • Chunk citations
  • Document ID links
  • Token and cost metadata

This ensures traceability back to the original extension source.

Output of Step 2: A large pool of grounded, farmer-oriented Q&As with structured provenance.

The Q&A pool contains duplicates, overly complex items, or questions not adequately supported by the text. The curation stage cleans and structures the dataset into a final evaluation benchmark.

3.1 Deduplication

Questions are embedded using the MiniLM-L6-v2 model and compared using cosine similarity via FAISS (Facebook AI Similarity Search). A threshold-based search removes near-duplicate questions and answers, retaining the earliest occurrence for stability and generating a full duplicate report for auditing.

3.2 Complexity Filtering

A dedicated LLM classifier labels each Q&A as KEEP or REMOVE. Removed items include:

  • Overly technical or research-heavy questions ("meta-analyses", "experimental design")
  • Engineering-focused content (e.g., drone algorithm design)
  • Excessive jargon inappropriate for grower communication
  • Questions that are not meaningful in advisory contexts

3.3 Premise Filtering

This step removes Q&As where the question or answer is not fully supported by the underlying chunk. This improves factual grounding and reduces hallucination risk by ensuring every benchmark item has verifiable provenance in the extension document.

3.4 Categorization

A specialized LLM classifier identifies the specific crop name (when present) and assigns each Q&A to one of several crop groups. Crop groups categorize questions by the type of crop being discussed. The crop groups are:

  • Midwestern_Row_Crops - Corn, Wheat, Soybean, Sorghum
  • Tree_Crops - Fruits and nuts from orchard or plantation trees
  • Commercial_Vegetables - Large-scale vegetables for commercial sale
  • Southern_Row_Crops - Rice, Cotton, Peanuts, Tobacco
  • Small_Fruits - Berries, grapes, other small fruits
  • Northern_Crops - Canola, Barley, Potatoes, Dry Beans
  • Herbs - Culinary/medicinal herbs for commercial production
  • Discard - Vague, incomplete, or not about production farming

In our crop-group categorization analysis, we evaluated 65,538 Q&A pairs, of which 26,279 (40.1%) were retained as non-Discard items and 39,259 (59.9%) were labeled as Discard (vague, incomplete, or not about production farming). Note that the difference between the 84,385 Q&A pairs evaluated in topic categorization and the 65,538 pairs evaluated in crop-group categorization reflects different pipeline runs and filtering processes at different stages.

Rank Crop Group Description Q&A Count Percentage
1 Discard Vague/incomplete / not about production farming 39,259 59.90%
2 Midwestern_Row_Crops Corn, Wheat, Soybean, Sorghum 10,508 16.03%
3 Tree_Crops Fruits and nuts from orchard or plantation trees 3,775 5.76%
4 Commercial_Vegetables Large-scale vegetables for commercial sale 3,697 5.64%
5 Southern_Row_Crops Rice, Cotton, Peanuts, Tobacco 3,550 5.42%
6 Small_Fruits Berries, grapes, other small fruits 2,501 3.82%
7 Northern_Crops Canola, Barley, Potatoes, Dry Beans 2,035 3.11%
8 Herbs Culinary/medicinal herbs for commercial production 213 0.33%
Total 65,538 100%

Each Q&A is additionally labeled with 1-3 agronomic topic categories. The 7 topic categories used in the final dataset are:

  • Crop Management Decisions
  • Agricultural Sustainability
  • Crop Nutrition
  • Pests and Pest Management
  • Water and Irrigation Management
  • Weeds and Weed Management
  • Agricultural Weather Risk

In our topic categorization analysis, we evaluated 84,385 Q&A pairs and retained 45,172 records after filtering. The following table shows the distribution of topic categories (note that totals exceed 45,172 due to multi-label assignment, where each Q&A can have 1-3 topic labels):

Topic Category Q&A with this label
Soils_and_Soil_Health 17,509
Other_Agronomic_Practices 16,744
Crop_Nutrition_and_Fertility_Management 14,602
Water_Management_and_Irrigation 10,908
Discard 10,748
Pests_and_Pest_Management 10,318
Seed_Hybrid_Rootstock_Selection 9,722
Weather_and_Weather_Risks 6,920
Diseases_and_Disease_Management 6,094
Weeds_and_Weed_Management 5,926

LLMs are potentially exposed to all public documents during pretraining, and the benchmark — which is derived from public land-grant university publications — might be contaminated with some pretraining data. Publications dated after the training cutoffs are likely to be excluded from pretraining, and the cutoffs are listed below. We use a (unavoidably smaller) benchmark dataset with 146 QA pairs derived from documents published after September 30, 2024 for the Unbiased leaderboard below. Some of the most recent LLMs have even later training cutoffs and there are insufficient publications after those cutoff dates, so they are excluded from the Unbiased benchmark results.

To study training-data contamination, all documents and Q&As are divided into:

  • pre-September 30, 2024 split - likely included in model training corpora
  • post-September 30, 2024 split - outside training cutoff windows (146 QA pairs)

This enables evaluation of whether models rely on memorization or true reasoning.

Output of Step 3: A fully cleaned, deduplicated, categorized, contamination-aware benchmark of ~45K high-quality Q&As.

To ensure the highest quality and practical relevance of the benchmark, we engage domain experts from production farming backgrounds to review and validate Q&A pairs. This human-in-the-loop stage adds critical quality control and ensures that the benchmark reflects real-world agricultural expertise.

We connected with a pool of 31 production farming experts with diverse expertise across different crop types, regions, and agricultural practices. However, only 23 reviewers completed the review process. Each reviewer's background and specialization were carefully documented to enable intelligent question assignment based on their domain knowledge.

Questions were assigned to reviewers based on their specific areas of expertise, ensuring that each Q&A pair is evaluated by someone with relevant domain knowledge. The review process involves:

  • Quality assessment: Reviewers evaluate whether each Q&A pair is good, needs revision (CBF - Could Be Fixed), or should be discarded (Not-Good).
  • Manual editing: For Q&A pairs that are close but fixable, reviewers manually edit and improve the content to meet quality standards. Only Q&A pairs with GOOD or CBF ratings were edited.
  • Domain validation: Reviewers verify that questions reflect real production farming scenarios and that answers are accurate and actionable.

Reviewer Coverage: We assigned 951 questions for review. The distribution of reviewers per question was:

  • Questions with 3 reviewers: 248
  • Questions with 2 reviewers: 415
  • Questions with 1 reviewer: 286

This resulted in an average of approximately 2 reviewers per question (range: 1-3), with a total of 1,868 individual expert responses across all 951 questions.

Quality Breakdown: Across the 951 questions reviewed, the distribution of quality labels was:

  • Only Good: 416
  • Only CBF (Could Be Fixed): 146
  • Only Good and CBF: 38
  • Only CBF and Not-Good: 72
  • Only Not-Good: 72
  • Good and Not-Good: 66
  • Good, CBF, and Not-Good: 19

100 QA pairs were manually edited by reviewers. Only Q&A pairs with GOOD or CBF ratings were edited.

Through careful evaluation and manual refinement, the final benchmark comprises 416 high-quality Q&A pairs that have been validated by domain experts and meet our strict quality criteria for production agriculture.

Output of Step 4: A curated benchmark of 416 expert-validated Q&A pairs ready for evaluation and deployment.

Pipeline Statistics

The following table shows record counts at each stage of the pipeline. Note that counts may vary across different pipeline runs and views (e.g., topic categorization vs. crop-group categorization), and some intermediate counts are not explicitly reported in our documentation.

Stage Output File / View Records Remaining
YourBench Q&A generation + ingestion dataset.json 84,385
Step 1 - Preprocessing step1_preprocessing/cleaned_qna.json 84,385
Step 2 - Deduplication step2_deduplication/cleaned_qna.json 65,538 (18,847 duplicates removed)
Step 3 - Topic categorization (multi-label) step3_categorization/categorized_qas.json 54,790
Step 4 - Complexity filter step4_complexity_filter/filtered_qna.json 46,662 (8,128 removed)
Step 5 - Premise Mismatch Filter step5_premise_mismatch_filter/filtered_qna.json 43,271
Step 4 - Expert Review and Quality Assurance step4_expert_review/final_benchmark.json 951 questions assigned to 31 experts → 416 Good questions retained

AI-AgriBench relies on carefully designed prompts at multiple stages of the pipeline: generation, filtering, and categorization. Use the arrows to explore the prompts that shape the benchmark.

The core AI-AgriBench dataset consists of cleaned, categorized question-answer pairs derived from extension and research PDFs. Each record is linked back to its source documents and carries crop, topic, and complexity labels.

Source Documents

400K+ agricultural PDFs from the CropWizard corpus, including land-grant extension bulletins, university publications, and open-access articles.

Raw knowledge base

Curated Q&A Pairs

84,385 Q&A pairs were evaluated in topic categorization, with 43,271 retained after filtering. In the crop-group categorization view,

Benchmark items

Coverage

Multi-crop and multi-topic coverage across row crops, tree crops, vegetables, small fruits, and herbs, with labels for crop group and production topic.

Diverse agronomic scenarios

Record Schema (Core Benchmark)

Each Q&A in the benchmark follows a standardized JSON structure:

{ "qna_id": "qna_2116", "question": "How can I determine the right amount of nitrogen fertilizer to apply based on my soil's biological activity?", "answer": "To find the appropriate nitrogen fertilizer amount for your crops, start by conducting a soil test...", "document_id": "005314", "citations": [ "Strong association occurred between soil-test biological activity and net N mineralization." ], "crop_name": "Corn", "crop_group": "Midwestern_Row_Crops", "topic_categories": ["Crop_Nutrition", "Agricultural_Sustainability"], "complexity_label": "KEEP", "split": "pre_2024_08" }

Field Descriptions

  • qna_id: Unique identifier for the Q&A pair.
  • question: Farmer-oriented, open-ended question grounded in extension content.
  • answer: Multi-paragraph, actionable answer generated and constrained by the source chunk.
  • document_id: Identifier linking back to the originating PDF in CropWizard.
  • citations: Optional supporting text spans from the original document.
  • crop_name: Specific crop name if present (e.g., Corn, Rice, Garlic) or NA.
  • crop_group: Crop group label (e.g., Midwestern_Row_Crops, Tree_Crops, Commercial_Vegetables).
  • topic_categories: One to three labels from topic categories such as Crop_Nutrition or Weeds_and_Weed_management.
  • complexity_label: Whether the Q&A was retained (KEEP) or removed by the complexity filter.
  • split: Time-based split indicator (pre_2024_08 vs post_2024_08) for contamination-aware evaluation.

Topic Categories

Pests and Pest Management

Insect identification, pest thresholds, IPM strategies, and pesticide programs.

Diseases and Disease Management

Disease identification, symptom diagnosis, disease prevention, and treatment strategies.

Weeds and Weed Management

Weed identification, herbicide programs and timings, resistance management, and control strategies.

Crop Nutrition and Fertility Management

Soil testing, fertilizer rates, split applications, nutrient placement, and deficiency diagnosis.

Soils and Soil Health

Soil structure, cover crops, reduced tillage, compaction management, and soil improvement practices.

Seed, Hybrid, and Rootstock Selection

Variety selection, seed quality, hybrid characteristics, rootstock choices, and planting decisions.

Horticultural and Agronomic Practices

Planting dates, seeding rates, row spacing, harvest timing, storage, and general crop management practices.

Water Management and Irrigation

Irrigation scheduling, drought mitigation, moisture monitoring, water quality, and water conservation.

Weather and Weather Risks

Frost, heat waves, hail, flooding, and short-term weather-related operational risks and mitigation strategies.

AI-AgriBench uses an LLM-as-a-Judge evaluation pipeline: subject models generate answers to agricultural questions, and specialized judge models score those answers along four metrics from 0–100. The pipeline supports multiprocessing, checkpointing, and resumable outputs for large-scale runs.

How Scoring Works

For each question, we distinguish between subject models (the models under test) and judge models (LLMs that score their answers). The judge sees:

  • The user’s agricultural question
  • The gold (expert) answer from AI-AgriBench
  • The subject model’s response
  • A detailed scoring rubric encoded in the prompt

The judge must respond with JSON only, containing four scores between 0 and 100: accuracy, relevance, completeness, and conciseness.

The evaluation pipeline is designed around simple JSON/JSONL interfaces:

  • Input. A JSON/JSONL file with at least: id, question, gold_answer (or self_answer), and one field per subject model (e.g., gpt-4o-mini, qwen2.5-72b) containing that model’s response.
  • Processing. The pipeline flattens this structure into per-model evaluation items and dispatches them to one or more judge backends with multiprocessing and robust JSON parsing.
  • Output. A JSONL file where each line corresponds to a single (id, subject_model, judge_model) triple and includes: id, question, gold_answer, subject_model, model_response, judge_model, the four scores, and metadata (timestamps, raw judge output, etc.).

Each answer is scored on four metrics, with a detailed rubric embedded in the judge prompt (detailed prompts are shown below):

  • Accuracy. Alignment with expert consensus and the gold answer. This includes correct terminology (disease/pest names, nutrient forms), factual correctness of diagnostic conclusions, and appropriateness of management recommendations. Completely correct, expert-aligned answers score 100; severely incorrect or misleading answers score near 0.
  • Relevance. How well the answer stays on topic and addresses the user’s agricultural question. Answers that drift into unrelated agronomy, ignore the main decision, or miss critical points are penalized.
  • Completeness. Whether the answer covers the key steps, caveats, and conditions needed for a farmer or advisor to act safely and effectively, rather than giving partial or fragmentary advice.
  • Conciseness. Whether the answer is focused, avoids unnecessary digressions, and communicates the required information efficiently.

Subject Models

AI-AgriBench evaluates a diverse set of language models and systems to assess performance on agricultural question answering. The benchmark includes both direct model evaluations and RAG-based systems that combine retrieval with language models.

These models are evaluated directly on the benchmark questions without additional retrieval or context augmentation:

  • gemini-3-pro-preview
  • gemini-2.5-flash
  • kimi-k2-thinking
  • gpt-5.1
  • gpt-5-mini
  • GPT-4o
  • GPT-4o-mini
  • Claude 4.1 Opus
  • Claude 3.7 Sonnet
  • deepseek/deepseek-v3
  • qwen/qwen2.5-72b-instruct
  • mistral/mistral-large-2411

CropWizard is our retrieval-augmented generation (RAG) pipeline that combines document retrieval from the CropWizard corpus with LLMs to generate answers. In our current evaluations, CropWizard uses:

  • gpt-5
  • gpt-5-mini

These systems are evaluated on the same questions as direct-chatbot baselines, making it possible to compare retrieval-augmented and non-retrieval setups under identical judging conditions.

Judge Models and Backends

To evaluate generated answers, we use multiple independent judge models to reduce bias and increase robustness:

  • Claude Opus 4.5
  • Gemini3-Pro-Preview
  • Kimi-K2-thinking
  • GPT5.1

By default, we use the first 3 judge models (Claude Opus 4.5, Gemini3-Pro-Preview, and Kimi-K2-thinking) to evaluate a subject model's responses. However, when one of the judge models itself is being evaluated as a subject model, we replace that judge model with the fourth model (GPT5.1) in the list above. This way, a subject model is never used to judge itself.

Contamination-Aware Splits

LLMs are potentially exposed to all public documents during pretraining, and the benchmark — which is derived from public land-grant university publications — might be contaminated with some pretraining data. Publications dated after the training cutoffs are likely to be excluded from pretraining, and the cutoffs are listed below. We use a (unavoidably smaller) benchmark dataset with 146 QA pairs derived from documents published after September 30, 2024 for the Unbiased leaderboard below. Some of the most recent LLMs have even later training cutoffs and there are insufficient publications after those cutoff dates, so they are excluded from the Unbiased benchmark results.

To study training-data contamination explicitly, AI-AgriBench includes:

  • A pre-September 30, 2024 split, which largely overlaps with typical model training corpora and approximates "seen" knowledge.
  • A smaller but critical post-September 30, 2024 split (146 QA pairs), designed to fall outside most model training windows and stress-test generalization beyond memorized content.

Comparing performance across these splits helps us distinguish how much of a model's success comes from memorization versus genuine reasoning over agricultural knowledge.

Frequently Asked Questions

Most popular LLM benchmarks focus on multiple-choice questions, short contexts, and clean textbook-style content. Real agriculture looks very different. Extension bulletins are long, noisy PDFs with complex technical content. And the questions that matter - "What are the key factors affecting nitrogen uptake in corn?", "How does soil pH influence nutrient availability?" - require deep agronomic understanding and rarely have four neat answer options.

AI-AgriBench is built to reflect that reality. It evaluates models on open-ended Q&A over real extension content, with multi-step reasoning and domain-specific terminology.

Our data comes from the CropWizard document corpus, which includes 400K+ publications from extension websites of 55+ U.S. land-grant universities, similar documents from other universities, and open-access research publications from agricultural journals. We only use the PDF documents from this corpus.

AI-AgriBench v0.5 is designed to evaluate technical agronomic understanding—specifically how well chatbots and agricultural advisory services grasp core technical concepts. This is an important component of evaluating advisory quality, but it does not represent the full range of real-world farmer decision-making.

The current version does not include field-level decision scenarios such as seed or hybrid selection, input product choices, or recommendations that depend on farm-specific data like soil test results, field history, or local weather conditions. These types of practical, context-dependent farmer questions are planned for inclusion in future versions of AI-AgriBench.

Future releases will also incorporate image-based question answering, data-driven recommendations using weather forecasts and market prices, and evaluation of models' ability to reason through open-ended, exploratory agricultural questions.

See the Future Improvements section below for more details on future plans.

See the Instructions for Joining the Leaderboard section above for detailed submission guidelines.

The initial AI-AgriBench release focuses on text-based Q&A over extension documents. We are actively exploring multimodal extensions (e.g., pairing images of crop symptoms with text) for future versions.

We use a structured generation pipeline ("yourbench") where GPT-4o-mini receives both local chunk content and global summaries. A strict prompt enforces that questions are framed from a farmer's perspective and answers are multi-paragraph, practical, and based only on the source text, with no external knowledge.

AI-AgriBench is built directly from real agricultural extension knowledge; questions are actionable and farmer-oriented; the pipeline is transparent and reproducible; and we explicitly account for training-data contamination via time-based splits.

Commercial use of AI-AgriBench requires explicit permission and may be subject to licensing terms. Please reach out via the contact form for more information.

Yes. The code used to construct the benchmark and run evaluations is available on GitHub: AIFARMS/AI-AgriBench. It includes preprocessing scripts, deduplication tools, classification prompts, and evaluation utilities.

Future Improvements

AI-AgriBench v0.5 represents our initial release focused on text-based agronomic understanding. We are actively working on new releases to expand the benchmark's scope, improve evaluation rigor, and better reflect real-world agricultural advisory scenarios.

1. Multimodal Capabilities

The current benchmark focuses exclusively on text-based question answering (QA). Future versions will incorporate image-based QA—evaluation of a subject model's ability to diagnose crop diseases, identify pests, assess nutrient deficiencies, and recognize weed species from field photographs. This will test visual understanding capabilities that are critical for real-world agricultural diagnostics.

2. Enhanced Evaluation Metrics

While the current Accuracy metric provides an overall assessment, it bundles together several features into a single complex yardstick. For example, it does not always capture fine-grained factual correctness, particularly for individual facts embedded in longer answers. Planned improvements include:

  • Granular accuracy breakdowns: Decomposing Accuracy into sub-components such as species identification accuracy, diagnostic correctness, recommendation appropriateness, and terminology precision, enabling more actionable diagnostic insights for improving the subject models.
  • Fact-level verification: Developing methods to verify individual factual claims within responses, allowing detection of partially correct answers that mix accurate and inaccurate information.
  • Additional specialized metrics: Exploring metrics for safety (e.g., detection of potentially harmful recommendations) and temporal relevance (whether advice appropriately accounts for seasonality or timing).

3. Field-Level Decision Making

Version 0.5 primarily evaluates technical agronomic knowledge. Future releases will incorporate more realistic decision-making scenarios, including:

  • Farm-specific decision contexts: Questions that require seed or hybrid selection, input product choices, and recommendations conditioned on farm-specific data such as soil test results, field history, and local conditions.
  • Data-driven recommendations: Evaluation of models' ability to integrate real-time data sources such as weather forecasts, market prices, soil sensor readings, and satellite imagery to generate actionable advice.
  • Multi-factor reasoning: Assessment of how models balance competing considerations such as economic cost, environmental impact, and timing constraints when making complex agricultural decisions.

Team

CDA Team

Vikram Adve, Talon Becker, Dennis Bowman, Elizabeth Wahle, John Reid, Ansh Ankul, Jixin Li, Chi Gui, Sol Robinson

Steering Committee Members

Tami Craig Schilling, Pratik Desai, Sachi Desai, Gershom Kutliroff, Jonathan Lehe, J. Mark Locklear, Sara Malvar, Lakshmi Pedapudi, Bradley Van De Woestyne, David Warren

Agronomy Experts (Reviewers)

Reviewer names will be listed here as permissions are obtained.

Acknowledgments

This work was funded by the AIFARMS AI Institute (which is sponsored by the USDA National Institute of Food and Agriculture) and by the Center for Digital Agriculture. It was also supported in part by membership fees paid by consortium member companies.

We thank the agronomists who manually reviewed and edited a large number of QA pairs for inclusion in the benchmark. Special recognition goes to the CropWizard project team for providing access to their document corpus and technical details of the benchmark evaluation pipeline.

Contact

We welcome questions, feedback, and collaboration opportunities from researchers, practitioners, and organizations interested in AI-AgriBench. Whether you're looking to submit model results, request dataset access, or explore research collaborations, we'd love to hear from you.

To receive updates about AI-AgriBench, join our mailing list below.