Agentic AI for Scientific Discovery: Benchmarks, Frameworks, and Applications
Zonglin Yang
MiroMind
Ruochen Li
(representing Xinya Du)
University of Texas at Dallas
Xinya Du
University of Texas at Dallas
Chandan Reddy
Virginia Tech
Detailed Tutorial Description
Overview
The rise of large language models (LLMs) has introduced a paradigm shift in how AI can contribute to science. Beyond serving as static predictors, LLMs can function as agents that actively generate, refine, and evaluate hypotheses. This tutorial provides a structured overview of how agentic AI can accelerate the scientific discovery process, grounded in recent advances in benchmarks, frameworks, and applications.
Motivation
Traditional machine learning excels at prediction but falls short in hypothesis-driven discovery, where novelty, interpretability, and iterative reasoning are essential. The promise of agentic AI lies in closing this gap. By structuring the discovery process into two complementary phases, we highlight how AI can play an active role in advancing science:
- Hypothesis Generation – AI agents propose candidate hypotheses by retrieving inspirations, composing associations, and ranking them for plausibility.
- Feedback and Refinement – Hypotheses are iteratively improved using diverse feedback signals, including data fit, reasoning consistency, symbolic decomposition, or benchmark performance.
This cycle mirrors the way human scientists move from initial ideas to refined, testable hypotheses, but accelerates it through automated reasoning and structured agentic workflows.
Tutorial Outline
- Introduction to Agentic AI in Science
- From prediction to discovery
- Defining “agentic AI” and distinguishing it from static LLM use
- Motivating examples
- Phase I: Hypothesis Generation
- Inspiration retrieval and knowledge recombination
- From qualitative hypotheses to symbolic formulations
- Ranking strategies and novelty assessment
- Phase II: Feedback and Refinement
- Iterative optimization using feedback signals
- Data-driven evaluation, symbolic decomposition, and reasoning consistency checks
- Hierarchical refinement from coarse ideas to fine-grained hypotheses
- Benchmarks for Scientific Discovery
- Limitations of existing datasets (memorization vs reasoning)
- Principles for robust benchmark design
- Recent benchmarks for equations, hypotheses, and surfaces
- Frameworks for Agentic Discovery
- Decomposition strategies, memory mechanisms, and feedback loops
- Integration of evolutionary search and reinforcement learning
- Examples of agentic workflows
- Applications Across Sciences
- Social sciences (open-domain hypothesis generation)
- Natural sciences (equation discovery, symbolic modeling)
- Broader applications in AI for science
- Challenges and Future Directions
- Reliability, interpretability, reproducibility
- Balancing creativity and validity
- Toward hybrid AI–science collaborations
Target Audience
Researchers and practitioners in machine learning, NLP, and AI for science who are interested in symbolic reasoning, agentic frameworks, and automated discovery. The tutorial is accessible to those with general familiarity with LLMs and does not require deep domain expertise.
Learning Outcomes
Participants will gain:
- An understanding of the two-phase cycle of agentic scientific discovery.
- Exposure to recent benchmarks for evaluating reasoning beyond memorization.
- Insight into frameworks that integrate decomposition, evolutionary search, and feedback mechanisms.
- Awareness of applications across disciplines and the challenges they expose.
- A forward-looking perspective on building reliable, interpretable science-focused agents.
Reading List
Introduction
- Towards Scientific Discovery with Generative AI: Progress, Opportunities, and Challenges (AAAI’25)
- A Survey on Large Language Models for Scientific Research
- Transformer-Based Planning for Symbolic Regression (NeurIPS’23)
- SNIP: Bridging Mathematical Symbolic and Numeric Realms with Unified Pre-training (ICLR’24)
- Evaluating Large Language Models in Scientific Discovery
Pre-experiment Phase
- Large Language Models for Automated Open-domain Scientific Hypotheses Discovery (ACL’24) [Github]
- MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses (ICLR’25) [Github]
- MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search (NeurIPS’25) [Github]
- ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
Experiment-guided Phase (Efficient Experimentation)
- LLM-SR: Scientific Equation Discovery via Programming with Large Language Models (ICLR’25 Oral)
- LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models (ICML’25 Oral)
- Data-Efficient Symbolic Regression via Foundation Model Distillation
- SURFACEBENCH: Can Self-Evolving LLMs Find the Equations of 3D Scientific Surfaces?
- Decompose, Adapt, and Evolve: Towards Efficient Scientific Equation Discovery with Large Language Models (NeurIPS’25 Mathematical Reasoning and AI workshop)
Experiment-guided Phase (Costly Experimentation)
- MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback (NeurIPS’25 AI4S workshop) [Github]
- Accelerating Materials Design via LLM-Guided Evolutionary Search
- Mlr-copilot: Autonomous machine learning research based on large language models agents
- LMR-BENCH: Evaluating LLM Agent’s Ability on Reproducing Language Modeling Research
- Learning to Generate Research Idea with Dynamic Control (AAAI’25 AI4Research workshop)