Towards end-to-end automation of AI research

Our research methodology is centred around two core automated systems: an AI scientist for generating new scientific research and an automated reviewer for rigorous evaluation. These systems work in concert to explore the potential of AI in accelerating scientific discovery.

The AI Scientist

The AI Scientist is an agentic system designed to autonomously conduct machine learning research. We present results for two modes: a template-based system that extends human-provided code and a more open-ended template-free system that operates with much less prior guidance. The detailed prompts used for each system are provided in Supplementary Information sections A.1.1 and A.2.6. More results and analyses of the papers generated by each system are provided in Supplementary Information sections B.1, C.1, C.2, D.1 and D.2.

Foundational technologies

Both versions are built upon autoregressive LLMs^3,4,5, which learn to generate text by modelling the conditional probability of a new token given preceding tokens. Through vast data and model scaling, LLMs exhibit human-like abilities, including reasoning and code generation. Agentic patterns⁴⁹, such as few-shot prompting⁵⁰ and self-reflection⁵¹, are leveraged by The AI Scientist to improve performance and reliability. For code generation, the template-based system uses the state-of-the-art open-source coding assistant Aider⁵², which is designed to implement features, fix bugs or refactor code in existing codebases. To go further and effectively use more test-time compute, the template-free system uses LLMs to power a tree search without relying on Aider.

Template-based AI Scientist

The system is provided with a starting code template that reproduces a simple training run from a popular algorithm on a standard benchmark (for example, training a small transformer⁵³ on the works of Shakespeare). Its workflow unfolds in three phases:

1.

Idea generation: The process begins with a simple experiment defined by a human-provided code template. The system then enters an iterative loop of idea generation and refinement using LLMs as a mutation operator. In each iteration, it proposes a batch of new research ideas that are variations or extensions of existing ideas in its growing archive. Each idea is a structured object containing a descriptive title, a summary of the core hypothesis, a detailed experimental plan, and self-assessed scores for interestingness (1–10 scale), novelty (1–10 scale) and feasibility (1–10 scale). This iterative growth of an idea archive was inspired by open-endedness algorithms that maintain a diverse collection of artefacts^20,54. To enforce novelty, each proposed idea is automatically checked against the scientific literature through the Semantic Scholar API³¹; ideas with high semantic similarity to existing works are discarded. The system is prompted to act as an ‘ambitious AI PhD student who is looking to publish a paper that will contribute significantly to the field’. For the novelty assessment, the system conducts up to ten rounds of literature search queries, and in each round, the system can refine its search based on previous results.
2.

Experiment execution: Once a promising idea is selected from the archive, the system devises a multi-step experimental plan with up to five experiments. It then executes this plan sequentially using Aider to modify the codebase. A key feature of this phase is its robustness to runtime errors. The system automatically detects execution failures, captures the error logs and invokes an instance of the Aider agent⁵² to perform automated debugging. The Aider agent is prompted with the failing code and the error message, and it then generates a patch, with up to four reattempt cycles per experiment. The corrected code is then used to rerun the experiment with a timeout of 7,200 s per experiment. All experimental outcomes, including metrics, generated plots and observations, are logged in an experimental journal. This journal serves as a form of memory and informs the subsequent steps in the experimental plan.
3.

Manuscript generation: Upon completing the experimental phase, the system synthesizes the findings into a full scientific paper. To do so, it uses Aider to populate a standard conference LaTeX template. Aider writes sections, including the introduction, methods, results and conclusion. The results section is written by analysing the experimental journal, summarizing key findings and embedding the generated figures. To situate the work within the broader scientific context, the system constructs a related work section by querying the Semantic Scholar API for relevant literature (up to 20 search rounds) and generating summaries for each cited paper. The manuscript undergoes several passes of automated editing and refinement to improve clarity and coherence. Finally, the system compiles the LaTeX source and automatically corrects any compilation errors (up to five correction rounds) to produce a final PDF.

Template-free AI Scientist

To overcome the limitations of a fixed starting codebase, we developed a template-free version capable of more open-ended discovery. We use OpenAI’s o3 for idea generation and code critique during experiments due to its strong reasoning capabilities, Anthropic’s Claude Sonnet 4 for code generation, OpenAI’s GPT-4o for cost-efficient vision-language tasks and OpenAI’s o4-mini for cost-efficient reasoning during the review stage. This version introduces several key enhancements.

Generalized idea generation

The ideation process used by the system is more abstract and not tethered to an initial code implementation. It begins by generating high-level research proposals that resemble the abstract of a scientific paper. These proposals articulate a research problem, propose a new method and hypothesize the expected outcomes. To ensure the proposals are both grounded and new, this process is tightly integrated with a literature review module that queries external academic databases to identify knowledge gaps and avoid rediscovering existing work. The system uses structured prompts to guide idea generation and reflection rounds to refine proposals based on the literature search results (see Supplementary Information section A.2.6 for prompts).

Experiment progress manager

Real-world scientific experimentation typically proceeds through distinct stages, from initial feasibility assessments to detailed ablation analyses. To emulate this structured approach, we introduced an experiment progress manager to coordinate four clearly defined stages of scientific experimentation: (1) start with a preliminary investigation to test basic viability, (2) tune the hyperparameters for optimization, (3) execute the main research agenda and (4) conclude with ablation studies to understand the contribution of different components. Each stage has explicit stopping criteria. Stage 1 concludes when a basic working prototype has successfully executed. Stage 2 ends when the experiments stabilize, as indicated by convergence in training curves and successful execution across at least two datasets. Stages 3 and 4 conclude when the allocated computational budget is exhausted. Each stage conducts its own tree search. The specifics of this tree search process are detailed in the following bullet point. Each node has a maximum experiment runtime of 1 h. At the end of each stage, an LLM-based evaluator assesses all leaf nodes and selects the most promising one to serve as the root for the next stage of exploration, thus effectively pruning less promising research avenues.

Parallelized agentic tree search for experimentation

To manage the complexity of open-ended research, the sequential workflow of the template-based version of The AI Scientist is replaced with a parallelized agentic tree. Figure 3a is an overview of the approach and Fig. 3b shows a tree generated by an actual run. By default, it uses Claude Sonnet 4 for code generation. We provide results for different LLM model choices in Fig. 1b.

Each experimental node within the agentic tree search undergoes the following execution cycle. First, Claude Sonnet 4 generates both a concrete experimentation plan and the associated Python code to implement the experiment. The generated code is immediately executed in a Python interpreter. If the execution encounters an error, the error message is recorded and the node is marked as buggy, ending the current execution cycle for that node. If the execution succeeds, the experiment proceeds to the plotting phase.

The system is prompted to save all relevant experimental outputs (training and validation metrics, losses and so on) into structured numpy files. In the plotting phase, The AI Scientist reads these stored results and the code and generates visualizations that summarize and illustrate the findings. These visualizations are subsequently passed to a vision-language model (VLM) for critique. Any issues flagged by the VLM (such as unclear labels, missing legends or misleading visualizations) result in the node being marked as buggy, and this feedback is recorded for future debugging. Nodes that successfully execute and pass the VLM review without issue are designated as non-buggy.

Each node is defined as a collection comprising an experiment script (for example, a Python file), a textual description of the high-level plan implemented in the script, an execution error trace (if applicable), experiment runtime, performance metrics recorded during the experiment, code critique from OpenAI o3 after running the script, a visualization script, file paths to the generated figures, feedback from a VLM on those figures and the final status of the node (either buggy or non-buggy).

At each iteration, the system selects several nodes from the existing tree to expand in parallel. With a predefined probability, a buggy node is chosen (thus prioritizing error resolution and debugging); otherwise, a non-buggy node is selected for further refinement and improvement. When choosing between non-buggy nodes, the system uses a best-first search strategy guided by GPT-4o, which evaluates candidates based on factors like performance metrics, training dynamics and the quality of the plots generated. The selected nodes are expanded by creating a new child node. The system attempts debugging if the parent node was buggy or refines and improves upon the previous experiment if the parent was non-buggy. Claude Sonnet 4 is used to generate the plan and experiment code for each new child node, after which all new nodes are executed concurrently in parallel, which greatly accelerates the exploration process. In addition to buggy and non-buggy nodes, the system uses specialized node variants tailored to specific experimental needs:

Hyperparameter nodes systematically explore alternative hyperparameter configurations during stage 2. The system maintains records of previously tested hyperparameters to prevent redundant experiments. Errors encountered during hyperparameter tuning trigger the creation of corresponding debug nodes.
Ablation nodes evaluate crucial ablation studies during stage 4. This assesses the importance of various components or assumptions underlying the experiment. Like hyperparameter nodes, previously tested ablation conditions are tracked to avoid repetition, and debugging nodes are created in response to any errors encountered.
Replication nodes execute replicates of their parent experiments using different random seeds. Typically, several replication nodes are created to enable the calculation of statistical measures (mean and s.d.) of experimental outcomes, which enhances the robustness of the results.
Aggregation nodes are special nodes created to consolidate and visualize the combined results of replication nodes. Unlike other node types, aggregation nodes do not conduct new experiments but simply generate a Python script to aggregate and summarize previous results. The script produces figures that explicitly show mean and s.d.

The structured design of the experimental stages and tailored node types facilitates systematic exploration across all stages. Unlike some LLM agents that rigidly follow predefined, fine-grained workflow graphs, The AI Scientist adopts a looser structure that guides the entire empirical research cycle, thus enabling flexible system behaviour while maintaining coherence across iterative stages. See Supplementary Information sections A.2.6 and A.2.9 for the prompts and detailed hyperparameters, respectively.

VLM integration

This system incorporates VLMs using GPT-4o to analyse and provide feedback on visual data. During experimentation, the plots generated are fed to a VLM, which is prompted to act as a scientist and critique them. For example, it might flag nonsensical axes or issues in the quality of generated examples or suggest clearer ways to present the data. This feedback is used to generate new experimental nodes in the tree search aimed at addressing the identified issues. During manuscript preparation, the VLM assesses the alignment between figures and their corresponding captions to ensure that a caption accurately describes the plot and highlights the key takeaways, thus improving the overall quality and clarity of the paper. The VLM reviews include detailed analyses of figure content, caption accuracy and integration with the main text (see Supplementary Information section A.2.6 for prompts).

Generalized dataset access

To broaden its research capabilities, the system is prompted to dynamically integrate datasets from public repositories by formulating queries to the HuggingFace Hub⁵⁵. A set of ten example datasets available on HuggingFace is listed in the prompt, and the system can automatically generate the data-loading code needed to use a selected dataset in its experiments. This approach partially relaxes the constraint of working with a fixed, predefined set of datasets by allowing human scientists to easily update the candidate list. For datasets not available on HuggingFace, human scientists can download them from public data repositories (for example, open-access archives), store them locally, and add usage instructions to the prompt. These locally stored datasets can then be used alongside HuggingFace datasets by The AI Scientist (see Supplementary Information section A.2.6 for prompts).

Enhanced manuscript writing

The template-free system moves away from the incremental Aider-based approach to direct LaTeX generation using a reasoning model such as OpenAI’s o1⁵⁶ followed by reflection⁵¹. The system first aggregates experimental results from several stages into compound figures using a dedicated plot-aggregation step. The manuscript-writing process includes specific prompts for different workshop formats (for example, the ICBINB workshop focusing on negative results), with detailed guidelines for each section, including the title, abstract, introduction, methods, experiments and conclusions. The system undergoes several reflection cycles, each time incorporating feedback from LaTeX linters and VLM reviews to improve figure quality and text–figure alignment (see our code and Supplementary Information section A.2.6 for prompts and full details).

The complete generation process for the template-free system typically takes from several hours to over 15 h, depending on problem complexity.

The Automated Reviewer

To assess the quality of the AI-generated research, we built an automated reviewer using o4-mini⁵⁷. This component was designed to emulate the peer-review process of a top-tier machine learning conference by adhering to the official NeurIPS reviewer guidelines. The agent processes the PDF of a manuscript to produce a structured review, including numerical scores for soundness, presentation and contribution, along with a list of strengths and weaknesses and a preliminary accept or reject decision (Supplementary Information section A.3). All prompts used for The Automated Reviewer are provided in Supplementary Information section A.3.1.

Review process

The Automated Reviewer follows a multistage process. First, the system is prompted with the role: ‘You are an AI researcher who is reviewing a paper that was submitted to a prestigious ML venue.’ The review prompt provides the paper content along with detailed NeurIPS reviewer guidelines and asks for a structured JSON response, including a summary, strengths, weaknesses, questions, limitations, ethical concerns and numerical scores (soundness, presentation, contribution, overall score 1–10 and confidence level). To improve robustness, the final assessment is a meta-review that ensembles five independent reviews. The five reviews are generated for each paper and aggregated into a single meta-review, with an LLM taking the role of an area chair to find consensus among the individual reviews.

Validation

We benchmarked The Automated Reviewer against human decisions using ICLR data from the publicly available OpenReview dataset³³. The Automated Reviewer achieved a comparable balanced accuracy with humans (69% versus 66%; see Supplementary Information section A.3.2 for details) and a higher F₁ score compared with inter-human group agreement (0.62 versus 0.49) in the NeurIPS 2021 consistency experiment³⁴, for which roughly 10% of submissions were randomly selected and sent to two independent review committees, thus providing a real-world benchmark of inter-reviewer consistency (Table 1). These results indicate that LLM-based agents can provide valuable feedback that aligns with the opinion of the average human expert. We highlight that there was a different set of paper submissions in the ICLR and NeurIPS paper pools and, thus, a shift in the distribution, so that this comparison is not exact. However, ICLR is the only main machine learning conference that releases all accept and reject decisions, which allowed us to perform the analysis, and the NeurIPS 2021 experiment is the only modern version of the human consistency experiment, and is, thus, the only possible comparison.

Ethics approval

This study received ethics approval from the University of British Columbia Behavioral Research Ethics Board (Protocol No. H24-02652). The research was conducted in full cooperation with the ICLR conference leadership and the relevant workshop organizers. In accordance with the approved protocol, human participants (peer reviewers) were informed that a small number of submissions to the workshop were AI-generated, although not which specific papers. Participants had the option to opt out of reviewing any potentially AI-generated manuscripts. All AI-generated submissions were withdrawn following the review process, regardless of the outcome.

Source link