Russell, S. & Norvig, P. Artificial Intelligence: A Modern Approach (Pearson Press, 2021).
Turing, A. M. Computing machinery and intelligence. Mind 59, 433–460 (1950).
Tracy, M., Cerdá, M. & Keyes, K. M. Agent-based modeling in public health: current applications and future directions. Annu. Rev. Public Health 39, 77–94 (2018).
Sridharan, P. & Ghosh, M. Gene expression and agent-based modeling improve precision prognosis in breast cancer. Sci. Rep. 15, 17059 (2025).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022).
Christiano, P. F. et al. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) (Curran Associates, 2017).
Wang, Y. et al. Reinforcement learning for reasoning in large language models with one training example. Preprint at arXiv https://doi.org/10.48550/arXiv.2504.20571 (2025).
DeepSeek-AI et al. DeepSeek-V3.2: Pushing the frontier of open large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2512.02556 (2025).
Rastogi, A. et al. Magistral. Preprint at arXiv https://doi.org/10.48550/arXiv.2506.10910 (2025).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Burstein, J., Doran, C. & Solorio, T. (eds). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 4171–4186 (Association for Computational Linguistics, 2019).
Workshop, B. et al. BLOOM: a 176B-parameter open-access multilingual language model. Preprint at arXiv https://doi.org/10.48550/ARXIV.2211.05100 (2022).
Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55, 248:1–248:38 (2023).
Kalai, A. T., Nachum, O., Vempala, S. S. & Zhang, E. Why language models hallucinate. Preprint at arXiv https://doi.org/10.48550/arXiv.2509.04664 (2025).
Jayaraman, P., Desman, J., Sabounchi, M., Nadkarni, G. N. & Sakhuja, A. A primer on reinforcement learning in medicine for clinicians. NPJ Digit. Med. 7, 337 (2024).
Sutton, R. S. & Barto, A. Reinforcement Learning: An Introduction (The MIT Press, 2020).
Ouyang, L. et al. Training language models to follow instructions with human feedback. In Proceedings of the 36th International Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 27730–27744 (Curran Associates, 2022).
Casper, S. et al. Open problems and fundamental limitations of reinforcement learning from human feedback. Trans. Mach. Learn. Res. https://openreview.net/pdf?id=bx24KpJ4Eb (2023).
Skalse, J., Howe, N. H. R., Krasheninnikov, D. & Krueger, D. Defining and characterizing reward hacking. In Proceedings of the 36th International Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 9460–9471 (Curran Associates, 2022).
Uesato, J. et al. Solving math word problems with process- and outcome-based feedback. Preprint at arXiv https://doi.org/10.48550/arXiv.2211.14275 (2022).
Lightman, H. et al. Let’s verify step by step. In Proceedings of the 12th International Conference on Learning Representations (eds Kim, B. et al.) 39578–39601 (2024).
Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025).
Bai, Y. et al. Constitutional AI: harmlessness from AI feedback. Preprint at arXiv https://doi.org/10.48550/arXiv.2212.08073 (2022).
Novikov, A. et al. AlphaEvolve: a coding agent for scientific and algorithmic discovery. Preprint at arXiv https://doi.org/10.48550/arXiv.2506.13131 (2025).
Gibney, E. DeepMind unveils ‘spectacular’ general-purpose science AI. Nature 641, 827–828 (2025).
Zhang, J., Hu, S., Lu, C., Lange, R. & Clune, J. Darwin Godel Machine: open-ended evolution of self-improving agents. Preprint at arXiv https://doi.org/10.48550/arXiv.2505.22954 (2025).
Rogers, A., Boyd-Graber, J. & Okazaki, N. (eds). Towards reasoning in large language models: a survey. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023 1049–1065 (Association for Computational Linguistics, 2023).
Hendrycks, D. et al. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks Vol. 1 (eds Vanschoren, J. & Yeung, S.) (2021).
Korhonen, A., Traum, D. & Màrquez, L. (eds). Explain yourself! Leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 4932–4942 (Association for Computational Linguistics, 2019).
Taylor, R. et al. Galactica: a large language model for science. Preprint at arXiv https://doi.org/10.48550/arXiv.2211.09085 (2022).
Wang, L. et al. Parameter-efficient fine-tuning in large language models: a survey of methodologies. Artif. Intell. Rev. 58, 227 (2025).
Fu, Y., Peng, H., Sabharwal, A., Clark, P. & Khot, T. Complexity-based prompting for multi-step reasoning. In Proceedings of the 11th International Conference on Learning Representations (eds Liu, Y. et al.) (2023).
Agirre, E., Apidianaki, M. & Vulić, I. (eds). What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures 100–114 (Association for Computational Linguistics, 2022).
Zhang, Z., Zhang, A., Li, M. & Smola, A. Automatic chain of thought prompting in large language models. In Proceedings of the 11th International Conference on Learning Representations (eds Liu, Y. et al.) (2023).
Yao, S. et al. Tree of thoughts: deliberate problem solving with large language models. In Proceedings of the 37th Conference on Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 11809–11822 (Curran Associates, 2023).
Besta, M. et al. Graph of thoughts: solving elaborate problems with large language models. AAAI 38, 17682–17690 (2024).
Shojaee, P. et al. The illusion of thinking: understanding the strengths and limitations of reasoning models via the lens of problem complexity. Preprint at arXiv https://doi.org/10.48550/arXiv.2506.06941 (2025).
Goyal, S. et al. Think before you speak: training language models with pause tokens. In Proceedings of the 12th International Conference on Learning Representations (eds Kim, B. et al.) 27896–27923 (2024).
Inui, K., Jiang, J., Ng, V. & Wan, X. (eds). PubMedQA: a dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2567–2577 (Association for Computational Linguistics, 2019).
Cobbe, K. et al. Training verifiers to solve math word problems. Preprint at arXiv https://doi.org/10.48550/arXiv.2110.14168 (2021).
Ku, L.-W., Martins, A. & Srikumar, V. (eds). Math-Shepherd: verify and reinforce LLMs step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics Vol. 1, 9426–9439 (Association for Computational Linguistics, 2024).
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K. & Yao, S. Reflexion: language agents with verbal reinforcement learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 8634–8652 (Curran Associates, 2023).
Gou, Z. et al. CRITIC: large language models can self-correct with tool-interactive critiquing. In Proceedings of the 12th International Conference on Learning Representations (eds Kim, B. et al.) 57734–57811 (2024).
Madaan, A. et al. Self-refine: iterative refinement with self-feedback. In Proceedings of the 37th International Conference on Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 46534–46594 (Curran Associates, 2023).
Crosby, M., Rovatsos, M. & Petrick, R. Automated agent decomposition for classical planning. In Proceedings of the International Conference on Automated Planning and Scheduling Vol. 23 (eds Borrajo, D. et al.) 46–54 (2013).
Huang, X. et al. Understanding the planning of LLM agents: a survey. Preprint at arXiv https://doi.org/10.48550/arXiv.2402.02716 (2024).
Zhou, D. et al. Least-to-most prompting enables complex reasoning in large language models. In Proceedings of the 11th International Conference on Learning Representations (eds Liu, Y. et al.) (2023).
Xu, B. et al. ReWOO: decoupling reasoning from observations for efficient augmented language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2305.18323 (2023).
Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In Proceedings of the 11th International Conference on Learning Representations (eds Liu, Y. et al.) (2023).
Shen, Y. et al. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. In Proceedings of the 37th Conference on Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 38154–38180 (Curran Associates, 2023).
Duh, K., Gomez, H. & Bethard, S. (eds). ADaPT: as-needed decomposition and planning with language models. In Proceedings of Findings of the Association for Computational Linguistics: NAACL 2024 4226–4252 (Association for Computational Linguistics, 2024).
Liu, B. et al. LLM + P: empowering large language models with optimal planning proficiency. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.11477 (2023).
Feng, P. et al. AGILE: a novel reinforcement learning framework of LLM agents. In Proceedings of the 38th International Conference on Neural Information Processing Systems Vol. 37 (eds Globerson, A. et al.) 5244–5284 (Curran Associates, 2024).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Hutter, F., Kotthoff, L. & Vanschoren, J. (eds). Automated Machine Learning: Methods, Systems, Challenges pp. 151–160 (Springer International Publishing, 2019).
Hernandez, J. G., Saini, A. K., Ghosh, A. & Moore, J. H. The tree-based pipeline optimization tool: tackling biomedical research problems with genetic programming and automated machine learning. Patterns 6, 101314 (2025).
Himmelstein, D. S. et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. Elife 6, e26726 (2017).
Swanson, K., Wu, W., Bulaong, N. L., Pak, J. E. & Zou, J. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies. Nature 646, 716–723 (2025).
Schick, T. et al. Toolformer: language models can teach themselves to use tools. In Proceedings of the 37th International Conference on Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 68539–68551 (Curran Associates, 2023).
Lu, P. et al. Chameleon: plug-and-play compositional reasoning with large language models. In Proceedings of the 37th International Conference on Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 43447–43478 (Curran Associates, 2023).
Patil, S. G., Zhang, T., Wang, X. & Gonzalez, J. E. Gorilla: Large language model connected with massive APIs. In Proceedings of the 38th International Conference on Neural Information Processing Systems Vol. 37 (eds Globerson, A. et al.) 126544–126565 (Curran Associates, 2024).
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems (eds Larochelle, H. et al.) 9459–9474 (Curran Associates, 2020).
Petroni, F. et al. How context affects language models’ factual predictions. In Proceedings of the Automated Knowledge Base Construction (eds McCallum, A. et al.) (2020).
Fan, W. et al. A survey on RAG meeting LLMs: towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (eds Baeza-Yates, R. & Bronchi, F.) 6491–6501 (Association for Computing Machinery, 2024).
Jeong, M., Sohn, J., Sung, M. & Kang, J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics 40, i119–i129 (2024).
Lu, J. et al. MemoChat: tuning LLMs to use memos for consistent long-range open-domain conversation. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.08239 (2023).
Zhong, W., Guo, L., Gao, Q., Ye, H. & Wang, Y. MemoryBank: enhancing large language models with long-term memory. In Proceedings of the AAAI Conference on Artificial Intelligence (eds Wooldridge, M., Dy, J. & Natarajan, S.) 19724–19731 (2024).
Park, J. S. et al. Generative agents: interactive simulacra of human behavior. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.03442 (2023).
Li, Y. et al. ChatDoctor: a medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge. Cureus 15, e40895 (2023).
Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J. & Chalef, D. Zep: a temporal knowledge graph architecture for agent memory. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.13956 (2025).
Edge, D. et al. From local to global: a graph RAG approach to query-focused summarization. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.16130 (2025).
Zhang, Z. et al. A survey on the memory mechanism of large language model-based agents. ACM Trans. Inf. Syst. 43, 155:1–155:47 (2025).
Yan, B. et al. Beyond self-talk: a communication-centric survey of LLM-based multi-agent systems. Preprint at arXiv https://doi.org/10.48550/arXiv.2502.14321 (2025).
Ku, L.-W., Martins, A. & Srikumar, V. ChatDev: communicative agents for software development. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics Vol. 1, 15174–15186 (Association for Computational Linguistics, 2024).
Hong, S. et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In Proceedings of the 12th International Conference on Learning Representations (eds Kim, B. et al.) 23247–23275 (2024).
Zhuge, M. et al. GPTSwarm: language agents as optimizable graphs. In Proceedings of the 41st International Conference on Machine Learning Vol. 235 (eds Salakhutdinov, R. R. et al.) 62743–62767 (2024).
Google Cloud. Agent2Agent (A2A) Protocol. a2a-protocol.org/latest/ (2025).
Borghoff, U. M., Bottoni, P. & Pareschi, R. Human-artificial interaction in the age of agentic AI: a system-theoretical approach. Front. Hum. Dyn. 7, 1579166 (2025).
Hua, W. et al. Interactive speculative planning: enhance agent efficiency through co-design of system and user interface. In Proceedings of the 13th International Conference on Learning Representations (eds Yue, Y. et al) 14256–14283 (2025).
Hou, X., Zhao, Y., Wang, S. & Wang, H. Model Context Protocol (MCP): landscape, security threats, and future research directions. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.23278 (2025).
Kuehl, M. et al. BioContextAI is a community hub for agentic biomedical systems. Nat. Biotechnol. 43, 1755–1757 (2025).
Yang, J. et al. SWE-agent: Agent-computer interfaces enable automated software engineering. In Proceedings of the 38th International Conference on Neural Information Processing Systems Vol. 37 (eds Globerson, A. et al.) 50528–50652 (Curran Associates, 2024).
Ferber, D. et al. Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology. Nat. Cancer 6, 1337–1349 (2025).
Ku, L.-W., Martins, A. & Srikumar, V. (eds). MedAgents: large language models as collaborators for zero-shot medical reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024 599–621 (Association for Computational Linguistics, 2024).
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).
Li, S. et al. SciLitLLM: How to adapt LLMs for scientific literature understanding. In Proceedings of the 13th International Conference on Learning Representations (eds Yue, Y. et al.) 56025–56048 (2025).
Wang, Y. et al. Biomedical information retrieval with positive-unlabeled learning and knowledge graphs. In ACM Trans. Intell. Syst. Technol. (ACM, 2024).
Yang, Z., Dabre, R., Tanaka, H. & Okazaki, N. SciCap+: a knowledge augmented dataset to study the challenges of scientific figure captioning. J. Nat. Lang. Process. 31, 1140–1165 (2024).
Zhang, S. et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs. NEJM AI 2, AIoa2400640 (2025).
Qi, B. et al. Large language models as biomedical hypothesis generators: a comprehensive evaluation. In Proceedings of the 1st Conference on Language Modeling (eds. Artzi, Y. et al.) (2024).
Gottweis, J. et al. Towards an AI co-scientist. Preprint at arXiv https://doi.org/10.48550/arXiv.2502.18864 (2025).
Zhang, Y. et al. A comprehensive large-scale biomedical knowledge graph for AI-powered data-driven biomedical research. Nat. Mach. Intell. 7, 602–614 (2025).
Huang, K. et al. Automated hypothesis validation with agentic sequential falsifications. In Proceedings of the 42nd International Conference on Machine Learning Vol. 267 (eds Singh, A. et al.) 25372–25437 (PMLR, 2025).
O’Donoghue, O. et al. BioPlanner: automatic evaluation of LLMs on protocol planning in biology. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H., Pino, J. & Bali, K.) 2676–2694 (Association for Computational Linguistics, 2023).
Roohani, Y. et al. BioDiscoveryAgent: an AI agent for designing genetic perturbation experiments. In Proceedings of the 13th International Conference on Learning Representations (eds Yue, Y. et al.) 26417–26466 (2025).
Liu, S. et al. DrugAgent: automating AI-aided drug discovery programming through LLM multi-agent collaboration. In Proceedings of the 2nd AI4Research Workshop: Towards a Knowledge-Grounded Scientific Research Lifecycle (eds Wang, Q. et al.) (2024).
Ma, M. D. et al. Orchestrating tool ecosystem of drug discovery with intention-aware LLM agents. In Towards Agentic AI for Science: Hypothesis Generation, Comprehension, Quantification, and Validation (eds Koutra, D. et al.) (2025).
Tang, X. et al. CellForge: agentic design of virtual cell models. Preprint at arXiv https://doi.org/10.48550/arXiv.2508.02276 (2025).
Turcan, A., Huang, K., Li, L. & Zhang, M. J. TusoAI: agentic optimization for scientific methods. Preprint at arXiv https://doi.org/10.48550/arXiv.2509.23986 (2025).
Huang, K. et al. Biomni: a general-purpose biomedical AI agent. Preprint at bioRxiv https://doi.org/10.1101/2025.05.30.656746 (2025).
Lu, C. et al. The AI Scientist: towards fully automated open-ended scientific discovery. Preprint at arXiv https://doi.org/10.48550/arXiv.2408.06292 (2024).
Yamada, Y. et al. Scientist-v2: workshop-level automated scientific discovery via agentic tree search. Preprint at arXiv https://doi.org/10.48550/arXiv.2504.08066 (2025).
Ferrag, M. A., Tihanyi, N. & Debbah, M. From LLM reasoning to autonomous AI agents: a comprehensive review. Preprint at arXiv https://doi.org/10.48550/arXiv.2504.19678 (2025).
Yehudai, A. et al. Survey on evaluation of LLM-based agents. Preprint at arXiv https://doi.org/10.48550/arXiv.2503.16416 (2025).
Geva, M. et al. Did Aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. Trans. Assoc. Comput. Linguist. 9, 346–361 (2021).
Chan, J. S. et al. MLE-bench: evaluating machine learning agents on machine learning engineering. In Proceedings of the 13th International Conference on Learning Representations (eds Yue, Y. et al.) 50466–50494 (2025).
Li, Y. et al. Competition-level code generation with AlphaCode. Science 378, 1092–1097 (2022).
Jimenez, C. E. et al. SWE-bench: can language models resolve real-world Github issues? In Proceedings of the 12th International Conference on Learning Representations (eds Kim, B. et al.) 54107–54157 (2024).
Chen, Z. et al. ScienceAgentBench: toward rigorous assessment of language agents for data-driven scientific discovery. In Proceedings of the 13th International Conference on Learning Representations (eds Yue, Y. et al.) 96934–96990 (2025).
Tian, M. et al. SciCode: a research coding benchmark curated by scientists. In Proceedings of the 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track Vol. 111 (eds Globerson, A. et al.) 30624–30650 (Curran Associates, 2024).
Srivastava, A. et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. Trans. Mach. Learn. Res. https://openreview.net/pdf?id=uyTL5Bvosj (2023).
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning Vol. 174 (eds Flores, G. et al.) 248–260 (PMLR, 2022).
Lou, R. et al. AAAR-1.0: assessing AI’s potential to assist research. In Proceedings of the 42nd International Conference on Machine Learning Vol. 267 (eds Singh, A. et al.) 40361–40383 (PMLR, 2025).
Webber, B., Cohn, T., He, Y. & Liu, Y. (eds). Fact or fiction: verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 7534–7550 (Association for Computational Linguistics, 2020).
Laurent, J. M. et al. LAB-Bench: measuring capabilities of language models for biology research. Preprint at arXiv https://doi.org/10.48550/arXiv.2407.10362 (2024).
Bragg, J. et al. AstaBench: rigorous benchmarking of AI agents with a scientific research suite. Preprint at arXiv https://doi.org/10.48550/arXiv.2510.21652 (2025).
Akhtar, M. et al. Croissant: a metadata format for ML-ready datasets. In Proceedings of the 8th Workshop on Data Management for End-to-End Machine Learning (eds Hulsebos, M., Interlandi, M., & Shankar, S.) 1–6 (Association for Computing Machinery, 2024).
Holmes, J. H. et al. Why is the electronic health record so challenging for research and clinical care? Methods Inf. Med. 60, 32–48 (2021).
Chen, Y. & Esmaeilzadeh, P. Generative AI in medical practice: in-depth exploration of privacy and security challenges. J. Med. Internet Res. 26, e53008 (2024).
European Commission. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). https://gdpr-info.eu/ (2016).
U.S. Congress. Health Insurance Portability and Accountability Act of 1996 42 U.S.C. 201 note. https://www.congress.gov/bill/104th-congress/house-bill/3103 (1996).
Science and Technology Policy Office. Blueprint for an AI bill of rights: making automated systems work for the American people. https://www.govinfo.gov/app/details/GOVPUB-PREX23-PURL-gpo193638 (2022).
Das, B. C., Amini, M. H. & Wu, Y. Security and privacy challenges of large language models: a survey. ACM Comput. Surv. 57, 152:1–152:39 (2025).
Chen, Z., Xiang, Z., Xiao, C., Song, D. & Li, B. AgentPoison: red-teaming LLM agents via poisoning memory or knowledge bases. In Proceedings of the 38th International Conference on Neural Information Processing Systems Vol. 37 (eds Globerson, A. et al.) 130185–130213 (Curran Associates, 2024).
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Benson, D. A. et al. GenBank. Nucleic Acids Res. 41, D36–D42 (2013).
Husom, E. J., Goknil, A., Shar, L. K. & Sen, S. The price of prompting: profiling energy use in large language models inference. Preprint at arXiv https://doi.org/10.48550/arXiv.2407.16893 (2024).
Maliakel, P. J., Ilager, S. & Brandic, I. Investigating energy efficiency and performance trade-offs in LLM inference across tasks and DVFS settings. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.08219 (2025).
Jiang, P., Sonne, C., Li, W., You, F. & You, S. Preventing the immense increase in the life-cycle energy and carbon footprints of LLM-powered intelligent chatbots. Engineering 40, 202–210 (2024).
Li, P., Yang, J., Islam, M. A. & Ren, S. Making AI less ‘thirsty’. Commun. ACM 68, 54–61 (2025).
Zhang, H., Ning, A., Prabhakar, R. B. & Wentzlaff, D. LLMCompass: enabling efficient hardware design for large language model inference. In Proceedings of the 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) (eds Vega, A. et al.) 1080–1096 (IEEE, 2024).
Barocas, S., Hardt, M. & Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities (The MIT Press, 2023).
Chang, C. T. et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. NPJ Digit. Med. 8, 149 (2025).
Chen, R. J. et al. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat. Biomed. Eng. 7, 719–742 (2023).
Omar, M. et al. Sociodemographic biases in medical decision making by large language models. Nat. Med. 31, 1873–1881 (2025).
OECD. Health Data Governance for the Digital Age: Implementing the OECD Recommendation on Health Data Governance (OECD Publishing, 2022).
Zhang, C. et al. A survey on federated learning. Knowl.-Based Syst. 216, 106775 (2021).
Li, R., Romano, J. D., Chen, Y. & Moore, J. H. Centralized and federated models for the analysis of clinical data. Annu. Rev. Biomed. Data Sci. 7, 179–199 (2024).
Pan, M. Z. et al. Why do multiagent systems fail? In Proceedings of the ICLR 2025 Workshop on Building Trust in Language Models and Applications (eds Goldblum, M. et al.) (2025).
Matsumoto, N. et al. ESCARGOT: an AI agent leveraging large language models, dynamic graph of thoughts, and biomedical knowledge graphs for enhanced reasoning. Bioinformatics 41, btaf031 (2025).
Romano, J. D. et al. The Alzheimer’s Knowledge Base: a knowledge graph for Alzheimer disease research. J. Med. Internet Res. 26, e46777 (2024).
Lobentanzer, S. et al. A platform for the biomedical application of large language models. Nat. Biotechnol. 43, 166–169 (2025).
Lobentanzer, S. et al. Democratizing knowledge representation with BioCypher. Nat. Biotechnol. 41, 1056–1059 (2023).
Zhou, J. et al. Large language models in biomedicine and healthcare. NPJ Artif. Intell. 1, 44 (2025).
Gulcehre, C. et al. Reinforced Self-Training (ReST) for language modeling. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.08998 (2023).
Gabriel, I., Keeling, G., Manzini, A. & Evans, J. We need a new ethics for a world of AI agents. Nature 644, 38–40 (2025).
Lee, H.-P. (Hank) et al. The impact of generative AI on critical thinking: self-reported reductions in cognitive effort and confidence effects from a survey of knowledge workers. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (eds Yamashita, N. et al.) 1–22 (Association for Computing Machinery, 2025).
Del Rio-Chanona, R. M., Ernst, E., Merola, R., Samaan, D. & Teutloff, O. AI and jobs. A review of theory, estimates, and evidence. Preprint at arXiv https://doi.org/10.48550/arXiv.2509.15265 (2025).
Becker, J., Rush, N., Barnes, E. & Rein, D. Measuring the impact of early-2025 AI on experienced open-source developer productivity. Preprint at arXiv https://doi.org/10.48550/arXiv.2507.09089 (2025).
SIMA Team et al. Scaling instructable agents across many simulated worlds. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.10179 (2024).
Gao, S. et al. Democratizing AI scientists using ToolUniverse. Preprint at arXiv https://doi.org/10.48550/arXiv.2509.23426 (2025).
Qu, Y. et al. CRISPR-GPT for agentic automation of gene-editing experiments. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-025-01463-z (2025).
Bran, A. M. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).
Wang, H. et al. SpatialAgent: an autonomous AI agent for spatial biology. Preprint at bioRxiv https://doi.org/10.1101/2025.04.03.646459 (2025).
Ghafarollahi, A. & Buehler, M. J. ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning. Digit. Discov. 3, 1389–1409 (2024).
Yuksekgonul, M. et al. Optimizing generative AI by backpropagating language model feedback. Nature 639, 609–616 (2025).
Yang, Y. et al. TwinMarket: a scalable behavioral and social simulation for financial markets. In Proceedings of the ICLR 2025 Workshop on World Models: Understanding, Modelling and Scaling (eds Yang, M. et al.) (2025).
Hu, S., Lu, C. & Clune, J. Automated design of agentic systems. In Proceedings of the 13th International Conference on Learning Representations (eds Yue, Y. et al.) 21344–21377 (2025).
Chiruzzo, L., Ritter, A. & Wang, L. (eds). EvoAgent: towards automatic multi-agent generation via evolutionary algorithms. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 6192–6217 (Association for Computational Linguistics, 2025).
Gao, S. et al. Empowering biomedical discovery with AI agents. Cell 187, 6125–6151 (2024).
Ahdritz, G. et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat. Methods 21, 1514–1524 (2024).
