Hugging Face Releases ml-intern: An Open-Source AI Agent that Automates the LLM Post-Training Workflow
Hugging Face has released ml-intern, an open-source AI agent designed to automate end-to-end post-training workflows for large language models (LLMs). Built on the company’s smolagents framework, the tool can autonomously perform literature review, dataset discovery, training script execution, and iterative evaluation — tasks that typically require significant manual effort from ML researchers and engineers.
What ml-intern Does
The agent operates as a continuous loop that mirrors the workflow of an ML researcher. It begins by browsing arXiv and Hugging Face Papers, reading methodology sections and traversing citation graphs to identify relevant datasets and techniques. It then searches the Hugging Face Hub for referenced datasets, inspects their quality, and reformats them for training. When local compute is unavailable, the agent can launch jobs via Hugging Face Jobs. After each training run, it reads evaluation outputs, diagnoses failures — such as reward collapse in RLHF pipelines — and retrains until benchmark performance improves.
The entire monitoring stack relies on Trackio, a Hub-native experiment tracker positioned as an open-source alternative to Weights & Biases.
Performance on PostTrainBench
ml-intern was evaluated against PostTrainBench, a benchmark introduced by researchers at the University of Tübingen and the Max Planck Institute. The benchmark tests an agent’s ability to post-train a base model within a strict 10-hour window on a single H100 GPU.
In the official launch demo, ml-intern took the Qwen3-1.7B base model—which scores a baseline of roughly 10% on GPQA—and pushed it to 32% in under 10 hours. The agent’s progress was remarkably fast, crossing the 27.5% mark in just over 3 hours.
This result is particularly significant when compared to the existing SOTA. Hugging Face’s data shows the agent outperforming Claude Code, which currently sits at a 22.99% benchmark on the same task. While the broader PostTrainBench paper recorded a high of 33% using the larger Gemma-3-4B, ml-intern’s ability to extract 32% from the tiny 1.7B Qwen model demonstrates a level of “data-efficiency” that manual researchers often struggle to replicate in such a short timeframe.
Technical Approaches: Synthetic Data and GRPO
Two technical strategies that ml-intern demonstrated in published demos are worth highlighting for practitioners.
Synthetic data generation: In a healthcare-domain test, the agent assessed available medical datasets, determined their quality was insufficient for reliable fine-tuning, and wrote a script to generate synthetic training examples focused on edge cases including medical hedging language and multilingual emergency response scenarios. It then upsampled this data to augment the training distribution before evaluating on HealthBench.
Autonomous RLHF via GRPO: In a math-domain test, the agent implemented a Group Relative Policy Optimization (GRPO) training script — a technique that performs reinforcement learning from human feedback with lower memory overhead than standard PPO. The agent launched training on A100 GPUs, monitored reward curves, and ran ablations to isolate effective components before finalizing the checkpoint.
Key Takeaways
- Autonomous Research Loop: The agent replicates the full machine learning workflow, from performing literature reviews on arXiv and traversing citation graphs to autonomously executing training runs and diagnosing failures.
- Significant Reasoning Gains: In less than 10 hours, the agent pushed a Qwen3-1.7B model’s scientific reasoning score on the GPQA benchmark from 8.5% to 32%, outperforming the specific GPQA results of Claude Code (22.99%).
- Advanced Training Strategies: Beyond simple fine-tuning, ml-intern can generate high-quality synthetic data for edge cases and implement complex techniques like Group Relative Policy Optimization (GRPO) to optimize math performance.
- Native Ecosystem Integration: Built on the smolagents framework, the tool natively integrates with Hugging Face Jobs for compute and uses Trackio for open-source experiment tracking.
Check out the App, and CLI. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us


