Experimentoring • Info

How it works

To become better at ML, you need to develop intuition for what experiments will work well so you can be more efficient about which experiments to run. This site helps you practice that intuition using real research papers.

We take papers people have already published, present the experimental setup, and you try to predict the results. For people new to ML, you can focus on fundamental papers and build a strong base. For experts, you can test your intuition on papers you haven't seen before.

Topics vs. Custom Search

- Choosing a topic from the dropdown is fast: we’ve already prepared those questions, so they load immediately.
- If you want to test yourself on a specific subfield that's missing, use custom search. These take longer (about 30–60 seconds) because we create the questions in real time. We fetch up to 100 relevant arXiv papers, shuffle them, then generate a small buffer of question/answer pairs on the fly.

Domains

I originally made this for machine learning, however, the "custom search" feature should work for any field that has published papers on arXiv, including Math, CS, Stats, Physics, etc.

Anti-slop mitigations

The questions and answers on this website are generated by GPT 5.2. To avoid hallucinations and make sure the answers are correct, I asked Opus4.5 the same questions, and only included questions where it was able to get a perfect score "open book" (using the paper as reference). For beginner questions where there is no paper for reference, I used both Opus 4.5 and Gemini 3 Pro, and only included questions that both other models were able to answer perfectly.

In addition to addressing AI slop, we must also mitigate human generated slop. Not all papers are correct, they may have false results due to errors in the code, poorly tuned baselines, or many other issues. To mitigate this, for each question, I included in the verification prompt a check to see if there was anything in the paper that suggests the results could be incorrect, such as non-rigorous methods, poorly tuned baselines, or lack of ablation studies. I removed all papers that this flagged as potentially wrong. A stronger check would be to include web search/twitter search to see if there is any discussion or other papers about the paper being incorrect. However, I have not implemented that for this first version.

To prove I looked at the data and did many questions myself, here's a funny bug from one of the many iterations: Initially the "Reward Hacking" section included a paper about theoretical math (Calabi-Yau Manifolds), because the arxiv keyword search was too broad and the author of the paper was Paul Hacking.

Even with these mitigations, it is still possible that some questions are slightly incorrect. The only one I've found so far is one example where one of the beginner questions asked how an algorithm designed to be robust to noise compared to the baseline on an example where there was noise. The problem was the exact parameters stated in the question had too little noise, so when I actually coded up the experiment and test it, there wasn't any improvement with the new algorithm. So from what I can tell, any errors are likely to be directionally correct (you're not going to learn wrong information by studying with experimentoring), but the experiments in the beginner questions might have hyperparamets that are not set right. I only see this problem in the beginner questions, since the advanced questions have the paper with experiments to use as reference.