SWE Tea - Year 2

1. Sanitizers
2. DC Power Oversubscription
3. RL Training
4. LLM Post-Training
5. Dishonorable Mentions

Year one post: https://malloc.dog/blog/2025/01/05/swe-tea---year-1/

Full SWE Tea papers list: https://malloc.dog/swetea

I was pleasantly surprised to find that Casper and I had marched towards paper #100 in SWE Tea. We read some good (and difficult!) papers this year, with the standouts being:

1. Sanitizers

Valgrind
- Nethercote, Nicholas, and Julian Seward. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. n.d.
- While this paper isn't itself that revolutionary, it's an excellent overview of the main problems in sanitization and the models that need to follow. Shadow memory, quarantine zones, and free lists all get discussed. It also covers dynamic binary recompilation, which is later ditched in favor of adding instrumentation points in the build process via LLVM, and having an additional compiler optimization pass after that.

2. DC Power Oversubscription

Medium Voltage Power Capping
- Sakalkar, Varun, Vasileios Kontorinis, David Landhuis, et al. “Data Center Power Oversubscription with a Medium Voltage Power Plane and Priority-Aware Capping.” Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ACM, March 9, 2020, 497–511. https://doi.org/10.1145/3373376.3378533.
- Data center power oversubscription is a bit of a niche topic, but this paper is a strong guide to its evolution. Managing power at the Medium Voltage level rather than the rack expands the resource pool and the degrees of freedom for optimization. Their focus is statistical multiplexing, a bin packing problem where you can schedule more efficiently across machines rather than relying on each node's Linux scheduler.

3. RL Training

SFT vs RL
- Chu, Tianzhe, Yuexiang Zhai, Jihan Yang, et al. “SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-Training.” arXiv:2501.17161. Preprint, arXiv, January 28, 2025. https://doi.org/10.48550/arXiv.2501.17161.
- Compares supervised fine-tuning with RL. This paper sets the stage for how RL training took over for LLMs, and why we needed to mix in RL in order to get better generalization from LLMs.

4. LLM Post-Training

Magistral
- Mistral-AI, Abhinav Rastogi, Albert Q. Jiang, et al. “Magistral.” arXiv:2506.10910. Preprint, arXiv, June 12, 2025. https://doi.org/10.48550/arXiv.2506.10910.
- I pick this paper and not the series of GRPO papers mostly because this is the first paper that offers a holistic understanding of the training pipeline. They use a form of GRPO for RL stability¹ and explain their trainers, generators, and verifiers, which are the infrastructure that run these policies. Generators and verifiers perform async rollouts, while the trainers update the batch numbers at each training phase. It's a good overview of how all the pieces stick together.

5. Dishonorable Mentions

These papers were kinda duds:

Erlingsson, Úlfar, Marcus Peinado, Simon Peter, and Mihai Budiu. Fay: Extensible Distributed Tracing from Kernels to Clusters. n.d.
- Unremarkable results, outshined by Zipkin at this point
Gu, Juncheng, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. Efficient Memory Disaggregation with InfInIswap. n.d.
- While the idea was cool, the evaluation section was extremely lacking.

Footnotes:

They did nix the KL penalty in their GRPO policy.

Ghettos of Abu Nawas