On Far Memory

Rest in peace, dear friend.

For the past two months, SWE Tea has been doing far memory papers, specifically:

- Lagar-Cavilla, Andres, Junwhan Ahn, Suleiman Souhlal, Neha Agarwal, Radoslaw Burny, Shakeel Butt, Jichuan Chang, et al. “Software-Defined Far Memory in Warehouse-Scale Computers.” In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 317–30. Providence RI USA: ACM, 2019. https://doi.org/10.1145/3297858.3304053.
- Ruan, Zhenyuan, Malte Schwarzkopf, Marcos K. Aguilera, and Adam Belay. “{AIFM}: {High-Performance}, {Application-Integrated} Far Memory,” 315–32, 2020. https://www.usenix.org/conference/osdi20/presentation/ruan.
- Dragojević, Aleksandar, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. “FaRM: Fast Remote Memory.” In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, 401–14. NSDI’14. USA: USENIX Association, 2014.
- Gu, Juncheng, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. “Efficient Memory Disaggregation with InfInIswap,” n.d.

We were curious about what happens when memory stretches, aka what Optane was hinting at before Intel smothered it. Directly addressable NVM is appealing, but hardware-wise it’s expensive to deploy, and painful to support operationally. Far memory has started creeping back into relevance in the real constraints of AI training clusters, k8s chaos, and the simple fact that most apps still spend half their RAM doing nothing useful.

However, the concept of a memory ladder is still intriguing. Imagine a memory hierarchy: cache, RAM, slow RAM, someone else’s RAM, flash. This lets your programs be elastic: rather than worrying about OOMs, your application expands and spills memory over to different machines as needed.

In fact, Lagar-Cavilla specifically calls this out:

The papers we read in SWE Tea presented various understandings of this, but it roughly goes like this:

Authors	Where Far Memory Is	Application Changes?	Hardware Changes?	Blurb
Lagar-Cavilla	Local machine	No, uses Linux swap	No	Compress "cold" sections of memory, GC mark and sweep style, via zswap.
Gu	Remote machines	No, uses Linux swap	Yes	Needs Infiniband/RoCE, but proactively swaps out cold pages in memory for you.
Dragojević	Remote machines	Yes, lightly	Yes (?)	Needs RDMA, but gives you transactional memory, where you can lock specific remote regions of far memory, and influence placement on different machines.
Ruan	Remote machines	Yes, heavily	No	Sets up an alternative runtime where everything runs as green threads and applications switch threads while waiting on far memory.

The appeal of far memory is obvious, most boxes in warehouse scale computers have terrible utilization rates, hovering around 40-60%. This means that nearly half of computer RAM is sitting idle, only to be used for bursty situations. We overprovision boxes because we need to handle peak capacity, but this means that effectively there's a ton of boxes sitting cold. Far memory lets you pool these unused resources, rather than marooning them.

Not all marooned memory is the same. The distinction between cold and stranded memory is important. Cold memory is idle because we overbudgeted. Stranded memory is stuck because the workload’s shape doesn’t match the box, such as a CPU-heavy task on a RAM-heavy node. In practice, our runtime would probably have to handle them separately, as cold memory is a result of application access patterns, and stranded memory is because there's a mismatch in resource brokerage.

By letting runtime stretch across the ladder, we're freeing up developers to worry less about how things are allocated, and focus on application/business logic specific code¹. Conversely, by proactively moving cold bits of memory off of the local DRAM, we get less OOMs and more local DRAM for hot paths.

Right now, every service boundary becomes a wall where memory can't be shared. We overprovision boxes and OOM in one service doesn't help that another is underused. Fragmentation makes this worse. Every pod gets its own resource island, with hard edges².

What we want, in effect, is a kind of distributed monolith: not a return to tight coupling, but a unified runtime that spans machines and abstracts memory boundaries. Rather than hard service partitions and duplicated state, memory becomes a pooled substrate. Programs allocate as usual; the runtime determines placement and scope. Elasticity becomes a property of the system itself, not something bolted on through orchestration.

Erlang is a decent north star here. It already solves some of this:

it does internode comms via dist, where everything’s serialized and shipped transparently.
it owns its own runtime. Erlang schedulers sit on top of Linux and can preempt actors however they want, which you need for something like AIFM-style far memory threading.

So you can imagine a system like that, but instead of just messaging between actors, you also get memory that moves. Hot pages stay local, and cold stuff gets pushed to someone else's DRAM. Your code is blind to this, but provides the option to annotate your memory like you do your types: remotable, hot, compressible³.

For example, if you've got a normal Python web app, with the classic Django setup, some views, some Celery jobs, a few bloated ORM queries. Today, when it gets hit with a traffic spike, it falls over because Postgres can't keep up and the app server OOMs under the load of some accidental .all() calls pulling 100k rows.

In our speculative runtime, you still write your views the same way: Python functions, and async sprinkled in. Your runtime pulls itself across a few extra machines when memory pressure warrants it. The large queryset now spills automatically into far memory by compressing cold objects, maybe even sending them off to a neighbor’s DRAM⁴. Meanwhile, the scheduler notices that your Celery workers are mostly idle, so it starts running them closer to the memory they’re accessing.

All of this, in the ideal case, is completely transparent to the developer. Obviously, this assumes a runtime that's opinionated and intrusive enough to take control: think something like PyPy meets Erlang meets a modern autoscaler.

AIFM points out the obvious tax here with kernel swap moves memory at the 4KB page level. If your objects are small, read amplification bites hard. Page granularity works fine when your app’s memory layout is a neat array, but our access patterns are so rarely aligned. Touching one object means pulling in a page full of unrelated garbage. That’s fine if it’s local. But if you’re crossing a NIC, pointer chases turn into a 4KB tax. Worse yet, we now have to worry about page table state on the NIC, which itself has limited memory and may have to periodically pull in page table information from DRAM.

NUMA tried to solve the same shape of problem. If your allocator places related things far apart, every memory access is a hop. Far memory just makes that hop more expensive. You miss once and pay the NIC roundtrip⁵.

Compressing local cold memory using the swap abstraction is a reasonable stepping stone, since schlepping data off to remote machines in modern environments (post PRISM) requires you to encrypt and CRC the data, which has further CPU cycle overhead. For now, we compress cold pages. But if we squint a little, we can see what comes next: a runtime that turns memory into a shared, fluid substrate.

Thanks to Casper for reading the first draft and providing feedback.

Footnotes:

This is actually how developers at Facebook write PHP code, they write some functions in Hack (PHP), and then yeet it into the generic runtime, which handles all the allocation, memory, jitting, etc for them.

Practically speaking I've never seen people try to optimize their pod requests vs limits to maximize space fitting on a single box, and most of the time you don't want to either. It's usually cheaper to just get more capacity than try to maximally space fill.

If you're Lakos pilled, you'll be nodding along. If you're not, read https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2126r0.pdf.

⁴

This turns out to be far more complicated in practice, since sending off to remote memory means you'll need to encrypt, and if you need to encrypt, you might as well compress.

⁵

The simplest kv case of Pilaf still suffered from multiple RDMA reads per key: Mitchell, Christopher, Yifeng Geng, and Jinyang Li. “Using One-Sided RDMA Reads to Build a Fast, CPU-Efﬁcient Key-Value Store,” n.d.