Turbocharging Pelikan Cache on Google Cloud’s latest Arm-based T2A VMs
Momento exceeded throughput and latency goals for its serverless cache by 25% with Google's latest Arm-based T2A VMs.
Update: We released a deep dive on this topic. Check it out if you want to get even more granular.
At Momento, we strive to be the world’s fastest cache—both in developer productivity and consistent latencies. Momento is open-core and built on Pelikan—the open sourced caching engine powering Twitter’s cache for over a decade and recently rewritten entirely in Rust. We also support RocksDB and KeyDB, a fully open-sourced database backed by Snap and a fast drop-in alternative to Redis. We chose these foundational open source components due to their reputation for fast, reliable caching at incredible scale.
Our obsession with speed extends to picking the best VMs on behalf of our customers for performance, capabilities, and price. We work closely with Twitter on tuning Pelikan and making it available to our customers at scale with a single-API call and zero configurations. While uncommon for an early-stage startup to make investments in deep processor-level optimizations to reduce cost and latencies, we refused to compromise our commitment to customers on picking the best hardware for their workloads.
Optimizing for Tau T2A VM
When we learned about the Tau T2A VM family coming to Google Cloud, we couldn’t wait to take it for a test drive. We assumed that optimizing for Tau T2A VM would be an expensive investment, but the work would pay for itself at scale. Turns out, we were wrong!
The transition to T2A was meaningfully less work than what we anticipated. Pelikan worked instantly on the T2A—and we used our existing tuning processes to optimize it for the Arm-based VMs.
Before we started, we set up objective criteria to assess VM families. Momento cares about high throughput at low latency, and we’re particularly focused on optimizing tail latencies. We evaluated queries per second (qps) at 99.9% latencies (p999) at two Service Level Objectives (SLOs)—1 millisecond (1ms) and 2 milliseconds (2ms).
For the T2A VMs to be cost neutral, we needed Pelikan to hit 260kqps at 1ms p999 SLO and 700kqps at 2ms p999 SLO on T2A-Standard-16 VMs.
Within a week of getting early access to T2A VMs (and working closely with Brian Martin at Twitter), we were able to exceed our goals by 25%. We hit 320kqps for our 1ms p999 SLO, and we achieved over one million queries per second at our 2ms p999 SLO!
It gets better. A T2A-Standard-48 VM has 3x the cores of the T2A-Standard-16 and can handle even more queries per second. Additionally, we found that DRAM is about 10% cheaper on the T2A VMs, enabling us to keep larger working sets in memory at the same price point.
The engineering week of effort to deliver these results on the Tau T2A VMs was a worthwhile investment. The week consisted of head-to-head benchmarks between VM families on Google Cloud, doing price:performance comparisons, and tuning our engine to work well on the Tau T2A VMs. Surprisingly, almost all of the optimizations we made (core pinning, IRQ affinity, etc.) resulted in improvements across both Arm and x86 based VMs. Our initial investment paid dividends across our caching fleet—even at our current scale.
Deciding when to invest in Arm
When considering Arm, it can be tough to decide when to invest in evaluating the new processors (AMD, Arm-based, Intel). We have three observations that may serve as a framework for you.
First and foremost, don’t see this as an irreversible decision. Google Cloud’s diverse platform makes evaluation and adoption a two-way door. If your applications work well on Arm today but a new x86 processor comes out tomorrow with better performance or features, you can easily benchmark and change.
Second, build a test harness that enables you to assess cost and performance. Most fast-moving enterprises end up with technical debt on cost optimizations—and stepping back to evaluate a new processor can yield cost savings and performance optimizations on your existing platforms even if you don’t decide to move forward with them. By having a test harness for performance evaluation, we could easily assess new VM families with minimal effort from the engineering team.
Third, you don’t have to go all in! You can run some of your applications on Arm and others on x86. In fact, you can even run a mixed fleet on Arm and x86 on the same applications.
We see a ton of momentum building up in Arm. The Momento team is excited to have an opportunity to partner with Google Cloud on optimizations that improve our price:performance and help us deliver the best possible cache to our customers.