System Initiative

Building System Initiative has been a process of iterating on architecture and approaches. As we’ve tried a series of persistence backends, rewritten services, and brought new technologies into our stack, one of the most enduring and important technologies has been NATS.

We brought NATS into our system over 4 years ago as a mechanism to deliver updates to the frontend. We wanted to use a persistent WebSocket connection that could push events such as function execution results and new values for model state. NATS provided a nicely decoupled and scalable approach where some business logic in the server side back end could publish an update for a workspace on a NATS subject, and then separately the API server could subscribe to these events and republish them on connected WebSocket clients. These updates then get pushed to the front end and allow us to update whatever we need to provide a fully reactive system. This solution is still used in our codebase today, virtually unchanged!

In retrospect, NATS was refreshingly easy to use in our stack for a few reasons. First, there is a fully capable Docker image which allowed us to get up and running without any configuration or tuning. In development, we all run this container on our workstations, listening on localhost without needing encryption, authentication, migrations, etc. When we first adopted NATS, we were using the stateless behavior which would now be better known as Core NATS as JetStream didn’t yet exist. This meant that destroying and redeploying our development NATS containers had no real impact on a running system--a unique feature that fosters consequence-free learning and speeds up development.

As we continued to build and iterate on our system, we found ourselves in need of a work queue pattern, where a service will subscribe to a queue of work requests, and these requests will be serviced by a pool worker instances. After experimenting with several approaches (including trying a couple of background job solutions), we settled on using queue groups, a builtin capability where by supplying a common “queue group” name, the NATS server will load balance the messages across all subscribers using that group name in an “at-most once” fashion. As a bonus, if no subscribers are present, then a “no-responders” error is returned if a message is published to one of the queue group subjects. Remember that as this is a part of Core NATS, it’s better to think of the system as stateless or live and lean into that. We certainly did, and for quite some time.

Over a year ago, in the fall of 2023, we started taking a closer look at some of the other features and capabilities a modern NATS system provides. It was delightful to discover that a lot of work had gone into the persistence engine called JetStream, which unlocks many more communication patterns. JetStream has you configure a named “bucket” of message, subject, consumption, and retention policies, which lets you model common patterns such as persistent work queues with retry, caches with time-to-live, event sourcing, and more. The JetStream architecture also provides key/value store (think: Redis) and object store (think: S3) APIs to add even more possibilities. We started small at first, using JetStream for durable work queues, but we’ve used the key/value store API a couple of times and are eyeing a couple of use cases for the object store API. We’ve also been using our home-brew solution to propagate large graph data structures to a subset of our services over JetStream by chunking binary data in a way very similar to how the object store works. As a result, we’re easily pushing gigabytes to terabytes of data per week through NATS with this one subsystem alone.

As we moved System Initiative closer to a wider availability release, we were keen to have a robust and well-run NATS deployment for our production SaaS environment. After all, by this point, NATS had become the connecting fabric of our system and was becoming a more significant part of our data and messaging persistence solution. We engaged Synadia in July 2024, where they deployed and are managing a NATS cluster for production. Our partnership allowed us to plan, scale, and load test our environment before announcing our general availability in September. In the process, we’ve learned different ways of setting up and configuring NATS to either run more reliably, quicker, or both. With Synadia’s guidance, we learned alternative ways of implementing some of our communication patterns, including why you might not want to create and destroy thousands of durable JetStream consumers in short bursts (a fun story for a podcast one day).

Today, the NATS architecture is the medium of communication, coordination, and collaboration between all of our services. As we begin 2025, it’s clear that our journey with NATS will continue to broaden and deepen. As the source code for System Initiative is available under the Apache v2 license, you can check out how we use NATS. If you have any questions or want to check out System Initiative, you can join our Discord.

NATS at System Initiative

Fletcher Nichol, Principal Engineer

Use System Initiative.