What Is an In-Memory Data Grid?

And why did I spend nearly ten years running inference on them?

For about a decade, from 2011 to 2021, almost all real-time AI inference workloads ran on a category of enterprise software you may never have heard of. Fraud detection in payments, anomaly detection in information security, flagging arbitrage opportunities in capital markets, and recommending ad buys at enormous scale: the vast majority of business systems implementing these functions ran on something called an in-memory data grid, or IMDG.

If you’ve never heard of an IMDG, you are not alone. IMDGs never had a flagship open-source project you could download and run with, like message buses had Kafka or analytics engines had Spark (though there was briefly Apache Ignite, which has now metamorphosed into something more tractable: Ignite now calls itself a fast database). IMDG vendors did show up at conferences with stickers and swag and presented case studies (indeed, I presented case studies at QCon, the London In-Memory Computing Summit, and a host of other conferences) and most people thought “wow, that technology is remarkable, but it also sounds complicated and expensive.” All three of those adjectives—remarkable, complicated, expensive—were true. IMDGs lived inside Fortune 500 architectures and quietly carried workloads that the marquee databases of the era—Oracle, Postgres, Cassandra, Mongo—could not. By the time the broader industry started talking publicly about “real-time AI,” IMDG vendors had been running real-time AI in production for over a decade.

For ten years, as an IMDG SME, I’ve been answering one question more than any other: what is an IMDG? This is the explainer I wish I’d had.

The one-sentence version

An in-memory data grid is a distributed system that holds your working data set in the RAM of a cluster of machines and lets your application code execute at the site of the data that it operates on instead of pulling the data across a network to your code.

That second half is the part that gets lost. Most people hear “in-memory” and think “cache.” That’s not wrong, but caching is the least interesting thing an IMDG does. The interesting thing is that it inverts the usual relationship between application and database. In a traditional inference architecture, your application fetches data from a database or cache (which you may now call a feature store), and uses that data to intelligently effect a business function: flag a transaction as fraudulent, recommend an item to a customer, determine if a user’s sudo request is anomalous and a security threat to be investigated. In an IMDG architecture, your app instead ships a small piece of code to the grid or triggers execution of a piece of code that is already installed in the grid. And the grid runs that code against the data that’s already sitting in memory on the nodes where it lives. If it’s a trigger to execute pre-installed code—like classifier or scoring function code—the network data movement is close to zero. The trigger payload is less than 100 bytes.

That’s a different shape of system. And it’s a shape that matters when you have a real-time decisioning problem that the traditional shape can’t execute within the time budget you have—maybe 20ms to score fraud after a credit card swipe.

Why “in-memory” stopped being the point

When the first IMDGs appeared in the mid-2000s—GigaSpaces XAP, Coherence, GemFire, eventually Hazelcast and others—the headline feature that the trade press picked up was “your data is in RAM, so reads are microseconds instead of milliseconds.” Now, that was a big deal in 2007. Disks were slow—rotationally slow—and putting your hot data in RAM was a competitive advantage. But it’s now commoditized and it’s not why IMDGs are interesting.

Redis will hold your data in RAM for a fraction of the cost of an IMDG license. So will memcached. So will, for that matter, any modern database with a buffer pool. The “in-memory” half of “in-memory data grid” is table stakes now and was never the point. Vendors just couldn’t articulate in the early 2000s what the real point was: the grid part. Distributed grid computing. Co-located with data. The ability to do real work on data without moving the data.

If your problem is “we need a fast cache in front of Postgres,” you do not need an IMDG. You need Redis. You will be happier with Redis. Buying an IMDG instead of Redis means your “simple” caching layer requires vendor professional services and a four-day training course. Don’t do that.

If your problem is “we have a business decisioning need that cannot be expressed as a series of database queries because the data is too large to move and the logic is too complex to express in SQL”—that is when an IMDG starts to look like the right tool. And there are more of those problems out there than people realize, because the industry has trained engineers to assume the answer is always “throw it at Spark” or “throw it at a data warehouse.” That works for offline decisioning, but for real-time decisioning with a latency budget of twenty milliseconds? Spark is laughing at you.

Objects, not records

IMDGs reveal their uniqueness through the vocabulary the IMDG world uses. NoSQL key-value databases store records. IMDGs store objects. The word choice is intentional, and it points at one of the architectural commitments that makes the category what it is.

IMDG data is stored in memory in a language-neutral binary layout that was engineered, from the start, to be cheap to rehydrate as a native object in whatever language your code happens to be written in. On any IMDG worth its salt, you can put() a Python object using a key-value interface and get() a Java object back. The grid doesn’t store Python bytes or Java bytes, it stores a portable binary representation, and each client or co-located code fragment invoked by the server runtime reconstitutes its own native object from that representation (Apache Ignite’s Binary Objects documentation is the clearest public write-up of how these formats work; every IMDG has its own variant).

This matters more than you might think. Inference code that runs co-located with grid data is frequently a mix of languages—Java for executing routine business operations like updates, Python for ML model code, and sometimes C++ when the latency budget is brutal. Cross-language object exchange via standard mechanisms is expensive. Indeed, it’s so expensive that we have an entire ecosystem of such representations—Google’s Protobuf, Apache Avro, MessagePack—that trade one kind of expense for another. The grid-native binary formats were designed specifically to avoid that cost: a local memory lookup plus rehydration into a native language object typically completes in about half a microsecond. That’s the budget that makes co-located compute viable.

The “record vs. object” distinction is a tell. When you hear a vendor talk about records, they’re selling you a database. When you hear them talk about objects, they’re selling you a data storage and execution environment.

Partition-aware placement: data that knows where it belongs

Beyond co-located processing, the architectural commitment that most clearly separates an IMDG from a distributed cache is that the grid understands logical relationships between pieces of data and uses those relationships to decide where data physically lives.

Consider fraud detection. The customer object—keyed by primary account number, carrying income, credit risk, geographic priors, behavioral history, etc.—is logically the parent of every transaction object the customer has executed in the past year. A classifier evaluating whether a new transaction is fraudulent needs both the customer context and the recent transaction stream. In a conventional architecture, the classifier issues a query for the customer, issues a second query for the transaction history, waits for both round trips, and then runs its logic. Each round-trip burns latency you may not have.

In an IMDG, you declare to the grid that customer objects and their child transaction objects are affinity-colocated. The grid then guarantees that every transaction belonging to a given customer lives in the same memory image as the customer object itself. Your classifier code, when it executes on the node that owns that customer’s partition, finds both the parent and the children already resident in local RAM. Zero network transfer. The imperative logic—read customer, iterate over transactions in a massively parallel way, compute features, run the model—runs against data that is, from the code’s point of view, just there.

In the IMDG universe, this is called partition-aware placement, and it naturally encodes the master-detail and parent-child relationships that show up in essentially every real business domain. NoSQL key-value databases don’t do this; they treat every key as independent and let the hash function scatter related data across the cluster. Distributed caches don’t do this for the same reason. It is, more than any single other feature, the thing that lets an IMDG support imperative, stateful, multi-object business logic at the millisecond latencies the category claims.

Scatter, gather, and the nodes that get to stay idle

Once you have objects in a portable binary format and you have related objects co-located in the same memory image, the rest of the architecture follows. This is the part that made IMDGs the dominant substrate for real-time inference during the multi-core era—roughly the decade between when Intel and AMD stopped chasing single-thread frequency and when GPUs took over the inference conversation.

Here’s how a single inference request actually executes on a grid. The client triggers some code—a classifier, a feature extractor, a scoring function, whatever—along with an implicit query that describes which data the code needs to run against. The grid consults its hash function to determine exactly which nodes hold that data. It then broadcasts the code to only those nodes. On each of them, the code runs in parallel against the local partition, using as many CPU cores as the node has available. Results stream back to the client, which assembles the final answer. This pattern is called scatter-gather, and it is the workhorse idiom of every IMDG.

The hash function doing the routing matters. Most production IMDGs use rendezvous hashing, invented by Thaler and Ravishankar at the University of Michigan in 1996 (their original technical report, CSE-TR-316-96, predates the more famous consistent hashing paper from MIT by a year). Rendezvous hashing lets every client, independently and without coordination, agree on exactly which subset of nodes owns any given key. Apache Ignite uses it. Other IMDGs use variants. The shared property that matters is precision: given an implicit query, the grid can name the exact set of nodes that need to participate, and just as importantly, the exact set of nodes that can stay out of it.

That second part—stay out of it—is what unlocks massive parallelism. Nodes excluded from a given scatter-gather incur no CPU cost from it. Their cores remain fully available to service other inference requests at the same time. A grid with a few dozen nodes can be running thousands of inference requests in parallel, each one engaging only the subset of nodes that hold the data the inference code needs to make a decision to flag a fraudulent transaction or recommend a product to a customer, with no interference between them. This is how IMDGs supported millions of concurrent inference operations on commodity hardware during the years when “real-time AI” meant CPU-based inference and the whole industry was trying to figure out how to use sixteen cores effectively.

The lengths these systems went to in order to keep cores busy were sometimes astonishing. A favorite example: Hazelcast is a JVM product, but the data science world lives in Python, and Python’s global interpreter lock is famously hostile to parallelism within a single process. So Hazelcast engineered around this by spawning multiple CPython subprocesses on each cluster node and routing pipeline data through them in parallel, with the level of parallelism configurable per Python stage. The JVM hosts the grid and orchestrates the work; CPython does the actual ML inference; multiple CPython processes per node escape the GIL by sidestepping it entirely. The fact that a serious enterprise platform shipped this much engineering specifically to avoid letting a Python interpreter monopolize a single core tells you everything about the moral universe IMDGs lived in. An idle core was a sin against all that is right and just in the world.

GPUs eventually changed the shape of the inference problem—but only for one part of it. A GPU is a machine for doing regular, dense arithmetic on large tensors, and a neural network is exactly that kind of workload: predictable, branch-free, embarrassingly parallel. Tree-based models are not. A gradient-boosted ensemble or a random forest makes a prediction by walking thousands of decision trees, and every node in every tree is a data-dependent branch—compare a feature to a threshold, go left or right, repeat. That irregular, branch-heavy control flow is the precise opposite of what GPU hardware is built to accelerate, which is why tree ensembles still run best on CPUs. And tree ensembles are not a fading niche: for tabular, feature-store-shaped data—fraud, risk, recommendation, the broad middle of enterprise machine learning—gradient-boosted trees remain the workhorse, frequently outperforming deep learning on exactly the data IMDGs were built to hold. Before the GPU era, IMDGs were the gold standard for real-time AI inference across the board. For the large class of models that GPUs never absorbed, they still have no obvious successor at the same latency.

IMDGs were cloud-native before “cloud-native” was a thing

IMDG have one more architectural property that matters enormously, and it isn’t obvious from anything I’ve described so far: IMDGs are inherently elastic. They were elastic before “elasticity” became a marketing term. And the reasons are mechanical and baked into the IMDG design itself, not an aspirational bolt-on.

The first reason is the simpler one. Data that lives in RAM doesn’t have to be read off a disk before being put on the wire. When you add a node to a cluster, the partitions that move to that node move at the speed the network can carry them, with no I/O wait on the source side. When you remove a node, the partitions it owned get redistributed at the same speed. Cluster expansion and contraction are network-bound, not disk-bound, and they happen in minutes rather than hours. This is a property that conventional databases, even fast ones, struggle to match—they spend most of a rebalance pulling data off disk before putting it on the wire.

The second reason is more subtle, and it’s the answer to a question I often heard as the Principal Architect of a leading IMDG vendor: of all the consistent hashing algorithms that exist, why did almost every production IMDG converge on rendezvous? The algorithm is older than consistent hashing by a year. It’s not the most famous. It’s not the one Akamai used. Why this one?

The answer is in the proof. In the original Thaler/Ravishankar tech report, there’s a probabilistic argument—not long, surprisingly elegant—that rendezvous hashing optimizes two properties simultaneously. The first one is obvious: the hash distributes data approximately uniformly across cluster nodes. The second is the property that matters: it provably minimizes the number of data movement operations required to restore approximately uniform distribution after a topology change. When you add or remove a node, rendezvous hashing rebalances by moving the minimum expected number partitions, within defined confidence bounds. Other hash functions get distribution right; rendezvous gets re-distribution right.

This is the property that makes IMDGs viable in a cloud. Consider the Williams-Sonoma Group of Companies, one of the largest new customers Hazelcast landed during my time there. WSGC comprises a whole series of bougie home goods brands, from Williams-Sonoma itself to Pottery Barn to West Elm. WSGC ran Hazelcast for real-time recommendations, customer 360, and related workloads. In retail, traffic on Cyber Monday might exceed traffic on a Tuesday in March by a factor of a thousand. So, we built automation that could scale a Hazelcast cluster from five nodes to a far larger number of nodes that I’m not at liberty to disclose in progressive stages, with stabilization periods between each expansion.

The stabilization periods existed because of a subtle interaction with the network topology of the era. When the cluster was actively migrating partitions to a newly-added node, some scatter-gather operations would require an extra network hop: a request would arrive at a node that no longer owned the relevant data, and the response would have to chase the partition to its new home. On 10 gigabit Ethernet—which was all we had at the time—that extra hop was expensive enough to matter. More importantly, every CPU cycle the cluster spent computing how to handle a transitional topology was a CPU cycle it wasn’t spending on inference. Rebalance time was time the cluster wasn’t doing its job.

Rendezvous hashing minimizes that time. Fewer partitions moving per topology change means shorter stabilization windows, means faster overall scale-out, means the cluster spends more of its life serving traffic and less of its life rebalancing.

So, the architectural picture, finally, looks like this. Multi-core processors gave us cheap parallelism on each node. Scatter-gather over partition-aware-placed data let us use that parallelism efficiently across the cluster. Rendezvous hashing let the cluster grow and shrink without paying for the privilege. And cloud infrastructure made elasticity into a deployable primitive that customers wanted to use. IMDGs landed in the sweet spot where all four of these forces met. Multi-core became a thing. The cloud became a thing. Real-time AI became a thing. And for about a decade, the technology that quietly stitched them together was the in-memory data grid.

IMDGs aren’t a panacea

Having spent a nearly a decade architecting systems built on these things and evangelizing them to the world, I’ll tell you where they lose. Because if I don’t, that’s only half the story. And good engineers tell the whole story.

They lose to Redis on simple caching. If you are evaluating an IMDG purely for caching and your data fits the Redis data model, buy Redis. I held principal-level positions at two IMDG companies and still hold substantial equity stakes in them. And even I’ll tell you that Redis is a better choice here.

They lose to purpose-built databases on point lookups at extreme scale. If your workload is “look up a user profile by ID, two hundred thousand times a second, with predictable latency under a millisecond,” there are systems that will do this better than an IMDG will, at a fraction of the operational cost. I work at one of them now—Aerospike. IMDG vendors have spent years trying to be good at this workload, and they’re fine at it, but “fine” is not “purpose-built” and storage tiering in IMDGs introduces latency unpredictability that HMA (hybrid memory architecture) systems like Aerospike don’t have.

They lose to Spark and Flink on batch and streaming analytics where latency budgets are forgiving. If you have ten minutes to produce an answer, you do not need the data in RAM. You need the right tool for batch, and the right tool for batch is not an IMDG.

Where IMDGs win

They win when four conditions line up: the working set is large but bounded, the inference computation is non-trivial, the latency budget is tight, and the model used for inference doesn’t compete with the IMDG for RAM. If you have a 10 million parameter ANN model, that will not meaningfully interfere with the in-memory storage image on the nodes. If you have a 30 billion parameter model, that belongs on its own server, in its own memory image, not competing with the IMDG data store. When all these conditions hold, nothing else in the market is competitive.

The workloads where I’ve seen this play out: real-time risk and exposure calculation in trading systems, fraud scoring on payment streams, telecom QoS analysis, and large-scale personalization and recommendation engines. The common thread is that the system needs to think, not just look things up, and it needs to think fast enough that a human-perceptible event is waiting for the answer.

A note on what these things cost

IMDGs developed a reputation for being ridiculously expensive, and the reputation is entirely earned.

The reason is that everything about an IMDG is optimized for performance under load, and performance optimization at that level is expensive to engineer and expensive to license. The rendezvous hashing choice is a tell, again. It’s a meaningfully more complex algorithm to implement correctly than the alternatives—which is part of why it sat in the literature from 1996 without much industrial uptake. Consistent hashing came out a year later and got most of the attention because it was easier to implement and good enough for the use cases that mattered at the time, like CDN cache placement. Rendezvous didn’t win on simplicity. It won on the specific property that mattered for elastic clusters under active load: minimum disruption during topology changes. If you needed that property, you used rendezvous. If you didn’t, you used something easier. The IMDG vendors universally needed it, and they paid the engineering cost to ship it, and they charged for the result.

This pattern repeats across the architecture. The portable binary object format is more complex than language-native serialization. Partition-aware placement is more complex than letting the hash function scatter data uniformly. Co-located compute with affinity routing is more complex than a remote-procedure-call architecture. Each of these decisions traded engineering complexity and licensing dollars for performance under realistic production conditions. Customers who needed that performance paid. Customers who didn’t bought something cheaper, and they were right to.

This is why I enthusiastically endorse Redis for caching workloads and Snowflake for offline inference. Both are excellent products that solve their problems better than an IMDG would, at a fraction of the cost. If your problem is “cache hot data in front of a database,” buy Redis. If your problem is “run inference on a feature table once a day,” buy Snowflake. Don’t pay for an IMDG to do these jobs. You’ll be unhappy with the bill, and the IMDG won’t be especially good at either job anyway.

But if your problem in 2016 was “score this transaction as fraud or not-fraud in ten milliseconds, using both the customer’s behavioral history and the current network of related transactions, while doing the same for ten thousand other transactions arriving in the same second”—to paraphrase Ghostbusters, who else were you going to call? The list of technologies that could actually do that job, at that latency, at that concurrency, on the hardware of the era, was very short. Indeed, that list had only one line item: “IMDGs.”

The end of the IMDG era and the bright future ahead

Things have changed now. And that list of technologies has a lot more line items than one. For one thing, networks have gotten dramatically faster. Over 400GbE, rebalancing no longer requires the minimal data movement guarantees that rendezvous hashing provides. Nor, for that matter, is pulling a lot of data from a fast key-value store like Aerospike to an external inference cluster particularly expensive. The categories that succeeded IMDGs—including uniquely fast NoSQL databases with predictably low latency and throughput guarantees even under uncertainty, which is how Aerospike and its competitors position themselves—were built in part by people who learned what the IMDGs were doing right and figured out how to do it for a fraction of the operational cost. That progression is healthy. It’s also why my portfolio of IMDG experience is more relevant than ever.

In real-time AI, the mark of a good architect is the ability to recognize that a problem is unsolvable with the standard set of tools we all know—Kafka, Spark, Cassandra, Redis, Flink—and the confidence to leave those tools behind and adopt wholly new technologies, however exotic they might at first seem. This is something I have some experience with. For a lot of our big financial services customers like Deutsche Bank, J.P. Morgan Chase, and HSBC, when I did an initial proof-of-concept and met with the customer’s DevOps and MLOps teams, they were somewhat mystified by IMDGs. These things were exotic. After all, what does “objects/sec rehydrated on node 16 = 482,773” mean when it appears on your AppDynamics telemetry panel? That’s not a statistic a database or a message bus emits. But after a year, with mentoring and coaching through collaborative work with our professional services and TAM teams, those same folks who were once mystified had no problem building automation to dynamically expand and contract multiple clusters at the touch of a button.

The lesson is that what seems exotic and risky can easily become tractable if you invest the resources to understand it. And getting good vendor professional services is a great way to do that. If you capture the knowledge they transfer to you in an auditable, living document on a Confluence wiki or wherever, the exotic becomes tractable and then mundane. This how IMDGs went from “scary” to “scale out automatically” for HSBC’s DevOps and MLOps teams.

An IMDG elegy

In-memory data grids let you do real computation, in parallel, on related data, fast enough that a human-perceptible event was waiting for the answer. For about a decade, that capability didn’t exist anywhere else at the latencies the business needed. The war stories elsewhere on this site are what that looked like in practice.