Robby Hoover | Recent Advances in AI Memory Management

Recent Advances in AI Memory Management

LLMs operate with a limited context window, which can cause catastrophic forgetting when not properly managed

May 13, 2026

One of, if not the biggest limitations in the development of AI in the current year of 2026 is properly managing context and memory. It's hard to effectively leverage intelligence if it can't effectively remember what it has learned. We have to start thinking about our contexts windows as a CPU register, not a hard drive, and move memory into auxiliary systems which aren't lost each time a new chat is created.

This year, prompt engineering has taken a bit of a backseat and has evolved into 'context engineering', deciding what goes into the window, when, in what shape, and what gets removed and when. This requires consideration of the system prompt, the user prompt, system guardrails, structured output, the short and long term memory, as well as external memory tooling (we will discuss that in this article).

Over the course of the past couple of years, some folks believed that if we simply keep increasing the size of the context window, it would eventually be large enough to solve the memory problem by sheer brute force. However, even with the advent of 1 - 2 million token context windows, we are experiencing the same issues, especially with large sets of documents. We are also encountering new issues with these large context models, that seem to be hitting a ceiling of sorts. The problem is no longer "can it fit", but "can the model attend to it".

LLM models attend best to the start and end of context. They have more trouble with context in the middle. This is sometimes called the "lost-in-the-middle problem".
When you let the model iteratively rewrite its own running summary, details erode turn over turn. Domain insight is sacrificed for conciseness. Sometimes known as "context collapse".
Finally, we have Context Poisoning, which is when verbose tool outputs and exploratory reads pollute the window, degrading later reasoning. By far the most common cause of "the model got dumber" or "the model is drifting" type of failure modes.

So, now what we are aware of the problems. What are some potential solutions that are emerging to remedy this problem?

Retrieval Augmented Generation (RAG) - RAG allows for the dynamic fetching of appropriate chunks and documents from an external knowledge store to be injected into the prompt just before generation. This enables the model to answer with current information, rather than relying solely on static training data. Traditionally done with a vector embedding search in a vectorized DB, but these days a hybrid approach is more common, where the vectorized semantic search results are combined with keyword search results and reranking, plus an agentic step to determined if retrieval would be advantageous or not.
Markdown Memory - This is how Claude Code and Cursor operate mainly for agentic programming. You externalize state into the file structure itself. A main MD file (claude.md, agents.md, etc.) holds stable facts about the project, such as architecture, best practices, what to do and what to never do, stuff like that. Reference files are kept as separate MD files that can be referenced in the main MD file, so the model can then decide what's necessary. Skill files are another recent improvement, which describe when and how to execute certain tools, only loaded into context when deemed necessary. Some people (Karpathy) take this to the next level by creating an Obsidian Vault (or similar PKM environment) that the AI is allowed to curate itself to maintain its own memory. More on that later.
Subagents - Instead of doing 20 file reads in the main session, spawn in a subagent, with its own context window, and return a summary, rather than loading all of that material directly into context. This technique in particular has been highly impactful on the agentic coding front. Repo exploration, test runs, doc research, security audits are all fantastic subagent use cases in the development process.

Agents themselves utilize a lot of their own tricks to save on context. You can do scheduled compaction, where when context fills, the system summarizes the older portion into a structured note, before then continuing. Usually tool results are pruned first. There can be dedicated notes files that the agent can handle across tasks. ACE-style playbooks are showing promise for long term memory. Instead of summarizing the past into something shorter, you grow a structured playbook of learned strategies and rules, updated incrementally rather than rewritten wholesale. Auto-evolving prompts like this seem promising, but have yet to achieve much foothold in the industry, or in research.

MCP (Model Context Protocol) serves as a universal plug-n-play type of system for connecting LLMs directly into other applications, such as Gmail or Postgres. The model can pull in whatever it needs to, at the moment it needs it, rather than loading right in context. MCPs are made up of tools (functions the model can call), prompts (reusable prompt templates) and resources (data the model can read and use). In many ways, it is a connective tissue that joins a lot of these previously mentioned approaches together. You can build a RAG system as an MCP server. You can expose a subagent as an MCP tool. You can wrap your Obsidian vault in MCP (which is what the Obsidian crowd is doing now).

Speaking again on Obsidian, it's almost always wired up via MCP when being used as an AI memory tool like this. There's a lot of hype around it, not just because of the guy pushing it, but it does offer a lot of solutions for problems in this space. The resulting markdown files are interpretable and human parsable, you get a feature-rich graph structure for free (an auditable search layer, basically) and it's locally stored and not at the mercy of a faceless corporation. The LLM is maintaining an active personal wiki that it can use to solve very general, and very specific tasks. It's the markdown memory program pushed to its logical extreme, with a healthy dose of RAG principles in there, just applied differently. Agents decide when to retrieve, rather than every-time it's prompted. The ability to traverse a network graph of the wiki allows it to directly find and traverse pages related to the results as well, especially when powered by structured queries.

Keep a close eye on the GraphRAG stuff. It combines keyword matching and semantic RAG search to find anchor entities in the graph, which can then be traversed. These provide explainable chains of thought (from page to entity to page, etc.) to find subgraphs that then get sent to the LLM's context window. Major developments on this front have come from Microsoft, who are trying different community summarization methods upon indexing. These can then be used for efficient global search to derive main themes for example, or for detailed local-entity specific questions, like what project a certain worker is connected to. The big thing with semantic vs. keyword vs. graph search is finding the proper way to stack them together. Some have tried doing them in parallel or having an agent select one or the other, but stacking them as I've described is what works best so far.

One more thing that I haven't mentioned yet, but I think is important and relevant is LoRA. Rather than trying to effectively manage context, LoRA is a method for finetuning LLMs without having to retrain all of the billions and billions of parameters in the system. Basically, you freeze the original weights and inject small trainable "adapter" matrices alongside them, which approximate the weight update you'd otherwise need. You end up training maybe 0.1-1% of the params, while getting almost the full benefit of full fine-tuning. If we've been looking at managing short and medium term memory, here's where the long term memory comes in. Good for teaching style, format, domain-specific reasoning patterns, or tone. Less good for raw factual recall, which is more prone to hallucination.

Some have commented on the limitations of these systems, especially RAG, highlighting the fact that the same question can be asked from several different angles (time-based, entity-based, categorical, existential, etc.) and different models are better and worse at different types of questions like that. Other problems that have been noted are the lack of an embodied world model for interpreting results, and trouble connecting vaguely connected entities (adam who likes coffee may or may not be the same as adam you know from work).

Additional work to keep an eye on is DeepSeek's engram paper, which encodes recent memory in the first few layers of the neural net. Plenty of research time is now dedicated to changing memory from the attention matrix to something more like human memory, which works hierarchically, selectively and through reconstruction. Plenty of labs are working on this, it's just a very complicated problem.

References and Further Reading

Leave a Comment

Comments

Read More!

Share this article!

Tags