Rethinking Data Ingestion as a DAG
When your data pipeline becomes the bottleneck for your entire product, you have two choices: optimize incrementally or rethink the architecture entirely.
Our AI agent is powered by a unified knowledge graph built from code repositories, documentation, and project management data. In practice, that means ingesting and keeping hundreds of thousands of files in sync across many external systems. When ingestion started taking hours, incremental optimizations were no longer enough. We stepped back from the implementation details and redefined ingestion in terms of its core constraints, dependencies, failure modes, and resource profiles.
The outcome: ingestion time for a production-scale knowledge base dropped from several hours to minutes.
The ceiling we hit
Our ingestion pipeline was a Python service built around async workflows. It held up early on, but as the system grew, its limits became harder to ignore. We optimized the obvious paths and added incremental ingestion so only files that changed were reprocessed. That helped, but temporarily. The deeper issue wasn’t inefficiency, it was the architecture. At higher throughput, any blocking operation inside an async workflow serializes execution. A synchronous library call, a slow API request, or a CPU-heavy step all push work back onto the event loop and limit concurrency. Meanwhile, the workload itself was becoming more complex, with stages that are I/O-bound, compute-heavy, or constrained by external systems. Treating all of this as a single monolithic process made it increasingly difficult to tune or reason about performance. Users were waiting hours for their knowledge bases to sync, which wasn’t acceptable for a system meant to stay up to date.
Single-threaded async
Even with async, operations serialize under load
Thinking in terms of dependencies
What started as a discussion about avoiding event loop blocking led to a more fundamental realization, we were addressing symptoms instead of questioning whether the underlying model still made sense. Document ingestion is not a single linear operation. It consists of multiple stages with defined ordering and dependencies. Certain steps must complete before others can begin, while independent work can run concurrently. Failures tend to be localized to specific stages and should be handled there rather than forcing a full restart. Looking at ingestion through this lens made it clear that the system behaved less like a pipeline and more like a DAG.
Once we framed the problem this way, the requirements became clearer. The system needed to represent dependencies directly, retry work at the appropriate level of granularity, and control concurrency without routing all execution through a single process. We evaluated several orchestration frameworks, including Temporal, Inngest, and a few others, before choosing BullMQ for its straightforward job graph model and predictable execution semantics. That gave us the structural foundation we needed without introducing unnecessary complexity.
Before: Single execution model
After: Job graph
Rate limit hit → entire flow fails
Rate limit hit → retry only failed node
The job queue architecture
We rebuilt the ingestion service in TypeScript using NestJS and structured it around job queues rather than a single execution path. NestJS provided a solid foundation for dependency injection and modular architecture, making it easier to organize the different stages as isolated, testable services. The most important change was not the language or framework choice but how the work was decomposed.
Ingestion was split into distinct stages based on how each stage interacts with the system. Work dominated by network I/O runs independently from compute-heavy tasks. Operations constrained by external limits are handled separately. Each stage runs with its own concurrency settings, allowing throughput to scale without letting one type of work overwhelm the rest. This separation was one of the biggest improvements. I/O-bound stages no longer compete with compute-bound work, and throttled operations can be constrained without slowing down the entire pipeline.
From the beginning, we treated different types of data differently. Documents follow a full enrichment flow for semantic search. Code takes a different path. Running LLM-based enrichment on code is expensive and often produces limited value compared to structured analysis. Instead, code is parsed into an abstract syntax tree and reduced to symbols such as functions, classes, and imports. These are indexed for fast text-based lookup rather than semantic search.
Processing a document now produces a set of dependent jobs rather than a single linear script. Each stage emits output that feeds into the next stage in the graph. When a failure occurs, only the affected stage is retried rather than restarting the entire flow. This structure provides finer-grained retries, clearer visibility into where work is blocked, and isolation between stages with different failure characteristics. The system is easier to debug and easier to scale, not because it is more complex, but because its complexity is explicit and controlled.
Job queue architecture
Migrating with confidence through Test-Driven Translation
Rewriting a production ingestion pipeline carries real risk. We needed to verify that the new system behaved identically to the old one before switching traffic. We used Claude Code to generate a comprehensive test suite for the existing Python service, capturing how documents were chunked, what metadata was produced, and how edge cases were handled. These tests froze the contract of the old pipeline.
Instead of rewriting everything at once, we migrated incrementally. Individual stages were translated to TypeScript and validated against the same test cases before moving on to the next. This allowed us to compare outputs side by side and catch discrepancies early. Over time, more of the pipeline moved onto the new system. Once the full flow was in place, we ran the same test suite against the new implementation, giving us a clear signal that the migration preserved behavior without introducing regressions. This approach made the migration far less risky and left us with a durable regression suite for future changes.
Test-driven translation
test generation
from production
new patterns
+ regression suite
Load testing, bottlenecks, and observability
Once the system was live, we load-tested it with production-scale data. Performance improved, but not to the degree we expected, making it clear that architecture alone wasn’t enough, we needed visibility into how the system behaved under real load. Because each stage ran in isolation, it became obvious where time was actually being spent. The bottleneck showed up downstream in a part of the pipeline constrained by write throughput and external limits. Isolating that work and adjusting its concurrency prevented it from blocking document processing and immediately improved end-to-end latency. As we continued testing, the same pattern surfaced repeatedly, external systems impose constraints that are unavoidable and shift as usage grows. A scalable ingestion system doesn’t eliminate these limits but absorbs them without cascading failures. Observability made that possible. Being able to see queue depth, stage-level latency, and failure rates in isolation allowed us to tune the system deliberately and gave us confidence that new bottlenecks would be visible as usage increased.
The result
What used to take 4-6 hours now completes in under 10 minutes. This change matters because ingestion is foundational. When syncing takes hours, users work with stale information. When failures go unnoticed, trust erodes. Fast and reliable ingestion is table stakes for any knowledge platform teams can depend on. The architecture we landed on isn’t exotic—it’s a job queue, a DAG, and thoughtful decomposition. We didn’t need clever solutions. We needed something reliable, scalable, and easy to reason about. As we continue to onboard large enterprise customers, new bottlenecks will appear and the architecture will evolve. But the principle remains the same, the best infrastructure is the kind users never have to think about.