When your data pipeline becomes the bottleneck for your entire product, you have two choices: optimize incrementally or rethink the architecture entirely.

Our AI agent is powered by a unified knowledge graph built from code repositories, documentation, and project management data. In practice, that means ingesting and keeping hundreds of thousands of files in sync across many external systems. When ingestion started taking hours, incremental optimizations were no longer enough. We stepped back from the implementation details and redefined ingestion in terms of its core constraints, dependencies, failure modes, and resource profiles.

The outcome: ingestion time for a production-scale knowledge base dropped from several hours to minutes.

The ceiling we hit

Our ingestion pipeline was a Python service built around async workflows. It held up early on, but as the system grew, its limits became harder to ignore. We optimized the obvious paths and added incremental ingestion so only files that changed were reprocessed. That helped, but temporarily. The deeper issue wasn’t inefficiency, it was the architecture. At higher throughput, any blocking operation inside an async workflow serializes execution. A synchronous library call, a slow API request, or a CPU-heavy step all push work back onto the event loop and limit concurrency. Meanwhile, the workload itself was becoming more complex, with stages that are I/O-bound, compute-heavy, or constrained by external systems. Treating all of this as a single monolithic process made it increasingly difficult to tune or reason about performance. Users were waiting hours for their knowledge bases to sync, which wasn’t acceptable for a system meant to stay up to date.

Single-threaded async

Even with async, operations serialize under load

Event loop

Fetch

I/O wait

Transform

CPU work

Enrich

API call

Embed

API call

Store

DB write

Time

Mixed resource profiles

I/O, CPU, and API calls compete

Serial execution

One file at a time under load

3–4 hours

For full ingestion

Thinking in terms of dependencies

What started as a discussion about avoiding event loop blocking led to a more fundamental realization, we were addressing symptoms instead of questioning whether the underlying model still made sense. Document ingestion is not a single linear operation. It consists of multiple stages with defined ordering and dependencies. Certain steps must complete before others can begin, while independent work can run concurrently. Failures tend to be localized to specific stages and should be handled there rather than forcing a full restart. Looking at ingestion through this lens made it clear that the system behaved less like a pipeline and more like a DAG.

Once we framed the problem this way, the requirements became clearer. The system needed to represent dependencies directly, retry work at the appropriate level of granularity, and control concurrency without routing all execution through a single process. We evaluated several orchestration frameworks, including Temporal, Inngest, and a few others, before choosing BullMQ for its straightforward job graph model and predictable execution semantics. That gave us the structural foundation we needed without introducing unnecessary complexity.

Before: Single execution model

After: Job graph

Rate limit hit → entire flow fails

Rate limit hit → retry only failed node

Authenticate

✓ completed

List items

✓ completed

Fetch content

✗ rate limited

429

Process

skipped

Store

skipped

Wasted work: Auth + List items re-run

OAuth tokens refreshed, pages re-enumerated

Authenticate

✓ cached

List items

✓ cached

Doc 1

✓ done

Doc 2

↻ retry

Doc 3

✓ done

Doc 2 retries while 1 & 3 continue

Process

✓ proceeds

Store

✓ batch write

Wasted work: none

Isolated retries, parallel processing

The job queue architecture

We rebuilt the ingestion service in TypeScript using NestJS and structured it around job queues rather than a single execution path. NestJS provided a solid foundation for dependency injection and modular architecture, making it easier to organize the different stages as isolated, testable services. The most important change was not the language or framework choice but how the work was decomposed.

Ingestion was split into distinct stages based on how each stage interacts with the system. Work dominated by network I/O runs independently from compute-heavy tasks. Operations constrained by external limits are handled separately. Each stage runs with its own concurrency settings, allowing throughput to scale without letting one type of work overwhelm the rest. This separation was one of the biggest improvements. I/O-bound stages no longer compete with compute-bound work, and throttled operations can be constrained without slowing down the entire pipeline.

From the beginning, we treated different types of data differently. Documents follow a full enrichment flow for semantic search. Code takes a different path. Running LLM-based enrichment on code is expensive and often produces limited value compared to structured analysis. Instead, code is parsed into an abstract syntax tree and reduced to symbols such as functions, classes, and imports. These are indexed for fast text-based lookup rather than semantic search.

Processing a document now produces a set of dependent jobs rather than a single linear script. Each stage emits output that feeds into the next stage in the graph. When a failure occurs, only the affected stage is retried rather than restarting the entire flow. This structure provides finer-grained retries, clearer visibility into where work is blocked, and isolation between stages with different failure characteristics. The system is easier to debug and easier to scale, not because it is more complex, but because its complexity is explicit and controlled.

Job queue architecture

External sources

Code · Docs · Project tools

Job queues (Redis-backed)

Routing

routes incoming requests

Processing

documents + code

Storage

isolated db writes

Concurrent workers

Sources

Authenticate

List items

Fetch content

Processing jobs

Documents enriched for semantic search

Code parsed for AST symbols

Storage jobs

batch write

Storage layer

Knowledge graph

Search index

Migrating with confidence through Test-Driven Translation

Rewriting a production ingestion pipeline carries real risk. We needed to verify that the new system behaved identically to the old one before switching traffic. We used Claude Code to generate a comprehensive test suite for the existing Python service, capturing how documents were chunked, what metadata was produced, and how edge cases were handled. These tests froze the contract of the old pipeline.

Instead of rewriting everything at once, we migrated incrementally. Individual stages were translated to TypeScript and validated against the same test cases before moving on to the next. This allowed us to compare outputs side by side and catch discrepancies early. Over time, more of the pipeline moved onto the new system. Once the full flow was in place, we ran the same test suite against the new implementation, giving us a clear signal that the migration preserved behavior without introducing regressions. This approach made the migration far less risky and left us with a durable regression suite for future changes.

Test-driven translation

Generate tests

for existing system

AI-assisted
test generation

Freeze contract

expected behavior

captures edge cases
from production

Translate code

new architecture

rewrite with
new patterns

Validate migration

same tests pass

confidence to ship
+ regression suite

Result: Controlled, verifiable migration

Risky rewrite → safe transition with durable test suite

Load testing, bottlenecks, and observability

Once the system was live, we load-tested it with production-scale data. Performance improved, but not to the degree we expected, making it clear that architecture alone wasn’t enough, we needed visibility into how the system behaved under real load. Because each stage ran in isolation, it became obvious where time was actually being spent. The bottleneck showed up downstream in a part of the pipeline constrained by write throughput and external limits. Isolating that work and adjusting its concurrency prevented it from blocking document processing and immediately improved end-to-end latency. As we continued testing, the same pattern surfaced repeatedly, external systems impose constraints that are unavoidable and shift as usage grows. A scalable ingestion system doesn’t eliminate these limits but absorbs them without cascading failures. Observability made that possible. Being able to see queue depth, stage-level latency, and failure rates in isolation allowed us to tune the system deliberately and gave us confidence that new bottlenecks would be visible as usage increased.

The result

What used to take 4-6 hours now completes in under 10 minutes. This change matters because ingestion is foundational. When syncing takes hours, users work with stale information. When failures go unnoticed, trust erodes. Fast and reliable ingestion is table stakes for any knowledge platform teams can depend on. The architecture we landed on isn’t exotic—it’s a job queue, a DAG, and thoughtful decomposition. We didn’t need clever solutions. We needed something reliable, scalable, and easy to reason about. As we continue to onboard large enterprise customers, new bottlenecks will appear and the architecture will evolve. But the principle remains the same, the best infrastructure is the kind users never have to think about.