Insights·Automation

Integrating 14 data sources without losing your mind

SS
Sylvester SFounder & CEO
Nov 18, 2024·6 min read
Data pipeline visualization

The unglamorous reality of data pipeline work: mismatched schemas, rate limits, silent failures, and the tooling decisions that made it manageable.

The project brief was straightforward: unify data from 14 different sources into a single analytics warehouse so the finance and operations teams could report from one place. Six months, four of us, and more grey hairs than expected. Here's what nobody tells you about multi-source data integration.

Schema mismatches are the rule, not the exception

Across 14 sources, we encountered: dates formatted in 6 different ways, currency values in local units with no denomination field, customer IDs that were the same entity but different strings across systems ("Acme Corp", "ACME", "Acme Corporation Ltd"), and two sources that used the same field name for completely different concepts.

The resolution framework we built: every source gets its own schema layer in the warehouse — raw data, untouched, as it arrived. A second layer applies source-specific transformations: normalise dates, resolve currency, standardise entity IDs. A third layer unifies across sources against a canonical schema. This three-layer model means that when a source changes its schema (and they always do), we change one transformation layer, not twenty downstream queries.

Rate limits will find you at the worst time

Three of our 14 sources had documented rate limits. Seven of them had undocumented rate limits we discovered by hitting them during the initial historical backfill. Two of them had rate limits that were inconsistently applied — working fine on Tuesday, throttling aggressively on Thursday for no obvious reason.

  • Implement exponential backoff with jitter on every API call — not just the documented rate-limited ones
  • Separate historical backfill jobs from ongoing sync jobs — run backfills at off-peak hours with conservative concurrency
  • Store raw API responses before any transformation — if a transform fails, you can replay without re-fetching
  • Monitor API quota consumption per source, not just errors — approaching a rate limit is more useful information than hitting one

Idempotency is not optional

A pipeline job that runs twice should produce the same result as a pipeline job that runs once. This sounds obvious. It's surprisingly hard to guarantee when sources have mutable records — a sales order that gets updated after it was first synced, a customer record that changes. Our approach: every record gets a source-system timestamp and a sync timestamp. The transformation layer uses source-system timestamp as the authoritative 'last updated' and upserts on a stable composite key.

If you build nothing else into your pipeline, build idempotency. We had two unplanned job reruns during the project — one from a bug, one from an infrastructure outage. Because our pipeline was idempotent, both replays produced correct results with no manual intervention.

The tooling that saved us

We ran Airflow for orchestration, dbt for transformations, and Fivetran for the 4 sources that had native connectors. For the remaining 10, we wrote custom Python extractors that ran as Airflow tasks. The key decision that saved the most time: using dbt for all transformations, not Python. SQL is more transparent, easier to audit, and simpler to hand off to the analytics team who needed to maintain it after we left.

Data quality monitoring is half the project

Pipelines break silently. A source API changes its response format and your extractor starts writing NULLs. A rate limit kicks in and a job partially succeeds. An upstream system changes a lookup value your transformation depends on. Without monitoring, these failures accumulate silently until someone notices a dashboard metric that 'looks wrong.'

We implemented Great Expectations tests on every table — row count assertions, null rate thresholds, referential integrity checks, and value range validations. Failed tests page the on-call engineer and halt downstream pipeline stages until resolved. It added two weeks to the project. It's caught eleven real data quality issues in the eight months since launch.

More in Automation
Automation workflow visualization
5 min read · Feb 14, 2025

The 80/20 of business automation: where to start