Integrating 14 data sources without losing your mind

Sylvester SFounder & CEO

Nov 18, 2024·6 min read

The unglamorous reality of data pipeline work: mismatched schemas, rate limits, silent failures, and the tooling decisions that made it manageable.

The project brief was straightforward: unify data from 14 different sources into a single analytics warehouse so the finance and operations teams could report from one place. Six months, four of us, and more grey hairs than expected. Here's what nobody tells you about multi-source data integration.

Schema mismatches are the rule, not the exception

Across 14 sources, we encountered: dates formatted in 6 different ways, currency values in local units with no denomination field, customer IDs that were the same entity but different strings across systems ("Acme Corp", "ACME", "Acme Corporation Ltd"), and two sources that used the same field name for completely different concepts.

The resolution framework we built: every source gets its own schema layer in the warehouse - raw data, untouched, as it arrived. A second layer applies source-specific transformations: normalise dates, resolve currency, standardise entity IDs. A third layer unifies across sources against a canonical schema. This three-layer model means that when a source changes its schema (and they always do), we change one transformation layer, not twenty downstream queries.

Rate limits will find you at the worst time

Three of our 14 sources had documented rate limits. Seven of them had undocumented rate limits we discovered by hitting them during the initial historical backfill. Two of them had rate limits that were inconsistently applied - working fine on Tuesday, throttling aggressively on Thursday for no obvious reason.

Implement exponential backoff with jitter on every API call - not just the documented rate-limited ones
Separate historical backfill jobs from ongoing sync jobs - run backfills at off-peak hours with conservative concurrency
Store raw API responses before any transformation - if a transform fails, you can replay without re-fetching
Monitor API quota consumption per source, not just errors - approaching a rate limit is more useful information than hitting one

Idempotency is not optional

A pipeline job that runs twice should produce the same result as a pipeline job that runs once. This sounds obvious. It's surprisingly hard to guarantee when sources have mutable records - a sales order that gets updated after it was first synced, a customer record that changes. Our approach: every record gets a source-system timestamp and a sync timestamp. The transformation layer uses source-system timestamp as the authoritative 'last updated' and upserts on a stable composite key.

If you build nothing else into your pipeline, build idempotency. We had two unplanned job reruns during the project - one from a bug, one from an infrastructure outage. Because our pipeline was idempotent, both replays produced correct results with no manual intervention.

The tooling that saved us

We ran Airflow for orchestration, dbt for transformations, and Fivetran for the 4 sources that had native connectors. For the remaining 10, we wrote custom Python extractors that ran as Airflow tasks. The key decision that saved the most time: using dbt for all transformations, not Python. SQL is more transparent, easier to audit, and simpler to hand off to the analytics team who needed to maintain it after we left.

Data quality monitoring is half the project

Pipelines break silently. A source API changes its response format and your extractor starts writing NULLs. A rate limit kicks in and a job partially succeeds. An upstream system changes a lookup value your transformation depends on. Without monitoring, these failures accumulate silently until someone notices a dashboard metric that 'looks wrong.'

We implemented Great Expectations tests on every table - row count assertions, null rate thresholds, referential integrity checks, and value range validations. Failed tests page the on-call engineer and halt downstream pipeline stages until resolved. It added two weeks to the project. It's caught eleven real data quality issues in the eight months since launch.

Frequently asked questions

What is the hardest part of integrating multiple data sources?

Schema mismatches, consistently. The same entity shows up with different formats, different IDs, and different field names across every source. Building a transformation layer that normalises each source independently before attempting to unify them is the approach that holds up. Trying to unify everything in one pass is where integration projects get stuck.

What does idempotency mean in a data pipeline, and why does it matter?

An idempotent pipeline produces the same result whether it runs once or ten times on the same data. It matters because pipelines fail and get rerun. If a rerun creates duplicate records or corrupts existing ones, every outage becomes a data quality incident on top of an infrastructure incident. The implementation uses a source-system timestamp and a stable composite key to upsert records rather than blindly inserting them.

Should I use Fivetran, Airbyte, or a custom extractor for my data sources?

Use a managed connector like Fivetran or Airbyte for sources that have native support, especially if you do not want to maintain the extraction logic yourself. Write a custom extractor for sources that lack connectors or have specific data handling requirements. The split approach works well: managed tools for the standard cases, custom code for the edge cases, with a shared orchestration layer like Airflow pulling it all together.

How do you handle a data source that keeps changing its schema?

The three-layer model described in this article handles it well. Raw data is stored exactly as it arrives. Source-specific transformations live in their own layer. The canonical unified schema is downstream of all of that. When a source changes, you update one transformation layer and the rest of the pipeline is unaffected. Storing raw responses before transformation also means you can replay history without re-fetching from the source.

Hostwire builds this

Workflow Automation ↗

Start a project