The insurance company had a compliance problem disguised as a data problem.
Regulators were asking for data lineage reports — where does each number in the actuarial filings come from? Which source systems feed the reserve calculations? Who has access to policyholder PII?
The company couldn’t answer these questions. Not because the data didn’t exist, but because nobody had documented how it flowed through the organization. Data lived in Hadoop clusters, Oracle databases, Hive tables, and Spark jobs — and the only person who understood how it all connected was one senior engineer who was about to retire.
We had six months to build a data governance framework before the next regulatory audit.
Why Governance Matters in Insurance
Insurance is one of the most data-intensive regulated industries. Every pricing decision, every claims estimate, and every reserve calculation traces back to data. When the data is wrong or its provenance is unclear, the consequences are severe:
- Regulatory fines for inaccurate filings
- Mispriced risk from inconsistent data definitions (what counts as a “claim” varies across departments)
- Audit failures when you can’t demonstrate data lineage
- Operational inefficiency when analysts spend 40% of their time finding, cleaning, and validating data instead of analyzing it
The DMBOK 2 framework (Data Management Body of Knowledge by DAMA International) gave us a structured approach. We didn’t need to implement all 11 knowledge areas at once — we focused on the four that mattered most for the audit: Data Governance, Metadata Management, Data Quality, and Data Lineage.
The Tool Stack
Collibra served as the data catalog and governance hub. It’s the industry standard for enterprise data governance — not the cheapest tool, but the one regulators recognize. Every data asset, business term, data owner, and policy lives in Collibra.
Informatica handled data integration and data quality profiling. Its PowerCenter module already managed some of the company’s ETL jobs, so extending it to include data quality rules was a natural fit.
Alteryx was used for data quality remediation workflows. When Informatica flagged a data quality issue, Alteryx workflows cleaned, standardized, and reconciled the data before it re-entered the pipeline.
What We Built
Data Catalog
The first step was visibility. We cataloged every data asset across the organization:
- 800+ tables across Hadoop, Oracle, and Hive
- 50+ ETL jobs documented with source-to-target mappings
- 120+ business terms defined in a business glossary (what does “incurred loss” mean? “earned premium”? “case reserve”?)
Each table in Collibra was linked to its business terms, its data owner, its classification (PII, confidential, public), and its upstream/downstream dependencies.
Data Lineage
This was the regulatory requirement. For any number in a filing, an auditor should be able to trace it back through every transformation to the original source record.
We built lineage at three levels:
- Technical lineage — Automated extraction from Informatica ETL jobs. Every column-level transformation was captured: which source columns feed which target columns, what joins, filters, and aggregations are applied.
- Business lineage — Higher-level flow diagrams showing how business concepts (e.g., “net premium”) are derived from raw data. These were built in Collibra and reviewed by business stakeholders.
- Impact analysis — When a source system changes, which downstream reports are affected? Collibra’s impact analysis feature made this query instantaneous instead of requiring a week of manual investigation.
Data Quality Monitoring
We implemented data quality rules across five dimensions (aligned with DMBOK 2):
| Dimension | Example Rule | Action on Failure |
|---|---|---|
| Completeness | Policy records must have non-null policyholder ID | Flag for review, block from downstream |
| Accuracy | Premium amounts must be within 3σ of historical average | Alert data steward |
| Consistency | ”Active” policy status must match across CRM and policy admin | Trigger Alteryx reconciliation |
| Timeliness | Claims data must arrive within 24 hours of submission | Escalate to data owner |
| Uniqueness | No duplicate policy numbers within the same product line | Deduplicate via Alteryx |
Informatica ran these rules daily. Results were published to a Collibra data quality scorecard — a single dashboard showing the health of every critical data asset.
Data Stewardship
We established a data governance council with representatives from actuarial, claims, underwriting, IT, and compliance. Each department nominated a data steward responsible for:
- Approving business term definitions
- Resolving data quality issues in their domain
- Reviewing and approving access requests for sensitive data
- Participating in quarterly governance reviews
This wasn’t a technology decision — it was an organizational one. Without clear ownership, no amount of tooling matters.
The Hard Parts
Getting business buy-in. Data governance sounds like bureaucracy. We framed it as “you’ll spend less time arguing about numbers in meetings” — which turned out to be the most compelling pitch.
Scope control. The temptation is to catalog everything. We started with the 50 most critical data assets (those feeding regulatory reports) and expanded from there. Boiling the ocean is the fastest way to kill a governance initiative.
Legacy system documentation. The Hadoop cluster had tables created by people who no longer worked at the company, with no documentation and cryptic column names. Reverse-engineering these required interviewing long-tenured employees and reading Spark job source code line by line.
Results
- Audit readiness: Full data lineage from source to filing for all 50 critical data assets
- Data quality score: Improved from 72% to 94% across five dimensions in 6 months
- Incident response: Time to assess impact of a source system change reduced from 5 days to 2 hours
- Analyst productivity: 30% reduction in time spent on data finding and validation
- Regulatory confidence: Passed the audit with zero critical findings
The Stack
| Layer | Tool | Why |
|---|---|---|
| Data Catalog | Collibra | Industry standard, regulator-recognized |
| Business Glossary | Collibra | Single source of truth for business terms |
| Data Lineage | Collibra + Informatica | Automated technical lineage, manual business lineage |
| Data Quality | Informatica Data Quality | Rules engine, profiling, scorecards |
| Remediation | Alteryx | Visual data cleansing and reconciliation workflows |
| Source Systems | Hadoop, Oracle, Hive, Spark | Existing enterprise data infrastructure |
| Framework | DMBOK 2 (DAMA International) | Structured approach to governance maturity |
Simba Hu helps companies make better decisions with data and AI — from strategy to implementation. Based in Tokyo, serving clients globally. Book a strategy call or visit simbahu.com.