Implementing Data Governance at an Insurance Company with Collibra and DMBOK

The insurance company had a compliance problem disguised as a data problem.

Regulators were asking for data lineage reports — where does each number in the actuarial filings come from? Which source systems feed the reserve calculations? Who has access to policyholder PII?

The company couldn’t answer these questions. Not because the data didn’t exist, but because nobody had documented how it flowed through the organization. Data lived in Hadoop clusters, Oracle databases, Hive tables, and Spark jobs — and the only person who understood how it all connected was one senior engineer who was about to retire.

We had six months to build a data governance framework before the next regulatory audit.

Why Governance Matters in Insurance

Insurance is one of the most data-intensive regulated industries. Every pricing decision, every claims estimate, and every reserve calculation traces back to data. When the data is wrong or its provenance is unclear, the consequences are severe:

Regulatory fines for inaccurate filings
Mispriced risk from inconsistent data definitions (what counts as a “claim” varies across departments)
Audit failures when you can’t demonstrate data lineage
Operational inefficiency when analysts spend 40% of their time finding, cleaning, and validating data instead of analyzing it

The DMBOK 2 framework (Data Management Body of Knowledge by DAMA International) gave us a structured approach. We didn’t need to implement all 11 knowledge areas at once — we focused on the four that mattered most for the audit: Data Governance, Metadata Management, Data Quality, and Data Lineage.

The Tool Stack

Collibra served as the data catalog and governance hub. It’s the industry standard for enterprise data governance — not the cheapest tool, but the one regulators recognize. Every data asset, business term, data owner, and policy lives in Collibra.

Informatica handled data integration and data quality profiling. Its PowerCenter module already managed some of the company’s ETL jobs, so extending it to include data quality rules was a natural fit.

Alteryx was used for data quality remediation workflows. When Informatica flagged a data quality issue, Alteryx workflows cleaned, standardized, and reconciled the data before it re-entered the pipeline.

What We Built

Data Catalog

The first step was visibility. We cataloged every data asset across the organization:

800+ tables across Hadoop, Oracle, and Hive
50+ ETL jobs documented with source-to-target mappings
120+ business terms defined in a business glossary (what does “incurred loss” mean? “earned premium”? “case reserve”?)

Each table in Collibra was linked to its business terms, its data owner, its classification (PII, confidential, public), and its upstream/downstream dependencies.

Data Lineage

This was the regulatory requirement. For any number in a filing, an auditor should be able to trace it back through every transformation to the original source record.

We built lineage at three levels:

Technical lineage — Automated extraction from Informatica ETL jobs. Every column-level transformation was captured: which source columns feed which target columns, what joins, filters, and aggregations are applied.
Business lineage — Higher-level flow diagrams showing how business concepts (e.g., “net premium”) are derived from raw data. These were built in Collibra and reviewed by business stakeholders.
Impact analysis — When a source system changes, which downstream reports are affected? Collibra’s impact analysis feature made this query instantaneous instead of requiring a week of manual investigation.

Data Quality Monitoring

We implemented data quality rules across five dimensions (aligned with DMBOK 2):

Dimension	Example Rule	Action on Failure
Completeness	Policy records must have non-null policyholder ID	Flag for review, block from downstream
Accuracy	Premium amounts must be within 3σ of historical average	Alert data steward
Consistency	”Active” policy status must match across CRM and policy admin	Trigger Alteryx reconciliation
Timeliness	Claims data must arrive within 24 hours of submission	Escalate to data owner
Uniqueness	No duplicate policy numbers within the same product line	Deduplicate via Alteryx

Informatica ran these rules daily. Results were published to a Collibra data quality scorecard — a single dashboard showing the health of every critical data asset.

Data Stewardship

We established a data governance council with representatives from actuarial, claims, underwriting, IT, and compliance. Each department nominated a data steward responsible for:

Approving business term definitions
Resolving data quality issues in their domain
Reviewing and approving access requests for sensitive data
Participating in quarterly governance reviews

This wasn’t a technology decision — it was an organizational one. Without clear ownership, no amount of tooling matters.

The Hard Parts

Getting business buy-in. Data governance sounds like bureaucracy. We framed it as “you’ll spend less time arguing about numbers in meetings” — which turned out to be the most compelling pitch.

Scope control. The temptation is to catalog everything. We started with the 50 most critical data assets (those feeding regulatory reports) and expanded from there. Boiling the ocean is the fastest way to kill a governance initiative.

Legacy system documentation. The Hadoop cluster had tables created by people who no longer worked at the company, with no documentation and cryptic column names. Reverse-engineering these required interviewing long-tenured employees and reading Spark job source code line by line.

Results

Audit readiness: Full data lineage from source to filing for all 50 critical data assets
Data quality score: Improved from 72% to 94% across five dimensions in 6 months
Incident response: Time to assess impact of a source system change reduced from 5 days to 2 hours
Analyst productivity: 30% reduction in time spent on data finding and validation
Regulatory confidence: Passed the audit with zero critical findings

The Stack

Layer	Tool	Why
Data Catalog	Collibra	Industry standard, regulator-recognized
Business Glossary	Collibra	Single source of truth for business terms
Data Lineage	Collibra + Informatica	Automated technical lineage, manual business lineage
Data Quality	Informatica Data Quality	Rules engine, profiling, scorecards
Remediation	Alteryx	Visual data cleansing and reconciliation workflows
Source Systems	Hadoop, Oracle, Hive, Spark	Existing enterprise data infrastructure
Framework	DMBOK 2 (DAMA International)	Structured approach to governance maturity

Simba Hu helps companies make better decisions with data and AI — from strategy to implementation. Based in Tokyo, serving clients globally. Book a strategy call or visit simbahu.com.