The e-commerce company was growing fast. Revenue had tripled in two years. But the data infrastructure hadn’t kept up. KPI reporting was a manual process — a data analyst spent two full days every week pulling numbers from five different platforms, pasting them into Google Sheets, and emailing PDFs to department heads.

By the time leadership saw the numbers, the data was already a week old. And nobody trusted it because the analyst’s formulas broke every time someone added a new product category.

We replaced the entire workflow with an automated ETL pipeline. Here’s how.

The Data Landscape

Data lived in five systems:

  • Shopify — Orders, products, customers, inventory
  • Google Analytics 4 — Traffic, conversion funnels, acquisition channels
  • Meta Ads Manager — Ad spend, impressions, ROAS by campaign
  • Stripe — Payments, refunds, subscription revenue
  • Zendesk — Support tickets, resolution time, customer satisfaction

Each system had its own API, its own data format, and its own definition of “revenue.” Shopify counted gross revenue. Stripe counted net revenue after fees. The Google Sheet tried to reconcile them and got it wrong half the time.

Why GCP

The company was already on Google Workspace. BigQuery’s serverless model meant no infrastructure to manage — no clusters to scale, no instances to patch. Pay per query, scale to zero when idle.

GCP Cloud Functions gave us event-driven ETL without managing servers. Each function handles one data source, runs on a schedule, and costs fractions of a cent per execution.

The Pipeline

Extraction

We wrote one Cloud Function per data source. Each function:

  1. Authenticates with the source API using secrets stored in Secret Manager
  2. Pulls incremental data since the last successful run (using watermark timestamps)
  3. Writes raw JSON to a Cloud Storage staging bucket
  4. Logs success/failure to Cloud Logging

The functions run on Cloud Scheduler — Shopify and Stripe every 15 minutes, GA4 and Meta daily, Zendesk hourly.

Transformation

A second set of Cloud Functions picks up raw data from the staging bucket and transforms it into analysis-ready tables:

  • Currency normalization — All amounts converted to USD at the day’s exchange rate
  • Revenue reconciliation — A single “revenue” column that matches the accounting definition: gross revenue minus refunds minus discounts, before payment processor fees
  • Customer identity resolution — Matching Shopify customer IDs to Stripe subscription IDs to Zendesk ticket requester emails
  • Attribution modeling — Assigning revenue to marketing channels using last-click and first-click models from GA4

Transformed data lands in BigQuery as partitioned tables (by date) with clustering on product category and channel.

Data Marts

We built four BigQuery data marts — one per department:

  • Sales mart — Revenue, AOV, conversion rate, repeat purchase rate by product/category/channel/cohort
  • Marketing mart — CAC, ROAS, LTV:CAC ratio, attribution by campaign
  • Operations mart — Fulfillment time, inventory turnover, stockout rate, return rate
  • Support mart — Ticket volume, first response time, resolution time, CSAT by category

Each mart is a materialized view that refreshes automatically when upstream tables update.

Dashboards

Domo was the BI layer. The company had an existing Domo license and their non-technical team preferred its drag-and-drop interface over SQL-heavy alternatives.

We built:

  • A daily executive scorecard (revenue, orders, traffic, conversion — one page)
  • Department deep-dives with drill-through capability
  • Automated alerts: email notification when conversion rate drops below threshold, when ad spend exceeds daily budget, or when support ticket volume spikes

Infrastructure as Code

The entire pipeline was defined in Terraform:

  • Cloud Functions (source code, triggers, environment variables)
  • Cloud Scheduler jobs (cron expressions)
  • BigQuery datasets, tables, and access controls
  • Cloud Storage buckets with lifecycle policies
  • IAM roles and service accounts

This meant the pipeline could be replicated for any new e-commerce client in under an hour — clone the repo, change the API credentials, run terraform apply.

Results

  • Reporting time: 2 days/week manual work → fully automated (saved 100+ analyst-hours/year)
  • Data freshness: Weekly snapshots → 15-minute refresh for revenue data
  • Cost: ~$120/month total GCP spend (Cloud Functions + BigQuery + Storage)
  • Accuracy: Revenue discrepancy between departments eliminated entirely
  • Scalability: Same pipeline handled 10x order volume during holiday peak without modification

The Stack

LayerToolWhy
ExtractionGCP Cloud FunctionsServerless, event-driven, per-source isolation
OrchestrationCloud SchedulerCron-based triggers, no server management
StagingCloud StorageRaw JSON landing zone, lifecycle policies
WarehouseBigQueryServerless, pay-per-query, partitioning/clustering
TransformationCloud Functions + SQLLightweight transforms in Python, heavy joins in BigQuery SQL
Data MartsBigQuery materialized viewsAuto-refresh, department-specific schemas
DashboardsDomoNon-technical user preference, drag-and-drop
IaCTerraformReproducible infrastructure, version controlled

Simba Hu helps companies make better decisions with data and AI — from strategy to implementation. Based in Tokyo, serving clients globally. Book a strategy call or visit simbahu.com.