Building an E-commerce ETL Pipeline with GCP Cloud Functions and BigQuery

The e-commerce company was growing fast. Revenue had tripled in two years. But the data infrastructure hadn’t kept up. KPI reporting was a manual process — a data analyst spent two full days every week pulling numbers from five different platforms, pasting them into Google Sheets, and emailing PDFs to department heads.

By the time leadership saw the numbers, the data was already a week old. And nobody trusted it because the analyst’s formulas broke every time someone added a new product category.

We replaced the entire workflow with an automated ETL pipeline. Here’s how.

The Data Landscape

Data lived in five systems:

Shopify — Orders, products, customers, inventory
Google Analytics 4 — Traffic, conversion funnels, acquisition channels
Meta Ads Manager — Ad spend, impressions, ROAS by campaign
Stripe — Payments, refunds, subscription revenue
Zendesk — Support tickets, resolution time, customer satisfaction

Each system had its own API, its own data format, and its own definition of “revenue.” Shopify counted gross revenue. Stripe counted net revenue after fees. The Google Sheet tried to reconcile them and got it wrong half the time.

Why GCP

The company was already on Google Workspace. BigQuery’s serverless model meant no infrastructure to manage — no clusters to scale, no instances to patch. Pay per query, scale to zero when idle.

GCP Cloud Functions gave us event-driven ETL without managing servers. Each function handles one data source, runs on a schedule, and costs fractions of a cent per execution.

The Pipeline

Extraction

We wrote one Cloud Function per data source. Each function:

Authenticates with the source API using secrets stored in Secret Manager
Pulls incremental data since the last successful run (using watermark timestamps)
Writes raw JSON to a Cloud Storage staging bucket
Logs success/failure to Cloud Logging

The functions run on Cloud Scheduler — Shopify and Stripe every 15 minutes, GA4 and Meta daily, Zendesk hourly.

Transformation

A second set of Cloud Functions picks up raw data from the staging bucket and transforms it into analysis-ready tables:

Currency normalization — All amounts converted to USD at the day’s exchange rate
Revenue reconciliation — A single “revenue” column that matches the accounting definition: gross revenue minus refunds minus discounts, before payment processor fees
Customer identity resolution — Matching Shopify customer IDs to Stripe subscription IDs to Zendesk ticket requester emails
Attribution modeling — Assigning revenue to marketing channels using last-click and first-click models from GA4

Transformed data lands in BigQuery as partitioned tables (by date) with clustering on product category and channel.

Data Marts

We built four BigQuery data marts — one per department:

Sales mart — Revenue, AOV, conversion rate, repeat purchase rate by product/category/channel/cohort
Marketing mart — CAC, ROAS, LTV:CAC ratio, attribution by campaign
Operations mart — Fulfillment time, inventory turnover, stockout rate, return rate
Support mart — Ticket volume, first response time, resolution time, CSAT by category

Each mart is a materialized view that refreshes automatically when upstream tables update.

Dashboards

Domo was the BI layer. The company had an existing Domo license and their non-technical team preferred its drag-and-drop interface over SQL-heavy alternatives.

We built:

A daily executive scorecard (revenue, orders, traffic, conversion — one page)
Department deep-dives with drill-through capability
Automated alerts: email notification when conversion rate drops below threshold, when ad spend exceeds daily budget, or when support ticket volume spikes

Infrastructure as Code

The entire pipeline was defined in Terraform:

Cloud Functions (source code, triggers, environment variables)
Cloud Scheduler jobs (cron expressions)
BigQuery datasets, tables, and access controls
Cloud Storage buckets with lifecycle policies
IAM roles and service accounts

This meant the pipeline could be replicated for any new e-commerce client in under an hour — clone the repo, change the API credentials, run terraform apply.

Results

Reporting time: 2 days/week manual work → fully automated (saved 100+ analyst-hours/year)
Data freshness: Weekly snapshots → 15-minute refresh for revenue data
Cost: ~$120/month total GCP spend (Cloud Functions + BigQuery + Storage)
Accuracy: Revenue discrepancy between departments eliminated entirely
Scalability: Same pipeline handled 10x order volume during holiday peak without modification

The Stack

Layer	Tool	Why
Extraction	GCP Cloud Functions	Serverless, event-driven, per-source isolation
Orchestration	Cloud Scheduler	Cron-based triggers, no server management
Staging	Cloud Storage	Raw JSON landing zone, lifecycle policies
Warehouse	BigQuery	Serverless, pay-per-query, partitioning/clustering
Transformation	Cloud Functions + SQL	Lightweight transforms in Python, heavy joins in BigQuery SQL
Data Marts	BigQuery materialized views	Auto-refresh, department-specific schemas
Dashboards	Domo	Non-technical user preference, drag-and-drop
IaC	Terraform	Reproducible infrastructure, version controlled

Simba Hu helps companies make better decisions with data and AI — from strategy to implementation. Based in Tokyo, serving clients globally. Book a strategy call or visit simbahu.com.