Module 9: ChatGPT Mastery
Module 10: Analytics Engineering Capstone Project

Capstone Intro

1. The Big Picture: What You’re Building

You’ll be creating an analytics pipeline that transforms raw data—comprised of:

  1. Registrations (when users sign up)
  2. Sessions (when users visit the site)
  3. Transactions (when users make a purchase)

…into insights that help answer key business questions like:

  • “How many registrations do we have over time?”
  • “Which channels drive the most registrations and purchases?”
  • “How many transactions are completed, and how quickly are orders delivered?”

To achieve this, you’ll use:

  • BigQuery as your data warehouse
  • dbt to transform and model data (including incremental modelsseed files, and testing)
  • Version control via GitHub to manage and track changes to your code
  • dbt Cloud to orchestrate scheduled jobs and send email notifications
  • Looker Studio to visualize and present the final insights to stakeholders

2. Set Up Your Environment

What You’ll Do:

  1. Get Access to BigQuery
    • Store your raw data in BigQuery (you’ll have three main tables: registrationssessions, and transactions).
  2. Install & Configure dbt (and dbt Cloud)
    • dbt will allow you to create SQL-based transformations, manage them in a structured workflow, and schedule daily/weekly jobs using dbt Cloud.
  3. Set Up Version Control on GitHub
    • Initialize a GitHub repository for your dbt project.
    • Commit and push your code regularly to track changes.
    • If working in a team, create branches for new features or changes, and use pull requests to merge code.
  4. Prepare Looker Studio
    • You will later connect Looker Studio to your curated/aggregated data in BigQuery to create dashboards.
  5. Plan for Notifications
    • In dbt Cloud, configure email (or Slack) notifications so you’ll know when a scheduled job fails or succeeds.

Why It Matters:

  • Having a robust environment ensures your data is secure, queries run efficiently, and your transformations are version-controlled.
  • Using dbt Cloud lets you schedule and automate jobs, plus get notified if something goes wrong—essential for production-grade pipelines.

3. Understand the Data

Data Overview:

  1. Registrations
    • Columns: visit_iduser_idregistered_at
  2. Sessions
    • Columns: visit_idvisit_start_timevisit_end_timedevice_typebrowserpageview_countspend_typeattributed_channelattributed_subchannel, etc.
  3. Transactions
    • Columns: order_idvisit_iduser_idorder_created_atshipping_carriershipping_methodestimated_delivery_datedelivered_at, etc.

What You’ll Do:

  • Identify how these tables link together (e.g., visit_id is common to sessions and registrations; user_id links to transactions).
  • Note any potential data quality issues (e.g., missing data, duplicate IDs).
  • Determine which columns contain key metrics or attributes (channel, device type, timestamps).

Why It Matters:

  • You need a clear understanding of how the data is structured before you start modeling it.
  • Making sure you know how tables relate will help you design the dbt transformations effectively.

4. Create Staging Models (and Seed Files) in dbt

The staging layer (often named stg_) will be the foundation of your project. You’ll build a staging model for each raw table, plus leverage dbt seed files if you have CSV-based reference data (like channel mappings or device categorization).

  1. Staging Models:
    • stg_sessions
    • stg_registrations
    • stg_transactions
  2. Seed Files:
    • For example, a CSV that contains reference info for shipping carriers or marketing channels.
    • You’ll place these CSV files in your dbt project’s seeds/ folder, and dbt can create a table for you automatically.

What You’ll Do:

  • Rename columns for consistency (e.g., transform registered_at to registration_timestamp).
  • Normalize timestamps (e.g., ensure all date/time columns share the same format/time zone).
  • Handle duplicates if necessary (e.g., remove or mark them).
  • Test your staging models (using dbt test) to ensure data integrity—e.g., check for primary key uniqueness.
  • Commit and push each change to GitHub, and open pull requests for review if collaborating.

Why It Matters:

  • The staging layer ensures that all source data is standardized and ready for further transformations.
  • Seed files give you a quick way to add small reference datasets without needing external sources.
  • Testing your staging data makes your pipeline more robust and reliable.

5. Build Fact & Dimension Tables (Including Incremental Models)

After your staging tables are set, you’ll create fact and dimension tables. Where appropriate—especially if you have large datasets—use dbt’s incremental mode to process only new or updated records rather than reprocessing the entire dataset every time.

Potential Dimension Tables:

  • dim_user: Stores user attributes (user_id, the date they first registered, etc.).
  • dim_device: Normalizes device_type and browser.
  • dim_channel: Defines each marketing channel (paid, organic, subchannel details), possibly sourced from seed files.

Potential Fact Tables:

  • fact_sessions: Session-level data (visit ID, start/end time, channel, device, etc.).
  • fact_registrations: Registration events (visit ID, user ID, registration timestamp).
  • fact_transactions: Transaction details (order ID, user ID, timestamps, shipping info).

What You’ll Do:

  1. Join the staged data in dbt using common keys (visit_iduser_id).
  2. Create dimension tables that reference unique keys and store descriptive attributes.
  3. Create fact tables that store metrics (pageview_count, shipping timestamps) and reference dimension IDs.
  4. For large tables, configure them as incremental models so dbt only processes changes after the initial full load.
  5. Continue to test (e.g., ensuring no null values in primary keys, referential integrity to dimension tables).
  6. Push each new model to GitHub, creating commits for each significant update.

Why It Matters:

  • Fact/dim schemas let you easily slice and dice data for reporting.
  • Incremental models can drastically improve run times and optimize costs if your dataset is large.
  • By version-controlling your models, you can track exactly which transformations changed over time—and revert if something breaks.

6. Develop Data Marts (Aggregated Views)

Sometimes you’ll need aggregations—like daily registrations or transaction counts. Building separate data mart tables or views can speed up dashboard queries and simplify analysis.

Examples:

  • Daily Registrations by Channel (daily_registrations_by_channel).
  • Transactions by Shipping Carrier (transactions_by_carrier).

What You’ll Do:

  1. Create aggregate tables (using dbt) that group data by day, channel, device, etc.
  2. Include key metrics (e.g., total registrations, total sessions, on-time deliveries).
  3. Test these aggregates against your fact tables to ensure data accuracy.
  4. Continue to commit changes to GitHub so you can review your modeling progress.

Why It Matters:

  • Aggregated views make it easy to quickly answer common business questions (e.g., “How many registrations per day?”) without scanning entire fact tables.
  • Testing and version control ensure reliability in your final dashboards.

7. Visualize in Looker Studio

This is where your data pipeline becomes actionable. You’ll connect Looker Studio to your final tables in BigQuery to create interactive dashboards.

What You’ll Do:

  1. Connect Looker Studio to your curated/aggregated data in BigQuery.
  2. Build Dashboards that showcase:
    • Registrations Over Time: A line chart to show daily or weekly registration trends.
    • Device & Browser Breakdown: Pie or bar charts of session metrics.
    • Channel Analysis: Which marketing channels lead to the most registrations/transactions.
    • Conversion Funnel: Visualize the funnel from sessions → registrations → transactions.
    • Shipping & Delivery: Track orders by carrier, delivery times, on-time performance.
  3. Add filters and date pickers so users can interact with the data, e.g., view specific date ranges or channels.
  4. Publish your dashboards and share them with relevant stakeholders.

Why It Matters:

  • Clear, insightful dashboards help non-technical teams (like executives or marketing) quickly understand how the business is performing.
  • Using Looker Studio makes your data pipeline feel “real”: you’re delivering actionable insights, not just raw tables.

8. Orchestrate & Schedule dbt Jobs (Testing & Notifications)

Your pipeline should run automatically on a daily or weekly cadence. You’ll set this up in dbt Cloud:

What You’ll Do:

  1. Create a Job in dbt Cloud that runs your models on a schedule (e.g., every day at midnight).
  2. Enable Testing so dbt runs data tests before or after the transformations to catch data quality issues (e.g., checks for null values, duplicates, referential integrity).
  3. Configure Email Notifications to alert you (or the team) when a job fails or if tests don’t pass.
  4. If you have large tables, make sure to leverage incremental settings appropriately to keep runtime efficient.

Why It Matters:

  • Automated scheduling ensures fresh data is always available for your dashboards.
  • Testing and notifications make your pipeline robust: you’ll know instantly if something breaks.
  • This step simulates a production-level data workflow, which is what you’d see in real-world data teams.

9. Answering Business Questions (Final Deliverable Examples)

By the end, you’ll have a set of dashboards that answer real business questions:

  1. Registrations & Sessions Over Time
    • A line chart with daily registrations; KPIs for total sessions vs. registrations.
  2. Device & Channel Performance
    • Breakdowns by device type (desktop, mobile) and attribution channel (paid, organic).
    • Identify which channels or devices convert best.
  3. Conversion Funnel
    • Visualizing how many sessions lead to registrations, and how many of those lead to transactions.
  4. Order & Delivery Analytics
    • Show on-time vs. delayed deliveries, average shipping time by carrier or shipping method.

These insights can then be shared with the CEO, marketing teams, or operations to make data-driven decisions and optimize the user journey.


10. Final Deliverables

By following these steps, you’ll end up with:

  1. dbt Project
    • A structured set of staging models, fact/dimension models, aggregated data marts.
    • All code and transformations stored and version-controlled in GitHub.
  2. Documented Transformations & Testing
    • Use dbt’s documentation features, or a README in GitHub, to explain each table and transformation.
    • Include data tests that ensure each step meets your data quality standards.
    • Outline assumptions or data cleaning rules.
  3. Looker Studio Dashboards
    • One or more dashboards with interactive filters and visualizations.
    • Clear labels, consistent color schemes, and easily interpretable insights.
  4. dbt Cloud Orchestration
    • Automated daily/weekly jobs that run your transformations (including incremental models).
    • Email (or Slack) notifications to alert you on failures or test errors.
  5. Demo/Presentation
    • Show how data flows from BigQuery (raw) → dbt (cleaned, modeled, tested) → dbt Cloud (scheduled jobs, notifications) → Looker Studio (visual insights).
    • Highlight how each key metric or visualization ties back to real business objectives.

Wrap-Up

As a student, you’ll learn how to:

  • Set up a cloud-based data pipeline in BigQuery
  • Model and transform data using dbt with incremental models and seed files
  • Version control your project with GitHub
  • Write and maintain data tests to ensure quality
  • Automate and schedule runs via dbt Cloud (with notifications)
  • Present insights via Looker Studio dashboards

This is a comprehensive, end-to-end experience that closely mirrors how modern data teams operate in the real world. Embrace each step, commit your work frequently, test your transformations, and most importantly—enjoy translating raw data into valuable business insights!