Interactive Lesson: Normalization Ride Share

Normalization & Denormalization Exercise – Ride Share Analytics

🚗 Ride Share Data Normalization

Master data modeling through real-world ride sharing scenarios

Step 1 of 5
0
Unnormalized
1
First Normal Form
2
Second Normal Form
3
Third Normal Form
Denormalization

📊 Unnormalized Data – The Messy Reality

This is how data often appears in spreadsheets – with multiple values in cells, repeating column groups, and mixed entity data:

🚨 Major Problems in This Table:
  • Non-atomic values: Multiple passengers in one cell
  • Repeating groups: Stop_1, Stop_2, Stop_3 columns
  • Mixed entities: Driver, vehicle, and ride data all jumbled together
  • Calculated fields stored: Total_Fare stored instead of calculated
Non-Atomic Values
Repeating Groups
Trip_Details Driver_Info Vehicle_Info Passengers Stop_1 Stop_2 Stop_3 Fare_Breakdown Total_Fare
R001
2024-01-15
14:30
Pool Ride Multiple Values!
Maria Garcia
555-0123
4.8★
2 years exp Multiple Values!
Toyota Camry 2022
ABC-123
John Smith (555-9876), Sarah Johnson (555-4321) List in Cell! 123 Main St
10 min
Downtown Plaza
15 min
Airport T2
Base: $25
Pool discount: -$5
Peak: +$8
Tip: $4
$32.00
R002
2024-01-15
16:45
Standard Multiple Values!
Alex Chen
555-0456
4.9★
3 years exp Multiple Values!
Honda Accord 2023
XYZ-789
Mike Wilson (555-1111) Format Mix! City Hall
5 min
Shopping Mall

Base: $12
Tip: $2
$14.00
R003
2024-01-16
08:00
Premium Multiple Values!
Maria Garcia
555-0123
4.8★
2 years exp Multiple Values!
Toyota Camry 2022
ABC-123
Emma Davis (555-2222), Lisa Brown (555-3333), Tom Lee (555-4444) List in Cell! Hotel Grand
8 min
Convention Ctr
12 min
Tech Campus
10 min
Base: $45
Premium: +$10
Tip: $8
$63.00

1️⃣ First Normal Form (1NF)

Eliminate repeating groups and ensure every cell contains exactly one atomic value.

✅ What We Fixed:
  • Split compound cells into separate columns (Trip_ID, Date, Time separated)
  • Created separate rows for each passenger (no lists in cells)
  • Eliminated repeating stop columns (Stop_1, Stop_2, Stop_3)
  • Each cell now contains exactly ONE value

Ride_Passengers Table (1NF)

Ride_ID Date Time Ride_Type Driver_Name Driver_Phone Driver_Rating Vehicle_Model License_Plate Passenger_Name Passenger_Phone Base_Fare Tip
R001 2024-01-15 14:30 Pool Maria Garcia 555-0123 4.8 Toyota Camry 2022 ABC-123 John Smith 555-9876 25.00 4.00
R001 2024-01-15 14:30 Pool Maria Garcia 555-0123 4.8 Toyota Camry 2022 ABC-123 Sarah Johnson 555-4321 25.00 4.00
R002 2024-01-15 16:45 Standard Alex Chen 555-0456 4.9 Honda Accord 2023 XYZ-789 Mike Wilson 555-1111 12.00 2.00

Ride_Stops Table (1NF) – Separated Repeating Groups

Ride_ID Stop_Sequence Location Duration_Minutes
R001 1 123 Main St 10
R001 2 Downtown Plaza 15
R001 3 Airport T2 0
R002 1 City Hall 5
R002 2 Shopping Mall 0
⚠️ New Problem Created: Now we have massive redundancy! The same ride and driver information is repeated for each passenger. This violates the DRY (Don’t Repeat Yourself) principle.

2️⃣ Second Normal Form (2NF)

Remove partial dependencies – ensure all non-key attributes depend on the ENTIRE primary key, not just part of it.

🔑 Understanding Keys:
  • In our 1NF table, the primary key was composite: (Ride_ID, Passenger_Name)
  • But Driver_Name only depends on Ride_ID, not on Passenger_Name
  • This is a partial dependency – we need to fix it!

🚗 Rides Table

Ride_ID (PK) Date Time Ride_Type Driver_Name Base_Fare Total_Tip
R001 2024-01-15 14:30 Pool Maria Garcia 25.00 4.00
R002 2024-01-15 16:45 Standard Alex Chen 12.00 2.00
R003 2024-01-16 08:00 Premium Maria Garcia 45.00 8.00

👥 Ride_Passengers Table

Ride_ID (FK) Passenger_Name Passenger_Phone
R001 John Smith 555-9876
R001 Sarah Johnson 555-4321
R002 Mike Wilson 555-1111
✅ What We Achieved:
  • Eliminated partial dependencies
  • Ride information stored once per ride (not per passenger)
  • Reduced redundancy significantly
⚠️ Still a Problem: Driver information (name, phone, rating) depends on Driver_Name, not on Ride_ID. This is a transitive dependency!

3️⃣ Third Normal Form (3NF)

Remove transitive dependencies – non-key attributes should depend ONLY on the primary key, not on other non-key attributes.

🔍 Transitive Dependencies Found:
  • Driver_Phone depends on Driver_Name (not Ride_ID)
  • Driver_Rating depends on Driver_Name (not Ride_ID)
  • Vehicle details depend on License_Plate (not Driver)
  • These create update anomalies – updating a driver’s phone requires updating multiple rides!

🚗 Rides Table (Final)

Ride_ID Driver_ID Date Time Ride_Type Base_Fare Total_Tip
R001 D001 2024-01-15 14:30 Pool 25.00 4.00
R002 D002 2024-01-15 16:45 Standard 12.00 2.00

👤 Drivers Table

Driver_ID Name Phone Rating Years_Experience Vehicle_ID
D001 Maria Garcia 555-0123 4.8 2 V001
D002 Alex Chen 555-0456 4.9 3 V002

🚙 Vehicles Table

Vehicle_ID Make Model Year License_Plate
V001 Toyota Camry 2022 ABC-123
V002 Honda Accord 2023 XYZ-789

👥 Passengers Table

Passenger_ID Name Phone
P001 John Smith 555-9876
P002 Sarah Johnson 555-4321
🎉 Fully Normalized Database!
  • ✅ No redundancy – each fact stored exactly once
  • ✅ No update anomalies – change driver phone in one place
  • ✅ No insert anomalies – can add drivers without rides
  • ✅ No delete anomalies – deleting a ride doesn’t lose driver info
  • ✅ Referential integrity via foreign keys

⚡ Strategic Denormalization for Analytics

While normalization is perfect for transactional systems, analytics often benefits from controlled denormalization.

🤔 The Analytics Challenge:

To answer “What was the average fare by driver rating tier last month?”, a normalized database requires:

  • 4-5 table JOINs
  • Complex aggregations
  • Slow query performance on large datasets

📊 Denormalized Fact Table for Analytics

Date_Key Ride_ID Driver_Name Driver_Rating_Tier Vehicle_Category Hour_of_Day Day_of_Week Is_Weekend Ride_Type Passenger_Count Total_Stops Distance_Miles Duration_Minutes Base_Fare Tip_Amount Total_Revenue
20240115 R001 Maria Garcia Premium Sedan 14 Monday False Pool 2 3 15.3 40 25.00 4.00 29.00
20240115 R002 Alex Chen Elite Sedan 16 Monday False Standard 1 2 5.2 15 12.00 2.00 14.00
🎯 Denormalization Benefits for Analytics:
  • Pre-calculated fields: Driver_Rating_Tier, Is_Weekend ready for grouping
  • No JOINs needed: All dimensions in one table
  • Fast aggregations: Optimized for SUM, AVG, COUNT operations
  • Business-friendly: Column names match business terminology
  • Time-series ready: Date_Key enables partitioning

Normalized Query (OLTP)

SELECT 
  CASE 
    WHEN d.rating >= 4.8 THEN 'Elite'
    WHEN d.rating >= 4.5 THEN 'Premium'
    ELSE 'Standard'
  END as rating_tier,
  AVG(r.base_fare + r.total_tip) as avg_revenue
FROM rides r
JOIN drivers d ON r.driver_id = d.driver_id
JOIN vehicles v ON d.vehicle_id = v.vehicle_id
JOIN (
  SELECT ride_id, COUNT(*) as passenger_count
  FROM ride_passengers
  GROUP BY ride_id
) p ON r.ride_id = p.ride_id
WHERE r.date >= '2024-01-01'
GROUP BY rating_tier;

Execution Time: 847ms

Denormalized Query (OLAP)

SELECT 
  driver_rating_tier,
  AVG(total_revenue) as avg_revenue
FROM ride_analytics_fact
WHERE date_key >= 20240101
GROUP BY driver_rating_tier;

Execution Time: 23ms
⚡ Performance Improvement: 37x faster!

This is why data warehouses and analytics platforms often use denormalized star schemas or data marts.

⚖️ The Trade-off:
  • ✅ Blazing fast analytics queries
  • ❌ More storage space required
  • ❌ ETL processes needed to maintain consistency
  • ❌ Not suitable for real-time transactional updates