Interactive Lesson: Normalization Ride Share
🚗 Ride Share Data Normalization
Master data modeling through real-world ride sharing scenarios
📊 Unnormalized Data – The Messy Reality
This is how data often appears in spreadsheets – with multiple values in cells, repeating column groups, and mixed entity data:
- Non-atomic values: Multiple passengers in one cell
- Repeating groups: Stop_1, Stop_2, Stop_3 columns
- Mixed entities: Driver, vehicle, and ride data all jumbled together
- Calculated fields stored: Total_Fare stored instead of calculated
Trip_Details | Driver_Info | Vehicle_Info | Passengers | Stop_1 | Stop_2 | Stop_3 | Fare_Breakdown | Total_Fare |
---|---|---|---|---|---|---|---|---|
R001 2024-01-15 14:30 Pool Ride Multiple Values! |
Maria Garcia 555-0123 4.8★ 2 years exp Multiple Values! |
Toyota Camry 2022 ABC-123 |
John Smith (555-9876), Sarah Johnson (555-4321) List in Cell! | 123 Main St 10 min |
Downtown Plaza 15 min |
Airport T2 – |
Base: $25 Pool discount: -$5 Peak: +$8 Tip: $4 |
$32.00 |
R002 2024-01-15 16:45 Standard Multiple Values! |
Alex Chen 555-0456 4.9★ 3 years exp Multiple Values! |
Honda Accord 2023 XYZ-789 |
Mike Wilson (555-1111) Format Mix! | City Hall 5 min |
Shopping Mall – |
– – |
Base: $12 Tip: $2 |
$14.00 |
R003 2024-01-16 08:00 Premium Multiple Values! |
Maria Garcia 555-0123 4.8★ 2 years exp Multiple Values! |
Toyota Camry 2022 ABC-123 |
Emma Davis (555-2222), Lisa Brown (555-3333), Tom Lee (555-4444) List in Cell! | Hotel Grand 8 min |
Convention Ctr 12 min |
Tech Campus 10 min |
Base: $45 Premium: +$10 Tip: $8 |
$63.00 |
1️⃣ First Normal Form (1NF)
Eliminate repeating groups and ensure every cell contains exactly one atomic value.
- Split compound cells into separate columns (Trip_ID, Date, Time separated)
- Created separate rows for each passenger (no lists in cells)
- Eliminated repeating stop columns (Stop_1, Stop_2, Stop_3)
- Each cell now contains exactly ONE value
Ride_Passengers Table (1NF)
Ride_ID | Date | Time | Ride_Type | Driver_Name | Driver_Phone | Driver_Rating | Vehicle_Model | License_Plate | Passenger_Name | Passenger_Phone | Base_Fare | Tip |
---|---|---|---|---|---|---|---|---|---|---|---|---|
R001 | 2024-01-15 | 14:30 | Pool | Maria Garcia | 555-0123 | 4.8 | Toyota Camry 2022 | ABC-123 | John Smith | 555-9876 | 25.00 | 4.00 |
R001 | 2024-01-15 | 14:30 | Pool | Maria Garcia | 555-0123 | 4.8 | Toyota Camry 2022 | ABC-123 | Sarah Johnson | 555-4321 | 25.00 | 4.00 |
R002 | 2024-01-15 | 16:45 | Standard | Alex Chen | 555-0456 | 4.9 | Honda Accord 2023 | XYZ-789 | Mike Wilson | 555-1111 | 12.00 | 2.00 |
Ride_Stops Table (1NF) – Separated Repeating Groups
Ride_ID | Stop_Sequence | Location | Duration_Minutes |
---|---|---|---|
R001 | 1 | 123 Main St | 10 |
R001 | 2 | Downtown Plaza | 15 |
R001 | 3 | Airport T2 | 0 |
R002 | 1 | City Hall | 5 |
R002 | 2 | Shopping Mall | 0 |
2️⃣ Second Normal Form (2NF)
Remove partial dependencies – ensure all non-key attributes depend on the ENTIRE primary key, not just part of it.
- In our 1NF table, the primary key was composite: (Ride_ID, Passenger_Name)
- But Driver_Name only depends on Ride_ID, not on Passenger_Name
- This is a partial dependency – we need to fix it!
🚗 Rides Table
Ride_ID (PK) | Date | Time | Ride_Type | Driver_Name | Base_Fare | Total_Tip |
---|---|---|---|---|---|---|
R001 | 2024-01-15 | 14:30 | Pool | Maria Garcia | 25.00 | 4.00 |
R002 | 2024-01-15 | 16:45 | Standard | Alex Chen | 12.00 | 2.00 |
R003 | 2024-01-16 | 08:00 | Premium | Maria Garcia | 45.00 | 8.00 |
👥 Ride_Passengers Table
Ride_ID (FK) | Passenger_Name | Passenger_Phone |
---|---|---|
R001 | John Smith | 555-9876 |
R001 | Sarah Johnson | 555-4321 |
R002 | Mike Wilson | 555-1111 |
- Eliminated partial dependencies
- Ride information stored once per ride (not per passenger)
- Reduced redundancy significantly
3️⃣ Third Normal Form (3NF)
Remove transitive dependencies – non-key attributes should depend ONLY on the primary key, not on other non-key attributes.
- Driver_Phone depends on Driver_Name (not Ride_ID)
- Driver_Rating depends on Driver_Name (not Ride_ID)
- Vehicle details depend on License_Plate (not Driver)
- These create update anomalies – updating a driver’s phone requires updating multiple rides!
🚗 Rides Table (Final)
Ride_ID | Driver_ID | Date | Time | Ride_Type | Base_Fare | Total_Tip |
---|---|---|---|---|---|---|
R001 | D001 | 2024-01-15 | 14:30 | Pool | 25.00 | 4.00 |
R002 | D002 | 2024-01-15 | 16:45 | Standard | 12.00 | 2.00 |
👤 Drivers Table
Driver_ID | Name | Phone | Rating | Years_Experience | Vehicle_ID |
---|---|---|---|---|---|
D001 | Maria Garcia | 555-0123 | 4.8 | 2 | V001 |
D002 | Alex Chen | 555-0456 | 4.9 | 3 | V002 |
🚙 Vehicles Table
Vehicle_ID | Make | Model | Year | License_Plate |
---|---|---|---|---|
V001 | Toyota | Camry | 2022 | ABC-123 |
V002 | Honda | Accord | 2023 | XYZ-789 |
👥 Passengers Table
Passenger_ID | Name | Phone |
---|---|---|
P001 | John Smith | 555-9876 |
P002 | Sarah Johnson | 555-4321 |
- ✅ No redundancy – each fact stored exactly once
- ✅ No update anomalies – change driver phone in one place
- ✅ No insert anomalies – can add drivers without rides
- ✅ No delete anomalies – deleting a ride doesn’t lose driver info
- ✅ Referential integrity via foreign keys
⚡ Strategic Denormalization for Analytics
While normalization is perfect for transactional systems, analytics often benefits from controlled denormalization.
To answer “What was the average fare by driver rating tier last month?”, a normalized database requires:
- 4-5 table JOINs
- Complex aggregations
- Slow query performance on large datasets
📊 Denormalized Fact Table for Analytics
Date_Key | Ride_ID | Driver_Name | Driver_Rating_Tier | Vehicle_Category | Hour_of_Day | Day_of_Week | Is_Weekend | Ride_Type | Passenger_Count | Total_Stops | Distance_Miles | Duration_Minutes | Base_Fare | Tip_Amount | Total_Revenue |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20240115 | R001 | Maria Garcia | Premium | Sedan | 14 | Monday | False | Pool | 2 | 3 | 15.3 | 40 | 25.00 | 4.00 | 29.00 |
20240115 | R002 | Alex Chen | Elite | Sedan | 16 | Monday | False | Standard | 1 | 2 | 5.2 | 15 | 12.00 | 2.00 | 14.00 |
- Pre-calculated fields: Driver_Rating_Tier, Is_Weekend ready for grouping
- No JOINs needed: All dimensions in one table
- Fast aggregations: Optimized for SUM, AVG, COUNT operations
- Business-friendly: Column names match business terminology
- Time-series ready: Date_Key enables partitioning
Normalized Query (OLTP)
SELECT CASE WHEN d.rating >= 4.8 THEN 'Elite' WHEN d.rating >= 4.5 THEN 'Premium' ELSE 'Standard' END as rating_tier, AVG(r.base_fare + r.total_tip) as avg_revenue FROM rides r JOIN drivers d ON r.driver_id = d.driver_id JOIN vehicles v ON d.vehicle_id = v.vehicle_id JOIN ( SELECT ride_id, COUNT(*) as passenger_count FROM ride_passengers GROUP BY ride_id ) p ON r.ride_id = p.ride_id WHERE r.date >= '2024-01-01' GROUP BY rating_tier; Execution Time: 847ms
Denormalized Query (OLAP)
SELECT driver_rating_tier, AVG(total_revenue) as avg_revenue FROM ride_analytics_fact WHERE date_key >= 20240101 GROUP BY driver_rating_tier; Execution Time: 23ms
This is why data warehouses and analytics platforms often use denormalized star schemas or data marts.
- ✅ Blazing fast analytics queries
- ❌ More storage space required
- ❌ ETL processes needed to maintain consistency
- ❌ Not suitable for real-time transactional updates