After loading data into the target
After extracting and before loading into the target
Simultaneously with extraction
It's always the last step in ETL
Extract, Load, Transform
Extract, Loop, Transfer
Evaluate, Load, Transfer
Enrich, Load, Transform
ETL is always faster than ELT
ELT never involves transformations
In ELT, the transformation happens after loading into the target
ELT is used only for on-prem systems
There's a modern cloud DWH supporting push-down transformations
No transformations are required
Data is never large or complex
A legacy on-prem data warehouse requires cleaned data upfront
The warehouse is scalable and can handle transformations internally
It avoids the need to load data
It never involves coding
It's always cheaper
Extract, Transfer, Loop
Evaluate, Transform, Load
Extract, Transform, Load
Enrich, Test, Leverage
You can skip extracting data
Flexibility to transform data later as requirements change
It guarantees faster processing always
Less storage required
Whether the source is a CSV file or not
The color of the server rack
The number of IT staff
The ability of the target system to efficiently transform data
The target systems lacked compute resources for on-the-fly transformations
Cloud warehouses didn't exist
It was a legal requirement
ETL stands for easier technical logic
It never transforms data
It's cheaper than ELT
Transformation bottlenecks can occur in a separate ETL engine
It's always real-time only
Skip loading data
Avoid extracting data altogether
Perform no transformations
Perform transformations post-load
Generate code for developers
Deliver trustworthy data for analysis
Replace database admins
Only handle metadata
Provide fresh data at regular intervals
Avoid loading any data
Ensure data never changes
Randomize results
It has no code
It never stores raw data
The raw data is already in the target, transformations can be adjusted later
It's done offline
They forbid any transformations
Storage and compute are decoupled, allowing flexible transforms after load
They run on-prem only
It reduces data volume
The target system must handle potentially huge volumes of raw data
Less data flexibility
Always slower than ETL
Transformations happen before loading
Random choice
The operating system flavor
The capabilities of the target platform and data integration needs
The programmer's mood
On-prem databases only
No transformations needed
Very small datasets only
Cloud-based analytics environments that scale compute as needed
They have infinite flexibility
Transformations are baked in before load, requiring re-engineering
ETL is always real-time
They store no data
The data platform capabilities and business needs
Random picking from a hat
Government mandate
The file extension of source data
Only copying files manually
Transforming data fully
Retrieving data from source systems
Always pulling from APIs only
Always extracting full data sets
Only extracting new or changed data since last run
Extracting in random order
Extracting once a year
Ignoring them
Always failing extraction
Using only binary dumps
Detecting new columns, dropped columns, and adjusting extraction logic
SQL queries to select data
Only binary logs always
Manual user input
Sound signals
Only CSV output
No authentication
Pagination and rate limits
No error handling
Always streaming them directly to target
Reading files, possibly decompressing, and parsing their structure
Only working if files are in XML format
Ignoring headers
Storing and rotating credentials securely
Skipping credentials
Hardcoding passwords in code
Using plaintext in logs
Giving up
Extracting full dumps always
Not scheduling extracts
Using incremental extracts, caching, or asynchronous requests
Never verifying row counts
Comparing expected row counts, checksums, or timestamps
Storing data in random order
Ignoring logs
Extracting multiple times concurrently
Changing source schema mid-extract
Using transaction isolation or read replicas
Ignoring concurrency
Recording schema, timestamps, and source versions
Changing data
Always storing metadata in CSV
Not recommended
Slowing down extraction
Removing errors
Allowing troubleshooting if something goes wrong
Automatically validating data
Running extracts manually each time
Using a scheduler or orchestration tool (e.g., Airflow)
Random intervals
DNS configuration
Not extracting at all
Changing target system
Only using full extracts
Implementing retries, backoff strategies, and caching intermediate results
Always converting everything to CSV first
No parsing logic
Appropriate parsing libraries or logic for each format
Ignoring file formats
Simplify extraction without coding connectors from scratch
Introduce more complexity
Only handle JSON
Replace target systems
Downloading data twice
Never compressing data
Compressing and possibly filtering data at source before transfer
Using plain text only
Data is always perfect
Issues are caught early before transformation
Slows the pipeline deliberately
Replaces transformation
Sorting data automatically
Removing duplicates always
Avoiding schema changes
Knowing when data was pulled, useful for incremental extracts
Implementing retries, backoff, and proper error handling
Ignoring all errors
Doing all transforms during extraction
Using no tools
Just file renaming
No changes to data
Cleansing, standardizing, and applying business rules
Always sorting by primary key
Extraction
Loading
Archiving
Transformation
Aggregating sales by region, calculating derived metrics
Only copying data unchanged
Sorting source files by name
Not applicable to transformations
External ETL engines only
The database or data warehouse's compute to run SQL queries for transformation
No tables
Python scripts always
Always discarding old data
Never tracking historical changes
Implementing logic to handle updates to dimension attributes over time
Only adding new columns
Integrate reference data and enrich facts
Create duplicates
Remove keys
Slow down the pipeline intentionally
No load needed
Ignoring errors
Faster extraction
Issues are caught before loading into the final target
Running on full production data initially
Testing with sample datasets and checking intermediate outputs
Never using logs
Removing parallelism
Distributing large-scale transformations across multiple nodes
Avoiding data backup
Running them multiple times doesn't corrupt or double data
They only run once
Data is always encrypted
Using different code each run
Increase data volume
Hide meaningful insights
Produce summarized metrics for reporting
Replace the need for extraction
Data remains unreadable
Data is correctly interpreted and stored
Only numeric fields are processed
Transformation always fails
Keeping track of changes so you can roll back if needed
Never updating transformations
Always using the first version
Storing it in a random folder
Moving transforms to the extraction phase
Doing transforms in memory outside DB
Executing transformations inside the target database/warehouse
Removing transformations entirely
You get doubled records
Errors always occur
Data becomes corrupted
The result remains consistent without duplication or loss
Modularizing logic into functions or templates
Writing each transform from scratch every time
Never commenting code
Encoding logic in binary files
Ignoring new columns
Updating transformation logic to accommodate new/removed fields
Only working with fixed schema
Dropping all transformed data
Adding unnecessary joins
Converting all data to strings
Partitioning data and parallelizing transformations
Removing indexing
No logs, just guesswork
Always reverting to ETL from ELT
Disabling error messages
Detailed logging, sample test datasets, and stepping through logic
No further changes can ever be made
They run without any testing
They should be documented for lineage and maintainability
Transformed data is never loaded
Only reading data from sources
Moving processed (or raw in ELT) data into the target system
Just renaming files
Ignoring the target system entirely
Inserting large volumes of data in fewer operations
Removing all constraints
Using single-row inserts
Always slowing down
Only append new data, never update
Full refresh every time
Delete all data before load
Insert new records and update existing ones based on keys
Always creating new indexes mid-load
No need to consider indexes
Dropping indexes before large bulk loads and recreating after load for faster performance
Storing indexes in CSV
Reduce contention and improve load times
Make loads fail
Not affect anything
Always slow queries
Data is always in partial states
Atomicity, so load either fully commits or rolls back on failure
Infinite loops
Guessing
No verification at all
Checking row counts, checksums, comparing against expected values
Only checking the first row
Processing different data subsets in parallel
Only loading one partition
Forcing sequential writes
Removing indexes always
Slow down loads
Not be beneficial
Introduce errors
Exploit native optimizations for faster loading
Ignoring failed loads
Implementing restart logic or partial reload from checkpoints
Always dropping target tables
Never loading again
Always doing full loads
Incremental is never allowed
Deciding whether to rebuild entire dataset or just apply changes
Only applicable to CSV files
Loaded data matches quality expectations and no corruption occurred
The source changed format
No transformations happened
Data is hidden
Slowing down loads
Replacing monitoring
Deleting logs
Alerting stakeholders to take action if needed
Only one thread loads data
Distributing workload to prevent bottlenecks and improve efficiency
Always loading to one table
Using less hardware
Ignoring it
Always stopping the pipeline
Loading it into a u201clateu201d or u201cdeltau201d partition and merging later
Converting it to JSON
Either the entire load completes successfully or no changes are made
Partial updates are always visible
Load always breaks
Redundant data always
Increase storage usage
Slow down queries
Avoid transformations
Manage storage efficiently and maintain relevant data
No transactions
Using database transactions or snapshot isolation
Always loading twice
Manual fixing after load
No transformations ever
Destroying metadata
Flexible, late transformations within the target system
Slower queries
Data is ready for downstream analytics and reporting
Data is never used
Must always re-extract
Pipeline automatically deletes data
Work only in ELT mode
Provide GUI-based interfaces for building ETL pipelines
Have no connectors
Only run on mainframes
Oracle 7
Excel macros
Snowflake, BigQuery, Redshift
Telnet sessions
Managing dependencies and scheduling ETL/ELT tasks
Only storing passwords
Not related to ETL
Creating BI dashboards
Require on-prem installation
Handle only transformations
Are all open-source
Provide managed extraction and loading connectors
Forcing a GUI approach
No flexibility
Custom logic and integration not provided by off-the-shelf tools
Avoiding code versioning
Slower deployments
Portability and consistent runtime environments
Windows-only execution
No improvements
No tests are run
ETL is manual
Only deployment to production
Automatic testing, building, and deploying ETL scripts
Only syntax
Cloud always cheaper
Deployment model, scalability, and cost structure
No difference
Allowing access to data from multiple sources without physical movement
Forcing data copies
Eliminating ETL entirely
Converting all data to XML
Only looking at cost
Ignoring support and community
Only focusing on user interface
Considering features, support, scalability, and TCO
Create identical performance always
Replace need for testing
Identify which tool handles your scale and data complexity best
No real benefit
Only small datasets
Large-scale, distributed data processing
Avoiding parallelism
Real-time alerts only
No code environments
Only security teams
GUI-only approaches
DevOps principles for continuous integration and delivery of data pipelines
Providing lineage and metadata to understand data origins
Increasing manual work
Preventing transformations
Only helping with network configuration
Eliminating storage
Making data static
Enabling real-time streaming data into ETL pipelines
Only for JSON files
One giant monolithic ETL
Breaking ETL steps into smaller services that communicate via APIs or queues
No transformations at all
Only batch processes
Irrelevant to ETL
Always synchronous
Allows precise timing of ETL/ELT tasks and handling dependencies
Removes logging
Tracking changes and enabling rollbacks
Hiding code
Always causing conflicts
All tools perform the same
Only cost matters
Pick the tool with the prettiest UI
Ensuring the chosen solution scales and meets functional requirements
Tools must be C++ only
Everyone must learn new languages always
Choosing a tool that matches the team's expertise increases efficiency
Skill sets don't matter
Only the extraction code
The final BI dashboard
Each step (extraction, transformation, loading) for slow operations
Splitting data into chunks processed concurrently
Running everything in a single thread
Avoiding partitioning
Only processing after hours
Making the code complex without performance gains
Slowing down transformations
Allowing different segments of data to be processed simultaneously
Increasing memory use arbitrarily
Always doubling file size
Enabling columnar reads and better compression
Requiring more I/O
Removing column types
CPU load
Memory usage
Network bandwidth usage and possibly I/O times
Data integrity
Streaming data instead of loading entire datasets into memory
Not processing data at all
Using only one core
Ignoring memory usage
Data never repeats
The same transformations are applied multiple times to the same subset of data
Cache is always slower
You want to increase I/O
Removing indexes
Using only SELECT *
Avoiding statistics
Creating appropriate indexes and using statistics for better query plans
Performing transformations closer to where data resides
Copying data multiple times
Always ETLing locally
Increasing network latency
Reducing CPU usage at peak times only
Forcing users to wait longer
Ensuring less contention with analytical queries
Disabling increments
Hide performance issues
Eliminate all code
Enforce static loads
Identify where optimization is needed
Always increasing processing time
Only moving changed data instead of the entire dataset each time
Deleting data first
Always sorting
It reduces I/O by scanning only needed columns
It's always row-based
Slows down queries
Removes transformations
Introduce random delays
Remove logs
Foresee when scaling resources or optimization is needed
Eliminate the need for testing
Failing immediately
Only working for transformations
Handling transient network or resource issues without manual intervention
Adding more steps
Streamlining the pipeline and reducing overhead
Slowing down performance
Storing duplicates
Removing logs
Making code unreadable
Identifying slow functions or steps to optimize
No impact
Performing transformations where data resides, minimizing I/O
Always copying data multiple times
Adding more network hops
Ignoring transformations
Batch-only scenarios
Storing data offline
Slowing data updates
Data that needs near real-time availability
No tests needed
Running tests once a year
Quickly detecting if recent changes negatively impact ETL speed
No effect on performance
Data always fails
Ignoring source issues
Problems are caught early, preventing corrupt downstream data
Only helps after loading
Ignoring nulls
Removing duplicates, correcting invalid formats, and filling missing values
Only sorting data alphabetically
Not related to quality
Increase ambiguity
Create more errors
Hide metadata
Ensure consistent terms and codes across the dataset
Tracking the origin and transformations applied to data
Ignoring source origins
Only versioning code
Deleting metadata
No retrospective analysis
Only current snapshot views
Comparing past states and understanding data evolution
Ignoring business rules
Confusing analysts
Quantifying the level of trust in the data
Replacing ETL processes
Only working with numeric fields
Discarding all data
Stopping the pipeline entirely
Isolating problematic rows for later inspection without halting the entire process
Eliminating transformations
Everyone sees everything
No security
Data is never transformed
Only authorized personnel can change pipelines or view sensitive data
Removing schema info
Allowing understanding of schema evolution, lineage, and data dictionary
Slowing ETL
Only storing logs
Mismatches that cause errors or incorrect mappings
Any data loading
Incremental extracts
Using metadata
Only the ETL developer can run checks
Requires special hardware
Data always perfect
Analysts can define and run their own validation rules
No expectations on data delivery
Data always arrives late
Clear targets and accountability for ETL performance
Removing all checks
Identifying trends in data issues over time
Ignoring improvements
Reducing metadata
No historical analysis
No standards
Adhering to best practices for data management and quality
Always encrypting data
Abandoning lineage
Data is always late
No compliance
Data is lost
Compliance with internal standards and external regulations
Stagnation
Ignoring feedback
Regularly reviewing metrics, addressing issues, and refining processes
Deleting error logs
Having random missing fields
Only using half the data
No relevance to ETL
All expected data elements are present
Data values contradict each other
Data does not conflict logically (e.g., end date after start date)
More duplicates
Slower loads
A trusted reference dataset or source of truth
Random guesses
No baseline
Logs only
Less visibility
Removing lineage
Ensuring policies, lineage, and quality standards are consistently applied
Data is always in plaintext
Data is slower
Sensitive information isn't exposed to eavesdroppers
Faster downloads only
Interception of sensitive data by unauthorized parties
Writing logs
Any transformations
PII is displayed openly
Sensitive data is protected and less exposed to unauthorized views
Only numeric data allowed
Protect sensitive columns while allowing partial data usage
Increase plaintext exposure
Remove all keys
Disable extraction
Everyone can edit pipelines
No logs needed
Data always public
Only authorized users can modify or run pipelines, enhancing security
Storing all PII unencrypted
Ignoring user requests
Implementing deletion or anonymization upon request
Adding more duplicates
No record of access
Less transparency
Compliance is ignored
Accountability and traceability for security and compliance
Faster access
Exposure of passwords in code or logs
Any encryption
Data from loading
Transforming data in-place in the secure environment
Decrypting data everywhere
Always using local disks
No updates to policies
Data remains unprotected
Ongoing adherence to security standards and regulations
Inconsistent governance
Data goes over public internet unprotected
Slower transfers only
Data moves within a secure, isolated environment
Replacing encryption
Strict controls, encryption, and auditing access to patient data
Ignoring patient privacy
No logging
Publishing data publicly
One person does everything
Different roles have limited, distinct permissions (e.g., dev vs. ops)
Only one admin for all
Less control
Reducing exposure if credentials are compromised
Always harder to manage
No benefit
Ignoring best practices
Giving all accounts full admin rights
No authentication needed
Accounts only get the minimum permissions needed to do their job
Disabling credentials
Storing passwords in logs
Printing PII openly
No filtering needed
Removing or masking sensitive information from logs
Keeping all data forever
Deleting data randomly
Removing old or unneeded data according to defined schedules
Ignoring compliance
Proper testing
Exposure of real sensitive data to dev/test environments
Any compliance
Realistic scenarios
Removing encryption
Verifying ETL processes follow all security, privacy, and data handling regulations
Increasing unauthorized access
Storage is compromised, attackers get ciphertext instead of plaintext
Keys are in plaintext near data
Ignoring all keys
Data is never accessed
Only running once a day
Continuously processing incoming data as it arrives
Ignoring source changes
Always slower
Only full dumps
No changes
Inserts, updates, and deletes in source data to apply incrementally
Random errors
Only batch loads
Capturing database changes in near real-time
Only file extractions
Manual transformations
Always batch processing
No real-time updates
Only single-thread reads
Data is ingested as events, allowing continuous transformations
Both are identical
Micro-batching never buffers data
Micro-batching processes small batches at intervals, streaming processes events immediately
Streaming is offline
Sorting all data by hand
Dropping late data
Ignoring timestamps
Event-time processing logic, possibly using watermarks
Data is processed twice
Handling duplicates and idempotency so data isn’t double-counted
Stopping after one record
Storing no context
Only batch mode
Remembering previous events’ data to compute aggregates or handle joins
Discarding all history
Grouping events into manageable intervals for aggregation
Ignoring event timestamps
Only running monthly
Storing data in CSV
Overwhelming downstream systems
The pipeline adapts to variations in data flow rate without crashing
Data always lost
Ignoring slow consumers
Always raw only
Strict ELT
Immediate latency vs. flexibility of reprocessing raw data later
No schema requirements
Only traditional RDBMS is allowed
Query patterns, latency needs, and volume of data
Ignoring format
Detecting if the system can keep up with incoming data
Ignoring delays
Only checking batch jobs
Slowing down on purpose
No checkpoints
Always restarting from scratch
Checkpointing and replaying events from a certain offset
Ignoring state
No schema changes allowed
Handling new fields or removed fields dynamically without stopping the stream
Batch reload
Only fixed schemas forever
Defining when a window of events is considered complete despite late arrivals
Only working with batch data
Sorting all events by arrival time only
Removing timestamps
No real-time capability
Only batch steps
Leveraging built-in abstractions for stateful and windowed transformations
Disabling transformations
Immediate response to anomalies
No remediation
Only batch corrections
Always waiting for all late events
Ignoring event times
Deciding how long to wait for late data before producing results
Only using batch mode
Only manual updates
Automated testing and rolling out changes without stopping the stream
Offline processing only
Fewer updates
No testing needed
Always testing full pipeline only
Validating small pieces of code in isolation to catch errors early
Only running after production
Each component works alone
All steps (extract, transform, load) work together as intended
Slower code
No data is moved
Testing logic without depending on actual external systems
Always hitting production DB
Only GUI tests
Ignoring dependencies
No changes allowed
Data is always static
Recent code changes haven’t broken previously working functionality
Ignoring old tests
Only testing one record
Removing performance metrics
Breaking the pipeline deliberately
Measuring how the system performs under high data volumes
Only known scenarios
Production data leaks
Edge cases and scenarios without risking real sensitive data
Slower transformations only
Manual deployment always
Automated testing before changes go live, preventing regressions
No tests ever run
Only testing after production issues
The pipeline recovers from transient failures without manual intervention
Always fail permanently
Only manual restarts
No visibility
Only command-line logs
A visual representation of job runs, statuses, and dependencies
Always slow loading
Ignoring issues
Delayed responses
No escalation
Stakeholders are promptly notified to fix or investigate
Making logs unreadable
Easier parsing, searching, and analyzing issues
Slowing queries
No rollbacks possible
Lost history of changes
Changes are tracked, allowing revert to previous stable versions
Only manual edits
Accountability and traceability of operations
Only complexity
No benefits
Automatic fixing of errors
Pipeline fails under load
Only one scenario tested
The pipeline can handle growing data volumes or more concurrency
No need for infrastructure changes
Ignoring discrepancies
Slower tests
Manual calculations
Quickly verifying data integrity and completeness
Testing new logic on a small subset of data before full rollout
Deploying everywhere at once
No difference from normal deploys
No issues
Resource constraints and scaling needs
Only network errors
Faster ETL by magic
Only real-time snapshots
Removing historical data
Observing how performance or quality metrics change over time
Ignoring patterns
Credentials aren't committed in code and no obvious vulnerabilities
More exposure
Only syntax checks
Stagnation of pipelines
Only manual reviews once a year
Using insights from tests/monitoring to refine and optimize ETL/ELT continuously
Removing alerts