ETL / ELT Quiz (200 Questions) – Analytics Engineering

ETL/ELT Quiz (Questions 1–50)

Section 1: Basic ETL vs. ELT Concepts (1–20)

1. In ETL, when does the Transformation step occur?

  After loading data into the target

  After extracting and before loading into the target

  Simultaneously with extraction

  It's always the last step in ETL

2. ELT stands for:

  Extract, Load, Transform

  Extract, Loop, Transfer

  Evaluate, Load, Transfer

  Enrich, Load, Transform

3. A key difference between ETL and ELT is:

  ETL is always faster than ELT

  ELT never involves transformations

  In ELT, the transformation happens after loading into the target

  ELT is used only for on-prem systems

4. ETL is often chosen when:

  There's a modern cloud DWH supporting push-down transformations

  No transformations are required

  Data is never large or complex

  A legacy on-prem data warehouse requires cleaned data upfront

5. ELT is often preferred with modern cloud warehouses because:

  The warehouse is scalable and can handle transformations internally

  It avoids the need to load data

  It never involves coding

  It's always cheaper

6. The main steps in ETL are:

  Extract, Transfer, Loop

  Evaluate, Transform, Load

  Extract, Transform, Load

  Enrich, Test, Leverage

7. One advantage of ELT over ETL is:

  You can skip extracting data

  Flexibility to transform data later as requirements change

  It guarantees faster processing always

  Less storage required

8. When choosing ETL vs. ELT, one key consideration is:

  Whether the source is a CSV file or not

  The color of the server rack

  The number of IT staff

  The ability of the target system to efficiently transform data

9. ETL traditionally was favored in legacy systems because:

  The target systems lacked compute resources for on-the-fly transformations

  Cloud warehouses didn't exist

  It was a legal requirement

  ETL stands for easier technical logic

10. One drawback of ETL is:

  It never transforms data

  It's cheaper than ELT

  Transformation bottlenecks can occur in a separate ETL engine

  It's always real-time only

11. ELT leverages the target system's resources to:

  Skip loading data

  Avoid extracting data altogether

  Perform no transformations

  Perform transformations post-load

12. Both ETL and ELT ultimately aim to:

  Generate code for developers

  Deliver trustworthy data for analysis

  Replace database admins

  Only handle metadata

13. ETL pipelines often run on a schedule to:

  Provide fresh data at regular intervals

  Avoid loading any data

  Ensure data never changes

  Randomize results

14. ELT is more flexible when requirements change because:

  It has no code

  It never stores raw data

  The raw data is already in the target, transformations can be adjusted later

  It's done offline

15. ELT often aligns well with modern data lake or warehouse strategies because:

  They forbid any transformations

  Storage and compute are decoupled, allowing flexible transforms after load

  They run on-prem only

  It reduces data volume

16. One disadvantage of ELT is:

  The target system must handle potentially huge volumes of raw data

  Less data flexibility

  Always slower than ETL

  Transformations happen before loading

17. Deciding between ETL and ELT depends on:

  Random choice

  The operating system flavor

  The capabilities of the target platform and data integration needs

  The programmer's mood

18. ELT often pairs well with:

  On-prem databases only

  No transformations needed

  Very small datasets only

  Cloud-based analytics environments that scale compute as needed

19. ETL pipelines might struggle with changing requirements because:

  They have infinite flexibility

  Transformations are baked in before load, requiring re-engineering

  ETL is always real-time

  They store no data

20. Ultimately, ETL and ELT are chosen based on:

  The data platform capabilities and business needs

  Random picking from a hat

  Government mandate

  The file extension of source data

21. Data extraction involves:

  Only copying files manually

  Transforming data fully

  Retrieving data from source systems

  Always pulling from APIs only

22. Incremental extraction means:

  Always extracting full data sets

  Only extracting new or changed data since last run

  Extracting in random order

  Extracting once a year

23. Handling schema changes in extraction involves:

  Ignoring them

  Always failing extraction

  Using only binary dumps

  Detecting new columns, dropped columns, and adjusting extraction logic

24. Extracting from relational databases often uses:

  SQL queries to select data

  Only binary logs always

  Manual user input

  Sound signals

25. APIs for extraction may require handling:

  Only CSV output

  No authentication

  Pagination and rate limits

  No error handling

26. Files (CSV, JSON) extraction involves:

  Always streaming them directly to target

  Reading files, possibly decompressing, and parsing their structure

  Only working if files are in XML format

  Ignoring headers

27. Handling authentication during extraction might involve:

  Storing and rotating credentials securely

  Skipping credentials

  Hardcoding passwords in code

  Using plaintext in logs

28. When a source system is slow, extraction strategies might include:

  Giving up

  Extracting full dumps always

  Not scheduling extracts

  Using incremental extracts, caching, or asynchronous requests

29. Ensuring data integrity during extraction might involve:

  Never verifying row counts

  Comparing expected row counts, checksums, or timestamps

  Storing data in random order

  Ignoring logs

30. Source system locks and concurrency issues during extraction can be mitigated by:

  Extracting multiple times concurrently

  Changing source schema mid-extract

  Using transaction isolation or read replicas

  Ignoring concurrency

31. Metadata extraction during extraction phase involves:

  Recording schema, timestamps, and source versions

  Changing data

  Always storing metadata in CSV

  Not recommended

32. Logging extraction events helps by:

  Slowing down extraction

  Removing errors

  Allowing troubleshooting if something goes wrong

  Automatically validating data

33. Scheduling extractions often involves:

  Running extracts manually each time

  Using a scheduler or orchestration tool (e.g., Airflow)

  Random intervals

  DNS configuration

34. Dealing with unreliable sources may involve:

  Not extracting at all

  Changing target system

  Only using full extracts

  Implementing retries, backoff strategies, and caching intermediate results

35. Handling different file formats (CSV, JSON, XML) requires:

  Always converting everything to CSV first

  No parsing logic

  Appropriate parsing libraries or logic for each format

  Ignoring file formats

36. Some ETL tools come with built-in source connectors to:

  Simplify extraction without coding connectors from scratch

  Introduce more complexity

  Only handle JSON

  Replace target systems

37. Minimizing network transfers during extraction might involve:

  Downloading data twice

  Never compressing data

  Compressing and possibly filtering data at source before transfer

  Using plain text only

38. Validating extracted data ensures:

  Data is always perfect

  Issues are caught early before transformation

  Slows the pipeline deliberately

  Replaces transformation

39. Metadata like source timestamps help by:

  Sorting data automatically

  Removing duplicates always

  Avoiding schema changes

  Knowing when data was pulled, useful for incremental extracts

40. Ensuring a stable extraction process might mean:

  Implementing retries, backoff, and proper error handling

  Ignoring all errors

  Doing all transforms during extraction

  Using no tools

41. Data transformation often includes:

  Just file renaming

  No changes to data

  Cleansing, standardizing, and applying business rules

  Always sorting by primary key

42. Converting date formats and normalizing data units is part of:

  Extraction

  Loading

  Archiving

  Transformation

43. Business rules in transformations might include:

  Aggregating sales by region, calculating derived metrics

  Only copying data unchanged

  Sorting source files by name

  Not applicable to transformations

44. SQL-based transformations in ELT approach use:

  External ETL engines only

  The database or data warehouse's compute to run SQL queries for transformation

  No tables

  Python scripts always

45. Handling Slowly Changing Dimensions (SCD) in transformations means:

  Always discarding old data

  Never tracking historical changes

  Implementing logic to handle updates to dimension attributes over time

  Only adding new columns

46. Joining multiple data sources in transformation steps is common to:

  Integrate reference data and enrich facts

  Create duplicates

  Remove keys

  Slow down the pipeline intentionally

47. Applying data quality checks mid-transformation ensures:

  No load needed

  Ignoring errors

  Faster extraction

  Issues are caught before loading into the final target

48. Debugging transformation logic often involves:

  Running on full production data initially

  Testing with sample datasets and checking intermediate outputs

  Never using logs

  Ignoring errors

49. Using frameworks like Spark for transformation helps with:

  Very small datasets only

  Removing parallelism

  Distributing large-scale transformations across multiple nodes

  Avoiding data backup

50. Ensuring transformations are idempotent means:

  Running them multiple times doesn't corrupt or double data

  They only run once

  Data is always encrypted

  Using different code each run

51. Applying aggregations (e.g., sum, avg) during transformations helps to:

  Increase data volume

  Hide meaningful insights

  Produce summarized metrics for reporting

  Replace the need for extraction

52. Handling character encoding issues (UTF-8 vs. ASCII) in transformations ensures:

  Data remains unreadable

  Data is correctly interpreted and stored

  Only numeric fields are processed

  Transformation always fails

53. Versioning transformation logic means:

  Keeping track of changes so you can roll back if needed

  Never updating transformations

  Always using the first version

  Storing it in a random folder

54. Pushdown transformations refer to:

  Moving transforms to the extraction phase

  Doing transforms in memory outside DB

  Executing transformations inside the target database/warehouse

  Removing transformations entirely

55. Idempotent transformations mean if rerun:

  You get doubled records

  Errors always occur

  Data becomes corrupted

  The result remains consistent without duplication or loss

56. Reusability in transformations can be achieved by:

  Modularizing logic into functions or templates

  Writing each transform from scratch every time

  Never commenting code

  Encoding logic in binary files

57. Handling schema evolution during transformation involves:

  Ignoring new columns

  Updating transformation logic to accommodate new/removed fields

  Only working with fixed schema

  Dropping all transformed data

58. Transformation performance optimization may include:

  Adding unnecessary joins

  Converting all data to strings

  Partitioning data and parallelizing transformations

  Removing indexing

59. Debugging transformation errors might use:

  No logs, just guesswork

  Always reverting to ETL from ELT

  Disabling error messages

  Detailed logging, sample test datasets, and stepping through logic

60. Once transformations are finalized:

  No further changes can ever be made

  They run without any testing

  They should be documented for lineage and maintainability

  Transformed data is never loaded

61. Data loading involves:

  Only reading data from sources

  Moving processed (or raw in ELT) data into the target system

  Just renaming files

  Ignoring the target system entirely

62. Bulk loading can improve performance by:

  Inserting large volumes of data in fewer operations

  Removing all constraints

  Using single-row inserts

  Always slowing down

63. Incremental loading (upserts) means:

  Only append new data, never update

  Full refresh every time

  Delete all data before load

  Insert new records and update existing ones based on keys

64. Managing indexes during load may involve:

  Always creating new indexes mid-load

  No need to consider indexes

  Dropping indexes before large bulk loads and recreating after load for faster performance

  Storing indexes in CSV

65. Timing loads off-peak hours can:

  Reduce contention and improve load times

  Make loads fail

  Not affect anything

  Always slow queries

66. Transactions during load ensure:

  Data is always in partial states

  Atomicity, so load either fully commits or rolls back on failure

  Infinite loops

  Ignoring errors

67. Verifying load success might involve:

  Guessing

  No verification at all

  Checking row counts, checksums, comparing against expected values

  Only checking the first row

68. Partitioned loading improves performance by:

  Processing different data subsets in parallel

  Only loading one partition

  Forcing sequential writes

  Removing indexes always

69. Using target-specific load utilities (e.g., COPY command in Redshift) can:

  Slow down loads

  Not be beneficial

  Introduce errors

  Exploit native optimizations for faster loading

70. Handling load failures might involve:

  Ignoring failed loads

  Implementing restart logic or partial reload from checkpoints

  Always dropping target tables

  Never loading again

71. Distinguishing between full refresh and incremental refresh loading means:

  Always doing full loads

  Incremental is never allowed

  Deciding whether to rebuild entire dataset or just apply changes

  Only applicable to CSV files

72. Post-load validations ensure:

  Loaded data matches quality expectations and no corruption occurred

  The source changed format

  No transformations happened

  Data is hidden

73. Notifications on load completion or errors help by:

  Slowing down loads

  Replacing monitoring

  Deleting logs

  Alerting stakeholders to take action if needed

74. Balancing load tasks means:

  Only one thread loads data

  Distributing workload to prevent bottlenecks and improve efficiency

  Always loading to one table

  Using less hardware

75. Late-arriving data might be handled by:

  Ignoring it

  Always stopping the pipeline

  Loading it into a u201clateu201d or u201cdeltau201d partition and merging later

  Converting it to JSON

76. Atomicity of loads ensures:

  Either the entire load completes successfully or no changes are made

  Partial updates are always visible

  Load always breaks

  Redundant data always

77. Archiving or purging old data during load cycles is done to:

  Increase storage usage

  Slow down queries

  Avoid transformations

  Manage storage efficiently and maintain relevant data

78. Ensuring consistency and atomicity of loads might require:

  No transactions

  Using database transactions or snapshot isolation

  Always loading twice

  Manual fixing after load

79. In ELT scenarios, loading raw data first allows:

  No transformations ever

  Destroying metadata

  Flexible, late transformations within the target system

  Slower queries

80. After successful load:

  Data is ready for downstream analytics and reporting

  Data is never used

  Must always re-extract

  Pipeline automatically deletes data

81. Traditional ETL tools (Informatica, Talend) often:

  Work only in ELT mode

  Provide GUI-based interfaces for building ETL pipelines

  Have no connectors

  Only run on mainframes

82. Modern ELT tools integrate with cloud warehouses like:

  Oracle 7

  Excel macros

  Snowflake, BigQuery, Redshift

  Telnet sessions

83. Orchestration tools (Airflow, Luigi) help by:

  Managing dependencies and scheduling ETL/ELT tasks

  Only storing passwords

  Not related to ETL

  Creating BI dashboards

84. SaaS integration platforms (Fivetran, Stitch) often:

  Require on-prem installation

  Handle only transformations

  Are all open-source

  Provide managed extraction and loading connectors

85. Using Python scripts for ETL can be advantageous for:

  Forcing a GUI approach

  No flexibility

  Custom logic and integration not provided by off-the-shelf tools

  Avoiding code versioning

86. Dockerizing ETL jobs provides:

  Slower deployments

  Portability and consistent runtime environments

  Windows-only execution

  No improvements

87. CI/CD pipelines for ETL code mean:

  No tests are run

  ETL is manual

  Only deployment to production

  Automatic testing, building, and deploying ETL scripts

88. On-prem ETL tools vs. cloud-native solutions differ in:

  Only syntax

  Cloud always cheaper

  Deployment model, scalability, and cost structure

  No difference

89. Data virtualization tools help by:

  Allowing access to data from multiple sources without physical movement

  Forcing data copies

  Eliminating ETL entirely

  Converting all data to XML

90. Evaluating open-source vs. commercial ETL solutions involves:

  Only looking at cost

  Ignoring support and community

  Only focusing on user interface

  Considering features, support, scalability, and TCO

91. Performance benchmarks among ETL tools help to:

  Create identical performance always

  Replace need for testing

  Identify which tool handles your scale and data complexity best

  No real benefit

92. Leveraging Spark or Flink in transformations is useful for:

  Only small datasets

  Large-scale, distributed data processing

  Avoiding parallelism

  Real-time alerts only

93. DataOps integrates ETL/ELT with:

  No code environments

  Only security teams

  GUI-only approaches

  DevOps principles for continuous integration and delivery of data pipelines

94. Integration with data catalogs helps ETL/ELT by:

  Providing lineage and metadata to understand data origins

  Increasing manual work

  Preventing transformations

  Only helping with network configuration

95. Using message queues (Kafka) in extraction helps by:

  Eliminating storage

  Making data static

  Enabling real-time streaming data into ETL pipelines

  Only for JSON files

96. ETL in a microservices architecture might mean:

  One giant monolithic ETL

  Breaking ETL steps into smaller services that communicate via APIs or queues

  No transformations at all

  Only batch processes

97. Impact of orchestration tool’s scheduling features:

  Irrelevant to ETL

  Always synchronous

  Allows precise timing of ETL/ELT tasks and handling dependencies

  Removes logging

98. Version control of ETL scripts and configs helps with:

  Tracking changes and enabling rollbacks

  No difference

  Hiding code

  Always causing conflicts

99. Selecting a tool based on data volume and complexity means:

  All tools perform the same

  Only cost matters

  Pick the tool with the prettiest UI

  Ensuring the chosen solution scales and meets functional requirements

100. Considering team skill sets for tool selection means:

  Tools must be C++ only

  Everyone must learn new languages always

  Choosing a tool that matches the team's expertise increases efficiency

  Skill sets don't matter

101. Identifying bottlenecks in ETL/ELT pipelines often involves checking:

  Only the extraction code

  The final BI dashboard

  Each step (extraction, transformation, loading) for slow operations

  Ignoring logs

102. Parallelization strategies might include:

  Splitting data into chunks processed concurrently

  Running everything in a single thread

  Avoiding partitioning

  Only processing after hours

103. Partitioning data for parallel processing helps by:

  Making the code complex without performance gains

  Slowing down transformations

  Allowing different segments of data to be processed simultaneously

  Increasing memory use arbitrarily

104. Efficient file formats like Parquet or ORC improve performance by:

  Always doubling file size

  Enabling columnar reads and better compression

  Requiring more I/O

  Removing column types

105. Compressing data before transfer reduces:

  CPU load

  Memory usage

  Network bandwidth usage and possibly I/O times

  Data integrity

106. Memory management for large ETL jobs can be improved by:

  Streaming data instead of loading entire datasets into memory

  Not processing data at all

  Using only one core

  Ignoring memory usage

107. Caching intermediate results might help if:

  Data never repeats

  The same transformations are applied multiple times to the same subset of data

  Cache is always slower

  You want to increase I/O

108. Query optimization in ELT scenarios includes:

  Removing indexes

  Using only SELECT *

  Avoiding statistics

  Creating appropriate indexes and using statistics for better query plans

109. Minimizing unnecessary data movement means:

  Performing transformations closer to where data resides

  Copying data multiple times

  Always ETLing locally

  Increasing network latency

110. Scheduling ETL jobs off-peak can improve performance by:

  Reducing CPU usage at peak times only

  Forcing users to wait longer

  Ensuring less contention with analytical queries

  Disabling increments

111. Monitoring runtime metrics (CPU, memory, throughput) helps:

  Hide performance issues

  Eliminate all code

  Enforce static loads

  Identify where optimization is needed

112. Choosing incremental over full loads can improve performance by:

  Always increasing processing time

  Only moving changed data instead of the entire dataset each time

  Deleting data first

  Always sorting

113. Using columnar storage in the target system helps because:

  It reduces I/O by scanning only needed columns

  It's always row-based

  Slows down queries

  Removes transformations

114. Monitoring runtime with trend analysis helps:

  Introduce random delays

  Remove logs

  Foresee when scaling resources or optimization is needed

  Eliminate the need for testing

115. Retry and backoff strategies during extraction and loading help by:

  Failing immediately

  Ignoring errors

  Only working for transformations

  Handling transient network or resource issues without manual intervention

116. Eliminating unnecessary transformations means:

  Adding more steps

  Streamlining the pipeline and reducing overhead

  Slowing down performance

  Storing duplicates

117. Code profiling in ETL scripts helps by:

  Removing logs

  Making code unreadable

  Identifying slow functions or steps to optimize

  No impact

118. Reducing unnecessary data movement (ELT vs. ETL) can improve performance by:

  Performing transformations where data resides, minimizing I/O

  Always copying data multiple times

  Adding more network hops

  Ignoring transformations

119. Adopting streaming ETL for continuous processing improves performance for:

  Batch-only scenarios

  Storing data offline

  Slowing data updates

  Data that needs near real-time availability

120. Automating performance regression tests means:

  No tests needed

  Running tests once a year

  Quickly detecting if recent changes negatively impact ETL speed

  No effect on performance

121. Implementing validation checks at extraction ensures:

  Data always fails

  Ignoring source issues

  Problems are caught early, preventing corrupt downstream data

  Only helps after loading

122. Data cleansing involves:

  Ignoring nulls

  Removing duplicates, correcting invalid formats, and filling missing values

  Only sorting data alphabetically

  Not related to quality

123. Standardizing reference data during transformation helps:

  Increase ambiguity

  Create more errors

  Hide metadata

  Ensure consistent terms and codes across the dataset

124. Data lineage means:

  Tracking the origin and transformations applied to data

  Ignoring source origins

  Only versioning code

  Deleting metadata

125. Auditing changes and maintaining historical versions of data allows:

  No retrospective analysis

  Only current snapshot views

  Comparing past states and understanding data evolution

  Ignoring business rules

126. Data quality metrics (completeness, consistency) help by:

  Confusing analysts

  Quantifying the level of trust in the data

  Replacing ETL processes

  Only working with numeric fields

127. Error handling pipelines (quarantine bad records) means:

  Discarding all data

  Stopping the pipeline entirely

  Isolating problematic rows for later inspection without halting the entire process

  Eliminating transformations

128. Role-based access controls in ETL/ELT governance ensure:

  Everyone sees everything

  No security

  Data is never transformed

  Only authorized personnel can change pipelines or view sensitive data

129. Metadata management helps by:

  Removing schema info

  Allowing understanding of schema evolution, lineage, and data dictionary

  Slowing ETL

  Only storing logs

130. Ensuring consistency between source and target schemas prevents:

  Mismatches that cause errors or incorrect mappings

  Any data loading

  Incremental extracts

  Using metadata

131. Self-service data quality checks mean:

  Only the ETL developer can run checks

  Requires special hardware

  Data always perfect

  Analysts can define and run their own validation rules

132. SLA definitions for data timeliness and correctness ensure:

  No expectations on data delivery

  Data always arrives late

  Clear targets and accountability for ETL performance

  Removing all checks

133. Logging quality metrics allows:

  Identifying trends in data issues over time

  Ignoring improvements

  Reducing metadata

  No historical analysis

134. Governance frameworks (like DAMA) applied to ETL/ELT mean:

  No standards

  Adhering to best practices for data management and quality

  Always encrypting data

  Abandoning lineage

135. Aligning ETL/ELT practices with organizational policies ensures:

  Data is always late

  No compliance

  Data is lost

  Compliance with internal standards and external regulations

136. Continuous improvement cycles for data quality involve:

  Stagnation

  Ignoring feedback

  Regularly reviewing metrics, addressing issues, and refining processes

  Deleting error logs

137. Ensuring completeness means:

  Having random missing fields

  Only using half the data

  No relevance to ETL

  All expected data elements are present

138. Consistency checks ensure:

  Data values contradict each other

  Data does not conflict logically (e.g., end date after start date)

  More duplicates

  Slower loads

139. Accuracy checks might compare extracted data against:

  A trusted reference dataset or source of truth

  Random guesses

  No baseline

  Logs only

140. Integrating with data governance tools means:

  Less visibility

  Removing lineage

  Ensuring policies, lineage, and quality standards are consistently applied

  Only helps after loading

141. Encrypting data in transit ensures:

  Data is always in plaintext

  No difference

  Data is slower

  Sensitive information isn't exposed to eavesdroppers

142. Using secure protocols (TLS/SSL) for data extraction from APIs prevents:

  Faster downloads only

  Interception of sensitive data by unauthorized parties

  Writing logs

  Any transformations

143. Masking or tokenizing PII fields in transformations ensures:

  PII is displayed openly

  No compliance

  Sensitive data is protected and less exposed to unauthorized views

  Only numeric data allowed

144. Applying column-level encryption or hashing can:

  Protect sensitive columns while allowing partial data usage

  Increase plaintext exposure

  Remove all keys

  Disable extraction

145. Strict access controls on ETL pipelines means:

  Everyone can edit pipelines

  No logs needed

  Data always public

  Only authorized users can modify or run pipelines, enhancing security

146. Complying with regulations like GDPR may involve:

  Storing all PII unencrypted

  Ignoring user requests

  Implementing deletion or anonymization upon request

  Adding more duplicates

147. Auditing and logging who accessed ETL data ensures:

  No record of access

  Less transparency

  Compliance is ignored

  Accountability and traceability for security and compliance

148. Using secrets managers for credentials instead of hardcoding prevents:

  Faster access

  Exposure of passwords in code or logs

  Any encryption

  Data from loading

149. Minimizing data movement of sensitive records may mean:

  Transforming data in-place in the secure environment

  Copying data multiple times

  Decrypting data everywhere

  Always using local disks

150. Regular compliance audits and ETL process reviews ensure:

  No updates to policies

  Data remains unprotected

  Ongoing adherence to security standards and regulations

  Inconsistent governance

151. Using VPCs or private networking for data transfers ensures:

  Data goes over public internet unprotected

  Slower transfers only

  Data moves within a secure, isolated environment

  Replacing encryption

152. Complying with HIPAA for healthcare data might require:

  Strict controls, encryption, and auditing access to patient data

  Ignoring patient privacy

  No logging

  Publishing data publicly

153. Separation of duties in ETL/ELT management means:

  One person does everything

  Different roles have limited, distinct permissions (e.g., dev vs. ops)

  Only one admin for all

  Less control

154. Using temporary credentials or tokens instead of static keys enhances security by:

  Reducing exposure if credentials are compromised

  Always harder to manage

  No benefit

  Ignoring best practices

155. Applying principle of least privilege to ETL system accounts means:

  Giving all accounts full admin rights

  No authentication needed

  Accounts only get the minimum permissions needed to do their job

  Disabling credentials

156. Sanitizing logs and debug info means:

  Storing passwords in logs

  Printing PII openly

  No filtering needed

  Removing or masking sensitive information from logs

157. Ensuring data disposal and retention policies are followed means:

  Keeping all data forever

  Deleting data randomly

  Removing old or unneeded data according to defined schedules

  Ignoring compliance

158. Using anonymized or synthetic test data prevents:

  Proper testing

  Exposure of real sensitive data to dev/test environments

  Any compliance

  Realistic scenarios

159. Regular compliance audits might involve:

  Ignoring logs

  Removing encryption

  Verifying ETL processes follow all security, privacy, and data handling regulations

  Increasing unauthorized access

160. Storing sensitive data only in encrypted form reduces risk if:

  Storage is compromised, attackers get ciphertext instead of plaintext

  Keys are in plaintext near data

  Ignoring all keys

  Data is never accessed

161. Real-time ETL differs from batch ETL by:

  Only running once a day

  Continuously processing incoming data as it arrives

  Ignoring source changes

  Always slower

162. Change Data Capture (CDC) techniques detect:

  Only full dumps

  No changes

  Inserts, updates, and deletes in source data to apply incrementally

  Random errors

163. Tools like Debezium or Attunity assist with:

  Only batch loads

  Capturing database changes in near real-time

  Only file extractions

  Manual transformations

164. Using Kafka or Kinesis in streaming ETL means:

  Always batch processing

  No real-time updates

  Only single-thread reads

  Data is ingested as events, allowing continuous transformations

165. Micro-batching vs. true streaming differs by:

  Both are identical

  Micro-batching never buffers data

  Micro-batching processes small batches at intervals, streaming processes events immediately

  Streaming is offline

166. Handling out-of-order events in streaming ETL often requires:

  Sorting all data by hand

  Dropping late data

  Ignoring timestamps

  Event-time processing logic, possibly using watermarks

167. Ensuring exactly-once delivery in streaming ETL means:

  Data is processed twice

  Handling duplicates and idempotency so data isn’t double-counted

  No error handling

  Stopping after one record

168. State management in streaming transformations involves:

  Storing no context

  Only batch mode

  Remembering previous events’ data to compute aggregates or handle joins

  Discarding all history

169. Windowing functions (tumbling, sliding windows) in streaming help by:

  Grouping events into manageable intervals for aggregation

  Ignoring event timestamps

  Only running monthly

  Storing data in CSV

170. Dealing with backpressure and rate limiting ensures:

  Overwhelming downstream systems

  The pipeline adapts to variations in data flow rate without crashing

  Data always lost

  Ignoring slow consumers

171. Transforming on the fly vs. storing raw events first is a choice between:

  Always raw only

  No transformations at all

  Strict ELT

  Immediate latency vs. flexibility of reprocessing raw data later

172. Choosing storage sinks for streaming outputs (NoSQL, data lakes) depends on:

  No schema requirements

  Only traditional RDBMS is allowed

  Query patterns, latency needs, and volume of data

  Ignoring format

173. Monitoring streaming pipelines for lag and throughput helps by:

  Detecting if the system can keep up with incoming data

  Ignoring delays

  Only checking batch jobs

  Slowing down on purpose

174. Recovery and fault tolerance in streaming ETL might use:

  No checkpoints

  Always restarting from scratch

  Checkpointing and replaying events from a certain offset

  Ignoring state

175. Schema evolution in a streaming environment means:

  No schema changes allowed

  Handling new fields or removed fields dynamically without stopping the stream

  Batch reload

  Only fixed schemas forever

176. Watermarks in event-time processing help by:

  Defining when a window of events is considered complete despite late arrivals

  Only working with batch data

  Sorting all events by arrival time only

  Removing timestamps

177. Integration with stream processing frameworks (Flink, Spark Structured Streaming) means:

  No real-time capability

  Only batch steps

  Leveraging built-in abstractions for stateful and windowed transformations

  Disabling transformations

178. Real-time alerting on data quality in streams allows:

  Immediate response to anomalies

  No remediation

  Only batch corrections

  Ignoring errors

179. Balancing latency vs. completeness means:

  Always waiting for all late events

  Ignoring event times

  Deciding how long to wait for late data before producing results

  Only using batch mode

180. Continuous integration and deployment for streaming pipelines means:

  Only manual updates

  Automated testing and rolling out changes without stopping the stream

  Offline processing only

  Fewer updates

181. Unit tests for individual transformation logic mean:

  No testing needed

  Always testing full pipeline only

  Validating small pieces of code in isolation to catch errors early

  Only running after production

182. Integration tests across the entire ETL pipeline ensure:

  Each component works alone

  All steps (extract, transform, load) work together as intended

  Slower code

  No data is moved

183. Using mock sources and targets in tests allows:

  Testing logic without depending on actual external systems

  Always hitting production DB

  Only GUI tests

  Ignoring dependencies

184. Regression testing ensures:

  No changes allowed

  Data is always static

  Recent code changes haven’t broken previously working functionality

  Ignoring old tests

185. Load testing ETL pipelines helps by:

  Only testing one record

  Removing performance metrics

  Breaking the pipeline deliberately

  Measuring how the system performs under high data volumes

186. Synthetic test data generation helps test:

  Only known scenarios

  Production data leaks

  Edge cases and scenarios without risking real sensitive data

  Slower transformations only

187. Integrating tests with CI/CD means:

  Manual deployment always

  Automated testing before changes go live, preventing regressions

  No tests ever run

  Only testing after production issues

188. Automated error detection and retries in pipelines means:

  The pipeline recovers from transient failures without manual intervention

  Always fail permanently

  No logging

  Only manual restarts

189. Monitoring tools (e.g., Airflow UI) provide:

  No visibility

  Only command-line logs

  A visual representation of job runs, statuses, and dependencies

  Always slow loading

190. Alerting on job failures or SLA breaches means:

  Ignoring issues

  Delayed responses

  No escalation

  Stakeholders are promptly notified to fix or investigate

191. Logging best practices (structured logs) aid in:

  Making logs unreadable

  Easier parsing, searching, and analyzing issues

  Slowing queries

  No difference

192. Version control of ETL pipeline definitions means:

  No rollbacks possible

  Lost history of changes

  Changes are tracked, allowing revert to previous stable versions

  Only manual edits

193. Auditing job runs (who triggered them, when) provides:

  Accountability and traceability of operations

  Only complexity

  No benefits

  Automatic fixing of errors

194. Scalability tests ensure:

  Pipeline fails under load

  Only one scenario tested

  The pipeline can handle growing data volumes or more concurrency

  No need for infrastructure changes

195. Using checksums and row counts for validation in tests means:

  Ignoring discrepancies

  Slower tests

  Manual calculations

  Quickly verifying data integrity and completeness

196. Canary deployments for ETL changes mean:

  Testing new logic on a small subset of data before full rollout

  Deploying everywhere at once

  Ignoring errors

  No difference from normal deploys

197. Performance metrics (CPU, memory) in monitoring help identify:

  No issues

  Resource constraints and scaling needs

  Only network errors

  Faster ETL by magic

198. Trend analysis in monitoring dashboards means:

  Only real-time snapshots

  Removing historical data

  Observing how performance or quality metrics change over time

  Ignoring patterns

199. Security scans on ETL code (linters, secret detection) ensure:

  Credentials aren't committed in code and no obvious vulnerabilities

  More exposure

  Only syntax checks

  Ignoring compliance

200. Continuous improvement loops from monitoring data means:

  Stagnation of pipelines

  Only manual reviews once a year

  Using insights from tests/monitoring to refine and optimize ETL/ELT continuously

  Removing alerts

...

Ask Tutor