image_pdfimage_print

Extract, transform, and load (ETL) and extract, load, and transform (ELT) are data pipeline workflows. However, cleaning, deduplicating, and formatting in these two workflows happen at different steps. With ETL, data is updated at the second step before being loaded into the data warehouse. With an ELT pipeline, data updates happen after the data is stored in the data warehouse.

ETL (Extract, Transform, Load)

In an ETL pipeline, the transformation step happens after data is extracted from a source. The source could be a website, an API, files, or another database. Raw data is often unusable without standardizing it for your specific warehouse design. For example, if you pull raw data from a website, phone numbers might include parentheses and hyphens. Your relational database might store phone numbers with no special characters. The transform step in ETL formats data to fit your data warehouse and would remove parentheses and hyphens before storing it in tables.

Formatting isn’t the only transformation step. Raw data is deduplicated to avoid adding unnecessary records, causing skewed and inaccurate reports, and using storage resources unnecessarily. Extracted data might be compared with currently stored data to exclude it from being loaded. Malformed data might be discarded, or developer scripts might try to salvage some malformed data and store it as a partial record. What you do with data depends on your business requirements.

Without ETL, relational databases might reject malformed data and important information could be lost. Relational databases have strict constraints on table columns, so data that does not follow rules will not be stored. For example, if you try to store a zip code with alphanumeric characters in a table column set to numeric values only, the record will be rejected in the load step.

unstructured data

ELT (Extract, Load, Transform)

In a data pipeline using ELT, data is first loaded into the database and transformed after being stored. Most ELT procedures work with NoSQL databases where the rules are much less constraining for raw data storage. For example, raw data can be dumped into a MongoDB table where records are stored as a document with a document ID to distinguish each record. The advantage is that data does not need specific formatting, but it can be evaluated and organized later.

After being loaded into a database, scripts run on raw data to perform formatting, deduplication, and cleanup. Data could be reloaded into another database or stored in the current database. Big data tables store records that can later be used in analytics, so often the transformation step is used to deduplicate information for more accurate reporting.

ELT often requires less formatting and focuses more on speed. For example, ingestion of logs from network events could be loaded into a data warehouse where analytics run artificial intelligence (AI) algorithms to determine an anomalous event. Anomalous events are sent to cybersecurity analysts for further review. Time is of the essence in incident response, so the ELT process must be fast where formatting is not as critical for output results.

ETL vs. ELT: Key Differences

The key difference between ETL and ELT is when data is stored in the database. If you decide to work with ETL, then you need scripts to format and organize data before it’s stored in a database. ELT first stores data in the database, so you perform the transformation in the future without requiring your workflow to perform it prior to storage. Your strategy should be based on speed and data integrity requirements.

ELT stores data prior to transformation, so data is usually in a raw format. NoSQL databases might be necessary for raw data unless you work with an API that returns preformatted data that fits your table structures. One benefit of ELT over ETL is that the data is loaded and stored before making changes to it, so you don’t lose any information if the data cannot be transformed properly.

Loading data prior to deduplication, however, can cause data integrity issues. When transformation happens after loading, it’s likely that you’ll need to have a staging server to store data and transform data before loading to a production environment. The production environment can then be used for your business applications and analytics.

When businesses choose between ETL and ELT, they usually focus on the importance of data availability and integrity. The workflow you choose will determine which one takes priority. Speed requirements might be best with ELT, while data integrity requirements will call for ETL.

Choosing the Right Approach

Performing transformations before storing data gives you “cleaner” data. If data formatting and deduplication happen before loading it, then you could conceivably have a workflow that dumps data directly into a production environment. ETL might be necessary when you need new data available to applications in real time or close to real time as your workflows collect it throughout the day.

Speed of data availability is the main difference between ETL and ELT. Applications and business strategies will determine which approach is better. Real-time applications and machine learning might use ELT, but be aware that some data might be duplicated. Reporting and applications that don’t require immediate data ingestion might be better with ETL.

APIs often provide formatted data, so you can work with an ETL pipeline without too many updates required on the data. ELT is usually necessary when you have raw data that cannot be easily imported into a database table. When you have raw files or scraped data from web pages, it might be necessary to first format and deduplicate data before adding it to your relational database tables. Relational databases require structured data, so ETL is a necessity prior to loading data. NoSQL databases are much more forgiving and can store unstructured data without formatting. 

Conclusion

Any time you import data from one location to another, you probably need to consider data transformation and storage requirements. Once you decide what data must be imported, you’ll decide if you’ll work with ETL or ELT. While architecting your storage solution, Pure Storage offers scalable block storage for structured and unstructured data. It integrates with AWS, Azure, and any major cloud services.

For virtualized environments, Portworx® helps with Kubernetes orchestration as your data travels through the data pipeline. Portworx is beneficial for operational data and pipelines with continual integration in a containerized and virtual environment.