Extract, transform, and load is a data warehousing function that involves extracting data from outside sources, transforming it to fit business needs, and ultimately loading it into the data warehouse. ETL is an important part of data warehousing, as it is the way data actually gets loaded into the warehouse.

Data warehousing projects consolidate data from different sources. Most sources tend to be relational databases or flat files, but other sources exist. A proper ETL system must be able to communicate with the databases and read the various file formats used throughout an organization. This can be a daunting task, and many data sources cannot be accessed very easily.

The transform phase looks at the data being retrieved and does several things. Some data sources will require very little manipulation of data. Other sources provide more information than is necessary, so an ETL system must cull out the extra data or consolidate it into a format that is more appropriate for a data warehouse. In addition, ETL systems may be required to combine data from multiple sources before inserting the data into the warehouse.

Finally, the consolidated data is uploaded to the data warehouse. Depending on the requirements of the organization, this process ranges widely. Some data warehouses merely overwrite old information with new data. More complex systems can selectively update information.

ETL systems can be quite complex, and certain problems may occur. Improperly designed ETL systems or the occasional change in format of one of the source databases can cause serious problems in this stage, potentially destroying or corrupting significant amounts of data. An additional difficulty is making sure the data being uploaded is relatively consistent. Since multiple source databases all have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold back certain data until all sources are synchronized.

List of ETL tools

  • Informatica
  • Data Junction
  • Data Stage
  • Ab Initio
  • Data Mirror

See also