What is ETL?
The mechanism of extracting information from source systems and bringing it into the data warehouse is commonly called ETL, which stands for Extraction, Transformation and Loading.
The ETL process requires active inputs from various stakeholders, including developers, analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs to change with business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse system and needs to be agile, automated, and well documented.
What is ETL in Datawarehouse
How ETL Works?
ETL consists of three separate phases:
What is ETL in Datawarehouse
Extraction is the operation of extracting information from a source system for further use in a data warehouse environment. This is the first stage of the ETL process.
Extraction process is often one of the most time-consuming tasks in the ETL.
The source systems might be complicated and poorly documented, and thus determining which data needs to be extracted can be difficult.
The data has to be extracted several times in a periodic manner to supply all changed data to the warehouse and keep it up-to-date.
The cleansing stage is crucial in a data warehouse technique because it is supposed to improve data quality. The primary data cleansing features found in ETL tools are rectification and homogenization. They use specific dictionaries to rectify typing mistakes and to recognize synonyms, as well as rule-based cleansing to enforce domain-specific rules and defines appropriate associations between values.
The following examples show the essential of data cleaning:
If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date list of contact addresses, email addresses and telephone numbers must be available.
If a client or supplier calls, the staff responding should be quickly able to find the person in the enterprise database, but this need that the caller's name or his/her company name is listed in the database.
If a user appears in the databases with two or more slightly different names or different account numbers, it becomes difficult to update the customer's information.
Transformation is the core of the reconciliation phase. It converts records from its operational source format into a particular data warehouse format. If we implement a three-layer architecture, this phase outputs our reconciled data layer.
The following points must be rectified in this phase:
Loose texts may hide valuable information. For example, XYZ PVT Ltd does not explicitly show that this is a Limited Partnership company.
Different formats can be used for individual data. For example, data can be saved as a string or as three integers.
Following are the main transformation processes aimed at populating the reconciled data layer:
Conversion and normalization that operate on both storage formats and units of measure to make data uniform.
Matching that associates equivalent fields in different sources.
Selection that reduces the number of source fields and records.
Cleansing and Transformation processes are often closely linked in ETL tools.