-
Staging area: Necessary or overhead?
Posted on September 23rd, 2008 No commentsIn this article, let us see what a staging area is, its types and the reason to have one in your data warehouse.
Ok, what is a stage area?
It is that part of a data warehouse where data is stored physically (in database or in files), but as an intermediate step before loading the target data warehouse / data marts. It is where activities like cleansing, de-duplication, etc take place. It is like a pit stop for a racing car before reaching the destination.
Some characteristics of the staging area are
- accessible to and owned by ETL / DW team
- OLAP / reporting teams do not have access to it
- indexed very little
- ETL developers are usually free to create / drop tables, controlled though (by the architect or modeling team)
Types of staging areas:
- Persistent staging – stage data is not deleted, if you want to maintain history.
- Transient staging – stage data is deleted after each ETL load
Most data warehouses have one or more staging areas, the types being either persistent or transient or both.
But should you really have a stage area? Can’t you do without it? After all these days, ETL tools are more capable of handling more data in memory fully.
Is staging necessary or is it an overhead?
I think it is very critical to have a staging area, especially if you are moving a lot of data from multiple sources. Here are a few reasons why :
- Intermediate processing: Often you need to perform transformations, cleansing and other processing on huge chunks of data from multiple sources. It is not feasible to do all of this in memory purely (extraction from source, processing and direct load to target). We need intermediate steps where data can be stored, the ETL tool can take a breather and start again. This also helps in several other ways discussed below.
- Auditing: Having a stage area provides a means of knowing the condition of data at certain points during the ETL cycle. If your ETL job fails mid-way, you can look at the stage tables and figure out the condition of data as it passes through the various steps. Thus it provides for easy auditing.
- Recovery: Suppose if your ETL job fails midway, you would need to fetch the data once again from the source and repeat the entire process all over again, if you do not maintain a stage area. Having stage areas could mean that the job can be restarted from the point of failure.
- Backup: If your source data gets overwritten, the original data is lost.. probably forever. And there is no way in which you can perform a reconciliation with the DW data against the source. But if you had a persistent stage, your data would be backed up in these intermittent databases.
- Cleansing, synchronizing, de-duping data, etc: If you have multiple sources, stage areas are a neat method to synchronize your data, at the same time, to cleanse and conform them. (For eg, if your customer data is from 3 different sources, it makes sense to pull them all into a stage area and have then synched. This is especially true if the different sources are not available at the same time)
- Helps reduce contention on busy source (OLTP) systems: It helps bring down the contention on the source systems. Once data is pulled, you can work on it in the stage area and get it transformed to your needs. If you were to apply these transformations on source data directly without a stage, you would be engaging the connections to the source for much longer. Thsi would impact OLTP systems adversely.
- Increases availability of data: If one of your source systems is temporarily down, you still will have access to the data which is staged.
- Joins are easier and faster: It is much easier and faster if the tables are in the same database schema as against different disparate databases.
Source: Several, but notably, Ralph Kimball, The data warehouse ETL toolkit.
Data Integration, Data Modelling, Data warehouse, Dimensional modelling, EIM, ETL Data Integration, Data Modelling, Dimensional modelling, DWH, EIM, ETL, StagingLeave a reply
You must be logged in to post a comment.


