• Staging area: Necessary or overhead?

    Posted on September 23rd, 2008 biexplorer No comments

    In this article, let us see what a staging area is, its types and the reason to have one in your data warehouse.

    Ok, what is a stage area?

    It is that part of a data warehouse where data is stored physically (in database or in files), but as an intermediate step before loading the target data warehouse / data marts. It is where activities like cleansing, de-duplication, etc take place. It is like a pit stop for a racing car before reaching the destination.

    Some characteristics of the staging area are

    1. accessible to and owned by ETL / DW team
    2. OLAP / reporting teams do not have access to it
    3. indexed very little
    4. ETL developers are usually free to create / drop tables, controlled though (by the architect or modeling team)

    Types of staging areas:

    1. Persistent staging – stage data is not deleted, if you want to maintain history.
    2. Transient staging – stage data is deleted after each ETL load

    Most data warehouses have one or more staging areas, the types being either persistent or transient or both.

    But should you really have a stage area? Can’t you do without it? After all these days, ETL tools are more capable of handling more data in memory fully.

    Is staging necessary or is it an overhead?

    Read the rest of this entry »

  • ETL effort estimation: Points to factor-in

    Posted on September 23rd, 2008 biexplorer No comments

    Estimation of ETL effort is not always fun (as with any estimation).

    There are several ways to estimate the effort needed to complete an ETL job. Work Breakdown Structure (WBS) is popular. And so is Function Point Analysis (FPA).

    But the most widely used is the one that factors in complexity based on the understanding of things like source, target, resources on project, etc.

    Though I haven’t really seen anyone use this method to perfection, it is a good place to start with. Some people argue against this method, but I see this as a complementary option to whatever method you have.

    So, here is a list of points that I think would be useful when you do any ETL effort estimation. I have grouped it under 5 heads: Source, target, transformations, resources, other.

    Source based:

    1. No of different sources & types
    2. Incremental extraction needs
    3. Profiling of data sources
    4. Cleansing / de-duplication dirty data sources
    5. Availability of documentation / transition of knowledge of source data
    6. Access control & management, if needed
    7. Data volumes for unit testing

    Read the rest of this entry »