-
Logic behind implementing SCD 2
Posted on October 7th, 2008 No commentsLets talk about the logic behind SCD type 2 today.
We know that SCD 2 is about preserving all the changes in the dimension records. Let us see the logic behind how we can implement it.
NOTE: The steps below assume SCD 2 having a Begin_date and End_date and a Current_Flag column
- Check if the incoming row is already present in the target table (dimension) using the source primary key
- If it doesn’t exist in the target dimension
- Generate a surrogate key
- Enter source record’s date as the Begin_date
- Enter the default end date (which could be 31/12/2099) as the End_date
- If you have a Current_flag column, set it as ‘Y’ or ’1′ (or whatever you want)
- Insert into the dimension
- If the row exists in the target
- Check if the incoming and target current record are different (at least for one chosen attribute)
- If they are same, do nothing
- If they are different, do the following
- For the record in the target table, change (update) the End_date to source record’s date and set the Current_flag to ‘N’ or ’0′ or whatever
- Take the incoming record, generate a surrogate key, enter source record’s date as the Begin_date and the default date as the End_date. Also set the Current_flag to ‘Y’ or ’1′. Insert into the dimension
NOTE: The End_date of the previous record and the Begin_date of the current record are assumed to be the same. But some people prefer them to be different dates ie) End_date is 1 day lesser than the next records Current_date.
-
Staging area: Necessary or overhead?
Posted on September 23rd, 2008 No commentsIn this article, let us see what a staging area is, its types and the reason to have one in your data warehouse.
Ok, what is a stage area?
It is that part of a data warehouse where data is stored physically (in database or in files), but as an intermediate step before loading the target data warehouse / data marts. It is where activities like cleansing, de-duplication, etc take place. It is like a pit stop for a racing car before reaching the destination.
Some characteristics of the staging area are
- accessible to and owned by ETL / DW team
- OLAP / reporting teams do not have access to it
- indexed very little
- ETL developers are usually free to create / drop tables, controlled though (by the architect or modeling team)
Types of staging areas:
- Persistent staging – stage data is not deleted, if you want to maintain history.
- Transient staging – stage data is deleted after each ETL load
Most data warehouses have one or more staging areas, the types being either persistent or transient or both.
But should you really have a stage area? Can’t you do without it? After all these days, ETL tools are more capable of handling more data in memory fully.
Is staging necessary or is it an overhead?
-
Decfloat datatype not supported in BODI / BODS
Posted on July 12th, 2008 No commentsRecently I was working with a few DB2 UDB 9.5 tables and discovered that decfloat datatype columns were not recognized even though I had checked the option to recognize unsupported datatypes as varchar.
-
Why surrogate key?
Posted on July 4th, 2008 No commentsWhy should we use surrogate keys in a data warehouse?
Here is why.
1. To separate the DWH from the operational environment
Think of this. You have a dimension loading from transactional tables. All of a sudden, the OLTP people decide to re-use all closed / inactive account id’s. For them, it wont matter a bit. For the DWH team, it will.
Read the rest of this entry » -
Handling many to many relationships (corrections made)
Posted on April 30th, 2008 No comments(There were issues with the design in my original post. Those have been corrected. I have let the errors stay and have highlighted where I have made corrections so that you can understand better)
Consider this situation.
2 new guys Ravi and Raj go to our bank and open new savings accounts 101 and 102. And Ravi deposits 100 and Raj deposits 144. This can be represented by the following.
CLICK ON THE IMAGES BELOW TO SEE A LARGER PICTURE
Now suddenly, Ramu and Ramya come to the bank and open a joint account 103. And they deposit 1000. How will you represent this? What will you enter for CUST_KEY ? Will you enter 3 or 4? It was the same transaction of Rs 1000. But which customer to indicate? What will you fill in the red ? position below? -
Migration of a DW solution from Oracle to DB2
Posted on April 26th, 2008 No commentsRecently we migrated an entire DW module from Oracle to DB2. And our ETL was done using Business Objects Data Integrator.
Below is a sequence of steps that we followed to migrate the entire solution (including data structures, ETL code and universe)
Assumption:
–>All estimates done for 4 DB2 databases, 3 federated systems, 62 tables, 33 indexes, 6 DI jobs, 110 data flows, 151 transformations, 16 SQL transforms, 34 SQL() scripts, 220 objects in 23 classes and 2 reports.
–>Add 20% time as buffer for unexpected issues.
–>Some tasks can be executed in parallel, while some have strong dependency on completion of previous tasks.Work breakdown of migration and rough estimation:
1. Initiation & planning : 40 Hrs
Establish objectives and goals
Requirements gathering
Establish scope
Plan resources
Identify development environment2. Assessment : 40 Hrs
Assess tool and technology needs
Assess and understand technical requirements
Establish technical guidelines
Finalize tools and development environmentBODI, BODS, Business Intelligence, DB2, Data Integration, Data Modelling, Data warehouse, Database, Dimensional modelling, EIM, ETL, Metadata, Oracle, SAP BODI, BODS, Business objects, Data Integration, Data Modelling, Data Quality, Database, DB2, Dimensional modelling, DWH, Effort estimation, EIM, ETL, Metadata, Oracle, SAP-BO -
Times.. they are changing.
Posted on March 3rd, 2008 No commentsLooks like traditional data warehousing will soon be taken over by a sea of changes.
(Click on underlined words to open the links)
Integration of OLTP with DW? Read about it here.
Here is a discussion on new approaches in DW.
If you want to know Werner’s opinion on the future of Business Objects Data Integrator (BODI), see here.
Appears like BO Data quality + BO Data Integrator = BO Data Services. A discussion on it here.
And is Informatica the next acquisition target? Some say maybe… some say no.
BI News, BODI, BODS, Business Intelligence, Business objects, Data Integration, Data Modelling, Data Quality, Data warehouse, Dimensional modelling, EIM, ETL, Informatica, SAP BODI, BODS, Business objects, Data Integration, Data Modelling, Data Quality, Dimensional modelling, DWH, EIM, ETL, Informatica, News, OLTP, SAP-BO -
The maximum size of a fact table.
Posted on July 11th, 2007 No commentsWell, this is just a workout. It may or may not make much sense!
Here I have detailed how we can calculate the maximum size of a fact table.
Assume our datamart has 5 dimensions: Time_dim, customer_dim, product_dim, product_warehouse_dim, supplier_dim. And our fact table has 5 foreign keys, 15 measures, 2 degenerate columns
Time_dim has 20 years of rows on daily grain: so 365 X 20 = 7300 rows
There are 4000 customers in customer_dim.
200 products.
10 warehouses.
12 suppliers.So the maximum possible fact table rows is
7300 x 4000 x 200 x 10 x 12 = 7008000000000 (7008000 million or 7008 billion)Now this is the maximum no of rows possible in the fact table.(Assuming that every customer brought every product from every store supplied by every supplier on every day…..etc… it goes on … permutations and combinations)
Assuming that the 22 columns have an average size of 5 bytes, we need 110 bytes per row.
So now, 110 x 7008 billion = 71794 GB
The maximum possible size for this fact table (assuming that dimension count stays same) is 71794 GB.
The maximum possible increase per year could be 3590 GB.
-
Semi-additive facts
Posted on May 31st, 2007 No commentsWhat are semi-additive facts?
Well, facts that can be aggregated across a few dimensions, but not all.
Take the case of an inventory fact table.
We track inventory every day for every product at every warehouse.
To get the inventory on a given day for all products, I aggregate the inventory fact measures across the product dimension filtered for that particular day. The following query will give the inventory for every product for today.
Select product_dim.name, sum(Inv_fact.inventory_qty)
from
product_dim, inv_fact, date_dim
where
date_dim.date_key= inv_fact.date_key and
date_dim.calendar_dt=sysdate and
product_dim.product_key = inv_fact.product_key
group by product_dimHowever, I cannot add all the inventory quantities every day for a product, say for a month and get anything meaningful. Aggregation across time is meaningless. (of course, you can use it to calculate the average inventory per day)
Such facts that can be aggregated across certain dimensions but not across all dimensions are called Semi-additive or partial facts.
-
Rapidly changing dimensions
Posted on May 25th, 2007 No commentsDimensions can be loaded in different ways. Slowly changing dimension loading is one. Even in this there are
1. Type 1 – any changes in existing data is updated. Used for correction handling.2. Type 2 – change is not updated, but a new record is inserted with change and the old data is kept as such and flagged as old version. For example, Prem is a record in Customer dimension. Today Prem lives in Hyderabad. So the current record in customer dim says
#1 — Prem — Hyderabad — MaleThen I move to Bangalore. Now how do I update this? So I say
# no — Name — Place — Sex — Current version
#1 — Prem — Hyd — M — No
#2 — Prem — Blr — M — YesWhy should I have the old record?
There might be a case where you want to know all the sales that was done by customers in Hyderabad last year. And Prem had a purchase in Hyderabad last year. I want this to be tracked. I dont want the sale to point to Prem living in Bangalore, but in Hyderabad.This way, if Prem moves 10 times, we will have 10 records in customer each pointing to a different Prem as per his location. This will keep the data accurate.
3. Type 3
Here instead of having all the history, we keep selected history data. Like 3 versions. The last 3 location moves of Prem will be tracked, not more!Okay, what are rapidly changing dimensions?
Any dimension whose attributes change too rapidly. For example, credit rating of Prem. Assume that your rating agency rates people every 3 months (assume for example purpose), then every 3 months Prem can have a possibly different rating!# no — Name — Place — Sex — Cr rate –Current version
#1 — Prem — Hyd — M — AAA –No
#2 — Prem — Blr — M — AA –No
#2 — Prem — Blr — M — A –No
#2 — Prem — Blr — M — B –YesSo if you have around 3000000 customer records and if they keep changing every quarter as they do here, customer dimension will grow exponentially. That defeats the purpose of a warehouse, effective querying.
What do we do?
Remove the rapidly changing attributes and make a new dimension out of it. Pull the Credit rating and make a dimension out of it and call it credit rating behaviour dimension.So the customer will not grow exponentially. The rating dimension will. But this will be used only by queries that need rating info. Any query that doesnt need rating info and only other customer info will go to the original customer dimension.



