-
What is a data warehouse?
Posted on May 25th, 2007 1 commentWhat is a datawarehouse?
In one line, it is a repository of several years of data using which reports can be created for business purposes (to take decisions).
It is a database specifically modeled and fine-tuned for analysis and decision making.
Data is extracted from heterogeneous sources as they are generated and stored centrally. This makes it much easier and more efficient to run queries over data that originally came from different sources.
Why do you need a DWH? Can’t the reports be created directly from daily transactional (operational) data?
Hmmm.. you can.. but it is not easy. Here is why.
Consider a huge car rental company with a fleet of 150,000 cars. They have a daily rent-out ratio of 50% meaning 75,000 cars are rented out per day. The challenge is to have fleet availability information ready at hand. This will enable them to have the right fleet available at the right time.
For this, the operational data will have 75,000 transactions per day on an average. This system will feed data for things like (only few listed below)
1. reserve a car
2. process a request
3. generate an invoice
4. receive cash
5. cancellation of bookingNow, this system is designed for feeding data in. What we need is to use this data in an intelligent manner to analyze the business. Such analysis may need answers for
1. Which model of car is best rented?
2. Which region is most profitable?
3. Which area is the most problematic?
4. Can I get the entire sales data aggregated by region, by year, by car model?
5. Can I drill down to quarter wise data for the same report above?
6. What is the operating profit margin per model? or per region?
7. Can I flag regions that are not yet problematic, but are on the border?Now, my daily data is not optimized to give me answers for the above queries. This is because the data is stored as per the application and not by business area (subject like customer, product info, garage or warehouse, etc). If we have a invoicing application, it will have a merely invoiced data and same with the booking & reservation application. And there might be a third application like the one to track loans given to large corporate clients which may be in a different environment thats incompatible with the other 2 above.
Now how do I get the data for the loans given and also make it interact with the other applications? How do I see the loan payment of top 100 huge clients and see their past sales information and maybe also their preference over models? These are in different systems which may or may not be compatible. Even if they are compatible, making them interact is tough.
In most cases, data is not meaningfully represented and is dispersed. This will cause several issues like duplication of effort, poor productivity, quality and integrity issues, etc. Also the range of analysis available is limited.
What is the solution then?
To have an (all applications) integrated system that is grouped business wise and not applicaton wise. This data should be non-volatile (unchanging..because its history data unless you made a wrong entry) and growing everyday! And its separated from the transactional data and is the single point of contact for all your decision making information.
Wow, welcome to datawarehousing!!
Does that solve all my problems?
Probably not! We still have issues of integrating the data, making them interact without issues, sort out discrepancies, cleansing data, modelling the data for easier understanding and querying, having methods of loading the tables, etc.
How is the data fed into the warehouse?
Data from several transactional sources is merged and is fed into the warehouse tables that have been modelled using concepts like star schema, snowflake schema, etc. This data is extracted from source transaction data and appended into the warehouse. We now have a history database of the business.
Okay, let me talk about star schema briefly. It will have a central table called Fact table and peripheral tables called dimensions. See image below.
Fact tables will have numeric data and descriptions are contained in dimensions.Now, dimensions are subject wise data. It can be a customer, product, warehouse, etc. The above picture is called a datamart modelled on star schema. This datamart is far easier to comprehend and understand than the operational system. And now we can use this for easier querying. See these questions
1. Which product sold the most? Available here
2. Which customer group had which product preference? available here
3. What was the sales last year? available.And the user will understand this business driven subject wise grouped modelling better.
To create this datamart, we need to get all customer, product, supplier data from all (different) systems and load them into the respective tables. Then get the transaction data into the fact table and just join them with the respective dimensions using a foreign key (we call this surrogate key, will discuss in a later post). The transaction data can be at at the individual transaction level or it can be grouped as per need.
Now, what is the dimension exactly?
In my report, when I say what is cost per X, X can be the dimension. Cost per employee, cost per product, sales per month (time dimension), etc. A dimension is an attribute across which you review data.And what is the fact table?
Data that is numerical…the sales amount, the cost amount, inventory quantity, etc will form the fact table. There will be separate fact tables for inventory, sales revenue, sales orders, account receivables, account payables, etc.So?
Now, you can get the data from fact tables grouped across dimensions to get data you need. What is the account payable last month (grouped across time dimension) for this region (grouped across region)? What is sales for product A (grouped across product dim filtered for A) in 2006 (group time dim and filter for 2006) ?Did you get it?
If not, pls mail info AT biexplorer DOT com or leave a comment. Its a very confined space here and I have tried my best. Pls let me know what you think.Business Intelligence, Data Integration, Data warehouse, EIM, ETL, OLAP BI, Data Integration, DWH, EIM, ETL, OLAP, OLTPOne response to “What is a data warehouse?”
-
Anonymous July 30th, 2007 at 22:16
This is the best introduction to DW ever -
Thanks!
Leave a reply
You must be logged in to post a comment. -


