Thursday, January 29, 2009

Data Modeling - Normalization process (1NF,2NF,3NF)

What is Normalization ?
Normalization is the process of putting things right.

First normal form (1NF):
------------------------
Table faithfully represents a relation and has no "repeating groups"

Second normal form (2NF):
-------------------------
No non-prime attribute in the table is functionally dependent
on a part (proper subset) of a candidate key

Third normal form (3NF):
------------------------
Every non-prime attribute is non-transitively dependent on every key of the table

----------------------------------------------------------------------------------------

First Normal Form (1NF)
First normal form (1NF) sets the very basic rules for an organized database:

* Eliminate duplicative columns from the same table.
* Create separate tables for each group of related data and
identify each row with a unique column or set of columns (the primary key).


Second Normal Form (2NF)
Second normal form (2NF) further addresses the concept of removing duplicative data:

* Meet all the requirements of the first normal form.
* Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
* Create relationships between these new tables and their predecessors through the use of foreign keys.

Third Normal Form (3NF)
Third normal form (3NF) goes one large step further:

* Meet all the requirements of the second normal form.
* Remove columns that are not dependent upon the primary key.

-----------------------------------------------------------------------------------------------------

Example:
--------
Bookshelf: Title, Author,Publisher,Category

Create a Entity called BOOKSHELF, it consits of following columns: Title,Author1,Author2,Auther3,Publisher,Category1,Category2,Category3

BOOKSHELF: TITLE,PUBLISHER,CATEGORY1,CATEGORY2,CATEGORY3,RATING,RATING_DESC
BOOKSHEL_AUTHOR: TITLE,AUTHOR_NAME
AUTHOR: AUTHOR_NAME , Comments

Above step refers to 1NF. regrouping of repeated groups into respective table.

Create a RATING Table with RATING,RATING_DESC , this step is called 2NF.

Remove category1,Category2,Category3 to seperate table called Category Table.

Create CATEGORY table with CATEGORY_NAME,PARENT_CATEGORY,SUBCATEGORY , this step is called 3NF.

Comparison of Bill Inmon and Ralph Kimball paradigm:

Comparison of Bill Inmon and Ralph Kimball paradigm:

In the data warehousing field, we often hear about discussions on where a person / organization's philosophy falls into Bill Inmon's camp or into Ralph Kimball's camp. We describe below the difference between the two.

Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.

Bill Inmon: Endorses a Top-Down design
Independent data marts cannot comprise an effective EDW. Organizations must focus on building EDW.

"You can catch all the minnows in the ocean and stack them together and they still do not make a whale,"
Bill Inmon, January 8, 1998.


Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise . Information is always stored in the dimensional model.
Kimball model also proposes the data warehouse bus architecture. This architecture is comprised of:
- A staging area (which can have an E/R or relationally designed 3NF design or flat file format), which cannot be accessed by an end-user of the data warehouse bus (Presentation Layer) .
-The Data Warehouse Bus itself which includes several atomic data marts, several aggregated data marts and a personal data mart but no single or centralized data warehouse component.
The Data Warehouse Bus:
- Is dimensional;
- Contains transaction and summary data;
- Includes data marts, which have single subject or fact tables; and
- Can consist of multiple data marts in a single data base.


There is no right or wrong between these two ideas, as they represent different data warehousing philosophies. In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. Only when more data marts are built later do they evolve into a data warehouse.

Ralph Kimball: Endorses a Bottom-Up design
EDW effectively grows up around many of the several independent data marts – such as for sales, inventory, or marketing

"...The data warehouse is nothing more than the union of all the data marts...,"
Ralph Kimball, December 29, 1997.

The main difference is Top-Down vs. Bottom-Up Approach.
Let us see the advantages & disadvantages:

Advantages of Top-Down Approach: [Bill Inmon Approach]:

  • A truly corporate effort, an enterprise view of data
  • Inherently architect ed- not a union of disparate DataMarts
  • Central rules and control
  • May be developed fast using iterative approach
Disadvantages of Top-Down [Bill Inmon Approach]:

  • Takes longer to build even with iterative method
  • High exposure/risk to failure
  • Needs high level of cross functional skills
  • High outlay without proof of concept
  • Difficult to sell this approach to senior management and sponsors
Advantages of Bottom-Up Approach [Ralph Kimball Approach]:
  • Faster and easier implementation of manageable pieces
  • Favorable ROI and proof of concept
  • Less risk of failure
  • Inherently incremental; can schedule important DataMarts first
  • Allows project team to learn and grow
Disadvantages of Bottom-Up Approach [Ralph Kimball Approach]:
  • Each DataMart has its own narrow view of data
  • Permeates redundant data in every DataMart
  • Difficult to integrate if the overall requirements are not considered in the beginning
Kimball encourages to have perspective of both “vertically” and “horizontally”
while gathering business requirements while developing/designing data marts using Bus Architecture to integrate in to Enterprise Data warehouse.

Vertical
  • Don’t just rely on the business data analyst to determine requirements
  • Inputs from senior managers about their vision, objectives, and challenges are critical
  • Ignoring this vertical span might cause failure in understanding the organization’s direction and likely future trends
Horizontal
  • Look horizontally across the departments before designing the Data Warehouse
  • Critical in establishing the enterprise view i.e Bus Architecture
  • Challenging to do if one particular department if funding the project. [Even one departments funds this project , but all the departments together become the company]
  • Ignoring horizontal span will create isolated, department-centric databases that are inconsistent and can’t be integrated while considering for Enterprise Data Warehouse
  • Complete coverage in a large organization is difficult , but CEO or CTO represents all of the departments. They should have the proper understanding of the full business and recommend what is good for whole company not just one department.
  • One rep. from each dept. interacting with the core development team can be of immense help
  • Creating a Business Decision Making group when there is a difference in the perspective from each department would help in resolve these kind of conflicts.
New Practical approach by Kimball
  • Plan and define requirements at the overall corporate level
  • Create a surrounding architecture for a complete warehouse
  • Conform and standardize the data content
  • Implement the Data Warehouse as a series of Supermarts, one at a time
[Notes: Soon I will have detailed steps & process to be followed by Kimball , please check back or subscribe to my blog]

SUPERMARTS

  • Totally monolithic approach vs. totally stovepipe approach
  • A step-by-step approach for building an Enterprise Data Warehouse from granular data
  • A Supermart s a data mart that has been carefully built with a disciplined architectural framework
  • A Supermart is naturally a complete subset of the Data Warehouse.
  • A Supermart is based on the most granular data that can possible be collected and stored
  • Conformed dimensions and standardized fact definitions
Data Warehouse to Data marts


Operational Source System (OSS) To Directly transforms data to Enterprise Data Warehouse Let us call -- ETL Process (1) .

Enterprise Data warehouse to Data Marts
Let us call -- ETL Process (2).






Enterprise Data Warehouse











[Further comparison will be added often ]

Wednesday, January 28, 2009

Fact Table Comparisons and what is better for your business?

Fact Table Comparisons and what is better for your business?

There are three type of Fact Tables:
1. Transaction Fact Tables ( Transaction Grain)
2. Periodic Snapshot Fact Tables ( Periodic Snapshot Grain)
3. Accumulating Snapshot Fact Tables ( Accumulating Snapshot Grain)

First question to ask yourself , what is your main business ? and what Industry you are in ?

Retail Business , Inventory , Procurement , Order Management , Customer Relationship management, Accounting , Human Resource Management , Financial Service , Telecommunication , Transportation , Education , Health Care , E Commerce , Insurance , Manufacturing , etc.

Second Question:
What do you want to achieve in the Data warehouse / DSS (Decision Support System) / Business Intelligence (BI) system ?
Determine the goal of your Data Warehouse.

Third Question:
Who are all the users of the data warehouse / BI System ? And their roles such as CFO , CTO , VP(Sales & Marketing) , Sales Manager , Production Manager , Area Manager , Business Analyst , Business user , IT Support , etc.

Fourth question:

What kind of reports expected out of this new data warehouse ? And also if the users already using some excel type reports manually prepared by Business users to report to their higher officials. List out all of them & if you already have the samples keep it.

next , list out all of the business process , identify what are the measures /facts used in those reports. If there are any calculations involved list each one of them. And identify how often it changes , where is the source system it captures these measures and also changing criteria. This step is to identify the grain i.e Level of detail specified business process can't be divide further. (Lowest Level of detail)
Ex: Order processing may have Customer , Product , Order Date , Order Number .. etc , these become dimensions in the dimension model

Next , against each business process list out the measures / facts and identified Dimensions in the previous step.

Ex: Order Quantity , Product Unit Price [ Let us just consider this as measure but it is really Non-Additive measure]

[I will continue tomorrow]

What about Datawarehouse / Business Intelligence BI in the internet like other web application?

I was thinking about creating datawarehouse / Business Intelligence (BI) System / Decision Support System on-line meaning in the internet like ebusiness web application, where users / small to medium sized companies could benefit from this.

What everyone think about this? I need real inputs to this.

Please feel free to add your comments, suggestions.