POS Applications, HR Applications, Customer Survey results; these are just some of the myriad sources of data that we are, as database administrators or developers responsible for - how do we store that data so that a Business Intelligence system can interrogate and provide meaningful results to the user.
Business Intelligence - Data Storage
POS Applications, HR Applications, Customer Survey results; these are just some of the myriad sources of data that we are, as database administrators or developers responsible for and that a business intelligence system can consume. Within Business Intelligence systems, these sources of data are usually encompassed in to two main types of storage systems to provide historical, current, and predictive views of business operations.
The Data Warehouse
In a repository of an organization's electronically stored data, data warehouses are designed to facilitate reporting and analysis. The need for a data warehouse is driven by an organization's need for reliable, consolidated, unique and integrated reporting and analysis of its data, at different levels of aggregation. The practical reality of most organizations is that their data infrastructure is made up by a collection of heterogeneous systems. For example, an organization might have one system that handles service quality, one that handles employees, and others that handle sales data or production data. In practice, these systems are often poorly or not at all integrated and simple questions like "How many customers are complaining about branch A and have a credit account and which employees are being targeted" can be very hard to answer, even though the information is available "somewhere" in the different data systems.
It is partly the purpose of data warehousing to bridge such problems but also to make data appear consistent, integrated and consolidated despite the problems in the underlying source systems. The data warehouse achieves this by employing techniques, creating new data repositories (i.e. the data warehouse) whose data model(s) support the needed reporting and analysis.
A data mart should be viewed as a subset of an organizational data store, usually oriented to a specific purpose or major data subject that may be distributed to support business needs. Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization. Data marts are often derived from subsets of data in a data warehouse, though in the bottom-up data warehouse design methodology the data warehouse is created from the union of organizational data marts.
Reasons for creating a data mart
- Access to frequently needed data
- Creates collective view by a group of users
- End-user response time
- Ease of creation
- Lower cost than implementing a full Data warehouse
- Users are more clearly defined than in a full Data warehouse
The Real World
In practice, the terms data mart and data warehouse each tend to imply the presence of the other in some form. However, most writers using the term seem to agree that the design of a data mart tends to start from an analysis of user needs and that a data warehouse tends to start from an analysis of what data already exists and how it can be collected in such a way that the data can later be used. A data warehouse is a central aggregation of data (which can be distributed physically); a data mart is a data repository that may or may not derive from a data warehouse and that emphasizes ease of access and usability for a particular designed purpose. In general, a data warehouse tends to be a strategic but somewhat unfinished concept; a data mart tends to be tactical and aimed at meeting an immediate need.
One writer, Marc Demarest, suggests combining the ideas into a Universal Data Architecture (UDA). In practice, many products and companies offering data warehouse services also tend to offer data mart capabilities or services.
Other emerging technologies are being combined into new tools and software, which enables dynamic querying of real time data utilizing SOA to develop composable and adaptive middleware - there is a drive to name this type of BI as Business Intelligence 2.0. This renaming is has been attributed by Neil Raden as an attempt to imply a move away from standard storage systems (Data Warehouse and Data Mart) that current systems employ.
The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few fact tables (possibly only one, justifying the name) referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema.
Dimension tables have a simple primary key, while fact tables have a set of foreign keys, which make up a compound primary key consisting of a combination of relevant dimension keys. It is common for dimension tables to consolidate redundant data in the most granular column, and is rendered in second normal form. Fact tables are usually in third normal form because all data depends on either one dimension or all of them, not on combinations of a few dimensions. The star schema is a way to implement multi-dimensional database (MDDB) functionality using a mainstream relational database: given the typical commitment to relational databases of most organizations, a specialized multidimensional DBMS is likely to be both expensive and inconvenient. The facts that the data warehouse helps analyze are classified along different dimensions: the fact tables hold the main data, while the usually smaller dimension tables describe each value of a dimension and can be joined to fact tables as needed. Another reason for using a star schema is its simplicity from the users' point of view: queries are never complex because the only joins and conditions involve a fact table and a single level of dimension tables, without the indirect dependencies to other tables that are possible in a better normalized snowflake schema.
Business Intelligence, the primary consumer of Star Schema data, is also best expressed in Business English, not programmer's dialect. The aggregate navigators (OLAP) tools that are common in the industry do not need to subsequently rename the elements, because they are already in proper business English. Most SQL database engines allow schemata descriptors, and also permit decoration suffixes on surrogate keys columns. Using square brackets, which are physically easier to type on the keyboard (no shift key needed) are not intrusive and make the code easier to read.
For example, the following query extracts how many SUVs have been sold, for each brand and country, in 1997.
SELECT Brand, Country, SUM ([Units Sold]) FROM Fact.Sales (NOLOCK) JOIN Dim.Date (NOLOCK) ON Date_FK = Date_PK JOIN Dim.Store (NOLOCK) ON Store_FK = Store_PK JOIN Dim.Product (NOLOCK) ON Product_FK = Product_PK WHERE [Year] = 1997 AND [Product Category] = 'SUV' GROUP BY Brand, Country
A snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake in shape. Closely related to the star schema, the snowflake schema is represented by centralized fact tables, which are connected to multiple dimensions. In the snowflake schema, however, dimensions are normalized into multiple related tables whereas the star schema's dimensions are denormalized with each dimension being represented by a single table. When the dimensions of a snowflake schema are elaborate, have multiple levels of relationships, and where child tables have multiple parent tables ("forks in the road"), a complex snowflake shape starts to emerge. The "snowflaking" effect only affects the dimension tables and not the fact tables.
Data normalization and storage
Normalization splits up data to avoid redundancy (duplication) by moving commonly repeating groups of data into a new table. Normalization therefore tends to increase the number of tables that need to be joined in order to perform a given query, but reduces the space required to hold the data and the number of places where it needs to be updated if the data changes. From a space storage point of view, the dimensional tables are typically small compared to the fact tables. This often removes the storage space benefit of snowflaking the dimension tables, as compared with a star schema. Some database developers compromise by creating an underlying snowflake schema with views built on top of it that perform many of the necessary joins to simulate a star schema. This provides the storage benefits achieved through the normalization of dimensions with the ease of querying that the star schema provides. The tradeoff is that requiring the server to perform the underlying joins automatically can result in a performance hit when querying as well as extra joins to tables that may not be necessary to fulfill certain queries.
Benefits of "snowflaking"
- Some OLAP multidimensional database modeling tools that use dimensional data marts as a data source are optimized for snowflake schemas.
- If a dimension is very sparse, (i.e. most of the possible values for the dimension have no data) and/or a dimension has a very long list of attributes, which may be used in a query, the dimension table may occupy a significant proportion of the database and snowflaking may be appropriate.
- A multidimensional view is sometimes added to an existing transactional database to aid reporting. In this case, the tables, which describe the dimensions, will already exist and will typically be normalized. A snowflake schema will therefore be easier to implement.
- A snowflake schema can sometimes reflect the way in which users think about data. Users may prefer to generate queries using a star schema in some cases, although this may or may not be reflected in the underlying organization of the database.
- Some users may wish to submit queries to the database which, when using conventional multidimensional reporting tools, cannot be expressed within a simple star schema. This is particularly common in data mining of customer databases, where a common requirement is to locate common factors between customers who bought products meeting complex criteria. Some snowflaking would typically be required to permit simple query tools to form such a query, especially if provision for these forms of query weren't anticipated when the data warehouse was first designed.
Which schema to use
Your decision whether to employ a star schema or a snowflake schema should consider the relative strengths of the database platform in question and the query tool to be employed. The Star schema should be favored with query tools that largely expose users to the underlying table structures and in environments where most queries are simpler in nature. The Snowflake schema is often better with more sophisticated query tools that isolate users from the raw table structures and for environments having numerous queries with complex criteria.