Column Store Index is a new feature in SQL Server 2012 that improves performance of data warehouse queries several folds. Unlike regular indexes or heaps, which store data in B-Tree structure (in row-wise fashion), the column store index stores data in columnar fashion and uses compression aggressively to reduce the disk I/O needed to serve the query request. This article discusses the new Column Store Index in detail, how it differs from regular indexes, why and how it improves the performance of the same query by several folds if column store index is used, where it can be used and what its limitations are.
Understanding Column Store Index in SQL Server 2012
We all agree on the fact that hardware speed and capacity has increased from hundreds to thousands times in last 20-30 years but disk I/O(Input/Output) or disk access or data transfer rate has not grown proportionately and is still the slow. The SQL Server team realized this fact and came up with a new column store index type in SQL Server 2012 which greatly reduces the disk I/O needed to serve the request of the data warehousing queries by storing data in columnar (also called column store) fashion instead of traditional row store (which B-Tree and heap uses).
Now wait, you must be wondering how come a different physical layout representation of the same data (storing same data in columnar format instead of traditional row-wise format) can reduce disk I/O significantly and can improve the performance of the data warehousing queries up to 100 times or even more. Well if you have this question then I am completely with you on this but before I answer this simple but worthwhile question let me first talk about row-store storage vs. column-store storage.
The data in B-Tree and heap gets stored in row-wise (also called row-store) fashion, which means data from all the columns of a row are stored together contiguously on the same page. For example consider a table with ten columns (C1 to C10), so if you look at the image below the data of all the ten columns from each row gets stored together contiguously on the same page.
When we create column store index, the data is stored in columnar (also called column-store index) fashion, which means data of each individual column from each rows is stored together on same page. For example if you look at the image below, the data of column C1 of all the rows (or as much as can fit on a page) gets stored together on one page whereas the data for column C2 of the all the rows (or as much as can fit on a page) gets stored on another page and so on.
How Column Store Improves the Performance of the Query
Now coming back to the question we raised in last section; how come a different physical layout representation of the same data (storing same data in columnar format instead of traditional row-wise format) can reduce disk I/O significantly and can improve the performance of the query up to 100 times or even more.
Well there are basically two main reasons for this performance improvement with respect to data storage, apart from the new batch mode processing (Query Optimizer and Query Execution engine has been enhanced to use batch mode processing for column store index, which in turns uses a new iterator model for processing data a batch at a time instead of a row at a time, which is optimized for multicore CPUs and increased memory throughput of modern hardware architecture).
The first one lets you pull only data requested from disk to memory, thus reducing disk I/O and the second one lets you use compression significantly so that the same data is stored on fewer pages and hence less disk I/O. In addition, a larger fraction of the data can reside in a given size of memory. Minimizing disk I/O can significantly speed up query response time by retaining more of your working set of data in memory, which will eventually speed up response time for subsequent queries that access the same data (improves buffer cache hit ratio).
Transfer Only Needed Data from the Disk to Memory to CPU
Let's consider a query (SELECT C1, C2, C3 FROM T1) from the table in the above image; if data is stored in the row-store all the disk pages will be brought into memory but as these disk pages also contain data of other columns (C4-C10), the number of pages that need to be brought into memory is significantly higher. It means we are bringing the data into memory, which we actually don't require for this query.
Now let's take the same query when the data is stored in column-store index, in this case all disk pages will not be brought into memory as these pages contain data for individual columns only, pages which contain data for columns C1, C2 and C3 will be brought into the memory. That means we are bringing only data (C1, C2 and C3), which is actually required to serve the request and not the data of other columns like C4-C10 (disk pages containing the other columns data are simply ignored). This way, the number of pages that are transferred from disk to memory is significantly less.
This is a very simple example; consider a data warehouse fact table with dozens of columns but only a few of them (10%-20%) are required to serve the request; in that kind of scenario, disk I/O overhead can be significantly reduced if the column store index is used.
Aggressively Using Compression to Reduce the Amount of Storage Space Needed to Store the Same Set of Data
The column-store index groups and stores data from each column on its own page separately, which means a page will either contain data of C1 or C2 or C3 and so on unlike row store index where data of all the columns (C1 to C10) are stored together contiguously on the same page.
As data gets stored from a single column on a page and a column is more likely to have repeating data values (more data redundancy) than a complete row, the page can be compressed to a greater extent. For example, consider a column called City in a table, now with column store index the page contains data of City column only and the chance to contain many occurrences of a city called “New York” (more data redundancy) is greater than the row store index page, which contains data of all the columns (Name, Address, City, Country, Zip etc.). Storage of all the columns on a single page reduces repetitions and ultimately the compression ratio, unlike a page of column store index, which has higher redundancy and greater compression ratio.
As per a performance improvement study done by Microsoft on a 32-logical processor machine with 256GB of RAM on a table with 1 TB of data and 1.44 billion rows and stunning result published in this white paper; the queries gained a 16X speed-up in CPU time and a whopping 455X improvement in elapsed time. In real terms it means a query that took 501 seconds originally (when using no column store index) was reduced to merely 1.1 seconds (when using column store index).
When to Use Column Store Index
Column store index has been designed to substantially accelerate common data warehouse queries, which require scanning, aggregation and filtering of large amounts of data or joining multiple tables like a star schema. With column store index, you can get interactive response time for queries against billions of rows on an economical SMP server with enough RAM to hold your frequently accessed data.
Creating a columnstore index is a parallel operation, subject to the limitations on the number of CPUs available and any restrictions set on MAXDOP setting.
You can have both row store index and column store index on the same table. As I said before, the Query Optimizer will decide when to use the column store index and when to use other types of indexes. The Query Optimizer will also choose when to use the new batch execution mode and when to use row execution mode during usage of column store index.
Column store index might not be suited for OLTP load for the following reasons and you might continue using row store indexes in those scenarios:
- Creating a column store index makes a table read-only table and update is not easy whereas updating data is simpler in row store indexes.
- Column store index might not be the ideal choice for selective queries, which touch only one (or a few) rows or queries that lookup a single row or a small range of rows. In this scenario, row store index will perform better.
Limitations of Using Column Store Index
We can create only one non-clustered column store index on a table although we can include all the columns of the table in the single index (recommended too). When we create a column store index it makes the table read only; hence it's not suited for OLTP load where updates are high but it can be used in data warehouse fact tables or on dimension tables (if it contains millions of records) where the number of columns are really big and in a normal query only 10-20% of them are required to serve the request.
To update or load the data in a table with column store index, you can switch in a partition, or disable/drop the columnstore index, update/load the data in table, and rebuild the column store index. Creating a column store index might take slightly longer than creating clustered row store index on the same set of data, as extra CPU cycles are required for compression and all. Though with column store index, you can do aggregation on the fly and hence aggregations (pre-built summary aggregates, whether they are indexed views, user-defined summary tables, or OLAP cube) are not required to be created during data load, this time saving will offset the extra time needed for creating column store index and overall data load or ETL might reduce.
- A column store index can be created on a table only; creation on indexed view is not allowed
- A column store index can be non-cluster only
- Only one column store index is allowed per table (though you can include all the columns of the table in single index; up to 1024 columns limit applies)
- A column store index cannot be a filtered index
- A column store index cannot include computed or sparse columns
- A column store index cannot be unique and cannot be used as primary or foreign key
- A column store index definition cannot be changed using ALTER INDEX command; for the index definition change, it has to be dropped and created
- A column store index cannot be created using ASC/DESC or INCLUDE keywords
- A column store index on a partitioned table must be partition-aligned
- A column store cannot be combined with
- PAGE or ROW compression
- Change Tracking
- Change Data Capture
- There are certain data types that are not supported and hence any column of that type cannot be part of a column store index like binary, varbinary, text, ntext, image, varchar(max), nvarchar(max), uniqueidentifier, rowversion, timestamp, decimal and numeric data types with precision greater than 18 digits, common language runtime and xml; apart from these, the rest of the other data types are supported. For the latest update on supported and unsupported data types click here.
Column Store Index is a new feature in SQL Server 2012 that improves performance of data warehouse queries several folds. In this article I talked in detail about the new Column Store Index, how it differs from regular indexes, why and how it improves the performance of the same query by several folds if column store index is used, where it can be used and what are its limitations. In my next article I am going to take this talk to another level and show you how you can create column store index, how you can use index query hint to include or exclude a column store index, how the performance differs when using column store index vs. row store index and more.