Column Store Index, which improves performance of data warehouse queries several folds, was first introduced in SQL Server 2012. Though it had several limitations, now SQL Server 2014 enhances the columnstore index and overcomes several of those earlier limitations. In this article series I am going to talk about how you can get started with using this enhanced columnstore index feature in SQL Server 2014 and do some performance tests to understand the benefits.
If you are new to columnstore index, you can refer my earlier articles on the topic to understand it better.
- Understanding New Columnstore Index of SQL Server 2012
- Getting Started with the New Columnstore Index of SQL Server 2012
- New Enhanced Columnstore Index in SQL Server 2014 – Part 1
Support for Clustered Columnstore Index
With SQL Server 2012, you are allowed to create only one non-clustered columnstore index on a table (please note, non-clustered columnstore index makes the table read only). Though this feature is still supported in SQL Server 2014, you can now create a clustered columnstore index on a table. The good news is that, this new clustered columnstore index supports issuing some INSERT, UPDATE, DELETE statements on the table on which it exists (no more tedious workaround is required for writing data to a table with columnstore index in this release like previous releases). You can learn more about this new index type in my earlier article.
How DML Works on a Table with Clustered Columnstore Index
When we create a columnstore index, table rows are first divided into row groups (~one million rows in each row group; higher value chosen for better compression ratio). Each row group is further encoded and compressed independently and in parallel, producing one compressed column segment for each column included in the index. For columns that use dictionary encoding the conversion may also produce a number of dictionaries. These column segments and dictionaries are stored as blobs on the disk. You can learn more about this new storage structure in the MSDN article, 'Columnstore Indexes.'
To make clustered columnstore indexes updateable, two new components are added: delete bitmaps and delta stores.
New rows are inserted into a delta store (new delta store gets created automatically for storing inserted rows), which is a traditional B-tree row store. An index may have multiple delta stores. Delta stores are transparently included in any scan of the columnstore index. A delta store is closed when the number of rows it contains reaches a predefined limit or else it remains in an open state to accept incoming rows. For closed delta stores, SQL Server internally keeps on checking for it; this task is called Tuple Mover, and converts them to columnar storage format one delta store at a time and then that specific delta store is de-allocated. This whole process is applicable for non-bulk inserts; for bulk inserts, rows are directly converted to columnar format in batches instead of first inserting it in delta store. The process buffers rows until a sufficient number of rows has accumulated, converts them into columnar format, and writes the resulting segments and dictionaries to disk.
For record deletion, a record containing its row ID is inserted into the delete bitmap. If the record exists in a delta store, the row is simply deleted from there. Now during scan delete bitmap is consulted to disqualify rows that have been deleted from the result set to return.
An update operation is treated as a combination of a delete and an insert operations. That is to say, the old row is marked as deleted (as mentioned above) in the delete bitmap and the new row is inserted into a delta store (as mentioned above).
Enhancement in Batch Mode pProcessing
Batch mode processing was first introduced in SQL Server 2012 though only a subset of the query operators (scan, filter, project, hash (inner) join and (local) hash aggregation) are supported. The limitation with hash operator in SQL Server is that a table has to fit entirely in memory and it has support for inner join only.
SQL Server 2014 now handles the full spectrum of SQL Server join types (inner, outer, semi- and anti-semi joins), union all, and scalar aggregation. It also adds enhancements to batch-mode hash join by adding spilling data functionality (temporarily to disk) for cases when a hash table does not fit entirely into memory.
If you are new to the batch mode processing concept, for your information, SQL Server traditionally uses a row-at-a-time execution model, which means a query operator processes one row at a time whereas in batch mode processing it allows processing a large number of rows (batch typically consists of around a thousand rows) as a batch at a time.
In this article, I talked about clustered columnstore index and how update happens to it, plus what enhancements have been done in batch mode processing; in my next article in the series I am going to do a complete demo of using this feature along with some performance tests for evaluation different options.