Getting Started with Azure SQL Data Warehouse - Part 5

Monday May 22nd 2017 by Arshad Ali

In part 5 of this series covering Azure SQL Data Warehouse, Arshad Ali covers performance optimization and the different aspects that impact performance.


Azure SQL Data Warehouse is Microsoft’s new enterprise-class, elastic petabyte-scale, data warehouse service that can scale according to organizational demands in just few minutes. In my last article of the series, I discussed round-robin and the distributed table, how partitioning works in SQL Data Warehouse and looked at the impact of choosing the right distribution key. I also talked about PolyBase and how you can leverage it to import data-in or export data-out from SQL Data Warehouse very easily and quickly. In this article, I am going to cover performance optimization in SQL Data Warehouse and the different aspects that impact performance.

Importance of Statistics

Statistics play a significant role in query performance. In SQL Data Warehouse, the Query Optimizer uses statistics to create distributed query execution plans that improve query performance; this includes estimating the cost of both data movement and query operations. 

Statistics refers to the statistical information about the distribution of values in one or more columns of a table. The SQL Data Warehouse Query Optimizer uses this statistical information to estimate the cardinality, or number of rows, in the query result to be returned and enable the Query Optimizer to create a high-quality distributed query execution plan. For example, based on this statistical information the Query Optimizer might decide whether to use index seek operator or more resource-intensive index scan operator to provide optimal query performance.

In SQL Data Warehouse, statistics are maintained at both control node as well as on each of the compute nodes (distributions) to improve query performance by minimizing the data movement in the distributed query execution plan.

  • The control node keeps statistics for distributed tables by merging statistics-related contents of a table residing on all the distributions and storing the consolidated statistics information with it.
  • Distributions maintain statistics information about the data that it contains (the portion of the distributed table) and uses them to improve query performance for the queries that run on that specific distribution.

As of this writing, SQL Data Warehouse does not automatically create or update statistics, unlike SQL Server, and hence we need to setup a process to use CREATE STATISTICS and UPDATE STATISTICS commands to create and update statistics in order to ensure query optimizer uses these statistics for a distributed execution plan, including data movement operations. It’s recommended to create statistics on distributed tables for all the strategic columns if they are used in:

  • Join predicates
  • WHERE clauses
  • GROUP BY clauses
  • ORDER BY clauses

Statistics become stale over the period of time when significant changes are done with the data in the table. In this case, you need to setup a process to update statistics after significant changes are applied to the data. But please, be careful and don't update statistics too often. Updating statistics on a table causes a query to recompile if that specific query refers that specific table and unlike index rebuild, which can be done for each specific partition, statistics update is done for the complete table and in some cases where a table contains a huge volume of data it might take a very long time. In order to identify when update statistics are needed, there are a couple of Dynamic Management Views (DMVs) or Catalog Views in SQL Data Warehouse. You can use these queries to identify the percentage of mismatch and if the percentage is beyond some threshold (like 5% or 10% - you can decide on it based on your workload) you can consider updating it.

To help you automate the process of managing statistics during automated data load, I have created a small framework to manage (recreate or update) statistics on the SQL Data Warehouse tables. You need to call the below stored procedure and pass on the required parameter, like schema name, table name and type of action (whether to drop/create or update) you want to perform.

--To drop existing and creating new statistics
EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName = 'FactInternetSalesReason', @ActionType = 'CREATE'
--To update statistics
EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName = 'FactInternetSalesReason', @ActionType = 'UPDATE'

The above stored procedure basically refers a meta-data tables ([dbo].[StatisticsMetadata]) to drop and create statistics on a table. You can specify all the columns of the table for which you want statistics to be maintained in this meta-data table. In the absence of the meta-data for a given table, it creates statistics on all the columns of the specified table. This way you can start creating statistics for all the columns in the table and as you eventually learn, you can add strategic columns to the meta-data table to manage statistics for those selected columns only (this way you can avoid additional overhead in maintaining statistics for columns that don’t impact on query performance).

The script below creates tables for creating meta-data and execution meta-data tables for this framework.

CREATE TABLE [dbo].[StatisticsMetadata] (
    [StatisticsMetadataId] int NULL, 
    [SchemaName] varchar(255), 
       [TableName] varchar(255), 
    [ColumnName] varchar(255), 
    [ScanType] char(10), 
    [IsActive] bit NULL
CREATE TABLE [dbo].[StatisticsExecutionMetadata] (
       [SchemaName] varchar(255),
    [TableName] varchar(255), 
    [ColumnName] varchar(255), 
    [ActionType] char(10), 
    [ScanType] char(10), 
    [CreatedOn] datetime NULL

The above stored procedure logs the operations as it performs in the execution meta-data table ([dbo].[StatisticsExecutionMetadata]) in case you want to reference them for auditing purposes. This is very helpful as it keeps track of existing statistics on a table as well as tracks time when a specific statistic was created or updated.

The script below contains code for the stored procedure used for managing statistics; please feel free to modify it to suit to your specific requirement.

CREATE PROC [dbo].[uspManageStatistics]  @SchemaName [VARCHAR](255), @TableName   [VARCHAR](255), @ActionType [CHAR](10) AS
PROCEDURE NAME:      [uspManageStatistics] 
DESCRIPTION:         It is used to Manage the statistics for the   specified table,  
You need to first enter records in [dbo].[StatisticsMetadata]    table for your table and columns for which you need to manage 
the statistics and type of manage (FULLSCAN or SAMPLE). Next you   need to call this stored procedure to manage the 
statistics - ActionType = CREATE - drops and creates statistics:
EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName   = 'FactInternetSalesReason', @ActionType = 'CREATE'
You can use ActionType = UPDATE to update table statistics   instead of DROP\CREATE:
EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName   = 'FactInternetSalesReason', @ActionType = 'UPDATE'
       DECLARE @StatisticsMetadataId INT, @ColumnName VARCHAR(255), @ScanType CHAR(10)
       IF @ActionType = 'UPDATE' 
              SET @ScanType = 'FULLSCAN'
              SET @SQLString = 'UPDATE STATISTICS ' + @SchemaName + '.' + @TableName + ' WITH ' + @ScanType + ';'
              PRINT @SQLString
              --EXECUTE sp_executesql @SQLString
              INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn])
              VALUES (@SchemaName, @TableName, NULL, @ActionType, @ScanType, @CreatedOn)
       ELSE IF @ActionType = 'CREATE'
              --Dropping statistics
              IF OBJECT_ID('tempdb..#DropStatsList') IS NOT NULL
                     DROP TABLE #DropStatsList
              CREATE TABLE #DropStatsList
              ([DropStatsCommand]  VARCHAR       (MAX))
              INSERT INTO #DropStatsList([DropStatsCommand]) 
              SELECT 'DROP STATISTICS ' + @SchemaName + '.' + + '.' +  + ';'
              FROM SYS.TABLES ST
              INNER JOIN sys.schemas sch on st.schema_id = sch.schema_id
              INNER JOIN sys.stats ss on st.object_id = ss.object_id 
              INNER JOIN sys.stats_columns sc on sc.stats_id =ss.stats_id and sc.object_id = ss.object_id
              INNER JOIN sys.columns c on C.object_id = st.object_id and sc.column_id = c.column_id
              WHERE User_created=1 AND = @SchemaName AND = @TableName 
              --SELECT * FROM #DropStatsList
              WHILE EXISTS (SELECT TOP 1 1 FROM #DropStatsList)
                     SET @SQLString = (SELECT TOP 1 [DropStatsCommand] FROM #DropStatsList)
                     --PRINT @SQLString
                     EXECUTE sp_executesql @SQLString
                     DELETE #DropStatsList WHERE [DropStatsCommand] = @SQLString
              --Creating statistics
              IF EXISTS (SELECT TOP 1 1 FROM [dbo].[StatisticsMetadata] WHERE [SchemaName] = @SchemaName AND [TableName] = @TableName  AND IsActive = 1)
                     IF OBJECT_ID('tempdb.dbo.#OperationItem') IS NOT NULL
                           DROP TABLE #OperationItem
                     CREATE TABLE #OperationItem
                           [StatisticsMetadataId] INT,
                           [ColumnName] [varchar](255),
                           [ScanType] [char](10),
                           [IsProcessed] [bit]
                     INSERT INTO #OperationItem ([StatisticsMetadataId], [ColumnName], [ScanType], [IsProcessed]) 
                     SELECT [StatisticsMetadataId], [ColumnName], [ScanType], 0
                     FROM [dbo].[StatisticsMetadata] 
                     WHERE [SchemaName] = @SchemaName AND [TableName] = @TableName  AND IsActive = 1
                     WHILE EXISTS (SELECT TOP 1 [StatisticsMetadataId] FROM #OperationItem WHERE [IsProcessed] = 0)
                           SET @StatisticsMetadataId = (SELECT TOP 1   [StatisticsMetadataId] FROM #OperationItem WHERE [IsProcessed] = 0)
                           SET @ColumnName = (SELECT TOP 1 [ColumnName] FROM #OperationItem WHERE   [StatisticsMetadataId] = @StatisticsMetadataId)
                           SET @ScanType = (SELECT TOP 1 [ScanType] FROM #OperationItem WHERE   [StatisticsMetadataId] = @StatisticsMetadataId)
                           IF @ScanType = 'FULLSCAN'
                                  SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' +  @TableName + '_' + @ColumnName + '
                                                              ON ' + @SchemaName + '.'+ @TableName + ' (' + @ColumnName + ') WITH ' + @ScanType + ''
                                  SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_'  + @TableName + '_' + @ColumnName + '
                                                              ON ' + @SchemaName + '.'+ @TableName + ' (' + @ColumnName + ')'
                           --PRINT @SQLString
                           EXECUTE sp_executesql      @SQLString
                           UPDATE #OperationItem
                           SET [IsProcessed] = 1
                           WHERE [StatisticsMetadataId] =   @StatisticsMetadataId
                           --INSERT INTO   [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName],   [ActionType], [ScanType], [CreatedOn])
                           --VALUES (@SchemaName, @TableName,   @ColumnName, @ActionType, @ScanType, @CreatedOn)
                     INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn])
                     VALUES (@SchemaName, @TableName, NULL, @ActionType, @ScanType, @CreatedOn)
                     IF OBJECT_ID('tempdb..#CreateStatsList') IS NOT NULL
                           DROP TABLE #CreateStatsList
                     CREATE TABLE #CreateStatsList
                     ([CreateStatsColumn] VARCHAR(255))
                     INSERT INTO #CreateStatsList([CreateStatsColumn]) 
                     FROM sys.tables st
                     INNER JOIN sys.schemas sch on st.schema_id = sch.schema_id
                     INNER JOIN sys.columns c on C.object_id = st.object_id 
                     WHERE = @SchemaName AND = @TableName  
                     AND not in ('EDW_ModifiedByID','EDW_CreatedByID','EDW_ModifiedOn','EDW_CreatedOn') 
                     DECLARE @CreateStatsColumn AS VARCHAR(255)
                     --SELECT * FROM #CreateStatsList
                     SET @ScanType = 'FULLSCAN'
                     WHILE EXISTS (SELECT TOP 1 1 FROM #CreateStatsList)
                           SET @CreateStatsColumn = (SELECT TOP 1   [CreateStatsColumn] FROM #CreateStatsList)
                           IF @ScanType = 'FULLSCAN'
                                  SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' +  @TableName + '_' + @CreateStatsColumn + '
                                                              ON ' + @SchemaName + '.'+ @TableName + ' (' + @CreateStatsColumn + ') WITH ' + @ScanType + ''
                                  SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' +  @TableName + '_' + @CreateStatsColumn + '
                                                              ON ' + @SchemaName + '.'+ @TableName + ' (' + @CreateStatsColumn + ')'
                           --PRINT @SQLString
                           EXECUTE sp_executesql @SQLString
                           DELETE #CreateStatsList WHERE [CreateStatsColumn]   = @CreateStatsColumn
                           --INSERT INTO   [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName],   [ActionType], [ScanType], [CreatedOn])
                           --VALUES (@SchemaName, @TableName,   @CreateStatsColumn, @ActionType, @ScanType, @CreatedOn)
                     INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn])
                     VALUES (@SchemaName, @TableName, NULL, @ActionType, @ScanType, @CreatedOn)

In case you have composite joins on multiple-columns that have cross-column relationships and dependencies on each other, creating statistics on the multiple columns might improve the distributed query execution plan. A multiple-columns statistic contains cross-column correlation statistical information, called densities, with respect to neighboring columns and helps query optimizer to choose the right physical operators (for example hash join over nested loop join, as a nested loop join is often not optimal in data warehousing scenario).

Workload Management

If you have experience working with SQL Server, you might have used Resource Governor to manage the workload in SQL Server to make resources available to all the requests and better utilize the resources. In a similar line, SQL Data Warehouse also has workload management capability built in. This capability allows you to assign an incoming request to have pre-configured resources (like memory and concurrency) so that they don’t get starved forever and SQL Data Warehouse better utilizes system resources to achieve the best performance for concurrent requests.

In SQL Data Warehouse, a resource class is a built-in server role that has pre-assigned limits for memory and concurrency and varies from different DWU tiers. Though irrespective of DWU tiers, there are four different resource classes and resources for these resource classes are mentioned in “Concurrency and workload management in SQL Data Warehouse” for each of the tiers. SQL Data Warehouse allocates resources to requests according to the resource class server role membership of the login that submits the requests.

SELECT AS [Member Name], AS [Resource Class]
FROM    sys.database_role_members rm
JOIN    sys.database_principals AS r            ON rm.role_principal_id       = r.principal_id
JOIN    sys.database_principals AS m            ON rm.member_principal_id     = m.principal_id
WHERE IN ('mediumrc','largerc', 'xlargerc');

Default or small resource class will be used if the login executing the query has not been assigned to any other resource classes. You need to assign a login to the higher resource class if the query it’s going to execute is very expensive. For example, you can consider using mediumrc (medium resource class) for CTAS operations that have large hash joins or SELECT operations that need more memory to avoid caching to disk or building, rebuilding, and reorganizing clustered columnstore indexes for smaller tables that have 10-15 columns, etc. Likewise you might consider using largerc (large resource class) if your query is more expensive than that, for example, if you are running very heavy CTAS operations that have huge hash joins, contain large aggregations (ORDER BY or GROUP BY clauses) or SELECT operations that require very large amounts of memory for operations like hash joins, aggregations (ORDER BY or GROUP BY clauses). Please be careful when using xlargerc (extra-large class) and use when a request requires extra-large resource consumption at run time.

Please note, the higher resource class you use for executing queries, the more resources it’s going to consume, limiting the maximum number of concurrent queries. Hence, I would suggest to you to go through “Concurrency and workload management in SQL Data Warehouse in detail and decide on a resource class based on DWU tier and concurrency need for your workload.

Performance Optimization Guideline

Although the Massively Parallel Processing (MPP) engine of SQL Data Warehouse does a fabulous job in executing your query in parallel across multiple nodes and provides faster response time, this section covers some of the controls that you have at hand to optimize the performance to meet your performance expectation.

  • Design
    • Make sure you have the right collation setting for your SQL Data Warehouse database when you are creating it. Changing collation after database creation is not supported as of now.
    • SQL Data Warehouse does not support cross database queries. Hence, during your migration you can migrate tables from all your databases to a single SQL Data Warehouse database. We used this naming pattern for objects in SQL Data Warehouse and it worked fine in migration. <PreviousDatabaseName_PreviousSchemaName>.<TableName>.
    • As you can notice, the previous database and schema name now become a concatenated schema name in SQL Data Warehouse. This also requires you to change all references of these objects wherever applicable (like in views, stored procedures, etc).
    • Choose the right hash distribution key column for a hash distributed table to avoid data skewness and possible data movement. See “Getting Started with Azure SQL Data Warehouse - Part 3” for some guidelines on deciding the right distribution key for a table.
    • Even though you have a distribution key on the same column for multiple tables, there might be possible data movement if the data type for this column is not exactly the same as the data type for the column on the other table. This happens because SQL Data Warehouse must do implicit type conversion before joins and hence you need to make sure you use distribution column in joins (or aggregations) with same data type to avoid implicit type conversion, eg. <int>=<bigint>.
    • Consider using smallrc for all select queries, use mediumrc or largerc for bigger data loads (especially loading directly to a CCI table or rebuilding CCI).
    • During data load or during maintenance you can scale your database to a higher DWU tier and once done scale it down. This way you can balance cost and performance.
    • As of this writing, SQL Data Warehouse supports a maximum 32 concurrent queries, in cases where you expect a large number of concurrent queries, consider using SQL Server Analysis Services (SSAS) or Azure Analysis Services as a caching layer and to support higher traffic.
    • SQL Data Warehouse allows you to specify a label for your query; don’t be shy to use it. Use it extensively as this will help you in troubleshooting. Make sure though you have defined a naming convention for these labels so that it’s not confusing to you or to others who are working on troubleshooting.
    • Though its applicable for any platform, particularly in SQL Data Warehouse it’s more important to use the smallest possible column size in your table definition. This is because Data Movement Service (DMS) uses fixed length as of maximum size (even the column data type is variable) when moving data across. This is very much like how data buffers work in SQL Server Integration Services (SSIS).
    • Likewise, whenever not needed don’t use NAVARCHAR or NCHAR, instead use VARCHAR or CHAR.
    • SQL Data Warehouse has several DMVs which can be used to monitor query execution. See “Monitor your workload using DMVs” for more detail on these DMVs.
  • Statistics
    • Whenever you face any performance issue, the first and foremost thing to check is the presence of the statistics and if those are up to date for all those involved tables. In several of my cases, only ensuring up to date statistics on involved tables improved the performance of the queries several times.
    • When there is a large data change in the table, you might consider re-creating or updating statistics on that table.
    • There are DMVs available to find out the difference between control and compute nodes and if the difference is say 5% or 10% (test it in your case to arrive on this number applicable in your specific case) then only update the statistics. This way you can optimize your load process and offload statistics update from regular data load process to maintenance process, which often runs in off-peak hours anyway.
    • You can start with creating statistics on all the columns of the table and gradually identify strategic columns used in joins, aggregation, filter and sorting and maintain statistics for those columns only. This way you can minimize the time it takes for re-creating or updating statistics on that table.
  • Clustered Columnstore index (CCI)
    • By default, every table gets created with CCI in SQL Data Warehouse; don’t change it unless you have a table with a considerably less number of rows.
    • When you have lots of updates or deletes on a table with CCI, over a period of time that table will have lots of ghost rows in the delete bitmap (a row is not deleted immediately from a table with CCI, it is marked as logically deleted and doesn’t reclaim physical storage from that row until the index is rebuilt) - sometime referred to as index fragmentation - this can cause performance issues for your queries and data load processes. To get rid of these unwanted rows or to remove fragmentation, you can either recreate the table with CTAS or rebuild the CCI.

      To identify when an index rebuild is needed, there are couple of Dynamic Management Views (DMVs) or Catalog Views in SQL Data Warehouse. You can use these DMVs to identify the percentage of deleted rows and if the percentage is beyond some threshold (like 10% or 15% - you can decide based on your workload and performance needs) you can consider index rebuilding.

    • When you are rebuilding CCI, do it with a medium or large resource class, depending on the size of the table and DWU tier.

  • Data Load Process
    • During data refresh (ETL or ELT), normally we first bring data to a staging table and then transform and move to the final table (dimension or fact) on a defined interval. In this case, you can create the staging table as HEAP (as opposed to default CCI) and evaluate the performance. This should improve performance, as normally data refresh on interval will not have large data to get the benefit of CCI. Also, often you will be querying all the data (all columns) from the staging table to the final table.
  • Partitioning
    • When using partitioning, remember that SQL Data Warehouse already distributes data across 60 distributions and hence you should use it at a higher level to ensure each of these partitions get at least 1 million rows for better compression in CCI – leading to better performance.
    • In the case of SQL Server, you can switch-in a non-partitioned staging table to a partition of a partitioned table if the non-partitioned table has CHECK constraint defined. However, because there is no support for CHECK constraint in SQL Data Warehouse, you can only switch in from one partitioned table to another partitioned table (one partition at time) and hence make sure your staging table has the same partitioning strategy as your final table. Also, the distribution type and distribution key should match.
    • You can consider having narrower partitions for recent data (for example daily partitions in the current month) and wider partitions for older data (for example, monthly partitions for previous months in current year and yearly partitions in all earlier years) based on your query pattern.
    • For a large fact table, you can use partition switch-in to load data into it. Likewise, to delete or archive the data you can use partition switch-out. This way you can minimize the time and impact to other queries referencing this table.
  • Migration
    • During historical data load when you have a very large table to migrate, I would suggest moving data into chunks; this way your migration is more manageable and you will have better restart-ability from the point of failure in case of any failure. If you have a partitioned table, you can migrate data for each partition as one chunk (may run multiple processes to run them in parallel for quicker migration though). Once you have data imported for all the partitions in internal staging tables within SQL Data Warehouse, you can use partition switch-in to combine them together in your final table. When you use this technique, just make sure your internal staging tables (which hold data for each of the partitions) have exactly the same distribution type, distribution key and partitioning strategy as your final table.

      If your source table is not partitioned or if you don’t want to create your final table as a partitioned table, then you can use CTAS to create your final table by combining data from all internal staging tables (with UNION ALL). But this process is going to be slow and might face memory out of exception. In that case, you can consider scaling your SQL Data Warehouse to a higher DWU tier and consider using largerc or xlargerc when running CTAS.

  • Polybase
    • Currently, Polybase supports up to 1MB of row width, which will be sufficient in most of the cases. In case you have a very wide row, then you can consider doing vertical partitioning of your data into multiple tables and use the same distribution key to avoid any data movement when joining these tables.
    • If you have Unicode data, you can use Encoding = ‘UTF16’ in external file format to import the data into.
    • If your data has embedded lines or a line feed in it, currently you have two choices. You can replace those values before exporting data out from your source or use another file format like RCFILE.
    • As discussed in my earlier article, Polybase is designed to move data in parallel and hence it’s recommended to move data into multiple smaller files rather than moving it in a large single file. When you export data out from your source, export it into 60 or more files rather than exporting it in a single or smaller number of files.


    In this article of the series, we looked the importance of statistics, how workload management works and how to go about performance optimization in SQL Data Warehouse and different aspects that impacts query performance.


    Getting Started with Azure SQL Data Warehouse - Part 1

    Getting Started with Azure SQL Data Warehouse - Part 2

    Getting Started with Azure SQL Data Warehouse - Part 3

    Getting Started with Azure SQL Data Warehouse - Part 4

    Design for Big Data with Microsoft Azure SQL Data Warehouse

    What is Azure SQL Data Warehouse

    SQL Data Warehouse free trial

    See all articles by Arshad Ali

Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved