Throughout our coverage of SQL Server 2005 Integration Services, we have been primarily concerned with newly introduced and improved features, spending considerably less time discussing its performance advantages over Data Transformation Services. However, it is important to point out that superiority of this technology goes beyond functionality alone, providing numerous efficiency benefits, which we will present in more detail in this article.
As we already have explained in the introductory installments of this series, one of the fundamental changes underlying design and implementation of SQL Server 2005 Integration Services is separation of Control Flow and Data Flow. The former determines execution sequence of tasks, defining workflow of the package. The primary purpose of the latter is management of the data movement from its source to destination and transformations of its content during transit. Some of the most relevant performance tuning methods are directly related to this dual architecture and therefore should be discussed within its context.
Control Flow execution is managed by the SSIS run-time engine, which offers the ability to run its tasks in parallel (in addition to traditional serial approach, which tends to be less efficient, since it forces tasks to execute in synchronous manner, where a subsequent one needs to wait for its predecessor to complete before it can proceed). The maximum number of tasks that can execute simultaneously is controlled with the MaxConcurrentExecutables package property. You can set it explicitly or rely on its default value of -1, which derives the limit by adding 2 to the total number of processors (logical rather than physical, giving you a slight edge when hyperthreading is turned on). Note, however, that you should not assign this number strictly based on the number of tasks and their respective arrangement, but also take into consideration the characteristics of individual tasks (such as their resource utilization or blocking nature). Similarly, within Data Flow, it is possible to benefit from parallelism. The total number of threads created by the pipeline engine is calculated based on several factors, such as the number of data sources, destinations, and asynchronous transformations. While you can affect the outcome of these calculations by assigning an arbitrary value to the EngineThreads property of each Data Flow task (which, by default, is set to 5), this serves merely as a suggestion to the engine scheduler, which might make additional adjustments as it deems appropriate (for example, in cases where this would eliminate the potential for deadlocks). Alternative method of influencing thread distribution involves modifying the pipeline by introducing additional transformations, such as Multicast (which copies its input across multiple outputs), spreading the load across multiple transformations and destinations.
In addition to improvements in thread organization, Data Flow can be optimized by tuning its memory utilization. Data processed by its engine is stored in chunks of memory referred to as buffers, created as new records are read via source components and allocated dynamically according to requirements imposed by transformations. The volume of records in a buffer depends on a variety of factors, such as the number and width of data columns, as well as buffer size (more about this shortly). The total number of buffers is derived based primarily on a total count of data records (extracted from their source or generated throughout the pipeline) and blocking nature of Data Flow components. Such components can be viewed as non-blocking, semi-blocking, or blocking, depending on whether data rows are relayed from one transformation to the next after each one, some, or all of them have been processed. The non-blocking type can be further subdivided into streaming (such as Character Map, Conditional Split, or Data Conversion, which use readily available internal features and data) and row-based (such as Export Column, Import Column or OLE DB Command, which are inherently slower since they need to rely on external entities for the same functionality). In either case though, processing is very efficient, with data rows maintained in the same set of buffers throughout the pipeline, handled in-place by consecutive transformations (without the need to allocate new buffers and copy partially processed data to them). Semi-blocking components (such as Merge or Merge Join) require that records expected by downstream transformation are temporarily retained (for example, Merge and Merge Join must first locate matching keys from data streams being merged before producing the outcome) resulting typically in the difference between counts of input and output rows. This, in turn, impacts memory usage (and tends to increase thread count), since new buffers need to be created (and existing ones maintained) in order to accommodate some of the rows processed by an upstream transformation. Fully blocking Data Flow components (such as Aggregate or Sort) where the entire input data set is needed before any output records are generated are the worst performers, since it is necessary to create sufficient amount buffers (and dedicate processing resources to perform their calculations) to hold the entire data set.
As mentioned earlier, buffers are formed on an as needed basis, according to the count and size of individual data rows (which is calculated by reviewing the data types of columns involved in transformations) and following rules dictated by several Data Flow-specific parameters, which include:
- DefaultBufferSize - set by default (but configurable) to 10,485,760 Bytes (equal to 10 MB), which specifies maximum amount of memory that can be used for a single memory buffer.
- DefaultBufferMaxRows - set by default (but configurable) to 10,000, which limits the number of rows that can be stored in a single buffer. If the result of multiplying this value by the estimated row size is larger than DefaultBufferSize, then the actual number of rows is automatically reduced to fit within the designated range.
- MaxBufferSize - hard-coded to 100 MB (and not configurable) sets the upper limit on the DefaultBufferSize (you will receive an error message if you attempt to increase DefaultBufferSize beyond this value). For this reason, it also affects the maximum number of rows that can be loaded into a single buffer (even if a larger number would be allowed based on the value assigned to DefaultBufferMaxRows).
- MinBufferSize - defined by operating system features (in particular, by its virtual memory allocation mechanism), determines the smallest possible value of DefaultBufferSize (and is typically equal to 65,536 Bytes - or 64 kB). If the amount of memory calculated by multiplying DefaultBufferMaxRows by the estimated row size is lower than MinBufferSize, than the number of allowed rows is automatically increased to ensure that the buffer size is not smaller than this absolute minimum.
Buffer and thread characteristics are reflected by the structure of execution trees, generated during package execution. Each such tree is associated with a separate buffer and, potentially, a distinct thread. You can determine their structure and layout by turning on logging for your Data Flow tasks and monitoring the BufferSizeTuning, PipelineExecutionTrees, and PipelineExecutionPlan events, (the outcome is displayed in the Log Events window and recorded using the log provider you selected). In addition, you should also review the results of the Progress tab, where you can identify any redundant columns that, while included in the buffer, might not be needed by any downstream transformations (relevant messages pointing to this issue are displayed in the form of warnings). Eliminating them from the pipeline minimizes the value of estimated row size, optimizing buffer utilization. This can be further augmented by ensuring that appropriate data types are specified when loading content of flat files. (You can use for this purpose the Suggest Types button in the Advanced section Flat File Connection Manager Editor dialog box, which provides advice regarding choices of smallest integer and real data types as well as identifies Boolean columns based on a data sample of arbitrary size). You might also want to adjust the EngineThreads parameter (mentioned before) in order to allow a sufficient number of threads to cover all execution trees and source components.
In order to determine the efficiency of your tuning techniques, you might want to monitor performance of Data Flow transformations and buffer utilization with Windows built-in System Monitor. This utility (which you can invoke by selecting Performance entry from the Administrative Tools menu or by simply typing Perfmon in the Start->Run box) displays real-time statistics based on the set of arbitrarily selected counters. The ones most relevant in the context of our discussion include Buffer memory (total memory used by buffers), Buffers in use (number of buffer objects) and Buffers spooled (which you should pay particular attention to, since its increase indicates shortage of physical memory, forcing the buffer content into the Swap file on the hard drive, which has significance adverse impact on peformance). More details can be obtained by looking at equivalent statistics for flat and private buffers, such as Flat buffer memory, Flat buffers in use, Private buffer memory, and Private buffers in use. (Flat buffers are used in the majority of situations as storage for data passing through the pipeline, while private ones are required by few blocking transformations, such as Sorts or Lookups to cache data needed for their temporary calculations). In order to display any of them in the System Monitor interface, select the Add counters... item in its context sensitive menu (or click on the "+" sign in the toolbar), pick SQL Server:SSIS Pipeline from the Performance objects drop-down list, and choose the relevant entries from the listbox underneath. Keep in mind, however, that all of the values are server-wide, which might make their analysis difficult, unless you have full control over all packages running on the system and can schedule their execution in such way that other tasks do not impact measurements.