SQL Server Integration Services (SSIS) was first made available well over a decade ago as part of the release of SQL Server 2005, replacing Data Transformation Services, which had been integrated into the product since SQL Server 7.0. This enduring pedigree has been interrupted by the introduction of Azure SQL Database, which, unlike its on-premises counterparts, restricted its capabilities to those implemented directly by the database engine. Unfortunately this meant that existing SSIS packages could not be used to perform extraction, transformation, and loading (ETL) tasks by relying on computing and storage resources of Azure SQL Database. To remediate this shortcoming, Microsoft provided equivalent functionality by relying on integration runtime of Azure Data Factory. In this article, you get an overview of this Azure-based SSIS offering.
From the architectural standpoint, the Azure-based SSIS offering consists of two primary services. The first service is Azure SQL Database (in the form of a single database, a member of an elastic pool, or a Managed Instance), which provides storage for packages and related metadata (SSISDB catalog). When using the package deployment model, it is also possible to leverage file storage (including file systems, file shares, and Azure Files) as a package repository (SSISDB catalog supports only the project deployment model). The second service is the Azure SSIS Integration Runtime (Azure SSIS IR) component of Azure Data Factory, which serves as the runtime engine for package execution. This is in contrast with the traditional SSIS architecture, which relies on both storage and runtime provided by a SQL Server instance.
Azure SSIS Integration Runtime is a fully managed service implemented as a cluster of Azure virtual machines. While your control over cluster functionality is limited, you have the ability of scaling it both vertically (by increasing the size of cluster nodes) and horizontally (by increasing the number of nodes in the cluster). You can also bring it online or take it offline based on expected demand in order to minimize cost (stopped cluster does not incur compute charges). In addition, you can specify the maximum number of packages that can run simultaneously on the same cluster node by setting the maximum parallel execution per node parameter of the cluster. This allows you to adjust resource allocation depending on the type of packages you intend to run (for a small number of resource intensive packages, you would specify a number lower than the one you would choose for a larger number of lightweight packages).
When provisioning Azure SSIS Integration Runtime, you should take into account the following factors:
- the region where the Azure SQL Database hosting SSISDB catalog resides (when using the project deployment model).
In general, you should use the same Azure region to host the Integration Runtime in order to minimize the latency and cost of communication between the two services.
- the region hosting Azure data sources and destinations that are part of the SSIS pipeline.
In general, you should use the same Azure region to host the Integration Runtime in order to eliminate the latency and cost associated with cross-region data transfers.
- the location of internal data sources and destinations.
In cases where there is a requirement to process data residing within a customer's internal network, then you should deploy Azure SSIS IR into an Azure virtual network and establish cross-premises connectivity between that virtual network and the customer's internal network (either via Azure Site-to-Site VPN or Azure ExpressRoute). To minimize the latency, it is recommended that the virtual network resides in the Azure region closest to the location of the customer's internal network. Additionally, you might also want to consider configuring a service endpoint between the subnet of the virtual network hosting the Azure SSIS IR deployment and the instance of Azure SQL Database hosting the SSISDB catalog (service endpoints can also be used to optimize connectivity to Azure resources that serve as data stores). When using SQL Database Managed Instance, you have the option of co-locating it with Azure SSIS IR in the same virtual network or implementing peering if they reside on two different virtual networks. Note that when Azure SSIS IR operates outside of a virtual network, you need to enable the setting Allow access to Azure services on the logical instance of the SQL Server that will host the SSISDB catalog.
In situations where establishing cross-premises connectivity is not possible (for example, due to constraints imposed by an on-premises Information Security policies), you might need to resort to a workaround that involves setting up an instance of the self-hosted integration runtime component of Azure Data Factory in the on-premises environment. While this type of runtime does not provide direct support for SSIS package execution, it can serve as a proxy between Azure SSIS IR and on-premises data stores. This workaround also offers a number of side benefits, including support for data sources that require third-party drivers (e.g. SAP HANA).
For authentication purposes, you can either use SQL Server-based authentication or Azure Active Directory authentication to provide access from the Azure SSIS IR to the SSISDB catalog. To implement the latter, you need to create a managed identity, assign it to the Azure Data Factory instance, and grant it sufficient privileges to create the target database (this makes possible to use the Azure SSIS Integration Runtime to create the SSISDB catalog).
Once the provisioning of the Azure SSIS Integration Runtime and the Azure SQL Database (or Managed Instance)-based SSISDB catalog is completed, you can manage SSIS packages with the same set of tools that are available in on-premises scenarios. If you prefer graphical interface, you can use for this purpose SQL Server Management Studio and SQL Server Data Tools. If your objective is to automate management via command line tools, then your options include dtinstall, dtexec, and dtutil (these tools are required if you do not use the SSISDB catalog but instead you rely on the package deployment model with file system-based storage). It is also possible to execute packages by using the Execute SSIS Package activity available as part of Azure Data Factory pipelines.
This concludes the introduction to SQL Server Integration Services with Azure Data Factory. In an upcoming articles, I will step you through the implementation process and describe the process of managing SSIS packages.