While using Microsoft Office Excel to store a variety of semi-structured data tends to be more economical and convenient than SQL Server, it lacks data mining and data warehousing capabilities. This article explores importing spreadsheets onto SQL Server and taking advantage of Office Automation through managed code running inside a Script Task.
Despite the technological superiority of the SQL Server product line, Microsoft Office Excel remains a widely popular application used to store a variety of semi-structured data. While this approach tends to be (at least in a short term) more economical and convenient, it lacks mining and warehousing capabilities provided by database management systems. These shortcomings can be remediated by importing spreadsheets into a relational database. SQL Server 2008 Integration Services simplifies the task of collating their content into the appropriate format, making them suitable for transformation and loading. In this article, we will explore one of several methods that provide this functionality, which takes advantage of Office Automation through managed code running inside a Script Task.
The features we are interested in are exposed by Microsoft.Office.Interop.Excel namespace, incorporated into the Microsoft Office installation and made available via the Office Primary Interop Assemblies download. They provide the ability to interact with Excel (as well as other Microsoft Office applications) from within Visual Studio Tools for Applications. Since this technique is intended primarily for client-side programming, you need to be aware of its limitations (especially in regard to supportability) as well as additional caveats that need to be taken into account when employing it to implement server-based COM code running in a security context of a non-interactive user (for example as a scheduled task or a SQL Server Agent job). As the Microsoft Technet article, Considerations for server-side Automation of Office explains in detail, this involves potentially unstable, insecure, and inconsistent behavior, related to dependencies on such factors as user identity, interactivity with the desktop, reentrancy, and scalability.
Keeping these warnings in mind and heeding the consequences associated with them, let's examine what type of actions can be carried out by leveraging types and members of the Microsoft.Office.Interop.Excel namespace. To illustrate these capabilities, we will present a generic scenario, in which we enumerate Excel workbooks residing in an arbitrarily chosen location, then identify their worksheets and named ranges for each of them, and finally display the content of their individual cells. The code will be implemented using Visual Studio Tools for Applications accessible via SSIS Script Task.
To start, launch Business Intelligence Development Studio and create a new project based on the Integration Services template. Drag the Script Task icon from the Toolbox and drop it on the Designer interface. Display its Editor dialog box, designate Visual Basic .NET 2008 as the ScriptLanguage, and confirm your choice by clicking on the OK command button. Use the Variables window to define a new variable (we will call it FolderName) of String type and set its value to the name of a folder where the Excel spreadsheets reside. Next, activate the Script Task Editor dialog box (accessible via the Edit option in its context sensitive menu), add the User::FolderName to the ReadOnlyVariables entry of the Script section, and click on the Edit Script button to display Visual Studio Tools for Applications.
In order to facilitate interaction with the Excel object model, we need to add references to the relevant library (which you can identify by referring to the Microsoft Knowledge Base article 219151). To accomplish this, select the Properties entry from the context-sensitive menu of the project displayed in the Solution Explorer window within the Visual Studio Tools for Applications Tools interface. Switch to the References section, click on the Add command button, display content of COM tab, highlight the appropriate object library entry (which, in our example, happens to be Microsoft Excel 12.0 Object Library, corresponding to Excel 2007), and confirm your choice by clicking on the OK command button. Once you return to the ScriptMain.vb window, add Imports Microsoft.Office.Interop.Excel and Imports System.IO (used for file system operations) statements to its (General)(Declarations) section. Finally, copy the following code as the content of Public Sub Main():
Public Sub Main() Dts.TaskResult = ScriptResults.Success Dim strFile As String Dim objExcelApp As New Microsoft.Office.Interop.Excel.Application Dim objExcelWbk As Microsoft.Office.Interop.Excel.Workbook Dim objExcelSheet As Microsoft.Office.Interop.Excel.Worksheet Dim objExcelCell As Microsoft.Office.Interop.Excel.Range Dim intSheet As Integer Dim intRow As Integer Dim intCol As Integer Try Dim strFolder = Dts.Variables("FolderName").Value.ToString Dim colFiles As String() = Directory.GetFiles(strFolder, "*.xlsx") For Each strFile In colFiles MessageBox.Show(strFile) objExcelWbk = objExcelApp.Workbooks.Open(strFile) MessageBox.Show(objExcelWbk.Worksheets.Count) For intSheet = 1 To objExcelWbk.Worksheets.Count objExcelSheet = objExcelWbk.Worksheets(intSheet) MessageBox.Show(objExcelSheet.Name.ToString) For intRow = 1 To objExcelSheet.UsedRange.Rows.Count For intCol = 1 To objExcelSheet.UsedRange.Columns.Count objExcelCell = objExcelSheet.UsedRange.Cells(intRow, intCol) If Not (objExcelCell.Value = Nothing) Then MessageBox.Show(objExcelCell.Value.ToString, "Content of cell " + _ objExcelWbk.Name.ToString + "." + objExcelSheet.Name.ToString + _ "." + intRow.ToString + ":" + intCol.ToString, MessageBoxButtons.OK) End If Next Next Next objExcelWbk.Close(False) Next objExcelSheet = Nothing objExcelWbk = Nothing Catch ex As Exception Dts.TaskResult = ScriptResults.Failure MessageBox.Show(ex.Message.ToString(), "Exception", MessageBoxButtons.OK) Finally objExcelApp.Workbooks.Close() objExcelApp.Quit() End Try End Sub
Our script enumerates all files with the extension .xlsx residing in the folder, whose name is obtained by referencing the FolderName SSIS variable. For each of them, we drill down through Workbooks and Worksheets collections (accessible via Microsoft.Office.Interop.Excel.Application namespace) using their Open method, down to the individual cell level (taking advantage of the Microsoft.Office.Interop.Excel.Range object, its Rows and Columns properties, as well as their respective collections).