Build a Web Site Traffic Analysis Cube: Part I

Monday Jul 21st 2003 by William Pearson
Share:

Discover hands-on approaches to extracting site traffic data, and loading it to a data source from which we can build a Site Traffic Analysis Cube. Author Bill Pearson kicks off a two-part article with the ETL side of the process, in preparation for cube design and construction in Part II.

About the Series ...

This is the thirteenth article of the series, Introduction to MSSQL Server 2000 Analysis Services. As I stated in the first article, Creating Our First Cube, the primary focus of this series is an introduction to the practical creation and manipulation of multidimensional OLAP cubes. The series is designed to provide hands-on application of the fundamentals of MS SQL Server 2000 Analysis Services ("Analysis Services"), with each installment progressively adding features and techniques designed to meet specific real-world needs. For more information on the series, as well as the hardware / software requirements to prepare for the exercises we will undertake, please see my initial article, Creating Our First Cube.

Preparation

Prior to beginning the lesson, you will need to download a copy of the sample Server Access Log, ServAccessLog.txt, a zipped text file that we will use as a data source in Part I of this lesson. Once the log is downloaded, unzip it and place it in a location that you can easily remember later, when we select the file as a data source. Once the lesson is completed, the file can be discarded to conserve hard disk space, if desired.

Introduction

While the majority of our series to date has focused upon the design and creation of cubes within Analysis Services (see Articles One through Nine of the Introduction to MSSQL Server 2000 Analysis Services series), we began in Article Ten to discuss reporting options for our cubes. My intention with Articles Ten, Eleven, and Twelve was to offer a response to the expressed need of several readers for options in this regard - options beyond the mere browse capabilities within Analysis Services.

In Articles Ten and Eleven, we explored some of the options offered by Microsoft Office - specifically the Excel PivotTable Report and Office PivotTable List, respectively - for report building with Analysis Services cubes. In Article Twelve, we explored features that integrate Analysis Services and Cognos PowerPlay, to provide a vehicle for client reporting and other business intelligence pursuits. The focus of the article was a basic overview of the steps involved in a simple (non-integrated security) connection of Cognos PowerPlay to a Microsoft Analysis Services cube, and then a high level overview of the use of PowerPlay for Windows and PowerPlay Web for the performance of analysis and reporting upon the Analysis Services OLAP data source.

In this article we will return to the hands-on design and building of cubes for various business purposes. Specifically, the next two articles will focus on the design and construction of a Web Site Traffic Analysis Cube. In Part I, after a brief discussion of potential business reasons for collecting web site traffic data, we will design and build an extract procedure, to illustrate one approach for entraining statistical data for ultimate placement into our new traffic analysis cube. Next, we will set up a simple data source that will serve as the destination point for the extract process, and as a basis for the design and creation of a web traffic analysis cube in Part II. Finally, we will browse our cube using the Analysis Services browser to examine the results of our handiwork.

The topics within Part I of this two-part article will include:

  • An overview of the business needs behind the desire to report upon web site traffic statistics;
  • An overview of the Server Access Log, and a discussion of its use as a source of web site activity tracking data;
  • A practical demonstration of the extraction of sample traffic statistics raw data from a log file, and it's importation into a database using MS SQL Server 2000 Data Transformation Services ("DTS");
  • Creation and population of a table in MSSQL Server 2000 to support our site traffic analysis cube in Part II.

Why a Site Traffic Analysis Cube?

In this lesson, we will return to an examination of real-life applications that can leverage the power of Analysis Services. The scenario that we explore in this article will surround the business need of a web site owner to analyze traffic.

The uses for site traffic analysis and statistics are legion, and the degree and complexity of the analysis performed can range widely. Examples might include the need to establish baseline activity on a given site before implementing a promotional campaign within the organization, as a means of determining the effectiveness of that campaign from various perspectives. Current traffic metrics can be useful for a number of other reasons as well. They can show us which overall resources or site features are attracting visitors, which pages in the site are being skipped by visitors (or, worse, simply not being seen due to obscurity in naming and referencing, non-intuitive links, and so forth), who our visitors are, and from what site they were referred to ours, among many other potentially valuable bits of information.

A partial list of "typical" web site tracking reports that I have put in place for clients in the past includes the following. The titles of the reports are shown here to give an indication of possible dimensions upon which one might seek to report. Other, more advanced reporting perspectives are, of course, possible.

Summary Reports

  • Totals and Averages (various reports)

Basic Tracking Reports

  • Unique Visitors, by
    • Days
    • Weeks
    • Months
    • Days of the Week
    • Hours of the day
  • Reloads by:
    • Days
    • Weeks
    • Months
  • Geographical Tracking by:
    • Domains
    • Countries (with obvious regional, province, state, etc., hierarchical levels)
    • Continents
  • System Tracking by:
    • Browsers
    • JavaScript Enabled
    • Operating Systems
    • Screen Resolutions
    • Screen Colors
  • Referrer Tracking by:
    • Last 20 (number varies ...)
    • Last 20 from Email
    • Last 20 from Search Engines
    • Last 20 Queries
    • Last 20 from Usenet
    • Last 20 from Hard Disk
  • Referrer Tracking by:
    • Totals by Source:
      • Website
      • Search Engine
      • Email
      • Usenet
      • Hard Disk
    • Totals by Search Engine:
      • 24 most popular engines (number varies)
    • All Keywords
    • All Website Referrers

There are many other potential dimensions, but perhaps this gives a flavor for the possibilities. Along with informing us of which resources on our site hold the attention of our visitors, web statistics can expose, both directly and by inference, many of the characteristics of the visitors, along with various attributes of their visits to our sites. These characteristics and attributes might include the following examples:

  • Duration of visits to the site (and individual pages thereof);
  • Most popular times of day / days of week for visits;
  • Likelihood of actual reading of resources, or mere skimming / skipping about;
  • Optimal times to perform maintenance / updates, based upon traffic valleys;
  • Characteristics of the people drawn to the site (demographics, etc.);
  • Characteristics of people likely to visit with adequate promotion;
  • Navigational impediments / perceived difficulties that shorten visits / prevent returns;
  • Participation in, percentage of completion of, and resistance to surveys and other information gathering vehicles.

Obtaining the Traffic Data

Before we can begin the generation of our Site Traffic Analysis Cube in Analysis Services, we need a data source that has, in essence, the characteristics of a star schema, or at least those of a fact table. A common source of information about traffic and activity on virtually any web site is the Server Access Log, which we will discuss in the next section.

Introducing the Server Access Log

While there are many options for storing site access information, a common measure of these statistics, file accesses (pervasively known as "hits"), are typically recorded in a file called an access log. Although it is beyond the scope of this lesson to become involved in the well-known considerations surrounding the appropriateness of the use of "hits" as the sole measure of site traffic, many of us are aware of the fact that alternatives exist that might more accurately provide refined data as to actual visits versus mere file accesses, etc. For the purposes of this article, we will use the access log as the sole source of data with which we intend to build our cube, primarily to keep focused on the cube design and construction itself. Suffice it to say that many advanced approaches to the objective of articulate data capture have been devised, and that our approach within the scope of this lesson certainly is not a recommendation of "best practices" within this specialized science.

The Server Access Log is central, for our purposes, to learning about the visits to our site, as well as the visitors themselves. At a high level, any visit to our site means a corresponding request for a file from the site. A request for a file results in a corresponding entry to the server access log, which acts as a cumulative history of every attempt (whether successful or unsuccessful) to retrieve information from the site. The information from the individual entries is easily extracted and loaded to a data store that is more readily adapted to the support of a multidimensional cube. The typical log contains entries in the common log file format that includes the following fields:

  • Host
  • Identification
  • User Authentication
  • Time Stamp
  • HTTP Request Type
  • Status Code
  • Transfer Volume

Example entries in a simple server access log appear in Illustration 1.


Illustration 1: Select Entries in an Example Server Access Log

Other logs exist, in addition to the above, and, often, logs are combined to add additional information to the entries shown above. For purposes of our lesson, we will work with a sample log file similar to that shown in Illustration 1. Keep in mind, throughout the procedures that follow, that the steps we take to entrain the data into our star schema are similar with any log, with modifications obviously required to adapt to differing scenarios. Again, we will keep the extraction process simple so that we can focus on our primary objectives.

As virtually all of us know, many options exist for performing the import of a log file. We not only want to import information from the file, however, but we want to set that information up into a data source that can be easily accessed by Analysis Services to build a cube. To achieve both objectives simultaneously, we will use Data Transformation Services (DTS), an Extraction, Transformation and Loading (ETL) tool that accompanies Microsoft SQL Server 2000.

Populating a Cube Data Source using DTS

Data Transformation Services (DTS), which comes along with the typical installation of MSSQL Server 2000, acts as an excellent tool for developing, automating, and managing data extraction, transformation and loading. In this lesson, our first major step will be to entrain the Server Access Log data, which we described in the previous section, that is useful to our cube design. While our excursion into the use of DTS will involve only the simplest functions, we will still gain an appreciation for the flexibility of the tool, and the number of things we can accomplish from a single graphical user interface. In our example, we will use DTS to accomplish the lion's share of building the data source structure, including the creation of the destination database and its member table; the process would likely differ in reality, particularly in the fact that the star schema would most likely have been designed and in place well before beginning the ETL process.

DTS makes use of OLE DB to acquire data. While the details lie outside our present scope, it is useful to understand that OLE DB goes beyond ODBC's limitations of access to only relational data. OLE DB defines a set of COM interfaces that let you access and manipulate any data type, enabling Microsoft's much-touted Universal Data Access (UDA).

In its role as an OLE DB consumer, DTS extracts data from any data source that acts as an OLE DB provider (that is, provides a native OLE DB interface), as well as from any ODBC data source. As we will see, we establish an OLE DB connection, using the DTS Package Designer, to our data source (the Server Access Log) in order to extract the data we need and to load it to our MSSQL Server destination database. Acting in its transformation role, DTS will map the access log data fields to the respective destination data fields via the data source connections, enabling the conversion of data types as necessary.

First, we will open the MSSQL Server 2000 Enterprise Manager, from which we can easily access DTS and many other database functions and objects.

1.             Go to the Start button on the PC, and then navigate to Microsoft SQL Server --> Enterprise Manager, as shown in Illustration 2:

Click for larger image

Illustration 2: Navigate to MSSQL Server 2000 Enterprise Manager.

2.             Click Enterprise Manager.

The Enterprise Manager - Console Root appears.

3.             Expand Microsoft SQL Servers by clicking the "+" sign to its immediate left.

4.             Expand SQL Server Group by clicking the "+" sign to its immediate left.

Enterprise Manager now appears (with differences based upon our individual operating environments, of course), as shown in Illustration 3:


Illustration 3: MSSQL Server 2000 Enterprise Manager View (Compressed View)

5.             Expand the Server name for the server upon which you will be working (most likely named after the computer on which it resides, or simply "local" - mine is MOTHER, as shown above, and at other places within our lesson).

6.             Expand the Data Transformation Services folder, exposing the Local Packages icon.

7.             Right-click the Local Packages icon.

The context menu appears, as shown in Illustration 4.


Illustration 4: Context Menu from the Local Packages Icon

8.             Click New Package from the context menu.

The DTS Designer opens with the DTS Package: <New Package> window, as shown in Illustration 5.


Illustration 5: DTS Package: <New Package> Window

9.             Click Connection -> Text File (Source) from the main menu, as shown in Illustration 6.


Illustration 6: Select Connection -> Text File (Source)

The Connection Properties dialog appears.

10.         Ensure that the New Connection radio button is selected.

11.         Type ServerAccessLog into the Name box.

12.         Ensure that Text File (Source) appears in the Data Source selector box.

13.         Click the ellipses (...) button to the right of the File Name box.

The Select File dialog appears.

14.         Navigate to its location, and select the sample Server Access Log (ServAccessLog.txt) file, which we downloaded in the preparatory steps above.

15.         Select the ServAccessLog.txt file by highlighting it.

16.         Click Open to apply the settings, and to close the Select File dialog.

We return to the Connection Properties dialog.

17.         Click the Properties button beneath the File Name box.

The Text File Properties dialog appears.

18.         Select the Delimited radio button, as depicted in Illustration 7.

Click for larger image

Illustration 7: The Text File Properties Dialog

19.         Click Next.

The Specify Comma Delimiter dialog appears. Here we define the boundaries of our data fields.

20.         Ensure the Comma radio button is selected, as shown in Illustration 8.


Illustration 8: Select the Comma Radio Button

21.         Click Finish.

The Connection Properties dialog reappears, as we see in Illustration 9.


Illustration 9: The Connection Properties Dialog

22.         Click OK.

We are returned to the DTS Package: <New Package> window, where we see the new ServerAccessLog data connection, as shown in Illustration 10.


Illustration 10: DTS Package: <New Package> Window with the New Data Connection

Having created a data connection for our data source, the Server Access Log file, we now need to create a data connection for our destination, the Web Traffic Analysis database. We will do so with the following steps.

23.         From within the DTS Package: <New Package> window, click Connection -> Microsoft OLE DB Provider for SQL Server from the main menu.

The Connection Properties dialog appears.

24.         Ensure that the New Connection radio button is selected.

25.         Type WebTrafficAnalysis_DB into the Name box.

26.         Ensure that Microsoft OLE DB Provider for SQL Server appears in the Data Source selector box.

27.         Specify the appropriate Server in the File Name box.

28.         Either select the Use Windows Authentication radio button, or select the Use SQL Server Authentication button and input your credentials.

29.         Select <new> in the Database selector.

The Create Database dialog appears.

30.         Type WebTrafficAnalysisDB into the Name box.

The Create Database dialog appears as shown in Illustration 11.


Illustration 11: Completed Create Database Dialog

31.         Leaving the other settings in the dialog at default, click OK.

The new database is created, and we see that it appears in the Database selector box as we are returned to the Database Connection dialog, as displayed in Illustration 12.


Illustration 12: Completed Database Connection Dialog

32.         Click OK.

We are returned to the Select DTS Package: <New Package> window, once again, where we see the new WebTrafficAnalysisDB connection, alongside our ServerAccessLog connection.

Because no tables exist in the new WebTrafficAnalysisDB database, we will take advantage of the opportunity provided by DTS to create our destination table. As a part of the process, we will use the transformation features of DTS to filter out any part of the log that we do not require, as well as to facilitate other finishing touches.

33.         Select Task -> 3 Transform Data Task from the main menu.

A small icon, labeled Select Source Connection, appears, attached to the cursor.

34.         Position the cursor / icon combination over the Server Access Log connection icon.

35.         Click the ServerAccessLog connection, once the cursor is over it.

The ServerAccessLog connection is now designated the source connection. The cursor label immediately becomes Select Destination Connection.

36.         Position the cursor / icon combination over the WebTrafficAnalysisDB connection icon.

37.         Click the WebTrafficAnalysisDB connection, once the cursor is over it.

The icon label disappears, and a directional line, representing the Transform Data Task, is drawn between the two data connections, as shown in Illustration 13.


Illustration 13: Directional Line Connects the New Data Connections

38.         Double-click the Transform Data Task line.

The Transform Data Task Properties dialog appears.

39.         Click the Source tab, as necessary.

40.         Type ETL_ServerAccessLog into the Description box.

The Source tab appears as shown in Illustration 14.


Illustration 14: Transform Data Task Dialog Source Tab

41.         Click the Destination tab.

The Create Table dialog appears. Here we can modify the existing SQL to create the new destination table.

42.         Type the following into the SQL Statement box, replacing the SQL that is in place initially.

		CREATE TABLE [ServerAccessLog] (
		[Date] varchar (11) NULL,
		[IPAdd] varchar (15) NULL
		 )

43.         Click OK.

The Destination tab appears as shown in Illustration 15 after our changes.


Illustration 15: Transform Data Task Dialog Destination Tab

44.         Click the Transformations tab.

45.         Click Delete All to remove any pre-existing mapping lines.

Illustration 16 represents our starting point in the Transformations tab.


Illustration 16: Transform Data Task Dialog Transformations Tab

46.         Click New.

The Create New Transformation dialog appears.

47.         Select the Middle of String transformation, as shown in Illustration 17.


Illustration 17: Create New Transformation Dialog

48.         Click OK.

The Transformation Options dialog appears, defaulted to the General tab.

49.         Type Date Transformation into the Name box.

50.         Click the Properties button.

The Middle of String Transformation Properties dialog appears.

51.         Set Start Position at 2.

52.         Check the Limit Number of Characters to: checkbox, and set the number at 11.

53.         Leave the rest of the settings at default

The completed Middle of String Transformation Properties dialog appears as shown in Illustration 18.


Illustration 18: The Middle of String Transformation Properties Dialog

54.         Click OK.

We are returned to the Transformation Options dialog, General Tab, which appears as shown in Illustration 19.


Illustration 19: The Transformation Options Dialog - General Tab

55.         Click the Source Columns tab.

56.         Double-click Col003 on the left to add it to the Selected Columns box on the right.

57.         Click the Destination Columns tab.

58.         Double-click Date on the left to add it to the Selected Columns box on the right.

59.         Click OK.

We are returned to the Transformations tab, which appears as shown in Illustration20.


Illustration 20: The Transform Data Task Properties - Transformations Tab

60.         Click Test to ascertain the correctness of setup of the transformation.

The Testing Transformation dialog briefly appears, and then is eclipsed by the appearance of the Package Execution Results message box, which announces successful completion of the process, and appears as shown in Illustration 21.


Illustration 21: Package Execution Results - Successful Completion

61.         Click OK.

The Package Execution Results dialog disappears, leaving the Testing Transformation dialog in place, as shown in Illustration 22.


Illustration 22: The Testing Transformation Dialog - Indicating Completion

The Testing Transformation dialog offers us an opportunity to preview the output of our transformation, before we move on to a second one.

62.         Click the View Results button.

The View Data dialog presents a sample view of the results of our transformation, as shown in Illustration 23.


Illustration 23: View Data - Showing a Sample of the Transformation Results

63.         Click OK to return to the Testing Transformation dialog.

64.         Click Done to close the dialog, and to return to the Transform Data Task Properties dialog - Transformations tab.

65.         Click New to once again open the Create New Transformation dialog.

66.         Highlight the Copy Column option to select it.

67.         Click OK.

The Transformation Options dialog again appears.

68.         Type IP Address into the Name box on the General tab.

69.         Click the Source Columns tab.

70.         Double-click Col001 to select it, if necessary, to the right hand side.

71.         Click the Destinations Columns tab.

72.         Double-click IPAdd to cause it to appear in the Selected Columns space to the right of the tab.

73.         Click OK.

We return to the Transform Data Task Properties dialog, which appears as depicted in Illustration 24 with our additions.


Illustration 24: The Transform Data Task Properties Dialog - With Added Transformation

74.         Ensuring that the new transformation's mapping line is selected, click the Test button.

Indication of a successful test is returned.

75.         Click the View Results button to obtain the sample data set shown in Illustration 25.


Illustration 25: The View Data Dialog with Results

76.         Click OK to close the View Data dialog.

77.         Click Done to close the Testing Transformation dialog.

We are returned once again to the Transformations tab.

78.         Click OK.

The dialog closes, and we are returned to the Select DTS Package: <New Package> window, once again.

79.         Click Package -> Save As to save our work up until now, and to give the package a name.

80.         Type ETL Server Access Log to Web Analysis into the Package Name box.

81.         Type in / select the appropriate Server name in the Server selector box.

82.         Input the appropriate authentication information, as we did earlier.

The Save As dialog resembles that shown in Illustration 26.


Illustration 26: The Save As Dialog

83.         Click OK.

Now let's execute the package, and then navigate to the table that it creates to ascertain that our package has delivered the expected results.

84.         Click Package -> Execute from the main menu.

Execution progress is metered as the Executing DTS Package dialog appears, and then the Package Execution Results message box reports the successful execution of the package, as shown in Illustration 27.


Illustration 27: The Save As Dialog

85.         Click OK to close the message box.

86.         Click Done to close the Executing DTS Package dialog.

87.         Close the DTS Package: ETL Server Access Log to Web Analysis window.

88.         Expand (by clicking the "+" sign to the left of) the Databases folder, under the Server in use, in the left pane of the Enterprise Manager console.

89.         Expand the WebTrafficAnalysis_DB database.

90.         Click the Tables icon.

The tables (largely system tables) appear in the right pane of Enterprise Manager. The new ServerAccessLog table appears, as well, as shown in Illustration 28.


Illustration 28: The WebTrafficAnalysis_DB Tables

91.         Right-click the ServerAccessLog table.

92.         Select Properties from the context menu.

The Properties dialog for the table appears (depicted in Illustration 29). We note that the Column Settings appear to meet our specifications.


Illustration 29: The Properties Dialog for the ServerAccessLog Table

93.         Click OK to close the Properties dialog.

Now let's take a look at the data we have loaded via the execution of our DTS package.

94.         Right-click the ServerAccessLog table.

95.         Select Open Table -> Return Top from the context menu and cascading menu.

The Number of Rows dialog appears, as shown in Illustration 30.


Illustration 30: The Number of Rows Dialog

96.         Leaving the number at 1000, click OK.

We see the first 1000 rows of the ServerAccessLog table returned, as partially depicted in Illustration 31.


Illustration 31: Partial Set of Rows Returned from the New ServerAccessLog Table

And so we see that our data appears to have been extracted, transformed, and loaded to the new table as we have directed within our design of the DTS package.

97.         Close the data browser, and exit the Enterprise Manager console as desired.

We have created a table that contains the data we need to construct our Web Site Traffic Analysis Cube. We will need to perform several more steps to manage this, and will begin design and creation of the cube in Part II of our lesson.

Next in Our Series ...

In this lesson, Build a Web Site Traffic Analysis Cube: Part I, we began a two-part lesson that focuses on the design and construction of a Web Site Traffic Analysis Cube. In this lesson, we briefly discussed potential business reasons for collecting web site traffic data. We provided an overview of the Server Access Log, and discussed its use as a source of web site activity tracking data. Finally, we designed and built an extract package using Data Transformation Services, to illustrate one approach for importing and transforming simple data from the Server Access Log into a table that we created in MSSQL Server. As an integrated part of the process, we created a destination table to serve as a data source for our new web traffic analysis cube.

In Part II of this two-article lesson, we will begin the design and creation of our Web Site Traffic Analysis Cube. Along the way, we will discuss some of the considerations and challenges encountered in designing a cube of this type, and we will demonstrate approaches for meeting the challenges. We will use Analysis Services to build a simple Web Site Traffic Analysis Cube, and then we will browse the new cube to learn about the visitors to our web site.

» See All Articles by Columnist William E. Pearson, III

Discuss this article in the MSSQL Server 2000 Analysis Services and MDX Topics Forum.

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved