Import XML Data Into Microsoft Access

Friday Feb 20th 2004 by Danny Lesandrini
Share:

If you have been searching the Internet for an 'Import XML into Access' solution, then congratulations, you've found it. Danny Lesendrini shares an import routine that refreshes data in an Access 2000 database from an array of XML files.

Back in November 2003, I was assigned the task of creating an import routine that would refresh data in an Access 2000 database from an array of XML files.  For some reason, I thought it would be easy, as in "one touch import" or something.  As it turns out, it is not that easy.

I searched the Internet for suggestions, but found little.  I am not sure what I was hoping for, but I thought there must be some kind of example out there, or maybe even a third party utility to simplify the process.  I even asked our local Access MVP for a quick fix, but he assured me that I would have to write a custom import routine from scratch.

Well, I did slog through the code, all the while vowing to turn it into an article someday.  What follows is the product of that vow.  If you have been searching the Internet for an Import XML into Access solution, then congratulations.  If not, and you just stumbled upon this article by accident, then bookmark this page.  The way XML is being used these days, it is just a matter of time before you'll be needing it.

The Problem Defined

The project that inspired this article is the quintessential example of where XML is useful.  My client had outsourced their help desk management to a company using non-Microsoft technologies, hosted off-site, behind a firewall.  They wanted metrics reports, but the help desk web interface had no such provision.  As a compromise, an arrangement was made to have daily data dumps sent to my client so that they could create their own reports, using Microsoft Access.

Each night, help desk data files were zipped up and FTP'd to our site. One XML file would be provided for each PROBLEM ticket modified the previous day.  The Microsoft Access import utility I created would run on schedule, looking for any new zip files, unpacking them and processing each XML file in turn.  The associated download for this article contains only the code for importing the XML. If you are looking for code to unpack a zip file using VBA and a third party DLL, drop me a line and I will send it to you.

WARNING:  Each XML import will be unique.  The download for this article is an example, not an import utility.  At best, all you will be able to leverage from the downloaded code is some simple file looping routines and the general process idea.

One of the caveats of working with XML is that there needs to be agreement, a contract of sorts, between the parties exchanging data. The structure or schema of the data should not change, or else the import process will break. For the example considered in this article, the following schema was agreed upon:

XML file schema example 

You will notice in the schema image above that each child of the PROBLEM node contains child node(s) named DATA.  While this XML is well formed, the convention of naming all lower level child nodes DATA is a bad practice.  I loaded one of these XML files into Visual Studio .Net to see if I could create a schema against which incoming XML could be validated, but I received the following error message.
 
XML Schema error message 

I am sure that by labeling this a "bad practice" I am going to hear from non-Microsoft programmers, explaining why this is wholly justified.  Quite frankly, I will not be able to defend myself.  I do not know that much about XML.  What I do know is that I had wanted to use the DOM to validate incoming XML files against a predefined schema, but the construction of this XML made that difficult, if not impossible.  I guess my point is that what you see in this example is probably the worst-case scenario of what you should expect when importing XML data in to Access.

Additionally, it may not be obvious from the above illustration but each PROBLEM has one and only one parent record whose attributes are stored in the PROBLEM_DETAIL node. For each PROBLEM_DETAIL record, there are one or more child records in each of the following nodes:  CALL_NOTES, PROBLEM_NOTES, OUTAGES, RCA, PROBLEM_AUDIT_TRAIL and PROBLEM_HISTORY.

I started by creating the Microsoft Access tables and their relationships, with cascading deletes to all the foreign key tables.  The mapping between Node names and Table names was necessary to conform to my client's naming standards.  You could, of course, match table names to node names, as well as field names to child node names.  The code for this example aliases everything, which is a pain, but makes the solution more flexible.  Below is a summary of the XML node names, their corresponding Access table and the Access relationship diagram of the finished table schema. 

XML Node Name

Access Table Name

 PROBLEM_DETAIL

 PRBLMS

 CALL_NOTES

 PRBLM_CALL_NTS

 PROBLEM_NOTES

 PRBLM_NTS

 OUTAGES

 PRBLM_OTGS

 RCA

 PRBLM_RCAS

 PROBLEM_AUDIT_TRAIL

 PRBLM_ADT_TRL

 PROBLEM_HISTORY

 PRBLM_HSTY

Table relationship diagram of Access database

 
One last project requirement that influenced my code below is that incoming help desk records may or may not have a corresponding record already stored in my Access database.  For example, say a ticket was opened on Tuesday, handled on Wednesday and closed on Thursday.  For the sake of my project, this would mean that I would get an XML file for this issue on Wednesday, Thursday and Friday, each describing its status on the previous day.  My approach to this requirement was first to perform a delete to remove any existing record before inserting the latest values.  I found this easier than trying to identify which columns had changed and updating an existing record.  For my needs, this worked fine, but you will have to consider how you want to approach this question. 

Import Utility Interface

The word "utility" should be taken with a grain of salt here.  It is a utility in the sense that it is MY utility to import MY data into MY database. To employ this process with your data, you will have to virtually rewrite the entire module that performs the data pump.  However, before we get into that, I wanted to introduce you to the user interface I created for demonstrating this process.  It is available in the download   for this article.

Click for larger image
XML data import utility
 

The first two buttons were added so that readers of this article could view the XML schema and table relationships diagram more easily while examining the sample code. The actual import is initiated by clicking the Run Import button.  Once clicked, it becomes disabled and the Cancel button is enabled.  Clicking Cancel interrupts the import process and returns control to the user.  This is critical when processing hundreds or thousands of files, since that could require an hour or more, depending on the complexity of your data.  The sample data includes 18 XML files, two of which contain errors.

The single text box above contains the path to the unpacked XML files.  Remember, for the sake of this article we are assuming that the files are already unzipped. In my example, the source files, once processed, remain if the folder for future test runs. You will need to modify my sample code to either delete or move files once they have been processed.

The top listbox shows XML files that could not be processed because they were ill formed or because they contained an illegal character.  The bottom listbox shows the records corresponding to the 16 files that were successfully imported. The six right-most columns of this listbox display the record count for each foreign key child table. Once finished, a log file is created in the same directory as the utility to keep track of what was just processed. This log file gave us an easy way to check to see if the nightly import ran and if so, how large the import was.

How The Real Work Gets Done

Up to this point, everything has been either theory or window dressing.  The real work of this utility is performed in the module named, BAS_IMPRT_PRBLM_XML by a procedure called ProcessSingleProblemXML().  It has been reproduced below with inline comments that explain each step of the process.


Private Function ProcessSingleProblemXML(ByVal sFile As String) As Boolean
On Error Resume Next

' This function will process a single Problem XML file, assuming
' its structure hasn't been changed.  (While XML is touted as the
' greatest thing since sliced bread, it has this singular flaw:
' the file must never change or the code below will break!)
'
' Each Problem file has the following data:
'
'     PROBLEM_DETAILS      (one node per problem)
'     CALL_NOTES           (many nodes per problem)
'     PROBLEM_NOTES        (many nodes per problem)
'     OUTAGES              (many nodes per problem)
'     RCA                  (many nodes per problem)
'     PROBLEM_AUDIT_TRAIL  (many nodes per problem)
'     PROBLEM_HISTORY      (many nodes per problem)
'
    ' XML DOM objects required for walking the Problem.xml tree.
    Dim objDomDoc         As New DOMDocument
    Dim objNodeData       As IXMLDOMNode
    Dim objNodeItem       As IXMLDOMNode
    Dim objNodeProblem    As IXMLDOMNode    ' Top most node
    Dim objNode_DETAILS   As IXMLDOMNode    ' Details node
    Dim objNode_CALL_NTS  As IXMLDOMNode    ' Call Notes node
    
    ' The iElement variable will be used for each recordset.  It will
    ' be used to loop through all elements of the node corresponding to
    ' each field in the corresponding temp table.  This process will be
    ' repeated for each of the 7 primary tables.
    Dim ielement          As Integer
    Dim sProblemID        As String
    Dim dbs               As DAO.Database
    Dim rst               As DAO.Recordset
    Dim sSQL              As String
    Dim sField            As String
    Dim sColumn           As String
    Dim sValue            As String
    
    ' Open a DAO Database connection
    Set dbs = CurrentDb
    
    ' Verify that such a file exists.  If not, then exit
    sFile = cXML_File_Folder & sFile
    If Dir(sFile) = "" Then Exit Function
    
    ' Open DOM object by loading the xml file.
    objDomDoc.async = False
    objDomDoc.Load sFile
     
    ' Create the top-most node object from which all others will be derived
    ' I believe there's a way to reference nodes by name but I had difficulty 
    ' getting this to work correctly.  Accordingly, I refer to child nodes by
    ' their ordinal position throughout my code.
    '
    ' I begin by setting a DOM Node object to the top-most parent Node.
    Set objNodeProblem = objDomDoc.documentElement.childNodes.Item(0)
    
    ' If this XML file cannot produce the node object, report the error.
    If Err.Number = 91 Then
        Set rst = dbs.OpenRecordset("FIL_IMPRT_ERRRS", dbOpenDynaset)
        rst.AddNew
        rst!TYP = "PRBLM"
        rst!BAD_FIL_NM = Replace(sFile, "Problem", "Reject")
        rst!ERRR_DAT = Now()
        rst.Update
        rst.Close
        Set rst = Nothing
        
        ProcessSingleProblemXML = False
        Exit Function
    Else
        ProcessSingleProblemXML = True
    End If
    
    ' If we made it this far, then the file is valid and we may continue.
    ' Capture the PROBLEM_ID because without it, we cannot proceed
    '
    ' Again, we're using ordinal location to navigate to that PROBLEM_ID value.
    '   objNodeProblem.childNodes(0) is the PROBLEM_DETAILS node
    '   objNodeProblem.childNodes(0).childNodes(0) is the DATA node
    '   objNodeProblem.childNodes(0).childNodes(0).childNodes(0).Text is PROBLEM_ID
    sProblemID = Nz(objNodeProblem.childNodes(0).childNodes(0).childNodes(0).Text, "")
    
    If sProblemID = "" Then
        ' Missing PROBLEM_ID from XML data file.  Cannot continue
        Exit Function
    Else
        ' The XML file has a value for PRBLM_ID, so we may proceed.
        ' Start by deleting any existing record.
        sSQL = "DELETE FROM PRBLMS WHERE [PRBLM_ID]='" & sProblemID & "'"
        dbs.Execute sSQL
    End If

' /////////////////////////////////////////////////////////////////////////// 
' Process PROBLEM_DETAILS Node first, creating new PROBLEM_ID  record. 
' NOTE: Using WHERE clause of 1=0 returns an empty recordset, ready for
' record additions. Saves on the overhead of large recordsets.  
' 
    ' REMEMBER: There is only one  PROBLEM_DETAIL per record. (parent record)
    ' As a sort of shorthand, I'm using the objNodeProblem DOM object created 
    ' above as a starting point from which I grab the DATA grandchild.
    Set objNode_DETAILS = objNodeProblem.childNodes(0).childNodes(0)
    
    sSQL = "SELECT * FROM PRBLMS WHERE 1=0"
    Set rst = dbs.OpenRecordset(sSQL, dbOpenDynaset)
    rst.AddNew
    
    ' The PROBLEM table has 52 columns.  The child nodes are processed in
    ' what ever order they come.  The corresponding table field is looked
    ' up through the GetFieldName() function. (see below)
    For ielement = 0 To 51
        Set objNodeItem = objNode_DETAILS.childNodes(ielement)
        If Nz(objNodeItem.Text, "") <> "" Then
            sField = objNodeItem.nodeName
            
            ' As mentioned above, I use aliases for all of the field names.
            ' This function, GetFieldName() returns the field name alias.
            sColumn = GetFieldName("PRBLMS", sField)
            
            ' I won't bother to explain this, other than to say that the 
            ' data coming in was dirty and had to be "cleaned up".
            ' Only add valid, non zero length values
            sValue = URLDecode(objNodeItem.Text)
            If sValue <> "" Then rst(sColumn) = sValue
        End If
        Set objNodeItem = Nothing
    Next ielement
    
    rst.Update
    rst.Close
    Set objNode_DETAILS = Nothing
    

' ///////////////////////////////////////////////////////////////////////////
    ' Process CALL_NOTES Node second.  Loop through for multiple records.  This
    ' node represents a child table, which can have many records for each detail.
    ' Again, I derive the node object from objNodeProblem created above.
    Set objNode_CALL_NTS = objNodeProblem.childNodes(1)
    
    sSQL = "SELECT * FROM PRBLM_CALL_NTS WHERE 1=0"
    Set rst = dbs.OpenRecordset(sSQL, dbOpenDynaset)
    
    ' This time, we loop through the child nodes, since we don't know how
    ' many (if any) exist.
    For Each objNodeData In objNode_CALL_NTS.childNodes
        rst.AddNew
        
        ' While the XML Schema doesn't include a PROBLEM_ID field with each of
        ' it's child nodes, my tables to.  It is this field that is used to  
        ' enforce referential integrity, so I add the ID to each new record.
        rst!PRBLM_ID = sProblemID
        
        ' Again, we know ahead of time that this node has 6 child nodes.
        ' We MUST know this and it must be accurate for the code to work.
        For ielement = 0 To 5
            Set objNodeItem = objNodeData.childNodes(ielement)
            If Nz(objNodeItem.Text, "") <> "" Then
                sField = objNodeItem.nodeName
                sColumn = GetFieldName("CALL_NT", sField)
                
                ' Only add valid, non zero length values
                sValue = URLDecode(objNodeItem.Text)
                If sValue <> "" Then rst(sColumn) = sValue
            End If
            Set objNodeItem = Nothing
        Next ielement
        rst.Update
    Next
    
    rst.Close
    Set objNode_CALL_NTS = Nothing

    
' ///////////////////////////////////////////////////////////////////////////
    ' Process PROBLEM_NOTES Node third.  Loop through for multiple records.					
	' ...
	'
	' From this point on, the above code is repeated for all nodes corresponding
	' to all 6 child tables.  It's nearly identical, so won't be repeated here.
	'
	' That's all there is to it, really.

In Conclusion

If I knew more about XML, I could probably have written some really slick and efficient DOM code but unfortunately, I do not and I did not.  The good news is that if I can do this without knowing a whole lot about XML and DOM, then so can you.  The biggest trick in the code above is figuring out how to walk the XML tree, finding the data you need.  As mentioned in the code comments above, I decided to do that by identifying the ordinal position of the child node I was looking for.  If your data is stored in attributes instead of node values, then you will have to figure out how to extract the values you are looking for.  (My favorite XML reference is a book by Mark and Tracey Wilson named XML Programming with VB and ASP. It is very helpful for VB programmers seeking to get up to speed on XML fast.)

While doing this article I searched the newsgroups looking for a utility or at least some suggestions on how to approach this issue.  I did not find anything very different from what you see above.  For now, at least, this is going to be a very manual process.  If your goal is similar to mine, a regular import of similar XML files , which share the same schema, then the paradigm described above will be of use.  If you want to casually import random XML files on an irregular basis, then I suggest you play around with the XML import wizard in Access 2003.  I tried it on the XML used for this article, and the wizard fell on its face.  I wish I had a simpler solution and as usual, if any reader has a better suggestion, I am all ears. 
 

» See All Articles by Columnist Danny J. Lesandrini

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved