Both the Importing XML Data into MySQL Tables Using a Stored Procedure and the Enhance Your MySQL XML Import Procedures using Prepared Statements tutorials described a way to overcome MySQL's import restrictions using the "proc-friendly" Load_File() and ExtractValue() functions. While by no means a perfect solution, it got the job done. Now it’s time to take a step back and ask ourselves how to best load large amounts of external data into our MySQL database(s) as efficiently as possible.
A Two-pronged Approach
There are two parts to importing data:
- Reading the data
- Inserting it into tables
We’ll address each of these separately as they are quite distinct from one another.
Part 1: Reading the data
Before we research how to parse XML as quickly as possible, the better question is, what format should we employ to transfer data? Now, you may not have a say in the matter, but if you do, I would urge you to explore other options, such as JSON, YAML, or even old-school CSV.
XML: Pros and Cons
XML has been used extensively for many years now, and with good reason; XML is readable by people, as well as machines. It also gives us the ability to create a vocabulary, which can then be utilized to describe the data. For example, when exchanging data between computers, the number 42 is meaningless unless you also exchange the meaning of the data, for instance, the CPU temperature expressed in degrees Celsius.
Moreover, Data Type Definition (DTD) and XML schema documents allow for the specification of constraint rules which are applied to the contents of XML instance documents.
The downside of the XML format is that it is not as easy to parse as newer formats. It’s also extremely verbose due to the inclusion of metadata.
Choose your Language Wisely
Today’s 3rd and 4th level programming languages have made the developer’s job easier than ever, at the expense of processing speed. By providing a higher level of abstraction of the internal computer hardware details, languages have become more programmer-friendly, powerful and versatile. So, when you ask yourself, which programming language would be better, PHP, Python, C#, or Java, the answer is none of the above. Assembly language is the fastest language because of its strong (generally one-to-one) correspondence between the language and the architecture's machine code instructions. Trouble is, very few developers – including myself – have any expertise in assembly coding.
From there, languages that compile to a machine’s native code are fastest. That’s why C and C++ are faster than Java and Visual Basic. Java compiles to byte code and then is interpreted at run time. Interpreted languages are slow because the run time engine has to read through the human-readable code and then translate it into machine code.
If you’re not keen on (re)learning an old language, the next best thing you can do is find a library that makes working with your preferred data transfer format more efficient. For example, I like coding in PHP and transferring data in JSON format. One of the reasons I choose this combination is that PHP provides the encode_json() and decode_json() functions for working with JSON. Not a PHP fan? The json.org site has helper libraries for just about every language in use today.
Different models of data parsing is appropriate in different situations, depending on performance and memory consumption requirements. The following models are in widespread use:
- One element at a time processing: this approach is better suited to formats that contain one record per line, such as CSV. It uses the least memory, but tends to be quite slow due to all of the read operations.
- One node at a time: The parser loads the data according to its structure in the document. The context needed for parsing is limited by the tree depth of the current element, which means that the memory requirements are greatly reduced. This type of parsing can be used for streaming documents where only part of the document is available at any single time.
- Parsing the entire document as a text stream/buffer, from which the parser generates a document object–an in-memory representation of the entire document tree, which has an individual object for each specific XML element or attribute, and a set of allowed operations (e.g., “get all child elements of this node”). The choice of the parsing model usually depends on the document size and structure. It is effective for documents that:
- are small enough to fit in memory,
- have a complicated structure with references between nodes that need to be traversed, or
- need complicated document transformations.
Part 2: The Database
Tweaking MySQL for speed is something that people have been doing for many years, so a lot of solid practices have been established. There is also a lot of information to be gleaned on the art of optimization in the MySQL docs. In the section on Optimizing INSERT Statements, they list the following factors as contributors to the time required for inserting a row, along with their approximate proportions:
- Connecting: (3)
- Sending query to server: (2)
- Parsing query: (2)
- Inserting row: (1 × size of row)
- Inserting indexes: (1 × number of indexes)
- Closing: (1)
There is also some initial overhead to open tables, which is done once for each concurrently running query.
Finally, the size of the table also slows down the insertion of indexes, with larger tables taking longer than smaller ones.
Using the above information as a guide, we can make the following adjustments:
- If we are inserting many rows from the same client at a time, we can include multiple VALUES lists to insert several rows at a time in our INSERT statements. This is much faster (many times faster in some cases) than using separate single-row INSERT statements.
For adding data to a table that already contains data, we can tune the bulk_insert_buffer_size variable to make data insertion faster still.
- Although it’s infinitely preferable to load data from memory, if we must read from a text file, we should employ LOAD DATA INFILE. It’s usually 20 times faster than using INSERT statements.
- We can take advantage of column default values and insert values explicitly only when the value to be inserted differs from the default. This improves the insert speed by reducing the amount of parsing that MySQL must perform.
I would be remiss if I didn’t touch on storage engines. InnoDB is the default and most general-purpose storage engine, and Oracle recommends using it for tables except for specialized use cases. In fact, the CREATE TABLE statement in MySQL 5.7 creates InnoDB tables by default. This is for good reason; InnoDB has a lot of outstanding features including ACID compliance, transaction support, and row-level locking, to name just a few. That being said, you may want to explore the numerous other engine options, starting with MyISAM. Rather than list the many factors to consider, I’ll refer you to Jaime Crespo’s excellent post entitled Testing the Fastest Way to Import a Table into MySQL. He did a tremendous job of comparing various strategies and configurations.
There is so much more to speeding up data imports than can be covered in a single article. We’ll come back to this important topic soon!