Designing an optimized schema in a Relational Database Management System is crucial for performance. One important aspect of optimization is the complexity of the queries that will run on the schema. If schema is considered simply as a model or layout about how data is to be stored in an orderly fashion, it can take a toll on the performance in real action.
So, you must tune a schema specifically for performance. This must be done right from the design phase, including estimating specific queries that you are supposed to run. You must put real emphasis on the measure of performance for different kinds of queries. Because queries rarely run in isolation, changes to one query or a part of the schema can have a ripple effect somewhere else. This however has its own trade-off because if you do something such as add indexes to boost the retrieval performance, updates are definitely going to be slow. Similarly, a schema that is not normalized might leverage some query performance but also might slow others. Sometime counters or summary tables might be added to optimize query performance, but they are expensive to maintain. So, in a nutshell there is not one rule of optimization only best practices or human discretion.
Optimizing the Use of Data Types
One of the important aspects of optimal schema design is to choose the right data type for the specific data one wants to store. This is crucial for performance. Unlike many RDBMS, MySQL provides a variety of data types to choose from.
You must choose the smallest data types that correctly represents the data. The smaller your data types are, the faster the retrieval process because they not only take less space on the disk, but also take less memory or CPU cache during query processing. The range of values must also be chosen optimally. A smaller range that foreseeably might not be exceeded is more acceptable. Remember that a simple data type such as integers are cheaper than characters. A character data types involve collation and character sets that makes comparison complicated and queries expensive.
The date and time should use built-in types instead of strings when possible. One should preferably use integers to represent key fields such as primary keys or indexed fields as far as possible. For example, one can use integer types to represent fixed large digit numbers such as ISBN field for a BOOK schema to leverage performance.
A field should be specified explicitly as NOT NULL if you do not have specific reason to store NULL. It is hard for MySQL to optimize a query that has nullable columns. Creating indexes, index statistics or doing any sort of value comparison becomes complicated with NULL columns. In fact, it is better to use simple a zero, a special value or even an empty string instead of NULL to represent a no-value in a field. Thumb rule is never to use a nullable column as indexes or avoid them as possible.
Refer: MySQL Data types in MySQL Reference Manual for more details on data types.
Working with Indexing
Indexes are data structures to leverage efficiency in the data retrieval process. There are many types of indexes and not all are implemented by every DBMS. Since indexes are implemented in the database engine and not on the database server there is no standardization. Same technique may be implemented differently in different engines. MySQL supports four types of indexes such as follows, with a very concise idea on each of them.
- B-Tree index: This is the most common indexing type that use B-Tree data structure to store its data. It speeds up data access by scanning the root node which holds pointers to child nodes. The storage engine picks up the right pointer by looking at the value in the node pages that define the upper and lower bound of the values in the child nodes. MySQL uses B-Trees for CREATE TABLE statements.
CREATE TABLE Product( p_code INT PRIMARY KEY, p_name VARCHAR(50) NOT NULL, p_generic_name VARCHAR(50) NOT NULL, p_descVARCHAR(128) NOT NULL, KEY (p_name,p_generic_name) );
- Hash index: Hash index uses a hash table to look up exact value using every column in the index. Storage engine computes hash code of the indexed columns and stores the code in the index and stores pointers to each row in a hash table. Only memory storage engine supports explicit hash indexes in MySQL. These index types are default for memory tables although memory tables can have B-Tree indexes as well.
CREATE TABLE Product( p_code INT PRIMARY KEY, p_name VARCHAR(50) NOT NULL, p_generic_name VARCHAR(50) NOT NULL, p_descVARCHAR(128) NOT NULL, KEY USING HASH(p_name) ) ENGINE=MEMORY;
Refer: Comparison of B-Tree and Hash Indexes for more information.
- Spatial index: Spatial index or R-Trees is used with geospatial types such as GEOMETRY. For storage engines that support nonspatial indexing of spatial columns, the engine creates a B-tree index. A B-tree index on spatial values is useful for exact-value lookups, but not for range scans. MySQL can create spatial indexes using syntax similar to that for creating regular indexes but using the SPATIAL keyword. Columns in spatial indexes must be declared
NOT NULL. MySQL creating spatial index documentation.
CREATE TABLE geom (g GEOMETRY NOT NULL SRID 4326, SPATIAL INDEX(g));
- Full-text index: Full-text indexes are used to assist in the full text search query where keyword searches are based on relevance rather than comparing values to each other. It is a special type of B-Tree index with two levels. First level holds the keywords and the second level holds the associated document pointers that points to the text collection that contain the keyword. The indexing method uses specific technique to prune the keyword. Refer MySQL Reference Manual for more details on Full-text search functions.
Using a correct indexing strategy is crucial for query performance. Only practice over time can help in choosing the right one under circumstance. There are many things to consider while choosing optimum index. Here are a couple of considerations:
- Simplify your WHERE criteria to make sure that the indexed column is on the one side of the comparison operator. For example, following is a bad query for performance because MySQL will not use an index on columns unless it is isolated in the query:
SELECT p_code FROM Product WHERE p_code + 1 = 101;
- For long character columns, leverage performance by indexing the first few characters only instead of the whole string. This not only makes an index use less space but also make is less selective.
ALTER TABLE Product ADD KEY(p_name(5));
In this example code, the first five-character prefix is used to make an index smaller and faster. But there is a problem: MySQL does not support ORDER BY or GROUP BY queries with prefix indexes.
Normalization and Denormalization
Normalization itself is a technique of schema optimization and helps in eradicating spurious tuples. Denormalization on the other hand is the opposite of it where information is duplicated and stored in multiple places. Typically, you should aim for the highest normalization of the database. But remember that queries can become expensive and make some indexing strategies impossible because columns that are strewn across multiple tables may have been better if they were together in a single table.
Sometimes denormalization schema works well simply because everything is in the same table. This is great for indexing. Query retrievals are fast, but you should also remember that CRUD operations can be highly unreliable in denormalized relational schemas.
So, in a nutshell you should be extremely careful to go in either extreme of normalization or denormalization. Sometimes it is better to stay in between and make some schema normalized with some denormalized only for the sake of performance with a cautious examination of all the pros and cons.
After creating table and adding indexes you should focus on finding and fixing corruption, maintain accurate index statistics and reduce fragmentation. Data corruption can occur due to internal bugs in MySQL or system failure. A corrupt index can retrieve undesirable results for queries. Indexes should be checked, and corruption should be repaired.
In case the storage engine does not provides accurate information to the optimizer about number of rows the query will examine, you can regenerate them by executing ANALYZE TABLE. This statistic is crucial for the optimizer to make correct decision. However, the implementation of statistics is different in each storage engine, for example, memory storage engines do not store index statistics at all. Whereas MyISAM stores statistics on disk on the other hand InnoDB uses memory for the purpose. Therefore, the exact way to run ANALYZE TABLE differs in each of them.
If the B-Tree index gets fragmented it can lag the performance. The fragmentation can occur when the row is stored in multiple locations or when logically sequential pages or rows are not stored sequentially on disk. In such a case you can defrag the data by running OPTIMIZE TABLE or using the technique of dump and reload the data.
Although optimization of schema is essential, a good strategy only comes through experience. There are good practices for schema design but not all are good for performance. Here I have hinted at only a few techniques that may help in schema design. You can get many such ideas thinking on the same line. Remember that optimization is always a trade-off between performance and space.
# # #