There is no silver bullet in the database world. Which database to choose depends on who you are, what youre trying to achieve, how much data you want to store, which OS/Language platform you use for applications, your budget, whether you need a data warehouse, a BI or decision supporting system on top of it, etc.
Many friends, developers, application and system architects and sometimes even system administrators often ask me the same types of questions: Bo, should I use MS SQL or Oracle database for my application?, Bo, should I migrate my Sybase databases to MS SQL?, Which database (platform) is the best?, Why dont we use open source databases such as MySQL or PostgreSQL? etc. All of these questions are essentially the same--which database platform fits the best in the application, or which database platform makes the most sense for the IT and business?
To answer any of those questions, Id ask them a dozen more questions before I could potentially recommend anything. I've been a database professional for one and a half decades, and often times people dont understand why a simple question from them would trigger so many questions from me. Hence, I've developed a reputation of over engineering, or over analyzing.
Many times, the interested person didnt even know why s/he even needed a database, what a relational database is, and what the differences (the pros and cons) of a spreadsheet, an Access database, or a MS SQL database are, or what the application can benefit from housing the database on a RDBMS platform.
While Im not here to preach about what a database is, as I already know it means different things to different IT/non-IT people. I dont want to be called a database-centric nerd, even though sometimes I dont think I can avoid that complement. However, I do think in the 21st century, people have been talking about BI (business intelligence) and I also think the most flexible, most complicated, and most maintainable systems are data driven. Therefore, Ill just give a brief introduction of the most popular database platforms that are available, and what people use them for, based on my experience and interaction with other database pros in this messy IT world.
Below are the databases I personally have supported/developed since I was in college about nineteen years ago:
- dBase/FoxBase/FoxPro in the DOS/Novell/Windows 3.x world
- MS SQL Server (6.5,7.0,2000,2005 and 2008)
- Oracle (7, 8/8i,9i,10g)
- Sybase SQL/ASE Servers (4.9.2, 10, 11, 11.9.2, 12/12.5, 15.x)
- Sybase IQ 12.6/12.7
- Informix/Dynamic (6, 7)
- MySQL (4, 5/5.1)
- PostGreSQL (8.3.5)/EnterpriseDB.
In addition, I've participated in the evaluation of time series databases, such as Vhayu and OneTick. I also have a little experience with in-memory databases used in the trading industry, such as Oracle TimesTen and Sybase ASE in Memory version (one component of Sybase RAP trading edition, which has become a strong competitor in the time series DB arena, especially since they integrated with CEP earlier last year).
I've worked as either as a full-time employee or consultant for different companies on different projects, and really think there has been no silver bullet in the database world. Which database to choose really depends on the following criteria:
- Who you are
- What youre trying to achieve (business/functionality requirements, performance/reliability/scalability/availability requirements)
- How much data you want to store in the current databases before archiving them
- Which OS/Language platform you want to choose for the application
- How much money you can budget for it
- Whether you want/need to build a data warehouse, a BI or decision supporting system on top of it eventually, etc.
Comparing different database systems is just like comparing different development platforms and OS platforms:
- Java/J2EE to (C#, VB)/.NET
- Windows to Linux/Unix
- Linux to proprietary Unix systems, such as AIX/HP-UX/Solaris/Tru64/Irix
- Unix to OpenMVS/UniSys/Tandem/IBM Mainframe
If we do not need the database to be in the critical path (due to concerns such as locking/blocking, latency, slow I/O, etc.) of any front office application systems such as order entry, banking or telecommunication systems, and we just want the database to store the after fact data or configuration/reference information as a data repository, or if we want it as either a backup/alternative way of saving data in database rather than in files, then virtually any RDBMS that support a distributed system will do:
- MS SQL
- PostGreSQL, etc.
Of course, they all provide different features and have their own flavor of SQL language, even though they all claim to comply with the ANSI SQL , SQL-92 standards, etc. Again, this is just like Websphere/WebLogic/Oracle App Server/Sybase EServer/JBoss--all different implementations of J2EE, and they all have to be compliant and compatible with J2EE specs, yet with their own flavor and expansion.
Mission-critical (OLTP) and High Performance System
However, if we need the database to play a more critical role in a mission/business-critical system, and we want everything, such as no down time/High Availability/clustering, reliability/point-in-time recovery, scalability (for both performance and data volume), replication for DR(disaster recovery) and reporting purposes, with extreme high speed and low latency (HPC), then we will probably need an RDBMS with enterprise features, which will incur high coste.g., for some of the examples in the TPCC results, the total system cost could be up to multi-million. Below is a comparison of the database platforms based on those critical, high performance enterprise features.
HA Clustering Features
The clustering capabilities of various systems are as follows:
- Oracle RAC is the best for HA clustering and load balancing and scalability.
- MS SQL Server Cluster provides HA failover but no load balancing.
- Sybase ASE 15 Cluster achieves the same thing as Oracle RAC.
- TeraData is a networked/distributed DB platform since birth.
- MySQL Cluster claimed higher TPC-C performance benchmarks than Oracle RAC.
- IBM DB2/UDB works very well with IBMs pSeries/Power server running AIX HACMP(High Availability Cluster Multiprocessing), and Veritas VCS/Sun Cluster etc.
High performance/low latency/In memory DB
- Oracle provides TimesTen in-memory database.
- MySQL's cluster uses an in-memory storage engine.
- Sybase provides ASE in memory database.
- Alternatively, we can use RAMSAN/SSD (Solid State Disk) to house MS SQL databases, then all the database related disk I/O will happen to the SSD rather than to traditional physical disks, hence we can dramatically reduces disk I/O latency and increase I/O throughput, and improve overall database performance.
- There are other in memory DBs, such as SQLite or eXtremeDB, but most of them either only support embedded systems or just support a single user/connection at a time.
- With Oracle RAC and Sybase ASE 15 Cluster, if the existing hardware cannot keep up with the performance requirement, we can simply add a beefier box to the cluster and gradually replace all other nodes, which protect our existing investment better.
- With MS SQL Cluster, since it only supports Active/Passive mode (instance wise), there is no scalability at the instance level. We have to replace the whole cluster (or pair by pair), if the existing hardware has been throttled.
- MySQL cluster also supports adding nodes to the cluster with all nodes being active, but it is a share nothing cluster. More nodes mean more copies of databases, more storage and more network traffic for data replication among all nodes in the cluster.
- Sybase. (OK, not to sell Sybase, because people think Bo is just a Sybase guy). Sybase has the best replication solution, which out performs Oracle's Data guard, or MS SQL Server's replication (even with its mirroring/log shipping/transactional replication, MS SQL's replication is still far from Sybase's replication server).
- MS SQL. MS SQL replication has log shipping and database mirroring for warm/standby DB solutions (active/passive, r/w and read-only), and it supports transactional replication with the granularity at the article/object level (per table).
- Oracle. Oracle's Data Guard: One of the best ways for database cloning and one of the best DR solutions (using RMAN).
- MySQL. MySQL provides shared nothing cluster, which basically is synchronous replication among all active nodes. And its asynchronous replication also supports at the transaction and object level.
- Storage level solution. With EMC's BCV, NetApp's Snap Manager, Hitachi's snapshots, Veritas' volume replicator, data replication can happen at the disk block/volume level. However, the replicant DB/volume is offline all the time until mounted to OS and presented to DB servers.
Sybase's replication is one of the main reasons why Sybase ruled Wall Street in the 1990s and is still used by many large shops (e.g., Citadel, Goldman Sachs, Bank of America, UBS, Chase, Barclays, etc.) for both heterogeneous replication and homogeneous replication.
Without wasting too many words about Sybase replication, Sybase replication can be implemented at the database level (active/active replication, better than SQL's mirroring), and transaction level (like SQL's transaction replication) with RPC, table level, stored proc level, function string level support, and MSA (multiple-site availability), and of course, log shipping (poor man's solution, in the replication world). Its mirror activator, in conjunction with EMC/SRDF, literally ensures true ZDL (zero data loss) for both planned and unplanned system outages.
Typically, middle office and back office databases house a great deal of large volume data, and performance issues usually appear during data retrieval (data sorting, grouping, summarizing, and calculating), which when running slowly will in turn affect the data writing process.
Therefore, for those types of applications, we want a database that handles large volume data very well and can also handle the reader/writer blocking issue very well.
Oracle is the best for this type of system, because in Oracle, reader and writer don't block each other; yet there's no dirty read for the reader and the data during the whole reading process/session/transaction is consistent. The overhead happens at the memory level by playing with latch and redo logs.
MS SQL 2005's new snapshot isolation using row versioning closely achieved what Oracle has, and the overhead is that there is more I/O incurred in tempdb.
Sybase ASE still has the reader and writer block each other. However, Sybase IQ has resolved the blocking issue by using snapshot versioning as well.
There are other databases that don't have the blocking issue, but some of them are not even ACID (atomicity, consistency, isolation, durability) compliant.
Companies build data warehouses to store/archive all historical data, and mine whatever data they feel needs to be put in database. Typically, data is denormalized into the data warehouse, and multi-dimensioned to provide fast retrieval for OLAP and business intelligence (BI) solutions.
Data warehousing is also useful for back testing to come up with new or enhanced business strategies. In addition, data grows exponentially in a data warehouse (cube) instead of linearly.
Advantages of Column-based RDBMS
Column-based RDBMS has an advantage now because column-based databases are multi-dimensioned by nature, and each column is self-indexed (B tree).
Examples of typical column-based databases are Google's BigTable, Sybase IQ (data store in Sybase RAP), Vertica, and KickFire, which is built on MySQL.
On average, these column-based databases are 60 to 100 times faster than traditional row-based RDBMS (e.g., Oracle, MS SQL, Sybase ASE, TeraData, DB2, MySQL) for data retrieval. However, they are slower for row level based data updating (much faster for large amount of data load, or batch load though). So far, Sybase IQ seems to be the leader in this world, which claims to house the largest data warehouse in the world with a size of over 1 PetaByte.
Additional information about these systems is available on their respective web sites.
More about Column-based RDBMS
Both Oracle and MS SQL need to build cubes to handle multi-dimensioned data for data warehousing, which incurs a lot more additional storage and design work as well. Personally, I think row-based RDBMS data warehouses will soon be replaced by column-based data warehouses.
Overall, row-based RDBMS (MS SQL, Oracle, Informix, DB2/UDB, MySQL, Sybase ASE, etc.) is better for OLTP, while column-based RDBMS (Sybase IQ, KickFire, Vertica, etc.) is better for OLAP/DSS and data warehouse. Note: those time series databases such as OneTick, kdb+ and Vhayu are not quite relational databases, but their storage structure are based on files on the file systems with each column as a separate file, which essentially is column-based. Column-based database is also useful for back testing to come up with new and enhanced trading strategies. In addition, data grows exponentially in data warehouse (cube) based on row-based RDBMS, as opposed data only growing linearly in data warehouses based on column-based RDBMS.
Traditionally, Oracle has had an advantage in the data warehouse world, but now it is losing edge to column-based RDBMS, such as Sybase IQ ,Vertica and KickFire built on MySQL. Google is using BigTable, another column-based database/storage system.
Supportability and existing investment in database technology is another thought.
I've seen quite a few successful migrations from Sybase to SQL and vice versa, but haven't seen many from Sybase/MS SQL to Oracle or vice versa. Depending on the database size and application scope and coupling with the databases, it could take a few consultants and full-time employees years to migrate Sybase/MS SQL stored procedures, rewrite applications and test them, even with the help of Oracle Migration Workbench (now we can use SQL Developer), as each and every stored procedure will have to be literally rewritten. The migration is easier if the applications only use ODBC/JDBC/ADO.NET and no stored procedures/temporary tables.
Native support of programming Languages
Having said the above, Oracle does have the best supportability/features for Java. We can store Java code in Oracle and treat it as a native Oracle function, which makes Oracle expandable (kind of like open source). Moreover, Sybase supports native Java code, too. (Sybase also has OpenServer, which is used to build event-driven applications for trading firms.) Both MS SQL/Sybase support extended stored procedures, which essentially is an interface within the DB calling external dlls written in different programming languages.
Cost of Ownership
When were looking at all of the features of those databases, one thing to keep in mind is the cost and time frame. Even if you have an experienced database team, it doesnt make much sense to spread across more than three database platforms. Additionally, it doesnt make much sense to consolidate them all and migrate to one database platform, as most of the heterogeneous databases can interact with one another through MS SQLs linked server, Oracle Gateway, or with some database replication and ETL process in place. When you consider the cost of maintaining a certain database, never isolate the database cost. Rather, consider it as an essential part of your application, your infrastructure.
I hope this article is a help for those people who will or are already in the process of picking a database for their applications.