SqlCredit - Part 16: The Cost of Bloat

Friday May 23rd 2008 by Rob Garrison
Share:

Database designers, developers, and administrators deal with a database that continues to grow over time, often at an alarming rate. In this series' current quest to understand costs of different architectures, we will look at what cost is paid in query performance as a table grows.

Download the code for this article.

This is part 16 of a series. If you have not read part 15, you can find it here, but this article does not require any knowledge of the previous work in the series.

Bloat: “We Never Delete Anything”

Every database designer, developer, and administrator I’ve talked to is dealing with a database that continues to grow over time, often at an alarming rate.

Applications often have tables that are fairly consistent (reference data), tables that grow fairly slowly over time (customers, stores), and tables that grow very quickly and to very large sizes (sales, purchases, web hits).

Database designers are more and more often receiving the orders, “Don’t delete anything.” This leads to database “bloat”. The question is, how does this affect performance? Can we reasonably support a million rows in a single table? How about a billion rows?

In this series’ current quest to understand costs of different architectures, we will look at what cost is paid in query performance as a table grows.

Building the Tests

Test Introduction

If you’ve been following the series, skip this section and move down to the Test Specifics section.

For this test, I built a database and added four sets of Card, Vendor, and Purchase tables. These tables were introduced in previous installments, but here is the DDL for review:

 CREATE TABLE dbo.Card1 (
     CardID         int          NOT NULL  IDENTITY,
     CardNumber1    bigint       NOT NULL,
     CardNumber2    bigint       NOT NULL,
     CardNumber3    bigint       NOT NULL,
     CardNumber4    bigint       NOT NULL,
     SecurityCode1  smallint     NOT NULL,
     SecurityCode2  smallint     NOT NULL,
     SecurityCode3  smallint     NOT NULL,
     SecurityCode4  smallint     NOT NULL,
     SecureString1  varchar(36)  NOT NULL,
     SecureString2  varchar(36)  NOT NULL,
     SecureString3  varchar(36)  NOT NULL,
     SecureString4  varchar(36)  NOT NULL,
 
     CONSTRAINT Card1_PK PRIMARY KEY CLUSTERED (CardID)
 )
 
 CREATE TABLE dbo.Vendor1 (
     VendorID         int          NOT NULL  IDENTITY,
     BusinessName     varchar(40)  NOT NULL,
     Address1         varchar(40)  NOT NULL,
     Address2         varchar(40)  NOT NULL,
     City             varchar(40)  NOT NULL,
     [State]          char(2)      NOT NULL,
     ZipCode          varchar(10)  NOT NULL,
     DateCreated      datetime     NOT NULL
         CONSTRAINT Vendor1_DateCreated_DF     DEFAULT GETUTCDATE(),
     DateLastUpdated  datetime     NOT NULL
         CONSTRAINT Vendor1_DateLastUpdated_DF DEFAULT GETUTCDATE(),
     StatusCd         tinyint      NOT NULL
         CONSTRAINT Vendor1_StatusCd_DF        DEFAULT 1,
 
     CONSTRAINT Vendor1_PK PRIMARY KEY CLUSTERED (VendorID)
 )
 
 CREATE TABLE dbo.Purchase1 (
     PurchaseID   int       NOT NULL  IDENTITY,
     CardID       int       NOT NULL,
     VendorID     int       NOT NULL,
     Amount       money     NOT NULL,
     DateCreated  datetime  NOT NULL
         CONSTRAINT Purchase1_DateCreated_DF DEFAULT GETUTCDATE(),
 
     CONSTRAINT Purchase1_PK PRIMARY KEY CLUSTERED (PurchaseID),
     CONSTRAINT Purchase1_Card_FK FOREIGN KEY (CardID)
         REFERENCES dbo.Card1 (CardID),
     CONSTRAINT Purchase1_Vendor_FK FOREIGN KEY (VendorID)
         REFERENCES dbo.Vendor1 (VendorID)
 )

For this test, I did not create any indexes other than the primary keys. The queries use the primary keys, so the other indexes are just space and processing overhead.

One thing I learned in building this set of tests is that creating a test table with 100 million records takes a long time. I put a reasonable amount of thought into the script that populated Purchase4 (the 100-million-record table). Still, populating the second half of the table (50 million records) started the morning of April 29th and finished the morning of May 3rd.

Test Specifics

For these tests, we keep the number of Card records and Vendor records the same and increase the number of Purchase records.

  • Card1: 100,000 records
  • Vendor1: 100,000 records
  • Purchase1: 100 records
  • Card2: 100,000 records
  • Vendor2: 100,000 records
  • Purchase2: 10,000 records (100 times the size of Purchase1)
  • Card3: 100,000 records
  • Vendor3: 100,000 records
  • Purchase3: 1,000,000 records (100 times the size of Purchase2)
  • Card4: 100,000 records
  • Vendor4: 100,000 records
  • Purchase4: 100,000,000 records (100 times the size of Purchase3)

Each test reads 1,000 random Purchase records JOINed to Card and Vendor. It is important to understand that the workload is the same regardless of the size of the Purchase table.

Given a standard workload (SELECTs only), what do you predict the penalty will be for significantly growing the size of just one of these tables? If we grow the record count 100 times, we don’t expect that reading 1,000 records will take 100 times as long, but we also don’t expect it to be free. I’ve given you the elapsed time for the 100-record test. Now make your informed guess as to what the other entries will be.

Test

Elapsed Time (ms)

Predicted Penalty

1 (100 Purchase records)

580

-

2 (10,000 Purchase records)

 

 

3(1,000,000 Purchase records)

 

 

4 (100,000,000 Purchase records)

 

 

Results

Here are the results of the tests.

Run 1

Test

Card
Records

Vendor
Records

Purchase
Records

Elapsed Time (ms)

Penalty vs. Set 1

Penalty vs. Next-Smaller Set

1

100,000

100,000

100

580

-

-

2

100,000

100,000

10,000

2,313

3.988

3.988

3

100,000

100,000

1,000,000

5,936

10.234

2.566

4

100,000

100,000

100,000,000

11,763

20.281

1.982

Run 2

Test

Card
Records

Vendor
Records

Purchase
Records

Elapsed Time (ms)

Penalty vs. Set 1

Penalty vs. Next-Smaller Set

1

100,000

100,000

100

546

-

-

2

100,000

100,000

10,000

2,156

3.949

3.949

3

100,000

100,000

1,000,000

5,766

10.560

2.674

4

100,000

100,000

100,000,000

11,796

21.604

2.046

Subsequent runs of the same test produced similar results.

Conclusion

Storing larger and larger amounts of data is often driven by the business needs of your organization. There are many different options for how to accommodate such a large data set, but it helps to know the affect of data growth given a simple architecture.

Interesting points from the results:

1.  The larger the number of rows, the longer it takes to process a simple workload. (This is not surprising at all, but now there are specific numbers that show how much affect that bloat has.) This assumes that you are actually accessing all those rows. If 90% of your rows were old and not accessed in the SELECT workload, the results would be different.

2.  The SQL Server team is doing a good job of optimization. Notice the last column in the results. As the record count grows 100 times larger, it takes less than 4 times as long to process the records. And the differential keeps going down as the table get bigger and bigger.

What are your thoughts here? Are you being asked to keep huge amounts of data? If so, are you being given the development time to modify your architecture to deal with the growth, or are you growing inside the same architecture? Drop a comment in the forum.

Download the code for this article.

» See All Articles by Columnist Rob Garrison

SqlCredit - Developing a Complete SQL Server OLTP Database Project
- Performance Testing SQL 2008's Transparent Data Encryption
- SQL Server 2008's Change Data Capture - Tracking the Moving Parts
- Performance Testing - SQL Server 2008 versus SQL Server 2005
- Exploring SQL Server's Index INCLUDEs
- Exploring SQL 2005's Ranking Functions - NTILE() and ROW_NUMBER()
- SqlCredit - Part 19: Exploring SQL 2005's Ranking Functions - RANK() and DENSE_RANK()
- SqlCredit, Part 18: Exploring the Performance of SQL 2005's OUTPUT Clause
- SqlCredit - Part 17: Exploring SQL 2005's OUTPUT Clause
- SqlCredit - Part 16: The Cost of Bloat
- SqlCredit - Part 15: The Cost of Distribution
- SqlCredit - Part 14: The Cost of Translation
- SqlCredit - Part 13: More on Indexed Persisted Computed Columns
- SqlCredit - Part 12: Exploring Indexed Persisted Computed Columns
- SqlCredit - Part 11: Change Tracking Using History Records
- SqlCredit - Part 10: MAC Performance and Updating SqlCredit
- SqlCredit - Part 9: Message Authentication Codes
- SqlCredit Part 8: Comparing Encrypt/DecryptByCert and Encrypt/DecryptByKey
- SqlCredit Part 7: Performance Impact of EncryptByCert and DecryptByCert
- SqlCredit Part 6: Exploring EncryptByCert and DecryptByCert
- SqlCredit - Part 5: Adding Card, Vendor, and Purchase, Plus Much Refactoring
- SqlCredit - Part 4: Schema and Procedure Security
- SqlCredit - Part 3: Schema/Proc Updates and Automated Unit Testing
- SqlCredit - Part 2: Creating the Database, Tables, CRUD Procedures
- SqlCredit - Developing a Complete SQL Server OLTP Database Project

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved