Oracle: The Nuts And Bolts Of Basic Compression

Monday Jan 26th 2015 by David Fitzjarrell
Share:

Oracle provides several levels of compression, with BASIC compression requiring no additional license cost. Read on to find out how this level of compression works and what you may expect when using it.

In previous articles compression options have been discussed, and now it's time to see how Oracle performs basic compression. It isn't really compression, it's de-duplication, but it does result in space savings for data that won't be modified after it's 'compressed'. Let's look at how Oracle saves space with your data.

Oracle de-duplicates the data by finding common strings, tokenizing them and using the token identifier in the string to reduce the row length. So, what does that mean? Looking at an example might help; a table is built and populated as follows:


--
-- Create and populate the table
--
create table comptst(
	tstcol1	varchar2(4),
	tstcol2 varchar2(6),
	tstcol3	number(8,2),
	tstcol4	varchar2(10));

insert into comptst
values ('ZZXZ', 'bbddff', 199.44, 'PENDING');

insert into comptst
values ('ZZXZ', 'ghijkl', 43.08, 'PENDING');

insert into comptst
values ('ZZXZ', 'bbddff', 881.02, 'PENDING');

insert into comptst
values ('ZZXZ', 'bbddff', 54.97, 'PENDING');

commit;

insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;

commit;
				--
-- Compress the table with BASIC compression
--
alter table comptst compress basic;
alter table comptst move;

[The table was compressed after the data was inserted which required two steps, the first to set the compression level and the second, a table move in place, to actually compress the data. Had the table been built as compressed and direct path inserts used the data would have been compressed without further action.] Since the initial 4-row insert was re-inserted multiple times there is a lot of duplication in the data, and since Oracle de-duplicates rows to produce the effect of compression there should be a lot of data in a block dump indicating this. There is, and the first piece of that data is the following line:


  perm_9ir2[4]={ 0 2 3 1 }

Oracle builds a token table for each data block; this provides a reference for each data string that occurs more than once in the block. Additionally Oracle can re-arrange the column values in that token table so that multiple column values can be turned into a single token and, thus, a single reference. The line shown above indicates what column values map to the table positions in the token table for this block; in this case column 0 maps to the data in table column 0, column 1 maps to the data in table column 2, column 2 maps to data column 3 and column 3 maps to data column 1. Let's look at the unique data that was inserted:


('ZZXZ', 'bbddff', 199.44, 'PENDING');
('ZZXZ', 'ghijkl', 43.08, 'PENDING');
('ZZXZ', 'bbddff', 881.02, 'PENDING');
('ZZXZ', 'bbddff', 54.97, 'PENDING');

Since these data rows are duplicated in each block every column is a potential compression token. Two values occur in every row, 'ZZXZ' and 'PENDING', so it should be expected that tokens for those values will be found in each of the compressed data rows. As mentioned previously Oracle builds a token table in each block so there are two tables in this block, the first, starting at offset 0, is the token table that has 7 rows and the second, starting at offset 7, is the actual table data and there are 721 rows:


0x24:pti[0]	nrow=7		offs=0
0x28:pti[1]	nrow=721	offs=7

Oracle has a clue with this implementation of compression and can create a token that includes a data value and a token, from the same token table, to reduce that row length even further. The examples provided here won't be demonstrating that but know that it is possible. Now let's look at the first row in this block for the data table:


tab 1, row 0, @0x1f31
tl: 5 fb: --H-FL-- lb: 0x0  cc: 4
col  0: [ 4]  5a 5a 58 5a
col  1: [ 7]  50 45 4e 44 49 4e 47
col  2: [ 6]  62 62 64 64 66 66
col  3: [ 3]  c1 37 62
bindmp: 2c 00 01 04 02

The actual column lengths are supplied between the square brackets for each column; the total length should be the sum of those values plus 7 bytes, 4 of those for the column lengths, one for the lock byte, one for the flag byte and one for the column count. Using that information the total length should be 24 bytes; the block dump provides a different total length of 5, as reported by the tl entry. There is a line at the end of the row dump labeled bindmp (a binary dump of the row contents) revealing the actual contents of those 5 bytes. As expected there is the lock byte (0x2c), the number of columns at this location (0x01) and two bytes representing the token, reporting that 4 columns are in this token and that the reference row in the token table is row 2. So, let's look at table 0, row 2:


tab 0, row 2, @0x1f5c
tl: 10 fb: --H-FL-- lb: 0x0  cc: 4
col  0: [ 4]  5a 5a 58 5a
col  1: [ 7]  50 45 4e 44 49 4e 47
col  2: [ 6]  62 62 64 64 66 66
col  3: [ 3]  c1 37 62
bindmp: 00 b3 04 04 05 06 cb c1 37 62

It looks almost like the data row, but the total token length is 10 bytes. Looking at the bindmp the first two bytes indicate this token is used 179 times in this block, the third byte indicates that 4 columns are in this token, the two bytes after that report that the first two columns are also tokens, 0x04 and 0x05. Going back to the token table we see that those tokens are:


tab 0, row 4, @0x1f66
tl: 7 fb: --H-FL-- lb: 0x0  cc: 1
col  0: [ 4]  5a 5a 58 5a
bindmp: 00 04 cc 5a 5a 58 5a
tab 0, row 5, @0x1f76
tl: 10 fb: --H-FL-- lb: 0x0  cc: 1
col  0: [ 7]  50 45 4e 44 49 4e 47
bindmp: 00 04 cf 50 45 4e 44 49 4e 47

These are single-column tokens, and each is used 4 times in this block. This is how Oracle reduced the row length from 24 bytes to 5 to save block space. Working through the block dump it's now possible to re-construct the 24 bytes of data the row originally contained even though it now is only 5 bytes in length.

We see that Oracle doesn't actually compress data, it replaces duplicate values with tokens and, through those tokens, reconstructs the data at query time by using the row directory and the actual row pieces in each block. Depending on the select list some tokens won't be accessed if that data isn't required. Of course all of this re-constructing can be expensive at the CPU level, and for full table scans of large tables performance can be an issue, especially with the "cache buffers chains" latch because Oracle is performing fewer "consistent gets - examination". This is because Oracle has to pin blocks for a longer period due to the reconstruction. On the plus side the number of physical reads can decrease since the data is in a smaller number of blocks and can stay in the cache longer. Using basic compression is a trade-off between size and performance, and for extremely large tables or in cases where the compression savings are quite large (meaning the data is compressed more) queries may become CPU-intensive rather than I/O intensive. The good and the bad need to be weighed carefully when making the decision to use compression; choose wisely. Space is relatively inexpensive when compared to end-user satisfaction. The DBA's idea of performance and the end-users' ideas of performance use different criteria, and it's really the end-user's idea that should take precedence.

See all articles by David Fitzjarrell

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved