Hashing in Oracle

Wednesday Oct 24th 2007 by Steve Callan

Hashing is but one of many cleverly implemented functions embedded in Oracle. Learn all about it, what it is and how to use it.

The internal workings of the Oracle RDBMS are a marvel to behold. Think about all of the things you do know with respect to what it takes to issue a select statement, perform an update, or create a table. Latches, locks, serialization, caching, pinning, flushing, checkpointing, sorting, space management, undo, redo, and the list goes on and on. Aside from visible functions, there are a slew of transparent operations. Many of these transparent operations, as the adjective implies, not only work behind the scenes, but also stay there. There are also operations whose work is largely hidden, but whose results are exposed to us. One such operation found in several key areas of how Oracle works is hashing.

What is Hashing?

Hash isn’t a formal term, but more of a description for what takes place: chop and mix. A hash function returns a value for an input, and the output is generally shorter or more compact than the input. For a relatively simple hash function, there is no guarantee that inputs map one to one for outputs, that is, two different inputs can have the same hashed output value. Add more sophistication to the hash function, or add yet another hashing, and the probability of two distinct inputs having the same hashed output can be greatly decreased.

Before getting more into where hashing takes place in Oracle, let’s look at some examples of how hashing is used. A good general overview of hashing and various types of hashing can be found at a Mathematica site. The table of names takes the sum of the ASCII values of each letter, sums them, and then lists the mod 13 remainder. The table is reproduced below with an additional column to help illustrate what is taking place. I also provided some code to help reproduce the results.

create table hash_example
(lastname varchar2(12),
  ascii_sum number,
  hashed_value number);
insert into hash_example (lastname) values ('Brecker');
insert into hash_example (lastname) values ('Corea');
insert into hash_example (lastname) values ('Davis');
insert into hash_example (lastname) values ('Hancock');
insert into hash_example (lastname) values ('Harris');
insert into hash_example (lastname) values ('Marsalis');
insert into hash_example (lastname) values ('Parker');

  cursor c is
  select lower(lastname) lastname 
  from hash_example 
  for update;
  l_name hash_example.lastname%type;
  l_sum hash_example.ascii_sum%type := 0;
  l_tmp number := 0;
  for r in c loop
    for i in 1..length(r.lastname) loop
      l_tmp := ascii(substr(r.lastname,i,1));
      l_sum := l_sum + l_tmp;
    end loop;
    update hash_example
    set ascii_sum = l_sum,
        hashed_value = mod(l_sum,13)
    where current of c;
    l_tmp := 0;
    l_sum := 0;
  end loop;

PL/SQL procedure successfully completed.

SQL> select * from hash_example;

------------ ---------- ------------
Brecker             734            6
Corea               522            2
Davis               535            2
Hancock             727           12
Harris              649           12
Marsalis            860            2
Parker              645            8

Overall, this was a pretty simple example. Given that you can pass in a value and get an output based on the hashing algorithm, and that the output is like a fingerprint of the original data, how do you think this can applied to a real table? Suppose you have a table that contains employee identification numbers along with salaries. Aside from using an already built in hashing function (e.g., SHA-1), you could devise a function to hash the combined employee ID and salary values. This hashed value could be stored elsewhere, and then auditing of the table would compare the current values and the stored values. Any (unknown) change in the fingerprint would be cause for concern. This is just one safeguard, and no doubt, you’d want to have some auditing (before the fact, not after!) on a sensitive table like this.

Hashing in a Database

So far, we’ve seen two uses of hashing that can be applied to a database. In the first example, a hashed value of some key attribute can be used for indexing (the hashed value points back to several values), and in the second example, hashing can be used as a comparison tool. These two concepts lend themselves to extensive use within Oracle.

How does Oracle know if a SQL statement is new or already parsed? A parsed statement has a hash value that is by and large unique enough (collisions, or same output for different inputs can happen, but are rare). Shown below is an extract from a trace file, and the lines show each statement’s identifying information.


PARSING IN CURSOR #1 len=46 dep=0 uid=5 hv=810835083 ad='1d9cd494'

select ename from scott.emp where empno = 7369


PARSING IN CURSOR #2 len=46 dep=0 uid=5 hv=1290101743 ad='1d9cca00'

select ename from scott.emp where empno = 7499


PARSING IN CURSOR #1 len=46 dep=0 uid=5 hv=1589217292 ad='1d9cc7c4'



PARSING IN CURSOR #2 len=46 dep=0 uid=5 hv=810835083 ad='1d9cd494'

select ename from scott.emp where empno = 7369

Statements 1, 2 and 3 are different, so each has its own hashed value (the hv value). Statements 1 and 4 are the same, and note both have the same hv value. One of the more desirable characteristics of a hashing function is that it should be quite fast and inexpensive to perform. If your statement has already been parsed (has a hash ID value), then a comparison is all that is needed. As pointed out in numerous other sources, hard parsing is undesirable. Re-use of statements can be accomplished two ways. One is via the use of bind variables, and the other is to embed commonly used statements in functions or procedures.

Where else is hashing used? Hash partitioning is useful for data that doesn’t cleanly fall into categories such as range or list. Composite partitioning such as range-hash can be used to further categorized and distribute data. In these examples, the indexing feature or benefit is being used. Since this is such a highly performant arrangement of data, the use of hash clusters would be a natural extension. “Overview of Hash Clusters” in the Concepts guide provides a good description of how useful clusters can be. Clustered data is easier to find, and for sorted data, the benefits are even better.

Maybe not as obvious for usage as in the past, given how much better the cost based optimizer is, but hash joins are a frequently used means of accessing data. Incorporating a hint in a statement can result in seeing the hashing step in an execution plan. Without looking at execution plans, and by letting the optimizer do its thing, Oracle isn’t going to “tell” you that a hash join was used.

Hidden Hashing

Now it’s time for a behind the scene use of hashing. In fact, this particular use of hashing is so under the radar that you would probably never even know it existed in Oracle unless you encountered one of several bugs. From note 4960265.9 we have, “In some cases when using a join filter in a RAC configuration (parallel hash-join) wrong results can be produced.” The stated fix for this is to set the undocumented parameter _BLOOM_FILTER_ENABLED to false. What is a Bloom filter?

There are lots of hashing techniques, and “General Purpose Hash Function Algorithms,” by Arash Partow, gives a decent summary without delving too deep into some esoteric computer science. The definition of a Bloom filter given in the paper states the following:

A Bloom filter allows for the state of existence of a very large set of possible type values to be repsented with a much smaller piece of memory. This is achieved through the use of multiple distinct hash functions and also by allowing the result of a query for the existence of a particular type to have a certain amount of error. This error can be controlled by varying the size of the table used for the Bloom filter and also by varying the number of hash functions.

So now, we know another technique Oracle uses to store data. In a sense, the data can be “over stored,” that is, we want to make really, really sure it exists in one place or another. We may be wrong about the place (and then go find an accurate location), but the main point is that it exists. With this type of hashing, it will be difficult to say we don’t have it anywhere, when in fact we do.

In Closing

Hashing is but one of many cleverly implemented functions embedded in Oracle. What would be interesting is being able to document all of the hash function implementation used in Oracle. For example, how does Oracle come up with 810835083 from “select ename from scott.emp where empno = 7369?” And why do hashed partitions work better (better = data spread more evenly) when the number of buckets is a power of two?

» See All Articles by Columnist Steve Callan

Mobile Site | Full Site