SQL joins - multi-table queries

Thursday Feb 28th 2002 by Ian Gilfillan
Share:

Most of you should be familiar with basic queries, SELECTS, UPDATES and DELETES using one table. But to harness the true power of relational databases it is vital to master queries using more than one table. This article will introduce you to database joins - queries using 2 or more tables.

Most of you should be familiar with basic queries, SELECTS, UPDATES and DELETES using one table. But to harness the true power of relational databases it is vital to master queries using more than one table. This article will introduce you to database joins - queries using 2 or more tables.

This article assumes you know how to perform basic SQL queries. If you don't, I suggest you read Simple SQL: Getting Started With SQL first. A warning though - SQL implementations are notoriously non-standard, with almost every DBMS having its own extensions, as well as exclusions, especially when it gets to the realm of inner and outer joins! And version by version, they change. So, although most of these examples should work with most implementations, they don't work with all. The final word should come from the documentation of your particular database installation.

Let's do a quick recap for those who may be unsure. We will perform queries on the table below, containing data about tourist guides. The table is defined with:

CREATE TABLE tour_guides(
       employee_number INT NOT NULL,
       employee_name VARCHAR(100),
       hourly_rate INT,
       PRIMARY KEY(employee_number));

To add records into this table, we use:

INSERT INTO tour_guides(
            employee_number,
            employee_name,
            hourly_rate) 
       VALUES(981,'Xoliswe Xaba',25)

INSERT INTO tour_guides(
            employee_number,
            employee_name,
            hourly_rate) 
       VALUES(978,'Siyabonge Nomvete',30)

INSERT INTO tour_guides(
            employee_number,
            employee_name,
            hourly_rate) 
       VALUES(942,'Jean-Marc Ithier',35)

INSERT INTO tour_guides(
            employee_number,
            employee_name,
            hourly_rate) 
       VALUES(923,'Tyrone Arendse',32)

Note that you can also do a shortcut INSERT statement, leaving out the fieldnames if the number and order of the fields is the same, e.g:

INSERT INTO tour_guides VALUES(982,'Mathew Booth',25)

I don't suggest using the shortcut however, as, particularly if you're doing your SQL from within an application, the table structure may change, and then the SQL may no longer be valid. For example, I may add another field, such as months_employed. Now, the above INSERT statement will not work. You will get an error something like

Column count doesn't match value count at row 1

If you had written the statement as:

INSERT into tour_guides(
            employee_number,
            employee_name,
            hourly_rate) 
       VALUES(982,'Mathew Booth',25)

it would have worked. For this reason, I suggest you don't use the shortcut - it makes your queries much less flexible, and less able to survive changes in table structure.

After the above, your table will looks as follows: Now the table contains:

employee_number employee_name hourly_rate
978 Siyabonge Nomvete 30
942 Jean-Marc Ithier 35
923 Tyrone Arendse 32
981 Xoliswe Xaba 25
982 Mathew Booth 25

To delete data from the table is also easy. Perhaps we have entered Xoliswe Xaba prematurely, as he has not accepted our generous offer of employment, and taken another job. To remove him, we use the statement:

DELETE FROM tour_guides WHERE employee_number=981

We use the employee number, and not the employee name, as according to our table definition the employee number is unique (the primary key). It is impossible for there to be another employee 981, while, unlikely as it may seem, there may, at least in theory, be another Xoliswe Xaba. This is why we almost always create a field that contains a unique code for each table.

Let's assume we need to know how long someone has worked for us, so we add the months_employed field talked about earlier. We use:

ALTER TABLE tour_guides ADD months_employed int

After the DELETE, and the creation of the new field, the table now looks as follows:

employee_number employee_name hourly_rate months_employed
978 Siyabonge Nomvete 30 0
942 Jean-Marc Ithier 35 0
923 Tyrone Arendse 32 0
982 Mathew Booth 25 0

See how all the months_employed values have defaulted to zero. Numeric fields, unless you specify otherwise, automatically default to zero. To start adding the months_employed, we use the UPDATE statement, as follows:

UPDATE tour_guides 
       SET months_employed=6 
       WHERE employee_number=978
UPDATE tour_guides 
       SET months_employed=12 
       WHERE employee_number=942
UPDATE tour_guides 
       SET months_employed=6 
       WHERE employee_number=923

Now our table contains the following:

employee_number employee_name hourly_rate months_employed
978 Siyabonge Nomvete 30 6
942 Jean-Marc Ithier 35 12
923 Tyrone Arendse 32 6
982 Mathew Booth 25 0

All of this putting data in, changing and taking out is of course of no use unless you can retrieve the data. This is done with the SELECT statement. For example,

SELECT employee_name FROM tour_guides

returns

employee_name
Siyabonge Nomvete
Jean-Marc Ithier
Tyrone Arendse
Mathew Booth

while

SELECT employee_name FROM tour_guides WHERE hourly_rate>30

returns

employee_name
Jean-Marc Ithier
Tyrone Arendse

If all this SELECT stuff is a bit over your head, and you're unsure of the powerful ways it can limit and perform calculations on the data, I suggest you first read a more introductory article, such as Simple SQL: Getting Started With SQL

All of this so far should be familiar to regular readers of SQLwire. Now, we introduce the 2nd table, that will show the real power of relational databases. First, a quick introduction to relational databases. Why are they given this name? The answer comes from the fact that, unlike earlier database structures (hierarchical and network), relational databases allow potentially every file, or table, to relate to every other one. They do this by using a common field. Let's add another couple of tables. (For more on database design, I suggest you read the article on database normalization.) We add the following tables:

tour_locations

location_code location_name
1024 Table Mountain
1025 Robben Island
1026 Kruger National Park
1027 St Lucia

tour_expeditions

location_code employee_number hours_worked tourgroup_size
1024 978 5 8
1027 942 8 4
1025 923 3 20
1026 982 6 8
1024 978 5 8
1025 978 3 16

I assume by now you can do the CREATE and INSERT statements to populate the above tables. Now you should be able to see the reason for the term relational database. The way these tables relate is by the common fields they have - tour expeditions joins to tour_guides through the field employee_number, and to tour_locations though location_code. Note that the field names do not have to be the same in both tables, as long as their definitions are the same (ie both int in this case). Try and make as many fields as possible NOT NULL (fields where there cannot logically be a NULL value). For example, the fields location_code and employee_number in the table tour_expeditions are good candidates. Make them NOT NULL now, and we'll reap the benefits later!

Now comes the crux. How would we answer the question "Which employees worked in which locations?". The secret here is to use the fields that relate in each table to join. Let's first answer a more simple question to introduce the concept. "Which employee_numbers worked in which locations?". We would use the following query:

SELECT employee_number,location_name 
       FROM tour_locations,tour_expeditions 
       WHERE tour_locations.location_code = tour_expeditions.location_code

This returns:

employee_number location_name
978 Table Mountain
942 St Lucia
923 Robben Island
982 Kruger National Park
978 Table Mountain
978 Robben Island

How did we get this query? The first part, immediately after the SELECT lists the fields we want to return. Easy enough - employee_numbers and location_name. The second part, after the FROM, provides the tables that contain the fields. In this case it's clearly tour_locations for location_name. But which table to choose for employee_number? Both tour_expeditions and tour_guides contain this field. Here, we have to look at which table is related to tour_location. Since tour_location is related only to tour_expedition (through the location_code field), we could only use tour_expedition. And the third part, after the WHERE clause, tells us which fields the relation exists on, or are being joined.

The usual SELECT rules apply. To bring back only the employee_numbers that gave a tour to Table Mountain, and to bring back only unique records (notice that the above query brought back a duplicate value, as there are 2 records that apply), we use:

SELECT DISTINCT employee_number,location_name 
       FROM tour_locations,tour_expeditions 
       WHERE tour_locations.location_code = tour_expeditions.location_code AND
             location_name='Table Mountain'
employee_number location_name
978 Table Mountain

is the only row returned. Note how the DISTINCT keyword returns only one identical row. Without it, we would have returned 2 identical rows, one for each time employee 978 gave a tour to Table Mountain.

How now do we return the name of the employees, as we originally requested, not just their numbers. To do so, we join the tour_expedition table to the tour_guide table, on the employee_number field, as follows:

SELECT DISTINCT 
    tour_expeditions.employee_number,employee_name,location_name 
    FROM 
    tour_locations,tour_expeditions,tour_guides 
    WHERE tour_locations.location_code = tour_expeditions.location_code AND 
    tour_expeditions.employee_number=tour_guides.employee_number

This brings back

employee_number employee_name location_name
978 Siyabonge Nomvete Table Mountain
923 Tyrone Arendse Robben Island
978 Siyabonge Nomvete Robben Island
982 Mathew Booth Kruger National Park
942 Jean-Marc Ithier St Lucia

Note the changes we made to our original join. We've added employee_name to the fields returned, tour_guides to the table list, and we've had to add the name of the table to the employee_number field of tour_expeditions, making it tour_expeditions.employee_number (now that there are 2 tables returning the employee_number, we need to specify which table to use.) In this case it makes no difference, but in others it may. And finally, we've added a join condition to the WHERE clause, pointing out the relation to use to join the 2 tables.

Let's add another record to the tour_guide table.

INSERT into tour_guides(
       employee_number,
       employee_name) 
VALUES(983,'Nelson Jiranga');

Now run the query again:

SELECT DISTINCT 
  tour_expeditions.employee_number,
  employee_name,location_name 
  FROM tour_locations,
       tour_expeditions,
       tour_guides 
  WHERE tour_locations.location_code = tour_expeditions.location_code AND 
        tour_expeditions.employee_number=tour_guides.employee_number

We get identical results

employee_number employee_name location_name
978 Siyabonge Nomvete Table Mountain
923 Tyrone Arendse Robben Island
978 Siyabonge Nomvete Robben Island
982 Mathew Booth Kruger National Park
942 Jean-Marc Ithier St Lucia

This makes sense, as our new tour guide has not yet undertaken any tours. He does not yet appear in the tour_expeditions table, and so the join does not work, as there is nothing in tour_expeditions to join to.

But what if we want all the employees back, regardless of whether they have undertaken a tour or not? We need to explicitly state this, and we do so using a LEFT JOIN (also called a LEFT OUTER JOIN). To introduce the concept, try the following query:

select DISTINCT employee_name 
       from tour_guides 
       LEFT JOIN tour_expeditions ON 
          tour_guides.employee_number = tour_expeditions.employee_number

This returns:

employee_name
Siyabonge Nomvete
Jean-Marc Ithier
Tyrone Arendse
Mathew Booth
Nelson Jiranga

Note the syntax is almost the same, except that the table names are separated by LEFT JOIN, not a comma, and ON is used for the fields to be joined, rather than WHERE.

So, going back to our original question - how do we return the employee numbers, names and locations of all guides, including those who have not yet given a tour. The query is as follows:

SELECT DISTINCT 
       tour_guides.employee_number,employee_name,location_name 
       FROM tour_guides 
       LEFT JOIN tour_expeditions ON 
          tour_guides.employee_number = tour_expeditions.employee_number 
       LEFT JOIN tour_locations ON 
          tour_locations.location_code=tour_expeditions.location_code

This now returns:

employee_number employee_name location_name
978 Siyabonge Nomvete Table Mountain
923 Tyrone Arendse Robben Island
978 Siyabonge Nomvete Robben Island
982 Mathew Booth Kruger National Park
942 Jean-Marc Ithier St Lucia
983 Nelson Jiranga NULL

Note that Nelson Jiranga now appears in the results table, and as he has not yet lead any tours, that field is NULL.

Many existing queries make use of what are called subselects (selects within selects). For example, try the following query, which returns all employees who've worked with a tourgroup of more than 10 people:

SELECT employee_name 
  FROM tour_guides 
  WHERE employee_number IN 
    (SELECT employee_name 
     FROM tour_expeditions 
     WHERE tourgroup_size>10)

The results are

employee_name
Siyabonge Nomvete
Mathew Booth

This query resolves in 2 steps - first the inner query (which returns 923 and 978) is resolved. Then we are left with

SELECT employee_name 
       FROM tour_guides 
       WHERE employee_number IN 
             (923,978)

which resolves to the results above. But I've just demonstrated another way to do this, and which is usually a better way - the join. You can rewrite this query as:

SELECT employee_name 
       FROM tour_guides,tour_expeditions 
       WHERE tourgroup_size>10 AND 
          tour_guides.employee_number=tour_expeditions.employee_number

Why do I say this is better? 2 reasons. One is that many DBMS's (such as early versions of MySQL) do not support nested selects. And the second reason is that more often they can be rewritten as a join, and the join is usually more efficient. On those big, heavily used tables, where performance is vital, you will want to do without nested selects as much as possible.

Let's take another example. How could we find all tour_guides who have not yet given a tour? We could write

SELECT employee_name 
       FROM tour_guides 
       WHERE employee_number NOT IN 
             (SELECT employee_number 
                     FROM tour_expeditions)

And this would return

employee_name
Nelson Jiranga

But, using the same principle as before, we could rewrite this as a join, in this case a LEFT JOIN (remembering that LEFT JOINS return values that are not present). Try the following:

SELECT employee_name 
       FROM tour_guides 
       LEFT JOIN tour_expeditions ON 
            tour_guides.employee_number = tour_expeditions.employee_number 
       WHERE tour_expeditions.employee_number IS NULL

Now we see an advantage of declaring employee_number NOT NULL. It allows us to use this kind of query, which is often more efficient than the nested select, and it also saves space (the DBMS does not have to waste space telling if the field is NULL or not--by definition it's not)

For this exercise, we first need to INSERT another tour guide, as follows:

INSERT INTO tour_guides 
  VALUES('999','Jon Qwelane',30,5)

Now consider another request. We want to find the names of all the employees who have the same hourly rate as Siyabonge Nomvete. Again, we can do this with a nested select:

SELECT employee_name 
  FROM tour_guides 
  WHERE hourly_rate IN 
  (select hourly_rate 
     from tour_guides 
     where employee_name='Siyabonge Nomvete')

But again, a join is preferable. In this case it will be a self-join, as all the data that we need is in the one table--tour_guides. So, we could use the following, more efficient, query:

SELECT e1.employee_name 
       FROM tour_guides e1,tour_guides e2 
       WHERE e1.hourly_rate=e2.hourly_rate AND 
             e2.employee_name='Siyabonge Nomvete'

This returns:

employee_name
Siyabonge Nomvete
Jon Qwelane

There are a few important points to notice here. We could not have used the query, as some of you may have thought,

SELECT employee_name 
       FROM tour_guides 
       WHERE employee_number=employee_number AND 
             employee_name ='Siyabonge Nomvete'

The reason is that we need to see the table as two separate tables to be joined. This query only returns "Siyabonge Nomvete", satisfying the final condition. In order to make the DBMS see the query as a join, we need to provide an alias for the tables. We give them the names e1 and e2. Also important is why we use e1 in SELECT e1.employee_name and e2 in e2.employee_name='Siyanbonge Nomvete'. These 2 have to come from the 2 'different' versions of the table. If we chose the employee_name from the same table that we impose the condition: WHERE employee_name='Siyabonge Nomvete', of course we're on'y going to get that one result back.

Good luck with your joins--remember to keep it as simple as possible for your DBMS, as few as possible nested selects, and you'll see the benefits in your applications!

Additional Resources:

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved