Protecting DB2 against Non Concurrent Application Code

Thursday Aug 14th 2003 by Marin Komadina
Share:

DBAs have learned that performance of an application can be influenced by many factors, which in turn can influence database performance. In this article, Marin Komadina explores a real-world example of an application with a locking problem.

Performance is one of the most important factors for any kind of database system. Different companies and especially banks spend a fortune on high-performance hardware because it is still cheaper than efficient application code. This works until the system finally breaks down. DBAs have learned that performance of an application can be influenced by many factors. In this article, I am going to explain and give real-world examples of an application that has a locking problem. The locking problem never happens alone; it is connected with the commit. A badly designed application will hurt database performance as it applies changes and releases locks and claims.

This article covers:

  • Transaction Terms
  • Locking Terms
  • Commit and Locking Parameters
  • Performance Problem *Slow Database*
  • Database and Application Tuning
  • Conclusion

Transaction Terms

  • Transaction - a group of operations guaranteed to succeed fully or to be undone. Result of the transaction will never leave the table data in an inconsistent state. Intermediate results are not visible outside the transaction until commit.

  • Distributed Transaction - transaction that spans more than one database over different machines. Distributed transaction function is controlled with two-phase commit and Distributed Unit of Work (DUOW) and guarantee integrity.

  • Commit - confirm changes made by the completed transaction and releases all obtaining and holding locks on local or distributed resources. Commit functionality is integrated in DB2 code, but it is also left to end users to control as it needed.

Locking Terms

The DB2 engine will handle most locking activities internally. DB2 implements the ANSI standard isolation levels (RR, RS, CS and UR). With DB2 the default behavior is to serialize transactions such that each transaction sees the current committed data.

DB2 supports the following standard isolation levels:

  • Repeatable Read (RR) - locks all the rows an application references within a unit of work, so that a SELECT statement within the same unit of work gives the same result each time.

  • Read Stability (RS) locks only those rows that an application retrieves within a unit of work and that any row changed by another application process is not read until that process commits the change.

    In the above picture, application A cannot read the row locked by application B, until application B commits, submits changes and releases held locks.

  • Cursor Stability (CS) locks any row accessed by a transaction of an application while the cursor is positioned on the row. This lock remains in effect until the next row is fetched or the transaction is terminated.

  • Uncommitted Read (UR) - allows an application to access uncommitted changes of other transactions and does not lock other applications out of the row it is reading.

When the application makes row changes, locking is made according to the following order: RR, RS, CS and UR. A database engine will try to use the lowest possible level to satisfy a request. If multiple applications try to lock the same record in exclusive mode (for update) then DB2 creates a linked list of these lock requests (first come first served) in memory. Default behavior within DB2 is that a transaction will wait for a lock release to see only committed data. From the viewpoint of read-only applications, UR is the best option for performance, followed by CS, RS and RR.



Commit and Locking Parameters



The MINCOMMIT parameter controls how many log records will be written to disk at once. The parameter has values from between 1 and 25 inclusive.



>>db2 get dbm cfg for ARTIST | grep MINCOMMIT
Group commit count                    (MINCOMMIT) = 1

Default value MINCOMMIT=1: after every commit statement, log records are written from memory to the disk.

MINCOMMIT =2: a second commit will write the previous and current transactions to the transaction log entries, if a commit timeout has not already triggered the commit action. Commit is generated after following statements: ALTER, COMMENT ON, CREATE, DELETE, DROP, GRANT, INSERT, LOCK TABLE, REVOKE, SET INTEGRITY, SET transition-variable, and UPDATE. On commit, all locks acquired by the unit of work are released, except necessary locks for open cursors that are declared WITH HOLD.

DB2 registry parameter DB2_RR_TO_RS is used to:

  • set a default isolation level for user tables to RS isolation level
  • reduce locking the next key for inserted or changed rows
  • reduce deadlocks when using reverse scan indexes (ALLOW REVERSE SCANS)
    You should set this parameter when expecting performance improvement.
	>> db2set -all | grep DB2_RR_TO_RS
	 [i] DB2_RR_TO_RS=NO

Activating DB2_RR_TO_RS parameter:

	>>db2set -g DB2_RR_TO_RS=YES

Other general locking and commit parameters:

	>>  db2 get db cfg for ARTIST | grep LOCK
	 Max storage for lock list (4KB)              (LOCKLIST) = 50
	 Percent. of lock lists per application       (MAXLOCKS) = 10
	 Lock timeout (sec)                        (LOCKTIMEOUT) = -1
	 Interval for checking deadlock (ms)         (DLCHKTIME) = 10000

LOCKLIST- The lock list holds the controlling information for all locked objects for applications. Storage is allocated for the database lock list (tables and rows) in sizes of 64 bytes for locked objects and 32 bytes for objects already locked. If the lock list is too small, lock escalation can result. When an application reaches the maximum number of locks allowed and there are no more locks to escalate, it will then use space in the lock list allocated for other applications. When the entire lock list is full, an error occurs.

MAXLOCKS - Maxlocks is a threshold in the percentage of the lock list occupied by only one application, after lock escalation is performed by the database manager.

LOCKTIMEOUT - Locktimeout is a time in seconds that the application can wait to obtain a lock on an object.

  • (0) - the application request immediately locks or raises an error -911
  • (-1) - the application will wait for a lock until it gets a lock or is discovered by deadlock detection, which means lock timeout detection is turned off.

DLCHKTIME - The time in ms for checking a deadlock. Control over the deadlock-checking interval saves some CPU time, but could lead to longer deadlocks.

The best way to control and to check status of the locking resources and their usage is over the performance monitor where we can find following, similar lock information:

	Lock list memory in use (Bytes)          = 2952 bytes
	Lock Escalations                         = 0
	Exclusive Lock Escalations               = 0
	Locks Held Currently                     = 0
	Current Applications Waiting on Locks    = 0
	Lock waits                               = 0
	Time database waited on locks (ms)       = 0
	Average wait time per lock 	   = 0.00 ms

After observing these numbers we can definitely point to the potential problem.

Every data modification action (INSERT, UPDATE, and DELETE statements) without commit will cause DB2 to hold and maintain locks on the data. All other applications must wait until locks are released to continue their work. On commit, all holding locks from applications are released and are enabled free access to data. Well-written application code will ensure that the application process will hold locks for the shortest time possible.

Commit behavior can be controlled on the database or at the application level. To show diversity and importance of the commit function here are a few situations that interact with a commit function:

  • on log files switch, commit is done in the background by the DB2 engine.
  • every one second, commit timeout forces writing log records to the disk.
  • database track transaction situation with database settings "DATABASE IS CONSISTENT". This setting has a value of NO whenever any update is performed to a database. When all transactions are committed, the database will return to a consistent state "YES."
  • when you issue commit all open cursors for executing application are closed
  • during import, commit is controlled by the COMMITCOUNT parameter for every n records and closes filled log files.
  • commit in JDBC call after insert is managed with AutoCommit flag (managed by the Connection.setAutoCommit method)
  • Temporary tables content being preserved with "on commit preserve rows" and cleared whenever a session is closed

Performance Problem *Slow Database*

As an example, a customer called because there was a problem with database performance. The machine was a SUN Solaris 8 CPU with 8 database partitions, version 7 with FixPack 3. It was an OLTP type database. The machine was projected to handle more customer connections than were actually connected. A snapshot of the system resources showed high CPU utilization.

	$ vmstat 1 10
	kthr     memory             page              faults        cpu     
	----- ----------- ------------------------ ------------ -----------
	 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id  wa 
	 0  5 160319 23319   0   0   0   0    0   0 475 5113 920 20 26  0 54
	 2  3 160319 23240   0   0   0   0    0   0 541 4791 929 23 24  0 53
	 1  3 160319 23250   0   0   0   0    0   0 466 5219 842 16 25  0 59
	 2  3 160319 23213   0   0   0   0    0   0 439 4151 868 21 29  0 50
	 5  2 160319 23204   0   0   0   0    0   0 509 6069 829 30 21  0 50
	 0  3 160319 23181   0   0   0   0    0   0 455 6385 865 28 18  0 54
	 0  3 160319 23183   0   0   0   0    0   0 456 6472 849 25 35  0 40
	 3  2 160319 23235   0   0   0   0    0   0 382 5346 605 26 23  0 51
	 2  4 160176 23404   0   0   0   0    0   0 407 6537 624 34 26  0 40

The memory situation was good, no swapping. Disk IO was very low, so the database had no problem working with the disk subsystem. We had a problem with high CPU utilization. On the 8 CPU system (idle time) is 0. So, what could cause that high CPU usage?

In most cases, the reason for high CPU utilization is badly written SQL application code or a database locking problem. It is not unusual for an overloaded machine to suffer from both. Frequently, non-optimal  SQL code will cause a locking problem.

List of the database locking parameters:

	>>  db2 get db cfg for ARTIST | grep LOCK
	
	 Max storage for lock list (4KB)              (LOCKLIST) = 50
	 Percent. of lock lists per application       (MAXLOCKS) = 10
	 Lock timeout (sec)                        (LOCKTIMEOUT) = -1
	 Interval for checking deadlock (ms)         (DLCHKTIME) = 10000

List of the database commit parameters:

	Group commit count                          (MINCOMMIT) = 1
	Percent log file reclaimed before soft chckpt (SOFTMAX) = 100

A locking information snapshot:

	Lock list memory in use (Bytes)          = 2952 bytes
	Lock Escalations                         = 10023
	Exclusive Lock Escalations               = 1322
	Locks Held Currently                     = 43
	Current Applications Waiting on Locks    = 6
	Lock waits                               = 2322333
	Time database waited on locks (ms)       = 38447777
	Average wait time per lock               = 233.00 ms

Connections to the monitored database:

	db2 =>  LIST APPLICATIONS    
	 
	Auth Id  Application    Appl.      Application Id                 DB       # of
	         Name           Handle                                    Name    Agents
	-------- -------------- ---------- ------------------------------ -------- -----
	APPEASY  ARTIST101.exe    46         AA34000A.0BBA.030716115358     ARTIST   1    
	APPEASY  ARTIST101.exe    17         AA34000A.0BCD.030716115401     ARTIST   1    
	APPEASY  ARTIST101.exe    13         AA34000A.0BBE.030716115359     ARTIST   1    
	DB2INST1 db2bp_s          133        *LOCAL.db2inst1.030716082332   ARTIST   1    

Checking database log file db2alert.log for possible errors:

	2003-07-16-10.30.10.642353   Instance:db2inst1   Node:003
	PID:12255(db2agntp (ARTIST) 3)   Appid:*N0.db2inst1.030716072626
	data_management  sqldEscalateLocks   Probe:2   Database:ARTIST
	 
	 -- Lock Count, Target  : 113791, 56895
	 -- Table (ID) Name     : (123;5) APPEASY.VRES111
	 -- Locks, Request Type : 113769, X 
	 -- Result (0 = success): 0
	 
	 -- Lock Count, Target  : 113784, 56892
	 -- Table (ID) Name     : (37;2) APPEASY.TRES222
	 -- Locks, Request Type : 113781, S
	 -- Result (0 = success): 0
	
	2003-07-16-16.13.28.046685   Instance:db2inst1   Node:004
	PID:60794(db2agent (ARTIST))   Appid:*LOCAL.db2inst1.030627061228
	data_management  sqldEscalateLocks   Probe:3   Database:ARTIST
	 
	 -- Lock Count, Target  : 1141, 570
	 -- Table (ID) Name     : (2;39) APPEASY.MESSAGES
	 -- Locks, Request Type : 514, X
	 -- Result (0 = success): FFFF8502 -> DEADLOCK
	 
	2003-07-16-16.15.50.070572   Instance:db2inst1   Node:002
	PID:60794(db2agent (ARTIST))   Appid:*LOCAL.db2inst1.030708151448
	data_management  sqldEscalateLocks   Probe:3   Database:ARTIST
	 
	 -- Lock Count, Target  : 631, 574
	 -- Table (ID) Name     : (2;69) APPEASY.PAYMENTS
	 -- Locks, Request Type : 500, X
	 -- Result (0 = success): FFFF8502 -> DEADLOCK
	 
	2003-07-16-17.17.52.015357   Instance:db2inst1   Node:000
	PID:23998(db2agent (ANLDEB2))   Appid:*LOCAL.db2inst1.030709090853
	data_management  sqldEscalateLocks   Probe:3   Database:ARTIST
	 
	 -- Lock Count, Target  : 1140, 570
	 -- Table (ID) Name     : (2;39) APPEASY.ACCOUNTS
	 -- Locks, Request Type : 509, X
	Result (0 = success): FFFF8502   ' DEADLOCK

The database log file had numerous deadlock occurrences. The customer had already complained that end user's applications timed out from time to time, with the following errors:

	[DB2/6000] SQL0911N  The current transaction has been rolled back because of a deadlock 
	or timeout.  Reason code "2".  SQLSTATE=40001

General conclusion:

The system was overloaded with only 3 database connections. In the database log file there was a good deal of information about lock escalations and deadlock situations. A deadlock is an indication that:

  • applications have contention problems caused by lock escalations

  • application was built with wrong isolation level

  • application lock ordering is not designed correctly

  • in very rare cases there might be a problem with catalog tables that are locked for repeatable read

This was my first guess after getting this initial information.

Before taking any further action, we decided to take a deeper look at the database. An agreement was reached with the customer to collect information from the database for one day. The collected information was converted to graphical charts using Quest Central for DB2. The locking information points to the actual problem:

The database had an average of 60 database locks for all connected applications. This was rising from time to time to 100 locks. We had only 3 applications running against the database.

It is not unusual to expect a thousand locks held with more than 30 connected applications, which is what the customer wanted in the first place.

The lock information can be collected on an individual agent level using "snapshot for locks." For example, getting lock info for the application handle number 13:

	db2 =>  LIST APPLICATIONS    
	 
	Auth Id  Application    Appl.      Application Id                 DB       # of
	         Name           Handle                                    Name    Agents
	-------- -------------- ---------- ------------------------------ -------- -----
	APPEASY  ARTIST101.exe    46         AA34000A.0BBA.030716115358     ARTIST   1    
	APPEASY  ARTIST101.exe    17         AA34000A.0BCD.030716115401     ARTIST   1    
	APPEASY  ARTIST101.exe    13         AA34000A.0BBE.030716115359     ARTIST   1    
	DB2INST1 db2bp_s          133        *LOCAL.db2inst1.030716082332   ARTIST   1    
	
	db2 > GET SNAPSHOT FOR LOCKS FOR APPLICATION AGENTID 13

This command will produce a report for all of the holding locks by an agent number 13.

Other useful information, "Average lock wait time" was collected:

Average lock wait time was from 250 ms to 4.000 ms. It was very high and not suitable for a multi-user concurrent system. Locking too many resources made the UNIX machine overload and the applications locked in waiting to get a lock. Even adding additional CPU units will not solve this problem.

Database and Application Tuning

We are going to solve this problem, in a conversation with the application developers. They need to provide the DBA with information about the target database load and we need to convince them to change their SQL code and the way that they lock database objects.

The statement from the application developers regarding the target database load:

"We suppose that a maximum of 35 concurrent programs will attack the database and each will process around 20000 records from the customer table."

During database observations, we had noticed high "Exclusive escalations," which was equal to the total number of escalations in the database. An exclusive escalation is when the locks have been escalated from one or several row locks to one exclusive table lock. It is controlled by the database's internal mechanism. Normal lock escalations are controlled by the application code. The problem was mainly too small of a lock list size for the current number of running applications or, one or more applications are using an excessive number of the locks.

Very basic calculations for the lock parameters:

1 connections -> 20.000 records = 20.000 row lock x 72 bytes = 1,5 MB

35 connections -> 1,5 MB x 35 users =52 MB + additional space for the internal processing and the other users, up to the 60MB total.

To solve the problem with the high lock escalations, we decided to change the
locklist size from 200KB to 60MB. 

	Max storage for lock list (4KB)              (LOCKLIST) = 50 -> 15000
	Percent. of lock lists per application       (MAXLOCKS) = 10 -> 20

The parameter "Maxlock" was changed from 10% (20 KB) to 5% (3MB), so one application could hold 3MB of lock space before starting the lock escalation procedure. After these changes, we had eliminated a 'lock escalation' where DB2 replaces several 'row locks' with the one 'table lock'.

Additional improvements were implemented in the application code by the developers:

  • implemented more frequent commits than before
  • changed lock behaviour for updates on many table rows to table lock, for very short time
  • changed LOCKSIZE parameter for several critical tables from row level to table
    	db2 =>  ALTER TABLE ARTIST.PAYMENT LOCKSIZE TABLE
  • changed isolation level used for application from Repeatable Read to Cursor Stability. Result was a decreased number of the shared locks.

The situation with deadlocks was discussed and managed with the following changes:

	Lock timeout (sec)                        (LOCKTIMEOUT) = -1 -> 60 
	Interval for checking deadlock (ms)       (DLCHKTIME) = 10000

The application will wait a maximum of 60 seconds to lock the resource. Deadlock timeout will resolve a deadlock condition after 10 seconds. This will bring down high CPU utilization. The application was changed to handle exceptions regarding inability to aquire lock in limited time of 60 seconds.

The commit behavior was also changed with the following parameter:

	Group commit count                          (MINCOMMIT) = 1 -> 3

This change will force the grouping of three commit requests into one write. This will manage optimal writing for COMMMIT records to disk without additional overhead. We decided to do that, since numerous applications would request a commit within a very short time.

The developers changed the application code for one critical batch process. Now they are executing the job with additional DB2 connections, separating jobs between connections and keeping rows locked for less time than before.

The database was restarted and statistics again collected. We had no more Deadlocks and the Average lock wait time was drastically reduced from 4.000 to 250 ms. Newly generated statistics were much better:

The Average lock wait time is still high, but no longer critical. CPU utilization was dumped to a normal level of 25% and all connected users were satisfied with performance.

Lock situation overview after changes:

The locks are shorter and the application behaves faster. Commit handling implemented in the program code worked much better. Commits were made more frequently, reducing the time locks were held on updated rows. Packages were rebinded with isolation level (CS) and locksize (PAGE) was changed to locksize (ROW). The batch programs were implemented with RELASE(COMMIT) instead of the RELASE(DEALLOCATE) option specified during the BIND of the package. The database log file db2alert.log has no new entries regarding the deadlock problem, just regular row lock escalation from time to time. The customer is satisfied and the problem is solved.

Conclusion

Is it realistic to expect the database engine to solve all our application problems?The database is recently enhanced with the self-tuning and auto-management features but still has no ability to rewrite all user, non-efficient application code. It is amazing, how many sites running versatile DB2 applications, still are not capable of running them concurrently.The only solution that I see is close work between the application team and the production DBA.

» See All Articles by Columnist Marin Komadina

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved