Document Management with Oracle Text

Thursday Jun 12th 2003 by Marin Komadina
Share:

Powerful procedures for handling image, audio, video and the text data in various formats are incorporated in the Oracle database kernel code. Marin Komadina looks at the ins and outs of the feature originally known as ConText, then interMedia Text, and finally Oracle Text.

A few years ago, many projects started using the multimedia capability of Oracle databases. Powerful procedures for handling image, audio, video and the text data in various formats are incorporated in the database kernel code; even the default database installation has multimedia objects installed. Storing, retrieving and searching the text from documents stored inside the Oracle database are the most common usage of multimedia database capabilities. This feature, incorporated into the product was formerly known as ConText, later known as interMedia Text (iMT) and with the database version 9.X, it is known as Oracle Text.

This article covers:

  • Document Index Type
  • Oracle Text Architecture (Classes, Objects, Preferences, Attributes)
  • Oracle Text Installation Check
  • Text Indexes Inventory
  • Synching and Optimising a Text Index
  • Monitoring and Error Logging
  • Maintenance Tips
  • Conclusion

Document Index Type

Oracle Text is an extension to the Oracle database that allows searching specific words in the tables of documents, using standard SQL expressions. Oracle Text is integrated in a number of Oracle products such as Portal, iFS and Applications. Supported document types are text, HTML, DOC, XLS, PPT, PDF and XML documents. txt/HTML data content will be stored in the clob column and other formatted document content in the blob column. The content can "also" stored outside of database via BFILES. For any kind of data content, the text engine is used for indexing and retrieving.

Overview of the Oracle Text product development:

Product Name

Database version

Index Type

ConText 2.x.x

< 8.1.5

interMedia Text

< 8.1.7

Context

Oracle Text for Oracle8i

(V 8.1.x) 8.1.7

Catalog, Context

Oracle Text

9.x

Ctxxpath, Catalog, Context

There are three different index types: Context, Catalog and Ctxxpath. They are used all for document indexing, but each has a different functionality.

Context index is a "domain" index used for fast retrieval of unstructured text.

DML processing on a Context index is deferred. The actual index updates do not take place until an index SYNC is performed.

Catalog (CTXCAT) index is an online, "catalog" index, efficient for searching between small, simple text fields and with queries using some structured criteria, (usually numbers or dates). This index type supports only a basic functionality provided in a Context index. A Catalog index has all the characteristics of the normal database index.

Ctxxpath index is a special index installed during an Oracle Text install. This index uses Oracle Text code and can be created only on sys.xmltype columns. It is used to speed up certain queries using the existsnode method.

Oracle Text Architecture (Classes, Objects, Preferences, Attributes)

An Oracle text index can be created on the following column types:

CHAR, VARCHAR, VARCHAR2, LONG, LONG RAW, BLOB, CLOB or BFILE.

Text index will store words and the documents in which these words occur.

The Oracle Text engine uses special preference system for providing this service.


click for larger image

Text indexing is managed in several stages called "classes". Defined classes on the Oracle 8.1.7.2 database are:

SQL> select * from ctx_classes;
 
CLA_NAME                       CLA_DESCRIPTION
------------------------------ -----------------------------DATASTORE    Data store Class
FILTER                          Filter Class
SECTION_GROUP                   Section Group
LEXER                           Lexer Class
WORDLIST                        Word List Class
STOPLIST                        Stop List Class
STORAGE                         Storage Class
INDEX_SET                       Index Set

Datastore is a document data selector table from which the CTX engine reads column data and returns document data.

Filter takes the document data from the datastore class and filters it to readable text.

Sectioner takes a text format, as input, and generates two outputs (the section boundaries and plaintext).

Lexer gets plaintext from the sectioner and splits it into words (discrete tokens).

Wordlist, Stoplist, and storage classes

STOPLIST Class holds a list of stop words, which are filtered out during indexing WORDLIST Class holds fuzzy and steam expansion settings used at the query time

STORAGE Class holds storage parameters for the underlying index tables and their indexes.

For example, a text index KB_INDEX will consist of the following objects:

DR$KB_INDEX$I - the tokens table, indexed tokens

DR$KB_INDEX$K - the docid mapping table, where text keys occur

DR$KB_INDEX$N - the negative row table, documents marked for deletion

DR$KB_INDEX$R - the rowid mapping table

DR$KB_INDEX$P - the substring index table

Every class has a defined list of objects, which are only templates.

A Class list on the Oracle 8.1.7.2 database version:

SQL> select * from ctx_objects;
 
OBJ_CLASS           OBJ_NAME             OBJ_DESCRIPTION
---------------   ------------------- -----------------------------------------
DATASTORE         DIRECT_DATASTORE      Documents are stored in the column
DATASTORE         DETAIL_DATASTORE      Documents are split into multiple lines
DATASTORE         FILE_DATASTORE        Documents are stored in files,column is file name
DATASTORE         URL_DATASTORE         Documents are web pages, column is URL
DATASTORE         USER_DATASTORE        Documents are stored in the column
DATASTORE         NESTED_DATASTORE      Documents are stored in a column in the nested table
DATASTORE     MULTI_COLUMN_DATASTORE    Documents are stored in multiple columns
FILTER            NULL_FILTER           Null filter
FILTER            USER_FILTER           User-defined filter
FILTER            CHARSET_FILTER        character set converting filter
FILTER            INSO_FILTER           filter for binary document formats
FILTER            PROCEDURE_FILTER      Procedure filter
SECTION_GROUP     NULL_SECTION_GROUP    null section group
SECTION_GROUP     BASIC_SECTION_GROUP   basic section group
SECTION_GROUP     HTML_SECTION_GROUP    html section group
SECTION_GROUP     XML_SECTION_GROUP     xml section group
SECTION_GROUP     NEWS_SECTION_GROUP    news section group
SECTION_GROUP     AUTO_SECTION_GROUP    auto section group
LEXER             BASIC_LEXER           Lexer for alphabetic languages
LEXER             JAPANESE_VGRAM_LEXER  V-gram lexer for Japanese
LEXER             KOREAN_LEXER          Dictionary-based lexer for Korean
LEXER             CHINESE_VGRAM_LEXER   V-GRAM lexer for Chinese
LEXER             CHINESE_LEXER         Chinese lexer
LEXER             MULTI_LEXER           Multi-language lexer
WORDLIST          BASIC_WORDLIST        basic wordlist
STOPLIST          BASIC_STOPLIST        basic stoplist
STOPLIST          MULTI_STOPLIST        multi-language stoplist
STORAGE           BASIC_STORAGE         text-index storage
INDEX_SET         BASIC_INDEX_SET       basic index set

Instead of directly using objects, we are creating a "preference" from one of the object templates, which is further customized by setting "attributes".

When we create the Oracle Text index, the indexing engine will read the defined preference for the index and invoke the attribute for each indexed document.

Oracle Text Installation Check

The Oracle interMedia software is installed as a part of the standard installation.

All of the Oracle Text index information is kept in the database repository under user CTXSYS. To install Oracle Text, you need to select Oracle interMedia from the menu.

We can check for the existence of the objects under CTXSYS schema and the version of Oracle Text:

SQL> connect ctxsys/ctxsys
Connected.
SQL> select * from ctx_version;
 
VER_D
-----
8.1.7

SQL> select object_type, count(*)  from user_objects group by object_type order by 1; 
 
OBJECT_TYPE          COUNT(*)
------------------ ----------
FUNCTION                    1
INDEX                      41
INDEXTYPE                   2
LIBRARY                     2
LOB                         1
OPERATOR                    3
PACKAGE                    40
PACKAGE BODY               34
PROCEDURE                   2
SEQUENCE                    3
TABLE                      32
TYPE                        5
TYPE BODY                   4
VIEW                       43
 

Text Index Inventory

Oracle Text Index settings are written in the ctx_user_preference_values view:

SQL> select * from ctx_user_preference_values;
 
PRV_PREFERENCE                 PRV_ATTRIBUTE                  PRV_VALUE
--------------------------------------------------------------------------------------------
DEFAULT_LEXER                  BASE_LETTER                    YES
DEFAULT_LEXER                  COMPOSITE                      GERMAN
DEFAULT_LEXER                  MIXED_CASE                     NO
DEFAULT_LEXER                  INDEX_TEXT                     YES
DEFAULT_LEXER                  INDEX_THEMES                   NO
DEFAULT_LEXER                  ALTERNATE_SPELLING             GERMAN
DEFAULT_STORAGE                R_TABLE_CLAUSE                 lob (data) store as (cache)
DEFAULT_STORAGE                I_INDEX_CLAUSE                 compress 2
DEFAULT_WORDLIST               STEMMER                        ENGLISH
DEFAULT_WORDLIST               FUZZY_MATCH                    GENERIC
MY_LEXER                       COMPOSITE                      GERMAN
MY_LEXER                       MIXED_CASE                     YES
MY_LEXER                       ALTERNATE_SPELLING             GERMAN
URL_DATASTORE                  TIMEOUT                        30
URL_DATASTORE                  MAXTHREADS                     8
URL_DATASTORE                  URLSIZE                        256
URL_DATASTORE                  MAXURLS                        256
URL_DATASTORE                  MAXDOCSIZE                     2097152

These settings are an overview of Oracle Text settings. In this example, the listed settings come from the German language Oracle Text installation. An existence and the status of Text index is checked via user_indexes view:

select index_name name, index_type "TYPE 1", ityp_owner "OWNER",ityp_name "TYPE 2",
domidx_opstatus   status from user_indexes where ityp_name="CONTEXT"

NAME          TYPE 1       OWNER        TYPE 2       STATUS 
------------  ------------ ------------ ------------ -------- 
KB_INDEX      DOMAIN       ARTIST       CONTEXT      VALID
WEB_TEXT      DOMAIN       WEB_ARTIST   CONTEXT      VALID
TEST_IDX      DOMAIN       CTXTEST      CONTEXT      INVALID
WEB_TEXT_GER  DOMAIN       WEB_ARTIST   CONTEXT      VALID

We have four Text indexes defined in the database. One of them is invalid.

The following select will provide information about indexed columns and indexed documents:

SQL> select IDX_OWNER#,IDX_NAME,IDX_TABLE#,IDX_KEY_NAME, IDX_DOCID_COUNT from DR$INDEX;

IDX_OWNER# IDX_NAME              IDX_TABLE#  IDX_KEY_NAME             IDX_DOCID_COUNT
---------------------------------------------------------------------------------------
       107 KB_INDEX            274239       ID,DOCU                   99014
       108 WEB_TEXT            276487       PRODUCT_DETAIL            99332
        81 TEST_IDX            191146       NR                        2
        97 WEB_TEXT_GER        238978       PRODUCT_DETAIL            97368

Listing content for the WEB_TEXT_GER index:

SQL> select token_text from ARTIST.DR$WEB_TEXT_GER$I;

TOKEN_TEXT
----------------
900782
387869
MANTEL
124
221
ANDEREN
EF
26.17
150005
AUFLOESUNGEN 

The Oracle Text index WEB_TEXT_GER belongs to the user WEB_ARTIST and was created on the table WEB_SHOP_GER:

SQL> desc web_artist.web_shop_ger

 Name                          Null?    Type
 ----------------------------- -------- --------------
 ID                            NOT NULL NUMBER(38)
 NAME                          NOT NULL VARCHAR2(100)
 LOCNLS                        NOT NULL VARCHAR2(7)
 GLOBNLS                       NOT NULL VARCHAR2(7)
 DATE_MODIFIED                 NOT NULL DATE
 MODIFIED_BY                   NOT NULL VARCHAR2(30)
 CREATED_BY                    NOT NULL VARCHAR2(30)
 PRODUCT_DETAIL                NOT NULL VARCHAR2(4000)

We can test index functionality using a content-based query:

SQL> select product_detail from web_artist.web_shop_ger 
	where contains (product_detail, 'surestore') > 0;

PRODUCT_DETAIL
--------------------------------------------------------------------------------
Streamer, Hewlett-Packard Surestore Backup, 8GB Kapazit\303\244t, extern, SCSI-Schnittstell

The result set is document rows containing the word surestore.

Synching and Optimising a Text Index

Some potentional problems with the Text index are the index fragmentation, document invalidation and the index out of sync.

Index Sync

The following select will provide the information about out of sync indexes:

SQL> select count(*),PND_INDEX_OWNER from ctx_pending group by PND_INDEX_OWNER; 

 COUNT(*) PND_INDEX_OWNER
--------- ------------------------------
     1000 ARTIST
      698 WEB_ARTIST
      955 WEB_ARTIST

A user ARTIST has only one Text index with the name KB_INDEX Text. This index has 1000 pending, not indexed documents. We will run a manually synchronization of the KB_INDEX:

SQL> alter index ARTIST.KB_INDEX rebuild online parameters('sync memory 12345678'); 
Index altered.

After synchronization, the KB_INDEX should not have any non-indexed documents:

SQL> select count(*), PND_INDEX_OWNER from ctx_pending group by PND_INDEX_OWNER;

 COUNT(*) PND_INDEX_OWNER
--------- ------------------------------
      698 WEB_ARTIST
      955 WEB_ARTIST

The index KB_INDEX is now in the sync with the base table.

Another way to ensure index synchronization is to use the ctx_dll package. With this package, we have a time schedule option for the index synchronization.

Setting index synchronization every 2 hours:

SQL> exec ctx_schedule.startup ( 'web_text_ger', 'SYNC', 120 ); 
Index Optimization

Index Optimization can be fast or full. The fast optimization runs through the whole index table and connects fragmented rows together, to a maximum of 4000 bytes for a single row. This reduces the number of rows in the index table.

Full mode optimization is called garbage collection. It removes the old information left over after document invalidation and connects separate rows together where possible. Both optimizations are non-blocking, providing queries to run on the index during optimization.

Running fast index optimization:

SQL> alter index web_artist.web_text_ger rebuild online parameters ('optimize fast');

Running full index optimization:

SQL> alter index web_text_ger rebuild online parameters ('optimize full');

Running full index optimization every 2 hours:

SQL> exec ctx_schedule.startup ( 'web_text_ger', 'OPTIMIZE FAST', 120 );

Index optimization with "alter index..." is a blocking operation; during index maintainance, no one can use the index. To provide a non-blocking effect, use ctx_dll package, which allows users to use the Text index during rebuild or resync.

An example of making fast optimization:

begin 
Ctx_ddl.Sync_Index( idx_name=>'WEB_TEXT_GER', optlevel => Ctx_ddl.Optlevel_fast) ;
end;
/

Monitoring and Error Logging

The indexing process can log all progress information in a log file.

SQL> execute CTX_OUTPUT.START_LOG('$ORACLE_HOME/ctx/log');
PL/SQL procedure successfully completed.

Content of the log file should look similar to the following:

Oracle interMedia Text: Release 8.1.7.0.0 - Production on Tue Oct 17 09:35:46
(c) Copyright 1999 Oracle Corporation.  All rights reserved.

09-35-46 10/17/00 begin logging
09-36-21 10/17/00 populate index: INTERMEDIA.WEB_TEXT_GER
09-36-21 10/17/00 ..opening interMedia Text session
09-36-21 10/17/00 Begin document indexing
09-36-24 10/17/00 100 documents indexed
09-36-25 10/17/00 200 documents indexed
09-36-26 10/17/00 300 documents indexed
09-36-37 10/17/00 Errors reading documents: 0
09-36-37 10/17/00 Index data for 353 documents to be written to database
09-36-37 10/17/00    memory use: 490054
09-36-38 10/17/00    index data written to database.
09-36-38 10/17/00 End of document indexing. 353 documents indexed.

To stop the logging activity use the following syntax:

SQL> ctx_output.end_log;

Processing errors encountered during indexing are logged. The index build will continue reading the next document to index. The logged errors are stored in the database and can be checked with the ctx_user_index_errors view.

Sometimes a create index or alter index fails, leaving the context index in an unusable state. The database believes that the index is in "LOADING" or "INPROGRESS" state and blocks any attempt to recover through alter index. This situation can only be corrected by dropping and recreating the index.

SQL> select owner,index_name, status from dba_indexes where index_name='WEB_*';

OWNER                          INDEX_NAME                     STATUS
------------------------------ ------------------------------ --------
WEB_ARTIST                     WEB_TEXT_GER                   VALID
WEB_ARTIST                     WEB_TEXT                       INPROGRS

SQL> drop index web_artist.web_text;
drop index web_artist.web_text
*
ERROR at line 1:
ORA-29868: cannot issue DDL on a domain index marked as LOADING

SQL> drop index web_artist.web_text force;
Index dropped.

In Oracle version 9.2.0 we have the package CTX_ADM.MARK_FAILED. This package makes a call to force the index from LOADING to FAILED, thus unblocking the alter index command.

Conclusion

Oracle version 9.2 has several main enhancements for Oracle Text such as multi column datastore, new multi lexers, better partitioning support, enhanced XML features, document classification and many others. In addition, Oracle version 9.2 implements support for create or rebuild index online, providing DML on the base table during an index creation.

Nevertheless, many customers still remember the buggy code with ConText and interMedia, and the poor customer support for this product. Many of them have switched to some other text engine, using an Oracle database as a simple data store.

» See All Articles by Columnist Marin Komadina

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved