Saturday, March 21, 2009

Full-Text Search using Oracle Text

This is in continuation with my previous post on Full Text Search in which I discussed about MySQL’s built-in Full Text Search engine and external Open Source Full Text Search engines as options for performing integrating full-text search features in java applications. This time I want to share information on building Full-Text Search Applications with Oracle Text.

Oracle Text:

Oracle Text is a powerful search technology built into all Oracle Database editions, including the free Express Edition (XE). The development APIs provided by Oracle Text allow software developers to easily implement full-featured content search applications.

Oracle Text is suitable for a wide variety of search-related use cases and storage structures. Application areas for Text include e-business, document and records management as well as issue tracking just to name a few. Retrievable text can reside in a structured form inside the database or in unstructured form either in a local file system or on the Web.

Oracle Text can be used to search structured and unstructured documents complementing the SQL wildcard matching. Oracle Text provides a complete SQL-based search API that consists of custom query operators, DDL syntax extensions, a set of PL/SQL procedures and database views. Text API gives the application developer full control over indexing, queries, security, presentation, and software configuration that is sometimes required. Oracle Text is also programming-language agnostic and works equally well for PHP as well as Java applications.

Setting Up Oracle Text:

Oracle Text is installed with an Oracle Database XE installation by default. With other database editions, you need to install the Oracle Text feature yourself. Once the feature is present, you only need to create a normal database user and grant the CTXAPP role to the user. This will allow the user to execute certain index management procedures:

Indexing Process and Searching

Oracle Text indexes retrievable data items before users are able to find content with search. This is a common approach used to ensure adequate search performance. The Oracle Text indexing process is modeled after a pipeline, where data items retrieved from a data store pass through a series of transformations before their keywords are added to the index. The indexing process is split into multiple phases, where each phase is handled by a separate entity and configurable by the application developer.

Oracle Text has different index types that are suitable for different purposes. For full-text search with large documents, the CONTEXT index is the appropriate index type. The indexing process includes the following phases:

  1. Data Retrieval: Data is simply fetched from a data store, for example, a Web page, database large object, or local file system, and passed as a stream of data to the next phase.
  2. Filtering: The filters are responsible for converting data in different file formats to plain text. The other components in the indexing pipeline only process plain text data and don't know about file formats such as Microsoft Word or Excel.
  3. Sectioning: The sectioner adds metadata about the structure of the original data item.
  4. Lexing: A stream of characters is split into words based on the language of the item.
  5. Indexing: In this final phase, the keywords are added to the actual index.

Once the index has been built, an application can use plain SQL queries to execute a search entered by an end user.

Searching

The CONTAINS operator is used for searching CONTEXT indexes.

Index Maintenance

Because base table data is replicated by the index, the data needs to be periodically synchronized to the index. Index maintenance procedures can be found in the CTX_DDL PL/SQL package.

Summary

Oracle Text allows users to create full-text index on a single column / multiple columns in a single table as well across multiple tables in a database. Details on creating the index, searching and index maintence is discussed comprehensively in the OTN Developer article on full text indexing.

References:

1. OTN Developer article on full text indexing

2. OTN Discussion Forum - Topic on multi-table indexing

3. Thread on full-text indexing