26 December, 2017

Domain Indexes -- 2 : STOPLIST for a Context Index

In the previous example, I created a simple CONTEXTIndex.
For a CONTEXT Index, Oracle automatically applies a default "STOPLIST".  This is a list of common words that are automatically excluded from the Index.
If you notice in the previous post, the words "this", "is", "a", "to", "be" are all in the MY_TEXT column of the MY_TEXT_TABLE but do not appear in the list of tokens in DR$MY_TEXT_INDEX$I.


Can we expand and build our own STOPLIST ?

I start with a new table containing the same rows :

SQL> connect ctxuser/ctxuser
Connected.
SQL> create table another_text_table
  2  as select id_column as id_column,
  3  my_text as another_text
  4  from my_text_table;

Table created.

SQL> select another_text from another_text_table;

ANOTHER_TEXT
--------------------------------------------------------------------------------
This is a long piece of text written by Hemant
Another long text to be captured by the index

SQL> 


I then build a custom STOPLIST (which I name as "another_text") and then create the CONTEXT Index.

SQL> begin
  2  ctx_ddl.create_stoplist('another_text');
  3  ctx_ddl.add_stopword('another_text','Hemant');
  4  ctx_ddl.add_stopword('another_text','long');
  5  end;
  6  /

PL/SQL procedure successfully completed.

SQL> 
SQL> create index another_text_index
  2  on another_text_table(another_text)
  3  indextype is ctxsys.context
  4  parameters('stoplist another_text');

Index created.

SQL> 


This time, I don't have to use CTX_DDL.SYNC_INDEX as the Index is created after the table is populated. Nevertheless, if I execute DML (INSERT, UPDATE or DELETE) to change data, I will need a call to CTX_DDL.SYNC_INDEX to update the Index.

Now, I check the Tokens that are created in this index.

SQL> select token_text, token_count
  2  from dr$another_text_index$i
  3  order by token_text
  4  /

TOKEN_TEXT                                                       TOKEN_COUNT
---------------------------------------------------------------- -----------
A                                                                          1
ANOTHER                                                                    1
BE                                                                         1
BY                                                                         2
CAPTURED                                                                   1
INDEX                                                                      1
IS                                                                         1
OF                                                                         1
PIECE                                                                      1
TEXT                                                                       2
THE                                                                        1
THIS                                                                       1
TO                                                                         1
WRITTEN                                                                    1

14 rows selected.

SQL> 


The STOPLIST words that I defined ('Hemant' and 'long') are not in the Tokens list.  However, since I defined my own Custom STOPLIST, Oracle's default BASIC_STOPLIST has been overriden.  So, now the Tokens list includes words like "a", "be", "by", "is" etc.

Contrast this with the Tokens list in the CONTEXT index that was created on MY_TEXT_TABLE using the default BASIC_STOPLIST :

SQL> select token_text, token_count
  2  from dr$my_text_index$i
  3  order by token_text
  4  /

TOKEN_TEXT                                                       TOKEN_COUNT
---------------------------------------------------------------- -----------
ANOTHER                                                                    1
CAPTURED                                                                   1
HEMANT                                                                     1
INDEX                                                                      1
LONG                                                                       2
PIECE                                                                      1
TEXT                                                                       2
WRITTEN                                                                    1

8 rows selected.

SQL> 


So, if you want to build a Custom STOPLIST, make sure you identify all the "common" words you want excluded.
.
.
.

10 December, 2017

Domain Indexes -- 1 : CONTEXT Indexes

A CONTEXT Index is one of a class of Domain Indexes.  It is used to build text-retrieval application.
Instead of a regular B-Tree index that captures the entire text of a VARCHAR2 column as a key, you can build an index that consists of "Tokens", words extracted from a long-ish text in the database column.  Searching the index is based on the CONTAINS operator.

Here is a quick demo in 11.2.0.4 :

First I setup a new user for this demo.  I could have used an existing user but my "HEMANT" user has DBA privileges and I want to demonstrate CONTEXT without such privileges.
Here the key grant is "CTXAPP" that is granted to the user.

SQL> create user ctxuser identified by ctxuser default tablespace users;

User created.

SQL> grant create session, create table to ctxuser;

Grant succeeded.

SQL> grant ctxapp to ctxuser;

Grant succeeded.

SQL> alter user ctxuser quota unlimited on users; 

User altered.

SQL> 


Next, this user creates the demonstration table and index :

SQL> connect ctxuser/ctxuser
Connected.
SQL> create table my_text_table 
  2  (id_column number primary key,
  3  my_text varchar2(2000));

Table created.

SQL> create index my_text_index
  2  on my_text_table(my_text)
  3  indextype is ctxsys.context;

Index created.

SQL> 
SQL> insert into my_text_table
  2  values (1,'This is a long piece of text written by Hemant');

1 row created.

SQL> insert into my_text_table
  2  values (2,'Another long text to be captured by the index');

1 row created.

SQL> commit;

Commit complete.

SQL> 
SQL> -- this update to the index would be a regular background job
SQL> exec ctx_ddl.sync_index('MY_TEXT_INDEX');

PL/SQL procedure successfully completed.

SQL> 


Note the call to CTX_DDL.SYNC_INDEX.  Unlike a regular B-Tree index, the CONTEXT (Domain) Index is, by default, *not* updated in real-time when DML occurs against the base table.  (Hopefully, in a future post, I may demonstrate real-time updates to a CTXCAT index, different from a CONTEXT Index).
Only remember that this CONTEXT index is not automatically updated.  If new rows are added to the table, they are not visible through the index until a SYNC_INDEX operation is performed.  The SYNC_INDEX can be scheduled as a job with another call to DBMS_JOB or DBMS_SCHEDULER or itself as part of the CREATE INDEX statement.

Now, let me demonstrate usage of the Index with the CONTAINS operator.

SQL> select id_column as id,
  2  my_text
  3  from my_text_table
  4  where contains (my_text, 'written by Hemant') > 0
  5  /

 ID
----------
MY_TEXT
--------------------------------------------------------------------------------
         1
This is a long piece of text written by Hemant


SQL> select my_text
  2  from my_text_table
  3  where contains (my_text, 'Another long') > 0
  4  /

MY_TEXT
--------------------------------------------------------------------------------
Another long text to be captured by the index

SQL> 


Yes, the CONTAINS operator is a bit awkward.  It will be some time before you (or your developers) get used to the syntax !

Besides CTX_DDL, there are a number of other packages in the prebuilt CTXSYS schema that are available :
 CTX_CLS
 CTX_DDL
 CTX_DOC
 CTX_OUTPUT
 CTX_QUERY
 CTX_REPORT
 CTX_THES
 CTX_ULEXER

Since I have created a separate database user, I can also demonstrate the additional objects that are created when the INDEXTYPE IS CTXSYS.CONTEXT.

SQL> select object_type, object_name, to_char(created,'DD-MON-RR HH24:MI') Crtd
  2  from user_objects
  3  order by object_type, object_name
  4  /

OBJECT_TYPE         OBJECT_NAME                    CRTD
------------------- ------------------------------ ------------------------
INDEX               DR$MY_TEXT_INDEX$X             10-DEC-17 16:48
INDEX               DRC$MY_TEXT_INDEX$R            10-DEC-17 16:48
INDEX               MY_TEXT_INDEX                  10-DEC-17 16:48
INDEX               SYS_C0017472                   10-DEC-17 16:48
INDEX               SYS_IL0000045133C00006$$       10-DEC-17 16:48
INDEX               SYS_IL0000045138C00002$$       10-DEC-17 16:48
INDEX               SYS_IOT_TOP_45136              10-DEC-17 16:48
INDEX               SYS_IOT_TOP_45142              10-DEC-17 16:48
LOB                 SYS_LOB0000045133C00006$$      10-DEC-17 16:48
LOB                 SYS_LOB0000045138C00002$$      10-DEC-17 16:48
TABLE               DR$MY_TEXT_INDEX$I             10-DEC-17 16:48
TABLE               DR$MY_TEXT_INDEX$K             10-DEC-17 16:48
TABLE               DR$MY_TEXT_INDEX$N             10-DEC-17 16:48
TABLE               DR$MY_TEXT_INDEX$R             10-DEC-17 16:48
TABLE               MY_TEXT_TABLE                  10-DEC-17 16:48

15 rows selected.

SQL> 
SQL> select table_name, constraint_name, index_name
  2  from user_constraints
  3  where constraint_type = 'P'
  4  order by table_name, constraint_name
  5  /

TABLE_NAME                     CONSTRAINT_NAME
------------------------------ ------------------------------
INDEX_NAME
------------------------------
DR$MY_TEXT_INDEX$K             SYS_IOT_TOP_45136
SYS_IOT_TOP_45136

DR$MY_TEXT_INDEX$N             SYS_IOT_TOP_45142
SYS_IOT_TOP_45142

DR$MY_TEXT_INDEX$R             DRC$MY_TEXT_INDEX$R
DRC$MY_TEXT_INDEX$R

MY_TEXT_TABLE                  SYS_C0017472
SYS_C0017472


SQL> 


Yes, that is a large number of database objects besides MY_TEXT_TABLE and MY_TEXT_INDEX.  SYS_C0017472 is, of course, the Primary Key Index on MY_TEXT_TABLE (on the ID_COLUMN column).  The others are interesting.

The "Tokens" I mentioned in the first paragraph are, for the purpose of this table MY_TEXT_TABLE, in the DR$MY_TEXT_INDEX$I.

SQL> desc DR$MY_TEXT_INDEX$I
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 TOKEN_TEXT                                NOT NULL VARCHAR2(64)
 TOKEN_TYPE                                NOT NULL NUMBER(10)
 TOKEN_FIRST                               NOT NULL NUMBER(10)
 TOKEN_LAST                                NOT NULL NUMBER(10)
 TOKEN_COUNT                               NOT NULL NUMBER(10)
 TOKEN_INFO                                         BLOB

SQL> select token_text, token_count
  2  from dr$my_text_index$i
  3  /

TOKEN_TEXT                                                       TOKEN_COUNT
---------------------------------------------------------------- -----------
ANOTHER                                                                    1
CAPTURED                                                                   1
HEMANT                                                                     1
INDEX                                                                      1
LONG                                                                       2
PIECE                                                                      1
TEXT                                                                       2
WRITTEN                                                                    1

8 rows selected.

SQL> 


Update : Here is a demonstration of a STOPLIST for a CONTEXT Index.

I have been busy in the last few months and have not published much this quarter. Hopefully, I will get more time in the coming weeks to explore CONTEXT, CTXCAT and Domain Indexes.
.
.
.