This is alpha version of contrib/tsearch with ranking support. Also, it includes OpenFTS (0.34) parser, ispell and snowball (stemming) support. Comments and documentation are welcome ! Without documentation we'll be not able to release the module ! We need documentation with examples for beginners and advanced users. Any comments please send to Oleg Bartunov (oleg@sai.msu.su) and Teodor Sigaev ( teodor@sigaev.ru ). Notes for testers: Tsearch module is fully reworked. Dixi. 1 Configuration now is keeping in four tables (in future it should be in system catalog): 1.1 create table pg_ts_dict ( dict_id int not null primary key, dict_name text not null, dict_init oid, dict_initoption text, dict_lemmatize oid not null, dict_comment text ); Table for storing dictionaries. Dict_init field store Oid of function that initialize dictionary. Dict_init has one option: text value from dict_initoption and should return internal representation (structure) of dictionary. Structure must be malloced or palloced in TopMemoryContext. Dict_init is called only one times per process. dict_lemmatize field store Oid of function that lemmatize lexem. Input values: structure of dictionary, pionter to string and it's length. Output: pointer to array of pointers to C-strings. Last pointer in array must be NULL. Returns NULL means that dictionary can't resolve this word, but return void array means that dictionary know input word, but suppose that word is stop-word. 1.2 create table pg_ts_parser ( prs_id int not null primary key, prs_name text not null, prs_start oid not null, prs_getlexem oid not null, prs_end oid not null, prs_headline oid not null, prs_lextype oid not null, dict_comment text ); Store parser. prs_start store Oid of function that initialize parser, arguments: pointer to string and it's length, returns internal structure of parser. Structure can be palloced. prs_getlexem store Oid of function that return next lexem. Input: structure of parser, pointer to pointer of char, pointer to int4. Returns type of lexem, if type is equal to 0 then all is parsed. Returning lexem is stored in last two pointers. prsd_end store Oid of function that finished parse session. Input: structure of parser. prs_headline is generate headline and work on parsed text, stored in HLPRSTEXT structure (ts_cfg.h). Arguments: pointer to HLPRSTEXT, pointer to query, pointer to option (as text pgsql's type) prs_lextype returns array of LexDescr (see wparser.h), describing types of lexem that can be returned by parser. 1.3 create table pg_ts_cfg ( id int not null primary key, ts_name text not null, prs_name text not null, locale text ); Tsearch config. Locale can be pointed for defining which config is used for current locale. 1.4 create table pg_ts_cfgmap ( ts_name text not null, lex_alias text not null, dict_name text[] not null, primary key (ts_name,lex_alias) ); Table for storing info about dictionaries per lexem type. 2 SQL-level interfaces 2.1 debug interface for dictionary text[] lemmatize(DICT_ID, text); text[] lemmatize(DICT_NAME, text); text[] lemmatize(text); Last lemmatize function use dictionary pointed by select set_curdict(DICT_ID); select set_curdict(DICT_NAME); 2.2 debug interface for parser create type lexemtype as (lexid int4, alias text, descr text); --type of table setof lexemtype lexem_type(ID_PARSER) setof lexemtype lexem_type(NAME_PARSER) setof lexemtype lexem_type() Example: select * from lexem_type(); create type lexemout as (lexid int4, lexem text); --parse result setof lexemout parse_txt(ID_PARSER, text); setof lexemout parse_txt(NAME_PARSER, text); setof lexemout parse_txt(text); Last lexem_type and parse_txt use parser pointed by select set_curprs(ID_PARSER); select set_curprs(NAME_PARSER); 2.3 Set current tsearch config select set_curcfg(ID_CFG); select set_curcfg(NAME_CFG); int4 show_curcfg(); By default, current config is defined by current locale. Reset all tsearch's caches on dictionary, parser and config: select reset_tsearch(); Useful for debugging. 3 Txtidx type 3.1 txtidx type could store position of each entry of each word: select txt2txtidx('Q W E Q W'); txt2txtidx ----------------------- 'e':3 'q':1,4 'w':2,5 Each entry of word may has 'weight entry', from 'A' (higher) to 'D' (lower, default) select 'w:4A,3B,2C,1D,5 a:8A'::txtidx; txtidx ------------------------- 'a':8A 'w':1,2C,3B,4A,5 3.2 Functions and operations: int4 txtidxsize(txtidx) - number of lexems txtidx strip(txtidx) - remove all position information txtidx chweight(txtidx, char) - set weight of each entry to second opt('A'-'D'); txtidx || txtidx - return union of txtidx select 'a:3A b:2a'::txtidx || 'ba a:1B'; ?column? ----------------------- 'a':3A,4B 'b':2A 'ba' Notice: second argument is designed as text follows from first one, so in this example a:1B in second will be a:4B in result. Some example of usage: create table tbl (text title, text body, txtidx ti); update tbl set ti = chweight(txt2txtidx(title),'A') || txt2txtidx(body); And more one: select chweight(txt2txtidx('sky wonderfull'),'A') || txt2txtidx('skies wow foo bar'); ?column? --------------------------------------------------- 'bar':6 'foo':5 'sky':1A,3 'wow':4 'wonderful':2A 3.3 txt2txtidx was extended: txtidx txt2txtidx(ID_CFG,text) txtidx txt2txtidx(NAME_CFG,text) txtidx txt2txtidx(text) 4 mquery_txt is removed. Use txt2query() instead of mquery_txt: query_txt txt2query(ID_CFG, text); query_txt txt2query(NAME_CFG, text); query_txt txt2query(text); It's supposed, that text will follow to syntax rules of query_txt. query_txt works with weights of lexem: 'd:AC & ca:B', '\'d\':AC & \'ca\':B' - search d with weights A or C and ca with B wow=> select 'a b:89 ca:23A,64b d:34c'::txtidx @@ 'd:AC & ca'; ---------- t wow=> select 'a b:89 ca:23A,64b d:34c'::txtidx @@ 'd:AC & ca:B'; ---------- t wow=> select 'a b:89 ca:23A,64b d:34c'::txtidx @@ 'd:AC & ca:C'; ---------- f 5 Operation: @@ is unchanged, ## is removed. 6 Included dictionaries. 6.1 'simple' dictionary. Recognize any word and only lowercase it. As option it can take a file name which contains stopwords one per line. 6.2 Snowball dictionaries. As example english ('en_stem') and russian ('ru_stem'). It can take file name which contains stopwords one per line. 6.3 Template for ispell interface. As option it requires names of dictionary, affixes and, optionally, stopwords file. Example: select * from pg_ts_dict where dict_id=4; dict_id | dict_name | dict_init | dict_initoption | dict_lemmatize | dict_comment ---------+-----------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-------------------------------------------------- 4 | ispell_template | 190309 | DictFile="/usr/local/share/ispell/russian.dict" ,AffFile ="/usr/local/share/ispell/russian.aff", StopFile="/usr/local/share/ispell/russian.stop" | 190310 | ISpell interface. Must have .dict and .aff files 6.4 'synonym' dictionary - for work with synonyms. As option it takes a file name which contains pair of words per line. First word will be replaced by second one. 7 Included parser was taken from OpenFTS v0.34 8 GiST support is unchanged. 9 Tsearch trigger uses only current config. 10 Ranking subsystem 10.1 rank form OpenFTS float4 rank(txtidx, query_txt); float4 rank(txtidx, query_txt, int4); float4 rank(float4[], txtidx, query_txt); float4 rank(float4[], txtidx, query_txt, int4); Returns rank of txtidx field to query_txt; int4 - method of normalization to length of document: 0: no (default) 1: divide rank by log(length) 2: divide rank by length float4[] - array of weights of each class of weights :) Array should have 4 elements: 0th element - for weight 'D', and 3th - for 'A', default weights: '{0.1, 0.2, 0.4, 1.0}', eights must be: 0<=x<=1.0 Example for usage: create table tbl (text title, text body, txtidx ti); update tbl set ti = chweight(txt2txtidx(title),'A') || txt2txtidx(body); select title,body, rank(ti, 'foo & bar') as w from tbl where ti @@ 'foo & bar' order by w desc; 10.2 Cover density ranking (Charles L. A. Clarke, Gordon V. Cormack and Elizabeth A. Tudhope. Relevance Ranking for One to Three Term Queries. To appear in Information Processing and Management , 1999) float4 rank_cd(txtidx, query_txt); float4 rank_cd(txtidx, query_txt, int4); float4 rank_cd(int4, txtidx, query_txt); float4 rank_cd(int4, txtidx, query_txt, int4); First int4 is K option in formula (2) 10.2.1 Debug support for ranking: select get_covers(txtidx,query_txt); 11 Headline generation text headline(IDCFG,text,query_txt,OPTION) text headline(NAMECFG,text,query_txt,OPTION) text headline(text,query_txt,OPTION) text headline(IDCFG,text,query_txt) text headline(NAMECFG,text,query_txt) text headline(text,query_txt) OPTION example: 'StartSel=, StopSel= , MaxWords=5, MinWords=4, ShortWord=3' Headline's algorithm is based on Cover density ranking (10.2) 12 Statistics create type statinfo as (word text, ndoc int4, nentry int4) Function stat('select query') returns set of statinfo type. ndoc - number of documents, nentry - total number of entries. I wanted to write aggregate function, but due to limitation of pgsql, function,which returns setof, couldn't be aggregated. example select * from stat('select a from test_txtidx') order by ndoc desc; Warning: Getting statistics could be very slow. It's recommended to save results into some table to perform more stats. 13 Limits: 13.1 2048 bytes for lexems 13.2 txtidx has limit about 1Mb. Exact value depends on quantity of position information. If there is no any position information, then sum of length of lexem must be less than 1Mb, otherwise, sum of length of and pos. info. Positional information uses 2 bytes per each position and 2 bytes per lexem with pos info. The number of lexems is limited by 4^32, so in practice it's unlimited. 13.3 query_txt: Number of entries (nodes, i.e sum of lexems and operation) is limited: internal representation is in polish notation and position of one operand is pointed by int2, so it's rather soft limit. In any case, low range of limit - 32768 nodes. Notice: query_txt doesn't designed for storing in table and is optimized for speed, not for size. 13.4 Positional information in txtidx: 13.4.1 Value of position may not be greater than 2^14 (16384), any value greater than this limit will be replaced by 16383. 13.4.2 Only 256 positional info per lexem. 14 Exclusive for programmers :) txtidx: There is one unused byte per lexem in position information (because of alignment) query_txt: There are one byte and one bit per node I don't know, for what purpose this bytes may be used.... Any idea how to use them are welcome !