This is alpha version of contrib/tsearch with ranking support.
Also, it includes OpenFTS (0.34) parser, ispell and snowball (stemming) 
support.

Comments and documentation are welcome ! 
Without documentation we'll be not able to release the module !
We need documentation with examples for beginners and advanced users.

Any comments please send to Oleg Bartunov (oleg@sai.msu.su) and
Teodor Sigaev ( teodor@sigaev.ru ). 

Notes for testers:

Tsearch module is fully reworked. Dixi.

1 Configuration now is keeping in four tables (in future 
  it should be in system catalog):
  1.1 create table pg_ts_dict (
	dict_id         int not null primary key,
	dict_name       text not null,
	dict_init       oid,
        dict_initoption text,
        dict_lemmatize  oid not null,
        dict_comment    text
      );
	Table for storing dictionaries. Dict_init field store Oid of function
	that initialize dictionary. Dict_init has one option: text value from
	dict_initoption and should return internal representation (structure)
	of dictionary. Structure must be malloced or palloced in 
	TopMemoryContext. Dict_init is called only one times per process.
	dict_lemmatize field store Oid of function that lemmatize lexem. 
	Input values: structure of dictionary, pionter to string and it's 
	length. Output: pointer to array of pointers to C-strings. Last pointer
	in array must be NULL. Returns NULL means that dictionary can't resolve
	 this word, but return void array means that dictionary know input word,
	but suppose that word is stop-word.
  1.2 create table pg_ts_parser (
        prs_id          int not null primary key,
        prs_name        text not null,
        prs_start       oid not null,
        prs_getlexem    oid not null,
        prs_end         oid not null,
        prs_headline    oid not null,
        prs_lextype     oid not null,
        dict_comment    text
      );
	Store parser. prs_start store Oid of function that initialize
	parser, arguments: pointer to string and it's length,
	returns internal structure of parser. Structure can be palloced.
	prs_getlexem store Oid of function that  return next lexem.
	Input: structure of parser, pointer to pointer of char, pointer to int4.	
	Returns type of lexem, if type is equal to 0 then all is parsed.
	Returning lexem is stored in last two pointers.
	prsd_end store Oid of function that finished parse session. Input:
	structure of parser.
	prs_headline is generate headline and work on parsed text, stored in
	HLPRSTEXT structure (ts_cfg.h). Arguments: pointer to HLPRSTEXT,
	pointer to query, pointer to option (as text pgsql's type)
	prs_lextype returns array of LexDescr (see wparser.h), describing
	types of lexem that can be returned by parser.
   1.3  create table pg_ts_cfg (
        id              int not  null primary key,
        ts_name         text not null,
        prs_name        text not null,
        locale          text
	);
	Tsearch config. Locale can be pointed for defining which config
	is used for current locale.
   1.4 create table pg_ts_cfgmap ( 
        ts_name         text not null,
        lex_alias       text not null,
        dict_name       text[] not null,
        primary key (ts_name,lex_alias)
	);
	Table for storing info about dictionaries per lexem type.

2 SQL-level interfaces
  2.1 debug interface for dictionary
      text[] lemmatize(DICT_ID, text);
      text[] lemmatize(DICT_NAME, text);
      text[] lemmatize(text);

	Last lemmatize function use dictionary pointed by
	select set_curdict(DICT_ID);
	select set_curdict(DICT_NAME);
  2.2 debug interface for parser
	create type lexemtype as (lexid int4, alias text, descr text); --type of table
	setof lexemtype lexem_type(ID_PARSER)
	setof lexemtype lexem_type(NAME_PARSER)
	setof lexemtype lexem_type()
	
	Example: select * from lexem_type();
	
	create type lexemout as (lexid int4, lexem text); --parse result
	
	setof lexemout parse_txt(ID_PARSER, text);
	setof lexemout parse_txt(NAME_PARSER, text);
	setof lexemout parse_txt(text);

	Last  lexem_type and parse_txt use parser pointed by
	select set_curprs(ID_PARSER);
	select set_curprs(NAME_PARSER);

  2.3 Set current tsearch config

	select set_curcfg(ID_CFG);
	select set_curcfg(NAME_CFG);
	int4 show_curcfg();
	
	By default, current config is defined by current locale.

	Reset all tsearch's caches on dictionary, parser and config:
	select reset_tsearch();
	Useful for debugging.

3 Txtidx type

  3.1 txtidx type could store position of each entry of each word:
	select txt2txtidx('Q W E Q W');
	      txt2txtidx       
	-----------------------
	 'e':3 'q':1,4 'w':2,5

	Each entry of word may has 'weight entry', from 'A' (higher) to 'D' (lower, default)
	select 'w:4A,3B,2C,1D,5 a:8A'::txtidx;
	         txtidx          
	-------------------------
	 'a':8A 'w':1,2C,3B,4A,5
  3.2 Functions and operations:
	int4 txtidxsize(txtidx) - number of lexems
	txtidx strip(txtidx) - remove all position information
	txtidx chweight(txtidx, char) - set weight of each entry to second opt('A'-'D');
	txtidx || txtidx - return union of txtidx
		select 'a:3A b:2a'::txtidx || 'ba a:1B';
		       ?column?        
		-----------------------
		 'a':3A,4B 'b':2A 'ba'
		Notice: second argument is designed as text follows from first one,
		so in this example a:1B in second will be a:4B in result.
		Some example of usage:
		create table tbl (text title, text body, txtidx ti);
		update tbl set ti = chweight(txt2txtidx(title),'A') || txt2txtidx(body);
		And more one:
		select chweight(txt2txtidx('sky wonderfull'),'A') || txt2txtidx('skies wow foo bar');
		                     ?column?                      
		---------------------------------------------------
		 'bar':6 'foo':5 'sky':1A,3 'wow':4 'wonderful':2A

  3.3 txt2txtidx was extended:

	txtidx txt2txtidx(ID_CFG,text)
	txtidx txt2txtidx(NAME_CFG,text)
	txtidx txt2txtidx(text)

4 mquery_txt is removed. 
  Use txt2query() instead of mquery_txt:

	query_txt txt2query(ID_CFG, text);
	query_txt txt2query(NAME_CFG, text);
	query_txt txt2query(text);

	It's supposed, that text will follow to syntax rules of query_txt.

   query_txt works with weights of lexem:
	'd:AC & ca:B', '\'d\':AC & \'ca\':B' - search d with weights A or C and ca with B
	wow=> select 'a b:89  ca:23A,64b d:34c'::txtidx @@ 'd:AC & ca';
	----------
	 t
	wow=> select 'a b:89  ca:23A,64b d:34c'::txtidx @@ 'd:AC & ca:B';
	----------
	 t
	wow=> select 'a b:89  ca:23A,64b d:34c'::txtidx @@ 'd:AC & ca:C';
	----------
 	 f
 
5 Operation: @@ is unchanged, ## is removed.
6 Included dictionaries.
  6.1 'simple' dictionary. Recognize any word and only lowercase it. As option 
	it can take a file name which contains stopwords one per line.
  6.2 Snowball dictionaries. As example english ('en_stem') and russian 
	('ru_stem'). It can take file name which contains stopwords one per 
        line.
  6.3 Template for ispell interface. As option it requires names of dictionary, 
	affixes and, optionally, stopwords file. Example:
	select * from pg_ts_dict where  dict_id=4;
 dict_id |    dict_name    | dict_init |                                                                 dict_initoption                                                                  | dict_lemmatize |                   dict_comment                   
---------+-----------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------+----------------+--------------------------------------------------
       4 | ispell_template |    190309 | DictFile="/usr/local/share/ispell/russian.dict" ,AffFile ="/usr/local/share/ispell/russian.aff", StopFile="/usr/local/share/ispell/russian.stop" |         190310 | ISpell interface. Must have .dict and .aff files
  6.4 'synonym' dictionary - for work with synonyms. As option it takes
	a file name which contains pair of words per line.
	First word will be replaced by second one.

7 Included parser was taken from OpenFTS v0.34 
8 GiST support is unchanged.
9 Tsearch trigger uses only current config.
10 Ranking subsystem
    10.1 rank form OpenFTS
	float4	rank(txtidx, query_txt);
	float4	rank(txtidx, query_txt, int4);
	float4	rank(float4[], txtidx, query_txt);
	float4	rank(float4[], txtidx, query_txt, int4);

	Returns rank of txtidx field to query_txt;
	int4 - method of normalization to length of document:
		0: no (default)
		1: divide rank by log(length)
		2: divide rank by length
	float4[] - array of weights of each class of weights :)
		Array should have 4 elements: 0th element - for weight 'D', 
                and 3th - for 'A',
		default weights: '{0.1, 0.2, 0.4, 1.0}', 
                eights must be: 0<=x<=1.0

	Example for usage:
		create table tbl (text title, text body, txtidx ti);
		update tbl set ti = chweight(txt2txtidx(title),'A') || txt2txtidx(body);
		select title,body, rank(ti, 'foo & bar') as w from tbl where ti @@ 'foo & bar' order by w desc;
    10.2 Cover density ranking (Charles L. A. Clarke, Gordon V. Cormack and Elizabeth A. Tudhope. 
		Relevance Ranking for One to Three Term Queries. To appear in  
		Information Processing and Management , 1999)
	float4	rank_cd(txtidx, query_txt);
	float4	rank_cd(txtidx, query_txt, int4);
	float4	rank_cd(int4, txtidx, query_txt);
	float4	rank_cd(int4, txtidx, query_txt, int4);

	First int4 is K option in formula (2) 

     10.2.1 Debug support for ranking:
	select get_covers(txtidx,query_txt);
11 Headline generation

	text headline(IDCFG,text,query_txt,OPTION)
	text headline(NAMECFG,text,query_txt,OPTION)
	text headline(text,query_txt,OPTION)
	text headline(IDCFG,text,query_txt)
	text headline(NAMECFG,text,query_txt)
	text headline(text,query_txt)
	OPTION example: 'StartSel=<i>, StopSel=</i> , MaxWords=5, MinWords=4, ShortWord=3'
	Headline's algorithm is based on Cover density ranking (10.2)


12 Statistics
	create type statinfo as (word text, ndoc int4, nentry int4)
	Function stat('select query') returns set of statinfo type.
	ndoc - number of documents, nentry - total number of entries.
	I wanted to write aggregate function, but due to limitation
	of pgsql, function,which returns setof, couldn't be aggregated.

	example
	select * from stat('select a from test_txtidx') order by ndoc desc;

        Warning: Getting statistics could be very slow. It's recommended
                 to save results into some table to perform more
                 stats.

13 Limits:
	13.1 2048 bytes for lexems
	13.2 txtidx has limit about 1Mb. Exact value depends on
		quantity of position information. If there is no any position 
                information, then sum of length of lexem must be less than 1Mb, 
                otherwise, sum of length of and pos. info. 
                Positional information uses 2 bytes per each 
		position and 2 bytes per lexem with pos info. The number of 
                lexems is limited by 4^32, so in practice it's unlimited.
	13.3 query_txt: 
                Number of entries (nodes, i.e sum of lexems and operation) 
                is limited: internal representation is in polish notation 
                and position of one operand is pointed by int2, so it's 
                rather soft limit. 
                In any case, low range of limit - 32768 nodes.
		Notice: query_txt doesn't designed for storing in table and
                is optimized for speed, not for size.
	13.4 Positional information in txtidx:
		13.4.1 Value of position may not be greater than 2^14 (16384), 
                       any value greater than this limit will be replaced
                       by 16383.
		13.4.2 Only 256 positional info per lexem.
	
14 Exclusive for programmers :)
	txtidx:
		There is one unused byte per lexem in position information 
                (because of alignment)
	query_txt:
		There are one byte and one bit per node

	I don't know, for what purpose this bytes may be used.... 
        Any idea how to use them are welcome !