Someone asked us about how to find documents, which contain emails. One possible solution is to write function
CREATE OR REPLACE FUNCTION document_token_types(text)
RETURNS _text AS
$$
SELECT ARRAY (
SELECT
DISTINCT alias
FROM
ts_token_type('default') AS tt,
ts_parse('default', $1) AS tp
WHERE
tt.tokid = tp.tokid
);
$$ LANGUAGE SQL immutable;
arxiv=# select document_token_types(title) from papers limit 10;
document_token_types
---------------------------------------------------------------
{asciihword,asciiword,blank,hword_asciipart}
{asciiword,blank}
{asciiword,blank}
{asciiword,blank}
{asciiword,blank}
{asciiword,blank,float,host}
{asciiword,blank}
{asciihword,asciiword,blank,hword_asciipart,int,numword,uint}
{asciiword,blank}
{asciiword,blank}
(10 rows)
Now we can create functional GIN index on document (which can be any combination of text fields) to speedup search.
create index fts_types_idx on papers using gin( document_token_types (comment) );
Find all documents with urls
explain analyze select comment from papers where document_token_types(comment) && '{url}';
Bitmap Heap Scan on papers (cost=40.80..8135.97 rows=2098 width=67) (actual time=6.811..27.679 rows=15483 loops=1)
Recheck Cond: (document_token_types(comment) && '{url}'::text[])
-> Bitmap Index Scan on fts_types_idx (cost=0.00..40.28 rows=2098 width=0) (actual time=3.820..3.820 rows=15483 loops=1)
Index Cond: (document_token_types(comment) && '{url}'::text[])
Total runtime: 33.124 ms
The list of available token types, supported by parser is available as
arxiv=# select * from ts_token_type('default');