Postgresql:UTF8 and KOI8

GiST for PostgreSQL | PostgreSQL mailing list archive | OpenFTS full text search engine
PostgreSQL site

Working with UTF8 and KOI8

I've got confused when I tried postgresql (7.3.3) with cyrillic text and UNICODE. After reading documentation (charset.sgml) I've realized what I did wrong :) I thought that I could work with several databases in different encodings. Well, I could use createdb -E encoding, but the only thing is important for text operations is encoding specified at 'initdb' stage !

The key phrase:

The nature of some locale categories is that their value has to be fixed for the lifetime of a database cluster. That is, once initdb has run, you cannot change them anymore. LC_COLLATE and LC_CTYPE are those categories. They affect the sort order of indexes, so they must be kept fixed, or indexes on text columns will become corrupt. PostgreSQL enforces this by recording the values of LC_COLLATE and LC_CTYPE that are seen by initdb. The server automatically adopts those two values when it is started.

Test bed:

Linux, Slackware 8.1, libc 2.2.5, postgresql 7.3.3, perl 5.6.1

Steps to success:

Create ru_RU.utf8 locale
localedef -i ru_RU -f UTF-8 ru_RU.UTF8

Testing locale

    export LC_CTYPE=ru_RU.utf8
    export LC_COLLATE=ru_RU.utf8
    perl testlocale-utf8.pl|iconv -f utf8 -t koi8-r
    Output shoul be sorted cyrillic letters.

Prepare postgresql data dir for unicode (as user postgres)

    export LC_CTYPE=ru_RU.utf8
    export LC_COLLATE=ru_RU.utf8
    initdb -E UTF8  --pgdata=/db1/pgdata.utf8
    pg_ctl -D /db1/pgdata.utf8 start

Testing

createdb -E UTF8 utf8
psql -l
        List of databases
   Name    |  Owner   | Encoding 
-----------+----------+----------
 template0 | postgres | UNICODE
 template1 | postgres | UNICODE
 utf8      | megera   | UNICODE
(3 rows)

psql utf8
utf8=# create table tt (a text);
CREATE TABLE
utf8=# \copy tt from './cyrillic.utf8'
\.
utf8=# \o out.utf8
utf8=# select * from tt order by a asc;
utf8=# \q
zen:~/app/locale$ iconv -f utf8 -t koi8-r out.utf8
Output should be sorted cyrillic letters !

Conclusion:

PostgreSQL works well with cyrillic and UTF8

Bad news:

I discovered that upper(), lower() function doesn't works in my setup. Read http://fts.postgresql.org/db/msg.html?mid=1070198 for details.

Leave a message ? oleg@sai.msu.su