Table of Contents

About

Dovecot can be used together with Solr in order to perform full-text searches that include message text and body. Note that this breaks the IMAP protocol - although it is a good thing to have full text searches. This tutorial uses Debian to couple Dovecot with Solr. The tutorial works under the assumption that Dovecot is installed and configured.

Installing Required Packages

Install solr-jetty and dovecot-solr:

aptitude install solr-jetty dovecot-solr

Configure Solr

Under Debian you will have to edit /etc/defaults/jetty8 and set NO_START to 0:

# change to 0 to allow Jetty to start
NO_START=0

in order to allow jetty to start.

Also, change the port to 8983:

JETTY_PORT=8983

since by default the port is 8080 on Debian which might conflict with other services.

It is also wise to restrict access only to localhost by enabling the option:

JETTY_HOST=$(uname -n) 

On Debian, dovecot-solr will place the Solr schema at: /usr/share/dovecot/solr-schema.xml which should be copied to the Solr schema folder.

First, rename /etc/solr/conf/schema.xml to /etc/solr/conf/schema.xml.dist:

mv /etc/solr/conf/schema.xml /etc/solr/conf/schema.xml.dist

and then copy /usr/share/dovecot/solr-schema.xml over to /etc/solr/conf/schema.xml:

cp /usr/share/dovecot/solr-schema.xml /etc/solr/conf/schema.xml

Edit /etc/solr/conf/solrconfig.xml and locate the tag and value:

<str name="df">text</str>

and change df to hdr to avoid errors such as undefined field text.

Configuring Dovecot

Open /etc/dovecot/conf.d/10-mail.conf and uncomment and edit the line starting with mail_plugins:

mail_plugins = fts fts_solr

such that it contains fts and fts_solr.

Next, edit /etc/dovecot/conf.d/90-plugins.conf and locate the plugin section and alter it like so:

plugin {
    #setting_name = value
    fts = solr
    fts_solr = url=http://127.0.0.1:8983/solr/ break-imap-search
    fts_autoindex = yes
}

Restarting Services

Restart jetty:

service jetty restart

and then dovecot:

service dovecot restart

Checking

Connect to dovecot manually via IMAP using openssl:

openssl s_client -connect localhost:993

assuming that you issue the command on the same machine.

You should see the SSL layer and then you will be dropped into IMAP:

* OK [CAPABILITY IMAP4rev1 LITERAL+ SASL-IR LOGIN-REFERRALS ID ENABLE IDLE AUTH=PLAIN] Dovecot ready.

Now log-in using a username and password:

a login boo peggy

where:

If successful, IMAP should answer with its capabilities starting with an a OK line.

Next select the inbox:

a select Inbox

and then issue a search:

a search text whoopie!

and since we are running for the first time, it should return something along the lines of:

* OK Indexed 29% of the mailbox, ETA 0:25
* OK Indexed 37% of the mailbox, ETA 0:34
* OK Indexed 89% of the mailbox, ETA 0:03
* OK Indexed 97% of the mailbox, ETA 0:01
* OK Indexed 97% of the mailbox, ETA 0:01
* OK Indexed 97% of the mailbox, ETA 0:01
* OK Indexed 97% of the mailbox, ETA 0:02
* OK Indexed 97% of the mailbox, ETA 0:02
* OK Indexed 97% of the mailbox, ETA 0:02
* OK Indexed 97% of the mailbox, ETA 0:03
* OK Indexed 97% of the mailbox, ETA 0:03
* OK Mailbox indexing finished
* SEARCH 1 2 3 4 7 8 9 10 11 12 13 14 17 18 22 23 26 27 33 34 56 57 58 59 60 77 83 85 88 118 122 123 126 128 156 178 179 183 186 191 199 200 212 213 225 245 247 254 281 282 286 287 289 305 309 311 312 315 317 320 329 333 357 358 361 364 374 375 379 387 390 401 405 408 414 415 416 428 433 439 442 444 445 465 494 495 496 500 504 533 537 538 542 543 549 557 561 566 567 568 569 575 580 603 620 621 636 641 642 643 668 677 679 684 685 686 687 691 701 705 710 715 718 721 722 734 736 743 747 748 753 754 773 774 775 776 777 778 779 780 784 785 786 787 792 795 813 820 821 829 830 831 843 844 855 861 862 863 864 865 870 871 886 889 890 891 892 893 894 895 899 906 914 915 924 926 930 933 937 938 939 940 941 946 947 979 987 988 990 991 1004 1005 1006 1007 1012 1013 1014 1018 1027 1028 1030 1037 1047 1054 1055 1059 1060 1061 1063 1064 1068 1069 1070 1071 1072 1074 1080 1081 1082 1083 1085 1087 1092 1093 1104 1111 1112 1113 1117 1118 1120 1140 1142 1143 1146 1152 1153 1154 1165 1176 1184 1185 1189 1201 1202 1210 1211 1212 1217 1228 1237 1238 1240 1243 1244 1245 1246 1247 1248 1249 1250 1252 1258 1264 1265 1268 1269 1279 1280 1284 1291 1298 1299 1300 1302 1306 1314 1315 1317 1322 1331 1338
a OK Search completed (117.978 secs).

If you repeat the search:

a search text whoopie!

You should see that it runs much faster:

* SEARCH 1 2 3 4 7 8 9 10 11 12 13 14 17 18 22 23 26 27 33 34 56 57 58 59 60 77 83 85 88 118 122 123 126 128 156 178 179 183 186 191 199 200 212 213 225 245 247 254 281 282 286 287 289 305 309 311 312 315 317 320 329 333 357 358 361 364 374 375 379 387 390 401 405 408 414 415 416 428 433 439 442 444 445 465 494 495 496 500 504 533 537 538 542 543 549 557 561 566 567 568 569 575 580 603 620 621 636 641 642 643 668 677 679 684 685 686 687 691 701 705 710 715 718 721 722 734 736 743 747 748 753 754 773 774 775 776 777 778 779 780 784 785 786 787 792 795 813 820 821 829 830 831 843 844 855 861 862 863 864 865 870 871 886 889 890 891 892 893 894 895 899 906 914 915 924 926 930 933 937 938 939 940 941 946 947 979 987 988 990 991 1004 1005 1006 1007 1012 1013 1014 1018 1027 1028 1030 1037 1047 1054 1055 1059 1060 1061 1063 1064 1068 1069 1070 1071 1072 1074 1080 1081 1082 1083 1085 1087 1092 1093 1104 1111 1112 1113 1117 1118 1120 1140 1142 1143 1146 1152 1153 1154 1165 1176 1184 1185 1189 1201 1202 1210 1211 1212 1217 1228 1237 1238 1240 1243 1244 1245 1246 1247 1248 1249 1250 1252 1258 1264 1265 1268 1269 1279 1280 1284 1291 1298 1299 1300 1302 1306 1314 1315 1317 1322 1331 1338
a OK Search completed (0.001 secs).

You are now done and you can log-out:

a logout

Reindexing

Now that full text search is enabled, a good idea would be to make dovecot reindex all messages for all users and for all mailboxes - this is particularly useful if you have large number of users and would like to jump start the caching. Whilst new messages are indexed automatically via the fts_autoindex = yes in /etc/dovecot/conf.d/90-plugin.conf, old messages may not have been indexed.

To issue a full re-index of all mailboxes, issue:

doveadm -D index -A '*'

where:

Optimize and Commit

Dovecot only performs soft commits in order to improve performance, however it is recommended to send a commit command to Solr once in a while. Similarly, one can send the optimize command along with the request.

A good idea is to use a script along the following:

solr-maintenance
#!/bin/sh
 
SERVER=127.0.0.1
PORT=8983
 
curl http://$SERVER:$PORT/solr/update?optimize=true >/dev/null 2>&1
curl http://$SERVER:$PORT/solr/update?commit=true >/dev/null 2>&1

and use cron to execute the script once every hour (or less, for very busy servers) - on Debian, the script can be placed in /etc/cron.hourly.

Using Tomcat Instead of Jetty

A better option is to use Tomcat instead of Jetty due to the instability of Jetty. To install tomcat and Solr issue:

aptitude install solr-tomcat

Tomcat by default will run on port 8080 which is a busy port on many machines. To change the default port that Tomcat listens on, edit the file at /etc/tomcat8/server.xml and locate the Connector tag:

    <Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443" />

and then change the 8080 value to, say 8983 (which is the same port as Jetty).

If you experience issues with FTS with requests in the Tomcat logs (/var/log/tomcat8) along the lines of "GET null null" 400 or messages in /var/log/mail.log such as Error: fts_solr: Lookup failed: 400 Bad Request, these can be traced to some strange clients that issue very long search requests to dovecot and Tomcat does not accept by default headers that long. In such cases, one can increase the maximum header size in /etc/tomcat8/server.xml:

    <Connector port="8983" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443" 
               maxHttpHeaderSize="65536" />

by setting maxHttpHeaderSize.

Tomcat can then be restarted via:

/etc/init.d/tomcat8 restart

The rest of the instructions apply - just that you will be running Tomcat instead of Jetty.

Until the tomcat init script is fixed, you may have to edit /etc/init.d/tomcat8 and comment out or delete the section:

                        if [ -f "$CATALINA_PID" ]; then
                               rm -f "$CATALINA_PID"
                        fi

otherwise tomcat fails to start and stop properly.

Indexing Attachment Content via Apache Tika

Dovecot's full text search plugin allows using Apache Tika for indexing email attachments (PDFs, Word Documents, etc…). This requires running Apache Tika in the background and pointing the Dovecot configuration to use Apache Tika as the backend.

Unfortunately, Debian does not package the Apache Tika server by default but you can grab the Wizardry and Steamworks Tika package and install it either by adding the Wizardry and Steamworks Tika repository to /etc/apt/sources.list or by manually installing the tika-server.deb package via:

dpkg-deb -i tika-server-1.16.deb

once you have downloaded it to a local directory.

Note that the Java JDK (such as openjdk) has to be installed for the tika servlet.

After installing the Wizardry and Steamworks Tika package, Apache Tika can be stopped with:

/etc/init.d/tika stop

and started with:

/etc/init.d/tika start

By issuing:

ps ax | grep tika

the Apache Tika process should show up:

12199 ?        Sl    14:57 /usr/bin/java -jar /usr/share/java/tika-server.jar -host localhost -port 9998

To check if Tika is answering properly, issue:

telnet localhost 9998

and a connection should be established.

The next step is to edit /etc/dovecot/conf.d/90-plugins.conf, locate the FTS directives mentioned in this document, and append:

fts_tika = http://localhost:9998/tika/

in the plugin section.

If fts_autoindex is set to yes in /etc/dovecot/conf.d/90-plugins.conf, then new emails will have their attachments indexed, however, you may want to trigger a full re-indexing of all mailboxes for all users:

doveadm -D index -A '*'

to scan older attachments.

During the scanning phase, you will see a bunch of:

[Req2424 PUT http://localhost:9998/tika/]: Waiting for request to finish

but eventually, you should get:

[Req2425: PUT http://localhost:9998/tika/]: Finished sending payload
[Req2425: PUT http://localhost:9998/tika/]: Waiting for request to finish

indicating that Apache Tika is working.

Finally, to check, just send an email with a PDF and start a full body text search with a sentence or phrase in the PDF.