Dovecot can be used together with Solr in order to perform full-text searches that include message text and body. Note that this breaks the IMAP protocol - although it is a good thing to have full text searches. This tutorial uses Debian to couple Dovecot with Solr. The tutorial works under the assumption that Dovecot is installed and configured.
Install solr-jetty
and dovecot-solr
:
aptitude install solr-jetty dovecot-solr
Under Debian you will have to edit /etc/defaults/jetty8
and set NO_START
to 0
:
# change to 0 to allow Jetty to start NO_START=0
in order to allow jetty
to start.
Also, change the port to 8983
:
JETTY_PORT=8983
since by default the port is 8080
on Debian which might conflict with other services.
It is also wise to restrict access only to localhost
by enabling the option:
JETTY_HOST=$(uname -n)
On Debian, dovecot-solr
will place the Solr schema at: /usr/share/dovecot/solr-schema.xml
which should be copied to the Solr schema folder.
First, rename /etc/solr/conf/schema.xml
to /etc/solr/conf/schema.xml.dist
:
mv /etc/solr/conf/schema.xml /etc/solr/conf/schema.xml.dist
and then copy /usr/share/dovecot/solr-schema.xml
over to /etc/solr/conf/schema.xml
:
cp /usr/share/dovecot/solr-schema.xml /etc/solr/conf/schema.xml
Edit /etc/solr/conf/solrconfig.xml
and locate the tag and value:
<str name="df">text</str>
and change df
to hdr
to avoid errors such as undefined field text
.
Open /etc/dovecot/conf.d/10-mail.conf
and uncomment and edit the line starting with mail_plugins
:
mail_plugins = fts fts_solr
such that it contains fts
and fts_solr
.
Next, edit /etc/dovecot/conf.d/90-plugins.conf
and locate the plugin
section and alter it like so:
plugin { #setting_name = value fts = solr fts_solr = url=http://127.0.0.1:8983/solr/ break-imap-search fts_autoindex = yes }
Restart jetty
:
service jetty restart
and then dovecot
:
service dovecot restart
Connect to dovecot
manually via IMAP using openssl
:
openssl s_client -connect localhost:993
assuming that you issue the command on the same machine.
You should see the SSL layer and then you will be dropped into IMAP:
* OK [CAPABILITY IMAP4rev1 LITERAL+ SASL-IR LOGIN-REFERRALS ID ENABLE IDLE AUTH=PLAIN] Dovecot ready.
Now log-in using a username and password:
a login boo peggy
where:
boo
is the username and,peggy
is the password
If successful, IMAP should answer with its capabilities starting with an a OK
line.
Next select the inbox:
a select Inbox
and then issue a search:
a search text whoopie!
and since we are running for the first time, it should return something along the lines of:
* OK Indexed 29% of the mailbox, ETA 0:25 * OK Indexed 37% of the mailbox, ETA 0:34 * OK Indexed 89% of the mailbox, ETA 0:03 * OK Indexed 97% of the mailbox, ETA 0:01 * OK Indexed 97% of the mailbox, ETA 0:01 * OK Indexed 97% of the mailbox, ETA 0:01 * OK Indexed 97% of the mailbox, ETA 0:02 * OK Indexed 97% of the mailbox, ETA 0:02 * OK Indexed 97% of the mailbox, ETA 0:02 * OK Indexed 97% of the mailbox, ETA 0:03 * OK Indexed 97% of the mailbox, ETA 0:03 * OK Mailbox indexing finished * SEARCH 1 2 3 4 7 8 9 10 11 12 13 14 17 18 22 23 26 27 33 34 56 57 58 59 60 77 83 85 88 118 122 123 126 128 156 178 179 183 186 191 199 200 212 213 225 245 247 254 281 282 286 287 289 305 309 311 312 315 317 320 329 333 357 358 361 364 374 375 379 387 390 401 405 408 414 415 416 428 433 439 442 444 445 465 494 495 496 500 504 533 537 538 542 543 549 557 561 566 567 568 569 575 580 603 620 621 636 641 642 643 668 677 679 684 685 686 687 691 701 705 710 715 718 721 722 734 736 743 747 748 753 754 773 774 775 776 777 778 779 780 784 785 786 787 792 795 813 820 821 829 830 831 843 844 855 861 862 863 864 865 870 871 886 889 890 891 892 893 894 895 899 906 914 915 924 926 930 933 937 938 939 940 941 946 947 979 987 988 990 991 1004 1005 1006 1007 1012 1013 1014 1018 1027 1028 1030 1037 1047 1054 1055 1059 1060 1061 1063 1064 1068 1069 1070 1071 1072 1074 1080 1081 1082 1083 1085 1087 1092 1093 1104 1111 1112 1113 1117 1118 1120 1140 1142 1143 1146 1152 1153 1154 1165 1176 1184 1185 1189 1201 1202 1210 1211 1212 1217 1228 1237 1238 1240 1243 1244 1245 1246 1247 1248 1249 1250 1252 1258 1264 1265 1268 1269 1279 1280 1284 1291 1298 1299 1300 1302 1306 1314 1315 1317 1322 1331 1338 a OK Search completed (117.978 secs).
If you repeat the search:
a search text whoopie!
You should see that it runs much faster:
* SEARCH 1 2 3 4 7 8 9 10 11 12 13 14 17 18 22 23 26 27 33 34 56 57 58 59 60 77 83 85 88 118 122 123 126 128 156 178 179 183 186 191 199 200 212 213 225 245 247 254 281 282 286 287 289 305 309 311 312 315 317 320 329 333 357 358 361 364 374 375 379 387 390 401 405 408 414 415 416 428 433 439 442 444 445 465 494 495 496 500 504 533 537 538 542 543 549 557 561 566 567 568 569 575 580 603 620 621 636 641 642 643 668 677 679 684 685 686 687 691 701 705 710 715 718 721 722 734 736 743 747 748 753 754 773 774 775 776 777 778 779 780 784 785 786 787 792 795 813 820 821 829 830 831 843 844 855 861 862 863 864 865 870 871 886 889 890 891 892 893 894 895 899 906 914 915 924 926 930 933 937 938 939 940 941 946 947 979 987 988 990 991 1004 1005 1006 1007 1012 1013 1014 1018 1027 1028 1030 1037 1047 1054 1055 1059 1060 1061 1063 1064 1068 1069 1070 1071 1072 1074 1080 1081 1082 1083 1085 1087 1092 1093 1104 1111 1112 1113 1117 1118 1120 1140 1142 1143 1146 1152 1153 1154 1165 1176 1184 1185 1189 1201 1202 1210 1211 1212 1217 1228 1237 1238 1240 1243 1244 1245 1246 1247 1248 1249 1250 1252 1258 1264 1265 1268 1269 1279 1280 1284 1291 1298 1299 1300 1302 1306 1314 1315 1317 1322 1331 1338 a OK Search completed (0.001 secs).
You are now done and you can log-out:
a logout
Now that full text search is enabled, a good idea would be to make dovecot reindex all messages for all users and for all mailboxes - this is particularly useful if you have large number of users and would like to jump start the caching. Whilst new messages are indexed automatically via the fts_autoindex = yes
in /etc/dovecot/conf.d/90-plugin.conf
, old messages may not have been indexed.
To issue a full re-index of all mailboxes, issue:
doveadm -D index -A '*'
where:
-D
turns on debugging - which makes it useful to observe errors since this will e a one-shot operation.-A
means to index mailboxes for all users,*
means to scan all the mails inside users' mailboxes.
Dovecot only performs soft commits in order to improve performance, however it is recommended to send a commit
command to Solr once in a while. Similarly, one can send the optimize
command along with the request.
A good idea is to use a script along the following:
#!/bin/sh SERVER=127.0.0.1 PORT=8983 curl http://$SERVER:$PORT/solr/update?optimize=true >/dev/null 2>&1 curl http://$SERVER:$PORT/solr/update?commit=true >/dev/null 2>&1
and use cron to execute the script once every hour (or less, for very busy servers) - on Debian, the script can be placed in /etc/cron.hourly
.
A better option is to use Tomcat instead of Jetty due to the instability of Jetty. To install tomcat and Solr issue:
aptitude install solr-tomcat
Tomcat by default will run on port 8080
which is a busy port on many machines. To change the default port that Tomcat listens on, edit the file at /etc/tomcat8/server.xml
and locate the Connector
tag:
<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" />
and then change the 8080
value to, say 8983
(which is the same port as Jetty).
If you experience issues with FTS with requests in the Tomcat logs (/var/log/tomcat8
) along the lines of "GET null null" 400
or messages in /var/log/mail.log
such as Error: fts_solr:
Lookup failed: 400 Bad Request
, these can be traced to some strange clients that issue very long search requests to dovecot and Tomcat does not accept by default headers that long. In such cases, one can increase the maximum header size in /etc/tomcat8/server.xml
:
<Connector port="8983" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" maxHttpHeaderSize="65536" />
by setting maxHttpHeaderSize
.
Tomcat can then be restarted via:
/etc/init.d/tomcat8 restart
The rest of the instructions apply - just that you will be running Tomcat instead of Jetty.
Until the tomcat init script is fixed, you may have to edit /etc/init.d/tomcat8
and comment out or delete the section:
if [ -f "$CATALINA_PID" ]; then rm -f "$CATALINA_PID" fi
otherwise tomcat fails to start and stop properly.
Dovecot's full text search plugin allows using Apache Tika for indexing email attachments (PDFs, Word Documents, etc…). This requires running Apache Tika in the background and pointing the Dovecot configuration to use Apache Tika as the backend.
Unfortunately, Debian does not package the Apache Tika server by default but you can grab the Wizardry and Steamworks Tika package and install it either by adding the Wizardry and Steamworks Tika repository to /etc/apt/sources.list
or by manually installing the tika-server.deb
package via:
dpkg-deb -i tika-server-1.16.deb
once you have downloaded it to a local directory.
Note that the Java JDK (such as openjdk
) has to be installed for the tika servlet.
After installing the Wizardry and Steamworks Tika package, Apache Tika can be stopped with:
/etc/init.d/tika stop
and started with:
/etc/init.d/tika start
By issuing:
ps ax | grep tika
the Apache Tika process should show up:
12199 ? Sl 14:57 /usr/bin/java -jar /usr/share/java/tika-server.jar -host localhost -port 9998
To check if Tika is answering properly, issue:
telnet localhost 9998
and a connection should be established.
The next step is to edit /etc/dovecot/conf.d/90-plugins.conf
, locate the FTS directives mentioned in this document, and append:
fts_tika = http://localhost:9998/tika/
in the plugin
section.
If fts_autoindex
is set to yes
in /etc/dovecot/conf.d/90-plugins.conf
, then new emails will have their attachments indexed, however, you may want to trigger a full re-indexing of all mailboxes for all users:
doveadm -D index -A '*'
to scan older attachments.
During the scanning phase, you will see a bunch of:
[Req2424 PUT http://localhost:9998/tika/]: Waiting for request to finish
but eventually, you should get:
[Req2425: PUT http://localhost:9998/tika/]: Finished sending payload [Req2425: PUT http://localhost:9998/tika/]: Waiting for request to finish
indicating that Apache Tika is working.
Finally, to check, just send an email with a PDF and start a full body text search with a sentence or phrase in the PDF.