Thoughts on Software: relevancy

Sonntag, 12. Januar 2014

Some hints for using Solr in production

Although it is easy to start with the search platform Apache Solr, it is difficult to master Solr in production. Here are some hints that could be helpful (additional to the usual things to do for Java apps in production):

Memory:

Solr requires sufficient memory the Java heap and OS disk cache. Some more background information is given here. Using SSDs can decrease the memory requirements.
The required Java heap size depends on the configuration of the Solr caches. Especially a wrongly configured filter cache can result in an OutOfMemoryError as at most as many bits as the number of documents are consumed in memory for one stored filter. That is, an upper bound for the Java heap space required by the filter cache (in bits) is the filter cache size (configured in solrconfig.xml) multiplied by the number of documents. A heap dump analysis is helpful in case of an OutOfMemoryError.

The search performance is heavily impacted when other I/O consuming operation are performed on the Solr server.
Do benchmarking in order to know when it is necessary to shard with SolrCloud. The search performance is quite often linearly dependent on the document size until a size where the search time increases exponentially. Try to use real word queries in order to perform performance tests with SolrMeter.
This article gives a good overview what can be done in order to ensure relevancy. Relevancy can't be ensured solely by the developers, it is best measured by content experts.

This list could be much longer (tune Solr caches, use the autocommit feature, ...) with things that you will probably find out yourself while putting your Solr app to production;)

Freitag, 10. Januar 2014

Use common terms queries in Solr queries in order to improve search performance while retaining relevancy

Removing stop words can help to improve the performance of search queries because it reduces the size of the index. Thereby the relevancy of the search results is usually not affected.

However, there are situations when it is necessecary to search for stop words (e.g. for "to be or not to be" which contains only stop words). Additionally, there could be domain-specific frequent words ("music", "book", ...) that are not in the usual stop word list. It is not desirable to remove them from the index, but on the other hand searching for them can worsen search performance.

A possible solution is the Lucene CommonTermsQuery which is already implemented in Elastic search. Here is a Github Gist that shows how to use common terms queries with Apache Solr. The common terms query can be used in a Solr query with q={!commonTermsQueryParser}query string&qf=query field.

Thoughts on Software

Sonntag, 12. Januar 2014

Some hints for using Solr in production

Freitag, 10. Januar 2014

Use common terms queries in Solr queries in order to improve search performance while retaining relevancy

Follower

Blog-Archiv

Über mich