Thoughts on Software: Januar 2014

Dienstag, 14. Januar 2014

Language detection at query time for long Solr queries

Try to avoid long queries (hundreds or thousands of terms) as Apache Solr is not optimized for them.

But if you have to execute long queries against a Solr app you should use language detection at query time if the query language is not known. This improves relevancy (by stemming, analyzing, etc.) and performance (by a smaller index and stop words / common terms query). Additionally, language detection is the basis for using common grams in order to improve the performance of phrase queries.

Here is how it works:

Do language detection at indexing time and put the resulting Solr documents in language specific fields (e.g. text_de, text_en, ...).
Do also language detection at query time in order to search in the corresponding language specific fields. Here is a Github Gist with an implementation that uses Googles language detection library that outperforms the Tika library.

Montag, 13. Januar 2014

Which technology to use when implementing a REST service with Java

There is no simple answer. The first question is whether the team that want to build the REST service has already hands-on experience with some of the relevant technologies:

JBoss 7 that uses RestEasy internally
SpringMVC (which is - as a minus - not JAX-RS compliant)
Spring with Apache CXF is a good choice if the team has already experience with Spring and wants to deploy on Tomcat
It is even easier to run Jersey (the JAX-RS reference implementation) in Tomcat if there is no Spring support needed
Restlet would be another option that I haven't used until now. Here is a comparison between Restlet and Jersey.

Another reason for decision are the specific features that should be supported by the REST service (security, performance, specific headers, ...). The frameworks given above should be analyzed according to those features.

nodejs can be an alternative to Java, especially Express or Restify. no backend (e.g. with deployd or hood.ie) could be interesting for prototyping.

Sonntag, 12. Januar 2014

Some hints for using Solr in production

Although it is easy to start with the search platform Apache Solr, it is difficult to master Solr in production. Here are some hints that could be helpful (additional to the usual things to do for Java apps in production):

Memory:

Solr requires sufficient memory the Java heap and OS disk cache. Some more background information is given here. Using SSDs can decrease the memory requirements.
The required Java heap size depends on the configuration of the Solr caches. Especially a wrongly configured filter cache can result in an OutOfMemoryError as at most as many bits as the number of documents are consumed in memory for one stored filter. That is, an upper bound for the Java heap space required by the filter cache (in bits) is the filter cache size (configured in solrconfig.xml) multiplied by the number of documents. A heap dump analysis is helpful in case of an OutOfMemoryError.

The search performance is heavily impacted when other I/O consuming operation are performed on the Solr server.
Do benchmarking in order to know when it is necessary to shard with SolrCloud. The search performance is quite often linearly dependent on the document size until a size where the search time increases exponentially. Try to use real word queries in order to perform performance tests with SolrMeter.
This article gives a good overview what can be done in order to ensure relevancy. Relevancy can't be ensured solely by the developers, it is best measured by content experts.

This list could be much longer (tune Solr caches, use the autocommit feature, ...) with things that you will probably find out yourself while putting your Solr app to production;)

Samstag, 11. Januar 2014

Don't forget the leading slash when using SolrJ or SolrMeter with a custom search handler

A custom Solr search handler is quite useful in order to decouple search clients from an Apache Solr server. More specifically, several features can be added transparently for search clients:

Clients access the search handler through the URL path (like http://mySolrUrl/mySearchHandler).

SolrMeter is a stress testing tool and SolrJ is Java client for Apache Solr. It is important not to forget the leading slash for the search handler when accessing a custom search handler with these tools (SolrJ: http://lucene.apache.org/solr/4_6_0/solr-solrj/org/apache/solr/client/solrj/SolrQuery.html#setRequestHandler(java.lang.String)).

Otherwise, the resulting search query is http://mySolrUrl/select?qt=mySearchHandler instead of http://mySolrUrl/mySearchHandler. As of Solr 3.6 the qt parameter is not considered by default anymore and therefore the default select search handler is executed instead of the custom one.

Freitag, 10. Januar 2014

Use common terms queries in Solr queries in order to improve search performance while retaining relevancy

Removing stop words can help to improve the performance of search queries because it reduces the size of the index. Thereby the relevancy of the search results is usually not affected.

However, there are situations when it is necessecary to search for stop words (e.g. for "to be or not to be" which contains only stop words). Additionally, there could be domain-specific frequent words ("music", "book", ...) that are not in the usual stop word list. It is not desirable to remove them from the index, but on the other hand searching for them can worsen search performance.

A possible solution is the Lucene CommonTermsQuery which is already implemented in Elastic search. Here is a Github Gist that shows how to use common terms queries with Apache Solr. The common terms query can be used in a Solr query with q={!commonTermsQueryParser}query string&qf=query field.

Thoughts on Software