Thoughts on Software

Dienstag, 14. Januar 2014

Language detection at query time for long Solr queries

Try to avoid long queries (hundreds or thousands of terms) as Apache Solr is not optimized for them.

But if you have to execute long queries against a Solr app you should use language detection at query time if the query language is not known. This improves relevancy (by stemming, analyzing, etc.) and performance (by a smaller index and stop words / common terms query). Additionally, language detection is the basis for using common grams in order to improve the performance of phrase queries.

Here is how it works:

Do language detection at indexing time and put the resulting Solr documents in language specific fields (e.g. text_de, text_en, ...).
Do also language detection at query time in order to search in the corresponding language specific fields. Here is a Github Gist with an implementation that uses Googles language detection library that outperforms the Tika library.

Montag, 13. Januar 2014

Which technology to use when implementing a REST service with Java

There is no simple answer. The first question is whether the team that want to build the REST service has already hands-on experience with some of the relevant technologies:

JBoss 7 that uses RestEasy internally
SpringMVC (which is - as a minus - not JAX-RS compliant)
Spring with Apache CXF is a good choice if the team has already experience with Spring and wants to deploy on Tomcat
It is even easier to run Jersey (the JAX-RS reference implementation) in Tomcat if there is no Spring support needed
Restlet would be another option that I haven't used until now. Here is a comparison between Restlet and Jersey.

Another reason for decision are the specific features that should be supported by the REST service (security, performance, specific headers, ...). The frameworks given above should be analyzed according to those features.

nodejs can be an alternative to Java, especially Express or Restify. no backend (e.g. with deployd or hood.ie) could be interesting for prototyping.

Sonntag, 12. Januar 2014

Some hints for using Solr in production

Although it is easy to start with the search platform Apache Solr, it is difficult to master Solr in production. Here are some hints that could be helpful (additional to the usual things to do for Java apps in production):

Memory:

Solr requires sufficient memory the Java heap and OS disk cache. Some more background information is given here. Using SSDs can decrease the memory requirements.
The required Java heap size depends on the configuration of the Solr caches. Especially a wrongly configured filter cache can result in an OutOfMemoryError as at most as many bits as the number of documents are consumed in memory for one stored filter. That is, an upper bound for the Java heap space required by the filter cache (in bits) is the filter cache size (configured in solrconfig.xml) multiplied by the number of documents. A heap dump analysis is helpful in case of an OutOfMemoryError.

The search performance is heavily impacted when other I/O consuming operation are performed on the Solr server.
Do benchmarking in order to know when it is necessary to shard with SolrCloud. The search performance is quite often linearly dependent on the document size until a size where the search time increases exponentially. Try to use real word queries in order to perform performance tests with SolrMeter.
This article gives a good overview what can be done in order to ensure relevancy. Relevancy can't be ensured solely by the developers, it is best measured by content experts.

This list could be much longer (tune Solr caches, use the autocommit feature, ...) with things that you will probably find out yourself while putting your Solr app to production;)

Samstag, 11. Januar 2014

Don't forget the leading slash when using SolrJ or SolrMeter with a custom search handler

A custom Solr search handler is quite useful in order to decouple search clients from an Apache Solr server. More specifically, several features can be added transparently for search clients:

Clients access the search handler through the URL path (like http://mySolrUrl/mySearchHandler).

SolrMeter is a stress testing tool and SolrJ is Java client for Apache Solr. It is important not to forget the leading slash for the search handler when accessing a custom search handler with these tools (SolrJ: http://lucene.apache.org/solr/4_6_0/solr-solrj/org/apache/solr/client/solrj/SolrQuery.html#setRequestHandler(java.lang.String)).

Otherwise, the resulting search query is http://mySolrUrl/select?qt=mySearchHandler instead of http://mySolrUrl/mySearchHandler. As of Solr 3.6 the qt parameter is not considered by default anymore and therefore the default select search handler is executed instead of the custom one.

Freitag, 10. Januar 2014

Use common terms queries in Solr queries in order to improve search performance while retaining relevancy

Removing stop words can help to improve the performance of search queries because it reduces the size of the index. Thereby the relevancy of the search results is usually not affected.

However, there are situations when it is necessecary to search for stop words (e.g. for "to be or not to be" which contains only stop words). Additionally, there could be domain-specific frequent words ("music", "book", ...) that are not in the usual stop word list. It is not desirable to remove them from the index, but on the other hand searching for them can worsen search performance.

A possible solution is the Lucene CommonTermsQuery which is already implemented in Elastic search. Here is a Github Gist that shows how to use common terms queries with Apache Solr. The common terms query can be used in a Solr query with q={!commonTermsQueryParser}query string&qf=query field.

Freitag, 4. Juni 2010

A non-obvious pitfall when using a custom facelets resource resolver

Sometimes it can be helpful to use a custom facelets resource resolver. This resource resolver resolves the XHTML snippets during facelet composition, for instance in:
<ui:composition template="/masterTemplate.xhtml" /> or <ui:include src="/toInclude.xhtml" />

A pitfall is to omit the leading slash in /masterTemplate.xhtml or /toInclude.xhtml. The custom resource resolver is only used WITH leading slash. The reason is the implementation of com.sun.facelets.impl.DefaultFaceletFactory#resolveURL(...), where this.resolver is the custom resource resolver:



public URL resolveURL(URL source, String path) throws IOException {

          if (path.startsWith("/")) {

                URL url = this.resolver.resolveUrl(path);

                if (url == null) {

                      throw new FileNotFoundException(path + " Not Found in ExternalContext as a Resource");

                }

                return url;

          } else {

              return new URL(source, path);

          }

}

Sonntag, 14. Februar 2010

Performance of complex web pages with Seam, RichFaces and Facelets

I worked on a web app that basically consists of a fairly large page composed by sophisticated facelet composition components. The page is implemented with Seam, RichFaces and Facelets. The Seam components are mostly in page scope. After an initial GET request the user stays on the page and sends AJAX postbacks.

This page has a lot of cool features but users complained about slow response times, either for initial page loading or subsequent AJAX requests. Performance logging with a JSF phase listener revealed that more than 80% of the server side response time elapsed during the render response phase of the JSF lifecycle. I additionally checked (by profiling with JProfiler) that the performance bottleneck is not in the application code. I took the performance hints given in Dan Allens articles on JSF/Seam performance (part1, part2) into consideration but the page remained slow.

Then I activated facelet logging and found some surprising results:

About 90% of the time in the render response phase were consumed by facelets.
During AJAX requests facelets build the whole component tree from scratch even if only a (small) part would be sufficient due to rerendering.

Facelets was the performance killer! Considering this fact I succeded to improve the response time by relying on the following principles:

Divide the page into different pages. The conversation context could be used as scope of the Seam conponents instead of the page context.
Reduce the number of facelet compositions.
Use real JSF components instead of facelet composition components.
Only load the part of the page front up that is really needed. The other parts should only loaded if necessary. This can be implemented by conditionally including parts of the page by the use of EL evaluation: <ui:include src="#{clicked ? 'full.content.xhtml' : 'empty.content.xhtml'}"> </ui:include>
I double checked that facelets dynamically (lazily) loads the corresponding content.
Embed AJAX action components like <a4j:commandLink> in <a4j:region selfrendered="true">. If selfrendered="true" is set then facelets is not at all called during the AJAX request because all information is obtained from the JSF component tree. Yet, there are some potential pitfalls with EL evaluation and with the <rich:columns> tag. Note that transient components are not rendered.

To sum up, there could be some non trival performance pitfalls while developing complex JSF apps. Mayby developers should give Apache Wicket a try (see JSF/Wicket performance comparison by Peter Thomas). I liked this web framework in another project...