Custom indexing (using Java threads) with Hibernate Search

The post is based on latest development on a project:  the requirement is to implement a custom incremental update of the Lucene index. The update takes place at pre-defined (configurable) intervals and indexes all entities created or modified since the last run (i.e. full text search index is aligned with the DB changes every 60 min while the users are accessing the application)

The technology stack is Hibernate 3.6, Hibernate Search 3.3, Lucene 3.0.1, Spring 2.5.6, Java 5.

Why and How

Unfortunately I couldn’t use the standard mechanisms suggested by the Hibernate doc:

  • Automatic indexing (let Hibernate Search do everything for you, the best choice ideally..): I have different instances of the application running on different containers but they can only access the data server (hosting database and Lucene data files) via SQL (Firewall restriction)
  • Mass Indexer requires the application to go (almost) off line or avoid index updates during the re-indexing, whilst (in my case) the users need to keep using the application
  • JMS-driven index update has “no-go” on the prod deployment scenario, again because of the Firewall restriction 😦

The approach implemented disables the built-in automatic indexing and instead controls the indexing process in one location (on the data server) where a Java thread would query the entities created or updated (since last index modification) and modify the Lucene index. The Lucene index files are then replicated (via a secure push mechanism) to each of the application instances.

In the diagram below Instance A hosts the thread responsible for updating the Lucene index, while the other applications just rely on a read-only local copy which is kept in synch with the master version.

Java Thread

The java.util.concurrent package provides everything we need to program concurrent threads.

The ScheduledExecutorService class “schedules” the execution of the thread every X seconds (for convenience the interval is configured in a property file)

public class UpdateIndexTask {
   private static final ScheduledExecutorService scheduler =  
      Executors.newScheduledThreadPool(1);
   public void initTask() {
      // run thread every X secs     
      scheduler.scheduleAtFixedRate(new UpdateIndexTimerTask(), 
           start, interval, TimeUnit.SECONDS);
   }

Update Index Logic

The logic of the index update (class UpdateIndexTimerTask in the snippet above) works in two steps:
  • retrieve all entities which have been created/modified since last index update
  • update the index with entities loaded in step 1
// get FullText session
FullTextSession ftSession = Search.getFullTextSession(session);
//define criteria to load entities
Criteria query = session.createCriteria(...);
// update index going through the search results
ScrollableResults scroll = query.scroll(ScrollMode.FORWARD_ONLY); 
while (scroll.next()) {
   ftSession.index(scroll.get(0)); //indexing of a single entity
       if (batch % batchSize == 0) { // commit batch                
           ftSession.flushToIndexes();                
           ftSession.clear();            
      }

Obviously the Criteria query is strictly defined according to the application domain and the Hibernate Search mappings (in my case the query will load a set of entities including the physical files if available)

A scrollableResult suits the needs here: the search results are traversed and flushed into the index using a batch technique (changes are flushed in chunks).

Session Management

Here is the critical part: making sure the Hibernate session is managed accordingly. It must be closed after the update is completed so that the JDBC connection can return into the pool.

A good strategy is to open a new session every time the update thread kicks in: we want the thread to get hold of  a new session which is shared by all operations (query criteria, index flush) within the thread body, also in a transactional context.

If using Spring (who isnt?) then a simple SessionFactoryUtils.getNewSession(sessionFactory) does the trick

Performance Considerations

Few aspects must be taken into account:

  • identify an appropriate batch size to flush index updates in chunks, this is very important
  • disable cache in the Query as we dont need to load all those objects in the cache
  • use MANUAL flush: we control the index update and we dont need Hibernate to look for dirty checks

Conclusion

Hope this can help devs who have different scenarios than the ones covered by Hibernate docs, the approach described works nicely and supports an incremental update if no strict real-time requirement is necessary.

Another tip: always provide a way to fully rebuild the index on demand. This is critical for upgrades when Lucene version changes (files need to re-created) or for recreating a corrupted index.

References

Hibernate Search documentation

Hibernate forums

One Response to Custom indexing (using Java threads) with Hibernate Search

  1. About Java says:

    thanks for the good article.. 🙂

Leave a comment