The post is based on latest development on a project: the requirement is to implement a custom incremental update of the Lucene index. The update takes place at pre-defined (configurable) intervals and indexes all entities created or modified since the last run (i.e. full text search index is aligned with the DB changes every 60 min while the users are accessing the application)
The technology stack is Hibernate 3.6, Hibernate Search 3.3, Lucene 3.0.1, Spring 2.5.6, Java 5.
Why and How
Unfortunately I couldn’t use the standard mechanisms suggested by the Hibernate doc:
- Automatic indexing (let Hibernate Search do everything for you, the best choice ideally..): I have different instances of the application running on different containers but they can only access the data server (hosting database and Lucene data files) via SQL (Firewall restriction)
- Mass Indexer requires the application to go (almost) off line or avoid index updates during the re-indexing, whilst (in my case) the users need to keep using the application
- JMS-driven index update has “no-go” on the prod deployment scenario, again because of the Firewall restriction 😦
The approach implemented disables the built-in automatic indexing and instead controls the indexing process in one location (on the data server) where a Java thread would query the entities created or updated (since last index modification) and modify the Lucene index. The Lucene index files are then replicated (via a secure push mechanism) to each of the application instances.
In the diagram below Instance A hosts the thread responsible for updating the Lucene index, while the other applications just rely on a read-only local copy which is kept in synch with the master version.
Java Thread
The java.util.concurrent package provides everything we need to program concurrent threads.
The ScheduledExecutorService class “schedules” the execution of the thread every X seconds (for convenience the interval is configured in a property file)
public class UpdateIndexTask { private static final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1); public void initTask() { // run thread every X secs scheduler.scheduleAtFixedRate(new UpdateIndexTimerTask(), start, interval, TimeUnit.SECONDS); }
Update Index Logic
The logic of the index update (class UpdateIndexTimerTask in the snippet above) works in two steps:
- retrieve all entities which have been created/modified since last index update
- update the index with entities loaded in step 1
// get FullText session FullTextSession ftSession = Search.getFullTextSession(session); //define criteria to load entities Criteria query = session.createCriteria(...); // update index going through the search results ScrollableResults scroll = query.scroll(ScrollMode.FORWARD_ONLY); while (scroll.next()) { ftSession.index(scroll.get(0)); //indexing of a single entity if (batch % batchSize == 0) { // commit batch ftSession.flushToIndexes(); ftSession.clear(); }
Obviously the Criteria query is strictly defined according to the application domain and the Hibernate Search mappings (in my case the query will load a set of entities including the physical files if available)
A scrollableResult suits the needs here: the search results are traversed and flushed into the index using a batch technique (changes are flushed in chunks).
Session Management
Here is the critical part: making sure the Hibernate session is managed accordingly. It must be closed after the update is completed so that the JDBC connection can return into the pool.
A good strategy is to open a new session every time the update thread kicks in: we want the thread to get hold of a new session which is shared by all operations (query criteria, index flush) within the thread body, also in a transactional context.
If using Spring (who isnt?) then a simple SessionFactoryUtils.getNewSession(sessionFactory) does the trick
Performance Considerations
Few aspects must be taken into account:
- identify an appropriate batch size to flush index updates in chunks, this is very important
- disable cache in the Query as we dont need to load all those objects in the cache
- use MANUAL flush: we control the index update and we dont need Hibernate to look for dirty checks
Conclusion
Hope this can help devs who have different scenarios than the ones covered by Hibernate docs, the approach described works nicely and supports an incremental update if no strict real-time requirement is necessary.
Another tip: always provide a way to fully rebuild the index on demand. This is critical for upgrades when Lucene version changes (files need to re-created) or for recreating a corrupted index.
References
Hibernate Search documentation
Hibernate forums
thanks for the good article.. 🙂