Thursday 23 May 2013

SOLR performance slow disk IO problem

Problem

My colleagues and I have been having significant performance issues with SOLR which I know is being used by many other organisations without any issues on very large indexes.

Our SOLR has indexed approximately 5 million documents. The contents of the documents are also set as stored fields. Most of the documents are quite large also.

The SOLR data directory is around 500 Giga Bytes in size.

Our SOLR version is 1.4, although upgrading to 4.1 made no difference.

There is only a single shard and single core.

SOLR server is on a Windows Server VM with 24GB RAM and 8 CPU.

We use VMWare as our virtualisation platform.

Our Java web application which runs in Tomcat utilises the SOLRj client to make requests to SOLR. We request 1,000 results from SOLR on user searches as the permissions to the documents in our SOLR index are handle by the database server which SOLR has no visibility to. We need to request 1,000 records from SOLR so that we can then filter the results based on user's access and then only show the top 100 results which they have access to.

Our storage solution is a 10GB iSCSI DELL Equalogic SAN which we know is capable of high throughout and high IOPs.

We were finding that there would be a significant delay in situations where SOLR had to go to disk to load index data that was not cache in memory. This delay could be anything up to two minutes at times.

After doing lots of research we found many other blogs and discussions on the Internet which would suggest how to tune and optimise SOLR. Some of which helped slightly, but we still seemed to have this underlying disk IO issue.

When SOLR was busy trying to get index data from disk we took a look at the SAN IO graphs to notice that SOLR was only capable of performing the random read of the index files at 6MB per second. We knew our SAN was capable of much more than this.

Solution

After trying many things to optimise this, the only solution that worked for us was to get our java application to make 10 concurrent requests to SOLR, each asking for 100 record result set instead of a single request for 1,000. This seemed to have a significant positive impact on the performance as now with multiple threads, SOLR could read data much faster from the storage array even though there is still only one single SOLR server with a single shard.

We are not 100% sure why we had to do this, but it worked for us. It could be because of any of the following reasons:

1. We use iSCSI to connect to our storage, which I have heard can have IO issues under certain single threaded circumstances.
2. We use Windows environment rather than UNIX.
3. May be a Windows / VMWare / DELL iSCSI specific issue.
4. Limitation of SOLRj client.