This week, the Apache Foundation released version 4.0 (click here for Solr specific release info) of the Lucene/Solr project. Solr is an open-source search and indexing platform built on the Lucene engine. It facilitates full-text search, hit highlighting, and faceted searches of text, rich documents (e.g., Word, PDF) and geospatial data. Although outside it’s intended purpose, organizations have started using Solr as a “NoSQL” datastore.
Solr is often used in eCommerce implementations as a search engine due to its ability to “facet” results. Faceting is a fancy way of organizing results into common groups. The results typically provide a count of the number of products that match the “facet”.
The search performance of Solr is so great that it has become the default search implementation in many popular eCommerce platforms, including Broadleaf Commerce, an open source framework for building custom eCommerce solutions. When configured to use Solr for search, Broadleaf Commerce requires a response from Solr to fulfill any catalog queries. By default, Broadleaf Commerce utilizes Solr in an embedded fashion, but production environments should query against stand-alone Solr servers.
Due to the reliance of Broadleaf Commerce on Solr to browse a catalog, it is imperative that the Solr implementation be in a high availability environment. Solr 4.0 includes a new high availability feature called SolrCloud, which utilizes a sharding and replication schema to achieve data persistence. It employs Apache ZooKeeper to maintain configuration and coordinate state between the shards.
The Solr project maintains a wiki outlining three basic implementations of SolrCloud. To make implementation simple, SolrCloud includes an embedded version of ZooKeeper. The configurations outlined by the wiki leverage the embedded ZooKeeper implementation. Although this is acceptable for development and testing purposes, performance and high availability lead us to need a separation of Solr and ZooKeeper.
Implementing a production-worthy SolrCloud involves highly available installations of both ZooKeeper and SolrCloud. Below, I will walk through setting up each component and conclude with an approach you can use to test your SolrCloud.
Infrastructure and Architecture
For ease of implementation, I have constructed the infrastructure outlined below in Amazon EC2. This allows me to create and duplicate new servers quickly and easily. For this configuration, I have deployed seven EC2 instances within a Virtual Private Cloud (VPC). The seven instances include three instances for my ZooKeeper ensemble (i.e. cluster) and four instances for the SolrCloud. Finally, I have an internal load balancer to marshall Solr queries. Although the load balancer is not necessary, it makes requests much easier. I can point all requests to the load balancer and allow it to farm the request to each of the available Solr servers.
To operate in a highly available manner, ZooKeeper requires that you have 50%+1 of the servers up. For this reason you will need an odd number of servers with a minimum of three. ZooKeeper will automatically elect a leader of the ensemble.
Once ZooKeeper is installed, there are two configuration files that need to be modified. The first is in the
conf directory of your ZooKeeper installation. You can choose the name of the file. I chose
zoo.cfg for simplicity. Furthermore, I copied the contents of
zoo_sample.cfg, but at the end added the servers in my ensemble. My final configuration file looks like this:
# The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. dataDir=/var/lib/zookeeper # the port at which the clients will connect clientPort=2181 # ensemble members: server.1=awszk0:2888:3888 server.2=awszk1:2888:3888 server.3=awszk2:2888:3888
Notice that I have defined three servers in the ensemble. I used the FQDN, but you can also use an IP address, if you prefer. The numbers after the machine name are the server and client ports for the ZooKeeper instance. The client port is used for synchronization between the servers in the ensemble.
The other file you must add/edit is in the
dataDir defined above. The name of the file is
myid. The only content in the file is the identifying number for the server. No carriage returns or spaces, just the number.
With those minor configuration changes, you are now ready to fire up your ZooKeeper ensemble. You can start the three servers in any order using
bin/zkServer.sh in the ZooKeeper installation directory. They will independently elect a leader. Additionally, they will automatically elect a new leader if one goes down.
For the Solr configuration, I’m going to once again take the path of least resistance. I will use the example core that comes with the Solr installation. Adapt this process to fit your implementation.
Activating SolrCloud is as simple as adding a zkHost directive to your startup script. However, the first time you run Solr in cloud mode, you must initialize the ZooKeeper configuration. To do so, start Solr in this fashion:
java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=solrdemo -DzkHost=awszk0:2181,awszk1:2181,awszk2:2181 -DnumShards=2 -jar start.jar
Let’s now dissect the java directives in this command. The first is
-Dbootstrap_confdir. This defines the local directory that will be used to seed the configuration for all subsequent Solr instances. The configuration files in this directory are pushed to ZooKeeper. As Solr instances join our cloud, they will receive this configuration from ZooKeeper. Next, we define a logical configuration name in ZooKeeper for our cloud using the
-Dcollection.configName directive. I chose to use
solrdemo as a simple name. These two directives setup our cloud in ZooKeeper. At this point I recommend stopping your Solr instance. We will restart it with our production-ready startup command:
java -DzkHost=awszk0:2181,awszk1:2181,awszk2:2181 -DnumShards=2 -jar start.jar
There are two more directives here to observe.
-DzkHost actually triggers the SolrCloud mode in Solr. Here, we list all ZooKeeper instances in our ensemble and the respective port it is listening on. Finally, we will define the number of shards using
-DnumShards. Since we are running four servers in our cloud, we will have two shards, each with one replica.
The role of each instance in our cloud will be defined as we start the process on each of our EC2 instances. The first two will be defined as shards 1 and 2. The third instance to start will be a replica of shard 1; the fourth will be a replica of shard 2.
That’s it! You are now up and running with a high availability, fault tolerant Solr implementation!
Depending on how you will be accessing Solr, you may want a load balancer in front of your cloud. Any of the Solr instances, shard or replica, can service requests on the SolrCloud.
If you are using SolrJ exclusively to access your cloud, you can skip the load balancer. Instead of implementing HttpSolrServer use LBHttpSolrServer. This provides a simple round-robin load balancer internal to your java application.
Try It Out
To test my SolrCloud, I loaded sample data from the example core included in the Solr installation:
java -Durl=http://awssc:8983/solr/collection1/update -jar post.jar ipod_video.xml
java -Durl=http://awssc:8983/solr/collection1/update -jar post.jar mem.xml
java -Durl=http://awssc:8983/solr/collection1/update -jar post.jar monitor.xml
java -Durl=http://awssc:8983/solr/collection1/update -jar post.jar monitor2.xml
At this point, you can throw requests at your cloud. I recommend using Apache Bench, included in the httpd package. Play around with dropping and restarting Solr instances. Watch your logs as you do so. You will see the requests get reallocated as the instances go up and down.
You can keep humming as long as 50%+1 ZooKeepers are alive and at least one member of each shard/replica pair is active. You can scale this environment out to fit your individual needs.
Searching for products in the catalog is one of the most frequently used functions of an eCommerce site. Fast, reliable results must be considered a top priority in your high availability strategy. Utilizing the features I have outlined above will keep costs low and reliability high by taking advantage of Solr’s built-in functionality.
Please share your successes and challenges with SolrCloud in the comments below.