Category Archives: search

  • -

Why Enterprise Search is critical to the success of your business

Given that every single second, google search engine performs 400,000 queries, we all know how important search is in our daily life.

Talking about business, let’s take the example of a Bank, an e-commerce shop, a high-street retailer or an energy company. They all need a system that allows:

  • their potential customers to quickly and easily find the product they are looking for
  • their existing customers or employees to find help, support and advice.
  • their employees and partners to easily find project and products documentations such as PDF files, MS Word documents, Spread Sheets, shared documents across the company network, documents from the company CMS, CRM, … and the list goes on and on…

 

And failure to have such a system does cause a lot of loss to those compaines

Just imagine a home owner trying to re-mortgage his house for the very first time.
He goes to his bank web site and searches for the keyword “renew my mortgage”.
He gets back result pages like:

  • “Applying for a mortgage”,
  • “My Mortgage”
  • “Expired credit card”
  • etc…

He gets all sort of results except what he is really looking for.

But when he searches for the word “remortgage”, then he get the right answer.

This is very frustrating and a proper Search implementation would have helped both the customer and the Bank.

The same goes for a search for “Red shirt” where you get back most products from a brand called “Red Foo”.
There are many example where businesses are loosing customers just because they do not have the right technology.

At Menelic, we are specialized in Enterprise Search…
And we strongly believe that every single organization, every single business and every enterprise needs a proper search solution.

At Menelic, we go into extra length to make sure your clients, potential customers and employees get the right result whenever they perform a search.
Most importantly, we spend a lot more time on the top search terms, making sure they always get the get the right hit.

To find out how we can help you, please contact us.


  • -

ZooKeeper: shutdown Leader! reason: Not sufficient followers synced, only synced with sids

We have been running this cross DC SolrCloud cluster for over a year now and things have been working well for us.

A couple of weeks ago, In one of our non production environment, our monitoring system went mad as our ZooKeeper quorum shot itself down, leaving our SolrCloud cluster in a read-only state.zookeeper_logo

The network seemed OK and no other system was affected.

However this was a non-production system, we spent some time investigating the issue by looking in the log and the system configuration files.

The ZooKeeper Leader

The log file on the ZooKeeper leader  node showed that at the the time of the incident, we had:

[QuorumPeer[myid=K]/0.0.0.0:2181:Leader@493] - Shutting down
[myid:K] - INFO  [QuorumPeer[myid=4]/0.0.0.0:2181:Leader@499] - Shutdown called
java.lang.Exception: shutdown Leader! reason: Not sufficient followers synced, only synced with sids: [ K ]
at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499)
at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:474)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:799)

The above log entries revealed that in the allocated time (time-out T), no follower was able to sync data from the leader ZK node with myid K.
The leader (with id K not) having enough follower to maintain the quorum of 5, deliberately shot itself down.

The ZooKeeper Followers

The log entries on the followers are identical go as follow:

[myid:L] - WARN  [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@89] - Exception when following the leader
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
[myid:L] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)

From the above entries, we can deduce that the followers were trying to sync data from the leaders at the same time and they threw java.net.SocketTimeoutException: Read timed out  during the allocated  time-out T

The ZooKeeper config

Now, looking at our configuration, we have among others the following lines:

tickTime=2000
initLimit=5
syncLimit=2

This means that the time-out T I was referring to in this blog is defined as

T = 2000*2= 4000ms = 4 sec

4 sec is definitely not enough for syncing SolrCloud config files data across multiple DCs.
The ZK configuration was clearly a default value that originally came with ZK and was never changed to reflect our deployment configuration

 

The fix

We changed the config to the one below

tickTime=4000
initLimit=30
syncLimit=15

Now, we are giving 60sec to each ZK follower node to sync data with the leader.

Other recommendations

– I would strongly recommend to read the ZooKeeper manual and understand the meaning of configuration options such as tickTime ,initLimit  and syncLimit  and check your ZK config files to make sure they are correct

– If your ZooKeeper server does not have an IPv6 address, make sure you add

-Djava.net.preferIPv4Stack=true

to your ZK start-up script. This will help avoid all sort of leader election issues (see [3] in the resources section below).

– By default, the RAM used by ZK depends on the one available the system. It’s recommended to explicitly allocate the heap size that ZK should use. This can be done by adding the following line into conf/java.env :

export JVMFLAGS="-Xms2g -Xmx2g"

You may want to change 2g to fit your need.

– It is a good idea to leave enough RAM for the OS and monitor the ZK node to make it NEVER swap!

Resources

  1. https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_configuration
  2. https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_bestPractices
  3. http://lucene.472066.n3.nabble.com/Zookeeper-Quorum-leader-election-td4227130.html
  4. https://mail-archives.apache.org/mod_mbox/zookeeper-user/201505.mbox/%3CCANLc_9+5c-4eqGNx_mbXOH3MViiuBVbMLNPP3xhafQA2xQ=POg@mail.gmail.com%3E
  5. http://www.markround.com/ ( Thanks for figuring out the IPv6 issue )

 


  • -

Allowing SolrJ CloudSolrClient to have preferred replica for query operations

In the previous blog post,solr-logo-on-orange-150

I discussed about how HTTP compression helped us improve solr response time and reduce network traffic in our cross DC solrCloud deployment.

In our deployment model, we have only 1 shard per collection and in term of content, all SolrCloud nodes are identical.

API and SolrCloud Traffic across two DCs

API and SolrCloud Traffic across two DCs

Let’s assume that:

  1. a request comes from the load balancer and lands on API1 in DC1,
  2. then API1 queries Solr Repl4 which is in DC2
  3. Response travels from DC2 back to API1 in DC1,
  4. the API1 finally sends response back to the client.

As stated earlier, all SolrCloud nodes have the very same content and are just replica of the same collection.

The question is: why should API1 go all the way to repl4 in DC2 to fetch data that is also available in repl1 and repl2 in DC1? There is certainly a better way.

To address this, we are proposing SOLR-8146 to the community

How it works

  1. Internally, the SolrJ client queries Zookeeper to find out the live replica of the collection being queried.
  2. SolrJ also acts as a load balancer. So, before querying Solr, SolrJ shuffles the list of replica URLs, and the first at the top of the list is used for querying. The second one is use only if the first one fails
  3. after the list is shuffled, we check whether the current request is a query operation or not
  4. If it’s a query operation, only then SOLR-8146 is applied by moving to the top of the list those URLs matching the specified Java Regular Expression . The pattern could be for instance an IP address or a port number etc.  I would recommend you check the tests in the source code of the patch at SOLR-8146

Notes

  1.  SOLR-8146 only deals with read or query operations. Any admin or update or delete operation will not be affected by the patch.
  2.  SOLR-8146 changes only the SolrJ client behaviour
  3.  SOLR-8146 comes into play if and only if the system property solr.preferredQueryNodePattern is set either by using the standard java -D  command line switch or in java  code System.setProperty()
  4. SOLR-8146 will still work no matter the number of collections deployed
  5. SOLR-8146 will still work no matter the number of shards deployed
  6. SOLR-8146 does not add to or remove nodes from the list of live solr nodes to query. it just re-order the list so that the one matching the specified pattern are first to be picked.
  7. One does not have to run SolrCloud across multiple DC in order to take advantage of SOLR-8146. There are many other use cases such as
    1. one could have a cluster running across multiple racks and prefer to have client API from rack1 talk to solr servers on rack1 only
    2. In a SolrCloud cluster, one may want one of the nodes to be used for analytics and manual slow queries or batch processing. SOLR-8146 would help keep a specific node from SolrJ queries.
    3. etc

Conclusion

SOLR-8146  brings more flexibility to the ways the SolrJ load balancer selects the nodes to query. This has many use cases.

Hopefully, it will be useful to others too.


  • -

Deploying SolrCloud across multiple Data Centers (DC): Performance

After deploying our search platform across multiple DCs deployment, we load tested the Search API.

We were not too impressed by the initial result.

We had issues like:
– high response time,
– high network traffic,
– long running queries.

After investigation, it turned out that a large amount of search result was being transferred between the SolrCloud nodes and the search API.

This is because clients were requesting a large number of documents.
It turned out that this was a business requirement and we could not put a cap on this.

HTTP compression to the rescue

Solr supports HTTP compression. This support is provided by the underlying Jetty Servlet Container.

To enable HTTP compression for Solr, two steps are required:

  1. Server Configuration

    To configure Solr 5 for HTTP compression, one needs to edit the file
    server/contexts/solr-jetty-context.xml by adding before the closing </config> the following XML snippet:

    
            org.eclipse.jetty.servlets.GzipFilter
            /*
            
                
                    
                
            
            
                mimetypes
                text/html,text/xml,text/plain,text/css,text/javascript,text/json,application/x-javascript,application/javascript,application/json,application/xml,application/xml+xhtml,image/svg+xml
            
            
                methods
                GET,POST
    
    

    The next step is to set the gzip header on the client.

  2. Client Configuration

    The SolrJ client needs to send the HTTP header Accept-Encoding: gzip, deflate to the server. Only then, will the server respond with compressed data.
    To achieve this, org.apache.solr.client.solrj.impl.HttpClientUtil utility class is being used:

    DefaultHttpClient httpClient = (DefaultHttpClient) cloudSolrClient.getLbClient().getHttpClient();
    HttpClientUtil.setAllowCompression(httpClient, true);
    HttpClientUtil.setMaxConnections(httpClient, maxTotalConnections);
    HttpClientUtil.setMaxConnectionsPerHost(httpClient, defaultMaxConnectionsPerRoute);
    HttpClientUtil.setSoTimeout(httpClient, readTimeout);
    HttpClientUtil.setConnectionTimeout(httpClient, connectTimeout);
    

    Note that in the code above we not only enable compression on the client, but we also set soTimeout and connectionTimeout on the client.

  3. The result

    1. Before enabling compression, we were doing in total in term of network traffic 12000KB/sec
    2. After changes, we dropped to 3000KB/s, that is serving just 25% of the original traffic, in other words, a drop of 75% of the network traffic!
    3. We have also seen a drop in response time by more than 60%!
    4. There is a price to pay for all of this: we have noticed a slight increase in CPU usage

Conclusion

However HTTP compression can be very beneficial when serving large response, it is not always the answer.
If possible, it’s better to serve small responses (for instance 10-40 items/pages).

In the next blog, I will share some of the challenges we have been facing.


  • -

Deploying SolrCloud across multiple Data Centers (DC)

Objectivessolr-logo-on-orange-150

Our objective is to deploy SolrCloud (5.X) across 2 DCs in active-active mode so that we still have all our search services available in the unfortunate event of a Data Centre loss.

Backgrounds

We used to run a Solr 3.x cluster in the traditional master-slave mode.
This worked very well for us for many years.
When Solr4 came out, we upgraded the cluster to the latest version of Solr, but still using the traditional master-slave architecture.

Why the move to SolrCloud?
There were many reasons behind this move. Below is a subset of them:

  1. The need for near real-time (NRT) search so that any update is immediately available to search,
  2. The ability to add more nodes to the cluster and scale as needed,
  3. The ability to deploy our search services in 2 DCs in active-active mode i.e. queries are simultaneously being served by both DCs,
  4. The ability to easily shard collections,
  5. The ability to avoid any single point of failure and make sure that the search platform is up and running in case one DC is lost.

The initial goal was to have our SolrCloud cluster deployed in 2 DCs meaning a ZooKeeper cluster spanning across 2 DCs.
As of the time of this writing, for Solr 5.3 and ZooKeeper 3.4.6 to work in a redundant manner across multiple DCs, we need 3 DCs, the 3rd DC being used solely for hosting one ZK node in order to maintain the quorum in case one DC is lost.

In summary, we have 3 private corporate DCs, connected with high speed gigabit fiber optic where the network latency is minimal.

Deployment

Note that we are well aware of the SOLR-6273 which is currently being implemented and the related blog entry at yonik.com .

We are also aware of SolrCloud HAFT

ZK Deployment:

  1. DC1: 2 ZK nodes
  2. DC2: 2 ZK nodes
  3. DC3: 1 ZK node

This is a standard ZK deployment forming a quorum of 5 nodes, spanning across 3 DCs

SolrCloud deployment:

In total 8 SolrCloud nodes, 4 in each of the two DC. DC3 having no SolrCloud node

  1. DC1: 4 SolrCloud nodes
  2. DC2: 4 SolrCloud nodes
  3. DC3: 0 (no) SolrCloud

Ingest Services deployment

solr-cross-dc

Cross DC SolrCloud Deployment architecture

The Ingest Service used to push data through to the SolrCloud cluster.
It’s build using SolrJ, so it talks to the ZK cluster as well.

  1. DC1 : 1 Ingest Service
  2. DC2 : 1 Ingest Service
  3. DC3 : 0 Ingest Service

Note that the ingest Services run in a round-robin fashion and at a given moment, only one of them is actively ingesting. The other one would be in standby mode and will be activated only if it’s the first to “acquire the lock”.
So, data flows from the active Ingest Service to the SolrCloud leader of a given collection regardless of the location of the Leader.

Search API deployment:

  1. DC1 : 2 API nodes
  2. DC2 : 2 API nodes
  3. DC3 : 0 (no) API node

This is API is built using SolrJ and is used by many client applications to search and suggestions.

Important note

In this deployment model, the killer point here is DC connectivity latency.
If there is high latency between the 3 DCs, this will inevitably kill our ZK quorum.
In our specific case, all 3 DCs are UK based and have fat pipe connecting them together.

Conclusion

During this process, we have come across many issues that we have managed to overcome them.

In the next blog post, I will be sharing with you the challenges we faced and how we addressed them.

Resources

  1. ZooKeeper Internals
  2. Mailing list thread about SolrCloud across multiple DC
  3. Presentation about SolrCloud HAFT

  • -

Solr 5.0.0 released

Apache Solr 5.0.0

Solr 5.0.0 has been officially released on the 20th of February 2015

This is a major release and as such, there have been significant changes in Lucene and in Solr code base.
Here, I am going to discuss 3 of the changes:

  1. Solr 5 as a standalone server
    The most important change in my opinion is that from now on, Solr is a standalone server just like ElasticSearch, Cassandra or MongoDB .
    The distribution comes with a set of scripts in the bin/ directory to enable users to run solr without the need of installing a servlet container.

    $ tree bin/
    bin/
    ├── init.d
    │ └── solr
    ├── install_solr_service.sh
    ├── oom_solr.sh
    ├── post
    ├── solr
    ├── solr.cmd
    ├── solr.in.cmd
    ├── solr.in.sh
    └── solr-8983.port
    1 directory, 9 files

    And there is a lot of goodies in the bin/ directory such as:

    • bin/install_solr_service.sh
      can be used to install solr as a service on unix-like systems
    • There is now some default GC tuning parameters available in
      bin/solr.in.sh to help reduce the guesswork. Note that these could easily be overridden if needed.
    • There is also
      bin/oom_solr.sh
      that get executed automatically to kill solr in case the worse happens and Solr decides to drop its pant whith in OutOfMemoryError.
    • To see the available option for starting solr, just try
      bin/solr start –help
      You may also want to look at the downloaded documentation at docs/quickstart.html
    • For windows users, there are only
      solr.cmd
      meaning no automatic service installation at this moment for Windows.
      Note that there are many external tools allowing to deploy a .bat file as a service on windows
    • Note that as of Solr 5.0.0, under the hood, jetty is embedded into the distribution and there is still a solr.war file involved

      $ tree server/webapps/
      server/webapps/
      └── solr.war
      0 directories, 1 file

      This means that however this is not recommended, one would still be able to deploy Solr 5 into a custom servlet container if needed, using the provided solr.war
  2. Solrj: The Java Client
    For clients applications using SolrJ, the old abstract class SolrServer has been deprecated in favor of the new and shiny abstract class SolrClient, for obvious reasons.
    From now on, we should all be using implementations such as CloudSolrClient, ConcurrentUpdateSolrClient, HttpSolrClient or LBHttpSolrClient instead.
  3. Lucene core library
    There have been many important changes in the core library such as
    more robust index IO operation by moving to NIO.2 , reduced memory footprint and various other optimization.In addition LUCENE-6050 that we have been waiting for has finally been released. In other words, the Lucene AnalizingInfixSuggester now allows to specify whether a context should be applied as is a MUST or a SHOULD operation

Resources