Thursday, December 12, 2013

Auto Suggest Using Solr - Configuring with SolrJ for Java Web Applications

I was tasked with setting up an auto suggest feature for the e-commerce site that we were working on and I had chosen Solr's auto suggest feature as

1. The ecommerce site was already running on Solr for its search
2. Required minimal config changes.

The challenge though, was that the Solr service was being run as an embedded server and hence I had to contact the server only through the SolrJ library.

Firstly , we have to make a few changes to the schema.xml and solrconfig.xml files to get this working

schema.xml

1. Define a new field type "auto_suggest" in your schema.xml

<fieldType name="auto_suggest" class="solr.TextField" > 
 <analyzer> 
  <tokenizer class="solr.KeywordTokenizerFactory"/> 
  <filter class="solr.LowerCaseFilterFactory"/> 
  </analyzer> 
</fieldType>

2. Define a new field "autocomplete_text" of type "auto_suggest" in your schema.xml

<field name="autocomplete_text" type="auto_suggest" indexed="true" stored="false"  multiValued="true" />

3. Define a copyField to copy all the fields (on which you want the auto suggest to work ) to the newly created field

<copyField source="*" dest="autocomplete_text"/> 

solrconfig.xml

1. Put in the following search component and request handler to handle the auto suggest terms

<searchComponent name="terms" class="solr.TermsComponent"/>
 
 <requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
 <lst name="defaults">
      <bool name="terms">true</bool>
      <bool name="distrib">false</bool>
  </lst>
  <arr name="components">
       <str>terms</str>
  </arr>
</requestHandler>

Secondly , we would need to write some java code to get the results from the embedded solr server

Java Code

public List<String> typeAhead(String q) throws SolrServerException{
SolrQuery query = new SolrQuery();
query.setParam(CommonParams.QT, "/terms"); //specify the request handler mapping as mentioned in solrconfig.xml 
query.setParam(TermsParams.TERMS, true); 
query.setParam(TermsParams.TERMS_LIMIT, "10"); // specifies the number of results
query.setParam(TermsParams.TERMS_FIELD, "autocomplete_text");  // or whatever fields you want
query.setParam(TermsParams.TERMS_PREFIX_STR, q); // pass the query prefix string
List<String> typeAheadList = new ArrayList<String>();
QueryResponse response = solrServer.query(query);  // SolrServer has to be replaced with your solr server instance
TermsResponse tr = response.getTermsResponse();
List<Term> termList = tr.getTerms("autocomplete_text");
for(Term t : termList){
typeAheadList.add(t.getTerm());
}
return typeAheadList ;
}

Sample

Input  -  a
Output  - absolute ,actor ,actress, address ,

Input  - ac
Output  - actor,actress

Input  - heater co
Output - heater coil , heater core

I have not tested this out on a stand alone solr server, but I don't see a reason as to why it should not work on one.


Sunday, June 2, 2013

Logstash Configuration

Most of the projects that we work on have a logging module , which keeps writing data into the log files and over a period of time , these log files grow huge in size and will be either archived or deleted. I am not sure how many people analyze their logs to see if the same exception / error is thrown again , was there a pattern to the exception being thrown ? when was this exception previously thrown ? In fact its not limited to exceptions alone , but user actions which are captured in the logs.

As a part of a project that I was working on an open source tool for managing events and logs called as Logstash. You can use it to collect logs, parse them, and store them for later use. It is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way. (More info here)

In this blog I am going to explain about the configuration of logstash.

You need to have a JRE installed for running logstash

You need to have an Elastic Search server running (will be using an embedded server for the purpose of this post)

You can download the latest version of logstash (1.1.13) as of this post from the link here

You need to define a config file for running logstash (logstash.conf)

The format of the config file is as described below

# This is a comment. You should use comments to describe
# parts of your configuration.
input {
  ...
}

filter {
  ...
}

output {
  ...
}

You can define your input section by pointing it to the log file(s) which needs to be monitored

input {
  file {
        type => "tomcat"
        path => "/home/tomcat7/logs/catalina.out"
  }
  file {
        type => "application"
        path => "/home/apps/logs/app.log"
  }
}

In the filter section you can define the patterns for logstash to identify the patterns within the log file and parse them. There are a standard set of regular expression patterns available to match the standard date , time , month , loglevel etc .. called as grok patterns (Details here ). Now using these , we can setup grok patterns to match the log records in a log file.
For eg , To match the log records from tomcat's catalina , we can use this pattern

filter{
grok {
         pattern => ["(?m)(?<logdate>%{MONTH} %{MONTHDAY}, %{YEAR} %{DATA} [AP]{1}M{1}) %{NOTSPACE:package} %{WORD:method}.*%{LOGLEVEL:loglevel}: %{GREEDYDATA:message}"]
}
}

So with this filter definition , the following log message

May 31, 2013 9:24:24 AM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet Engine: Apache Tomcat/7.0.35

can be broken down as

logdate => May 31, 2013 9:24:24 AM
package =>  org.apache.catalina.core.StandardEngine
method=> startInternal
loglevel=>INFO
message=> Starting Servlet Engine: Apache Tomcat/7.0.35

Now this log message represent one log record and there can be multiple such records , you can now define an output filter to push these records into Elastic Search.

Now once the logs have been moved to ES , you can run ES queries to search for logs based on keywords , loglevel , date range etc ...

output {
  elasticsearch {
       embedded => true
  }
}

And once you have finished editing the config file , you just need to move the logstash jar and config file into the same folder and start logstash from the same folder by running the following command


java -jar logstash-1.1.13-flat.jar agent -f logstash.conf
In order to enable verbose output while logstash is running , you can use -v or -vv (based on how verbose you need it to be )