How to group data when querying Solr

A 3 minute read written by Terry July 30, 2015

The Apache Solr logo

I was recently tasked with building an online gallery for a client, which also needed to allow users to filter data based on various criteria. It sounded easy enough, but the data for this gallery would be in the Solr search appliance for the site.

The project was a lot of fun but posed a few struggles for a newb like me.

Solr is an open-source Search Appliance built on Apache Lucene™. Once the website’s data has been indexed into a Solr collection, developers can add search functionality to a website, as well as execute queries on it like a database.

The next few paragraphs are going to describe how to group data, as well as examine a problem I encountered and finally, how I resolved it. Before proceeding, please note that all of the code examples in the schema.xml apply to Solr version 4.2. If you’re using a newer version, the code may be somewhat different, but the logic should be very similar.

Basic group by – the Solr way

In SQL we can simply write Group By columnName and it’s done. With Solr it wasn’t so simple.

To return grouped data from a Solr query (in my case, a field called “category1”), there are two parameters we need to set.

  1. We need to set: group=true

  2. We need to specify which field we’d like to group by group.field=category1

You can use the “fl” parameter to select which fields you’d like returned.

To do this via the Solr interface you have to populate those fields via the Raw Query Parameters field:

A screenshot of the solr interface

If we are directly accessing our Solr appliance with a URL string, our URL would look something like this:

http://localhost:8983/solr/my-collection/select?q=*%3A*&fl=category1%2Ctitle&wt=json&indent=true&group=true&group.field=category1

If you are using the Apache_Solr_Service, your code would look something like this:

$collection_path = “/solr/my-collection”;
$solr = new Apache_Solr_Service(SOLR_IP, SOLR_PORT,$collection_path);
$query = "*";
	$start = 0;
	$num_rows = 1000;

	$additionalParameters = array(
    		'fl'=>’title, category1’,
    		'group.field'=>’’category1”, 
    		'group'=>"true"
	);
	 
	if (get_magic_quotes_gpc() === 1) {
    		$query = stripslashes($query);
	}
  	 
	$output = $solr->search($query, $start, $num_rows, $additionalParameters);

Trouble with tokenizers

I encountered one major issue when grouping data. Depending on the type of field you’re using, Solr will tokenize the content of that field. You can learn more about tokenizers here.

This is where you will have to revisit your schema.xml.

In my schema, my category1 field is a “text_general” field type. The “text_general” field type was being tokenized using the solr.StandardTokenizerFactory class.

This class splits the text field into tokens, treating whitespace and punctuation as delimiters.

<field name="category1" type="text_general" indexed="true" stored="true"/>

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  	<analyzer type="index">
    	<tokenizer class="solr.StandardTokenizerFactory"/>
    	<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    	<!-- in this example, we will only use synonyms at query time
    	<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    	-->
    	<filter class="solr.LowerCaseFilterFactory"/>
  	</analyzer>
  	<analyzer type="query">
    	<tokenizer class="solr.StandardTokenizerFactory"/>
    	<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    	<filter class="solr.LowerCaseFilterFactory"/>
  	</analyzer>
</fieldType>

This caused a few issues with grouping:

  1. Values were all converted to lowercase (e.g., Aviation was now ‘aviation’).
  2. Group names that were more than one word in length were split up (e.g., Ground Transportation was under a group value of “transportation”).
  3. When I had groups that had any of the same words (e.g., Ground Transportation, Air Transportation) all results were returned under “transportation”.

Obviously, this was a problem. I couldn’t just change the field because the tokenized fields allow for more efficient and accurate indexing and searching.

A simple solution

After doing a bit of research I discovered that the solution was actually pretty simple. I needed to create an additional field that would not be tokenized by Solr and then copy the value from my original field into that new field. Then I could use this new field for grouping.

I used the “string” field type. In my schema.xml I have my original field in the <fields> node:

<field name="group1" type="text_general" indexed="true" stored="true"/>

I then created my new field to the <fields> node:

<field name="category1_full" type="string" indexed="true" stored="true"/>

Outside of the <fields node> I copied the contents of the original field to my new field.

<copyField source="category1" type="text_general" dest="category1_full"/>

Now I can run the same queries I did earlier, but instead of using category1 as my group field, I am using category1_full.

another screenshot of the Solr interface

http://localhost:8983/solr/my-collection/select?q=*%3A*&fl=category1_full%2Ctitle&wt=json&indent=true&group=true&group.field=category1_full

If you are using the Apache_Solr_Service, your code would look something like this:

$collection_path = “/solr/my-collection”;
$solr = new Apache_Solr_Service(SOLR_IP, SOLR_PORT,$collection_path);
$query = "*";
$start = 0;
$num_rows = 1000;

$additionalParameters = array(
		'fl'=>’title, category1,category1_full’,
		'group.field'=>’’category1_full”, 
		'group'=>"true"
);
 
if (get_magic_quotes_gpc() === 1) {
		$query = stripslashes($query);
}
	 
$output = $solr->search($query, $start, $num_rows, $additionalParameters);

This is one way to solve this issue. As with most problems, there are likely several solutions. If you know a different way, then please post a comment on this blog and let me know. I’d love to hear different approaches.