‘Simple’ search options

8. ‘Simple’ search options#

On this page

Learn about constructing searches in Trove, including the use of indexes and facets. Includes a variety of tips and tricks, focusing on undocumented or potentially confusing aspects of the Trove search system.

8.1. Simple search isn’t!#

The Trove web interface distinguishes between ‘Advanced’ and ‘Simple’ search. This is a bit misleading as you can construct complex queries using either. ‘Advanced’ search really just adds a structured interface over the ‘Simple’ search options. This Guide focuses on using ‘Simple’ search because it gives you more control, exposes more of the workings of the search index, and its queries can be easily translated to work with the Trove API.

See Constructing a complex search query in the Trove help system for an introduction to:

boolean searches (use AND, OR, and NOT to combine search terms)
phrase searches
proximity searches (specify the number of words that can appear between search terms)
some of the available indexes

It can also be useful to poke around the Solr query parser documentation. Solr is the indexing software used by Trove, so many of the query formats described will work in Trove.

Below you’ll find information on some of the undocumented and potentially confusing aspects of Trove search.

8.2. De-fuzzify your searches #

By default, Trove adds a bit of fuzziness to your searches to try and ensure you get back some useful results. This includes:

stemming of your search terms (this reduces words to their base form, for example computer becomes comput matching ‘compute’, ‘computer’, ‘computing’ etc)
allowing extra words in phrases (this is to match across line breaks where words are hyphenated)
searching both full text (where available) and user-generated tags and comments

It’s possible to modify some of these settings by changing the format of your query. Here are some examples using a single search term:

Table 8.1 De-fuzzify keyword searches#
Query	Results	Explanation
`hobart`	5,892,614	Searches article text, tags & comments (some fuzziness, terms are stemmed)
`hobart*`	5,964,555	Searches article text, tags & comments (more fuzziness, wildcard matches zero or more characters)
`text:hobart`	5,605,604	Searches article text only (exact match, ignores tags & comments)
`title:hobart`	720,316	Searches headlines only

Similarly you can adjust the fuzziness of phrase searches.

Table 8.2 De-fuzzify phrase searches#
Query	Results	Explanation
`australia OR unlimited`	33,138,404
`australia unlimited`	437,656	Same as australia AND unlimited
`"australia unlimited"`	3,911	Search for phrase (with stemming)
`text:"australia unlimited"`	3,834	Search for phrase (no stemming & ignores tags/comments)
`"australia unlimited"~0`	2,873	Search for phrase (with stemming, no extra words)
`text:"australia unlimited"~0`	2,815	Search for exact phrase (no extra words, no stemming, ignore tags/comments)

8.3. Stemming oddities #

As noted above, Trove stems your search terms, trimming them back to their base form. It seems that Trove uses the Porter stemming algorithm. If you’d to see how stemming affects your query, you can use this online tool to test the results of the Porter algorithm.

I’ve noticed some oddities in handling -ise and -ize suffixes. For example:

Table 8.3 Stemming variations#
Query	Results	Explanation
`naturalisation`	250,586	Stemmed to ‘naturalis’
`naturalization`	15,482,606	Stemmed to ‘natur’
`text:naturalisation`	132,840	No stemming
`text:naturalization`	24,732	No stemming

8.4. Proximity searches #

The defuzzify examples above use the proximity modifier (~) to remove extra words from a query, but you can also use it to set a maximum distance between search terms. One thing to note is that the order of the search terms makes a difference to your results, as reversing the order of your terms counts as a ‘word’. For example:

Table 8.4 Using proximity modifiers#
Query	Results	Explanation
`chinese tasmania`	279,705	articles contain both terms
`"chinese tasmania"~10`	4,183	articles where ‘tasmania’ is within 10 words of ‘chinese’
`"tasmania chinese"~10`	4,198	terms in reverse order are matched, but reversing counts towards the word distance so results can differ
`"tasmania chinese"~10 OR "chinese tasmania"~10`	4,702	10 word distance in either direction

8.5. Using indexes #

When you enter queries in the simple search box, or by using the q parameter in an API request, you’re searching across most metadata fields and any available full text. To control where and what you’re searching, you can specify the index you want Trove to use. For example, the query title:wragge will search only the title field for the term wragge.

Other indexes mentioned in Trove’s help documentation are:

subject
creator
issn
isbn
nuc
publictag

A more complete list of available indexes is provided in the API technical documentation.

Undocumented indexes include:

Table 8.5 Undocumented search indexes#
Index	Description	Example
`series`	Search for resources that are part of a collection	`series:"Parliamentary paper (Australia. Parliament)` – find Parliamentary Papers
`firstpageseq`	Search for newspaper articles published on a specific page	`firstpageseq:2` – find articles published on page two

You can use many of the standard search operators with index queries. For example:

Table 8.6 Using search operators with indexes#
Query	Explanation
`subject:history`	Search for a keyword in the `subject` index
`subject:(history weather)`	Search for multiple keywords in the `subject` index
`subject:(history NOT australia)`	Search using boolean operators in `subject` index
`subject:"Australian history"`	Search for a phrase in the `subject` index

Unlike regular searches. stemming is not applied by default to index searches. If you want to use stemming, there are separate stemmed indexes for creator, subject, and title: s_creator, s_subject, and s_title.

There’s some overlap between indexes and facets. For example, there’s a format index and a format facet that both let you limit your search by format. However, indexes and facets behave differently – facets expect exact matches, while indexes are much more flexible. Also, you can use the NOT operator with indexes to exclude particular values. For example, to exclude books from your search you could add NOT format:Book to your query. There’s no way of doing this with facets.

Some indexes such as date and lastupdated expect a range of dates. Depending on the index and the category, the date values are either years or complete ISO formatted datetimes. For example:

Table 8.7 Using the date index#
Query	Explanation
`date:[1901 TO 1904]`	1 January 1901 to 31 December 1904
`date:[* TO 1904]`	before 31 December 1904
`date:[1904 TO 1904]`	1 January 1904 to 31 December 1904
`date:[1942-10-31T00:00:00Z TO 1942-11-30T00:00:00Z]`	1 November 1942 to 31 November 1942 (newspapers only – dates need timezones, first date in range ignored)
`date:[1942-11-09T00:00:00Z TO 1942-11-10T00:00:00Z]`	10 November 1942 (newspapers only – dates need timezones, first date in range ignored)

For more information see Date searches

8.6. Using facets #

Facets are a set of pre-determined values you can use to set limits on your search resuls. They allow you to take slices of your results.

In the web interface, facets appear as a set of check boxes next to the list of results. You just click the box next to a facet value to apply it to your search. You can only select one facet value at a time.

../_images/web-facets.png — Fig. 8.1 Display of facets in the web interface#

Facets vary by category, but a complete list is available in the API technical documentation.

To use facets to limit the results of your API query, you add a l-[FACET NAME] parameter and set to your desired value. For example, to limit a search of newspaper articles to those published in the Sydney Morning Herald, you add the l-title parameter and set it to 35 (the title identifier for the SMH).

When you use the API you can apply multiple facet values. However, facet fields don’t all behave the same way when you select multiple values. In some cases, you’ll get back the sum of the requested slices, but in most you’ll only get the intersection of the slices.

For example, if you use the state facet to request newspaper articles from both Victoria and NSW, you get back articles from either Victoria or NSW.

On the other hand, if you use the category facet to request articles in the Article and Advertising category, you’ll only get articles that are in both categories.

User added categories

You might be thinking that the final result above should be zero, as newspaper articles are assigned to a single category – how can an article be in both the Article and Advertising categories? The answer is that Trove users can add extra categories to articles, and these user-added values are included in the facet counts. There doesn’t seem to be any way to exclude these values, so it’s something else to keep in mind if you’re working with the data!

Facet	Results
`l-state=Victoria`	48,133,262
`l-state=New South Wales`	91,338,016
`l-state=Victoria&l-state=New South Wales`	139,471,278

Facet	Results
`l-category=Article`	173,230,640
`l-category=Advertising`	47,033,067
`l-category=Article&l-category=Advertising`	6,203

‘Simple’ search options

Contents

8. ‘Simple’ search options#

8.1. Simple search isn’t!#

8.2. De-fuzzify your searches#

8.3. Stemming oddities#

8.4. Proximity searches#

8.5. Using indexes#

8.6. Using facets#