🚧 This is a working draft and will change often. Do not cite!
Use the latest published version instead.
🚧

8. ‘Simple’ search options#

On this page

Learn about constructing searches in Trove, including the use of indexes and facets. Includes a variety of tips and tricks, focusing on undocumented or potentially confusing aspects of the Trove search system.

8.1. Simple search isn’t!#

The Trove web interface distinguishes between ‘Advanced’ and ‘Simple’ search. This is a bit misleading as you can construct complex queries using either. ‘Advanced’ search really just adds a structured interface over the ‘Simple’ search options. This Guide focuses on using ‘Simple’ search because it gives you more control, exposes more of the workings of the search index, and its queries can be easily translated to work with the Trove API.

See Constructing a complex search query in the Trove help system for an introduction to:

  • boolean searches (use AND, OR, and NOT to combine search terms)

  • phrase searches

  • proximity searches (specify the number of words that can appear between search terms)

  • some of the available indexes

It can also be useful to poke around the Solr query parser documentation. Solr is the indexing software used by Trove, so many of the query formats described will work in Trove.

Below you’ll find information on some of the undocumented and potentially confusing aspects of Trove search.

8.2. De-fuzzify your searches#

By default, Trove adds a bit of fuzziness to your searches to try and ensure you get back some useful results. This includes:

  • stemming of your search terms (this reduces words to their base form, for example computer becomes comput matching ‘compute’, ‘computer’, ‘computing’ etc)

  • allowing extra words in phrases (this is to match across line breaks where words are hyphenated)

  • searching both full text (where available) and user-generated tags and comments

It’s possible to modify some of these settings by changing the format of your query. Here are some examples using a single search term:

Table 8.1 De-fuzzify keyword searches#

Query

Results

Explanation

hobart

5,892,614

Searches article text, tags & comments (some fuzziness, terms are stemmed)

hobart*

5,964,555

Searches article text, tags & comments (more fuzziness, wildcard matches zero or more characters)

text:hobart

5,605,604

Searches article text only (exact match, ignores tags & comments)

title:hobart

720,316

Searches headlines only

Similarly you can adjust the fuzziness of phrase searches.

Table 8.2 De-fuzzify phrase searches#

Query

Results

Explanation

australia OR unlimited

33,138,404

australia unlimited

437,656

Same as australia AND unlimited

"australia unlimited"

3,911

Search for phrase (with stemming)

text:"australia unlimited"

3,834

Search for phrase (no stemming & ignores tags/comments)

"australia unlimited"~0

2,873

Search for phrase (with stemming, no extra words)

text:"australia unlimited"~0

2,815

Search for exact phrase (no extra words, no stemming, ignore tags/comments)

8.3. Stemming oddities#

As noted above, Trove stems your search terms, trimming them back to their base form. It seems that Trove uses the Porter stemming algorithm. If you’d to see how stemming affects your query, you can use this online tool to test the results of the Porter algorithm.

I’ve noticed some oddities in handling -ise and -ize suffixes. For example:

Table 8.3 Stemming variations#

Query

Results

Explanation

naturalisation

250,586

Stemmed to ‘naturalis’

naturalization

15,482,606

Stemmed to ‘natur’

text:naturalisation

132,840

No stemming

text:naturalization

24,732

No stemming

8.4. Proximity searches#

The defuzzify examples above use the proximity modifier (~) to remove extra words from a query, but you can also use it to set a maximum distance between search terms. One thing to note is that the order of the search terms makes a difference to your results, as reversing the order of your terms counts as a ‘word’. For example:

Table 8.4 Using proximity modifiers#

Query

Results

Explanation

chinese tasmania

279,705

articles contain both terms

"chinese tasmania"~10

4,183

articles where ‘tasmania’ is within 10 words of ‘chinese’

"tasmania chinese"~10

4,198

terms in reverse order are matched, but reversing counts towards the word distance so results can differ

"tasmania chinese"~10 OR "chinese tasmania"~10

4,702

10 word distance in either direction

8.5. Using indexes#

When you enter queries in the simple search box, or by using the q parameter in an API request, you’re searching across most metadata fields and any available full text. To control where and what you’re searching, you can specify the index you want Trove to use. For example, the query title:wragge will search only the title field for the term wragge.

Other indexes mentioned in Trove’s help documentation are:

  • subject

  • creator

  • issn

  • isbn

  • nuc

  • publictag

A more complete list of available indexes is provided in the API technical documentation.

Undocumented indexes include:

Table 8.5 Undocumented search indexes#

Index

Description

Example

series

Search for resources that are part of a collection

series:"Parliamentary paper (Australia. Parliament) – find Parliamentary Papers

firstpageseq

Search for newspaper articles published on a specific page

firstpageseq:2 – find articles published on page two

You can use many of the standard search operators with index queries. For example:

Table 8.6 Using search operators with indexes#

Query

Explanation

subject:history

Search for a keyword in the subject index

subject:(history weather)

Search for multiple keywords in the subject index

subject:(history NOT australia)

Search using boolean operators in subject index

subject:"Australian history"

Search for a phrase in the subject index

Unlike regular searches. stemming is not applied by default to index searches. If you want to use stemming, there are separate stemmed indexes for creator, subject, and title: s_creator, s_subject, and s_title.

There’s some overlap between indexes and facets. For example, there’s a format index and a format facet that both let you limit your search by format. However, indexes and facets behave differently – facets expect exact matches, while indexes are much more flexible. Also, you can use the NOT operator with indexes to exclude particular values. For example, to exclude books from your search you could add NOT format:Book to your query. There’s no way of doing this with facets.

Some indexes such as date and lastupdated expect a range of dates. Depending on the index and the category, the date values are either years or complete ISO formatted datetimes. For example:

Table 8.7 Using the date index#

Query

Explanation

date:[1901 TO 1904]

1 January 1901 to 31 December 1904

date:[* TO 1904]

before 31 December 1904

date:[1904 TO 1904]

1 January 1904 to 31 December 1904

date:[1942-10-31T00:00:00Z TO 1942-11-30T00:00:00Z]

1 November 1942 to 31 November 1942 (newspapers only – dates need timezones, first date in range ignored)

date:[1942-11-09T00:00:00Z TO 1942-11-10T00:00:00Z]

10 November 1942 (newspapers only – dates need timezones, first date in range ignored)

For more information see Date searches

8.6. Using facets#

Facets are a set of pre-determined values you can use to set limits on your search resuls. They allow you to take slices of your results.

In the web interface, facets appear as a set of check boxes next to the list of results. You just click the box next to a facet value to apply it to your search. You can only select one facet value at a time.

../_images/web-facets.png

Fig. 8.1 Display of facets in the web interface#

Facets vary by category, but a complete list is available in the API technical documentation.

To use facets to limit the results of your API query, you add a l-[FACET NAME] parameter and set to your desired value. For example, to limit a search of newspaper articles to those published in the Sydney Morning Herald, you add the l-title parameter and set it to 35 (the title identifier for the SMH).

Try it!

When you use the API you can apply multiple facet values. However, facet fields don’t all behave the same way when you select multiple values. In some cases, you’ll get back the sum of the requested slices, but in most you’ll only get the intersection of the slices.

For example, if you use the state facet to request newspaper articles from both Victoria and NSW, you get back articles from either Victoria or NSW.

Table 8.8 Results for state facet combinations#

Facet

Results

l-state=Victoria

48,133,262

l-state=New South Wales

91,338,016

l-state=Victoria&l-state=New South Wales

139,471,278

On the other hand, if you use the category facet to request articles in the Article and Advertising category, you’ll only get articles that are in both categories.

Table 8.9 Results for category facet combinations#

Facet

Results

l-category=Article

173,230,640

l-category=Advertising

47,033,067

l-category=Article&l-category=Advertising

6,203

User added categories

You might be thinking that the final result above should be zero, as newspaper articles are assigned to a single category – how can an article be in both the Article and Advertising categories? The answer is that Trove users can add extra categories to articles, and these user-added values are included in the facet counts. There doesn’t seem to be any way to exclude these values, so it’s something else to keep in mind if you’re working with the data!