8. ‘Simple’ search options#
Learn about constructing searches in Trove, including the use of indexes and facets. Includes a variety of tips and tricks, focusing on undocumented or potentially confusing aspects of the Trove search system.
8.1. Simple search isn’t!#
The Trove web interface distinguishes between ‘Advanced’ and ‘Simple’ search. This is a bit misleading as you can construct complex queries using either. ‘Advanced’ search really just adds a structured interface over the ‘Simple’ search options. This Guide focuses on using ‘Simple’ search because it gives you more control, exposes more of the workings of the search index, and its queries can be easily translated to work with the Trove API.
See Constructing a complex search query in the Trove help system for an introduction to:
boolean searches (use
AND
,OR
, andNOT
to combine search terms)phrase searches
proximity searches (specify the number of words that can appear between search terms)
some of the available indexes
It can also be useful to poke around the Solr query parser documentation. Solr is the indexing software used by Trove, so many of the query formats described will work in Trove.
Below you’ll find information on some of the undocumented and potentially confusing aspects of Trove search.
8.2. De-fuzzify your searches#
By default, Trove adds a bit of fuzziness to your searches to try and ensure you get back some useful results. This includes:
stemming of your search terms (this reduces words to their base form, for example
computer
becomescomput
matching ‘compute’, ‘computer’, ‘computing’ etc)allowing extra words in phrases (this is to match across line breaks where words are hyphenated)
searching both full text (where available) and user-generated tags and comments
It’s possible to modify some of these settings by changing the format of your query. Here are some examples using a single search term:
Query |
Results |
Explanation |
---|---|---|
|
5,892,614 |
Searches article text, tags & comments (some fuzziness, terms are stemmed) |
|
5,964,555 |
Searches article text, tags & comments (more fuzziness, wildcard matches zero or more characters) |
|
5,605,604 |
Searches article text only (exact match, ignores tags & comments) |
|
720,316 |
Searches headlines only |
Similarly you can adjust the fuzziness of phrase searches.
Query |
Results |
Explanation |
---|---|---|
|
33,138,404 |
|
|
437,656 |
Same as australia AND unlimited |
|
3,911 |
Search for phrase (with stemming) |
|
3,834 |
Search for phrase (no stemming & ignores tags/comments) |
|
2,873 |
Search for phrase (with stemming, no extra words) |
|
2,815 |
Search for exact phrase (no extra words, no stemming, ignore tags/comments) |
8.3. Stemming oddities#
As noted above, Trove stems your search terms, trimming them back to their base form. It seems that Trove uses the Porter stemming algorithm. If you’d to see how stemming affects your query, you can use this online tool to test the results of the Porter algorithm.
I’ve noticed some oddities in handling -ise
and -ize
suffixes. For example:
Query |
Results |
Explanation |
---|---|---|
|
250,586 |
Stemmed to ‘naturalis’ |
|
15,482,606 |
Stemmed to ‘natur’ |
|
132,840 |
No stemming |
|
24,732 |
No stemming |
8.4. Proximity searches#
The defuzzify examples above use the proximity modifier (~
) to remove extra words from a query, but you can also use it to set a maximum distance between search terms. One thing to note is that the order of the search terms makes a difference to your results, as reversing the order of your terms counts as a ‘word’. For example:
Query |
Results |
Explanation |
---|---|---|
|
279,705 |
articles contain both terms |
|
4,183 |
articles where ‘tasmania’ is within 10 words of ‘chinese’ |
|
4,198 |
terms in reverse order are matched, but reversing counts towards the word distance so results can differ |
|
4,702 |
10 word distance in either direction |
8.5. Using indexes#
When you enter queries in the simple search box, or by using the q
parameter in an API request, you’re searching across most metadata fields and any available full text. To control where and what you’re searching, you can specify the index you want Trove to use. For example, the query title:wragge
will search only the title
field for the term wragge
.
Other indexes mentioned in Trove’s help documentation are:
subject
creator
issn
isbn
nuc
publictag
A more complete list of available indexes is provided in the API technical documentation.
Undocumented indexes include:
Index |
Description |
Example |
---|---|---|
|
|
|
|
Search for newspaper articles published on a specific page |
|
You can use many of the standard search operators with index queries. For example:
Query |
Explanation |
---|---|
|
Search for a keyword in the |
|
Search for multiple keywords in the |
|
Search using boolean operators in |
|
Search for a phrase in the |
Unlike regular searches. stemming is not applied by default to index searches. If you want to use stemming, there are separate stemmed indexes for creator, subject, and title: s_creator
, s_subject
, and s_title
.
There’s some overlap between indexes and facets. For example, there’s a format
index and a format
facet that both let you limit your search by format. However, indexes and facets behave differently – facets expect exact matches, while indexes are much more flexible. Also, you can use the NOT
operator with indexes to exclude particular values. For example, to exclude books from your search you could add NOT format:Book
to your query. There’s no way of doing this with facets.
Some indexes such as date
and lastupdated
expect a range of dates. Depending on the index and the category, the date values are either years or complete ISO formatted datetimes. For example:
Query |
Explanation |
---|---|
|
1 January 1901 to 31 December 1904 |
|
before 31 December 1904 |
|
1 January 1904 to 31 December 1904 |
|
1 November 1942 to 31 November 1942 (newspapers only – dates need timezones, first date in range ignored) |
|
10 November 1942 (newspapers only – dates need timezones, first date in range ignored) |
For more information see Date searches
8.6. Using facets#
Facets are a set of pre-determined values you can use to set limits on your search resuls. They allow you to take slices of your results.
In the web interface, facets appear as a set of check boxes next to the list of results. You just click the box next to a facet value to apply it to your search. You can only select one facet value at a time.
Facets vary by category, but a complete list is available in the API technical documentation.
To use facets to limit the results of your API query, you add a l-[FACET NAME]
parameter and set to your desired value. For example, to limit a search of newspaper articles to those published in the Sydney Morning Herald, you add the l-title
parameter and set it to 35
(the title identifier for the SMH).
When you use the API you can apply multiple facet values. However, facet fields don’t all behave the same way when you select multiple values. In some cases, you’ll get back the sum of the requested slices, but in most you’ll only get the intersection of the slices.
For example, if you use the state
facet to request newspaper articles from both Victoria and NSW, you get back articles from either Victoria or NSW.
Facet |
Results |
---|---|
|
48,133,262 |
|
91,338,016 |
|
139,471,278 |
On the other hand, if you use the category
facet to request articles in the Article
and Advertising
category, you’ll only get articles that are in both categories.
Facet |
Results |
---|---|
|
173,230,640 |
|
47,033,067 |
|
6,203 |
User added categories
You might be thinking that the final result above should be zero, as newspaper articles are assigned to a single category – how can an article be in both the Article
and Advertising
categories? The answer is that Trove users can add extra categories to articles, and these user-added values are included in the facet counts. There doesn’t seem to be any way to exclude these values, so it’s something else to keep in mind if you’re working with the data!