🚧 This is a working draft and will change often. Do not cite!
Use the latest published version instead.
🚧

27.1. Data sources#

This page describes options for obtaining text from Trove.

Documentation#

These sections of the Trove Data Guide explain how to access text from different parts of Trove:

Pre-harvested datasets#

The GLAM Workbench provides a number of datasets containing OCRd text harvested from Trove.

OCRd text from Trove books and ephemera

A harvest of 26,762 files of OCRd text from digitised books and ephemera in Trove.

OCRd text from Trove digitised journals

This dataset contains OCRd text and metadata harvested from digitised periodicals in Trove.

Press releases relating to refugees

This dataset contains metadata and full text of items from the Parliamentary Library’s press releases collection that include the term ‘refugees’ (or a number of related terms).

Press releases relating to COVID

This dataset contains metadata and full text of items from the Parliamentary Library’s press releases collection that include the term ‘covid’ or ‘coronavirus’.

Creating datasets#

These tools and examples can help you create your own collections of text from Trove.

GLAM Workbench notebooks#

These tools and examples can help you create your own collections of place data from Trove.

Trove Newspaper Harvester

The Trove Newspaper & Gazette Harvester makes it easy to download large quantities of digitised articles from Trove’s newspapers and gazettes.

Get OCRd text from a digitised journal in Trove

Many of the digitised periodicals available in Trove make OCRd text available for download. This notebook helps you download all the OCRd text from a single periodical – one text file for each issue.

Download summaries and transcripts from oral histories

If oral histories have summaries or transcripts, they can be downloaded as text or PDF files using their nla.obj identifiers. This notebook downloads all the available transcripts and summaries from digitised oral histories available in Trove.

Harvesting the text of digitised books (and ephemera)

This notebook harvests metadata and OCRd text from digitised works in Trove’s book zone.

Harvesting collections of text from archived web pages

This notebook helps you assemble datasets of text extracted from all available captures of archived web pages. You can then feed these datasets to the text analysis tool of your choice to analyse changes over time.

Harvest parliament press releases from Trove

Trove includes more than 380,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. This notebook shows you how to harvest both metadata and full text from a search of the parliamentary press releases.

Software packages#

trove-newspaper-harvester

The Trove Newspaper (& Gazette) Harvester makes it easy to download large quantities of digitised articles from Trove’s newspapers and gazettes. Just give it a search from the Trove web interface, and the harvester will save the metadata of all the articles in a CSV (spreadsheet) file for further analysis. You can also save the full text of every article, as well as copies of the articles as JPG images, and even PDFs.