Data sources

27.1. Data sources#

This page describes options for obtaining text from Trove.

Documentation#

These sections of the Trove Data Guide explain how to access text from different parts of Trove:

OCRd text from digitised newspapers
- Articles
- Pages
- Issues
- Titles
OCRd text from digitised periodicals
Transcripts and summaries from oral histories

Pre-harvested datasets#

The GLAM Workbench provides a number of datasets containing OCRd text harvested from Trove.

OCRd text from Trove books and ephemera

A harvest of 26,762 files of OCRd text from digitised books and ephemera in Trove.

More info

OCRd text from Trove digitised journals

This dataset contains OCRd text and metadata harvested from digitised periodicals in Trove.

More info

Press releases relating to refugees

This dataset contains metadata and full text of items from the Parliamentary Library’s press releases collection that include the term ‘refugees’ (or a number of related terms).

More info

Press releases relating to COVID

This dataset contains metadata and full text of items from the Parliamentary Library’s press releases collection that include the term ‘covid’ or ‘coronavirus’.

More info

Creating datasets#

These tools and examples can help you create your own collections of text from Trove.

GLAM Workbench notebooks#

These tools and examples can help you create your own collections of place data from Trove.

Trove Newspaper Harvester: The Trove Newspaper & Gazette Harvester makes it easy to download large quantities of digitised articles from Trove’s newspapers and gazettes.
Get OCRd text from a digitised journal in Trove: Many of the digitised periodicals available in Trove make OCRd text available for download. This notebook helps you download all the OCRd text from a single periodical – one text file for each issue.
Download summaries and transcripts from oral histories: If oral histories have summaries or transcripts, they can be downloaded as text or PDF files using their nla.obj identifiers. This notebook downloads all the available transcripts and summaries from digitised oral histories available in Trove.
Harvesting the text of digitised books (and ephemera): This notebook harvests metadata and OCRd text from digitised works in Trove’s book zone.
Harvesting collections of text from archived web pages: This notebook helps you assemble datasets of text extracted from all available captures of archived web pages. You can then feed these datasets to the text analysis tool of your choice to analyse changes over time.
Harvest parliament press releases from Trove: Trove includes more than 380,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. This notebook shows you how to harvest both metadata and full text from a search of the parliamentary press releases.

Software packages#

trove-newspaper-harvester: The Trove Newspaper (& Gazette) Harvester makes it easy to download large quantities of digitised articles from Trove’s newspapers and gazettes. Just give it a search from the Trove web interface, and the harvester will save the metadata of all the articles in a CSV (spreadsheet) file for further analysis. You can also save the full text of every article, as well as copies of the articles as JPG images, and even PDFs.

Data sources

Contents

27.1. Data sources#

Documentation#

Pre-harvested datasets#

Creating datasets#

GLAM Workbench notebooks#

Software packages#