🚧 This is a working draft and will change often. Do not cite!
Use the latest published version instead. 🚧

Data sources

Contents

29.1. Data sources#

This page describes options for obtaining collection and system data from Trove.

Documentation#

These sections of the Trove Data Guide explain how to access collection and system from different parts of Trove:

Metadata from digitised newspapers
- Articles
- Pages
- Issues
- Titles
Metadata from digitised periodicals
Metadata from oral histories

Pre-harvested datasets#

The GLAM Workbench provides a number of datasets containing collection and system data harvested from Trove.

First appearance of newspaper titles harvested from web archives

CSV formatted dataset containing details of the first appearance of newspaper titles in web archive captures, indicating when the titles were (approximately) added to Trove. The complete list of captures has been filtered to include only the first appearance of each title / place / date range combination.

CSV formatted list of Australian Women’s Weekly issues, 1933 to 1982

This CSV formatted file includes metadata for 2,566 issues of the Australian Women’s Weekly from 1933 to 1982.

List of Trove newspapers with non-English language content

Markdown formatted list of newspapers with non-English content created by applying language detection tools to a sample of articles.

Trove newspapers with articles published after 1954

CSV formatted dataset containing a list of digitised newspapers in Trove with articles published after 1954 (the copyright cliff of death).

CSV formatted list of digitised books in Trove

This file provides metadata of digitised works with the format Book.

List of organisations contributing metadata to Trove

This is a flattened version of the contributors data available from the Trove API. It is harvested weekly.

Count of records by contributor and category

This dataset was created by searching for contributor’s NUC codes in each Trove category. This gives a count of records by contributor and category. It is harvested weekly.

Digitised Parliamentary Papers in Trove

This dataset contains metadata describing Commonwealth Parliamentary Papers that have been digitised and are made available through Trove.

Details of periodicals submitted to Trove through the National edeposit scheme (NED)

This dataset contains details of periodical titles and issues submitted to the Trove through the NLA’s National edeposit scheme. It includes CSV-formatted lists of titles and issues, and an SQLite database created for use with Datasette-Lite.

Details of digitised periodicals from the /magazine/titles API endpoint

This dataset was created by checking, correcting, and enriching data about digitised periodicals obtained from the Trove API. Additional metadata describing periodical titles and issues was extracted from the Trove website and used to check the API results. Where titles were wrongly described as issues, and vice versa, the records were corrected.

Trove lists metadata

CSV formatted file containing a complete harvest of metadata describing user-created Trove lists.

Trove public tags

This dataset contains details of 2,495,958 unique public tags added to 10,403,650 resources in Trove between August 2008 and June 2024. It is saved as a CSV file with the following columns:

Trove tag counts

CSV formatted file containing the total number of times each tag in Trove has been applied to resources.

NLA oral histories metadata

This dataset contains metadata describing oral histories held by the National Library of Australia. The metadata was harvested from Trove and includes details of both digitised, and not digitised, oral histories.

List of NLA oral history collections and projects

This dataset contains a list of collection and project names extracted from the metadata of oral histories held by the NLA. The metadata was harvested from Trove and includes details of both digitised, and not digitised, oral histories.

Harvest of ABC Radio National metadata

The full harvest of ABC Radio National program metadata, containing more than 400,000 records.

Rights applied to images by each Trove contributor

This dataset includes information about the application of licences and rights statements to images by Trove contributors.

Pandora collections data

This dataset contains details of the subject and collection groupings used by Pandora to organise archived web resource titles.

Pandora titles data

This dataset contains a complete list of Pandora’s archived web resource titles.

NLA digitised finding aids: list of urls

A list of urls pointing to the National Library of Australia’s digitised manuscript finding aids, harvested from Trove.

NLA digitised finding aids: summary information

This dataset includes summary information describing each finding aid.

Creating datasets#

These tools and examples can help you create your own collections of collection and system from Trove.

GLAM Workbench notebooks#

Gathering historical data about the addition of newspaper titles to Trove: The number of digitised newspapers available through Trove has increased dramatically since 2009. Understanding when newspapers were added is important for historiographical purposes, but there’s no data about this available directly from Trove. This notebook uses web archives to extract lists of newspapers in Trove over time, and chart Trove’s development.
Harvest information about newspaper issues: When you search Trove’s newspapers, you find articles – these articles are grouped by page, and all the pages from a particular date make up an issue. But how do you find out what issues are available? On what dates were newspapers published? This notebook shows how you can get information about issues from the Trove API.
Get the page coordinates of a digitised newspaper article from Trove: This notebook demonstrates how to find the coordinates of a newspaper article on a digitised page.
Harvest details of Commonwealth Parliamentary Papers digitised in Trove: Trove includes thousands of digitised papers and reports presented to the Commonwealth Parliament. However, finding all the Parliamentary Papers is not straightforward because of inconsistencies in the way they’ve been arranged and described. This notebook attempts to work around these problems and harvest as complete as possible data about Parliamentary Papers in Trove.
Get details of periodicals from the /magazine/titles API endpoint: This notebook uses the /magazine/titles endpoint of the Trove API to get details of digitised periodical titles and issues. It then tries to fix some problems with the data by removing duplicates and Parliamentary Papers, and checking the lists of issues against those scraped from the Trove website.
Enrich the list of periodicals from the Trove API: This notebook tries to fix some problems with the periodicals data from the Trove API. It also enriches the harvested data by extracting additional information from the website. It creates two datasets – one for titles and one for issues – and loads these into an SQLite database for use with Datasette Lite.
Harvest details of periodicals submitted to Trove through the National edeposit scheme (NED): This notebook harvests details of periodicals submitted to Trove through the National edeposit scheme (NED). It creates two datasets, one containing details of the periodical titles, and the other listing all the available issues.
Harvest summary data from Trove lists: Use the Trove API to harvest data about all public lists, then extract some summary data and explore a few different techniques to analyse the complete dataset.
Harvest public tags from Trove zones: This notebook harvests all the public tags that users have added to records in Trove. However, tags are being added all the time, so by the time you’ve finished harvesting, the dataset will probably be out of date.
Harvest oral histories metadata: Harvests metadata describing the NLA’s oral history collection from Trove and saves the results as a CSV file.
Save a list of oral history collections and projects: Extracts a list of series from metadata describing oral histories held by the NLA and described in Trove.
Harvest ABC Radio National records from Trove: Trove harvests details of programs and segments broadcast on ABC Radio National. You can find them by searching for nuc:“ABC:RN” in the Music & Audio category. The records include basic metadata such as titles, dates, and contributors, but not full transcripts or audio. This notebook harvests, cleans, and saves all the available Radio National data from Trove.
Create title datasets from collections and subjects: This notebook helps you create a dataset of archived urls using Pandora’s subject and collection groupings.
Harvest Pandora subjects and collections: This notebook harvests Pandora’s navigation hierarchy, saving the connections between subjects, collections, and titles.
Harvest the full collection of Pandora titles: This notebook harvests a complete collection of archived web page titles from Pandora, the National Library of Australia’s selective web archive.
Find urls of digitised finding aids: This notebook uses the Trove API to harvest urls of NLA digitised finding aids from a search in the collection zone.
Collect information about digitised finding aids: This notebook works through a list of urls pointing to NLA’s digitised finding aids, extracting additional information about each one.

Software packages#

trove-newspaper-harvester: The Trove Newspaper (& Gazette) Harvester makes it easy to download large quantities of digitised articles from Trove’s newspapers and gazettes. Just give it a search from the Trove web interface, and the harvester will save the metadata of all the articles in a CSV (spreadsheet) file for further analysis. You can also save the full text of every article, as well as copies of the articles as JPG images, and even PDFs.