selector_to_html = {"a[href=\"#check-for-missing-records\"]": "

Check for \u2018missing\u2019 records#

Some of the records in the dataset might represent parts of resources, such as the sections of a Parliamentary Paper. You\u2019d expect there to be separate results for the parent records, but I\u2019ve found this is not always the case \u2013 the parent records are missing. In the previous processing step you can add the identifiers for any parent resources from the metadata embedded in the digital object viewer. You can then check all the parent identifiers to make sure they\u2019re already included in the dataset. Something like this:

", "a[href=\"get-collection-items.html\"]": "

25.3. HOW TO: Get a list of items from a digitised collection#

25.3.1. Background#

The NLA\u2019s digitised resources are often presented as \u2018collections\u2019. A collection could be the volumes in a multi-volume work, the issues of a periodical, a map series, an album of photographs, or a manuscript collection. In the web interface, collections will have a \u2018Browse this collection\u2019 link or button that displays a list of the contents, but getting machine-readable data is not so straightforward. You can use the magazine/title API endpoint to get a list of issues from a periodical, but there\u2019s no way to get the contents of other types of collections from the Trove API.

", "a[href=\"#additional-processing\"]": "

25.1.2. Additional processing#

Once you have a dataset that is as complete as possible, you might want to:

", "a[href=\"get-downloads.html\"]": "

25.4. HOW TO: Get text, images, and PDFs using Trove\u2019s download link#

25.4.1. Background#

You can download text, images, and PDFs from individual digitised items using the Trove web interface. But only the text of periodical articles is available for machine access through the Trove API. This makes it difficult to assemble datasets, or build processing pipelines involving digitised resources.

This page documents a workaround developed by reverse-engineering the download link used by the Trove web interface. You can use it to automate the download of text, images, and PDFs from many digitised resources.

", "a[href=\"../../what-is-trove/works-and-versions.html\"]": "

3. Works and versions#

3.1. Grouping versions into works#

The idea is simple enough \u2013 bring all the versions of a publication together under a single heading to simplify a user\u2019s search results. Instead of having to wade through a long list of near-identical entries, a user can quickly focus in on a title of interest, and drill down to find a specific version at a specific library. This idea is based on the Functional Requirements for Bibliographic Records (FRBR). The FRBR data model describes four entities: \u2018work\u2019, \u2018expression\u2019, \u2018manifestation\u2019, and \u2018item\u2019:

", "a[href=\"../../accessing-data/trove-api-intro.html\"]": "

14. Trove API introduction#

Use the Trove Application Programming Interface (API) to get direct access to Trove data. Just make a request and get back data in a predictable, structured format that computers can understand.

", "a[href=\"#merge-remove-duplicates-from-dataset\"]": "

Merge/remove duplicates from dataset#

Duplicates exist at multiple levels amongst Trove\u2019s digitised resources. There can be more than one work record pointing to a single digitised object. Single works can also contain near-duplicate versions pointing to the same resource but including slightly different metadata. The processing steps above will exapand all of these duplicates and near-duplicates out into individual records. The aim of this step is to deduplicate the records while preserving all the harvested metadata. The desired result is a dataset with one record for each fulltext url. If there are multiple values in any column, these need to be concatenated into a single list or value.

", "a[href=\"extract-embedded-metadata.html\"]": "

25.2. HOW TO: Extract additional metadata from the digitised resource viewer#

The viewers you use to examine digitised resources in Trove embed some metadata that isn\u2019t available through the Trove API. This includes a JSON-ified version of the item\u2019s MARC record (presumably copied from the NLA catalogue), as well as structural information used by the viewer itself, such as a list of pages in a digitised book.

This metadata can be useful in a number of different contexts. For example, you can extract the number of pages in a digitised book, then use this number to automatically download the full text or a PDF. The GLAM Workbench includes an example where geospatial coordinates are extracted from the MARC data to add to a harvest of digitised maps.

", "a[href=\"download-images.html\"]": "

25.5. HOW TO: Create download links for images using `nla.obj` identifiers#

25.5.1. Introduction#

Many of the resources digitised by the NLA and its partners are made up of images. These might be digitised copies of visual material like photos and maps, or scanned pages of print publications like books or periodicals. In Trove, each image or page has its own unique nla.obj identifier. You can use these identifiers to construct urls that lead directly to downloadable versions of the image file.

", "a[href=\"#outline-of-harvesting-method\"]": "

25.1.1. Outline of harvesting method#

This is an outline of a general, \u2018belts and braces\u2019 approach to harvesting details of digitised resources. The specific method will depend on the type of resource, the filters you apply, and the metadata you want to save.

", "a[href=\"#harvest-metadata-from-api\"]": "

Harvest metadata from API#

Searches using the API return work-level records. Sometimes digitised resources are grouped as versions of a work, even though they\u2019re quite different. To make sure you get everything, you need to work your way down through through the hierarchy of work -> version -> sub-version (labelled record in API responses), harvesting every relevant record. The steps are:

", "a[href=\"#expand-collections-and-enrich-dataset-using-embedded-metadata\"]": "

Expand collections and enrich dataset using embedded metadata#

Most of Trove\u2019s digitised resource viewers embed useful metadata in the HTML of their web pages. You can use this to determine whether a fulltext url points to a single resource or a collection, and to enrich the metadata you obtained from the API. The steps are:

", "a[href=\"../../accessing-data/how-to/harvest-complete-results.html\"]": "

15.2. HOW TO: Harvest a complete set of search results using the Trove API#

See Trove API introduction for general information about using the Trove API.

", "a[href=\"#how-to-harvest-data-relating-to-digitised-resources\"]": "

25.1. HOW TO: Harvest data relating to digitised resources#

Harvesting data from a search for digitised resources (other than newspapers) in Trove is not as simple as making a few API requests. The major problem is that digitised resources are often assembled into groups or collections, and the full details of these groupings are not available through the Trove API. This means that simply harvesting results from an API query can miss many digitised resources. In addition, the way resources are described and arranged is often inconsistent, so you can\u2019t assume that a particular type of resource will be grouped in a particular way.

As a result of these problems, a \u2018belts and braces\u2019 approach seems best \u2013 follow every possible route and harvest as many records as possible. This may result in duplicates, but these can be dealt with later. In any case, Trove already contains a large number of duplicate records for digitised resources, so some form of merging or deduplication will always be required.

"} skip_classes = ["headerlink", "sd-stretched-link", "sd-rounded-pill"] window.onload = function () { for (const [select, tip_html] of Object.entries(selector_to_html)) { const links = document.querySelectorAll(`article.bd-article ${select}`); for (const link of links) { if (skip_classes.some(c => link.classList.contains(c))) { continue; } tippy(link, { content: tip_html, allowHTML: true, arrow: true, placement: 'auto-start', maxWidth: 500, interactive: false, }); }; }; console.log("tippy tips loaded!"); };