🚧 This is a working draft and will change often. Do not cite!
Use the latest published version instead.
🚧

HOW TO: Automate the download of digitised items as text, images, or PDFs

HOW TO: Automate the download of digitised items as text, images, or PDFs#

You can download text, images, and PDFs from individual digitised items using the Trove web interface. But only the text of periodical articles is available for machine access through the Trove API. This makes it difficult to assemble datasets, or build processing pipelines involving digitised resources. This page documents a series of work arounds that enable you to automate the download of digitised items as text, images, or PDFs.

Downloading high-resolution images individually#

The method described above has a couple of problems when it comes to downloading images. The first is that all the requested images are delivered in a single zip file. If you’re requested images of all the pages in a book, this file could get very large. The second problem is that the built-in download link doesn’t always provide images at the highest possible resolution.

An alternative approach that avoids both of these problems is to construct a url for each individual page. All you need to do this is get the page identifier and tack /image on the end of the url.

For example, this cute picture of a penguin has the identifier http://nla.gov.au/nla.obj-141171324. To download a high-resolution version, just add /image:

http://nla.gov.au/nla.obj-141171324/image

But how do you get the individual identifiers for all the pages in a book, or all the images in a collection? Once again, the methods vary by format:

Once you have a list of identifiers, you can loop through them, saving each image.