HOW TO: Automate the download of digitised items as text, images, or PDFs#
On this page
You can download text, images, and PDFs from individual digitised items using the Trove web interface. But only the text of periodical articles is available for machine access through the Trove API. This makes it difficult to assemble datasets, or build processing pipelines involving digitised resources. This page documents a series of work arounds that enable you to automate the download of digitised items as text, images, or PDFs.
Using Trove’s download link#
When you click on the download button in the web interface, your browser fires off a request to Trove that looks like this:
https://nla.gov.au/nla.obj-[ID]/download?downloadOption=[FORMAT]&firstPage=[FIRST]&lastPage=[LAST]
[ID]
is the NLA identifier for the current item or collection, for examplenla.obj-3199043190
downloadOption
is the format of the download, it can be one ofocr
,pdf
,zip
(for images), ortif
(available with some maps)firstPage
andlastPage
are numbers that define the range of items you want to download – ranges start at0
, so if a book had fifty pages you’d setfirstPage
to0
andlastPage
to49
.
For example, the book The pearling disaster, 1899 : a memorial is identified as nla.obj-33685055
and has 102 ‘pages’. Pages in this context actually refers to the images in the digitised version, rather than the number of printed pages in the original work. This is because the digitised version will typically include images of book covers, and endpapers, as well as printed pages. Using this information we can construct a url to download all the OCRd text in the book, setting lastPage
to 101
(102 - 1), because the numbering starts at 0
:
https://nla.gov.au/nla.obj-33685055/download?downloadOption=ocr&firstPage=0&lastPage=101
This method also works with collections of items that don’t have numbered pages. For example, B.A.N.Z. Antarctic Research Expedition photographs is an album containing 151 photographs. To download them all in one PDF, you would construct the following url:
https://nla.gov.au/nla.obj-141170265/download?downloadOption=pdf&firstPage=0&lastPage=150
Note
If you’re downloading a collection of images you might notice that you get the first image twice. This is because the first image is used as a ‘cover image’ for the collection, as well as being a child of the collection. When you download, you get the collection container and it’s contents, so the first image appears twice with different identifiers.
This method is consistent across most formats, so you can develop processes that construct urls like these from a list of NLA identifiers and download their contents automatically. But if you want to get the complete contents, you need some way of discovering the total number of pages or images to set the lastPage
value. You can find this value embedded in the code of the web page, though it’s location varies:
in books, periodicals, and other works with consecutive pages it’s in a block of embedded JSON metadata
in collections of images, maps, and manuscripts you need to look for
maxNumOfChildDownloads
variable in the page’s Javascript
==Code example==
Downloading high-resolution images individually#
The method described above has a couple of problems when it comes to downloading images. The first is that all the requested images are delivered in a single zip
file. If you’re requested images of all the pages in a book, this file could get very large. The second problem is that the built-in download link doesn’t always provide images at the highest possible resolution.
An alternative approach that avoids both of these problems is to construct a url for each individual page. All you need to do this is get the page identifier and tack /image
on the end of the url.
For example, this cute picture of a penguin has the identifier http://nla.gov.au/nla.obj-141171324
. To download a high-resolution version, just add /image
:
http://nla.gov.au/nla.obj-141171324/image
But how do you get the individual identifiers for all the pages in a book, or all the images in a collection? Once again, the methods vary by format:
in books, periodicals, and other works with consecutive pages it’s in a block of embedded JSON metadata
in collections of images, maps, and manuscripts you need to extract the list of identifiers from the collection’s pop-up browse screen
Once you have a list of identifiers, you can loop through them, saving each image.