🚧 This is a working draft and will change often. Do not cite!
Use the latest published version instead.
🚧

21. Overview of digitised books#

21.1. Finding digitised books#

Trove includes thousands of digitised resources described as books. You can find them by using the standard search strategy for digitised resources – search for "nla.obj" in the Books & Libraries category with the availability facet set to y, and the format facet set to Book.

Try it!

You might want to filter these results further, for example:

  • exclude books that are not freely available online by setting the availability facet to y/f – this should exclude books uploaded through the National E-Deposit scheme which can only be accessed onsite

  • exclude all books uploaded through the National E-Deposit scheme by adding NOT nuc:"ANL:NED" to your query

  • exclude Parliamentary Papers by adding NOT series:"Parliamentary paper (Australia. Parliament)" to your query

You’d expect all the books to be in the Books & Libraries category, but keeping in mind the basic principle to ‘always assume that Trove isn’t telling you everything’, you might want to look for books in other categories as well. If you’re using the API this is easy to do by setting category to all and bulkHarvest to true.

Try it!

21.2. How many digitised books are there?#

I’ve created a dataset of digitised books in Trove by using the basic search strategy outined above and excluding Parliamentary Papers. Harvesting data about digitised resoures is tricky because of the way they’re described and grouped. To workaround these problems, I used the method described in HOW TO: Harvest data relating to digitised resources to process all versions, exclude records without a fulltext identifier, add in multi-volume works and other collection groupings, and merge duplicate records using their fulltext identifier. I then attempted to download OCRd text from the books. The resulting dataset is available on GitHub.

You can use this dataset to investigate the characteristics of Trove’s digitised books. For example, how many book records are there in the dataset?

Hide code cell source
import pandas as pd

df = pd.read_csv(
    "https://raw.githubusercontent.com/GLAM-Workbench/trove-books-data/main/trove-books.csv",
    keep_default_na=False,
)

print(f"There are {df.shape[0]:,} book records!")
There are 21,218 book records!

How many of the books have OCRd text you can download?

Hide code cell source
print(f"{df.loc[df['text_file'] != ''].shape[0]:,}")
17,695

Why wouldn’t a digitised book have OCRd text? I suspect it’s because some of the so-called books are actually leaflets and posters. There are probably also some NED books in PDF or eBook formats.

21.3. What languages are they in?#

The book metadata includes a language field. This field can have multiple values.

How many different languages are recorded in the language field?

Hide code cell source
languages = df["language"].str.split(" | ", regex=False).explode()

languages.nunique()
251

What are the ten most common language values?

Hide code cell source
languages.value_counts().to_frame().reset_index()[:10].style.format(
    thousands=","
).hide()
language count
English 17,903
Chinese 1,685
518
French 256
Undetermined 162
German 132
Japanese 95
Italian 89
Austronesian (Other) 80
Dutch 77

It’s no suprise that a large majority of the books are in English. To dig deeper, uou can exclude the English books and visualise the remaining language counts as a word cloud.

Hide code cell source
from wordcloud import WordCloud

language_counts = languages.value_counts().to_dict()
del language_counts["English"]

wc = WordCloud(width=800, height=400)
wc.generate_from_frequencies(language_counts)
language_cloud = wc.to_image()
display(language_cloud)
../../_images/cf95e69b51f495f45d761d2aff5b2a6fc7babe26fd8cca57bca98d707f8a246d.png

Fig. 21.1 Languages of digitised books (excluding English)#

21.4. When were they published?#

You can use the date field to examine their publication dates.

Hide code cell source
import altair as alt

df["year"] = df["date"].str.extract(r"\b((?:16|17|18|19|20)\d{2})\b")
year_counts = df["year"].value_counts().to_frame().reset_index()

chart_years = alt.Chart(year_counts).mark_bar(size=1).encode(
    x="year:T",
    y=alt.Y("count:Q", title="number of books"),
    tooltip=[alt.Tooltip("year:T", format="%Y"), "count:Q"],
).properties(width="container", height=200, padding=20)

display(chart_years)

Fig. 21.2 Publication years of digitised books#

Why are there peaks in 1800, 1900, 2000? Presumably because an exact publication date was not known, so a value for the century was used.

21.6. What are they about?#

The subject field contains standard(ish) subject headings. There can be multiple values. The values below have been normalised to standardise spacing and remove trailing full stops.

Hide code cell source
import re

def split_and_clean(value):
    values = value.split("|")
    return list(
        set(
            [
                re.sub(r"(\w)--(\w)", r"\1 -- \2", v).strip().strip(".")
                for v in values
                if v
            ]
        )
    )


df["subjects"] = df["subject"].apply(split_and_clean)

subjects = df["subjects"].explode().to_frame()

subjects["subjects"].value_counts().to_frame().reset_index()[:20].style.format(
    thousands=","
).hide()
subjects count
Australian 3,456
Printed ephemera -- Australia 2,089
Menus -- Specimens 963
Chinese language -- Dictionaries 610
Menus - Shipping (general, Orient, P&O) 480
Menus -- Australia -- Specimens 470
Invitation cards -- Australia 422
Programs -- Australia 418
Australia 369
Menus - General 335
Menus - Specific events 242
Commercial catalogs -- Australia 225
Australia -- Commerce -- Miscellanea 224
Business -- Australia -- Miscellanea 224
Advertising -- Australia -- Miscellanea 219
Retail trade -- Australia -- Miscellanea 203
Trade literature -- Australia 201
Industries -- Australia -- Miscellanea 198
Advertising -- Brand name products -- Australia 181
Broadsides -- Australia 169

21.7. How many pages do they contain?#

The number of pages is generally not in the API data, but I’ve extracted it from the metadata embedded in the digitised book viewer. This number really represents the number of digitised images, so it can include covers and endleaves.

How many digitised pages (or images) are there in total?

Hide code cell source
print(f"{df['pages'].sum():,}")
1,473,339

By ‘binning’ the page values into groups of 100 you can examine how the page lengths vary.

Hide code cell source
chart_pages = alt.Chart(df).mark_bar().encode(
    x=alt.X("pages:Q", title="number of pages (binned)").bin(maxbins=20), 
    y=alt.Y("count()", title="number of books")
).properties(
    width="container", height=200
)

display(chart_pages)

Fig. 21.3 Number of pages in digitised books (binned)#

Why are most of the books less than 100 pages long? if you look at the list above of the most common subjects, you’ll see that a number of them relate to ‘ephemera’. It seems thata significant number of Trove’s ‘books’ are actually things like leaflets, pamphlets, posters, and menus.

What’s the longest book?

Hide code cell source
df.loc[df["pages"].idxmax()]
title                                   Jiu xin yue quan shu. | 舊新約全書.
sub_title                                                             
contributor          British and Foreign Bible Society | British an...
publisher            Sheng shu gong hui | [Guangzhou] : Sheng shu g...
date                                                              1907
type                                                                  
format                                                            Book
extent                         1308, 400 pages ; 23 cm. | 703718 words
language                                                       Chinese
subject                                                               
spatial                                                               
is_part_of                                                            
identifier           AuCNLKIN: 000015575462 | LMS 545 | OCoLC: 2224...
rights               No known copyright restrictions | http://right...
pages                                                             1723
fulltext_url                       https://nla.gov.au/nla.obj-79373227
fulltext_url_text         National Library of Australia digitised item
text_download_url    https://trove.nla.gov.au/nla.obj-79373227/down...
catalogue_url                      https://nla.gov.au/nla.cat-vn483026
work_url                        https://trove.nla.gov.au/work/12620536
work_type                                                         Book
parent                                                                
parent_url                                                            
children                                                              
text_file                                         nla.obj-79373227.txt
year                                                              1907
subjects                                                            []
Name: 2310, dtype: object