core

a harvester for downloading large numbers of digitised newspaper articles from Trove

source

Harvester

 Harvester (query_params=None, key=None, data_dir='data',
            harvest_dir=None, config_file=None, text=False, pdf=False,
            image=False, include_linebreaks=False, maximum=None)

Harvest large quantities of digitised newspaper articles from Trove. Note that you must supply either query_params and key or config_file.

Parameters:

  • query_params [optional, dictionary of parameters]
  • key [optional, Trove API key]
  • config_file [optional, path to a config file]
  • data_dir [optional, directory for harvests, string]
  • harvest_dir [optional, directory for this harvest, string]
  • text [optional, save articles as text files, True or False]
  • pdf [optional, save articles as PDFs, True or False]
  • image [optional, save articles as images, True or False]
  • include_linebreaks [optional, include linebreaks in text files, True or False]
  • maximum [optional, maximum number of results, integer]

The Harvester class configures and runs your harvest, saving results in a variety of formats.

You must supply either query_params and key, or the path to a config_file. If you don’t you’ll get a NoQueryError.

By default, the harvester will save harvests in a directory called data, with each individual harvest in a directory named according to the current date and time (YYYYMMDDHHmmss format). You can change this by setting the data_dir and harvest_dir parameters. This can help you to manage your harvests by grouping together related searches, or giving them meaningful names.

The harvester generates three data files by default:

  • harvester_config.json a file that captures the parameters used to launch the harvest
  • ro-crate-metadata.json a metadata file documenting the harvest in RO-Crate format
  • results.ndjson contains details of all the harvested articles in a newline delimited JSON format (each line is a JSON object)

You can convert the ndjson file to a CSV format using Harvester.save_csv.

The text, pdf, and image options give you the option to save the contents of the articles as either text files, PDF files, or JPG images. Note that saving PDFs and images can be very slow.

If you only want to harvest part of the results set you can set the maximum parameter to the number of records you want.

Quick start

  • You’ll need a Trove API key to use the harvester.
  • Just copy the url from a search in the newspapers and gazettes category.
from trove_newspaper_harvester.core import prepare_query, Harvester

my_api_key = "myApIkEy"
search_url = "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge"

# Convert the search url into a set of API parameters
my_query_params = prepare_query(search_url)

# Initialise the Harvester
harvester = Harvester(query_params=myquery_params, key=my_api_key)

# Start the harvest
harvester.harvest()

If you want to harvest the OCRd text of articles as well as metadata, add text=True to the harvester initialisation.

# Initialise the Harvester
harvester = Harvester(query_params=myquery_params, key=my_api_key, text=True)

Similarly you can harvest PDFs and images of articles by adding pdf=True and image=True to the harvester initialisation, but keep in mind that these options will make the harvest much slower!

You can generate a set of query parameters from a Trove search url using prepare_query().

# TEST FOR MISSING PARAMETERS
# You need to supply either query_params AND key, OR config_file. 
# If you don't you'll get a NoQueryError
with ExceptionExpected(ex=NoQueryError):
    harvester = Harvester()

source

prepare_query

 prepare_query (query)

Converts a Trove search url into a set of parameters ready for harvesting.

Parameters:

  • query [required, search url from Trove web interface or API, string]

Returns:

  • a dictionary of parameters

The prepare_query function converts a search url from the Trove web interface or API into a set of parameters that you can feed to Harvester. It uses the trove-query-parser to do most of the work, but adds in a few extra parameters needed for the harvest.

query_params = prepare_query("https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=New%20South%20Wales&l-artType=newspapers&l-title=508&l-decade=191&l-category=Article"
)
query_params
{'q': 'wragge',
 'l-state': ['New South Wales'],
 'l-artType': 'newspapers',
 'l-title': ['508'],
 'l-decade': ['191'],
 'l-category': ['Article'],
 'category': 'newspaper',
 'encoding': 'json',
 'reclevel': 'full',
 'bulkHarvest': 'true'}
# TEST query_params()
# Convert a url from the Trove web interface
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge"
)

# Test the results
assert query_params == {
    "q": "wragge",
    "category": "newspaper",
    "encoding": "json",
    "reclevel": "full",
    "bulkHarvest": "true",
}

# Convert a url from an API request
query_params = prepare_query(
    "https://api.trove.nla.gov.au/v2/result?q=wragge&category=newspaper&encoding=json&l-category=Article"
)

assert query_params == {
    "q": ["wragge"],
    "category": ["newspaper"],
    "encoding": "json",
    "l-category": ["Article"],
    "reclevel": "full",
    "bulkHarvest": "true",
}

Initialising a harvest using a harvester_config.json file

The parameters used to initialise a harvest are saved into a file called harvester_config.json. This provides useful documentation of your harvest, making it possible to reconstruct the process at a later date.

For example, you might want to re-harvest a particular query a year after your initial harvest to see how the results have changed. Remember, more articles are being added every week! To re-run a harvest, just point the Harvester to the harvester_config.json file. By default, your new harvest will be saved in a fresh directory.

from trove_newspaper_harvester.core import Harvester

harvester = Harvester(config_file="path/to/old/harvest/harvester_config.json")

harvester.harvest()

Note that the harvester_config.json contains all the parameters used for your harvest, including your Trove API key. This makes it easy to re-run a harvest at a later date, but if you’re intending to share your harvest results you should delete or obscure the key value.

# TEST: Reharvest from config file

API_KEY = os.getenv("TROVE_API_KEY")

test_config = {
    'query_params': {'q': 'wragge',
    'l-state': ['Western Australia'],
    'l-illustrated': 'true',
    'l-illtype': ['Photo'],
    'include': ['articleText'],
    'category': 'newspaper',
    'encoding': 'json',
    'reclevel': 'full',
    'bulkHarvest': 'true'},
    'key': API_KEY,
    'full_harvest_dir': 'harvests/test_harvest',
    'maximum': None,
    'text': True,
    'pdf': False,
    'image': False,
    'include_linebreaks': False
}

Path("harvester_config.json").write_text(json.dumps(test_config))

# Initialise the harvester
harvester = Harvester(config_file="harvester_config.json")

# Start the harvest!
harvester.harvest()

shutil.rmtree(Path("data"))
Path("harvester_config.json").unlink()

Where your harvests are saved

By default, harvests are saved in a directory named data. Each individual harvest is saved in a directory named according to the current date/time, for example: data/20230826125205.

# TEST HARVESTER CREATES DEFAULT HARVEST DIRECTORY
# This example initialises a harvest, but doesn't actually run it.

API_KEY = os.getenv("TROVE_API_KEY")

# Prepare query params
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge"
)

# Initialise the Harvester with the query parameters
harvester = Harvester(query_params=query_params, key=API_KEY, text=True)

# if you haven't set the max parameter, the total value will be the total number of results
assert harvester.total > 0
print(f"Total results: {harvester.total:,}")

# Check that the data directory exists
assert Path("data").exists() is True

# Check that a harvest directory with the current date/hour exists in the data directory
assert len(list(Path("data").glob(f'{arrow.utcnow().format("YYYYMMDDHH")}*'))) == 1

# Check that a 'text' directory exists in the harvest directory
assert (
    Path(next(Path("data").glob(f'{arrow.utcnow().format("YYYYMMDDHH")}*'))).exists()
    is True
)

# Check that the cache has been initialised
assert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists()

# Clean up
shutil.rmtree(Path("data"))
harvester.delete_cache()
Total results: 140,806

You can change the default directories using the data_dir and harvest_dir parameters. For example, if you wanted to keep all the harvests relating to a specific project together, you could set data_dir="my-cool-project". You can use harvest_dir to give your harvest a meaningful name, for example harvest_dir="search-for-cat-photos".

# TEST HARVESTER CREATES REQUESTED HARVEST DIRECTORY

query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge"
)

harvester = Harvester(
    query_params=query_params,
    key=API_KEY,
    data_dir="harvests",
    harvest_dir="my_trove_harvest",
    pdf=True,
    image=True,
)

assert harvester.total > 0
print(f"Total results: {harvester.total:,}")

# Check that the data directory exists
assert Path("harvests").exists() is True

assert Path("harvests", "my_trove_harvest").exists() is True

assert Path("harvests", "my_trove_harvest", "pdf").exists() is True

assert Path("harvests", "my_trove_harvest", "image").exists() is True

# Clean up
shutil.rmtree(Path("harvests"))
harvester.delete_cache()
Total results: 140,806

source

Harvester.harvest

 Harvester.harvest ()

Start the harvest and loop over the result set until finished.

Once the harvester is initialised, you can start the harvest by calling Harvester.harvest(). A progress bar will keep you informed of the status of your harvest.

Add text=True to include the OCRd full text of the articles in the harvest. The contents of each article is saved as a separate file in the text directory. See the harvest results section below for more information.

# HARVEST WITH TEXT > 100 records

# Prepare query parameters
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo"
)

# Initialise the harvester
harvester = Harvester(
    query_params=query_params,
    key=API_KEY,
    data_dir="harvests",
    harvest_dir="test_harvest",
    text=True,
)

# Start the harvest
harvester.harvest()


# ---TESTS---
# Check that the ndjson file exists and lines can be parsed as json
json_data = []
with harvester.ndjson_file.open("r") as ndjson_file:
    for line in ndjson_file:
        json_data.append(json.loads(line.strip()))

# The length of the ndjson file should equal the number of records harvested
assert len(json_data) == harvester.harvested

# Check that the metadata file has been created
config = get_config(harvester.harvest_dir)
assert config["query_params"] == query_params

# Check that the RO-Crate file was created
crate = get_crate(harvester.harvest_dir)
eids = [
    "./", 
    "ro-crate-metadata.json", 
    "#harvester_run", 
    "harvester_config.json", 
    "https://github.com/wragge/trove-newspaper-harvester",
    "results.ndjson",
    "text",
    "https://creativecommons.org/publicdomain/zero/1.0/",
    "http://rightsstatements.org/vocab/CNE/1.0/",
    "http://rightsstatements.org/vocab/NKC/1.0/" 
]
for eid in eids:
    assert crate.get(eid) is not None

# Check that a text file exists and can be read
assert Path("harvests", "test_harvest", json_data[0]["articleText"]).exists()
text = Path("harvests", "test_harvest", json_data[0]["articleText"]).read_text()
assert isinstance(text, str)

# Check that the cache file was deleted
assert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists() is False

shutil.rmtree(Path("harvests"))

The text of articles in the Australian Women’s Weekly is not available through the API, so the harvester has to scrape it separately. This happens automatically. The code below is just a little test to make sure it’s working as expected.

# ---TEST FOR AWW---
# Prepare query params
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge"
)

# Initialise the Harvester with the query parameters
harvester = Harvester(query_params=query_params, key=API_KEY, text=True)

# Get html text of an article
text = harvester.get_aww_text(51187457)
assert "THE SHAPE OF THINGS TO COME" in text

# Clean up
harvester.delete_cache()
shutil.rmtree(Path("data"))

You can include PDFs and images of the articles by adding pdf=True or image=True to the harvester initialisation. It’s important to note that this will slow down the harvest a lot, as each file needs to be generated and downloaded individually.

# HARVEST WITH PDF AND IMAGE -- 1 RECORD MAX

# Prepare the query parameters
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-illustrationType=Cartoon"
)

# Initialise the harvester
harvester = Harvester(
    query_params=query_params,
    key=API_KEY,
    data_dir="harvests",
    harvest_dir="test_harvest",
    pdf=True,
    image=True,
    maximum=1,
)

# Start the harvest!
harvester.harvest()


# ---TESTS---

# Check that the ndjson file exists and lines can be parsed as json
json_data = []
with harvester.ndjson_file.open("r") as ndjson_file:
    for line in ndjson_file:
        json_data.append(json.loads(line.strip()))

assert harvester.maximum == harvester.harvested

# The length of the ndjson file should equal the number of records harvested
assert len(json_data) == harvester.harvested

# Check that a pdf and image file exist
assert Path("harvests", "test_harvest", json_data[0]["pdf"]).exists()
assert Path("harvests", "test_harvest", json_data[0]["images"][0]).exists()

shutil.rmtree(Path("harvests"))

Naturally enough, nothing is harvested from a query with no results. Check your search and your API key!

# HARVEST WITH NO RESULTS

# Prepare query parameters
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wwgagsgshggshghso"
)

# Initialise the harvester
harvester = Harvester(
    query_params=query_params,
    key=API_KEY
)

# Start the harvest
harvester.harvest()

assert harvester.harvested == 0
shutil.rmtree(Path("data"))

Restarting a failed harvest

The Harvester uses requests-cache to cache API responses. This makes it easy to restart a failed harvest. All you need to do is call Harvester.harvest() again and it will pick up where it left off.


source

Harvester.save_csv

 Harvester.save_csv ()

Flatten and rename data in the ndjson file to save as CSV.

Harvested metadata is saved, by default, in a newline-delimited JSON file. If you’d prefer the results in CSV format, just call Harvester.save_csv(). See below for more information on results formats.

# TEST - save harvest results as CSV

# Prepare query parameters
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo"
)

# Initialise the harvester
harvester = Harvester(
    query_params=query_params,
    key=API_KEY,
    data_dir="harvests",
    harvest_dir="test_harvest",
    text=True,
)

# Start the harvest
harvester.harvest()

# Save results as CSV
harvester.save_csv()

# ---TESTS---

# Check that CSV file exists
csv_file = Path(harvester.harvest_dir, "results.csv")
assert csv_file.exists()

# Open the CSV file and check that the number of rows equals number of records harvested
df = pd.read_csv(csv_file)
assert df.shape[0] == harvester.harvested

shutil.rmtree(Path("harvests"))

Harvest results

There will be at least three files created for each harvest:

  • harvester_config.json a file that captures the parameters used to launch the harvest
  • ro-crate-metadata.json a metadata file documenting the harvest in RO-Crate format
  • results.ndjson contains details of all the harvested articles in a newline delimited JSON format (each line is a JSON object)

The results.ndjson stores the API results from Trove as is, with a couple of exceptions:

  • if the text parameter has been set to True, the articleText field will contain the path to a .txt file containing the OCRd text contents of the article (rather than containing the text itself)
  • similarly if PDFs and images are requests, the pdf and image fields int the ndjson file will point to the saved files.

You’ll probably find it easier to work with the results in CSV format. The Harvester.save_csv() method flattens the ndjson file and renames some columns to make them compatible with previous versions of the harvest. It produces a results.csv file, which is a plain text CSV (Comma Separated Values) file. You can open it with any spreadsheet program. The details recorded for each article are:

  • article_id – a unique identifier for the article
  • title – the title of the article
  • date – in ISO format, YYYY-MM-DD
  • page – page number (of course), but might also indicate the page is part of a supplement or special section
  • newspaper_id – a unique identifier for the newspaper or gazette title (this can be used to retrieve more information or build a link to the web interface)
  • newspaper_title – the name of the newspaper (or gazette)
  • category – one of ‘Article’, ‘Advertising’, ‘Detailed lists, results, guides’, ‘Family Notices’, or ‘Literature’
  • words – number of words in the article
  • illustrated – is it illustrated (values are y or n)
  • edition – edition of newspaper (rarely used)
  • supplement – section of newspaper (rarely used)
  • section – section of newspaper (rarely used)
  • url – the persistent url for the article
  • page_url – the persistent url of the page on which the article is published
  • snippet – short text sample
  • relevance – search relevance score of this result
  • status – some articles that are still being processed will have the status “coming soon” and might be missing other fields
  • corrections – number of text corrections
  • last_correction – date of last correction
  • tags – number of attached tags
  • comments – number of attached comments
  • lists – number of lists this article is included in
  • text – path to text file
  • pdf – path to PDF file
  • image – path to image file

If you’ve asked for text files PDFs or images, there will be additional directories containing those files. Files containing the OCRd text of the articles will be saved in a directory named text. These are just plain text files, stripped on any HTML. These files include some basic metadata in their file titles – the date of the article, the id number of the newspaper, and the id number of the article. So, for example, the filename 19460104-1002-206680758.txt tells you:

As you can see, you can use the newspaper and article ids to create direct links into Trove:

  • to a newspaper or gazette https://trove.nla.gov.au/newspaper/title/[newspaper id]
  • to an article http://nla.gov.au/nla.news-article[article id]

Similarly, if you’ve asked for copies of the articles as images, they’ll be in a directory named image. The image file names are similar to the text files, but with an extra id number for the page from which the image was extracted. So, for example, the image filename 19250411-460-140772994-11900413.jpg tells you:

The text of articles in the Australian Women’s Weekly is not available through the API, so the harvester has to scrape it separately. This happens automatically. The code below is just a little test to make sure it’s working as expected.


source

get_harvest

 get_harvest (data_dir='data', harvest_dir=None)

Get the path to a harvest. If data_dir and harvest_dir are not supplied, this will return the most recent harvest in the ‘data’ directory.

Parameters:

  • data_dir [optional, directory for harvests, string]
  • harvest_dir [optional, directory for this harvest, string]

Returns:

  • a pathlib.Path object pointing to the harvest directory
# TEST GET HARVEST

# Create test folders
Path("data", "20220919100000").mkdir(parents=True)
Path("data", "20220919200000").mkdir(parents=True)

# Get latest harvest folder
harvest = get_harvest()
print(harvest)

# ---TESTS---
assert harvest.name == "20220919200000"

harvest = get_harvest(data_dir="data", harvest_dir="20220919100000")
assert harvest.name == "20220919100000"

shutil.rmtree(Path("data"))
data/20220919200000

source

get_config

 get_config (harvest)

Get the query config parameters from a harvest directory.

Parameters:

  • harvest [required, path to harvest, string or pathlib.Path]

Returns:

  • config dictionary

The harvester_config.json file contains the parameters used to initiate a harvest. Using get_config you can retrieve the harvester_config.json for for a particular harvest. This can be useful if, for example, you want to re-run a harvest at a later data – you can just grab the query_paramaters and feed them into a new Harvester instance.

# Prepare query parameters
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo"
)

# Initialise the harvester
harvester = Harvester(
    query_params=query_params,
    key=API_KEY,
    text=True,
)

# Start the harvest
harvester.harvest()

# Get the most recent harvest
harvest = get_harvest()

# Get the metadata
config = get_config(harvest)

# Obscure key and display
config["key"] = "########"
display(config)

# ---TESTS---
assert config["query_params"]["q"] == "wragge"
assert config["text"] is True

shutil.rmtree(Path("data"))
{'query_params': {'q': 'wragge',
  'l-state': ['Western Australia'],
  'l-illustrated': 'true',
  'l-illustrationType': ['Photo'],
  'category': 'newspaper',
  'encoding': 'json',
  'reclevel': 'full',
  'bulkHarvest': 'true',
  'include': ['articleText']},
 'key': '########',
 'full_harvest_dir': 'data/20231023042615',
 'maximum': None,
 'text': True,
 'pdf': False,
 'image': False,
 'include_linebreaks': False}

source

get_crate

 get_crate (harvest)

Get the RO-Crate metadata file from a harvest directory.

Parameters:

  • harvest [required, path to harvest, string or pathlib.Path]

Returns:

  • ROCrate object

Trove is changing all the time, so it’s important to document your harvests. The Harvester automatically creates a metadata file using the Research Object Crate (RO-Crate) format. This documents when the harvest was run, how many results were saved, and the version of the harvester. It is linked to the harvester_config.json file that save the query parameters and harvester settings. This function retrieves the RO-Crate file for a given harvest. It returns an RO-Crate object – see the ro-crate.py package for more information.

# Prepare query parameters
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo"
)

# Initialise the harvester
harvester = Harvester(
    query_params=query_params,
    key=API_KEY,
    text=True,
)

# Start the harvest
harvester.harvest()

# Get the most recent harvest
harvest = get_harvest()

# Get the metadata
crate = get_crate(harvest)

for eid in crate.get_entities():
    print(eid.id, eid.type)

assert crate.get("./").type == "Dataset"
assert crate.get("harvester_config.json").properties()["encodingFormat"] == "application/json"
assert crate.get("./").properties()["mainEntity"] == {"@id": "#harvester_run"}

shutil.rmtree(Path("data"))
./ Dataset
ro-crate-metadata.json CreativeWork
harvester_config.json File
results.ndjson ['File', 'Dataset']
text ['File', 'Dataset']
#harvester_run CreateAction
https://github.com/wragge/trove-newspaper-harvester SoftwareApplication
http://rightsstatements.org/vocab/NKC/1.0/ CreativeWork
http://rightsstatements.org/vocab/CNE/1.0/ CreativeWork
https://creativecommons.org/publicdomain/zero/1.0/ CreativeWork

source

NoQueryError

Exception triggered by empty query.


Created by Tim Sherratt for the GLAM Workbench. Support this project by becoming a GitHub sponsor.