Converts a Trove search url into a set of parameters ready for harvesting.
Parameters:
query [required, search url from Trove web interface or API, string]
api_key [required, Trove API key, string]
text [optional, save text files, True or False]
Returns:
a dictionary of parameters
The prepare_query function converts a search url from the Trove web interface or API into a set of parameters that you can feed to Harvester. It uses the trove-query-parser to do most of the work, but adds in a few extra parameters needed for the harvest.
If you want to save the contents of the articles as text files you need to set text to True. This ensures that the articleText field is included in the results.
# TEST query_params()# Convert a url from the Trove web interface, including textquery_params = prepare_query("https://trove.nla.gov.au/search/category/newspapers?keyword=wragge", api_key="MY_API_KEY", text=True,)# Test the resultsassert query_params == {"q": "wragge","include": ["articleText"],"zone": "newspaper","key": "MY_API_KEY","encoding": "json","reclevel": "full","bulkHarvest": "true",}# Convert a url from an API requestquery_params = prepare_query("https://api.trove.nla.gov.au/v2/result?q=wragge&zone=newspaper&encoding=json&l-category=Article", api_key="MY_API_KEY",)assert query_params == {"q": ["wragge"],"zone": ["newspaper"],"encoding": "json","l-category": ["Article"],"key": "MY_API_KEY","reclevel": "full","bulkHarvest": "true",}
Harvest large quantities of digitised newspaper articles from Trove.
Parameters:
query_params [required, dictionary of parameters]
data_dir [optional, directory for harvests, string]
harvest_dir [optional, directory for this harvest, string]
text [optional, save articles as text files, True or False]
pdf [optional, save articles as PDFs, True or False]
image [optional, save articles as images, True or False]
include_linebreaks [optional, include linebreaks in text files, True or False]
max [optional, maximum number of results, integer]
The Harvester class configures and runs your harvest, saving results in a variety of formats.
By default, the harvester will save harvests in a directory called data, with each individual harvest in a directory named according to the current date and time (YYYYMMDDHHmmss format). You can change this by setting the data_dir and harvest_dir parameters. This can help you to manage your harvests by grouping together related searches, or giving them meaningful names.
The harvester generates two data files by default:
metadata.json contains basic information about the harvest
results.ndjson contains details of all the harvested articles in a newline delimited JSON format (each line is a JSON object)
You can convert the ndjson file to a CSV format using Harvester.save_csv.
The text, pdf, and image options give you the option to save the contents of the articles as either text files, PDF files, or JPG images. Note that saving PDFs and images can be very slow.
If you only want to harvest part of the results set you can set the max parameter to the number of records you want.
# TEST HARVESTER CREATES DEFAULT HARVEST DIRECTORY# This example initialises a harvest, but doesn't actually run it.API_KEY = os.getenv("TROVE_API_KEY")# Prepare query paramsquery_params = prepare_query("https://trove.nla.gov.au/search/category/newspapers?keyword=wragge", text=True, api_key=API_KEY,)# Initialise the Harvester with the query parametersharvester = Harvester(query_params=query_params, text=True)# if you haven't set the max parameter, the maximum value will be the total number of resultsassert harvester.maximum >0print(f"Total results: {harvester.maximum:,}")# Check that the data directory existsassert Path("data").exists() isTrue# Check that a harvest directory with the current date/hour exists in the data directoryassertlen(list(Path("data").glob(f'{arrow.utcnow().format("YYYYMMDDHH")}*'))) ==1# Check that a 'text' directory exists in the harvest directoryassert ( Path(next(Path("data").glob(f'{arrow.utcnow().format("YYYYMMDDHH")}*'))).exists()isTrue)# Check that the cache has been initialisedassert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists()# Clean upshutil.rmtree(Path("data"))harvester.delete_cache()
Start the harvest and loop over the result set until finished.
Once the Harvester is initialised with your query parameters, you can call Harvester.harvest to actually start the process. The harvester will loop over the complete results set until finished.
# HARVEST WITH TEXT > 100 records# Prepare query parametersquery_params = prepare_query("https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo", api_key=API_KEY, text=True,)# Initialise the harvesterharvester = Harvester( query_params=query_params, data_dir="harvests", harvest_dir="test_harvest", text=True,)# Start the harvestharvester.harvest()# ---TESTS---# Check that the ndjson file exists and lines can be parsed as jsonjson_data = []with harvester.ndjson_file.open("r") as ndjson_file:for line in ndjson_file: json_data.append(json.loads(line.strip()))# The length of the ndjson file should equal the number of records harvestedassertlen(json_data) == harvester.harvested# Check that the metadata file has been createdmetadata = get_metadata(harvester.harvest_dir)assert metadata["query_parameters"] == query_params# Check that a text file exists and can be readassert Path("harvests", "test_harvest", json_data[0]["articleText"]).exists()text = Path("harvests", "test_harvest", json_data[0]["articleText"]).read_text()assertisinstance(text, str)# Check that the cache file was deletedassert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists() isFalseshutil.rmtree(Path("harvests"))
# HARVEST WITH NO RESULTS# Prepare query parametersquery_params = prepare_query("https://trove.nla.gov.au/search/category/newspapers?keyword=wwgagsgshggshghso", api_key=API_KEY, text=True,)# Initialise the harvesterharvester = Harvester( query_params=query_params,)# Start the harvestharvester.harvest()assert harvester.harvested ==0shutil.rmtree(Path("data"))
# HARVEST WITH PDF AND IMAGE -- 1 RECORD MAX# Prepare the query parametersquery_params = prepare_query("https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-illustrationType=Cartoon", api_key=API_KEY, text=True,)# Initialise the harvesterharvester = Harvester( query_params=query_params, data_dir="harvests", harvest_dir="test_harvest", pdf=True, image=True,max=1,)# Start the harvest!harvester.harvest()# ---TESTS---# Check that the ndjson file exists and lines can be parsed as jsonjson_data = []with harvester.ndjson_file.open("r") as ndjson_file:for line in ndjson_file: json_data.append(json.loads(line.strip()))assert harvester.maximum == harvester.harvested# The length of the ndjson file should equal the number of records harvestedassertlen(json_data) == harvester.harvested# Check that a pdf and image file existassert Path("harvests", "test_harvest", json_data[0]["pdf"]).exists()assert Path("harvests", "test_harvest", json_data[0]["images"][0]).exists()shutil.rmtree(Path("harvests"))
The text of articles in the Australian Women’s Weekly is not available through the API, so the harvester has to scrape it separately. This happens automatically. The code below is just a little test to make sure it’s working as expected.
# ---TEST FOR AWW---# Prepare query paramsquery_params = prepare_query("https://trove.nla.gov.au/search/category/newspapers?keyword=wragge", text=True, api_key=API_KEY,)# Initialise the Harvester with the query parametersharvester = Harvester(query_params=query_params, text=True)# Get html text of an articletext = harvester.get_aww_text(51187457)assert"THE SHAPE OF THINGS TO COME"in text# Clean upshutil.rmtree(Path("data"))
Restarting a failed harvest
The Harvester uses requests-cache to cache API responses. This makes it easy to restart a failed harvest. All you need to do is call Harvester.harvest() again and it will pick up where it left off.
Flatten and rename data in the ndjson file to save as CSV.
Harvested metadata is saved, by default, in a newline-delimited JSON file. If you’d prefer the results in CSV format, just call Harvester.save_csv(). See below for more information on results formats.
# Prepare query parametersquery_params = prepare_query("https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo", api_key=API_KEY, text=True,)# Initialise the harvesterharvester = Harvester( query_params=query_params, data_dir="harvests", harvest_dir="test_harvest", text=True,)# Start the harvestharvester.harvest()# Save results as CSVharvester.save_csv()# ---TESTS---# Check that CSV file existscsv_file = Path(harvester.harvest_dir, "results.csv")assert csv_file.exists()# Open the CSV file and check that the number of rows equals number of records harvesteddf = pd.read_csv(csv_file)assert df.shape[0] == harvester.harvestedshutil.rmtree(Path("harvests"))
harvest [required, path to harvest, string or pathlib.Path]
Returns:
metadata dictionary
The metadata.json file contains information about a harvest. Using get_metadata you can retrieve the metadata.json for for a particular harvest. This can be useful if, for example, you want to re-run a harvest at a later data – you can just grab the query_paramaters and feed them into a new Harvester instance.
# Prepare query parametersquery_params = prepare_query("https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo", api_key=API_KEY, text=True,)# Initialise the harvesterharvester = Harvester( query_params=query_params, text=True,)# Start the harvestharvester.harvest()# Get the most recent harvestharvest = get_harvest()# Get the metadatametadata = get_metadata(harvest)# Obscure keymetadata["query_parameters"]["key"] ="########"display(metadata)# ---TESTS---assert metadata["query_parameters"]["q"] =="wragge"assert metadata["text"] isTrueassert metadata["harvested"] == harvester.harvestedshutil.rmtree(Path("data"))
There will be at least two files created for each harvest:
results.ndjson – a newline-delimited JSON file containing the details of all harvested articles
metadata.json – a JSON file which stores all the details of the harvest
The results.ndjson stores the API results from Trove as is, with a couple of exceptions:
if the text parameter has been set to True, the articleText field will contain the path to a .txt file containing the OCRd text contents of the article (rather than containing the text itself)
similarly if PDFs and images are requests, the pdf and image fields int the ndjson file will point to the saved files.
You’ll probably find it easier to work with the results in CSV format. The Harvester.save_csv() method flattens the ndjson file and renames some columns to make them compatible with prevsious versions of the harvest. It produces a results.csv file, which is a plain text CSV (Comma Separated Values) file. You can open it with any spreadsheet program. The details recorded for each article are:
article_id – a unique identifier for the article
title – the title of the article
date – in ISO format, YYYY-MM-DD
page – page number (of course), but might also indicate the page is part of a supplement or special section
newspaper_id – a unique identifier for the newspaper or gazette title (this can be used to retrieve more information or build a link to the web interface)
newspaper_title – the name of the newspaper (or gazette)
category – one of ‘Article’, ‘Advertising’, ‘Detailed lists, results, guides’, ‘Family Notices’, or ‘Literature’
words – number of words in the article
illustrated – is it illustrated (values are y or n)
edition – edition of newspaper (rarely used)
supplement – section of newspaper (rarely used)
section – section of newspaper (rarely used)
url – the persistent url for the article
page_url – the persistent url of the page on which the article is published
snippet – short text sample
relevance – search relevance score of this result
corrections – number of text corrections
last_correction – date of last correction
tags – number of attached tags
comments – number of attached comments
lists – number of lists this article is included in
text – path to text file
pdf – path to PDF file
image – path to image file
If you’ve asked for text files PDFS or images, there will be additional directories containing those files. Files containing the OCRd text of the articles will be saved in a directory named text. These are just plain text files, stripped on any HTML. These files include some basic metadata in their file titles – the date of the article, the id number of the newspaper, and the id number of the article. So, for example, the filename 19460104-1002-206680758.txt tells you:
19460104 – the article was published on 4 January 1946 (YYYYMMDD)
As you can see, you can use the newspaper and article ids to create direct links into Trove:
to a newspaper or gazette https://trove.nla.gov.au/newspaper/title/[newspaper id]
to an article http://nla.gov.au/nla.news-article[article id]
Similarly, if you’ve asked for copies of the articles as images, they’ll be in a directory named image. The image file names are similar to the text files, but with an extra id number for the page from which the image was extracted. So, for example, the image filename 19250411-460-140772994-11900413.jpg tells you:
19250411 – the article was published on 11 April 1925 (YYYYMMDD)