# TEST FOR MISSING PARAMETERS
# You need to supply either query_params AND key, OR config_file.
# If you don't you'll get a NoQueryError
with ExceptionExpected(ex=NoQueryError):
= Harvester() harvester
core
Harvester
Harvester (query_params=None, key=None, data_dir='data', harvest_dir=None, config_file=None, text=False, pdf=False, image=False, include_linebreaks=False, maximum=None)
Harvest large quantities of digitised newspaper articles from Trove. Note that you must supply either query_params
and key
or config_file
.
Parameters:
query_params
[optional, dictionary of parameters]key
[optional, Trove API key]config_file
[optional, path to a config file]data_dir
[optional, directory for harvests, string]harvest_dir
[optional, directory for this harvest, string]text
[optional, save articles as text files, True or False]pdf
[optional, save articles as PDFs, True or False]image
[optional, save articles as images, True or False]include_linebreaks
[optional, include linebreaks in text files, True or False]maximum
[optional, maximum number of results, integer]
The Harvester
class configures and runs your harvest, saving results in a variety of formats.
You must supply either query_params
and key
, or the path to a config_file
. If you don’t you’ll get a NoQueryError
.
By default, the harvester will save harvests in a directory called data
, with each individual harvest in a directory named according to the current date and time (YYYYMMDDHHmmss
format). You can change this by setting the data_dir
and harvest_dir
parameters. This can help you to manage your harvests by grouping together related searches, or giving them meaningful names.
The harvester generates three data files by default:
harvester_config.json
a file that captures the parameters used to launch the harvestro-crate-metadata.json
a metadata file documenting the harvest in RO-Crate formatresults.ndjson
contains details of all the harvested articles in a newline delimited JSON format (each line is a JSON object)
You can convert the ndjson
file to a CSV format using Harvester.save_csv
.
The text
, pdf
, and image
options give you the option to save the contents of the articles as either text files, PDF files, or JPG images. Note that saving PDFs and images can be very slow.
If you only want to harvest part of the results set you can set the maximum
parameter to the number of records you want.
Quick start
- You’ll need a Trove API key to use the harvester.
- Just copy the url from a search in the newspapers and gazettes category.
from trove_newspaper_harvester.core import prepare_query, Harvester
= "myApIkEy"
my_api_key = "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge"
search_url
# Convert the search url into a set of API parameters
= prepare_query(search_url)
my_query_params
# Initialise the Harvester
= Harvester(query_params=myquery_params, key=my_api_key)
harvester
# Start the harvest
harvester.harvest()
If you want to harvest the OCRd text of articles as well as metadata, add text=True
to the harvester initialisation.
# Initialise the Harvester
= Harvester(query_params=myquery_params, key=my_api_key, text=True) harvester
Similarly you can harvest PDFs and images of articles by adding pdf=True
and image=True
to the harvester initialisation, but keep in mind that these options will make the harvest much slower!
You can generate a set of query parameters from a Trove search url using prepare_query()
.
prepare_query
prepare_query (query)
Converts a Trove search url into a set of parameters ready for harvesting.
Parameters:
query
[required, search url from Trove web interface or API, string]
Returns:
- a dictionary of parameters
The prepare_query
function converts a search url from the Trove web interface or API into a set of parameters that you can feed to Harvester
. It uses the trove-query-parser to do most of the work, but adds in a few extra parameters needed for the harvest.
= prepare_query("https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=New%20South%20Wales&l-artType=newspapers&l-title=508&l-decade=191&l-category=Article"
query_params
) query_params
{'q': 'wragge',
'l-state': ['New South Wales'],
'l-artType': 'newspapers',
'l-title': ['508'],
'l-decade': ['191'],
'l-category': ['Article'],
'category': 'newspaper',
'encoding': 'json',
'reclevel': 'full',
'bulkHarvest': 'true'}
# TEST query_params()
# Convert a url from the Trove web interface
= prepare_query(
query_params "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge"
)
# Test the results
assert query_params == {
"q": "wragge",
"category": "newspaper",
"encoding": "json",
"reclevel": "full",
"bulkHarvest": "true",
}
# Convert a url from an API request
= prepare_query(
query_params "https://api.trove.nla.gov.au/v2/result?q=wragge&category=newspaper&encoding=json&l-category=Article"
)
assert query_params == {
"q": ["wragge"],
"category": ["newspaper"],
"encoding": "json",
"l-category": ["Article"],
"reclevel": "full",
"bulkHarvest": "true",
}
Initialising a harvest using a harvester_config.json
file
The parameters used to initialise a harvest are saved into a file called harvester_config.json
. This provides useful documentation of your harvest, making it possible to reconstruct the process at a later date.
For example, you might want to re-harvest a particular query a year after your initial harvest to see how the results have changed. Remember, more articles are being added every week! To re-run a harvest, just point the Harvester to the harvester_config.json
file. By default, your new harvest will be saved in a fresh directory.
from trove_newspaper_harvester.core import Harvester
= Harvester(config_file="path/to/old/harvest/harvester_config.json")
harvester
harvester.harvest()
Note that the harvester_config.json
contains all the parameters used for your harvest, including your Trove API key. This makes it easy to re-run a harvest at a later date, but if you’re intending to share your harvest results you should delete or obscure the key
value.
# TEST: Reharvest from config file
= os.getenv("TROVE_API_KEY")
API_KEY
= {
test_config 'query_params': {'q': 'wragge',
'l-state': ['Western Australia'],
'l-illustrated': 'true',
'l-illtype': ['Photo'],
'include': ['articleText'],
'category': 'newspaper',
'encoding': 'json',
'reclevel': 'full',
'bulkHarvest': 'true'},
'key': API_KEY,
'full_harvest_dir': 'harvests/test_harvest',
'maximum': None,
'text': True,
'pdf': False,
'image': False,
'include_linebreaks': False
}
"harvester_config.json").write_text(json.dumps(test_config))
Path(
# Initialise the harvester
= Harvester(config_file="harvester_config.json")
harvester
# Start the harvest!
harvester.harvest()
"data"))
shutil.rmtree(Path("harvester_config.json").unlink() Path(
Where your harvests are saved
By default, harvests are saved in a directory named data
. Each individual harvest is saved in a directory named according to the current date/time, for example: data/20230826125205
.
# TEST HARVESTER CREATES DEFAULT HARVEST DIRECTORY
# This example initialises a harvest, but doesn't actually run it.
= os.getenv("TROVE_API_KEY")
API_KEY
# Prepare query params
= prepare_query(
query_params "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge"
)
# Initialise the Harvester with the query parameters
= Harvester(query_params=query_params, key=API_KEY, text=True)
harvester
# if you haven't set the max parameter, the total value will be the total number of results
assert harvester.total > 0
print(f"Total results: {harvester.total:,}")
# Check that the data directory exists
assert Path("data").exists() is True
# Check that a harvest directory with the current date/hour exists in the data directory
assert len(list(Path("data").glob(f'{arrow.utcnow().format("YYYYMMDDHH")}*'))) == 1
# Check that a 'text' directory exists in the harvest directory
assert (
next(Path("data").glob(f'{arrow.utcnow().format("YYYYMMDDHH")}*'))).exists()
Path(is True
)
# Check that the cache has been initialised
assert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists()
# Clean up
"data"))
shutil.rmtree(Path( harvester.delete_cache()
Total results: 140,806
You can change the default directories using the data_dir
and harvest_dir
parameters. For example, if you wanted to keep all the harvests relating to a specific project together, you could set data_dir="my-cool-project"
. You can use harvest_dir
to give your harvest a meaningful name, for example harvest_dir="search-for-cat-photos"
.
# TEST HARVESTER CREATES REQUESTED HARVEST DIRECTORY
= prepare_query(
query_params "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge"
)
= Harvester(
harvester =query_params,
query_params=API_KEY,
key="harvests",
data_dir="my_trove_harvest",
harvest_dir=True,
pdf=True,
image
)
assert harvester.total > 0
print(f"Total results: {harvester.total:,}")
# Check that the data directory exists
assert Path("harvests").exists() is True
assert Path("harvests", "my_trove_harvest").exists() is True
assert Path("harvests", "my_trove_harvest", "pdf").exists() is True
assert Path("harvests", "my_trove_harvest", "image").exists() is True
# Clean up
"harvests"))
shutil.rmtree(Path( harvester.delete_cache()
Total results: 140,806
Harvester.harvest
Harvester.harvest ()
Start the harvest and loop over the result set until finished.
Once the harvester is initialised, you can start the harvest by calling Harvester.harvest()
. A progress bar will keep you informed of the status of your harvest.
Add text=True
to include the OCRd full text of the articles in the harvest. The contents of each article is saved as a separate file in the text
directory. See the harvest results section below for more information.
# HARVEST WITH TEXT > 100 records
# Prepare query parameters
= prepare_query(
query_params "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo"
)
# Initialise the harvester
= Harvester(
harvester =query_params,
query_params=API_KEY,
key="harvests",
data_dir="test_harvest",
harvest_dir=True,
text
)
# Start the harvest
harvester.harvest()
# ---TESTS---
# Check that the ndjson file exists and lines can be parsed as json
= []
json_data with harvester.ndjson_file.open("r") as ndjson_file:
for line in ndjson_file:
json_data.append(json.loads(line.strip()))
# The length of the ndjson file should equal the number of records harvested
assert len(json_data) == harvester.harvested
# Check that the metadata file has been created
= get_config(harvester.harvest_dir)
config assert config["query_params"] == query_params
# Check that the RO-Crate file was created
= get_crate(harvester.harvest_dir)
crate = [
eids "./",
"ro-crate-metadata.json",
"#harvester_run",
"harvester_config.json",
"https://github.com/wragge/trove-newspaper-harvester",
"results.ndjson",
"text",
"https://creativecommons.org/publicdomain/zero/1.0/",
"http://rightsstatements.org/vocab/CNE/1.0/",
"http://rightsstatements.org/vocab/NKC/1.0/"
]for eid in eids:
assert crate.get(eid) is not None
# Check that a text file exists and can be read
assert Path("harvests", "test_harvest", json_data[0]["articleText"]).exists()
= Path("harvests", "test_harvest", json_data[0]["articleText"]).read_text()
text assert isinstance(text, str)
# Check that the cache file was deleted
assert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists() is False
"harvests")) shutil.rmtree(Path(
The text of articles in the Australian Women’s Weekly is not available through the API, so the harvester has to scrape it separately. This happens automatically. The code below is just a little test to make sure it’s working as expected.
# ---TEST FOR AWW---
# Prepare query params
= prepare_query(
query_params "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge"
)
# Initialise the Harvester with the query parameters
= Harvester(query_params=query_params, key=API_KEY, text=True)
harvester
# Get html text of an article
= harvester.get_aww_text(51187457)
text assert "THE SHAPE OF THINGS TO COME" in text
# Clean up
harvester.delete_cache()"data")) shutil.rmtree(Path(
You can include PDFs and images of the articles by adding pdf=True
or image=True
to the harvester initialisation. It’s important to note that this will slow down the harvest a lot, as each file needs to be generated and downloaded individually.
# HARVEST WITH PDF AND IMAGE -- 1 RECORD MAX
# Prepare the query parameters
= prepare_query(
query_params "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-illustrationType=Cartoon"
)
# Initialise the harvester
= Harvester(
harvester =query_params,
query_params=API_KEY,
key="harvests",
data_dir="test_harvest",
harvest_dir=True,
pdf=True,
image=1,
maximum
)
# Start the harvest!
harvester.harvest()
# ---TESTS---
# Check that the ndjson file exists and lines can be parsed as json
= []
json_data with harvester.ndjson_file.open("r") as ndjson_file:
for line in ndjson_file:
json_data.append(json.loads(line.strip()))
assert harvester.maximum == harvester.harvested
# The length of the ndjson file should equal the number of records harvested
assert len(json_data) == harvester.harvested
# Check that a pdf and image file exist
assert Path("harvests", "test_harvest", json_data[0]["pdf"]).exists()
assert Path("harvests", "test_harvest", json_data[0]["images"][0]).exists()
"harvests")) shutil.rmtree(Path(
Naturally enough, nothing is harvested from a query with no results. Check your search and your API key!
# HARVEST WITH NO RESULTS
# Prepare query parameters
= prepare_query(
query_params "https://trove.nla.gov.au/search/category/newspapers?keyword=wwgagsgshggshghso"
)
# Initialise the harvester
= Harvester(
harvester =query_params,
query_params=API_KEY
key
)
# Start the harvest
harvester.harvest()
assert harvester.harvested == 0
"data")) shutil.rmtree(Path(
Restarting a failed harvest
The Harvester
uses requests-cache to cache API responses. This makes it easy to restart a failed harvest. All you need to do is call Harvester.harvest()
again and it will pick up where it left off.
Harvester.save_csv
Harvester.save_csv ()
Flatten and rename data in the ndjson file to save as CSV.
Harvested metadata is saved, by default, in a newline-delimited JSON file. If you’d prefer the results in CSV format, just call Harvester.save_csv()
. See below for more information on results formats.
# TEST - save harvest results as CSV
# Prepare query parameters
= prepare_query(
query_params "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo"
)
# Initialise the harvester
= Harvester(
harvester =query_params,
query_params=API_KEY,
key="harvests",
data_dir="test_harvest",
harvest_dir=True,
text
)
# Start the harvest
harvester.harvest()
# Save results as CSV
harvester.save_csv()
# ---TESTS---
# Check that CSV file exists
= Path(harvester.harvest_dir, "results.csv")
csv_file assert csv_file.exists()
# Open the CSV file and check that the number of rows equals number of records harvested
= pd.read_csv(csv_file)
df assert df.shape[0] == harvester.harvested
"harvests")) shutil.rmtree(Path(
Harvest results
There will be at least three files created for each harvest:
harvester_config.json
a file that captures the parameters used to launch the harvestro-crate-metadata.json
a metadata file documenting the harvest in RO-Crate formatresults.ndjson
contains details of all the harvested articles in a newline delimited JSON format (each line is a JSON object)
The results.ndjson
stores the API results from Trove as is, with a couple of exceptions:
- if the
text
parameter has been set toTrue
, thearticleText
field will contain the path to a.txt
file containing the OCRd text contents of the article (rather than containing the text itself) - similarly if PDFs and images are requests, the
pdf
andimage
fields int thendjson
file will point to the saved files.
You’ll probably find it easier to work with the results in CSV format. The Harvester.save_csv()
method flattens the ndjson
file and renames some columns to make them compatible with previous versions of the harvest. It produces a results.csv
file, which is a plain text CSV (Comma Separated Values) file. You can open it with any spreadsheet program. The details recorded for each article are:
article_id
– a unique identifier for the articletitle
– the title of the articledate
– in ISO format, YYYY-MM-DDpage
– page number (of course), but might also indicate the page is part of a supplement or special sectionnewspaper_id
– a unique identifier for the newspaper or gazette title (this can be used to retrieve more information or build a link to the web interface)newspaper_title
– the name of the newspaper (or gazette)category
– one of ‘Article’, ‘Advertising’, ‘Detailed lists, results, guides’, ‘Family Notices’, or ‘Literature’words
– number of words in the articleillustrated
– is it illustrated (values are y or n)edition
– edition of newspaper (rarely used)supplement
– section of newspaper (rarely used)section
– section of newspaper (rarely used)url
– the persistent url for the articlepage_url
– the persistent url of the page on which the article is publishedsnippet
– short text samplerelevance
– search relevance score of this resultstatus
– some articles that are still being processed will have the status “coming soon” and might be missing other fieldscorrections
– number of text correctionslast_correction
– date of last correctiontags
– number of attached tagscomments
– number of attached commentslists
– number of lists this article is included intext
– path to text filepdf
– path to PDF fileimage
– path to image file
If you’ve asked for text files PDFs or images, there will be additional directories containing those files. Files containing the OCRd text of the articles will be saved in a directory named text
. These are just plain text files, stripped on any HTML. These files include some basic metadata in their file titles – the date of the article, the id number of the newspaper, and the id number of the article. So, for example, the filename 19460104-1002-206680758.txt
tells you:
19460104
– the article was published on 4 January 1946 (YYYYMMDD)1002
– the article was published in The Tribune206680758
– the article’s unique identifier
As you can see, you can use the newspaper and article ids to create direct links into Trove:
- to a newspaper or gazette
https://trove.nla.gov.au/newspaper/title/[newspaper id]
- to an article
http://nla.gov.au/nla.news-article[article id]
Similarly, if you’ve asked for copies of the articles as images, they’ll be in a directory named image
. The image file names are similar to the text files, but with an extra id number for the page from which the image was extracted. So, for example, the image filename 19250411-460-140772994-11900413.jpg
tells you:
19250411
– the article was published on 11 April 1925 (YYYYMMDD)460
– the article was published in The Australasian140772994
– the article’s unique identifier11900413
– the page’s unique identifier (some articles can be split over multiple pages)
The text of articles in the Australian Women’s Weekly is not available through the API, so the harvester has to scrape it separately. This happens automatically. The code below is just a little test to make sure it’s working as expected.
get_harvest
get_harvest (data_dir='data', harvest_dir=None)
Get the path to a harvest. If data_dir and harvest_dir are not supplied, this will return the most recent harvest in the ‘data’ directory.
Parameters:
data_dir
[optional, directory for harvests, string]harvest_dir
[optional, directory for this harvest, string]
Returns:
- a pathlib.Path object pointing to the harvest directory
# TEST GET HARVEST
# Create test folders
"data", "20220919100000").mkdir(parents=True)
Path("data", "20220919200000").mkdir(parents=True)
Path(
# Get latest harvest folder
= get_harvest()
harvest print(harvest)
# ---TESTS---
assert harvest.name == "20220919200000"
= get_harvest(data_dir="data", harvest_dir="20220919100000")
harvest assert harvest.name == "20220919100000"
"data")) shutil.rmtree(Path(
data/20220919200000
get_config
get_config (harvest)
Get the query config parameters from a harvest directory.
Parameters:
harvest
[required, path to harvest, string or pathlib.Path]
Returns:
- config dictionary
The harvester_config.json
file contains the parameters used to initiate a harvest. Using get_config
you can retrieve the harvester_config.json
for for a particular harvest. This can be useful if, for example, you want to re-run a harvest at a later data – you can just grab the query_paramaters
and feed them into a new Harvester
instance.
# Prepare query parameters
= prepare_query(
query_params "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo"
)
# Initialise the harvester
= Harvester(
harvester =query_params,
query_params=API_KEY,
key=True,
text
)
# Start the harvest
harvester.harvest()
# Get the most recent harvest
= get_harvest()
harvest
# Get the metadata
= get_config(harvest)
config
# Obscure key and display
"key"] = "########"
config[
display(config)
# ---TESTS---
assert config["query_params"]["q"] == "wragge"
assert config["text"] is True
"data")) shutil.rmtree(Path(
{'query_params': {'q': 'wragge',
'l-state': ['Western Australia'],
'l-illustrated': 'true',
'l-illustrationType': ['Photo'],
'category': 'newspaper',
'encoding': 'json',
'reclevel': 'full',
'bulkHarvest': 'true',
'include': ['articleText']},
'key': '########',
'full_harvest_dir': 'data/20231023042615',
'maximum': None,
'text': True,
'pdf': False,
'image': False,
'include_linebreaks': False}
get_crate
get_crate (harvest)
Get the RO-Crate metadata file from a harvest directory.
Parameters:
harvest
[required, path to harvest, string or pathlib.Path]
Returns:
- ROCrate object
Trove is changing all the time, so it’s important to document your harvests. The Harvester automatically creates a metadata file using the Research Object Crate (RO-Crate) format. This documents when the harvest was run, how many results were saved, and the version of the harvester. It is linked to the harvester_config.json
file that save the query parameters and harvester settings. This function retrieves the RO-Crate file for a given harvest. It returns an RO-Crate object – see the ro-crate.py package for more information.
# Prepare query parameters
= prepare_query(
query_params "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo"
)
# Initialise the harvester
= Harvester(
harvester =query_params,
query_params=API_KEY,
key=True,
text
)
# Start the harvest
harvester.harvest()
# Get the most recent harvest
= get_harvest()
harvest
# Get the metadata
= get_crate(harvest)
crate
for eid in crate.get_entities():
print(eid.id, eid.type)
assert crate.get("./").type == "Dataset"
assert crate.get("harvester_config.json").properties()["encodingFormat"] == "application/json"
assert crate.get("./").properties()["mainEntity"] == {"@id": "#harvester_run"}
"data")) shutil.rmtree(Path(
./ Dataset
ro-crate-metadata.json CreativeWork
harvester_config.json File
results.ndjson ['File', 'Dataset']
text ['File', 'Dataset']
#harvester_run CreateAction
https://github.com/wragge/trove-newspaper-harvester SoftwareApplication
http://rightsstatements.org/vocab/NKC/1.0/ CreativeWork
http://rightsstatements.org/vocab/CNE/1.0/ CreativeWork
https://creativecommons.org/publicdomain/zero/1.0/ CreativeWork
NoQueryError
Exception triggered by empty query.
Created by Tim Sherratt for the GLAM Workbench. Support this project by becoming a GitHub sponsor.