# Test for missing API key
def test_no_key():
start_harvest("https://trove.nla.gov.au/search/category/newspapers?keyword=wragge", None
)
"The request could not be authorised, check your API key.") test_stdout(test_no_key,
cli
Before you do any harvesting you need to get yourself a Trove API key.
There are three basic commands:
- start – start a new harvest
- restart – restart a stalled harvest
- report – view harvest details
Start a harvest
To start a new harvest you can just do:
troveharvester start "[Trove query]" [Trove API key]
The Trove query can either be a url copied and pasted from a search in the Trove web interface, or a Trove API query url constructed using something like the Trove API Console. Enclose the url in double quotes.
Unless you specify otherwise, a data
directory will be automatically created to hold all of your harvests. Each harvest will be saved into a directory named using the current datetime. Details of harvested articles are written to a CSV file named results.csv
. The harvest configuration details are also saved to a metadata.json
file.
The CLI automatically saves the harvested metadata in a CSV file and, by default, deletes the raw results in the results.ndjson
file. You can change this behaviour with the --keep_json
option. See more information about the results generated by the harvester.
Options:
--data_dir
directory in which your harvests will be stored (default is
data
)
--harvest_dir
directory in which this harvest will be stored within the output directory (default is current datetime)
--text
save the OCRd text of each article into a separate
.txt
file
--pdf
save a copy of each each as a PDF (this makes the harvest a lot slower as you have to allow a couple of seconds for each PDF to generate)
--image
save an image of each article into a separate
.jpg
file (if the article is split over more than one page there will be multiple images)
--include_linebreaks
preserve linebreaks in saved text files
--keep_json
saves harvested data in an
results.ndjson
file (one json object per line) as well asresults.csv
--max
[integer]
specify a maximum number of articles to harvest
More examples
Basic harvest with no options:
troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy
Specify the data and harvest directories:
troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy --data_dir my_harvests --harvest_dir wragge_search
Save the articles as individual text files:
troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy --text
Save the articles as images and PDFs (this will be very slow):
troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy --pdf --image
Keep the raw results in the results.ndjson
file:
troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy --keep_json
Restart a harvest
Things go wrong and harvests get interrupted. If your harvest stops before it should, you can just do:
troveharvester restart
By default the script will try to restart the most recent harvest. If you’ve used the --data_dir
or --harvest_dir
parameters, you’ll have to supply these again to restart the harvest.
troveharvester restart --data_dir my_harvests --harvest_dir my_latest_dataset
Get a summary of a harvest
If you’d like to quickly check the status of a harvest, just try:
troveharvester report
By default the script will report on the most recent harvest. If you’ve used the --data_dir
or --harvest_dir
parameters, you’ll have to supply these again to generate a report.
troveharvester report --data_dir my_harvests --harvest_dir my_latest_dataset
Functions
The functions below are all called by the command-line interface, so don’t need to be accessed directly. See the core library for programmatic access to the Harvester
class.
start_harvest
start_harvest (query, key, data_dir='data', harvest_dir=None, text=False, pdf=False, image=False, include_linebreaks=False, max=None, keep_json=False)
Start a harvest.
Parameters:
query
[required, search url from Trove web interface or API, string]key
[required, Trove API key, string]data_dir
[optional, directory for harvests, string]harvest_dir
[optional, directory for this harvest, string]text
[optional, save articles as text files, True or False]pdf
[optional, save articles as PDFs, True or False]image
[optional, save articles as images, True or False]include_linebreaks
[optional, include linebreaks in text files, True or False]max
[optional, maximum number of results, integer]keep_json
[optional, keep the results.ndjson file, true or False]
# Test for missing query
= os.getenv("TROVE_API_KEY")
API_KEY
def test_no_query():
"", API_KEY)
start_harvest(
"No query parameters found, check your query url.") test_stdout(test_no_query,
start_harvest("https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo",
API_KEY,=True,
text
)
= get_harvest()
this_harvest
assert Path(this_harvest, "results.csv").exists() is True
assert Path(this_harvest, "results.ndjson").exists() is False
assert Path(this_harvest, "text").exists() is True
"data")) shutil.rmtree(Path(
start_harvest("https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo",
API_KEY,=True,
text=True,
keep_json
)
= get_harvest()
this_harvest
assert Path(this_harvest, "results.csv").exists() is True
assert Path(this_harvest, "results.ndjson").exists() is True
assert Path(this_harvest, "text").exists() is True
report_harvest
report_harvest (data_dir='data', harvest_dir=None)
Provide some details of a harvest. If no harvest is specified, show the most recent.
Parameters:
data_dir
[optional, directory for harvests, string]harvest_dir
[optional, directory for this harvest, string]
report_harvest()
HARVEST METADATA
================
Last harvest started: 2023-04-05T02:04:40.783025+00:00
Harvest id: data/20230405020440
Query parameters:
{ 'bulkHarvest': 'true',
'encoding': 'json',
'include': ['articleText'],
'key': 'gq29l1g1h75pimh4',
'l-illtype': ['Photo'],
'l-illustrated': 'true',
'l-state': ['Western Australia'],
'q': 'wragge',
'reclevel': 'full',
'zone': 'newspaper'}
Max results: 132
Include PDFs: False
Include text: True
Include images: False
Include linebreaks: False
Harvested with: trove_newspaper_harvester v0.6.5
# TEST REPORT
"^\nHARVEST METADATA.*", regex=True)
test_stdout(report_harvest,
test_stdout(".*Harvested with: trove_newspaper_harvester v[0-9\.]+$", regex=True
report_harvest,
)
"data")) shutil.rmtree(Path(
restart_harvest
restart_harvest (data_dir='data', harvest_dir=None)
Restart a failed harvest.
Parameters:
data_dir
[optional, directory for harvests, string]harvest_dir
[optional, directory for this harvest, string]
# TEST RESTART
# To test the restart function we'll create a new harvester but not start it
= prepare_query(
params ="https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo",
query=API_KEY,
api_key=True,
text
)= Harvester(query_params=params, text=True)
harvester
# Should be no data yet
assert harvester.ndjson_file.exists() is False
# The cache should still exist
assert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists()
# Now it should run with restart using the settings from above
restart_harvest()
# Should be data now
assert harvester.ndjson_file.exists() is True
# The cache should have been deleted
assert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists() is False
# Clean up
"data")) shutil.rmtree(Path(
main
main ()
Sets up the command-line interface
Created by Tim Sherratt for the GLAM Workbench. Support this project by becoming a GitHub sponsor.