cli

command-line interface for the harvester

Before you do any harvesting you need to get yourself a Trove API key.

There are three basic commands:

Start a harvest

To start a new harvest you can just do:

troveharvester start "[Trove query]" [Trove API key]

The Trove query can either be a url copied and pasted from a search in the Trove web interface, or a Trove API query url constructed using something like the Trove API Console. Enclose the url in double quotes.

Unless you specify otherwise, a data directory will be automatically created to hold all of your harvests. Each harvest will be saved into a directory named using the current datetime. Details of harvested articles are written to a CSV file named results.csv. The harvest configuration details are also saved to a metadata.json file.

The CLI automatically saves the harvested metadata in a CSV file and, by default, deletes the raw results in the results.ndjson file. You can change this behaviour with the --keep_json option. See more information about the results generated by the harvester.

Options:

--data_dir

directory in which your harvests will be stored (default is data)

--harvest_dir

directory in which this harvest will be stored within the output directory (default is current datetime)

--text

save the OCRd text of each article into a separate .txt file

--pdf

save a copy of each each as a PDF (this makes the harvest a lot slower as you have to allow a couple of seconds for each PDF to generate)

--image

save an image of each article into a separate .jpg file (if the article is split over more than one page there will be multiple images)

--include_linebreaks

preserve linebreaks in saved text files

--keep_json

saves harvested data in an results.ndjson file (one json object per line) as well as results.csv

--max [integer]

specify a maximum number of articles to harvest

More examples

Basic harvest with no options:

troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy

Specify the data and harvest directories:

troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy --data_dir my_harvests --harvest_dir wragge_search

Save the articles as individual text files:

troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy --text

Save the articles as images and PDFs (this will be very slow):

troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy --pdf --image

Keep the raw results in the results.ndjson file:

troveharvester start "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" mySeCReTkEy --keep_json

Restart a harvest

Things go wrong and harvests get interrupted. If your harvest stops before it should, you can just do:

troveharvester restart

By default the script will try to restart the most recent harvest. If you’ve used the --data_dir or --harvest_dir parameters, you’ll have to supply these again to restart the harvest.

troveharvester restart --data_dir my_harvests --harvest_dir my_latest_dataset

Get a summary of a harvest

If you’d like to quickly check the status of a harvest, just try:

troveharvester report

By default the script will report on the most recent harvest. If you’ve used the --data_dir or --harvest_dir parameters, you’ll have to supply these again to generate a report.

troveharvester report --data_dir my_harvests --harvest_dir my_latest_dataset

Functions

The functions below are all called by the command-line interface, so don’t need to be accessed directly. See the core library for programmatic access to the Harvester class.


source

start_harvest

 start_harvest (query, key, data_dir='data', harvest_dir=None, text=False,
                pdf=False, image=False, include_linebreaks=False,
                max=None, keep_json=False)

Start a harvest.

Parameters:

  • query [required, search url from Trove web interface or API, string]
  • key [required, Trove API key, string]
  • data_dir [optional, directory for harvests, string]
  • harvest_dir [optional, directory for this harvest, string]
  • text [optional, save articles as text files, True or False]
  • pdf [optional, save articles as PDFs, True or False]
  • image [optional, save articles as images, True or False]
  • include_linebreaks [optional, include linebreaks in text files, True or False]
  • max [optional, maximum number of results, integer]
  • keep_json [optional, keep the results.ndjson file, true or False]
API_KEY = os.getenv("TROVE_API_KEY")

start_harvest(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo",
    API_KEY,
    text=True,
)

this_harvest = get_harvest()

assert Path(this_harvest, "results.csv").exists() is True
assert Path(this_harvest, "results.ndjson").exists() is False
assert Path(this_harvest, "text").exists() is True

shutil.rmtree(Path("data"))
start_harvest(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo",
    API_KEY,
    text=True,
    keep_json=True,
)

this_harvest = get_harvest()

assert Path(this_harvest, "results.csv").exists() is True
assert Path(this_harvest, "results.ndjson").exists() is True
assert Path(this_harvest, "text").exists() is True

source

report_harvest

 report_harvest (data_dir='data', harvest_dir=None)

Provide some details of a harvest. If no harvest is specified, show the most recent.

Parameters:

  • data_dir [optional, directory for harvests, string]
  • harvest_dir [optional, directory for this harvest, string]
report_harvest()

HARVEST METADATA
================
Last harvest started: 2022-09-21T11:42:44.719906+00:00
Harvest id: data/20220921114244
Query parameters:
{ 'bulkHarvest': 'true',
  'encoding': 'json',
  'include': ['articleText'],
  'key': 'gq29l1g1h75pimh4',
  'l-illtype': ['Photo'],
  'l-illustrated': 'true',
  'l-state': ['Western Australia'],
  'q': 'wragge',
  'reclevel': 'full',
  'zone': 'newspaper'}
Max results: 130
Include PDFs: False
Include text: True
Include images: False
Include linebreaks: False
Harvested with: trove_newspaper_harvester v0.0.1
# TEST REPORT
test_stdout(report_harvest, "^\nHARVEST METADATA.*", regex=True)
test_stdout(
    report_harvest, ".*Harvested with: trove_newspaper_harvester v[0-9\.]+$", regex=True
)

shutil.rmtree(Path("data"))

source

restart_harvest

 restart_harvest (data_dir='data', harvest_dir=None)

Restart a failed harvest.

Parameters:

  • data_dir [optional, directory for harvests, string]
  • harvest_dir [optional, directory for this harvest, string]
# TEST RESTART
# To test the restart function we'll create a new harvester but not start it
params = prepare_query(
    query="https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo",
    api_key=API_KEY,
    text=True,
)
harvester = Harvester(query_params=params, text=True)

# Should be no data yet
assert harvester.ndjson_file.exists() is False

# The cache should still exist
assert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists()

# Now it should run with restart using the settings from above
restart_harvest()

# Should be data now
assert harvester.ndjson_file.exists() is True

# The cache should have been deleted
assert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists() is False

# Clean up
shutil.rmtree(Path("data"))

source

main

 main ()

Sets up the command-line interface


Created by Tim Sherratt for the GLAM Workbench. Support this project by becoming a GitHub sponsor.