
command-line interface for the harvester

Before you do any harvesting you need to get yourself a Trove API key.

There are three basic commands:

Start a harvest

To start a new harvest you can just do:

troveharvester start "[Trove query]" [Trove API key]

The Trove query can either be a url copied and pasted from a search in the Trove web interface, or a Trove API query url constructed using something like the Trove API Console. Enclose the url in double quotes.

Unless you specify otherwise, a data directory will be automatically created to hold all of your harvests. Each harvest will be saved into a directory named using the current datetime. Details of harvested articles are written to a CSV file named results.csv. The harvest configuration details are also saved to a metadata.json file.

The CLI automatically saves the harvested metadata in a CSV file and, by default, deletes the raw results in the results.ndjson file. You can change this behaviour with the --keep_json option. See more information about the results generated by the harvester.



Instead of supplying the query url and API key on the command line, you can point to an existing config file. The file harvester_config.json is automatically created when you run a harvest.


directory in which your harvests will be stored (default is data)


directory in which this harvest will be stored within the output directory (default is current datetime)


save the OCRd text of each article into a separate .txt file


save a copy of each each as a PDF (this makes the harvest a lot slower as you have to allow a couple of seconds for each PDF to generate)


save an image of each article into a separate .jpg file (if the article is split over more than one page there will be multiple images)


preserve linebreaks in saved text files


saves harvested data in an results.ndjson file (one json object per line) as well as results.csv

--max [integer]

specify a maximum number of articles to harvest

More examples

Basic harvest with no options:

troveharvester start "" mySeCReTkEy

Specify the data and harvest directories:

troveharvester start "" mySeCReTkEy --data_dir my_harvests --harvest_dir wragge_search

Save the articles as individual text files:

troveharvester start "" mySeCReTkEy --text

Save the articles as images and PDFs (this will be very slow):

troveharvester start "" mySeCReTkEy --pdf --image

Keep the raw results in the results.ndjson file:

troveharvester start "" mySeCReTkEy --keep_json

Run a harvest from a config file:

troveharvester --config_file "/old-harvest-dir/harvester_config.json"

Restart a harvest

Things go wrong and harvests get interrupted. If your harvest stops before it should, you can just do:

troveharvester restart

By default the script will try to restart the most recent harvest. If you’ve used the --data_dir or --harvest_dir parameters, you’ll have to supply these again to restart the harvest.

troveharvester restart --data_dir my_harvests --harvest_dir my_latest_dataset

Get a summary of a harvest

If you’d like to quickly check the status of a harvest, just try:

troveharvester report

By default the script will report on the most recent harvest. If you’ve used the --data_dir or --harvest_dir parameters, you’ll have to supply these again to generate a report.

troveharvester report --data_dir my_harvests --harvest_dir my_latest_dataset


The functions below are all called by the command-line interface, so don’t need to be accessed directly. See the core library for programmatic access to the Harvester class.



 start_harvest (query=None, key=None, config_file=None, data_dir='data',
                harvest_dir=None, text=False, pdf=False, image=False,
                include_linebreaks=False, max=None, keep_json=False)

Start a harvest. Note that you must supply either query_params and key or config_file.


  • query [optional, search url from Trove web interface or API, string]
  • key [optional, Trove API key, string]
  • config_file [optional, path to a config file]
  • data_dir [optional, directory for harvests, string]
  • harvest_dir [optional, directory for this harvest, string]
  • text [optional, save articles as text files, True or False]
  • pdf [optional, save articles as PDFs, True or False]
  • image [optional, save articles as images, True or False]
  • include_linebreaks [optional, include linebreaks in text files, True or False]
  • max [optional, maximum number of results, integer]
  • keep_json [optional, keep the results.ndjson file, true or False]
# Test for missing query
API_KEY = os.getenv("TROVE_API_KEY")

def test_no_query():
    start_harvest("", API_KEY)

test_stdout(test_no_query, "No query parameters found, check your query url. You must supply either a query and key, or a config_file.")

this_harvest = get_harvest()

assert Path(this_harvest, "results.csv").exists() is True
assert Path(this_harvest, "results.ndjson").exists() is False
assert Path(this_harvest, "text").exists() is True


this_harvest = get_harvest()

assert Path(this_harvest, "results.csv").exists() is True
assert Path(this_harvest, "results.ndjson").exists() is True
assert Path(this_harvest, "text").exists() is True



 report_harvest (data_dir='data', harvest_dir=None)

Provide some details of a harvest. If no harvest is specified, show the most recent.


  • data_dir [optional, directory for harvests, string]
  • harvest_dir [optional, directory for this harvest, string]

Harvest path: data/20230826123318
Query parameters:
{ 'bulkHarvest': 'true',
  'category': 'newspaper',
  'encoding': 'json',
  'include': ['articleText'],
  'l-illtype': ['Photo'],
  'l-illustrated': 'true',
  'l-state': ['Western Australia'],
  'q': 'wragge',
  'reclevel': 'full'}
Max results: None
Include PDFs: False
Include text: True
Include images: False
Include linebreaks: False

Harvest started: 2023-08-26T22:33:18.881889+10:00
Harvest ended: 2023-08-26T22:33:25.034403+10:00
Total articles: 174
Harvested by: Trove Newspaper and Gazette Harvester version 0.7.0
test_stdout(report_harvest, "^\nHARVEST PARAMETERS.*", regex=True)
test_stdout(report_harvest, "HARVEST RESULTS.*", regex=True)



 restart_harvest (data_dir='data', harvest_dir=None)

Restart a failed harvest.


  • data_dir [optional, directory for harvests, string]
  • harvest_dir [optional, directory for this harvest, string]
# To test the restart function we'll create a new harvester but not start it
params = prepare_query(
harvester = Harvester(query_params=params, key=API_KEY, text=True)

# Should be no data yet
assert harvester.ndjson_file.exists() is False

# The cache should still exist
assert Path(f"{'-'.join(}.sqlite").exists()

# Now it should run with restart using the settings from above

# Should be data now
assert harvester.ndjson_file.exists() is True

# The cache should have been deleted
assert Path(f"{'-'.join(}.sqlite").exists() is False

# Clean up



 main ()

Sets up the command-line interface

Created by Tim Sherratt for the GLAM Workbench. Support this project by becoming a GitHub sponsor.