core

a harvester for downloading large numbers of digitised newspaper articles from Trove

source

prepare_query

 prepare_query (query, api_key, text=False)

Converts a Trove search url into a set of parameters ready for harvesting.

Parameters:

  • query [required, search url from Trove web interface or API, string]
  • api_key [required, Trove API key, string]
  • text [optional, save text files, True or False]

Returns:

  • a dictionary of parameters

The prepare_query function converts a search url from the Trove web interface or API into a set of parameters that you can feed to Harvester. It uses the trove-query-parser to do most of the work, but adds in a few extra parameters needed for the harvest.

If you want to save the contents of the articles as text files you need to set text to True. This ensures that the articleText field is included in the results.

query_params = prepare_query(
    query="https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=New%20South%20Wales&l-artType=newspapers&l-title=508&l-decade=191&l-category=Article",
    api_key="MY_API_KEY",
)
query_params
{'q': 'wragge',
 'l-state': ['New South Wales'],
 'zone': 'newspaper',
 'l-title': ['508'],
 'l-decade': ['191'],
 'l-category': ['Article'],
 'key': 'MY_API_KEY',
 'encoding': 'json',
 'reclevel': 'full',
 'bulkHarvest': 'true'}
# TEST query_params()
# Convert a url from the Trove web interface, including text
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge",
    api_key="MY_API_KEY",
    text=True,
)

# Test the results
assert query_params == {
    "q": "wragge",
    "include": ["articleText"],
    "zone": "newspaper",
    "key": "MY_API_KEY",
    "encoding": "json",
    "reclevel": "full",
    "bulkHarvest": "true",
}

# Convert a url from an API request
query_params = prepare_query(
    "https://api.trove.nla.gov.au/v2/result?q=wragge&zone=newspaper&encoding=json&l-category=Article",
    api_key="MY_API_KEY",
)

assert query_params == {
    "q": ["wragge"],
    "zone": ["newspaper"],
    "encoding": "json",
    "l-category": ["Article"],
    "key": "MY_API_KEY",
    "reclevel": "full",
    "bulkHarvest": "true",
}

source

Harvester

 Harvester (query_params, data_dir='data', harvest_dir=None, text=False,
            pdf=False, image=False, include_linebreaks=False, max=None)

Harvest large quantities of digitised newspaper articles from Trove.

Parameters:

  • query_params [required, dictionary of parameters]
  • data_dir [optional, directory for harvests, string]
  • harvest_dir [optional, directory for this harvest, string]
  • text [optional, save articles as text files, True or False]
  • pdf [optional, save articles as PDFs, True or False]
  • image [optional, save articles as images, True or False]
  • include_linebreaks [optional, include linebreaks in text files, True or False]
  • max [optional, maximum number of results, integer]

The Harvester class configures and runs your harvest, saving results in a variety of formats.

By default, the harvester will save harvests in a directory called data, with each individual harvest in a directory named according to the current date and time (YYYYMMDDHHmmss format). You can change this by setting the data_dir and harvest_dir parameters. This can help you to manage your harvests by grouping together related searches, or giving them meaningful names.

The harvester generates two data files by default:

  • metadata.json contains basic information about the harvest
  • results.ndjson contains details of all the harvested articles in a newline delimited JSON format (each line is a JSON object)

You can convert the ndjson file to a CSV format using Harvester.save_csv.

The text, pdf, and image options give you the option to save the contents of the articles as either text files, PDF files, or JPG images. Note that saving PDFs and images can be very slow.

If you only want to harvest part of the results set you can set the max parameter to the number of records you want.

# TEST HARVESTER CREATES DEFAULT HARVEST DIRECTORY
# This example initialises a harvest, but doesn't actually run it.

API_KEY = os.getenv("TROVE_API_KEY")

# Prepare query params
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge",
    text=True,
    api_key=API_KEY,
)

# Initialise the Harvester with the query parameters
harvester = Harvester(query_params=query_params, text=True)

# if you haven't set the max parameter, the maximum value will be the total number of results
assert harvester.maximum > 0
print(f"Total results: {harvester.maximum:,}")

# Check that the data directory exists
assert Path("data").exists() is True

# Check that a harvest directory with the current date/hour exists in the data directory
assert len(list(Path("data").glob(f'{arrow.utcnow().format("YYYYMMDDHH")}*'))) == 1

# Check that a 'text' directory exists in the harvest directory
assert (
    Path(next(Path("data").glob(f'{arrow.utcnow().format("YYYYMMDDHH")}*'))).exists()
    is True
)

# Check that the cache has been initialised
assert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists()

# Clean up
shutil.rmtree(Path("data"))
harvester.delete_cache()
Total results: 137,770
# TEST HARVESTER CREATES REQUESTED HARVEST DIRECTORY

query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge",
    api_key=API_KEY,
)

harvester = Harvester(
    query_params=query_params,
    data_dir="harvests",
    harvest_dir="my_trove_harvest",
    pdf=True,
    image=True,
)

assert harvester.maximum > 0
print(f"Total results: {harvester.maximum:,}")

# Check that the data directory exists
assert Path("harvests").exists() is True

assert Path("harvests", "my_trove_harvest").exists() is True

assert Path("harvests", "my_trove_harvest", "pdf").exists() is True

assert Path("harvests", "my_trove_harvest", "image").exists() is True

# Clean up
shutil.rmtree(Path("harvests"))
harvester.delete_cache()
Total results: 137,770

source

Harvester.harvest

 Harvester.harvest ()

Start the harvest and loop over the result set until finished.

Once the Harvester is initialised with your query parameters, you can call Harvester.harvest to actually start the process. The harvester will loop over the complete results set until finished.

# HARVEST WITH TEXT > 100 records

# Prepare query parameters
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo",
    api_key=API_KEY,
    text=True,
)

# Initialise the harvester
harvester = Harvester(
    query_params=query_params,
    data_dir="harvests",
    harvest_dir="test_harvest",
    text=True,
)

# Start the harvest
harvester.harvest()


# ---TESTS---
# Check that the ndjson file exists and lines can be parsed as json
json_data = []
with harvester.ndjson_file.open("r") as ndjson_file:
    for line in ndjson_file:
        json_data.append(json.loads(line.strip()))

# The length of the ndjson file should equal the number of records harvested
assert len(json_data) == harvester.harvested

# Check that the metadata file has been created
metadata = get_metadata(harvester.harvest_dir)
assert metadata["query_parameters"] == query_params

# Check that a text file exists and can be read
assert Path("harvests", "test_harvest", json_data[0]["articleText"]).exists()
text = Path("harvests", "test_harvest", json_data[0]["articleText"]).read_text()
assert isinstance(text, str)

# Check that the cache file was deleted
assert Path(f"{'-'.join(harvester.harvest_dir.parts)}.sqlite").exists() is False

shutil.rmtree(Path("harvests"))
# HARVEST WITH PDF AND IMAGE -- 1 RECORD MAX

# Prepare the query parameters
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-illustrationType=Cartoon",
    api_key=API_KEY,
    text=True,
)

# Initialise the harvester
harvester = Harvester(
    query_params=query_params,
    data_dir="harvests",
    harvest_dir="test_harvest",
    pdf=True,
    image=True,
    max=1,
)

# Start the harvest!
harvester.harvest()


# ---TESTS---

# Check that the ndjson file exists and lines can be parsed as json
json_data = []
with harvester.ndjson_file.open("r") as ndjson_file:
    for line in ndjson_file:
        json_data.append(json.loads(line.strip()))

assert harvester.maximum == harvester.harvested

# The length of the ndjson file should equal the number of records harvested
assert len(json_data) == harvester.harvested

# Check that a pdf and image file exist
assert Path("harvests", "test_harvest", json_data[0]["pdf"]).exists()
assert Path("harvests", "test_harvest", json_data[0]["images"][0]).exists()

shutil.rmtree(Path("harvests"))

The text of articles in the Australian Women’s Weekly is not available through the API, so the harvester has to scrape it separately. This happens automatically. The code below is just a little test to make sure it’s working as expected.

#---TEST FOR AWW---
# Prepare query params
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge",
    text=True,
    api_key=API_KEY,
)

# Initialise the Harvester with the query parameters
harvester = Harvester(query_params=query_params, text=True)

# Get html text of an article
text = harvester.get_aww_text(51187457)
assert "THE SHAPE OF THINGS TO COME" in text

# Clean up
shutil.rmtree(Path("data"))

Restarting a failed harvest

The Harvester uses requests-cache to cache API responses. This makes it easy to restart a failed harvest. All you need to do is call Harvester.harvest() again and it will pick up where it left off.


source

Harvester.save_csv

 Harvester.save_csv ()

Flatten and rename data in the ndjson file to save as CSV.

Harvested metadata is saved, by default, in a newline-delimited JSON file. If you’d prefer the results in CSV format, just call Harvester.save_csv(). See below for more information on results formats.

# Prepare query parameters
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo",
    api_key=API_KEY,
    text=True,
)

# Initialise the harvester
harvester = Harvester(
    query_params=query_params,
    data_dir="harvests",
    harvest_dir="test_harvest",
    text=True,
)

# Start the harvest
harvester.harvest()

# Save results as CSV
harvester.save_csv()

# ---TESTS---

# Check that CSV file exists
csv_file = Path(harvester.harvest_dir, "results.csv")
assert csv_file.exists()

# Open the CSV file and check that the number of rows equals number of records harvested
df = pd.read_csv(csv_file)
assert df.shape[0] == harvester.harvested

shutil.rmtree(Path("harvests"))

source

get_harvest

 get_harvest (data_dir='data', harvest_dir=None)

Get the path to a harvest. If data_dir and harvest_dir are not supplied, this will return the most recent harvest in the ‘data’ directory.

Parameters:

  • data_dir [optional, directory for harvests, string]
  • harvest_dir [optional, directory for this harvest, string]

Returns:

  • a pathlib.Path object pointing to the harvest directory
# TEST GET HARVEST

# Create test folders
Path("data", "20220919100000").mkdir(parents=True)
Path("data", "20220919200000").mkdir(parents=True)

# Get latest harvest folder
harvest = get_harvest()
print(harvest)

# ---TESTS---
assert harvest.name == "20220919200000"

harvest = get_harvest(data_dir="data", harvest_dir="20220919100000")
assert harvest.name == "20220919100000"

shutil.rmtree(Path("data"))
data/20220919200000

source

get_metadata

 get_metadata (harvest)

Get the query metadata from a harvest directory.

Parameters:

  • harvest [required, path to harvest, string or pathlib.Path]

Returns:

  • metadata dictionary

The metadata.json file contains information about a harvest. Using get_metadata you can retrieve the metadata.json for for a particular harvest. This can be useful if, for example, you want to re-run a harvest at a later data – you can just grab the query_paramaters and feed them into a new Harvester instance.

# Prepare query parameters
query_params = prepare_query(
    "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-state=Western%20Australia&l-illustrationType=Photo",
    api_key=API_KEY,
    text=True,
)

# Initialise the harvester
harvester = Harvester(
    query_params=query_params,
    text=True,
)

# Start the harvest
harvester.harvest()

# Get the most recent harvest
harvest = get_harvest()

# Get the metadata
metadata = get_metadata(harvest)
# Obscure key
metadata["query_parameters"]["key"] = "########"
display(metadata)

# ---TESTS---
assert metadata["query_parameters"]["q"] == "wragge"
assert metadata["text"] is True
assert metadata["harvested"] == harvester.harvested

shutil.rmtree(Path("data"))
{'query_parameters': {'q': 'wragge',
  'l-state': ['Western Australia'],
  'l-illustrated': 'true',
  'l-illtype': ['Photo'],
  'include': ['articleText'],
  'zone': 'newspaper',
  'key': '########',
  'encoding': 'json',
  'reclevel': 'full',
  'bulkHarvest': 'true'},
 'harvest_directory': 'data/20220921125508',
 'max': 130,
 'text': True,
 'pdf': False,
 'image': False,
 'include_linebreaks': False,
 'date_started': '2022-09-21T12:55:09.287005+00:00',
 'harvester': 'trove_newspaper_harvester v0.6.1',
 'harvested': 130}

Results

There will be at least two files created for each harvest:

  • results.ndjson – a newline-delimited JSON file containing the details of all harvested articles
  • metadata.json – a JSON file which stores all the details of the harvest

The results.ndjson stores the API results from Trove as is, with a couple of exceptions:

  • if the text parameter has been set to True, the articleText field will contain the path to a .txt file containing the OCRd text contents of the article (rather than containing the text itself)
  • similarly if PDFs and images are requests, the pdf and image fields int the ndjson file will point to the saved files.

You’ll probably find it easier to work with the results in CSV format. The Harvester.save_csv() method flattens the ndjson file and renames some columns to make them compatible with prevsious versions of the harvest. It produces a results.csv file, which is a plain text CSV (Comma Separated Values) file. You can open it with any spreadsheet program. The details recorded for each article are:

  • article_id – a unique identifier for the article
  • title – the title of the article
  • date – in ISO format, YYYY-MM-DD
  • page – page number (of course), but might also indicate the page is part of a supplement or special section
  • newspaper_id – a unique identifier for the newspaper or gazette title (this can be used to retrieve more information or build a link to the web interface)
  • newspaper_title – the name of the newspaper (or gazette)
  • category – one of ‘Article’, ‘Advertising’, ‘Detailed lists, results, guides’, ‘Family Notices’, or ‘Literature’
  • words – number of words in the article
  • illustrated – is it illustrated (values are y or n)
  • edition – edition of newspaper (rarely used)
  • supplement – section of newspaper (rarely used)
  • section – section of newspaper (rarely used)
  • url – the persistent url for the article
  • page_url – the persistent url of the page on which the article is published
  • snippet – short text sample
  • relevance – search relevance score of this result
  • corrections – number of text corrections
  • last_correction – date of last correction
  • tags – number of attached tags
  • comments – number of attached comments
  • lists – number of lists this article is included in
  • text – path to text file
  • pdf – path to PDF file
  • image – path to image file

If you’ve asked for text files PDFS or images, there will be additional directories containing those files. Files containing the OCRd text of the articles will be saved in a directory named text. These are just plain text files, stripped on any HTML. These files include some basic metadata in their file titles – the date of the article, the id number of the newspaper, and the id number of the article. So, for example, the filename 19460104-1002-206680758.txt tells you:

As you can see, you can use the newspaper and article ids to create direct links into Trove:

  • to a newspaper or gazette https://trove.nla.gov.au/newspaper/title/[newspaper id]
  • to an article http://nla.gov.au/nla.news-article[article id]

Similarly, if you’ve asked for copies of the articles as images, they’ll be in a directory named image. The image file names are similar to the text files, but with an extra id number for the page from which the image was extracted. So, for example, the image filename 19250411-460-140772994-11900413.jpg tells you:


Created by Tim Sherratt for the GLAM Workbench. Support this project by becoming a GitHub sponsor.