= download_images("107024751") images
articles
Functions
main
main ()
CLI configuration
download_images
download_images (article_id, output_dir='', size=None, masked=False)
Extract an image of a newspaper article from the page image(s), download and save it, and return the image filename(s).
Parameters:
- article_id – identifier for a Trove newspaper article
- output_dir – a directory to save images in (will be created if it doesn’t exist)
- size – maximum dimensions of image
- masked –
True
orFalse
, remove content that isn’t part of the article
Returns:
- a list of image file names
get_article_boxes
get_article_boxes (article_id)
Get a list of boundary boxes for an article, each box representing the position of an article on a page.
Positional information about the article is attached to each line of the OCR output in data attributes.
This function loads the HTML version of the article and scrapes the x, y, and width values for each line of text to determine the coordinates of a box around the article.
get_box
get_box (zones)
Loop through all the zones to find the outer limits of each boundary. Return a bounding box around the article.
Basic usage
Just give the download_images
function a Trove article identifier. It downloads and saves the images, and returns a list of image file names. If an article is split across multiple pages, you’ll get multiple images!
The image files names have the form
nla.news-article[article id]-[page id].jpg
.
For example:
nla.news-article107024751-11565831.jpg
.
This means you can always track back from the image to find out where it came from.
The download_images
function returns a list of the image file names. We can use this to preview the downloaded articles, or to run some additional processing steps.
0], width=300) displayImage(images[
By default, images are saved in the current directory. To change this, you can include the output_dir
parameter to specify the directory where you want the images saved. If the directory doesn’t exist, it will be created.
= download_images("107024751", output_dir="images") images
Not all articles fit neatly in bounding boxes. If the article spans multiple, uneven columns, your image might contain content from neighbouring articles. By setting masked
to True
you can try to remove the irrelevant content.
Compare these two images of the same article. The first is unmasked, and includes a lot of extra content. The second is created by setting masked=True
and only includes the desired article.
= download_images("255909273", output_dir="unmasked")
images "unmasked", images[0]), width=300) displayImage(Path(
= download_images("255909273", output_dir="masked", masked=True)
images "masked", images[0]), width=300) displayImage(Path(
The accuracy of the mask depends on the way the boundaries of the article have been identified in Trove. Sometimes an article will be grouped with a neighbouring illustration, even though they’re not connected.
Tests
The get_article_boxes
function resturns a list of bounding boxes defining the position of an article on a page.
= get_article_boxes("107024751")
boxes boxes
[{'page_id': '11565831',
'left': 1062,
'top': 1406,
'right': 1775,
'bottom': 1949,
'zones': [{'left': 1062, 'top': 1406, 'right': 1775, 'bottom': 1467},
{'left': 1062, 'top': 1469, 'right': 1775, 'bottom': 1949}]}]
This article is printed on one page only, so get_article_boxes
should return a list containing a single box.
assert len(boxes) == 1
Each box includes a page_id
value, providing the identifier of the page on which the article is printed.
assert "page_id" in boxes[0]
The bounding box has left
, right
, top
, and bottom
values. We’d expect left
to be less that right
.
assert boxes[0]["left"] < boxes[0]["right"]
Let’s try an article that is split across multiple pages.
= get_article_boxes("162833980") boxes
There should be two boxes.
assert len(boxes) == 2
The download_images
function downloads the images and returns a list of image names.
= download_images("107024751")
images images
['nla.news-article107024751-11565831.jpg']
There should be a single image file name in the images
list.
assert len(images) == 1
The image file names should include the article id.
assert "107024751" in images[0]
Let’s check that the image has been downloaded and saved.
assert Path(images[0]).exists()
# Delete the image
0]).unlink() Path(images[
Let’s use the output_dir
parameter to save the image to a specified directory.
= download_images("107024751", output_dir="images-test") images
Let’s make sure the image has been saved to the directory.
assert Path("images-test", images[0]).exists()
Let’s download an article split across multiple pages.
= download_images("162833980") images
assert len(images) > 1
And check that all the images have been downloaded.
for img in images:
assert Path(img).exists()
# Delete downloaded images
for img in images:
Path(img).unlink()
# Clean up after tests
"images"))
shutil.rmtree(Path("images-test"))
shutil.rmtree(Path("masked"))
shutil.rmtree(Path("unmasked")) shutil.rmtree(Path(