Huddersfield Chronicle (1850-1900)
1 August 2020 at 4:55 pm #299
Happy Yorkshire Day!
This seems as good an opportunity as any to announce something that I’ve been tinkering with for a while — adding the text from 50 years of the Huddersfield Chronicle newspaper to the site!
Anyone who’s done local history research will probably already be familiar with the newspaper archives available at sites such as Gale’s British Library Newspapers and the British Newspaper Archive (also available via FindMyPast). All of these require a subscription to access the content, as sadly the British Library haven’t adopted the model used elsewhere (e.g. Trove in Australia) of making historic newspaper content freely available.
Despite mostly being historic documents that are long out of copyright, the British Library has asserted a copyright claim over the digitised newspapers. However, they cannot also assert copyright over transcriptions of the digitised versions:
What I’m planning to do is to convert the newspaper scans (using OCR software) into a text version that can be incorporated into Huddersfield Exposed. The process of OCR-ing old newspapers can be extremely problematic but, after much experimenting, I believe the quality of the conversion is good to justify doing it. For example, here is an extract of a newspaper page (© British Library)…
…and this is result of an OCR conversion followed by positioning the extracted text so that it mirrors the original newspaper page layout…
As you can probably see, there are plenty of errors (mostly these are highlighted in shades of red) but the majority of the text is readable.
So, the plan is to figure out the best way of adding the text to Huddersfield Exposed to make it easily searchable.
There’ll also be some logistical issues to resolve. For example, the site currently has around 5 million words of text. If the full run of the Chronicle from 1850 to 1900 is added to the site, that would be somewhere in the region of 500 million words, so roughly a 10,000% increase in the site’s word count. Whether that’s achievable with the current web server hardware remains to be seen!4 August 2020 at 9:43 am #301
In terms of when the content will start to appear on the web site, the OCR conversion should be complete within the next week or so.
Once that’s done, I’m going to spend some time on analysing the converted text to find ways of correcting mis-converted words. As an example, the conversion has found 550,689 occurrences of the word “huddersfield” so far. However, for each correctly spelled word, there’s a long tail of errors — here’s a random selection and the number of occurrences:
- hudderefield – 2,385
- hudderafield – 2,318
- huddebsfield – 448
- hudderefield – 345
- huddeksfield – 317
- hudddersfield – 266
- huddbrsfield – 245
- huidersfield – 213
- hudderafleld – 190
- hudderaficld – 162
- huddarsfield – 149
- hucdersfield – 148
- hubdersfield – 146
- huldersfield – 133
- hudderafeld – 130
- huddbesfield – 109
- huddbrsfibld – 107
- hudaersfield – 86
- hudbersfield – 74
- hudderafiold – 52
- hucderstield – 46
- hulderstield – 41
- huddeebsfield – 24
- huddarstield – 18
- hubrdersfield – 13
- hugdersfield – 13
- hatdersfield – 10
- hulcdersfield – 10
- hucidersfield – 6
- hujdersfield – 6
- hudyersfield – 2
- hulersticid – 1
- huldorstield – 1
- hugecieficid – 1
- huedyversfield – 1
Once the OCR conversion is complete, I’ll use spell checking tools such as Hunspell and Aspell to identify the correctly spelled words and the number of times they occur — this process might also highlight words, such as local names and placenames, that are missing from the spell checker’s dictionary. That data can then be used to help provide weightings for determining that “Hudyersfield” is more likely to be “Huddersfield” than, say, “Hundersfield” (a former township in Rochdale).
When everything’s ready, I’ll start adding the content to the site in batches and then monitor how it affects the web server’s performance. Adding 500m words to the site would make Huddersfield Exposed around 14% of the size of Wikipedia… but sadly I don’t have 14% of Wikipedia’s enormous budget (around $2.5m per year) so spend on extra hardware! 😀7 August 2020 at 3:53 pm #304
The OCR process has now completed. My guesstimate of 500m words was a little high, and the final word count is 432,998,673!10 August 2020 at 4:22 pm #308
After a busy weekend of writing code, I now have an spellchecker API up and running:
For those who like techie details, it uses both Aspell and Hunspell. It makes use of a custom dictionary of local placenames (sourced from Huddersfield Exposed) and Yorkshire towns & villages, along with local personal names (from Huddersfield Exposed) and common Yorkshire names (from the 1881 Census). The suggested words are weighted against the frequency by which they appear in the newspaper OCR text. The API also checks to see if the suggested words appear in names or placenames.
The first example is using one of the many mis-conversions of “Huddersfield”. The API gives the correct spelling and flags it as a place name.
Similarly for a misspelling of Fartown:
And finally, a misspelling of the surname Wimpenny, which the API flags as likely being part of a name:
With a half-decent API in place, I’ve started the process of analysing the 43,006 newspaper pages of converted OCR text. Each page takes around 2 to 3 minutes to process, so we’re looking at around 3 months. I’ve effectively halved that by temporarily upgrading the server to have double the CPU power. If anyone’s feeling generous and would like to make a donation via PayPal (firstname.lastname@example.org), I’ll do a further temporary upgrade and double the CPU power again.24 December 2020 at 6:22 pm #385
Well, I wanted to try and get all of this done by Christmas but it’s taken a bit longer than expected. Anyway, the first batch of pages are now being added to the site!27 December 2020 at 3:11 pm #386
So… some good news and some bad news!
The newspaper page content integrates well with the site and it’s definitely been worthwhile to attempt the project.
However, I’ve had to abandon the loading of the pages for technical reasons. As I mentioned in an earlier comment, “Whether [it’s] achievable with the current web server hardware remains to be seen!” Sadly, it’s not achievable with the current hardware…
The site runs on MediaWiki (the same software as Wikipedia) and the adding of the newspaper content wasn’t a major issue. As you’d expect, the size of the MediaWiki database grew as the content was added.
The site also uses CirrusSearch/Elasticsearch to power the search facility and that’s where I hit issues — the additional content caused the size of the Elasticsearch indexes to grow to a around 30GB in size. In an ideal world, you’d want the indexes to be held in the server’s memory but that would mean upgrading from a £20 per month server to one that costs £160 per month (or an annual increase in the hosting costs of £1,680… gulp!). Instead, the indexes had to sit on the disk space which makes searching a much slower (and more CPU intensive) process.
I’m going to roll back the addition of most of the newspaper content and lick my wounds whilst I mull over the options!31 December 2020 at 12:46 pm #391
I’ve hopefully found a decent compromise. After upgrading the server hardware (which adds £120 to the annual cost) I’ve added all of the Saturday editions of the Chronicle. From 1850 to 1872, the newspaper was published as a Saturday edition. From 1873 onwards, it was published as a 4-page weekday issue together with a 8-page Saturday issue (which mostly comprised content from the weekday issues).
- You must be logged in to reply to this topic.