Huddersfield Chronicle (1850-1900)
1 August 2020 at 4:55 pm #299
Happy Yorkshire Day!
This seems as good an opportunity as any to announce something that I’ve been tinkering with for a while — adding the text from 50 years of the Huddersfield Chronicle newspaper to the site!
Anyone who’s done local history research will probably already be familiar with the newspaper archives available at sites such as Gale’s British Library Newspapers and the British Newspaper Archive (also available via FindMyPast). All of these require a subscription to access the content, as sadly the British Library haven’t adopted the model used elsewhere (e.g. Trove in Australia) of making historic newspaper content freely available.
Despite mostly being historic documents that are long out of copyright, the British Library has asserted a copyright claim over the digitised newspapers. However, they cannot also assert copyright over transcriptions of the digitised versions:
What I’m planning to do is to convert the newspaper scans (using OCR software) into a text version that can be incorporated into Huddersfield Exposed. The process of OCR-ing old newspapers can be extremely problematic but, after much experimenting, I believe the quality of the conversion is good to justify doing it. For example, here is an extract of a newspaper page (© British Library)…
…and this is result of an OCR conversion followed by positioning the extracted text so that it mirrors the original newspaper page layout…
As you can probably see, there are plenty of errors (mostly these are highlighted in shades of red) but the majority of the text is readable.
So, the plan is to figure out the best way of adding the text to Huddersfield Exposed to make it easily searchable.
There’ll also be some logistical issues to resolve. For example, the site currently has around 5 million words of text. If the full run of the Chronicle from 1850 to 1900 is added to the site, that would be somewhere in the region of 500 million words, so roughly a 10,000% increase in the site’s word count. Whether that’s achievable with the current web server hardware remains to be seen!4 August 2020 at 9:43 am #301
In terms of when the content will start to appear on the web site, the OCR conversion should be complete within the next week or so.
Once that’s done, I’m going to spend some time on analysing the converted text to find ways of correcting mis-converted words. As an example, the conversion has found 550,689 occurrences of the word “huddersfield” so far. However, for each correctly spelled word, there’s a long tail of errors — here’s a random selection and the number of occurrences:
- hudderefield – 2,385
- hudderafield – 2,318
- huddebsfield – 448
- hudderefield – 345
- huddeksfield – 317
- hudddersfield – 266
- huddbrsfield – 245
- huidersfield – 213
- hudderafleld – 190
- hudderaficld – 162
- huddarsfield – 149
- hucdersfield – 148
- hubdersfield – 146
- huldersfield – 133
- hudderafeld – 130
- huddbesfield – 109
- huddbrsfibld – 107
- hudaersfield – 86
- hudbersfield – 74
- hudderafiold – 52
- hucderstield – 46
- hulderstield – 41
- huddeebsfield – 24
- huddarstield – 18
- hubrdersfield – 13
- hugdersfield – 13
- hatdersfield – 10
- hulcdersfield – 10
- hucidersfield – 6
- hujdersfield – 6
- hudyersfield – 2
- hulersticid – 1
- huldorstield – 1
- hugecieficid – 1
- huedyversfield – 1
Once the OCR conversion is complete, I’ll use spell checking tools such as Hunspell and Aspell to identify the correctly spelled words and the number of times they occur — this process might also highlight words, such as local names and placenames, that are missing from the spell checker’s dictionary. That data can then be used to help provide weightings for determining that “Hudyersfield” is more likely to be “Huddersfield” than, say, “Hundersfield” (a former township in Rochdale).
When everything’s ready, I’ll start adding the content to the site in batches and then monitor how it affects the web server’s performance. Adding 500m words to the site would make Huddersfield Exposed around 14% of the size of Wikipedia… but sadly I don’t have 14% of Wikipedia’s enormous budget (around $2.5m per year) so spend on extra hardware! 😀7 August 2020 at 3:53 pm #304
The OCR process has now completed. My guesstimate of 500m words was a little high, and the final word count is 432,998,673!10 August 2020 at 4:22 pm #308
After a busy weekend of writing code, I now have an spellchecker API up and running:
For those who like techie details, it uses both Aspell and Hunspell. It makes use of a custom dictionary of local placenames (sourced from Huddersfield Exposed) and Yorkshire towns & villages, along with local personal names (from Huddersfield Exposed) and common Yorkshire names (from the 1881 Census). The suggested words are weighted against the frequency by which they appear in the newspaper OCR text. The API also checks to see if the suggested words appear in names or placenames.
The first example is using one of the many mis-conversions of “Huddersfield”. The API gives the correct spelling and flags it as a place name.
Similarly for a misspelling of Fartown:
And finally, a misspelling of the surname Wimpenny, which the API flags as likely being part of a name:
With a half-decent API in place, I’ve started the process of analysing the 43,006 newspaper pages of converted OCR text. Each page takes around 2 to 3 minutes to process, so we’re looking at around 3 months. I’ve effectively halved that by temporarily upgrading the server to have double the CPU power. If anyone’s feeling generous and would like to make a donation via PayPal (email@example.com), I’ll do a further temporary upgrade and double the CPU power again.
- You must be logged in to reply to this topic.