Reply To: Huddersfield Chronicle (1850-1900)
After a busy weekend of writing code, I now have an spellchecker API up and running:
For those who like techie details, it uses both Aspell and Hunspell. It makes use of a custom dictionary of local placenames (sourced from Huddersfield Exposed) and Yorkshire towns & villages, along with local personal names (from Huddersfield Exposed) and common Yorkshire names (from the 1881 Census). The suggested words are weighted against the frequency by which they appear in the newspaper OCR text. The API also checks to see if the suggested words appear in names or placenames.
The first example is using one of the many mis-conversions of “Huddersfield”. The API gives the correct spelling and flags it as a place name.
Similarly for a misspelling of Fartown:
And finally, a misspelling of the surname Wimpenny, which the API flags as likely being part of a name:
With a half-decent API in place, I’ve started the process of analysing the 43,006 newspaper pages of converted OCR text. Each page takes around 2 to 3 minutes to process, so we’re looking at around 3 months. I’ve effectively halved that by temporarily upgrading the server to have double the CPU power. If anyone’s feeling generous and would like to make a donation via PayPal (firstname.lastname@example.org), I’ll do a further temporary upgrade and double the CPU power again.