Reply To: Huddersfield Chronicle (1850-1900)
Welcome › Forums › Huddersfield Exposed › Huddersfield Chronicle (1850-1900) › Reply To: Huddersfield Chronicle (1850-1900)
In terms of when the content will start to appear on the web site, the OCR conversion should be complete within the next week or so.
Once that’s done, I’m going to spend some time on analysing the converted text to find ways of correcting mis-converted words. As an example, the conversion has found 550,689 occurrences of the word “huddersfield” so far. However, for each correctly spelled word, there’s a long tail of errors — here’s a random selection and the number of occurrences:
- hudderefield – 2,385
- hudderafield – 2,318
- huddebsfield – 448
- hudderefield – 345
- huddeksfield – 317
- hudddersfield – 266
- huddbrsfield – 245
- huidersfield – 213
- hudderafleld – 190
- hudderaficld – 162
- huddarsfield – 149
- hucdersfield – 148
- hubdersfield – 146
- huldersfield – 133
- hudderafeld – 130
- huddbesfield – 109
- huddbrsfibld – 107
- hudaersfield – 86
- hudbersfield – 74
- hudderafiold – 52
- hucderstield – 46
- hulderstield – 41
- huddeebsfield – 24
- huddarstield – 18
- hubrdersfield – 13
- hugdersfield – 13
- hatdersfield – 10
- hulcdersfield – 10
- hucidersfield – 6
- hujdersfield – 6
- hudyersfield – 2
- hulersticid – 1
- huldorstield – 1
- hugecieficid – 1
- huedyversfield – 1
Once the OCR conversion is complete, I’ll use spell checking tools such as Hunspell and Aspell to identify the correctly spelled words and the number of times they occur — this process might also highlight words, such as local names and placenames, that are missing from the spell checker’s dictionary. That data can then be used to help provide weightings for determining that “Hudyersfield” is more likely to be “Huddersfield” than, say, “Hundersfield” (a former township in Rochdale).
When everything’s ready, I’ll start adding the content to the site in batches and then monitor how it affects the web server’s performance. Adding 500m words to the site would make Huddersfield Exposed around 14% of the size of Wikipedia… but sadly I don’t have 14% of Wikipedia’s enormous budget (around $2.5m per year) so spend on extra hardware! 😀