MediaWiki: adding synonyms to CirrusSearch
Welcome › Forums › Huddersfield Exposed › MediaWiki: adding synonyms to CirrusSearch
- This topic has 0 replies, 1 voice, and was last updated 2 years, 9 months ago by
Dave Pattern.
-
AuthorPosts
-
23 August 2020 at 9:33 am #316
I struggled to find much info online about how to add synonym searching to the MediaWiki CirrusSearch extension, so thought it might be worthwhile posting some notes in case any other wiki owners want to have a go. I should note that I’m a novice when it comes to ElasticSearch, so there may be better ways of achieving the same goal.
To see an example of the synonyms in action, view this search for metternich which was a phonetic spelling of the surname Mettrick used by a small number of newspaper reports of the Holmfirth Flood of 1852. The synonym list causes the search results to match on Mettrick, Metrick and Metterick as well as Metternich, regardless of which spelling is used by the searcher. There may be times when you don’t want to include synonyms, in which case you can put quotes around the search term(s).
This topic suggested it was possible, but didn’t provide specific details. However, it got me pointed in the right direction. The following is based on MediaWiki and CirrusSearch installed on an Ubuntu server.
1. Synonym List
The synonym word file (e.g. synonym.txt) needs to be placed in the /etc/elasticsearch/ directory with relevant permissions (e.g. group elasticsearch). For Huddersfield Exposed, I’m maintaining the list as a protected wiki page and then running a daily cron job to scrape that page and update the synonym file. Changes to the list require a full CirrusSearch reindex, which I run manually (but perhaps could be automated?):
php updateSearchIndexConfig.php –startOver
php forceSearchIndex.phpMy list is simply comma separated equivalent words — e.g. hinchliffe, hinchcliff, hinchcliffe, hinchliff — but it is possible to do more sophisticated stuff with the synonyms.
2. Changes to the CirrusSearch Code
To make CirrusSearch use the list requires some minor changes to the CirrusSearch extension file ./includes/Maintenance/AnalysisConfigBuilder.php (the changed code is shown in bold and you’ll need to adjust quotes and apostrophes due to WordPress reformatting them):
a) locate the defaults function at around line 360 and add a new analyzer…
private function defaults( $language ) {
$defaults = [
‘analyzer’ => [
‘synonym’ => [
‘type’ => ‘custom’,
‘tokenizer’ => ‘whitespace’,
‘filter’ => [ ‘lowercase’,’synonym’ ],
],
‘text’ => [
…b) scroll down to around line 490 and add a new filter…
],
],
‘filter’ => [
‘synonym’ => [
‘type’ => ‘synonym’,
‘ignore_case’ => true,
‘synonyms_path’ => ‘/etc/elasticsearch/synonym.txt’,
],
‘suggest_shingle’ => [
…c. Locate the customize function (around line 610) and then scroll down to around line 710 to update the list of filters…
private function customize( $config, $language ) {
…
$filters = [];
$filters[] = ‘aggressive_splitting’;
$filters[] = ‘possessive_english’;
$filters[] = ‘lowercase’;
$filters[] = ‘synonym’;
$filters[] = ‘stop’;It’s important to add the new filter after the lowercase one, otherwise you’ll see issues with case sensitive searches using the synonyms. Once you’ve made the changes, run a full CirrusSearch reindex, e.g.
php updateSearchIndexConfig.php –startOver
php forceSearchIndex.php -
AuthorPosts
- You must be logged in to reply to this topic.