Elasticsearch – Ignore special characters in query with pattern replace filter and custom analyzer
Using Elasticsearch 5, we had a field like drivers license number where values may include special characters and inconsistent upper/lower case behavior as the values were entered by the users with limited validation. For example, these are hypothetical values:
- CA-123-456-789
- WI.12345.6789
- tx123456789
- az-123-xyz-456
- …
In our application, the end user need to search by that field. We had a business requirement that user should be able to not have to enter any special characters such as hyphens and periods to get back the record. So for the first example above, the user should be able to type any of these values and see that record:
- CA-123-456-789 (an exact match)
- CA123456789 (no special chars)
- ca123456789 (lower-case letters and no special chars)
- Ca.123.456-789 (mixed case letters and mixed special chars)
Our approach was to write a custom analyzer that ignores special characters and then query against that field.
Step 1: Create pattern replace character filter and custom analyzer
We defined a pattern replace character filter to remove any non-alphanumeric characters as follows on the index:
"char_filter": { "specialCharactersFilter": { "pattern": "[^A-Za-z0-9]", "type": "pattern_replace", "replacement": "" } }
Then we used that filter to create a custom analyzer that we named “alphanumericStringAnalyzer” on the index:
"analyzer": { "alphanumericStringAnalyzer": { "filter": "lowercase", "char_filter": [ "specialCharactersFilter" ], "type": "custom", "tokenizer": "standard" } }
Step 2: Define field mapping using the custom analyzer
The next step was to define a new field mapping that used the new “alphanumericStringAnalyzer” analyzer:
"driversLicenseNumber": { "type": "text", "fields": { "alphanumeric": { "type": "text", "analyzer": "alphanumericStringAnalyzer" }, "raw": { "type": "keyword" } } }
Step 3: Run query against new field
In our case, we have this match query as part of a boolean query in the “should” clause:
{ "match" : { "driversLicenseNumber.alphanumeric" : { "query" : "Ca.123.456-789", "operator" : "OR", "boost" : 10.0 } } }
Published on Java Code Geeks with permission by Steven Wall, partner at our JCG program. See the original article here: Elasticsearch – Ignore special characters in query with pattern replace filter and custom analyzer Opinions expressed by Java Code Geeks contributors are their own. |