Amazon CloudSearch accented words - amazon-web-services

I have an index with documents with accented words.
For example this document in Portuguese:
title => 'Ponte metálica'
If i search "metálica" it matches, so no problem.
But usually people search without accents, so it's very usual to search just for "metalica" (note the "a" without accent "á").
But it's not returning any results. I tested it in the AWS console and via endpoint /search. Im using the 2013 API.
I think the Synonyms can't solve this issue since they aren't full words

It looks like you posted the same question in AWS forums and got a reply:
The CloudSearch Portuguese stemmer does not remove accents, so á won't match a, and it does not currently have an option to remove them.
Two work-arounds I can think of:
Remove accents before uploading. (possibly to a different field)
Use a copy field, and the "mulitiple languages" analysis mode. This won't stem words by Portuguese rules, unfortunately, but it does remove accents!
I like the idea of removing the accent before upload, but I also have two other ideas:
Use fuzzy matching, so that you can tolerate one or maybe two "wrong" characters. Might have performance drawback to consider.
Provide an auto-complete/suggestor solution similar to a "did you mean?" type of experience.
I found this Stack Overflow thread from around 2014 that discusses these two possibilities, still using CloudSearch: Implementing "Did you mean?" using Amazon CloudSearch
About the fuzzy matching operator:
You can also perform fuzzy searches with the simple query parser. To perform a fuzzy search, append the ~ operator and a value that indicates how much terms can differ from the user query string and still be considered a match. For example, the specifying planit~1 searches for the term planit and allows matches to differ by up to one character, which means the results will include hits for planet.
And about auto-complete, with fuzzy matching option:
When you request suggestions, Amazon CloudSearch finds all of the documents whose values in the suggester field start with the specified query string—the beginning of the field must match the query string to be considered a match. The return data includes the field value and document ID for each match. You can configure suggesters to find matches for the exact query string, or to perform approximate string matching (fuzzy matching) to correct for typographical errors and misspellings.

Related

Regex match hyphenated word with hyphen-less query

I have an Azure Storage Table set up that possesses lots of values containing hyphens, apostrophes, and other bits of punctuation that the Azure Indexers don't like. Hyphenated-Word gets broken into two tokens — Hyphenated and Word — upon indexing. Accordingly, this means that searching for HyphenatedWord will not yield any results, regardless of any wildcard or fuzzy matching characters. That said, Azure Cognitive Search possesses support for Regex Lucene queries...
As such, I'm trying to find out if there's a Regex pattern I can use to match words with or without hyphens to a given query. As an example, the query homework should match the results homework and home-work.
I know that if I were trying to do the opposite — match unhyphenated words even when a hyphen is provided in the query — I would use something like /home(-)?work/. However, I'm not sure what the inverse looks like — if such a thing exists.
Is there a raw Regex pattern that will perform the kind of matching I'm proposing? Or am I SOL?
Edit: I should point out that the example I provided is unrealistic because I won't always know where a hyphen should be. Optimally, the pattern that performs this matching would be agnostic to the precise placement of a hyphen.
Edit 2: A solution I've discovered that works but isn't exactly optimal (and, though I have no way to prove this, probably isn't performant) is to just break down the query, remove all of the special characters that cause token breaks, and then dynamically build a regex query that has an optional match in between every character in the query. Using the homework example, the pattern would look something like [-'\.! ]?h[-'\.! ]?o[-'\.! ]?m[-'\.! ]?e[-'\.! ]?w[-'\.! ]?o[-'\.! ]?r[-'\.! ]?k[-'\.! ]?...which is perhaps the ugliest thing I've ever seen. Nevertheless, it gets the job done.
My solution to scenarios like this is always to introduce content- and query-processing.
Content processing is easier when you use the push model via the SDK, but you could achieve the same by creating a shadow/copy of your table where the content is manipulated for indexing purposes. You let your original table stay intact. And then you maintain a duplicate table where your text is processed.
Query processing is something you should use regardless. In its simplest form you want to clean the input from the end users before you use it in a query. Additional steps can be to handle special characters like a hyphen. Either escape it, strip it, or whatever depending on what your requirements are.
EXAMPLE
I have to support searches for ordering codes that may contain hyphens or other special characters. The maintainers of our ordering codes may define ordering codes in an inconsistent format. Customers visiting our sites are just as inconsistent.
The requirement is that ABC-123-DE_F-4.56G should match any of
ABC-123-DE_F-4.56G
ABC123-DE_F-4.56G
ABC_123_DE_F_4_56G
ABC.123.DE.F.4.56G
ABC 123 DEF 56 G
ABC123DEF56G
I solve this using my suggested approach above. I use content processing to generate a version of the ordering code without any special characters (using a simple regex). Then, I use query processing to transform the end user's input into an OR-query, like:
<verbatim-user-input-cleaned> OR OrderCodeVariation:<verbatim-user-input-without-special-chars>
So, if the user entered ABC.123.DE.F.4.56G I would effecively search for
ABC.123.DE.F.4.56G OR OrderingCodeVariation:ABC123DEF56G
It sounds like you want to define your own tokenization. Would using a custom tokenizer help? https://learn.microsoft.com/azure/search/index-add-custom-analyzers
To add onto Jennifer's answer, you could consider using a custom analyzer consisting of either of these token filters:
pattern_replace: A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.
pattern_capture: Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.
You could use the pattern_replace token filter to replace hyphens with the desired character, maybe an empty character.

Combine multiple regexes into one / build small regex to match a set of fixed strings

The situation:
We created a tool Google Analytics Referrer Spam Killer, which automatically adds filters to Google Analytics to filter out spam.
These filters exclude traffic which comes from certain spammy domains. Right now we have 400+ spammy domains in our list.
To remove the spam, we add a regex (like so domain1.com|domain2.com|..) as a filter to Analytics and tell Analytics to ignore all traffic which matches this filter.
The problem:
Google Analytics has a 255 character limit for each regex (one regex per filter). Because of that we must to create a lot of filters to add all 400+ domains (now 30+ filters). The problem is, there is another limit. Number of write operation per day. Each new filter is 3 more write operations.
The question:
What I want to find the shortest regex to exactly match another regex.
For example you need to match the following strings:
`abc`, `abbc` and `aac`
You could match them with the following regexes: /^abc|abbc|aac$/, /^a(b|bb|a)c$/, /^a(bb?|a)c$/, etc..
Basically I'm looking for an expression which exactly matches /^abc|abbc|aac$/, but is shorter in length.
I found multiregexp, but as far as I can tell, it doesn't create a new regex out of another expression which I can use in Analytics.
Is there a tool which can optimize regexes for length?
I found this C tool which compiles on Linux: http://bisqwit.iki.fi/source/regexopt.html
Super easy:
$ ./regex-opt '123.123.123.123'
(123.){3}123
$ ./regex-opt 'abc|abbc|aac'
(aa|ab{1,2})c
$ ./regex-opt 'aback|abacus|abacuses|abaft|abaka|abakas|abalone|abalones|abamp'
aba(ck|ft|ka|lone|mp|(cu|ka|(cus|lon)e)s)
I wasn't able to run the tool suggested by #sln. It looks like it makes an even shorter regex.
I'm not aware of an existing tool for combining / compressing / optimising regexes. There may be one. Maybe by building a finite-state machine out of a regex and then generating a regex back out of that?
You don't need to solve the problem for the general case of arbitrary regexes. I think it's a better bet to look at creating compact regexes to match any of a given set of fixed strings.
There may already be some existing code for making an optimised regex to match a given set of fixed strings, again, IDK.
To do it yourself, the simplest thing would be to sort your strings and look for common prefixes / suffixes. ((afoo|bbaz|c)bar.com). Looking for common strings in the middle is less easy. You might want to look at algorithms used for lossless data compression for finding redundancy.
You'd ideally want to spot cases where you could use a foo[a-d] range instead of a foo(a|b|c|d), and various other things.

Non printable regex

I am in need of a regex that will match non printable characters. The reason being is that I have a hailstorm spammer that is abusing my network and is getting past my PCRE based heuristic filter by obfuscating his subjects with non printable characters. Therefore, any text based rules I create are bypassed because there is no match.
For example:
The regular text based subject: Reduce tech cоsts with clоud cоmputing
The obfuscated subject:
Reduce tech cоsts with clоud cоmputing
ReduÑe teÑh cоsts with Ñlоud Ñоmputing
ReduÑe teÑh Ñosts with Ñloud Ñomputing
Rеducе tеch cоsts with Ñlоud Ñоmputing
What I am looking for is a regex that I can modify to match all of the phrases that have been used and build a list of regexes.
Maybe, if I can get a regex that will match the subjects, I can meta them together with other matching header information that will thwart these messages.
Any help would be much appreciated.
You can use the following to match.
(Reduce|ReduÑe|Rеducе)\s*(tech|teÑh|tеch)\s*
(cоsts|Ñosts)\s*(with)\s*(clоud|Ñlоud|Ñloud)\s*
(cоmputing|Ñоmputing|Ñomputing)
You can add the unique keywords that have been used in the particular group (reduce, tech, etc) and the above regex handles the different combinations of phrases that can be made using the keywords.
For example, above regex restricts 3x3x2x1x3x3 (162 ways) of spamming using given keywords.
EDIT: You can use [^\w\s."'\/\\=!##$%^&*(){}\[\]?><,+|`~-]+ for checking if subject contains characters that are not printable, and take actions on it. (If you are using this, you might need to add other regexes to handle spam phrases that can be created with printable characters)
Demo with PCRE

Is rearranging words of text possible with RegEx?

I've got a list of devices in a database, such as Model X123 ABC. Another portion of the system accepts user input and needs to, as well as possible, match their entries to the existing devices. But the users have the ability to enter anything they want. They might enter the above model as Model 100 ABC X123 or Model X123.
Understand, this is a general example, and the permutations of available models and matching user entries is enormous, and I'm just trying to match as many as possible so that the manual corrections can be kept to a minimum.
The system is built in FileMaker, but has access to pretty much any plugin I wish, which means I have access to Groovy, PHP, JavaScript, etc. I'm currently working with Groovy using the ScriptMaster plugin for other simple regex pattern matching elsewhere, and I'm wondering if the most straightforward way to do this is to use regex.
My thinking with regex is that I'm looking for patterns, but I'm unsure if I can say, "Assign this grouping to this pattern regardless of where it is in the order of pattern groups." Above, I want to find if the string contains three patterns: (?i)\bmodel\b, (?i)\b[a-z]\d{3}\b, and (?i)\b[a-z]{3}\b, but I don't care about what order they come in. If all three are found, I want to place them in that specific order: first the word "model", capitalized, then the all-caps alphanumeric code and finally the pure alphabetical code in all-caps.
Can (should?) regex handle this?
I suggest tokenizing the input into words, matching each of them against the supported tokens, and assembling them into canonical categorized slots.
Even better would be to offer search suggestions when the user enters the information, and require the user to pick a suggestion.
But you could do it by (programmatically) constructing a monster regex pattern with all the premutations:
\b(?:(model)\s+([a-z]\d{3})\s+([a-z]{3})
|(model)\s+([a-z]{3})\s+([a-z]\d{3})
|([a-z]\d{3})\s+(model)\s+([a-z]{3})
|([a-z]\d{3})\s+([a-z]{3})(model)
|([a-z]{3})(model)\s+([a-z]\d{3})
|([a-z]{3})\s+([a-z]\d{3})\s+(model)
)\b
It'd have to use named capturing groups but I left that out in the hopes that the above might be close to readable.
I'm not sure I fully understand your underlying objective -- is this to be able to match up like products (e.g., products with the same model number)? If so, a word permutations function like this one could be used in a calculated field to create a multikey: http://www.briandunning.com/cf/1535
If you need partial matching in FileMaker, you could also use a redux search function like this one: http://www.fmfunctions.com/fid/380
Feel free to PM me if you have questions that aren't a good format to post here.

Is it possible to configure token separator characters with AWS Cloudsearch?

My Cloudsearch index currently returns no results for one-two three but it does return one result (correctly) for one two three (and will also be included, correctly, in the results when searching for two three etc.)
My understanding is that this is because searchable phrases are broken down into their tokens (words) with whitespace and punctuation acting as delimiters. So, one and two become separate tokens, but one-two is not a valid token, so no results are found. From the Cloudsearch docs:
During tokenization, the stream of text in a field is split into separate tokens on detectable boundaries using the word break rules defined in the Unicode Text Segmentation algorithm.
That Unicode document is here.
I would like to be able to search for one-two three and find the relevant result, as well as a few other punctuation characters, like /. Is it possible to configure this with Cloudsearch?
I just realized a simple solution to this that works fine, although it technically does not answer my question. I simply needed to pre-process my query strings before sending them to cloud search by replacing - or / or whatever char I want with a single space.
That way, one-two three actually performs a search for one two three, returning the correct result.