How to make local character insensitive search with regex in MongoDB? - regex

I want to make a search and let's say my keyboard is English. But in the database, there are some data including Turkish charachters:
"İstanbul"
"İzmir"
etc. Because I don't have "İ" in my keyboard, I never be able to find these 2 data in my queries.
What is the best way to do it?
UPDATE:
In NodeJS, I have following function to convert Turkish characters into English alikes:
function convertTurkishToEnglish(trStr){
return S(trStr)
.replaceAll('ı', 'i')
.replaceAll('ö', 'o')
.replaceAll('ü', 'u')
.s;
}
But I cannot apply it to the data in the DB.

You can use a unicode escape sequence \u0130 to identify İ

Three options come to mind:
Enhance the data to include an additional field that represents the "to English" version of the text (using your convertTurkishToEnglish function for example) (you might be able to use a MapReduce function to build a new table that has what you need).
Investigate using a search engine like ElasticSearch or Solr for a more exhaustive search option
Increase the complexity of your regular expressions to include all of the combinations of character replacement whenever text is searched (at runtime you'd build these search strings):
db.users.find({"username": { $regex: "\u0130|ian", $options : "i" } })
In the above code snippet, it's looking for İ or i. You'd need to do this for any other Turkish characters. (It was looking for "Ian" for example).

Related

Compare fullwidth and halfwidth japanese characters in mongodb by using collation and regex

According to the MongoDB documentation and the ICU documentation it should be possible to ignore full-width and half-width difference in Japanese text by utilizing collation.
I tried the following;
{ locale: "ja", caseLevel:true, strength:1}
with different strength but none of them is working.
db.getCollection('mycollection')
.find({"desc":/バンド/})
.collation({ locale: "ja", caseLevel:true, strength:1})
This query cannot get result from the following document;
{
"desc": "*EGRパイプバンド外れ"
}
update
Found reason that in MongoDB regex cannot apply collation, so if I use certain match to perform query the result is perfect:
db.getCollection('mycollection')
.find({"desc":"*EGRパイプバンド外れ???"})
.collation({ locale: "ja", caseLevel:true, strength:1})
This query will return *EGRパイプバンド外れ this result.
But not if I use regex, any suggestion on it?
There is no way to make collate work with any regex find logic, since the regex script will override any collate definition, and only use the logic defined within itself, namely find any string that contains half-width バンド only.
The simplest way to achieve this is to add an extra logic before you send the search text into your MongoDB client, and duplicate the text into both half & full width. You can use some existing tool like this.
Then apply both half & full width search parameters into your find condition with $or;
db.mycollection.find({$or: [{"desc":/バンド/}, {"desc":/バンド/}]})
Same issue;
Use of collation in mongodb $regex

CloudSearch wildcard query not working with 2013 API after migration from 2011 API

I've recently upgraded a CloudSearch instance from the 2011 to the 2013 API. Both instances have a field called sid, which is a text field containing a two-letter code followed by some digits e.g. LC12345. With the 2011 API, if I run a search like this:
q=12345*&return-fields=sid,name,desc
...I get back 1 result, which is great. But the sid of the result is LC12345 and that's the way it was indexed. The number 12345 does not appear anywhere else in any of the resulting document fields. I don't understand why it works. I can only assume that this type of query is looking for any terms in any fields that even contain the number 12345.
The reason I'm asking is because this functionality is now broken when I query using the 2013 API. I need to use the structured query parser, but even a comparable wildcard query using the simple parser is not working e.g.
q.parser=simple&q=12345*&return=sid,name,desc
...returns nothing, although the document is definitely there i.e. if I query for LC12345* it finds the document.
If I could figure out how to get the simple query working like it was before, that would at least get me started on how to do the same with the structured syntax.
Why it's not working
CloudSearch v1 (2011) had a different way of tokenizing mixed alpha+numeric strings. Here's the logic as described in the archived docs (emphasis mine).
If a string contains both alphabetic and numeric characters and is at
least three and no more than nine characters long, the alphabetic and
numeric portions of the string are treated as separate tokens. For
example, the string DOC298 is tokenized into two terms: doc 298
CloudSearch v2 (2013) text processing follows Unicode Text Segmentation, which does not specify that behavior:
Do not break within sequences of digits, or digits adjacent to letters (“3a”, or “A3”).
Solution
You should just be able to search *12345 to get back results with any prefix. There may be some edge cases like getting back results you don't want (things with more preceding digits like AB99912345); I don't know enough about your data to say whether those are real concerns.
Another option would would be to index the numeric prefix separately from the alphabetical suffix but that's additional work that may be unnecessary.
I'm guessing you are using Cloudsearch in English, so maybe this isn't your specific problem, but also watch out for Stopwords in your search queries:
https://docs.aws.amazon.com/cloudsearch/latest/developerguide/configuring-analysis-schemes.html#stopwords
In your example, the word "jo" is a stop word in Danish and another languages, and of course, supported languages, have a dictionary of stop words that has very common ones. If you don't specify a language in your text field, it will be English. You can see them here: https://docs.aws.amazon.com/cloudsearch/latest/developerguide/text-processing.html#text-processing-settings

False word elemination using Regex replacement

I need to perform content/keyword based search in a list of files. for that i need to extract the keywords and store them in MySQL database. the key words are extracted in following manner:
Read the file content
Remove special characters and additional white spaces if any using
Regex.Replace(input, "[^a-zA-Z0-9_]+", " ")
Remove am/is/are/be/being/been/ , have/has/having/had/, do/does/doing/did/ adjectives, phrases, Adverbs etc..
Removing endings like :
-IC-ATION fortification
-IC-ITY electricity
-IC-MENT fantastically
-AT-IV contemplative
-AT-OR conspirator
-IV-ITY relativity
-IV-MENT instinctively
-ABLE-ITY incapability
-ABLE-MENT charitably
-OUS-MENT famously
Can i do the whole operation using a single Regular expression? is their any simplest method for this? Here i have a reference algorithm for this operation.
I don't think it would be possible to implement a stemming algorithm using regular expressions exclusively. Maybe you should take a look at already existing implementations to get ideas. Here is a link to the Porter stemming algorithm in VB.net

Search and Replace in Solr?

Im looking for something like a search and replace functionality in Solr.
I have dumped a document into solr, and doing some text analysis over it. At times i may need to group couple of words together and want solr to treat it as one single token.
For ex: "South Africa" will be treated as one single token for further processing. And also notice that these can be dynamic and im going to let the end user to decide which words he/she has to group. So NO Semantics required.
My current plan is to add a special character between these two words so Solr will treat it as one single token (StandardTokenizerFactory) for further processing.
So im looking for something like:
replace("South Africa",South_Africa")
Can anyone has any solution?
Use a Synonym filter and define these replacements in a synonyms.txt file. Once you have all of your definitions, rebuild the index.
You would probably have an entry like this to handle both the case where a field has a LowerCase filter before Synonym and where Synonym comes before LowerCase.
South Africa,south africa => southafrica
More info here http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
You could perhaps use a PatternReplaceFilter and a clever regexp.

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.