MongoDB querying whitespace with regex - regex

I've got a large collection of text data stored in MondoDB that users can query via keyword or phrase, and have an issue where some data has unicode character U+00A0 (no-break space) instead of a regular space.
Fixing up the data not being an option (those nbsps are there intentionally), I still want the user to be able to search and find that data. So I updated our Mongo query-building code to search for any whitespace [\s] in places where the user entered a space, resulting in a query like so:
{ "tt" : { "$elemMatch" : { "x" : { "$regex" : "high[\s]performance" , "$options" : "i"} }}}
(there's more to the query, that's just the relevant bit).
Unfortunately, this doesn't return the expected results. So I play around with a bunch of other ways to accomplish this, and eventually discover that I get the correct results when I search for "not non-whitespace" [^\S], as so:
{ "tt" : { "$elemMatch" : { "x" : { "$regex" : "high[^\S]performance" , "$options" : "i"} }}}
Which leads to my question -- why does "any whitespace" ("\s") fail finding this text while "not-non whitespace" ("^\S") finds it successfully? Does Mongo have a different set of rules for what counts as whitespace and non-whitespace?
Data is all in UTF-8 throughout, MongoDB version is 2.2.2

I suppose that the problem here is with \, not with spaces. Can you please write \\ to prove my conjecture?

Related

Wrong regexp query for elasticsearch

I have some problems with the regexp query for elasticsearch. In my index there's a text field with comma-separated numeric values (IDs), f.e.
2,140,3,2495
And I have the following query term:
"regexp" : {
"myIds" : {
"value" : "^2495,|,2495,|,2495$|^2495$",
"boost" : 1
}
}
But my result list is empty.
Let me say that I know that regexp queries are kind of slow but the index still exists and is filled with millions of documents so unfortunately it's not an option to restructure it. So I need a regex solution.
In ElasticSearch regex, patterns are anchored by default, the ^ and $ are treated as literal chars.
What you mean to use is "2495,.*|.*,2495,.*|.*,2495|2495" - 2495, at the start of string, ,2495, in the middle, ,2495 at the end or a whole string equal to 2495.
Or, you may use a simpler
"(.*,)?2495(,.*)?"
That means
(.*,)? - an optional text (not including line breaks) ending with ,
2495 - your value
(,.*)? - an optional text (not including line breaks) ending with ,
Here is an online demo showing how this expression works (not a proof though).
Ok, I got it to work but run in another problem now. I built the string as follows:
(.*,)?2495(,.*)?|(.*,)?10(,.*)?|(.*,)?898(,.*)?
It works good for a few IDs but if I have let's say 50 IDs, then ES throws an exception which says that the regexp is too complex to process.
Is there a way to simplify the regexp or restructure the query it selves?

Regex to replace all occurences of a character within a given context

I am trying to write a search and replace regex (in ruby) to replace all instances of a character in a string in a given context.
The regex needs to replace all instances of "." in a json key, and I'm battling with references. I have a feeling that I need to use a lookaround in some way, but the variations I've tried I can't seem to get working.
Some example strings:
, "key1.name" : " value.something "
, "key2.complex.name" : "value.else"
, "this.is.the.most.complex.name" : "value"
I initially had this regex to replace a single occurrence (replacing it with "FULLSTOP"):
s/, "([^.]+)\.([^"]+)" :/, "\1FULLSTOP\2" :/gā€ā€
Desired output:
, "key1FULLSTOPname" : " value.something "
, "key2FULLSTOPcomplexFULLSTOPname" : "value.else"
, "thisFULLSTOPisFULLSTOPtheFULLSTOPmostFULLSTOPcomplexFULLSTOPname" : "value"
I'm guessing I need to use a (?=\.) somehow in the search, but not sure how to use this correctly with references. I am using the opening , and ending : as a way of defining the context for a json key.
thanks in advance.
(?=.*?\:)\.
Use this.See demo.
http://regex101.com/r/cH8vN2/5
Edit:
(?=.*?\"\s*\:)\.
Use this to be very sure.
See demo.
http://regex101.com/r/cH8vN2/6
You can use the following as a sample :
str = ', "this.is.the.most.complex.name" : "value';
str = str.gsub(/\.+/, 'FULLSTOP');
puts str;
I have not taken care of the 'value' part.
You should be able to do that easily.

Replace using regex containing some of previous match in Notepad++

I'm trying to use a regex to match a block of text, and using replace all, replace it with nothing, so as to delete it.
But Since I sometimes (but not always) have the block appear one after another when I try to replace all, it replaces every second block.
I made this Regex
http.*\n.*\K\n\{\n "code"(.*\n)+?\}\nhttp.*\n
But it will match all isolated blocks, but only every second consecutive block.
I think I'm meant to use "assertions" as described by here. But I couldn't get them to work.
Also how do I replace with nothing (as in delete)? Just leave an empty replace with field? or do I need some special character? Or as I am coming to suspect, I shouldn't use Notpad++ for this sort of thing? If that is the case what should/could I be using?
Sample Data:
"teamAbbr" : "Foo",
"teamName" : "Bar",
"teamNickname" : "FBar"
}
} ]
}
http://www.link_I_want_to_keep_belonging_to_above_data.com
{
"code" : "XXXXXXXXXXXXXXXXXXXXXXX",
"techMessage" : "XXXXXXXXXXXXXXXXXXXXXX",
"userMessage" : "XXXXXXXXXXXXXXXXXXX",
"host" : "XXXXXXXXXXXX",
"date" : "XXXXXXXXXXX",
"version" : "XXX"
}
http://www.url_that_belong_to_block_Iwant_to_be_rid_off.com
{
"code" : "XXXXXXXXXXXXXXXXXXXXXXX",
"techMessage" : "XXXXXXXXXXXXXXXXXXXXXX",
"userMessage" : "XXXXXXXXXXXXXXXXXXX",
"host" : "XXXXXXXXXXXX",
"date" : "XXXXXXXXXXX",
"version" : "XXX"
}
http://www.url_that_belong_to_block_Iwant_to_be_rid_off.com
The problem is that you also match the first url, but that is unavailable when immidiately after a match. And also at the start of the file.
Lookbehind assertions takes care of the problem, but needs to be fixed length.
Do you need to search for the first url? Ie. does
\{\n "code"(.*\n)+?\}\nhttp.*\n
work for you?
To delete a whole match you replace with an empty string. No special characters needed.

Issue with evaluating ' regexp ' for '\c+' , ' \i\c* ' and ' [\i-[:]][\c-[:]]* '

I Am working on a TCL GUI, and I obtain the Data Tree structure for the GUI from a XML Schema, and I have to validate the entry fields fro the restrictions as in the XML Schema. In the XML Schema I am working with I have the simple types NMTOKEN Name and NCName with pattern restrictions '\c+' , '\i\c*' and '[\i-[:]][\c-[:]]*' respectively.
The code i use to check is
method validatePatternValue { value } {
set patternCheck 1
set pattern "^($patternValue)\$"
set patternCheck [regexp $pattern $value]
if {$patternCheck == 0} {
tk_messageBox -message "Only Characters within range $patternValue for $patternValueType is\
accepted "
return 0
}
return 1
}
and whenever the $pattern is one of these '\c+' , '\i\c*' and '[\i-[:]][\c-[:]]*' my text field does not accept any input and keeps throwing an error exception dialogue.
Just to add some more info, I came across this website, with some good info regarding my question about processing combinations of '\i' and '\c'. But is there no other way apart from the one suggested in the following link : XML Schema Character Classes
The \c escape sequence does not do in Tcl regexp what it does in XML-Schema regexp.
In XML Schema
\c matches any character that may occur after the first character in
an XML name, i.e. [-._:A-Za-z0-9]
In Tcl
\cX (where X is any character) the character whose low-order 5 bits
are the same as those of X, and whose other bits are all zero
It's also clearly stated in the link you sent
Note that the \c shorthand syntax conflicts with the control character
syntax used in many other regex flavors.
You should try using [-.:\w] instead of \c
The same is true for \i, it's not doing the same in Tcl and in XML

Vim regex to find missing JSON element

I have a big JSON file, formatted over multiple lines. I want to find objects that don't have a given property. The objects are guaranteed not to contain any further nested objects. Say the given property was "bad", then I would want to locate the value of"foo" in the second element in the following (but not in the first element).
{
result: [
{
"foo" : {
"good" : 1,
"bad" : 0
},
"bar" : 123
},
{
"foo" : {
"good" : 1
},
"bar" : 123
}
]
}
I know about multi-line regexes in Vim but I can't get anything that does what I want. Any pointers?
Try the following:
/\v"foo"\_s*:\_s*\{%(%(\_[\t ,]"bad"\_s*:)#!\_.){-}\}
When you need to exclude something, you should look at negative look-aheads or look-behinds (latter is slower and unlike vim Perl/PCRE regular expressions do not support look-behinds except fixed-width (or a number of alternative fixed-width) ones).
JSON is a context free grammar and as such is not regular. Unless you can give a much stricter set of rules to go on, no regex will be able to do what you want.