Translate specific return query into mgo - regex

I have a query which returns all names from a collection's documents which contain a specific text. In the following example, return all names which contain the sequence "oh" case-insensitively; do not return other fields in the document:
find({name:/oh/i}, {name:1, _id:0})
I have tried to translate this query into mgo:
Find([]bson.M{bson.M{"name": "/oh/i"}, bson.M{"name": "1", "_id": "0"}})
but there are always zero results when using mgo. What is the correct syntax for such a query using mgo?
This question is different from the alleged duplicates because none of those questions deal with how to restrict MongoDB to return only a specific field instead of entire documents.

To execute queries that use regexp patterns for filtering, use the bson.RegEx type.
And to exclude fields from the result documents, use the Query.Select() method.
Like in this example:
c.Find(bson.M{"name": bson.RegEx{Pattern: "oh", Options: "i"}}).
Select(bson.M{"name": 1, "_id": 0})
Translation of the regexp:
name:/oh/i
This means to match documents where the name field has a value that contains the "oh" sub-string, case insensitive. This can be represented using a bson.RegEx, where the RegEx.Pattern field gets the pattern used in the above expression ("oh"). And the RegEx.Options may contain options now to apply / match the pattern. The doc lists the possible values. If the Options field contains the 'i' character, that means to match case insensitive.
If you have a user-entered term such as "[a-c]", you have to quote regexp meta characters, so the final pattern you apply should be "\[a-c\]" To do that easily, use the regexp.QuoteMeta() function, e.g.
fmt.Println(regexp.QuoteMeta("[a-c]")) // Prints: \[a-c\]
Try it on the Go Playground.

Related

Partial matches using mongo's primitive package

I am using Mongo's Primitive package to get a bson value based on what was submitted. This is what I am currently doing
school = "Havard"
value = primitive.Regex{Pattern: school, Options: ""}
This will only match bson values that are Havard, how do I make this regex case insensitive and make it match for example, hava
In all, if I use hava for a search, I should also get Havard
The expression primitive.Regex{Pattern: school} matches substrings too, but it's not case insensitive. Use the "i" option to make it case insensitive:
value = primitive.Regex{Pattern: school, Options: "i"}
Also note that if the value of school contains special regexp characters, that might give you unexpected results or errors. So best is to quote it with e.g. using regexp.QuoteMeta():
value = primitive.Regex{Pattern: regexp.QuoteMeta(school), Options: "i"}
For GO users the filter looks like this:
filter := bson.D{{"column_name", primitive.Regex{Pattern: school, Options: "i"}}}

MongoDB: Match multiple values in string field

I have a collection of entities that contain a string field. I'm looking for a way to query the collection with a combined number of values, and get all entities that contain all of these values, with these specifications:
contain ALL provided query values, not just some of them
case-insensitive
regardless of order
'word' query values can be part of something bigger (for example separated by _ or any other character)
So as an example, if I provide these words as the query values:
i am spiderman
(I can separate them by whitespace, give an array, or whatever works..)
I expect these results:
- "i am_spiderMan" // should match
- "AM i spiderman?!" // should match
- "who am I? supermanspiderman" // should match
- "I am superman" // should not match
- "i am spider_man" // should not match
I hope this covers all the cases I tried to describe.
I tried regex, and also did some research with similar questions but could not get it to work.
You could use regular expr. This is working perfectly. When you pass the sentence, you need to put all worlds into array as I have shown below. Refer $all to include all words to find. Reg expr case insensitive
db.collection.find ({ key: { $all: [ /spiderman/i, /i/i, /am/i ] } })

How to pass multiple regex in single re.complile and create list of matched pattern

I have written multiple regex patterns separately and tried to make the matched patterns in a list like this below:
pattern=re.compile('(?:OR011-|OGEA|LLCM|A|1-)\d{2,15}')
For a single pattern I am able to make the match patterns to list like this on a column but not for entire as whole:
pattern_list=list(filter(pattern1.findall, column))
input column like this:
column
OR011-103401461251
Hi the information is 1-234455
How are you?LLCM23466723
output coming:
['OR011-103401461251','Hi the information is 1-234455','How are you?LLCM23466723']
output required:
['OR011-103401461251','1-234455','LLCM23466723']
How can I compile all patterns in a single re.compile() and make a single pattern_list for all the matched patterns?
You can use an alternation to combine your expressions to 1 pattern:
(?:OR011-|OGEA|LLCM|A|1-)\d{2,15}
Explanation
(?: Non capturing group
OR011-|OGEA|LLCM|A|1- Match 1 of the options
) Close non capturing group
\d{2,15} Match 2-15 digits
Regex demo | Python demo
About your approach
The function filter returns the element for which the function returns true. You pass the method findall to filter where, for every item, findall finds a match and returns the element which will result in:
['OR011-103401461251','Hi the information is 1-234455','How are you?LLCM23466723']
What you could do instead of using filter is to use map and pass findall:
pattern=re.compile('(?:OR011-|OGEA|LLCM|A|1-)\d{2,15}')
pattern_list=map(pattern.findall, df.column)
print(list(pattern_list))
That will result in:
[['OR011-103401461251'], ['1-234455'], ['LLCM23466723']]
See a Python example
Or you could pass a lambda to map and first check if the search has a result:
pattern=re.compile('(?:OR011-|OGEA|LLCM|A|1-)\d{2,15}')
pattern_list=map(lambda x: pattern.search(x).group() if pattern.search(x) else None, df.column)
print(list(pattern_list))
That will result in:
['OR011-103401461251', '1-234455', 'LLCM23466723']
See a Python example

Does mongodb $regex without the option `i` still make use of the index if I am searching on the Index?

I have a model with a normal index using Mongoose.
const mod = new mongoose.Schema({
number: { type: String, required: true, index: { unique: true } },
});
I am using a regex in a query to get the mod corresponding to a specific number. Will my regex query utilize the index that is on this model?
query.number = {
$regex: `.*Q10.*`
}
modelName.find(query)
I am concerned that this is looking through the entire collection without using the indexes. What would be the best way to know if I am using the index. Or if you happen to know a way that will utilize the index could you show me? Here I am looking for all close to Q10, not trying to get an exact match. Would using /^Q10.* be better and use the index?
Referencing MongoDB regex information on index and comments made on this post stackoverflow previous question
The best way to confirm index usage for a given query is using MongoDB's query explain() feature. See Explain Results in the manual for your version of MongoDB for more information on the output fields and interpretation.
With regular expressions a main concern is efficient use of indexes. An unanchored substring match like /Q10/ will require examining all index keys (assuming a candidate index exists, as in your example). This is an improvement over scanning the full collection data (as would be the case without an index), but not as ideal as being able to check a subset of relevant index keys as is possible with a regex prefix search.
If you are routinely searching for substring matches and there is a common pattern to your strings, you could design a more scalable schema. For example, you could save whatever your Q10 value represents into a separate field (such as part_number) where you could use a prefix match or an exact match (non-regex).
To illustrate, I set up some test data using MongoDB 3.4.2 and the mongo shell:
// Needles: strings to search for
db.mod.insert([{number:'Q10'}, {number: 'foo-Q10'}, {number:'Q10-123'}])
// Haystack: some string values to illustrate key comparisons
for (i=0; i<1000; i++) { db.mod.insert({number: "I" + i}) }
Regex search without an index:
db.mod.find({ number: { $regex: /Q10/ }}).explain('executionStats')
The winningPlan is a COLLSCAN (collection scan) which requires the server retrieve every document in the collection to perform the comparison. Note that the original regex includes an unnecessary .* prefix and suffix; this is implicit with a substring match so can be written more concisely as /Q10/.
Highlights from the executionStats section of the explain output:
"nReturned": 2,
"totalKeysExamined": 0,
"totalDocsExamined": 1003,
The explain output confirms there are no index keys examined and 1003 documents (all docs in this collection).
Add an index for the following two examples:
db.mod.createIndex({number:1}, {unique: true})
Regex substring search with an index:
db.mod.find({ number: { $regex: /Q10/}}).explain('executionStats')
The winningPlan is still an IXSCAN, but now has to examine all 1003 indexed string values to find substring matches:
"nReturned": 3,
"totalKeysExamined": 1003,
"totalDocsExamined": 3,
Regex prefix search with an index:
db.mod.find({ number: { $regex: /^Q10/}}).explain('executionStats')
The winningPlan is an IXSCAN (Index scan) which requires 3 key comparisons and 2 document fetches to return the 2 matching documents:
"nReturned": 2,
"totalKeysExamined": 3,
"totalDocsExamined": 2,
A prefix search isn't equivalent to the first two searches, as it will not match the document with value foo-Q10. However, this does illustrate a more efficient regex search.
Note that totalKeysExamined is 3. It might be reasonable to expect this to be 2 since there were only 2 matches, however this metric includes any comparisons with out-of-range keys (eg. end of a range of values). For more information see Explain Results: keysExamined.
With the index enabled, For case sensitive regular expression queries, the query traverses the entire index (load into memory), then load the matching documents to be returned into memory. Its expensive but still could be better than a full collection scan.
For /John Doe/ regex ,mongo will scan the entire keyset in the index
then fetch the matched documents.
However, if you use a prefix query :
Further optimization can occur if the regular expression is a “prefix
expression”, which means that all potential matches start with the
same string. This allows MongoDB to construct a “range” from that
prefix and only match against those values from the index that fall
within that range.

How to parse GET tokens from URL with regular expression

Given a URL with GET arguments such as
http://www.domain.com?key1=value1+value2+value3&key2=value4+value5
I wish to capture all the values for a given key (into separate references if possible). For example if the desired key was key1 i would want to capture value1 in \1 (or $1 depending on language), value2 in \2, and value3 in \3.
My flawed regex is:
/[?&](?:key1)=((?:[^+&]+[+&$])+)/
which yields 0 results.
I am writing this in c++ using ECMA syntax, but I think I could convert a solution or advice from any regex flavor to ECMA. Any help would be appreciated.
This has been answered before and there are compact scripts written for it.
Regular expressions are not optimal for extracting query string values. At the end of this answer, I will give you an expression which can extract the value(s) for a given field into separate references. But not that it takes a "lot" of time to extract the parameters one at a time using regular expressions, but they can all be completely extracted very quickly with no regular expression engine needed. For instance, http://www.htmlgoodies.com/beyond/javascript/article.php/3755006/How-to-Use-a-JavaScript-Query-String-Parser.htm
What language are you trying to use to extract these parameters, C++?
If you are using, JavaScript, you use the small functions mentioned in the article above, i.e.,
function ptq(q)
{
/* parse the query */
var x = q.replace(/;/g, '&').split('&'), i, name, t;
/* q changes from string version of query to object */
for (q={}, i=0; i<x.length; i++)
{
t = x[i].split('=', 2);
name = unescape(t[0]);
if (!q[name])
q[name] = [];
if (t.length > 1)
{
q[name][q[name].length] = unescape(t[1]);
}
/* next two lines are nonstandard, allowing programmer-friendly Boolean parameters */
else
q[name][q[name].length] = true;
}
return q;
}
function param() {
return ptq(location.search.substring(1).replace(/+/g, ' '));
}
Once you have that code included in your page's scripts, then you can parse the current URLs data by doing query = param(); and then using the value of query.key1, etc.
You can parse other query-string formatted data by using the ptq() function directly, i.e., query_object = ptq(query_string).
If you are using another language and regular expressions are the way you want to do it, then this would return all values matching key1, for instance:
/key1=([^&;]*)/g
That will return all the values with a certain field name (which in the query string definition, are written like this, key1=value1&key1=value2&key1=value3, etc.).
The way you ask your question makes it sound like you want to create your own programmer-friendly way of supplying values (i.e., by constructing your own custom URLs rather than receiving data from form submissions through browsers) in which your values are separated by spaces (spaces are encoded as + signs in an HTTP GET query string, and as %20 in generic query strings).
You could make a complicated regular expression to do this in one step, but it is faster to match the entire field (all the values and the + signs as well), and then split the result at the + signs.
For each of the results from the regular expression I indicate, you can extract the plus-sign separated values by simply doing /[^+]*/g