MongoDB/PyMongo: how to 'escape' parameters in regex search? - regex

I'm using pymongo and want to do a search for items starting with a certain sequence of characters. I might implement that like this:
items = collection.find({ 'key': '/^text/' })
This should work, but what if text is a variable? I could do something like:
items = collection.find({ 'key': '/^' + variable + '/' })
But now if the text in variable contains any characters with special regex meaning (such as $), the query no longer behaves as expected. Is there a way to do some sort of parameter binding? Do I have to sanitize variable myself? Is that even reliably possible?
Thanks!

You have to assemble the regex programmatically. So either:
import re
regex = re.compile('^' + re.escape(variable))
items = collection.find({ 'key': regex })
OR
items = collection.find({'key': { '$regex': '^' + re.escape(variable) }})
Note that the code uses re.escape to escape the string in case it contains special characters.

Here is the concept: you have to use regex
items = collection.find({
'key' : {
$regex : yourRegex
}
})

Related

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

compare list items against another list

So lets say I have 3 item list:
myString = "prop zebra cool"
items = myString.split(" ")
#items = ["prop", "zebra", "cool"]
And another list content containing hudreds of string items. Its actally a list of files.
Now I want to get only the items of content that contain all of the items
So I started this way:
assets = []
for c in content:
for item in items:
if item in c:
assets.append(c)
And then somehow isolate only the items that are duplicated in assets list
And this would work fine. But I dont like that, its not elegant. And Im sure that there is some other way to deal with that in python
If I interpret your question correctly, you can use all.
In your case, assuming:
content = [
"z:/prop/zebra/rig/cool_v001.ma",
"sjasdjaskkk",
"thisIsNoGood",
"shakalaka",
"z:/prop/zebra/rig/cool_v999.ma"
]
string = "prop zebra cool"
You can do the following:
assets = []
matchlist = string.split(' ')
for c in content:
if all(s in c for s in matchlist):
assets.append(c)
print assets
Alternative Method
If you want to have more control (ie. you want to make sure that you only match strings where your words appear in the specified order), then you could go with regular expressions:
import re
# convert content to a single, tab-separated, string
contentstring = '\t'.join(content)
# generate a regex string to match
matchlist = [r'(?:{0})[^\t]+'.format(s) for s in string.split(' ')]
matchstring = r'([^\t]+{0})'.format(''.join(matchlist))
assets = re.findall(matchstring, contentstring)
print assets
Assuming \t does not appear in the strings of content, you can use it as a separator and join the list into a single string (obviously, you can pick any other separator that better suits you).
Then you can build your regex so that it matches any substring containing your words and any other character, except \t.
In this case, matchstring results in:
([^\t]+(?:prop)[^\t]+(?:zebra)[^\t]+(?:cool)[^\t]+)
where:
(?:word) means that word is matched but not returned
[^\t]+ means that all characters but \t will match
the outer () will return whole strings matching your rule (in this case z:/prop/zebra/rig/cool_v001.ma and z:/prop/zebra/rig/cool_v999.ma)

Regular expression within count() of a list not working

I am trying to count certain expressions in tokenized texts. My code is:
tokens = nltk.word_tokenize(raw)
print(tokens.count(r"<cash><flow>"))
'tokens' is a list of tokenized texts (partly shown below). But the regex here is not working and the output shows 0 occurrence of 'cash flow', which is not correct. And I receive no error message. If I only count 'cash', it works fine.
'that', 'produces', 'cash', 'flow', 'from', 'operations', ',', 'none', 'of', 'which', 'are', 'currently', 'planned', ',', 'the', 'cash', 'flows', 'that', 'could', 'result', 'from'
Anyone knows what the problem is?
You don't need regex for this.
Just the find the matching keywords in tokens and count the elements.
Example:
tokens = ['that','produces','cash','flow','from','operations','with','cash']
keywords = ['cash','flow']
keywords_in_tokens = [x for x in keywords if x in tokens]
count_keywords_in_tokens = len(keywords_in_tokens)
print(keywords_in_tokens)
print(count_keywords_in_tokens)
count_keywords_in_tokens returns 2 because both words are found in the list.
To do it the regex way, you need a string to find the matches based on a regex pattern.
In the example below the 2 keywords are separated by an OR (the pipe)
import re
tokens = ['that','produces','cash','flow','from','operations','with','cash']
string = ' '.join(tokens)
pattern = re.compile(r'\b(cash|flow)\b', re.IGNORECASE)
keyword_matches = re.findall(pattern, string)
count_keyword_matches = len(keyword_matches)
print(keyword_matches)
print(count_keyword_matches)
count_keyword_matches returns 3 because there are 3 matches.

Regex: Obtain ID(s) from URL

Getting my feet wet in Regular Expressions, and I'm having a difficult time getting this one to work.
I have a url as such:
/800-Flowers-inc-4124/18-roses-3123
Where 4124 is the business ID, and 3123 is the product ID.
The hard part for me is creating the capturing groups. Currently, my regex is as follows:
/(\d+)(?=/|$)/g
Unfortunately, that only selects the business ID, and doesn't return the product ID.
Any help is greatly appreciated, and if you provide a regex, I would love if you could put a little explanation
thanks!
Your regex is fine, except since you've used the / as the regex delimiter you need to escape it in the expression:
/(\d+)(?=\/|$)/g
Or, you can just use a different delimiter (e.g. #):
#(\d+)(?=/|$)#g
Depending on the language you're using it'll probably return the results in some sort of array, or there could be a 'findAll'-type method instead of just 'find'.
mathematical.coffee is correct:
var data = '/800-Flowers-inc-4124/18-roses-3123';
var myregexp = /(\d+)(?=\/|$)/g;
var match = myregexp.exec(data);
var result = "Matches:\n";
while (match != null) {
result += "match:" + match[0] + ',\n';
match = myregexp.exec(data);
}
alert(result);

Replace using RegEx outside of text markers

I have the following sample text and I want to replace '[core].' with something else but I only want to replace it when it is not between text markers ' (SQL):
PRINT 'The result of [core].[dbo].[FunctionX]' + [core].[dbo].[FunctionX] + '.'
EXECUTE [core].[dbo].[FunctionX]
The Result shoud be:
PRINT 'The result of [core].[dbo].[FunctionX]' + [extended].[dbo].[FunctionX] + '.'
EXECUTE [extended].[dbo].[FunctionX]
I hope someone can understand this. Can this be solved by a regular expression?
With RegLove
Kevin
Not in a single step, and not in an ordinary text editor. If your SQL is syntactically valid, you can do something like this:
First, you remove every string from the SQL and replace with placeholders. Then you do your replace of [core] with something else. Then you restore the text in the placeholders from step one:
Find all occurrences of '(?:''|[^'])+' with 'n', where n is an index number (the number of the match). Store the matches in an array with the same number as n. This will remove all SQL strings from the input and exchange them for harmless replacements without invalidating the SQL itself.
Do your replace of [core]. No regex required, normal search-and-replace is enough here.
Iterate the array, replacing the placeholder '1' with the first array item, '2' with the second, up to n. Now you have restored the original strings.
The regex, explained:
' # a single quote
(?: # begin non-capturing group
''|[^'] # either two single quotes, or anything but a single quote
)+ # end group, repeat at least once
' # a single quote
JavaScript this would look something like this:
var sql = 'your long SQL code';
var str = [];
// step 1 - remove everything that looks like an SQL string
var newSql = sql.replace(/'(?:''|[^'])+'/g, function(m) {
str.push(m);
return "'"+(str.length-1)+"'";
});
// step 2 - actual replacement (JavaScript replace is regex-only)
newSql = newSql.replace(/\[core\]/g, "[new-core]");
// step 3 - restore all original strings
for (var i=0; i<str.length; i++){
newSql = newSql.replace("'"+i+"'", str[i]);
}
// done.
Here is a solution (javascript):
str.replace(/('[^']*'.*)*\[core\]/g, "$1[extended]");
See it in action