Reserved Keyword search, but in reverse. Regex - regex

I'm writing a code that looks through a string and then takes in words that are not considered "reserved keywords". I am new to regex, but have spent quite some time learning what kind of structure I need to look for reserved words. So far, I've written something along the lines of this:
\b(import|false|int|etc)\b
I am going to use an array list to feed in all of the reserved words into the string above, but I need it to work opposite of how it works now. I've figured out how to get it to search for the specific words with the code above, but how can I get it to look for the words that are NOT listed above. I've tried incorporating the ^ symbol, but I'm not having any luck there. Any regex veterans out there who see what I'm doing wrong?

There are two obvious possibilities, depending on what (else) you are doing.
Possibility 1: Use a dict or set:
You could just match words and then test for membership in a set or dictionary:
Reserved_words = set('import false true int ...'.split())
word_rx = r'\b\w+\b' # Or whatever rule you like for "words"
for m in re.finditer(...):
word = m.group(0)
if word in Reserved_words:
print("Found reserved word:", word)
else:
print("Found unreserved word:", word)
This approach is frequently used in lexers, where it is easier to just write a catch-all "match a word" rule, and then check a matched word against a list of keywords, than it is to write a fairly complex rule for each keyword and a catch-all to deal with the rest.
You can use a dict if you want to associate some kind of payload with the keyword (such as a class handle for instantiating a particular AST node type, etc.).
Possibility 2: Use named groups:
Another possibility is that you could use named groups in your regex to capture keyword/nonkeyword values:
word_rx = r'\b(?<keyword>import|int|true|false|\.\.\.)|(?<nonkeyword>\w+)\b'
for m in re.finditer(...):
word = m.group('keyword')
if word:
print("Found keyword:", word)
else:
word = m.group('nonkeyword')
print("Found nonkeyword:", word)
This is going to be slower than the previous approach, because of prefixes: "int" matches a keyword, but "integral" starts to match an int, then fails, then backtracks to the other branch, then matches a nonkeyword. :-(
However, if you are strongly tied to a mostly-regex implementation, for example, if you have many other regex-based rules, and you are processing them in a loop, then go for it!

Related

How to remove/replace specials characters from a 'dynamic' regex/string on ruby?

So I had this code working for a few months already, lets say I have a table called Categories, which has a string column called name, so I receive a string and I want to know if any category was mentioned (a mention occur when the string contains the substring: #name_of_a_category), the approach I follow for this was something like below:
categories.select { |category_i| content_received.downcase.match(/##{category_i.downcase}/)}
That worked pretty well until today suddenly started to receive an exception unmatched close parenthesis, I realized that the categories names can contain special chars so I decided to not consider special chars or spaces anymore (don't want to add restrictions to the user and at the same time don't want to deal with those cases so the policy is just to ignore it).
So the question is there a clean way of removing these special chars (maintaining the #) and matching the string (don't want to modify the data just ignore it while looking for mentions)?
You can also use
prep_content_received = content_received.gsub(/[^\w\s]|_/,'')
p categories.select { |c|
prep_content_received.match?(/\b#{c.gsub(/[^\w\s]|_/, '').strip()}\b/i)
}
See the Ruby demo
Details:
The prep_content_received = content_received.gsub(/[^\w\s]|_/,'') creates a copy of content_received with no special chars and _. Using it once reduced overhead if there are a lot of categories
Then, you iterate over the categories list, and each time check if the prep_content_received matches \b (word boundary) + category with all special chars, _ and leading/trailing whitespace stripped from it + \b in a case insensitive way (see the /i flag, no need to .downcase).
So after looking around I found some answers on the platform but nothing with my specific requirements (maybe I missed something, if so please let me know), and this is how I fix it for my case:
content_received = 'pepe is watching a #comedy :)'
categories = ['comedy :)', 'terror']
temp_content = content_received.downcase
categories.select { |category_i| temp_content.gsub(/[^\sa-zA-Z0-9]/, '#' => '#').match?(/##{category_i.downcase.
gsub(/[^\sa-zA-Z0-9]/, '')}/) }
For the sake of the example, I reduced the categories to a simple array of strings, basically the first gsub, remove any character that is not a letter or a number (any special character) and replace each # with an #, the second gsub is a simpler version of the first one.
You can test the snippet above here

Possible to limit to scope/range of a lookahead

We can check to see if a digit is in a password, for example, by doing something like:
(?=.*\d)
Or if there's a digit and lowercase with:
(?=.*\d)(?=.*[a-z])
This will basically go on "until the end" to check whether there's a letter in the string.
However, I was wondering if it's possible in some sort of generic way to limit the scope of a lookahead. Here's a basic example which I'm hoping will demonstrate the point:
start_of_string;
middle_of_string;
end_of_string;
I want to use a single regular expression to match against start_of_string + middle_of_string + end_of_string.
Is it possible to use a lookahead/lookbehind in the middle_of_string section WITHOUT KNOWING WHAT COMES BEFORE OR AFTER IT? That is, not knowing the size or contents of the preceding/succeeding string component. And limit the scope of the lookahead to only what is contained in that portion of the string?
Let's take one example:
start_of_string = 'start'
middle_of_string = '123'
end_of_string = 'ABC'
Would it be possible to check the contents of each part but limit it's scope like this?
string = 'start123ABC'
# Check to make sure the first part has a letter, the second part has a number and the third part has a capital
((?=.*[a-z]).*) # limit scope to the first part only!!
((?=.*[0-9]).*) # limit scope to only the second part.
((?=.*[A-Z]).*) # limit scope to only the last part.
In other words, can lookaheads/lookbehinds be "chained" with other components of a regex without it screwing up the entire regex?
UPDATE:
Here would be an example, hopefully this is more helpful to the question:
START_OF_STRING = 'abc'
Does 'x' exist in it? (?=.*x) ==> False
END_OF_STRING = 'cdxoy'
Does 'y' exist in it? (?=.*y) ==> True
FULL_STRING = START_OF_STRING + END_OF_STRING
'abcdxoy'
Is it possible to chain the two regexes together in any sort of way to only wok on its 'substring' component?
For example, now (?=.*x) in the first part of the string would return True, but it should not.
`((?=.*x)(?=.*y)).*`
I think the short answer to this is "No, it's not possible.", but am looking to hear from someone who understands this to tell why it is or isn't.
In .NET and javascript you could use a positive lookahead at the start of your string component and a negative lookbehind at the end of it to "constrain" the match. Example:
.*(?=.*arrow)(?<middle>.*)(?<=.*arrow).*
helloarrowxyz
{'middle': 'arrow'}
If in pcre, python, or other you would need to either have a fixed width lookahead to constraint it from going too far forward, such as what Wiktor Stribiżew says above:
.*(?=.{0,5}arrow)(?<middle>.{0,5}).*
Otherwise, it wouldn't be possible to do without either a fixed-width lookahead or a variable width look-behind.

How to group provided string correctly?

I have the following regex:
^([A-Za-z]{2,3}\d{6}|\d{5}|\d{3})((\d{3})?)(\d{2}|\d{3}|\d{6})(\d{2}|\d{3})$
I use this regex to match different, yet similar strings:
# MOR644-004-007-001
MOR644004007001 # string provided
# VUF00101-050-08-01
VUF001010500801 # string provided
# MF001317-077944-01
MF00131707794401 # string provided
These strings need to match/group as it is at the top of the strings, however my problem is that it is not grouping it correctly
The first string: MOR644004007001 is grouped: (MOR644004) (007) (001) which should be (MOR644) (004) (007) (001)
The second string: VUF001010500801 is grouped (VUF001010) (500) (801) which should be (VUF00101) (050) (08) (01)
How can I change ([A-Za-z]{2,3}\d{6}|\d{5}|\d{3})((\d{3})?) so that it would group the provided string correctly?
I am not sure that you can do what you want to.
Let's consider the first two strings:
# MOR644-004-007-001
MOR644004007001 # string provided
# VUF00101-050-08-01
VUF001010500801 # string provided
Now, both the strings are composed of 3 chars followed by 12 digits. Thus, given a regex R, if R does not depend on particular (sequences of) characters and on particular (sequences of) digits (i.e., it presents [A-Za-z] and \d but does not present, let's say, MO and 0070), then it will match both the string in the same way.
So, if you want to operate a different matching, then you need to look at the particular occurrence of certain characters or digits. We need more data from you in order to give you an aswer.
Finally, I suggest you to take a look at this tool:
http://regex.inginf.units.it/ (demo: http://regex.inginf.units.it/demo.html). It is a research project that automatically generates a regex given (many) examples of extraction. I warmly suggest you to try it, especially if you know that an underlying pattern is present in your case for sure (i.e. strings beginning with VUF must be matched differently from strings beginning with MOR) but you are unable to find it. Again, you will need to provide many examples to the engine. Needles to say, if a generic pattern does not exist, then the tool won't find it ;)
Considering your comment to Serv I'd say the (only?) solution is to have one regex for each possibility, like -
MOR(\d{3})(\d{3})(\d{3})(\d{3})|VUF(\d{5})(\d{3})(\d{2})(\d{2})|MF(\d{6})(\d{6})(\d{2})
and then use the execution environment (JS/php/python - you haven't provided which one) to piece the parts together.
See example on regex101 here. Note that substitution, only as an example, matches only the second string.
Regards
Take a look at this. I have used what's called as a named group. As pointed out earlier by others, it's better to have one regex code for each string. I have shown here for the first string, MOR644004007001. Easily you can expand for other two strings:
import re
# MOR644-004-007-001
MOR = "MOR644004007001" # string provided
# VUF00101-050-08-01
VUF = "VUF001010500801" # string provided
# MF001317-077944-01
MF = "MF00131707794401" # string provided
MORcompile = re.compile(r'(?P<first>\w{,6})(?P<second>\d{,3})(?P<third>\d{,3})(?P<fourth>\d{,3})')
MORsearch = MORcompile.search(MOR.strip())
print MORsearch.group('first')
print MORsearch.group('second')
print MORsearch.group('third')
print MORsearch.group('fourth')
MOR644
004
007
001

Finding if a string matches a pattern

At one point in my app, I need to match some strings against a pattern. Let's say that some of the sample strings look as follows:
Hi there, John.
What a lovely day today!
Lovely sunset today, John, isn't it?
Will you be meeting Linda today, John?
Most (not all) of these strings are from pre-defined patterns as follows:
"Hi there, %s."
"What a lovely day today!"
"Lovely sunset today, %s, isn't it?"
"Will you be meeting %s today, %s?"
This library of patterns is ever-expanding (currently at around 1,500), but is manually maintained. The input strings though (the first group) is largely unpredictable. Though most of them will match one of the patterns, some of them will not.
So, here's my question: Given a string (from the first group) as input, I need to know which of the patterns (known second group) it matched. If nothing matched, it needs to tell me that.
I'm guessing the solution involves building a regex out of the patterns, and iteratively checking which one matched. However, I'm unsure what the code to build those regexes looks like.
Note: The strings I've given here are for illustration purposes. In reality, the strings aren't human generated, but are computer-generated human-friendly strings as shown above from systems I don't control. Since they aren't manually typed in, we don't need to worry about things like typos and other human errors. Just need to find which pattern it matches.
Note 2: I could modify the patterns library to be some other format, if that makes it easier to construct the regexes. The current structure, with the printf style %s, isn't set in stone.
I am looking at this as a parsing problem. The idea is that the parser function takes a string and determines if it is valid or not.
The string is valid if you can find it among the given patterns. That means you need an index of all the patterns. The index must be a full text index. Also it must match according to the word position. eg. it should short circuit if the first word of the input is not found among the first word of the patterns. It should take care of the any match ie %s in the pattern.
One solution is to put the patterns in an in memory database (eg. redis) and do a full text index on it. (this will not match according to word position) but you should be able to narrow down to the correct pattern by splitting the input into words and searching. The searches will be very fast because you have a small in memory database. Also note that you are looking for the closest match. One or more words will not match. The highest number of matches is the pattern you want.
An even better solution is to generate your own index in a dictionary format. Here is an example index for the four patterns you gave as a JavaScript object.
{
"Hi": { "there": {"%s": null}},
"What: {"a": {"lovely": {"day": {"today": null}}}},
"Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}},
"Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}}
}
This index is recursive descending according to the word postion. So search for the first word, if found search for the next within the object returned by the first and so on. Same words at a given level will have only one key. You should also match the any case. This should be blinding fast in memory.
My first thought would be to have the regexp engine take all the trouble of handling this. They're usually optimised to handle large amounts of text so it shouldn't be that much of a performance hassle. It's brute force but the performance seems to be okay. And you could split the input into pieces and have multiple processes handle them. Here's my moderately tested solution (in Python).
import random
import string
import re
def create_random_sentence():
nwords = random.randint(4, 10)
sentence = []
for i in range(nwords):
sentence.append("".join(random.choice(string.lowercase) for x in range(random.randint(3,10))))
ret = " ".join(sentence)
print ret
return ret
patterns = [ r"Hi there, [a-zA-Z]+.",
r"What a lovely day today!",
r"Lovely sunset today, [a-zA-Z]+, isn't it?",
r"Will you be meeting [a-zA-Z]+ today, [a-zA-Z]+\?"]
for i in range(95):
patterns.append(create_random_sentence())
monster_pattern = "|".join("(%s)"%x for x in patterns)
print monster_pattern
print "--------------"
monster_regexp = re.compile(monster_pattern)
inputs = ["Hi there, John.",
"What a lovely day today!",
"Lovely sunset today, John, isn't it?",
"Will you be meeting Linda today, John?",
"Goobledigoock"]*2000
for i in inputs:
ret = monster_regexp.search(i)
if ret:
print ".",
else:
print "x",
I've created a hundred patterns. This is the maximum limit of the python regexp library. 4 of them are your actual examples and the rest are random sentences just to stress performance a little.
Then I combined them into a single regexp with 100 groups. (group1)|(group2)|(group3)|.... I'm guessing you'll have to sanitise the inputs for things that can have meanings in regular expressions (like ? etc.). That's the monster_regexp.
Testing one string against this tests it against 100 patterns in a single shot. There are methods that fetch out the exact group which was matched. I test 10000 strings 80% of which should match and 10% which will not. It short cirtcuits so if there's a success, it will be comparatively quick. Failures will have to run through the whole regexp so it will be slower. You can order things based on the frequency of input to get some more performance out of it.
I ran this on my machine and this is my timing.
python /tmp/scratch.py 0.13s user 0.00s system 97% cpu 0.136 total
which is not too bad.
However, to run a pattern against such a large regexp and fail will take longer so I changed the inputs to have lots of randomly generated strings that won't match and then tried. 10000 strings none of which match the monster_regexp and I got this.
python /tmp/scratch.py 3.76s user 0.01s system 99% cpu 3.779 total
Similar to Noufal's solution, but returns the matched pattern or None.
import re
patterns = [
"Hi there, %s.",
"What a lovely day today!",
"Lovely sunset today, %s, isn't it",
"Will you be meeting %s today, %s?"
]
def make_re_pattern(pattern):
# characters like . ? etc. have special meaning in regular expressions.
# Escape the string to avoid interpretting them as differently.
# The re.escape function escapes even %, so replacing that with XXX to avoid that.
p = re.escape(pattern.replace("%s", "XXX"))
return p.replace("XXX", "\w+")
# Join all the pattens into a single regular expression.
# Each pattern is enclosed in () to remember the match.
# This will help us to find the matched pattern.
rx = re.compile("|".join("(" + make_re_pattern(p) + ")" for p in patterns))
def match(s):
"""Given an input strings, returns the matched pattern or None."""
m = rx.match(s)
if m:
# Find the index of the matching group.
index = (i for i, group in enumerate(m.groups()) if group is not None).next()
return patterns[index]
# Testing with couple of patterns
print match("Hi there, John.")
print match("Will you be meeting Linda today, John?")
Python solution. JS should be similar.
>>> re2.compile('^ABC(.*)E$').search('ABCDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABCDDDDDDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABX') == None
True
>>>
The trick is to use ^ and $ to bound your pattern and making it a "template". Use (.*) or (.+) or whatever it is that you want to "search" for.
The main bottleneck for you, imho, will be iterating through a list of these patterns. Regex searches are computationally expensive.
If you want a "does any pattern match" result, build a massive OR based regex and let your regex engine handle the 'OR'ing for you.
Also, if you have only prefix patterns, check out the TRIE data structure.
This could be a job for sscanf, there is an implementation in js: http://phpjs.org/functions/sscanf/; the function being copied is this: http://php.net/manual/en/function.sscanf.php.
You should be able to use it without changing the prepared strings much, but I have doubts about the performances.
the problem isn't clear to me. Do you want to take the patterns and build regexes out of it?
Most regex engines have a "quoted string" option. (\Q \E). So you could take the string and make it
^\QHi there,\E(?:.*)\Q.\E$
these will be regexes that match exactly the string you want outside your variables.
if you want to use a single regex to match just a single pattern, you can put them in grouped patterns to find out which one matched, but that will not give you EVERY match, just the first one.
if you use a proper parser (I've used PEG.js), it might be more maintainable though. So that's another option if you think you might get stuck in regex hell

Part of as string from a string using regular expressions

I have a string of 5 characters out of which the first two characters should be in some list and next three should be in some other list.
How could i validate them with regular expressions?
Example:
List for First two characters {VBNET, CSNET, HTML)}
List for next three characters {BEGINNER, EXPERT, MEDIUM}
My Strings are going to be: VBBEG, CSBEG, etc.
My regular expression should find that the input string first two characters could be either VB, CS, HT and the rest should also be like that.
Would the following expression work for you in a more general case (so that you don't have hardcoded values): (^..)(.*$)
- returns the first two letters in the first group, and the remaining letters in the second group.
something like this:
^(VB|CS|HT)(BEG|EXP|MED)$
This recipe works for me:
^(VB|CS|HT)(BEG|EXP|MED)$
I guess (VB|CS|HT)(BEG|EXP|MED) should do it.
If your strings are as well-defined as this, you don't even need regex - simple string slicing would work.
For example, in Python we might say:
mystring = "HTEXP"
prefix = mystring[0:2]
suffix = mystring[2:5]
if (prefix in ['HT','CS','VB']) AND (suffix in ['BEG','MED','EXP']):
pass # valid!
else:
pass # not valid. :(
Don't use regex where elementary string operations will do.