In Actionscript, how to match / in infinitive structures like to cross out/off? - regex

I'm using the following regular expression to find the exact occurrences in infinitives. Flag is global.
(?!to )(?<!\w) (' + word_to_search + ') (?!\w)
To give example of what I'm trying to achieve
looking for out should not bring : to outlaw
looking for out could bring : to be out of line
looking for to should not bring : to etc. just because it matches the first to
I've already done these steps, however, to cross out/off should be in the result list too. Is there any way to create an exception without compromising what I have achieved?
Thank you.

I'm still not sure I understand the question. You want to match something that looks like an infinitive verb phrase and contains the whole word word_to_search? Try this:
"\\bto\\s(?:\\w+[\\s/])*" + word_to_search + "\\b"
Remember, when you create a regex in the form of a string literal, you have to escape the backslashes. If you tried to use "\b" to specify a word boundary, it would have been interpreted as a backspace.

I know OR operator but the question was rather how to organize the structure so it can look ahead and behind. I'm going to explain what I have done so far
var strPattern:String = '(?!to )(?<!\w) (' + word_to_search + ') (?!\w)|';
strPattern+='(?!to )(?<!\w) (' + word_to_search + '\/)|';
strPattern+='(?!to )(\/' + word_to_search + ')';
var pattern:RegExp = new RegExp(strPattern, "g");
First line is the same line in my question, it searches structures like to bail out for cases where you type out. Second line is for matching structures like to cross out/off. But we need something else to match to cross out/off if the word is off. So, the third line add that extra condition.

Related

error: multiple repeat for regex in robot [duplicate]

I'm trying to determine whether a term appears in a string.
Before and after the term must appear a space, and a standard suffix is also allowed.
Example:
term: google
string: "I love google!!! "
result: found
term: dog
string: "I love dogs "
result: found
I'm trying the following code:
regexPart1 = "\s"
regexPart2 = "(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
and get the error:
raise error("multiple repeat")
sre_constants.error: multiple repeat
Update
Real code that fails:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
On the other hand, the following term passes smoothly (+ instead of ++)
term = 'lg incite" OR author:"http+www.dealitem.com" OR "for sale'
The problem is that, in a non-raw string, \" is ".
You get lucky with all of your other unescaped backslashes—\s is the same as \\s, not s; \( is the same as \\(, not (, and so on. But you should never rely on getting lucky, or assuming that you know the whole list of Python escape sequences by heart.
Either print out your string and escape the backslashes that get lost (bad), escape all of your backslashes (OK), or just use raw strings in the first place (best).
That being said, your regexp as posted won't match some expressions that it should, but it will never raise that "multiple repeat" error. Clearly, your actual code is different from the code you've shown us, and it's impossible to debug code we can't see.
Now that you've shown a real reproducible test case, that's a separate problem.
You're searching for terms that may have special regexp characters in them, like this:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
That p++ in the middle of a regexp means "1 or more of 1 or more of the letter p" (in the others, the same as "1 or more of the letter p") in some regexp languages, "always fail" in others, and "raise an exception" in others. Python's re falls into the last group. In fact, you can test this in isolation:
>>> re.compile('p++')
error: multiple repeat
If you want to put random strings into a regexp, you need to call re.escape on them.
One more problem (thanks to Ωmega):
. in a regexp means "any character". So, ,|.|;|:" (I've just extracted a short fragment of your longer alternation chain) means "a comma, or any character, or a semicolon, or a colon"… which is the same as "any character". You probably wanted to escape the ..
Putting all three fixes together:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|\.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + re.escape(term) + regexPart2 , re.IGNORECASE)
As Ωmega also pointed out in a comment, you don't need to use a chain of alternations if they're all one character long; a character class will do just as well, more concisely and more readably.
And I'm sure there are other ways this could be improved.
The other answer is great, but I would like to point out that using regular expressions to find strings in other strings is not the best way to go about it. In python simply write:
if term in string:
#do whatever
i have an example_str = "i love you c++" when using regex get error multiple repeat Error. The error I'm getting here is because the string contains "++" which is equivalent to the special characters used in the regex. my fix was to use re.escape(example_str ), here is my code.
example_str = "i love you c++"
regex_word = re.search(rf'\b{re.escape(word_filter)}\b', word_en)
Also make sure that your arguments are in the correct order!
I was trying to run a regular expression on some html code. I kept getting the multiple repeat error, even with very simple patterns of just a few letters.
Turns out I had the pattern and the html mixed up. I tried re.findall(html, pattern) instead of re.findall(pattern, html).
A general solution to "multiple repeat" is using re.escape to match the literal pattern.
Example:
>>>> re.compile(re.escape("c++"))
re.compile('c\\+\\+')
However if you want to match a literal word with space before and after try out this example:
>>>> re.findall(rf"\s{re.escape('c++')}\s", "i love c++ you c++")
[' c++ ']

Pattern matching in postgres 9.1

I am trying to extract the camera make & model from exifdata.
The exifdata itself is quite long, 4 lines follow:
JPEG.APP1.Ifd0.ImageDescription = ' '
JPEG.APP1.Ifd0.Make = 'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'
I use the following regexs, but they do not match. Are the patterns correct?
make := substring( meta from 'Make\\s+=\\s+(.*)');
model := substring( meta from 'Model\\s+=\\s+(.*)');
subtring([str] from [pattern]) doesn't work like you seem to think it does. You can find the details of how it works here: 9.7.2. SIMILAR TO Regular Expressions. That's the regular expression syntax your call uses.
For starters, there's this relevant bit of info:
As with SIMILAR TO, the specified pattern must match the entire data string, or else the function fails and returns null.
Your regular expressions clearly don't match the entire string.
Second is the next sentence:
To indicate the part of the pattern that should be returned on success, the pattern must contain two occurrences of the escape character followed by a double quote (")
This isn't quite the standard regex, so you need to be aware of it.
Rather than trying to get subtring([str] from [pattern]) working, I'm going to recommend an alternative: regexp_matches. This function uses standard POSIX regex syntax, and it returns a text[] containing all the captured groups from the match. Here's a quick test to show that it works:
SELECT regexp_matches($$JPEG.APP1.Ifd0.ImageDescription = ' '
JPEG.APP1.Ifd0.Make = 'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'$$, '(Make)') m;
(I'm using dollar quoting for your example string, in case you're not familiar with that syntax.)
This gives back the array {Make}.
Second, your regex actually doesn't work, as I found out in my testing. You have two problems:
The double slashes are incorrect. You don't need to escape the \ as PostgreSQL doesn't treat it as an escape character by default. You can read about escaping in strings here; the most relevant section is probably 4.1.2.2. String Constants with C-style Escapes. That section describes what you thought was happening by default, but it actually requires an E prefix to enable.
Fixing that improves the result:
SELECT regexp_matches($$JPEG.APP1.Ifd0.ImageDescription = ' '
JPEG.APP1.Ifd0.Make = 'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'$$, 'Make\s+=\s+(.*)') m;
now gives an array containing this string:
'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'
This brings us to...
The (.*) is matching everything to the end of the string, not the end of the line. You can actually fix this by doing something you probably want to do anyway: get the single quote marks out of the match. You can use this pattern to do that:
$$Make\s+=\s+'([^']+)'$$
I've used dollar quoting again, this time to avoid the ugliness of escaping all those single quote marks. Now the query is:
SELECT regexp_matches($$JPEG.APP1.Ifd0.ImageDescription = ' '
JPEG.APP1.Ifd0.Make = 'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'$$, $$Make\s+=\s+'([^']+)'$$) m;
which gives you pretty much exactly what you want: an array containing just the string Canon. You'll need to extract the result from the array, of course, but I'll leave that as an exercise for you.
That should be enough info for you to get the second expression working, too.
P.S. PostgreSQL's truly fine manual is a blessing.

VB.Net Beginner: Replace with Wildcards, Possibly RegEx?

I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?
I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.

preg_match Part of a url

I have a link that looks like this http://site.com/numbers_and_letters/This_is_what-I-need_to-retrieve.html
I basically need to retrieve this part: This_is_what-I-need_to-retrieve
And also replace the the dashes and underscores with spaces so it would end up looking like this: This is what I need to retrieve
I'm new to regex so this is what i'm using:
(it works but has poor performance)
function clean($url)
{
$cleaned = preg_replace("/http:\/\/site.com\/.+\//", '', $url);
$cleaned = preg_replace("/[-_]/", ' ', $cleaned);
//remove the html extension
$cleaned = substr($cleaned, 0,-4);
return $cleaned;
}
What you've got isn't that bad. But maybe you can try comparing its performance to this:
preg_match('[^/]+$', $url, $match);
$cleaned = preg_replace('[-_]', ' ', $match);
EDIT:
If all you have is a hammer, everything looks like a nail.
How about avoiding regex altogether? (I presume each input is a valid URL.)
$cleaned = strtr(substr($url, strrpos($url, '/') + 1, -5), '-_', ' ');
This even removes the .html extension! (I'm making all the same assumptions you already seem to be making, i.e. that all links end in .html.) A brief explanation:
strtr translates a set of characters, e.g. -_, to respective characters in another set, e.g. spaces. (I imagine it'd be more efficient than invoking the entire regex engine.)
substr, you must know, but note that if the last argument is negative, e.g. -5, it indicates the number of characters from the end to ignore. Handy for this case, and again, probably more efficient than regex.
strrpos, of course, finds the last position of a character in a string, e.g. /.

Regular expression any character with dynamic size

I want to use a regular expression that would do the following thing ( i extracted the part where i'm in trouble in order to simplify ):
any character for 1 to 5 first characters, then an "underscore", then some digits, then an "underscore", then some digits or dot.
With a restriction on "underscore" it should give something like that:
^([^_]{1,5})_([\\d]{2,3})_([\\d\\.]*)$
But i want to allow the "_" in the 1-5 first characters in case it still match the end of the regular expression, for example if i had somethink like:
to_to_123_12.56
I think this is linked to an eager problem in the regex engine, nevertheless, i tried to do some lazy stuff like explained here but without sucess.
Any idea ?
I used the following regex and it appeared to work fine for your task. I've simply replaced your initial [^_] with ..
^.{1,5}_\d{2,3}_[\d\.]*$
It's probably best to replace your final * with + too, unless you allow nothing after the final '_'. And note your final part allows multiple '.' (I don't know if that's what you want or not).
For the record, here's a quick Python script I used to verify the regex:
import re
strs = [ "a_12_1",
"abc_12_134",
"abcd_123_1.",
"abcde_12_1",
"a_123_123.456.7890.",
"a_12_1",
"ab_de_12_1",
]
myre = r"^.{1,5}_\d{2,3}_[\d\.]+$"
for str in strs:
m = re.match(myre, str)
if m:
print "Yes:",
if m.group(0) == str:
print "ALL",
else:
print "No:",
print str
Output is:
Yes: ALL a_12_1
Yes: ALL abc_12_134
Yes: ALL abcd_134_1.
Yes: ALL abcde_12_1
Yes: ALL a_123_123.456.7890.
Yes: ALL a_12_1
Yes: ALL ab_de_12_1
^(.{1,5})_(\d{2,3})_([\d.]*)$
works for your example. The result doesn't change whether you use a lazy quantifier or not.
While answering the comment ( writing the lazy expression ), i saw that i did a mistake... if i simply use the folowing classical regex, it works:
^(.{1,5})_([\\d]{2,3})_([\\d\\.]*)$
Thank you.