Regex match, return remaining rest of string - regex

Simple regex function that matches the start of a string "Bananas: " and returns the second part. I've done the regex, but it's not the way I expected it to work:
import re
def return_name(s):
m = re.match(r"^Bananas:\s?(.*)", s)
if m:
# print m.group(0)
# print m.group(1)
return m.group(1)
somestring = "Bananas: Gwen Stefani" # Bananas: + name
print return_name(somestring) # Gwen Stefani - correct!
However, I'm convinced that you don't have identify the group with (.*) in order to get the same results. ie match first part of string - return the remaining part. But I'm not sure how to do that.
Also I read somewhere that you should be being cautious using .* in a regex.

You could use a lookbehind ((?<=)):
(?<=^Bananas:\s).*
Remember to use re.search instead of re.match as the latter will try to match at the start of the string (aka implicit ^).
As for the .* concerns - it can cause a lot of backtracking if you don't have a clear understanding of how regexes work, but in this case it is guaranteed to be a linear search.

Using the alternate regular expression module "regex" you could use perl's \K meta-character, which makes it able to discard previously matched content and only Keep the following.
I'm not really recommending this, I think your solution is good enough, and the lookbehind answer is also probably better than using another module just for that.

Related

Regex string with 3 or more vowels

I'm trying to make a regular expression that matches a String with 3 or more vowels.
I've tried this one:
[aeiou]{3,}
But it only works when the vowels are in a sequence. Any tips ?
For example:
Samuel -> valid
Joan -> invalid
Sol Manuel -> valid
Sol -> Invalid
There are several ways to do it and in this case keeping it simple will probably be the most helpful to future devs maintaining that code. That's a fun part about regexes, you can make them very efficient and clever and then very hard for somebody who doesn't do them often to update.
import re
regex = "[aeiou].*[aeiou].*[aeiou]"
mylist = [
"Samuel", #yes!
"JOAN", #no!
"Sol Manuel", #yes!
"", #no!
]
for text in mylist:
if re.search(regex, text, re.IGNORECASE):
print ("Winner!")
else:
print ("Nein!")
You could also adjust each part to be [aeiouAEIOU] if you don't have an ignore case flag in your language of choice. Good luck! :)
just
(\w*[aeuio]\w*){3,}
or if you want line match
^(.*[aeuio].*){3,}$
This can be achieved using lookaheads like this.
Regex: ^(?=.*[aeiou].*[aeiou].*[aeiou])(?:[a-z] *)+$
Explanation:
(?=.*[aeiou].*[aeiou].*[aeiou]) positive lookahead checks for presence of any character followed by vowel three times.
(?:[a-zA-Z] *)+ matches your one or more English words separated by spaces.
Regex101 Demo
If Case insensitive Mode is OFF use following regex
Regex: ^(?=.*[aeiouAEIOU].*[aeiouAEIOU].*[aeiouAEIOU])(?:[a-zA-Z] *)+$
Regex101 Demo
Try this pattern:
^.*[AEIOUaeiou].*[AEIOUaeiou].*[AEIOUaeiou].*$
We could also use a positive lookahead:
^(?=.*[AEIOUaeiou].*[AEIOUaeiou].*[AEIOUaeiou]).*$
Note that due to the possibility of backtracking I would probably prefer using the first (non lookahead) pattern because it should be more efficient.
I tried this using help from sniperd's answer:
def multi_vowel_words(text):
pattern = r"\w+[aeiou]\w*[aeiou]\w*[aeiou]\w+"
result = re.findall(pattern, text)
return result
This works even with uppercases.
If you have numbers and underscore in your text, then instead of \w use [a-zA-Z].

python regex how to avoid match multiple semicolon?

I'm about to write a regex to extract substrings. the string is:
ASP.NET_SessionId=frffcjcarie4dhxouz5yklwu;+BIGipServercapitaliq-ssl=3617221783.36895.0000;+ObSSOCookie=wkyQfn2Cyx2%2f7kSj4zBB886WaLs92Ord9FSf64c%2byHFOBwgEP4f3UmorDj051suQwRXAKEwBtYVKRYJuUGh2YNZtAj2%2bNp8asLIT9xQPqVktEAzkl3jNIv8MyWFsoFPDtm%2fTm1FeaCP%2bGTk9Oa%2fCNA0Hmy847qK2qo7%2bbziV%2bjeClbkGjAX3pgcPzfs%2bQp7p9BSjP1xJqUaUKwJ2%2flIgzZL5Ma%2bnJK8j%2b732ixNyIDNDGo7uIF%2b;+machineIdCookie=866873600;+userLoggedIn=jga;sdgjefdfdfs
I want to extract a substring beginning with ObSSOCookie=....; and ending just before the userLoggedIn.
I set my regex pattern
pattern = "ObSSOCookie=.*;"
But it continues to extract until the last semicolon (which includes the +machineIdCookie=866873600), rather than the first semicolon, which is what I want.
Is there a way to just extract up to the first semicolon? And I can't just use split by ";" cause this regex is actually to be used in a Logstash configuration file and there's no way to use python-style coding there...
You want to make your regex non-greedy
Instead of using this
* - zero or more
Use this
*? - zero or more (non-greedy)
Here's your expression (demo).
ObSSOCookie=(.*?;)
This is a general technique, also described in this answer.
Why not just grab anything except the next ; like this (demo)
ObSSOCookie=([^;]*)
>>> import re
>>> data = 'ASP.NET_SessionId=frffcjcarie4dhxouz5yklwu;+BIGipServercapitaliq-ssl=3617221783.36895.0000;+ObSSOCookie=wkyQfn2Cyx2%2f7kSj4zBB886WaLs92Ord9FSf64c%2byHFOBwgEP4f3UmorDj051suQwRXAKEwBtYVKRYJuUGh2YNZtAj2%2bNp8asLIT9xQPqVktEAzkl3jNIv8MyWFsoFPDtm%2fTm1FeaCP%2bGTk9Oa%2fCNA0Hmy847qK2qo7%2bbziV%2bjeClbkGjAX3pgcPzfs%2bQp7p9BSjP1xJqUaUKwJ2%2flIgzZL5Ma%2bnJK8j%2b732ixNyIDNDGo7uIF%2b;+machineIdCookie=866873600;+userLoggedIn=jga;sdgjefdfdfs'
>>> p = re.compile('ObSSOCookie=([^;]*)')
>>> m = p.search(data)
>>> m.group(1)
'wkyQfn2Cyx2%2f7kSj4zBB886WaLs92Ord9FSf64c%2byHFOBwgEP4f3UmorDj051suQwRXAKEwBtYVKRYJuUGh2YNZtAj2%2bNp8asLIT9xQPqVktEAzkl3jNIv8MyWFsoFPDtm%2fTm1FeaCP%2bGTk9Oa%2fCNA0Hmy847qK2qo7%2bbziV%2bjeClbkGjAX3pgcPzfs%2bQp7p9BSjP1xJqUaUKwJ2%2flIgzZL5Ma%2bnJK8j%2b732ixNyIDNDGo7uIF%2b'

regular expression replacement of numbers

Using regular expression how do I replace 1,186.55 with 1186.55?
My search string is
\b[1-9],[0-9][0-9][0-9].[0-9][0-9]
which works fine. I just can't seem to get the replacement part to work.
You are very sparse with information in your question. I try to answer as general as possible:
You can shorten the regex a bit by using quantifiers, I would make this in a first step
\b[1-9],[0-9]{3}.[0-9]{2}
Most probably you can also replace [0-9] by \d, is also more readable IMO.
\b\d,\d{3}.\d{2}
Now we can go to the replacement part. Here you need to store the parts you want to keep. You can do that by putting that part into capturing groups, by placing brackets around, this would be your search pattern:
\b(\d),(\d{3}.\d{2})
So, now you can access the matched content of those capturing groups in the replacement string. The first opening bracket is the first group the second opening bracket is the second group, ...
Here there are now two possibilities, either you can get that content by \1 or by $1
Your replacement string would then be
\1\2
OR
$1$2
Python:
def repl(initstr, unwanted=','):
res = set(unwanted)
return ''.join(r for r in initstr if r not in res)
Using regular expressions:
from re import compile
regex = compile(r'([\d\.])')
print ''.join(regex.findall('1,186.55'))
Using str.split() method:
num = '1,186.55'
print ''.join(num.split(','))
Using str.replace() method:
num = '1,186.55'
print num.replace(',', '')
if you just wanna remove the comma you can do(in java or C#):
str.Replace(",", "");
(in java it's replace)
Or in Perl:
s/(\d+),(\d+)/$1$2/

How to ignore whitespace in a regular expression subject string?

Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.