python regex how to avoid match multiple semicolon? - regex

I'm about to write a regex to extract substrings. the string is:
ASP.NET_SessionId=frffcjcarie4dhxouz5yklwu;+BIGipServercapitaliq-ssl=3617221783.36895.0000;+ObSSOCookie=wkyQfn2Cyx2%2f7kSj4zBB886WaLs92Ord9FSf64c%2byHFOBwgEP4f3UmorDj051suQwRXAKEwBtYVKRYJuUGh2YNZtAj2%2bNp8asLIT9xQPqVktEAzkl3jNIv8MyWFsoFPDtm%2fTm1FeaCP%2bGTk9Oa%2fCNA0Hmy847qK2qo7%2bbziV%2bjeClbkGjAX3pgcPzfs%2bQp7p9BSjP1xJqUaUKwJ2%2flIgzZL5Ma%2bnJK8j%2b732ixNyIDNDGo7uIF%2b;+machineIdCookie=866873600;+userLoggedIn=jga;sdgjefdfdfs
I want to extract a substring beginning with ObSSOCookie=....; and ending just before the userLoggedIn.
I set my regex pattern
pattern = "ObSSOCookie=.*;"
But it continues to extract until the last semicolon (which includes the +machineIdCookie=866873600), rather than the first semicolon, which is what I want.
Is there a way to just extract up to the first semicolon? And I can't just use split by ";" cause this regex is actually to be used in a Logstash configuration file and there's no way to use python-style coding there...

You want to make your regex non-greedy
Instead of using this
* - zero or more
Use this
*? - zero or more (non-greedy)
Here's your expression (demo).
ObSSOCookie=(.*?;)
This is a general technique, also described in this answer.

Why not just grab anything except the next ; like this (demo)
ObSSOCookie=([^;]*)
>>> import re
>>> data = 'ASP.NET_SessionId=frffcjcarie4dhxouz5yklwu;+BIGipServercapitaliq-ssl=3617221783.36895.0000;+ObSSOCookie=wkyQfn2Cyx2%2f7kSj4zBB886WaLs92Ord9FSf64c%2byHFOBwgEP4f3UmorDj051suQwRXAKEwBtYVKRYJuUGh2YNZtAj2%2bNp8asLIT9xQPqVktEAzkl3jNIv8MyWFsoFPDtm%2fTm1FeaCP%2bGTk9Oa%2fCNA0Hmy847qK2qo7%2bbziV%2bjeClbkGjAX3pgcPzfs%2bQp7p9BSjP1xJqUaUKwJ2%2flIgzZL5Ma%2bnJK8j%2b732ixNyIDNDGo7uIF%2b;+machineIdCookie=866873600;+userLoggedIn=jga;sdgjefdfdfs'
>>> p = re.compile('ObSSOCookie=([^;]*)')
>>> m = p.search(data)
>>> m.group(1)
'wkyQfn2Cyx2%2f7kSj4zBB886WaLs92Ord9FSf64c%2byHFOBwgEP4f3UmorDj051suQwRXAKEwBtYVKRYJuUGh2YNZtAj2%2bNp8asLIT9xQPqVktEAzkl3jNIv8MyWFsoFPDtm%2fTm1FeaCP%2bGTk9Oa%2fCNA0Hmy847qK2qo7%2bbziV%2bjeClbkGjAX3pgcPzfs%2bQp7p9BSjP1xJqUaUKwJ2%2flIgzZL5Ma%2bnJK8j%2b732ixNyIDNDGo7uIF%2b'

Related

Regex match last substring among same substrings in the string

For example we have a string:
asd/asd/asd/asd/1#s_
I need to match this part: /asd/1#s_ or asd/1#s_
How is it possible to do with plain regex?
I've tried negative lookahead like this
But it didn't work
\/(?:.(?!\/))?(asd)(\/(([\W\d\w]){1,})|)$
it matches this '/asd/asd/asd/asd/asd/asd/1#s_'
from this 'prefix/asd/asd/asd/asd/asd/asd/1#s_'
and I need to match '/asd/1#s_' without all preceding /asd/'s
Match should work with plain regex
Without any helper functions of any programming language
https://regexr.com/
I use this site to check if regex matches or not
here's the possible strings:
prefix/asd/asd/asd/1#s
prefix/asd/asd/asd/1s#
prefix/asd/asd/asd/s1#
prefix/asd/asd/asd/s#1
prefix/asd/asd/asd/#1s
prefix/asd/asd/asd/#s1
and asd part could be replaced with any word like
prefix/a1sd/a1sd/a1sd/1#s
prefix/a1sd/a1sd/a1sd/1s#
...
So I need to match last repeating part with everything to the right
And everything to the right could be character, not character, digit, in any order
A more complicated string example:
prefix/a1sd/a1sd/a1sd/1s#/ds/dsse/a1sd/22$$#!/123/321/asd
this should match that part:
/a1sd/22$$#!/123/321/asd
Try this one. This works in python.
import re
reg = re.compile(r"\/[a-z]{1,}\/\d+[#a-z_]{1,}")
s = "asd/asd/asd/asd/1#s_"
print(reg.findall(s))
# ['/asd/1#s_']
Update:
Since the question lacks clarity, this only works with the given order and hence, I suppose any other combination simply fails.
Edits:
New Regex
reg = r"\/\w+(\/\w*\d+\W*)*(\/\d+\w*\W*)*(\/\d+\W*\w*)*(\/\w*\W*\d+)*(\/\W*\d+\w*)*(\/\W*\w*\d+)*$"

Regex match, return remaining rest of string

Simple regex function that matches the start of a string "Bananas: " and returns the second part. I've done the regex, but it's not the way I expected it to work:
import re
def return_name(s):
m = re.match(r"^Bananas:\s?(.*)", s)
if m:
# print m.group(0)
# print m.group(1)
return m.group(1)
somestring = "Bananas: Gwen Stefani" # Bananas: + name
print return_name(somestring) # Gwen Stefani - correct!
However, I'm convinced that you don't have identify the group with (.*) in order to get the same results. ie match first part of string - return the remaining part. But I'm not sure how to do that.
Also I read somewhere that you should be being cautious using .* in a regex.
You could use a lookbehind ((?<=)):
(?<=^Bananas:\s).*
Remember to use re.search instead of re.match as the latter will try to match at the start of the string (aka implicit ^).
As for the .* concerns - it can cause a lot of backtracking if you don't have a clear understanding of how regexes work, but in this case it is guaranteed to be a linear search.
Using the alternate regular expression module "regex" you could use perl's \K meta-character, which makes it able to discard previously matched content and only Keep the following.
I'm not really recommending this, I think your solution is good enough, and the lookbehind answer is also probably better than using another module just for that.

Regular expression in python re.findall()

I tryed the folowing:
I want to split with the re.findall()
str="<abc>somechars<*><def>somechars<*><ghj>somechars<*><ijk>somechars<*>"
print(re.findall('<(abc|ghj)>.*?<*>',str))
The out should be
['<abc>somechars<*>','<ghj>somechars<*>']
In notepad, if I try this expression I get right, but here:
['abc', 'ghj']
Any idea?
Thanks for the answers.
(<(?:abc|ghj)>.*?<\*>)
Try this.See demo.
http://regex101.com/r/kP8uF5/12
import re
p = re.compile(ur'(<(?:abc|ghj)>.*?<\*>)', re.IGNORECASE | re.MULTILINE)
test_str = u"<abc>somechars<*><def>somechars<*><ghj>somechars<*><ijk>somechars<*>"
re.findall(p, test_str)
You're capturing (abc|ghj). Use a non-capturing group (?:abc|ghj) instead.
Also, you should escape the second * in your regex since you want a literal asterisk: <\*> rather than <*>.
>>> s = '<abc>somechars<*><def>somechars<*><ghj>somechars<*><ijk>somechars<*>'
>>> re.findall(r'<(?:abc|ghj)>.*?<\*>', s)
['<abc>somechars<*>', '<ghj>somechars<*>']
Also also, avoid shadowing the built-in name str.
Just make the group a non-capturing group:
str="<abc>somechars<*><def>somechars<*><ghj>somechars<*><ijk>somechars<*>"
print(re.findall('<(?:abc|ghj)>.*?<*>',str))
The function returns the groups from left to right, and since you specified a group it left out the entire match.
From the Python documentation
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match
.

How to regex something after first whitespace(s)

I have a string link TEST123 DATA, so this are two words seperated by whitespace. How can I regex the right part after whitespace(s) to get DATA? I am new to this and I hoped someone could tell me how to do this? Any characters at the beginning should be skipped including the first whitespace. I need everything after the first whitespace(s). So This are string examples:
TEST_1 DATA
TEST DATA
123 DATA
and the result should be always "DATA".
Thanks
^\S*\s+(\S+)
matches the string from the beginning until the word after the first whitespace(s). Group 1 will then contain the string DATA (in your example).
If you only want to match DATA, and you have access to a Perl-compatible regex engine, you can use
^\S*\s+\K\S+
The \K token tells the regex engine to ignore all the text that has been matched so far.
See it live on regex101.com.
With a .NET regex engine, you can use a positive lookbehind assertion:
(?<=^\S*\s+)\S+
See it live on regexhero.net.
Starting from the end of the string, matching everything that isn't a whitespace character:
[^\s]*$
Gets all 3 DATA in the sample with global and multiline flags.
You can also try the following regex: (Python)
>>> import re
>>> s = "TEST_1 DATA"
>>> result = re.sub(r".*?(\w+)\s*$", r"\1", s)
>>> result
'DATA'

How to match a specific character only if it is followed by string containing specific characters

I'm trying to replace slashes in a string, but not all of them - only the ones before first comma. To do that, I probably have to find a way to match only slashes being followed by string containing a comma.
Is it possible to do this using one regexp, i.e. without first splitting the string by commas?
Example input string:
Abc1/Def2/Ghi3,/Dore1/Mifa2/Solla3,Sido4
What I want to get:
Abc1.Def2.Ghi3,/Dore1/Mifa2/Solla3,Sido4
I've tried some lookahead and lookbehind techniques with no effect, so currently to do this in e.g. Python I first split the data:
test = 'Abc1/Def2/Ghi3,/Dore1/Mifa2/Solla3,Sido4'
strlist = re.split(r',', test)
result = ','.join([re.sub(r'\/', r'.', strlist[0])] + strlist[1:])
What I would prefer is to use a specific regexp pattern instead of Python-oriented solution though, so essentially I could have a pattern and replacement such that the following code would give me the same result:
result = re.sub(pattern, replacement, test)
Thanks for all regex-avoiding answers - I was wondering if I could do this using only regex (so e.g. I could use sed instead of Python).
item = 'Abc1/Def2/Ghi3,/Dore1/Mifa2/Solla3,Sido4'
print item.replace("/", ".", item.count("/", 0, item.index(",")))
This will print what you need. Try to avoid regex wherever you can because they are slow.
You could do this with lookbehind expressions that look for both the beginning of the string and no comma. Or don't use re entirely.
s = 'Abc1/Def2/Ghi3,/Dore1/Mifa2/Solla3,Sido4'
left,sep,right = s.partition(',')
sep.join((left.replace('/','.'),right))
Out[24]: 'Abc1.Def2.Ghi3,/Dore1/Mifa2/Solla3,Sido4'