Regular expression in python re.findall() - regex

I tryed the folowing:
I want to split with the re.findall()
str="<abc>somechars<*><def>somechars<*><ghj>somechars<*><ijk>somechars<*>"
print(re.findall('<(abc|ghj)>.*?<*>',str))
The out should be
['<abc>somechars<*>','<ghj>somechars<*>']
In notepad, if I try this expression I get right, but here:
['abc', 'ghj']
Any idea?
Thanks for the answers.

(<(?:abc|ghj)>.*?<\*>)
Try this.See demo.
http://regex101.com/r/kP8uF5/12
import re
p = re.compile(ur'(<(?:abc|ghj)>.*?<\*>)', re.IGNORECASE | re.MULTILINE)
test_str = u"<abc>somechars<*><def>somechars<*><ghj>somechars<*><ijk>somechars<*>"
re.findall(p, test_str)

You're capturing (abc|ghj). Use a non-capturing group (?:abc|ghj) instead.
Also, you should escape the second * in your regex since you want a literal asterisk: <\*> rather than <*>.
>>> s = '<abc>somechars<*><def>somechars<*><ghj>somechars<*><ijk>somechars<*>'
>>> re.findall(r'<(?:abc|ghj)>.*?<\*>', s)
['<abc>somechars<*>', '<ghj>somechars<*>']
Also also, avoid shadowing the built-in name str.

Just make the group a non-capturing group:
str="<abc>somechars<*><def>somechars<*><ghj>somechars<*><ijk>somechars<*>"
print(re.findall('<(?:abc|ghj)>.*?<*>',str))
The function returns the groups from left to right, and since you specified a group it left out the entire match.
From the Python documentation
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match
.

Related

Regular expression to match closest tag above specific word (HLS media playlist)

Given a HLS media playlist as follows:
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-ALLOW-CACHE:NO
#EXT-X-TARGETDURATION:7
#EXT-X-MEDIA-SEQUENCE:0
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:22.621+02:00
#EXTINF:6.666666667,
seg1.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:29.637+02:00
#EXTINF:6.666666667,
seg2.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:36.583+02:00
#EXTINF:6.666666666,
seg3.ts
I want to create a regular expression to match the datetime following the EXT-X-PROGRAM-DATE-TIME tag closest to a specified .ts file name. For example, I want to be able to retrieve the datetime 2022-09-12T10:03:29.637+02:00, by specifying that the match should end with seg2.ts. It should work even if new tags are added in between the file name and the EXT-X-PROGRAM-DATE-TIME tag in the future.
This pattern (EXT-X-PROGRAM-DATE-TIME:(.*)[\s\S]*?seg2.ts) is my best effort so far, but I can't figure out how make the match start at the last possible EXT-X-PROGRAM-DATE-TIME tag. The lazy quantifier did not help. The group that is currently captured is the datetime following the first EXT-X-PROGRAM-DATE-TIME, i.e. 2022-09-12T10:03:22.621+02:00.
I also looked at using negative lookahead, but I can't figure out how to combine that with matching a variable number of characters and whitespaces before the seg2.ts.
I'm sure this has been answered before in another context, but I just can't find the right search terms.
We can use re.search here along with a regex tempered dot trick:
#Python 2.7.17
import re
inp = """#EXTM3U
#EXT-X-VERSION:3
#EXT-X-ALLOW-CACHE:NO
#EXT-X-TARGETDURATION:7
#EXT-X-MEDIA-SEQUENCE:0
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:22.621+02:00
#EXTINF:6.666666667,
seg1.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:29.637+02:00
#EXTINF:6.666666667,
seg2.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:36.583+02:00
#EXTINF:6.666666666,
seg3.ts"""
match = re.search(r'#EXT-X-PROGRAM-DATE-TIME:(\S+)(?:(?!EXT-X-PROGRAM-DATE-TIME).)*\bseg2\.ts', inp, flags=re.S)
if match:
print(match.group(1)) # 2022-09-12T10:03:29.637+02:00
Here is an explanation of the regex pattern:
#EXT-X-PROGRAM-DATE-TIME:
(\S+) match and capture the timestamp
(?:(?!EXT-X-PROGRAM-DATE-TIME).)* match all content WITHOUT crossing the next section
\bseg2\.ts match the filename
if match:
You might write the pattern not crossing lines that start with seg lines, and then match seg2.ts
^#EXT-X-PROGRAM-DATE-TIME:(.*)(?:\n(?!seg\d+\.ts$).*)*\nseg2\.ts$
^ Start of string
#EXT-X-PROGRAM-DATE-TIME: Match literally
(.*) Capture group 1, match the rest of the line (note that this can also match an empty string)
(?:\n(?!seg\d+\.ts$).*)* Match all lines that do not start with the seq pattern
\nseg2\.ts Match a newline and seq2.ts
$ End of string
Regex demo
import re
pattern = r"^#EXT-X-PROGRAM-DATE-TIME:(.*)(?:\n(?!seg\d+\.ts$).*)*\nseg2\.ts$"
s = ("#EXTM3U\n"
"#EXT-X-VERSION:3\n"
"#EXT-X-ALLOW-CACHE:NO\n"
"#EXT-X-TARGETDURATION:7\n"
"#EXT-X-MEDIA-SEQUENCE:0\n\n"
"#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:22.621+02:00\n"
"#EXTINF:6.666666667,\n"
"seg1.ts\n"
"#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:29.637+02:00\n"
"#EXTINF:6.666666667,\n"
"seg2.ts\n"
"#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:36.583+02:00\n"
"#EXTINF:6.666666666,\n"
"seg3.ts")
m = re.search(pattern, s, re.M)
if m:
print(m.group(1))
Output
2022-09-12T10:03:29.637+02:00
If you also do not want to cross matching the #EXT-X parts in between, you can add that as an alternative to the negative lookahead:
^#EXT-X-PROGRAM-DATE-TIME:(.*)(?:\n(?!seg\d+\.ts\b|#EXT-X-PROGRAM-DATE-TIME:).*)*\nseg2\.ts$

python regex how to avoid match multiple semicolon?

I'm about to write a regex to extract substrings. the string is:
ASP.NET_SessionId=frffcjcarie4dhxouz5yklwu;+BIGipServercapitaliq-ssl=3617221783.36895.0000;+ObSSOCookie=wkyQfn2Cyx2%2f7kSj4zBB886WaLs92Ord9FSf64c%2byHFOBwgEP4f3UmorDj051suQwRXAKEwBtYVKRYJuUGh2YNZtAj2%2bNp8asLIT9xQPqVktEAzkl3jNIv8MyWFsoFPDtm%2fTm1FeaCP%2bGTk9Oa%2fCNA0Hmy847qK2qo7%2bbziV%2bjeClbkGjAX3pgcPzfs%2bQp7p9BSjP1xJqUaUKwJ2%2flIgzZL5Ma%2bnJK8j%2b732ixNyIDNDGo7uIF%2b;+machineIdCookie=866873600;+userLoggedIn=jga;sdgjefdfdfs
I want to extract a substring beginning with ObSSOCookie=....; and ending just before the userLoggedIn.
I set my regex pattern
pattern = "ObSSOCookie=.*;"
But it continues to extract until the last semicolon (which includes the +machineIdCookie=866873600), rather than the first semicolon, which is what I want.
Is there a way to just extract up to the first semicolon? And I can't just use split by ";" cause this regex is actually to be used in a Logstash configuration file and there's no way to use python-style coding there...
You want to make your regex non-greedy
Instead of using this
* - zero or more
Use this
*? - zero or more (non-greedy)
Here's your expression (demo).
ObSSOCookie=(.*?;)
This is a general technique, also described in this answer.
Why not just grab anything except the next ; like this (demo)
ObSSOCookie=([^;]*)
>>> import re
>>> data = 'ASP.NET_SessionId=frffcjcarie4dhxouz5yklwu;+BIGipServercapitaliq-ssl=3617221783.36895.0000;+ObSSOCookie=wkyQfn2Cyx2%2f7kSj4zBB886WaLs92Ord9FSf64c%2byHFOBwgEP4f3UmorDj051suQwRXAKEwBtYVKRYJuUGh2YNZtAj2%2bNp8asLIT9xQPqVktEAzkl3jNIv8MyWFsoFPDtm%2fTm1FeaCP%2bGTk9Oa%2fCNA0Hmy847qK2qo7%2bbziV%2bjeClbkGjAX3pgcPzfs%2bQp7p9BSjP1xJqUaUKwJ2%2flIgzZL5Ma%2bnJK8j%2b732ixNyIDNDGo7uIF%2b;+machineIdCookie=866873600;+userLoggedIn=jga;sdgjefdfdfs'
>>> p = re.compile('ObSSOCookie=([^;]*)')
>>> m = p.search(data)
>>> m.group(1)
'wkyQfn2Cyx2%2f7kSj4zBB886WaLs92Ord9FSf64c%2byHFOBwgEP4f3UmorDj051suQwRXAKEwBtYVKRYJuUGh2YNZtAj2%2bNp8asLIT9xQPqVktEAzkl3jNIv8MyWFsoFPDtm%2fTm1FeaCP%2bGTk9Oa%2fCNA0Hmy847qK2qo7%2bbziV%2bjeClbkGjAX3pgcPzfs%2bQp7p9BSjP1xJqUaUKwJ2%2flIgzZL5Ma%2bnJK8j%2b732ixNyIDNDGo7uIF%2b'

Dart RegExp, Why does this throw a FormatException

I'm not clear on why this is throwing a FormatException:
void main(){
RegExp cssColorMatch = new RegExp(r'^#([0-9a-fA-F]{3}{1,2}$)');
print(cssColorMatch.hasMatch('#F56'));
}
You are trying to specify multiple range quantifiers back to back which causes an exception error. You need to end your capturing group around your first range quantifier and place the following range quantifier outside of the capturing group if you want to use it this way.
RegExp re = new RegExp(r"#([0-9a-fA-F]{3}){1,2}");
Since you are using hasMatch, you can remove the start ^ and end $ anchors since this function returns if the regular expression has a match in the string input and you really don't need {1,2} here either.
RegExp re = new RegExp(r"#([0-9a-fA-F]{3})");
You cannot do {3}{1,2}. But you can do:
RegExp cssColorMatch = new RegExp(r'^\#((?:[0-9a-fA-F]{3}){1,2})$');
which still does not match Hex colors correctly.
Because your regex contains {1,2} at the last. There is no need to include this part.
Below regex would be enough,
RegExp cssColorMatch = new RegExp(r'^#([0-9a-fA-F]{3})$');

regular expression replacement of numbers

Using regular expression how do I replace 1,186.55 with 1186.55?
My search string is
\b[1-9],[0-9][0-9][0-9].[0-9][0-9]
which works fine. I just can't seem to get the replacement part to work.
You are very sparse with information in your question. I try to answer as general as possible:
You can shorten the regex a bit by using quantifiers, I would make this in a first step
\b[1-9],[0-9]{3}.[0-9]{2}
Most probably you can also replace [0-9] by \d, is also more readable IMO.
\b\d,\d{3}.\d{2}
Now we can go to the replacement part. Here you need to store the parts you want to keep. You can do that by putting that part into capturing groups, by placing brackets around, this would be your search pattern:
\b(\d),(\d{3}.\d{2})
So, now you can access the matched content of those capturing groups in the replacement string. The first opening bracket is the first group the second opening bracket is the second group, ...
Here there are now two possibilities, either you can get that content by \1 or by $1
Your replacement string would then be
\1\2
OR
$1$2
Python:
def repl(initstr, unwanted=','):
res = set(unwanted)
return ''.join(r for r in initstr if r not in res)
Using regular expressions:
from re import compile
regex = compile(r'([\d\.])')
print ''.join(regex.findall('1,186.55'))
Using str.split() method:
num = '1,186.55'
print ''.join(num.split(','))
Using str.replace() method:
num = '1,186.55'
print num.replace(',', '')
if you just wanna remove the comma you can do(in java or C#):
str.Replace(",", "");
(in java it's replace)
Or in Perl:
s/(\d+),(\d+)/$1$2/

Using RegEx with something of the format "xxx:abc" to match just "abc"?

I've not done much RegEx, so I'm having issues coming up with a good RegEx for this script.
Here are some example inputs:
document:ASoi4jgt0w9efcZXNDOFzsdpfoasdf-zGRnae4iwn2, file:90jfa9_189204hsfiansdIASDNF, pdf:a09srjbZXMgf9oe90rfmasasgjm4-ab, spreadsheet:ASr0gk0jsdfPAsdfn
And here's what I'd want to match on each of those examples:
ASoi4jgt0w9efcZXNDOFzsdpfoasdf-zGRnae4iwn2, 90jfa9_189204hsfiansdIASDNF, a09srjbZXMgf9oe90rfmasasgjm4-ab, ASr0gk0jsdfPAsdfn
What would be the best and perhaps simplest RegEx to use for this? Thanks!
.*:(.*) should get you everything after the last colon in the string as the value of the first group (or second group if you count the 'match everything' group).
An alternative would be [^:]*$ which gets you all characters at the end of the string up to but not including the last character in the string that is a colon.
Use something like below:
([^:]*)(,|$)
and get the first group. You can use a non-capturing group (?:ABC) if needed for the last. Also this makes the assumption that the value itself can have , as one of the characters.
I don't think answers like (.*)\:(.*) would work. It will match entire string.
(.*)\:(.*)
And take the second capture group...
Simplest seems to be [^:]*:([^,]*)(?:,|$).
That is find something that has something (possibly nothing) up to a colon, then a colon, then anything not including a comma (which is the thing matched), up to a comma or the end of the line.
Note the use of a non-capturing group at the end to encapsulate the alternation. The only capturing group appearing is the one which you wish to use.
So in python:
import re
exampStr = "document:ASoi4jgt0w9efcZXNDOFzsdpfoasdf-zGRnae4iwn2, file:90jfa9_189204hsfiansdIASDNF, pdf:a09srjbZXMgf9oe90rfmasasgjm4-ab, spreadsheet:ASr0gk0jsdfPAsdfn"
regex = re.compile("[^:]*:([^,]*)(?:,|$)")
result = regex.findall(exampStr)
print result
#
# Result:
#
# ['ASoi4jgt0w9efcZXNDOFzsdpfoasdf-zGRnae4iwn2', '90jfa9_189204hsfiansdIASDNF', 'a09srjbZXMgf9oe90rfmasasgjm4-ab', 'ASr0gk0jsdfPAsdfn']
#
#
A good introduction is at: http://www.regular-expressions.info/tutorial.html .