How to match a string until a certain pattern (Python 3.8) - regex

I got a small problem with my regex.
I want to pare a p-list file to get a unix (10 digis) timestamp plus everything until a certain pattern after the timestamp. My current pattern looks like that:
,\s*(\d{10}),\s*'(?=.[',])
I want to match the timestamp and everything between the timestamp and the certain pattern ',.
This is a snipped of the string, out of the p-list:
'$class': UID(23)}, 1572871204, 'I need this one', {'dictionary': UID(34)
I want to get:
1573078965, 'I need this one'
It would be ideal if I get the timestamp as a submatch and the string as another submatch.
Thank you very much!

Between the positive lookahead, you could add another capturing group matching not a comma or ' using a negated character class ([^,']+).
But as you are matching the comma before as well, you can omit the lookahead and match the comma afterwards instead.
For example
,\s*(\d{10}),\s*'([^,']+)[',]
Regex demo

Related

RegEx: How can I match all characters until the next match? [duplicate]

This question already has answers here:
Tempered Greedy Token - What is different about placing the dot before the negative lookahead?
(3 answers)
Closed 3 years ago.
I have a string like this:
Hello [#foo] how are you [#bar] more text
Ultimately I need to modify each instance of a substring matching /\[#.+?\]/, but I also need to modify each substring before/after the [#foo] and [#bar].
The following regex matches the substring before a [#.+], the [#.+] itself, then a substring after the [#.+] until the next character is followed by another [#.+].
(.*?)(\[(#.+?)\])((.(?!(\[#.+?\])))*)
So the first match is "Hello [#foo] how are you" and the second match is " [#bar] more text".
Note the space at the beginning of the second match. That's the problem. Is there a way to get the first match to include all characters right up to the next [#.+]?
My regex includes characters after the [#.+] that are not followed by an instance of [#.+], and I cannot see any way of getting it to include all characters until we are actually in another instance of [#.+].
I'm really interested in whether I'm missing something - it certainly feels like there should be a simpler way to capture the characters around a given match, or a simpler way to capture characters not part of a match...
You have this regex:
(.*?)(\[(#.+?)\])((.(?!(\[#.+?\])))*)
^
Look at that dot. It precedes a negative lookahead. It matches a unit of data only if negative lookahead is satisfied. If negative lookahead fails, dot won't match. This happens at a character before matching a \[#.+?\]. Hence the space character isn't included.
To include it you just change the order. Put the dot after negative lookahead is passed:
(.*?)(\[(#.+?)\])(((?!(\[#.+?\])).)*)
^
See live demo here
If I understand correctly, you want to separate your text into groups, each one having one instance of [#.+], and all of the text must be matched into a group.
Try (?:^.*?)?\[#.+?\].*?(?=\[|$).
This RegEx might help you to get those vars.
(?:\[#[A-Za-z0-9]+\])
You can also add any other char to [A-Za-z0-9] such as ., +, #:
`[A-Za-z0-9\.\+\#]`
and change it as you wish:
(?:\[#[A-Za-z0-9\.\+\#]+\])
x = 'Hello [#foo] how are you [#bar] more text'
out = re.search('((.*)(\[.*\])(.*))((\[.*\])(.*))',x)
After getting above output you can use groups method to access different groups:
out.group(1)
'Hello [#foo] how are you '
out.group(2)
'Hello '
out.group(3)
'[#foo]'
out.group(4)
' how are you '
out.group(5)
'[#bar] more text'
out.group(6)
'[#bar]'
out.group(7)
' more text'

I need a regex result that does not include the substring at the beginning and end of the matched pattern

I have a pattern I need to match that's always a date "_YYYYMMDD.". However, I don't want to include the "_" and the "." in the result. I have a regex pattern that successfully match above. Its too complicated to include here because I would have to write by hand and would mess it up.
Suffice it to say I have a pattern:
[_](lots of stuff in the middle)[.]
It works fine but I don't want to include the "_" and "."
Any answers are greatly appreciated. Thanks!
For matching underscore and dot with the pattern and not including it in the full matching text, you will need to use lookarounds in the regex pattern. Following regex will match date preceded by _ and followed by .
(?<=_)\d{8}(?=\.)
Regex Demo
Additionally, if you want to capture the year, month and date part into their own capture groups, you can use this regex and capture year part from group1, month from group2 and date from group3,
(?<=_)(\d{4})(\d{2})(\d{2})(?=\.)
Demo with different parts of date into their own groups
Easiest way would be to slice the first and last characters off the result. You can do it either by string length:
result="${result:1:${#result}-2}"
(or result="${result:1:8}" since the length will be constant)
Or by specific character:
result="${result#_}"
result="${result%.}"

RegEx - double condition to find some string

I'd like to find word RADU3_ or RADU3- in a sentence that begins with xlink:href= and ends with .svg
How to do this?
I've tried following, but does not give the result I'm expecting.
(?=\wxlink:href=|\wsvg\b)|\bRADU3_|\bRADU3-
Just last line in example is good result (RADU3_)
ProductionGraphics\GP1**RADU3-**11_HeatingFurnaceF1.svg
PB:ExpressionText id="RADU3_FUEL GAS _SUM_EX" PBD:LinkUses
xlink:href="C:\ProcBookImport\MaintenanceGraphics\RADU3_AI.svg"
Example...
Not sure exactly how you want to use it but the below pattern finds the string. I put the RADU3 part in a group where I matches RADU3 followed by - or _ ([_-])
(xlink:href=.*)(RADU3[_-]*)(.*\.svg)
Edit, handle multiple occurences
If a string might contain the pattern several times then use ? to allow a group to repeat itself
(RADU3[_-]*?)(.*?\.svg?)
The above could be used in a replace expression like
\1someotherword\3
Where \2 is the second group that is replaced
If you want to make sure that the string starts with xlink:href= and ends with \.svg you could use anchors to assert the start ^ and the end $ of the string.
Use 1 capturing group to make sure xlink:href= comes before RADU3 followed by an underscore or a hyphen. Then you could match it and in the replacement use that capturing group follwed by your replacement.
You could use a positive lookahead to assert that the string ends with \.svg
That will match:
^(xlink:href=.*)\bRADU3[_-](?=.*\.svg$)
^ Assert the start of the string
(xlink:href=.*) Capturing group, match up until the last occurence of ..
\bRADU3[_-] Word boundary to prevent matching part of a larger word. Match RADU3 followed by an underscore or hyphen
(?=.*\.svg$) Positive lookahead to assert the string ends with .svg
See the regex demo
It sounds like you only want the word (substring) if it is in a specific context?
In your case, you can restart the regex midways if you want to have starting and ending conditions (multiple conditions) for a string, but at the same time only want to use these conditions as "if-statements" and not as part of the result.
The following uses this method, and utilizes restarts (\K) in order to only extract the substring you are looking for.
# The string has to start with "xlink:href="
xlink:href=
# Fetch everything up to our match, and the restart the regex
.*\K
# The strings we are looking for
(RADU3[-_])
# String has to end with ".svg"
(?=(.*\.svg))
If you want the entire string matching our rules you are looking for something like this:
#The string has to start with "xlink:href"
^(xlink:href=).*
# The strings we are looking for
(RADU3[-_])
# String has to end with ".svg"
(\w+\.svg)
#Get everything after .svg too
.*
If you only want the ending " after the .svg, you'd want to modify the last part where I just take everything after .svg
You can play around with what I have come up with at regex101 (no affiliation, just love their site): https://regex101.com/r/g0v07V/3/

RegEx - match all periods except the period preceded by 'single capital letter'

Any ideas on how to remove all periods from a large text document, by using a regex on a text editor for the following example:
J. don't match
F.C. don't match
word. match
Word. match
WORD. match
Below regex matches multiple word characters or single non-capital string followed by .:
((\w{2,})|([^A-Z]))\.$
You can try this too,
(?<!(?<=^|[^A-Z])[A-Z])\.
Demo
You can try something like this: \w{2,}?\.
You can go to Regex101 and try it for yourself with more test strings to get the one you want. If you want to actually exclude the periods you can use a capturing group like so: (\w{2,}?)\.

Regex to match string between two characters in email

I'm trying to match a single string out of an email using regex. The email pattern looks like:
name.name.someid#mail.domain.com
And I would like to grab the 'someid' section. Meaning I need to match everything before the '#' and after the last period.
I can match everything before the '#' with (^[^#]+) however I can't effectively combine it in the regex statement to evaluate only after the last period (I can only get it to match after the first period).
Any pointers would be great, thanks!
Use a positive lookahead:
/[^.]+(?=#)/
Here's a demo: http://regex101.com/r/sW7sR3
/\.([^.#]+)#/
Without using lookarounds, this matches anything that's not an # or . that comes after a . and before #.