regexp: multiline, non-greedy match until optional string

regexp: multiline, non-greedy match until optional string - regex

Using Go's regexp, I'm trying to extract a predefined set of ordered key-value (multiline) pairs whose last element may be optional from a raw text, e.g.,
Key1:
SomeValue1
MoreValue1
Key2:
SomeValue2
MoreValue2
OptionalKey3:
SomeValue3
MoreValue3
(here, I want to extract all the values as named groups)
If I use the default greedy pattern (?s:Key1:\n(?P<Key1>.*)Key2:\n(?P<Key2>.*)(?:OptionalKey3:\n(?P<OptionalKey3>.*))?), it never sees OptionalKey3 and matches the rest of the text as Key2.
If I use the non-greedy pattern (?s:Key1:\n(?P<Key1>.*)Key2:\n(?P<Key2>.*?)(?:OptionalKey3:\n(?P<OptionalKey3>.*))?), it doesn't even see SomeValue2 and stops immediately: https://regex101.com/r/QE2g3o/1
Is there a way to optionally match OptionalKey3 while also able to capture all the other ones?

Use
(?s)\AKey1:\n(?P<Key1>.*)Key2:\n(?P<Key2>.*?)(?:OptionalKey3:\n(?P<OptionalKey3>.*))?\z
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?s) set flags for this block (with . matching
\n) (case-sensitive) (with ^ and $
matching normally) (matching whitespace
and # normally)
--------------------------------------------------------------------------------
\A the beginning of the string
--------------------------------------------------------------------------------
Key1: 'Key1:'
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
(?P<Key1> group and capture to "Key1":
--------------------------------------------------------------------------------
.* any character (0 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of "Key1"
--------------------------------------------------------------------------------
Key2: 'Key2:'
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
(?P<Key2> group and capture to "Key2":
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
) end of "Key2"
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
OptionalKey3: 'OptionalKey3:'
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
(?P<OptionalKey3> group and capture to "OptionalKey3":
--------------------------------------------------------------------------------
.* any character (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of "OptionalKey3"
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
\z the end of the string

Related

Optional element between .*?

I'm trying to extract an optional element via PCRE from the following example.
I need to pull out the xxxx-xxx-xxxx-xxxx-xxxxx if ActivityID exists.
I'm guessing I need to use lookaheads or the like but I can't quite wrap my head around it.
</Level><Task>...<Correlation ActivityID='{xxxx-xxx-xxxx-xxxx-xxxxx}'/><Execution...</Channel>
This works if the element exists, saving to taco64:
<\/level>(?<taco16>.*?)ActivityID='{(?<taco64>.*)}'(?<taco32>.*?)<Computer>
Being optional drops everything into taco32.
<\/level>(?<taco16>.*?)(ActivityID='{(?<taco64>.*)}')?(?<taco32>.*?)<Computer>

Use
<\/level>(?:(?<taco16>.*?)(ActivityID='{(?<taco64>.*)}'))?(?<taco32>.*?)<Computer>
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
level> 'level>'
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?<taco16> group and capture to taco16:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more
times (matching the least amount
possible))
--------------------------------------------------------------------------------
) end of taco16
--------------------------------------------------------------------------------
(?<taco64> group and capture to taco64:
--------------------------------------------------------------------------------
ActivityID='{ 'ActivityID=\'{'
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of taco64
--------------------------------------------------------------------------------
}' '}\''
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
(?<taco32> group and capture to taco32:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of taco32
--------------------------------------------------------------------------------
<Computer> '<Computer>'

How to remove the hash tag from this regex

My current regex:
(\d.{17})[^#]*(\D+)(\d+)gr(\d+)
In group 2, they are still having the hashtag, I want to remove it from there. What should I change from my current regex?
201223E0MWJPJD2230#AdeSaputra290gr99000
2101023CNV6TT1109J#Fefe430gr142000
2101183EDTFPSA0128#Jessica500gr112000
201221E2QKWRY11413#EssyYosita880gr233500
2101123G9XQ7R41705#Meily1120gr329000
201228ECEWTJT50859#WidyaNatali1720gr457230
201227EEBX1K9K1020#Excelio112gr58900
2101112N4YNFB12016#DebyNath520gr156220
2101072R8A0QB22347#AlycieHandoTan700gr85000
Output:
group 1: 201223E0MWJPJD2230
group 2: #AdeSaputra
group 3: 290
group 4: 99000

remove [^] from your regex
(\d.{17})#*(\D+)(\d+)gr(\d+)
see this

\D matches any non-digits and it matches # in your input.
Add # to the pattern instead of the opposite [^#] so as to match it.
Use
^(\d.{17})#(\D+)(\d+)gr(\d+)$
See proof it works. Adding anchors to match entire strings.
Explanation
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
.{17} any character except \n (17 times)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
# '#'
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\D+ non-digits (all but 0-9) (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
gr 'gr'
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \4
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Sort Email:Pass List by first occurrence of each email

I want to sort a email:password by first occurrence of each email.
Example list:
email#example.com:passsword1
email#example.com:passsword2
email#example.com:passsword3
email1#example.com:passsword1
email1#example.com:passsword2
email1#example.com:passsword2
So only
email#example.com:passsword1
email1#example.com:passsword1
should be kept as result.
With my limited Regex skills I worked out this one but I guess I misunderstand something:
^(.*)(\r?\n\1)+(?=:)

Use
^((.*:).*)(?:\r?\n\2.*)+
See proof, use g and m flags.
Explanation
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
\r? '\r' (carriage return) (optional
(matching the most amount possible))
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
\2 what was matched by capture \2
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)+ end of grouping

using "or" in pattern matching regular expression

I am using a regular expression to find the matched pattern. But somehow I am not able to find all Occrences.
My input file from where I need to match the pattern(please note that this is an example file with only 3 occurrence, in real - it have multiple Occrences) :
aaa-233- hi, how are you?
aaa-234- 6(-8989)
aaa-235- 123
end
So, I want my output to be
hi, how are you?
6(-8988)
123
My regex is
aaa\\-[A-Za-z0-9,->#]\\-(.+?)(aaa)
Pseudo code
Output= matcher.group(2);
How can I make the logic to start read from aaa and end either it encounters aaa or end.

Use
(?sm)^aaa-[^-]+-.*?(?=\naaa|\nend|\z)
See proof
Explanation
EXPLANATION
--------------------------------------------------------------------------------
(?ms) set flags for this block (with ^ and $
matching start and end of line) (with .
matching \n) (case-sensitive) (matching
whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
aaa- 'aaa-'
--------------------------------------------------------------------------------
[^-]+ any character except: '-' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
aaa 'aaa'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
end 'end'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\z the end of the string
--------------------------------------------------------------------------------
) end of look-ahead

Match one to three groups only if they appear in order

I'd like to match positively strings like "10a3b4c", "10a", "5b4c", "3a6c", but not match "2c1b" (because the letters aren't in alphabetical order) or the empty string.
Attempt: (\d+a)?(\d+b)?(\d+c)?
Problem: Matches the empty string. It falsely matches "".
Attempt: (\d+[abc]){1,3}
Problem: Doesn't enforce the a, b, c order. It falsely matches "2c1b"
How can this restriction be expressed as a regex?

Use
^(?!$)(\d+a)?(\d+b)?(\d+c)?$
See proof
Explanation
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \1 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
a 'a'
--------------------------------------------------------------------------------
)? end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
( group and capture to \2 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
b 'b'
--------------------------------------------------------------------------------
)? end of \2 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \2)
--------------------------------------------------------------------------------
( group and capture to \3 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
c 'c'
--------------------------------------------------------------------------------
)? end of \3 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \3)
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

This seems to work (see it on regex101)
\b(?=[\dabc]+\b)(\d+a)?(\d+b)?(\d+c)?\b

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regexp: multiline, non-greedy match until optional string - regex

Related

Optional element between .*?

How to remove the hash tag from this regex

Sort Email:Pass List by first occurrence of each email

using "or" in pattern matching regular expression

Match one to three groups only if they appear in order

Categories

Resources