Using Go's regexp, I'm trying to extract a predefined set of ordered key-value (multiline) pairs whose last element may be optional from a raw text, e.g.,
Key1:
SomeValue1
MoreValue1
Key2:
SomeValue2
MoreValue2
OptionalKey3:
SomeValue3
MoreValue3
(here, I want to extract all the values as named groups)
If I use the default greedy pattern (?s:Key1:\n(?P<Key1>.*)Key2:\n(?P<Key2>.*)(?:OptionalKey3:\n(?P<OptionalKey3>.*))?), it never sees OptionalKey3 and matches the rest of the text as Key2.
If I use the non-greedy pattern (?s:Key1:\n(?P<Key1>.*)Key2:\n(?P<Key2>.*?)(?:OptionalKey3:\n(?P<OptionalKey3>.*))?), it doesn't even see SomeValue2 and stops immediately: https://regex101.com/r/QE2g3o/1
Is there a way to optionally match OptionalKey3 while also able to capture all the other ones?
Use
(?s)\AKey1:\n(?P<Key1>.*)Key2:\n(?P<Key2>.*?)(?:OptionalKey3:\n(?P<OptionalKey3>.*))?\z
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?s) set flags for this block (with . matching
\n) (case-sensitive) (with ^ and $
matching normally) (matching whitespace
and # normally)
--------------------------------------------------------------------------------
\A the beginning of the string
--------------------------------------------------------------------------------
Key1: 'Key1:'
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
(?P<Key1> group and capture to "Key1":
--------------------------------------------------------------------------------
.* any character (0 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of "Key1"
--------------------------------------------------------------------------------
Key2: 'Key2:'
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
(?P<Key2> group and capture to "Key2":
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
) end of "Key2"
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
OptionalKey3: 'OptionalKey3:'
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
(?P<OptionalKey3> group and capture to "OptionalKey3":
--------------------------------------------------------------------------------
.* any character (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of "OptionalKey3"
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
\z the end of the string
Related
I'm trying to extract an optional element via PCRE from the following example.
I need to pull out the xxxx-xxx-xxxx-xxxx-xxxxx if ActivityID exists.
I'm guessing I need to use lookaheads or the like but I can't quite wrap my head around it.
</Level><Task>...<Correlation ActivityID='{xxxx-xxx-xxxx-xxxx-xxxxx}'/><Execution...</Channel>
This works if the element exists, saving to taco64:
<\/level>(?<taco16>.*?)ActivityID='{(?<taco64>.*)}'(?<taco32>.*?)<Computer>
Being optional drops everything into taco32.
<\/level>(?<taco16>.*?)(ActivityID='{(?<taco64>.*)}')?(?<taco32>.*?)<Computer>
Use
<\/level>(?:(?<taco16>.*?)(ActivityID='{(?<taco64>.*)}'))?(?<taco32>.*?)<Computer>
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
level> 'level>'
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?<taco16> group and capture to taco16:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more
times (matching the least amount
possible))
--------------------------------------------------------------------------------
) end of taco16
--------------------------------------------------------------------------------
(?<taco64> group and capture to taco64:
--------------------------------------------------------------------------------
ActivityID='{ 'ActivityID=\'{'
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of taco64
--------------------------------------------------------------------------------
}' '}\''
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
(?<taco32> group and capture to taco32:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of taco32
--------------------------------------------------------------------------------
<Computer> '<Computer>'
My current regex:
(\d.{17})[^#]*(\D+)(\d+)gr(\d+)
In group 2, they are still having the hashtag, I want to remove it from there. What should I change from my current regex?
201223E0MWJPJD2230#AdeSaputra290gr99000
2101023CNV6TT1109J#Fefe430gr142000
2101183EDTFPSA0128#Jessica500gr112000
201221E2QKWRY11413#EssyYosita880gr233500
2101123G9XQ7R41705#Meily1120gr329000
201228ECEWTJT50859#WidyaNatali1720gr457230
201227EEBX1K9K1020#Excelio112gr58900
2101112N4YNFB12016#DebyNath520gr156220
2101072R8A0QB22347#AlycieHandoTan700gr85000
Output:
group 1: 201223E0MWJPJD2230
group 2: #AdeSaputra
group 3: 290
group 4: 99000
remove [^] from your regex
(\d.{17})#*(\D+)(\d+)gr(\d+)
see this
\D matches any non-digits and it matches # in your input.
Add # to the pattern instead of the opposite [^#] so as to match it.
Use
^(\d.{17})#(\D+)(\d+)gr(\d+)$
See proof it works. Adding anchors to match entire strings.
Explanation
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
.{17} any character except \n (17 times)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
# '#'
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\D+ non-digits (all but 0-9) (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
gr 'gr'
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \4
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
I want to sort a email:password by first occurrence of each email.
Example list:
email#example.com:passsword1
email#example.com:passsword2
email#example.com:passsword3
email1#example.com:passsword1
email1#example.com:passsword2
email1#example.com:passsword2
So only
email#example.com:passsword1
email1#example.com:passsword1
should be kept as result.
With my limited Regex skills I worked out this one but I guess I misunderstand something:
^(.*)(\r?\n\1)+(?=:)
Use
^((.*:).*)(?:\r?\n\2.*)+
See proof, use g and m flags.
Explanation
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
\r? '\r' (carriage return) (optional
(matching the most amount possible))
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
\2 what was matched by capture \2
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)+ end of grouping
I am using a regular expression to find the matched pattern. But somehow I am not able to find all Occrences.
My input file from where I need to match the pattern(please note that this is an example file with only 3 occurrence, in real - it have multiple Occrences) :
aaa-233- hi, how are you?
aaa-234- 6(-8989)
aaa-235- 123
end
So, I want my output to be
hi, how are you?
6(-8988)
123
My regex is
aaa\\-[A-Za-z0-9,->#]\\-(.+?)(aaa)
Pseudo code
Output= matcher.group(2);
How can I make the logic to start read from aaa and end either it encounters aaa or end.
Use
(?sm)^aaa-[^-]+-.*?(?=\naaa|\nend|\z)
See proof
Explanation
EXPLANATION
--------------------------------------------------------------------------------
(?ms) set flags for this block (with ^ and $
matching start and end of line) (with .
matching \n) (case-sensitive) (matching
whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
aaa- 'aaa-'
--------------------------------------------------------------------------------
[^-]+ any character except: '-' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
aaa 'aaa'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
end 'end'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\z the end of the string
--------------------------------------------------------------------------------
) end of look-ahead
I'd like to match positively strings like "10a3b4c", "10a", "5b4c", "3a6c", but not match "2c1b" (because the letters aren't in alphabetical order) or the empty string.
Attempt: (\d+a)?(\d+b)?(\d+c)?
Problem: Matches the empty string. It falsely matches "".
Attempt: (\d+[abc]){1,3}
Problem: Doesn't enforce the a, b, c order. It falsely matches "2c1b"
How can this restriction be expressed as a regex?
Use
^(?!$)(\d+a)?(\d+b)?(\d+c)?$
See proof
Explanation
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \1 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
a 'a'
--------------------------------------------------------------------------------
)? end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
( group and capture to \2 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
b 'b'
--------------------------------------------------------------------------------
)? end of \2 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \2)
--------------------------------------------------------------------------------
( group and capture to \3 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
c 'c'
--------------------------------------------------------------------------------
)? end of \3 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \3)
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
This seems to work (see it on regex101)
\b(?=[\dabc]+\b)(\d+a)?(\d+b)?(\d+c)?\b