Greedy/non-greedy quantifiers in ABAP regular expressions - regex

I would like to extract 2 things from this string: | 2013.10.10 FEL felsz
regex -> Date field -> the needed value will be only the 2013.10.10 (in this case)
regex -> String between 2013.10.10 and felsz string -> the needed value will be only the FEL string (in this case).
I tried with the following regexes as with not too much success:
(.*?<p/\s>.*?)(?=\s)
(.*?<p/("[0-9]+">.*?)(?=\s)
Do you have any suggestions?

As mentioned in comments, since ABAP doesn't allow non-greedy match with *?, if you can count on felsz occurring only immediately after the second portion you want to match you could use:
(\d{4}\.\d\d\.\d\d) (.*) felsz
(PS: Invalidated first answer: in non-ABAP systems where *? is supported, the following regex will get both values into submatches. The date will be in submatch 1 and the other value (FEL in this case) will be in submatch 2 : `(\d{4}.\d\d.\d\d) (.*?) felsz)

Is "felsz" variable? Can the white space vary? Can your date format vary? If not:
\| (\d{4}\.\d{2}\.\d{2}) (.*?) felsz
Otherwise:
\|\s+?(\d{4}\.\d{2}\.\d{2})\s+?(.*?)\s+?[a-z]+
Then access capture groups 1/2.

The regex
\d+\.\d+\.\d+
matches 2013.10.10 in the given string. Explanation and demonstration: http://regex101.com/r/bL7eO0
(?<=\d ).*(?= felsz)
should work to match FEL. Explanation and demonstration: http://regex101.com/r/pV2mW5
If you want them in capturing groups, you could use the regex:
\| (\d+\.\d+\.\d+) (.+?) .*
Explanation and demonstration: http://regex101.com/r/rQ6uU4

How about:
(?:\d+\.\d+\.\d+\s)(.*)\s See it in action.
This matches FEL
Some things I took for granted:
the date always comes first and is a mix of numbers and periods
the date is always followed by a space
the word to capture is always followed by a space
the word to capture never contains a space

Assuming that FEL is always a single word (that is, delimited by a space), you could use the following expression:
(\d{4}\.\d\d\.\d\d) ([^\s]+) (.*)

Related

How to exclude a specific string with REGEX? (Perl)

For example, I have these strings
APPLEJUCE1A
APPLETREE2B
APPLECAKE3C
APPLETEA1B
APPLEWINE3B
APPLEWINE1C
I want all of these strings except those that have TEA or WINE1C in them.
APPLEJUCE1A
APPLETREE2B
APPLECAKE3C
APPLEWINE3B
I've already tried the following, but it didn't work:
^APPLE(?!.*(?:TEA|WINE1C)).*$
Any help is appreciated as I'm also kinda new to this.
If you indeed have mutliple strings as you claim, there's no need to jam all that in one regex pattern.
/^APPLE/ && !/TEA|WINE1C/
If you have a single string, the best approach is probably to splice it into lines (split /\n/), but you could also use a single regex match too
/^APPLE(?!.*TEA|WINE1C).*/mg
You can use
^APPLE(?!.*TEA)(?!.*WINE1C).*
See the regex demo.
Details:
^ - start of string
APPLE - a fixed string
(?!.*TEA) - no TEA allowed anywhere to the right of the current location
(?!.*WINE1C) - no WINE1C allowed anywhere to the right of the current location
.* - any zero or more chars other than line break chars as many as possible.
If you don't want to match a string that has both or them (which is not in the current example data):
^APPLE(?!.*(WINE1C|TEA).*(?!\1)(?:TEA|WINE1C)).*
Explanation
^ Start of string
APPLE match literally
(?! Negative lookahead
.*(WINE1C|TEA) Capture either one of the values in group 1
.* Match 0+ characters
(?!\1)(?:TEA|WINE1C) Match either one of the values as long as it is not the same as previously matched in group 1
) Close the lookahead
.* Match the rest of the line
Regex demo

Capture number between two whitespaces (RegEx)

I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?
Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.
I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1
Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3
Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.

RegEx: Match everything up to the last space without including it

I'd like to match everything in a string up to the last space but without including it. For the sake of example, I would like to match characters I put in bold:
RENATA T. GROCHAL
So far I have ^(.+\s)(.+) However, it matches the last space and I don't want it to. RegEx should work also for other languages than English, as mine does.
EDIT: I didn't mention that the second capturing group should not contain a space – it should be GROCHAL not GROCHAL with a space before it.
EDIT 2: My new RegEx based on what the two answers have provided is: ^((.+)(?=\s))\s(.+) and the RegEx used to replace the matches is \3, \1. It does the expected result:
GROCHAL, RENATa T.
Any improvements would be desirable.
^(.+)\s(.+)
with substitution string:
\2, \1
Update:
Another version that can collapse extra spaces between the 2 capturing groups:
^(.+?)\s+(\S+)$
Use a positive lookahead assertion:
^(.+)(?=\s)
Capturing group 1 will contain the match.
I like using named capturing groups:
rawName = RENATA T. GROCHAL
RegexMatch(rawName, "O)^(?P<firstName>.+)\s(?P<lastName>.+)", match)
MsgBox, % match.lastName ", " match.firstName

How to match, but not capture, part of a regex?

I have a list of strings. Some of them are of the form 123-...456. The variable portion "..." may be:
the string "apple" followed by a hyphen, e.g. 123-apple-456
the string "banana" followed by a hyphen, e.g. 123-banana-456
a blank string, e.g. 123-456 (note there's only one hyphen)
Any word other than "apple" or "banana" is invalid.
For these three cases, I would like to match "apple", "banana", and "", respectively. Note that I never want capture the hyphen, but I always want to match it. If the string is not of the form 123-...456 as described above, then there is no match at all.
How do I write a regular expression to do this? Assume I have a flavor that allows lookahead, lookbehind, lookaround, and non-capturing groups.
The key observation here is that when you have either "apple" or "banana", you must also have the trailing hyphen, but you don't want to match it. And when you're matching the blank string, you must not have the trailing hyphen. A regex that encapsulates this assertion will be the right one, I think.
The only way not to capture something is using look-around assertions:
(?<=123-)((apple|banana)(?=-456)|(?=456))
Because even with non-capturing groups (?:…) the whole regular expression captures their matched contents. But this regular expression matches only apple or banana if it’s preceded by 123- and followed by -456, or it matches the empty string if it’s preceded by 123- and followed by 456.
Lookaround
Name
What it Does
(?=foo)
Lookahead
Asserts that what immediately FOLLOWS the current position in the string is foo
(?<=foo)
Lookbehind
Asserts that what immediately PRECEDES the current position in the string is foo
(?!foo)
Negative Lookahead
Asserts that what immediately FOLLOWS the current position in the string is NOT foo
(?<!foo)
Negative Lookbehind
Asserts that what immediately PRECEDES the current position in the string is NOT foo
In javascript try: /123-(apple(?=-)|banana(?=-)|(?!-))-?456/
Remember that the result is in group 1
Debuggex Demo
Based on the input provided by Germán Rodríguez Herrera
Try:
123-(?:(apple|banana|)-|)456
That will match apple, banana, or a blank string, and following it there will be a 0 or 1 hyphens. I was wrong about not having a need for a capturing group. Silly me.
I have modified one of the answers (by #op1ekun):
123-(apple(?=-)|banana(?=-)|(?!-))-?456
The reason is that the answer from #op1ekun also matches "123-apple456", without the hyphen after apple.
Try this:
/\d{3}-(?:(apple|banana)-)?\d{3}/
A variation of the expression by #Gumbo that makes use of \K for resetting match positions to prevent the inclusion of number blocks in the match. Usable in PCRE regex flavours.
123-\K(?:(?:apple|banana)(?=-456)|456\K)
Matches:
Match 1 apple
Match 2 banana
Match 3
By far the simplest (works for python) is '123-(apple|banana)-?456'.

Matching on repeated substrings in a regex

Is it possible for a regex to match based on other parts of the same regex?
For example, how would I match lines that begins and end with the same sequence of 3 characters, regardless of what the characters are?
Matches:
abcabc
xyz abc xyz
Doesn't Match:
abc123
Undefined: (Can match or not, whichever is easiest)
ababa
a
Ideally, I'd like something in the perl regex flavor. If that's not possible, I'd be interested to know if there are any flavors that can do it.
Use capture groups and backreferences.
/^(.{3}).*\1$/
The \1 refers back to whatever is matched by the contents of the first capture group (the contents of the ()). Regexes in most languages allow something like this.
You need backreferences. The idea is to use a capturing group for the first bit, and then refer back to it when you're trying to match the last bit. Here's an example of matching a pair of HTML start and end tags (from the link given earlier):
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
This regex contains only one pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]* into the first backreference. This backreference is reused with \1 (backslash one). The / before it is simply the forward slash in the closing HTML tag that we are trying to match.
Applying this to your case:
/^(.{3}).*\1$/
(Yes, that's the regex that Brian Carper posted. There just aren't that many ways to do this.)
A detailed explanation for posterity's sake (please don't be insulted if it's beneath you):
^ matches the start of the line.
(.{3}) grabs three characters of any type and saves them in a group for later reference.
.* matches anything for as long as possible. (You don't care what's in the middle of the line.)
\1 matches the group that was captured in step 2.
$ matches the end of the line.
For the same characters at the beginning and end:
/^(.{3}).*\1$/
This is a backreference.
This works:
my $test = 'abcabc';
print $test =~ m/^([a-z]{3}).*(\1)$/;
For matching the beginning and the end you should add ^ and $ anchors.