Could someone explain the regex /(.*)\.(.*)/? - regex

I want to get the file extension in Groovy with a regex, for let's say South.6987556.Input.csv.cop.
http://www.regexplanet.com/advanced/java/index.html shows me that the second group would really contain the cop extension. Which is what I want.
0: [0,27] South.6987556.Input.csv.cop
1: [0,23] South.6987556.Input.csv
2: [24,27] cop
I just don't understand why the result won't be
0: [0,27] South.6987556.Input.csv.cop
1: [0,23] South
2: [24,27] 6987556.Input.csv.cop
What should be the regex to get this kind of result?

Here is a visualization of this regex
(.*)\.(.*)
Debuggex Demo
in words
(.*) matches anything als large as possible and references it
\. matches one period, no reference (no brackets)
(.*) matches anything again, may be empty, and references it
in your case this is
(.*) : South.6987556.Input.csv
\. : .
(.*) : cop
it isn't just only South and 6987556.Input.csv.cop because the first part (.*) isn't optional but greedy and must be followed by a period, so the engine tries to match the largest possible string.
Your intended result would be created by this regex: (.*?)\.(.*). The ? after a quantifier (in this case *) switches the behaviour of the engine to ungreedy, so the smallest matching string will be searched. By default most regex engines are greedy.

To get the desired output, your regex should be:
((.*?)\.(.*))
DEMO
See the captured groups at right bottom of the DEMO site.
Explanation:
( group and capture to \1:
( group and capture to \2:
.*? any character except \n (0 or more
times) ? after * makes the regex engine
to does a non-greedy match(shortest possible match).
) end of \2
\. '.'
( group and capture to \3:
.* any character except \n (0 or more
times)
) end of \3
) end of \1

Related

What is the proper regex for capturing everything after "String" and between two delimeters ('=' and and non alphanumeric))

Details={
AwsEc2SecurityGroup={GroupName=m.com-rds, OwnerId=123, VpcId=vpc-123,
IpPermissions=[{FromPort=3306, ToPort=3306, IpProtocol=tcp, IpRanges=[{CidrIp=1.1.1.1/32}, {CidrIp=2.2.2.2/32}, {CidrIp=0.0.0.0/0}, {CidrIp=3.3.3.3/32}],
UserIdGroupPairs=[{UserId=123, GroupId=sg-123abc}]}], IpPermissionsEgress=[{IpProtocol=-1, IpRanges=[{CidrIp=0.0.0.0/0}]}], GroupId=sg-123abc}},
Region=us-east-1, Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc}]
}
I want to capture exactly arn:aws:ec2:us-east-1:123:security-group/sg-123abc in this example. Generically, I want to capture the value of Id regardless of placement. My current solution is /Details={.*Id=(.*\w)/, but this only works if it's the last object in the data. How can I take into account the following potential scenario:
Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc, Thing=123abc}]
You have a pattern with 2 times .* which will first match till the end of the line/string (depending on if the dot matches a newline) and it will backtrack to match the last occurrence where this part of the pattern Id=(.*\w) can match.
If you want to use a capture group, you can make the format and the allowed characters a bit more specific:
\bId=(\w+(?:[:\/-]\w+)+)
The pattern in parts
\b A word boundary to prevent a partial word match
Id= Match literally
( Capture group 1
\w+ Match 1+ word chars
(?:[:\/-]\w+)+ Repeat 1+ times either : / - and 1+ word chars
) Close group 1
Regex demo
Or if you know that it starts with Id=arn:
\bId=(arn:[\w:\/-]+)
Regex demo
Note that you don't have to escape the \/ only when the delimiters of the regex are forward slashes, but there is no language tagged.
You can use look-behind to check that there is the Id= prefix, and then match anything that is not a space, comma or closing brace:
(?<=\bId=)[^,}\s]*

Regex get all before first occurrence of character

I know it's been asked many many times. I tried my best but the result wasn't perfect.
Regex
/(\(\s*["[^']*]*)(.*\/logo\.png.*?)(["[^']*]*\s*\))/gmi
Regex101 Link: https://regex101.com/r/0f8Q08/1
It should capture all separately.
(../asdasd/dasdas/logo.png)
(../asdasd/dasdas/logo.png)
( '../logo.png' )
Right now it's capturing as a whole.
(../asdasd/dasdas/logo.png) (../asdasd/dasdas/logo.png) ( '../logo.png' )
What I need is, the regex to stop after the first closing bracket ) match.
You can use
(\(\s*(["']?))([^"')]*\/logo\.png[^"')]*)(\2\s*\))
See the regex demo.
Details
(\(\s*(["']?)) - Group 1: (, any zero or more whitespaces, and then Group 2 capturing either a ' or a " optionally
([^"')]*\/logo\.png[^"')]*) - Group 3: any zero or more chars other than ", ' and ), then a /logo.png string, and then again any zero or more chars other than ", ' and )
(\2\s*\)) - Group 4: the same value as in Group 2, zero or more whitespaces, and a ) char.
The issue in your pattern is that the .* matches too much. After the opening parenthesis, you should exclude matching the ( and ) to overmatch the separate parts.
You don't need all those capture groups if you want to match the parts with parenthesis as a whole.
You can use 1 capture group, where the group would be a backreference matching the same optional closing quote.
\(\s*(["']?)[^()'"]*\/logo\.png[^()'"]*\1\s*\)
Regex demo
If you also want the matches without the matching quotes:
\(\s*["']?[^()'"]*\/logo\.png[^()'"]*["']?\s*\)
Regex demo
If you want to use regex you can make the change from .* to [^)] so you stay between parenthesis
(\(\s*["[^']*]*)([^)]*\/logo\.png.*?)(["[^']*]*\s*\))
regex101

Regex doesn't ignore the optionnals groups

I'm trying the create a regex to catch my url and his, optionnals, groups. The regex works fine if the url is complete. The optionnals groups are not optionnals at all.
Regex :
\/(.+)(?:\/(.+))(?:(?:\?(.+)))
Urls to catch :
/taxi
/taxi/lyon
/taxi/lyon?coordinates=7542
https://regex101.com/r/NKFkwq/4/
As you can see, the third line is catched. But i'd like the first and second too.
I thought the ?: will be enought to do that, but i missed something...
Thanks a lot for your help !
Cheers
EDIT and answer
Thanks in the comments for helping me. Here the great regex (the one i expected) : https://regex101.com/r/NKFkwq/8
Indeed ?: is about ignoring a match, not made him optionnal.
Your pattern consists of capturing and non capturing groups. The (?: denotes a non capturing group.
If you want to match all 3 lines, you could use match the part starting from the first forward slash and make the part starting from the second forward slash optional.
^/[^\s/]+(?:/[^\s/]+)?$
^ Start of string
/[^\s/]+ Match / and match 1+ times any char except a whitespace or /
(?: Non capturing group
/[^\s/]+ Match / and match 1+ times any char except a whitespace or /
)? Close non capturing group and make it optional
$ End of string
Regex demo
If you want to have capturing groups, but don't want to match /taxi?coordinates=7542 you could nest the groups and make them optional as well.
^/\w+(/\w+(\?\S*)?)?$
^ Start of string
/\w+ Match / and 1+ word chars
( Capture group 1
/\w+ Match / and 1+ word chars
( Capture group 2
\?\S* Match ? and 0+ times a non whitespace char
)? Close group 2
)? Close group 1
$ End of string
Regex demo

Regex (PCRE) exclude certain words from match result

I need to get only the string with names that is in Bold:
author={Trainor, Sarah F and Calef, Monika and Natcher, David and Chapin, F Stuart and McGuire, A David and Huntington, Orville and Duffy, Paul and Rupp, T Scott and DeWilde, La'Ona and Kwart, Mary and others},
Is there a way to skip all 'and' 'others' words from match result?
Tried to do lots of things, but nothing works as i expect
(?<=\{).+?(?<=and\s).+(?=\})
Instead of using omission, you could be better off by implementing rules which expect a specific format in order to match the examples you've provided:
([A-Z]+[A-Za-z]*('[A-Za-z]+)*, [A-Z]? ?[A-Z]+[A-Za-z]*('[A-Za-z]+)*( [A-Z])?)
https://regex101.com/r/9LGqn3/3
You could make use of \G and a capturing group to get you the matches.
The values are in capturing group 1.
(?:author={|\G(?!^))([^\s,]+,(?:\h+[^\s,]+)+)\h+and\h+(?=[^{}]*\})
About the pattern
(?: Non capturing group
author={ Match literally
| Or
\G(?!^) Assert position at the end of previous match, not at the start
) Close non capturing group
( Capture group 1
[^\s,]+, Match not a whitespace char or comma, then match a comma
(?:\h+[^\s,]+)+ Repeat 1+ times matching 1+ horizontal whitespace chars followed by matching any char except a whitespace char and a comma
) Close group 1
\h+and\h+ Match and between 1+ horizontal whitespaces
(?=[^{}]*\}) Assert what is on the right is a closing }
Regex demo

Regex - match a pair of repeated characters after space or beginning of line

I've been trying to learn regular expressions for some time now, and sometimes I run into some stuff I find hard to understand.
Early today I was trying to match a pair of repeated characters after space or beginning of line, so I first found a way to match space or beginning of line: (^|\s)
Then, to match a pair of (alphanumeric) characters: (\w)\1+
Both work very well, but when I put them together (^|\s)(\w)\1+, it just doesn't work.
Do you know why that is wrong, and what is the best way to achieve what I want?
By the way, I'm using this website to test my expressions.
Thank you very much!
Try this regex:
(?:^|\s)(\w)\1
Problem is you are using capturing group for (^|\s) and that becomes \1 and (\w) becomes \2 therefore your regex doesn't work.
(?:..) is non-capturing group hence (\w) remains \1 (first capturing group).
(^|\s)(\w)\1+, it just doesn't work
#anubhava gave you the answer.
This commented example might help as well.
( ^ | \s ) # (1), BOL or whitespace
( \w ) # (2), Word character
\1+ # backreference to group 1 (BOL or whitespace)
( ^ | \s ) # (1), BOL or whitespace
( \w ) # (2), Word character
\2+ # backreference to group 2 (Word character)