Can't find upper case letter in URL using Regex - regex

I have the following regex:
(href[\s]?=[\s]?)(\"[^"]*\/*[^"]*\")
using the following Test String:
href="http://mysite.io/Plan-documents"
I get two capturing groups. One with the href= and the other is everything past that. Now I want to only display matches where there is an uppercase letter anywhere in the second capture group. I tried:
(href[\s]?=[\s]?)(\"[A-Z]*[^"]*\/*[^"]*\")
to try and only have this regex come back with URL's that have uppercase in them. No luck. Regardless if I modify the test string as:
href="http://mysite.io/plan-documents"
I still get a match. I only want to match on the href string if there any at least one uppercase in the string past the href=.
Thanks.

You don't get the right matches because in your second capturing group all what is between double quotes uses a quantifier * which matches 0 or more times.
First the engine matches 0+ times [A-Z]*. It is not present but it is ok, because of the 0+ times quantifier. Then the next part [^"]* will match until right before it encounters the next "
The following \/* is not there but is also ok because of the 0+ times quantifier followed by [^"]* which is also ok.
What you might do instead is first match not an uppercase until you match an uppercase and then match until the closing double quotes.
(href\s?=\s?)("[^A-Z\s]*[A-Z][^\s"]*")
Explanation
(href\s?=\s?) Capture group, match href= surrounded by optional whitespace char
(" Start capture group and match "
[^A-Z\s]* Match 0+ times not an uppercase or whitespace char
[A-Z] Match 1 uppercase char
[^"\s]* Match 0+ times not " or a whitespace char
") Match " and close capture group
Regex demo
Without using groups, you could use:
href\s?=\s?"[^A-Z\s]*[A-Z][^\s"]*"
Regex demo

Related

What is the proper regex for capturing everything after "String" and between two delimeters ('=' and and non alphanumeric))

Details={
AwsEc2SecurityGroup={GroupName=m.com-rds, OwnerId=123, VpcId=vpc-123,
IpPermissions=[{FromPort=3306, ToPort=3306, IpProtocol=tcp, IpRanges=[{CidrIp=1.1.1.1/32}, {CidrIp=2.2.2.2/32}, {CidrIp=0.0.0.0/0}, {CidrIp=3.3.3.3/32}],
UserIdGroupPairs=[{UserId=123, GroupId=sg-123abc}]}], IpPermissionsEgress=[{IpProtocol=-1, IpRanges=[{CidrIp=0.0.0.0/0}]}], GroupId=sg-123abc}},
Region=us-east-1, Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc}]
}
I want to capture exactly arn:aws:ec2:us-east-1:123:security-group/sg-123abc in this example. Generically, I want to capture the value of Id regardless of placement. My current solution is /Details={.*Id=(.*\w)/, but this only works if it's the last object in the data. How can I take into account the following potential scenario:
Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc, Thing=123abc}]
You have a pattern with 2 times .* which will first match till the end of the line/string (depending on if the dot matches a newline) and it will backtrack to match the last occurrence where this part of the pattern Id=(.*\w) can match.
If you want to use a capture group, you can make the format and the allowed characters a bit more specific:
\bId=(\w+(?:[:\/-]\w+)+)
The pattern in parts
\b A word boundary to prevent a partial word match
Id= Match literally
( Capture group 1
\w+ Match 1+ word chars
(?:[:\/-]\w+)+ Repeat 1+ times either : / - and 1+ word chars
) Close group 1
Regex demo
Or if you know that it starts with Id=arn:
\bId=(arn:[\w:\/-]+)
Regex demo
Note that you don't have to escape the \/ only when the delimiters of the regex are forward slashes, but there is no language tagged.
You can use look-behind to check that there is the Id= prefix, and then match anything that is not a space, comma or closing brace:
(?<=\bId=)[^,}\s]*

Regular expression matching and remove spaces

Please how can I get the address using regex:
Address 123 Mayor Street, LAG Branch ABC
used (?<=Address(\s))(.*(?=\s)) but it includes the spaces after "Address". Trying to get an expression that extracts the address without the spaces. (There are a couple of spaces after "Address" before "123")
Thanks!
The pattern (?<=Address(\s))(.*(?=\s)) that you tried asserts Address followed by a single whitespace char to the left, and then matches the rest of the line asserting a whitespace char to the right.
For the example data, that will match right before the last whitespace char in the string, and the match will also contain all the whitespace chars that are present right after Address
One option to match the bold parts in the question is to use a capture group.
\bAddress\s+([^,]+,\s*\S+)
The pattern matches:
\bAddress\s+ Match Address followed by 1+ whitespace chars
( Capture group 1
[^,]+, Match 1+ occurrences of any char except , and then match ,
\s*\S+ Match optional whitespace chars followed by 1+ non whitespace chars
) Close group 1
.NET regex demo (Click on the Table tab to see the value for group 1)
Note that \s and [^,] can also match a newline
A variant with a positive lookbehind to get a match only:
(?<=\bAddress\s+)[^,\s][^,]+,\s*\S+
.NET Regex demo

Regex to capture everything after optional token

I have fields which contain data in the following possible formats (each line is a different possibility):
AAA - Something Here
AAA - Something Here - D
Something Here
Note that the first group of letters (AAA) can be of varying lengths.
What I am trying to capture is the "Something Here" or "Something Here - D" (if it exists) using PCRE, but I can't get the Regex to work properly for all three cases. I have tried:
- (.*) which works fine for cases 1 and 2 but obviously not 3;
(?<= - )(.*) which also works fine for cases 1 and 2;
(?! - )(.+)| - (.+) works for cases 2 and 3 but not 1.
I feel like I'm on the verge of it but I can't seem to crack it.
Thanks in advance for your help.
Edit: I realized that I was unclear in my requirements. If there is a trailing " - D" (the letter in the data is arbitrary but should only be a single character), that needs to be captured as well.
About the patterns that you tried:
- (.*)This pattern will match the first occurrence of - followed by matching the rest of the line. It will match too much for the second example as the .* will also match the second occurrence of -
(?<= - )(.*)This pattern will match the same as the first example without the - as it asserts that is should occur directly to the left
(?! - )(.+)| - (.+) This pattern uses a negative lookahead which asserts what is directly to the right is not (?! - ). As none of the example start with - , the whole line will be matched directly after the negative lookahead due to .+ and the second part after the alternation | will not be evaluated
If the first group of letters can be of varying length, you could make the match either specific matching 1 or more uppercase characters [A-Z]+ or 1+ word characters \w+.
To get a more broad match, you could match 1 or more non whitespace characters using \S+
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*
Explanation
^ Start of string
(?:\S+\h-\h)? Optionally match the first group of non whitespace chars followed by - between horizontal whitespace chars
\K Clear the match buffer (Forget what is currently matched)
\S+ Match 1+ non whitespace characters
(?: Non capture group
\h(?!-\h) Match a horizontal whitespace char and assert what is directly to the right is not - followed by another horizontal whitespace char
\S+ Match 1+ non whitespace chars
)* Close non capture group and repeat 1+ times to match more "words" separated by spaces
Regex demo
Edit
To match an optional hyphen and trailing single character, you could add an optional non capturing group (?:-\h\S\h*)?$ and assert the end of the string if the pattern should match the whole string:
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*\h*(?:-\h\S\h*)?$
Regex demo
You may use
^(?:.*? - )?\K.*?(?= - | *$)
^(?:.*?\h-\h)?\K.*?(?=\h-\h|\h*$)
See the regex demo
Details
^ - start of string
-(?:.*? - )? - an optional non-capturing group matching any 0+ chars other than line break chars as few as possible up to the first space-space
\K - match reset operator
.*? - any 0+ chars other than line break chars as few as possible
(?= - | *$) - space-space or 0+ spaces till the end of string should follow immediately on the right.
Note that \h matches any horizontal whitespace chars.
^(?:[A-Z]+ - \K)?.*\S
demo
Since "Something Here" can be anything, there's no reason to specially describe the eventual last letter in the pattern. You don't need something more complicated.
With this pattern I assume that you are not interested by the trailing spaces, that's why I ended it with \S. If you want to keep them, remove the \S and change the previous quantifier to +.

How do i prioritize my "OR" condition in regex. I'm fairly new to coding

there can be two cases from where i'll need to capture data.
Incoming this is a test string
this is a test string
I only want to capture "this is a test string".
The regex i wrote is:
((?<=Incoming)(?:\s+[A-Za-z-]+)+)|(?:[A-Za-z-]+\s+)
The regex works fine without the OR condition for prob statement 1. But as soon as i add the OR condition, the entire string "1" gets captured instead of "this is a test string".
Is there anyway to give priority to the first part of regex? Like i want the first regex to test first and if there is no match, go to second regex.
Thanks
If you only use the first part, the match can only start after encountering Incoming to the left. If you add the alternation, the second part can and will match the first word.
If you don't want to match Incoming but do want to match all the other words, you could use a negative lookahead.
Note that \s could also match a newline so I have used a space in the demo.
(?<!\S)(?!Incoming\s)[A-Za-z-]+(?:\s+(?!Incoming)[A-Za-z-]+)*
Explanation
(?<!\S) Assert a whitespace boundary on the left
(?!Incoming\s) Assert what is directly on the right is not Incoming and a whitespace char
[A-Za-z-]+ Match 1+ times a char A-Za-z or -
(?: Non capture group
\s+(?!Incoming) Match 1+ whitespace chars and check what is on the right is not Incoming
[A-Za-z-]+ Match 1+ times a char A-Za-z or -
)* Close group and repeat 1+ times
Regex demo

Seeking help on Regular expression

How can I extract this string from the text using regex
text: {abcdefgh="test-name-test-name-w2-a"} 54554654654 .654654654
Expected output: test-name-test-name-w2
Note: I tried this "([^\s]*)" and the output is test-name-test-name-w2-a. But need the output as I mentioned just above.
You can try with this regex
.*\"(.*)-.*\".*
The link to regex101 is test
You could extend the negated character class to also exclude - and ". Then use a repeating pattern using the same character class preceded with a -
The value is in the first capturing group.
"([^\s-"]+(?:-[^\s-"]+)*)-[^\s-"]+"
" Match a " char
( Capture group 1
[^\s-"]+ Match 1+ times any char except - " or a whitespace char
(?: Non capturing group
[^\s-"]+Match 1+ times any char except - " or a whitespace char
)* Close non capturing group, repeat 0+ times
) Close capture group
-[^\s-"]+ Match 1+ times any char except - " or a whitespace char
" Match a " char
Regex101 demo
(On regex101 at the FLAVOR panel you can switch between PCRE and Golang)
Update
To match where the word test is present and not for example test1 you could use a negative lookahead (?![^"\s]*\btest\w) to assert no presence of test followed by a word character.
""(?![^"\s]*\btest\w)([^\s-"]+(?:-[^\s-"]+)*)-[^\s-"]+""
Regex demo