Regex modify capturing group - regex

I have this Regex
^(?!.*\b(?:https?:\/\/|www\.))\w+(?:\.\w+)*\.\w{2,}(?:,\w+(?:\.\w+)*\.\w{2,})+$
that captures multiple URL separated by commas
caputres google.com,facebook.com but not with extra characters like google.com/home.php?,facebook.com/pages/#ref=?

Assuming your URLs won't contain a comma, you can add another optional non-capturing group in your regex like this:
^(?!.*\b(?:https?:\/\/|www\.))\w+(?:\.\w+)*\.\w{2,}(?:\/[^,]*)?(?:,\w+(?:\.\w+)*\.\w{2,}(?:\/[^,]*)?)*$
RegEx Demo
Note addition of an optional non-capturing group in regex:
(?:\/[^,]*)?: That matches text starting with / followed by 0 or more of any character except a comma. ? makes this group optional

Related

Regular Expression to prevent Email Name Spoofing

I want to match everything where .com or my\s?example appears in the display name of a From header and where the From email address is not .*#myexample.com.
It's easy when the display name is enclosed by quotation marks, but fails when the quotation marks are absent.
"(.*?(my\s?example|\.com).*?)"(?!\s?\<.*?\#myexample\.com\>)
Please see here:
https://regexr.com/5im6l
Everything works as desired except for the last line in the input field, where the double quotes are missing. I would like it to also match for this.
If an if clause is supported, and you want to capture what is between the double quotes if they are both there or capture the whole string if there are no double quotes at the start and end, you might use:
\bFrom:\s(")?(.*?\b(my\s?example|\.com)\b.*?)(?(1)")\s+<(?!\s?[^\r\n<>]*#myexample\.com>)
The pattern matches:
\bFrom:\s(")? A word boundary, match From: and optionally capture " in group 1
(.*?\b(my\s?example|\.com)\b.*?) Capture group 2, match a part that contains either myexample or .com where the alternatives are in group 3
(?(1)") If clause, if group 1 exists, match " so it is not part of the capture group
\s+< Match 1+ whitespace chars and <
(?! Negative lookahead, assert that what is at the right is not
\s?[^\r\n<>]*#myexample\.com> Match #myexample\.com between the brackets
) Close lookahead
Group 2 contains the whole match, and group 3 contains a part with either Myexample or .com using a case insensitive match.
Regex demo
If \K is supported to forget what is matched so far, and you want as another example a match only:
\bFrom:\s"?\K.*?\b(?:my\s?example|\.com)\b.*?(?="?\s<(?![^<>]*#myexample\.com>))
Regex demo
Note that you don't have to escape \< \> and \#

greedy-but-not-too-greedy regex: need to exclude last occurrence of optional character

(it must be something trivial and answered many times already - but I can't formulate the right search query, sorry!)
From the text like prefix start.then.123.some-more.text. All the rest I need to extract start.then.123.some-more.text - i.e. string that has no spaces, have periods in the middle and may have or not the trailing period (and that trailing period should not be included). I struggle to build a regex that would catch both cases:
prefix (start[0-9a-zA-Z\.\-]+)\..* - this works correctly only if there's a trailing period,
prefix (start[0-9a-zA-Z\.\-]+)\.?.* - I thought adding ? after \. will make it optional - but it doesn't...
P.S. My environment is MS VBA script, I'm using CreateObject("vbscript.regexp") - but I guess the question is relevant to other regex engines as well.
If you don’t want to include “prefix” you can use:
(?<=prefix )\S*?(?=\.?\s)
Demo
EDIT:
Even simpler, without lookbehinds or lookaheads, if you're using capturing groups anyway:
prefix (\S*\w)
This will stop at the last letter, number, or underscore. If you want to be able to capture a hyphen as the last character, you can change \w above to [\w-].
Demo 2
You could match prefix, and use a capturing group to first match chars A-Za-z0-9.
Then you can repeat the previous pattern in a group preceded by either a . or - using a character class.
prefix ([0-9a-zA-Z]+(?:[.-][0-9a-zA-Z]+)+)
In parts
prefix Match literally
( Capture group 1
[0-9a-zA-Z]+ Match 1+ times any of the listed chars
(?: Non capture group
[.-][0-9a-zA-Z]+ match either a . or - and again match 1+ times any of the listed chars
)+ Close group and repeat 1+ times to match at least a dot or hyphen
) Close group
Regex demo
If the value in the capturing group should begin with start:
prefix (start(?:[.-][0-9a-zA-Z]+)+)
Regex demo

Regex Extract a string between two words containing a particular string

I have the below string
abc-12d-ef-oy-5678-xyz--**--20190120075439322am--**--ghi-66d-ef-oy-8877-sdf--**--sfdfdsgfg--**--20190120075765487am
It is kind of multi character delimited string, delimited by '--**--' I am trying to extract the first and second words which has the -oy- tag in it. This is a column in a table. I am using the regex_extract method but i am not able extract the string which contains a string and ends with a string.
Here is one pattern that i tried .*(.*oy.*)--
If the -oy- can not be at the start or at the end, you could use this pattern to match the 2 hyphen delimited strings with -oy-:
[a-z0-9]+(?:-[a-z0-9]+)*-oy(?:-[a-z0-9]+)+
Regex details
[a-z0-9]+ Match 1+ times a-z0-9
(?: Non capturing group
-[a-z0-9]+ Match - and 1+ times a-z0-9
)* Close group and repeat 0+ times
-oy Match literally
(?:-[a-z0-9]+)+ Repeat 1+ times a group which will match - and 1+ times a-z0-9
You can extend the character class [A-Za-z0-9] to allow what you want to match like uppercase chars.
Regex demo | Java demo
If the matches should be between delimiters, you could use a positive lookbehind and positive lookahead and an alternation:
(?<=^|--\\*\\*--)[a-z0-9]+(?:-[a-z0-9]+)*-oy(?:-[a-z0-9]+)+(?=--\\*\\*--|$)
See a Java demo
You can use this regex which will match string containing -oy- and capture them in group1 and group2.
^.*?(\w+(?:-\w+)*-oy-\w+(?:-\w+)*).*?(\w+(?:-\w+)*-oy-\w+(?:-\w+)*)
This regex basically matches two strings delimiter separated containing -oy- using this (\w+(?:-\w+)*-oy-\w+(?:-\w+)*) to capture the text.
Demo
Are you able to select values from capture groups?
(?:--\*\*--|^)(.*?-oy-.*?)(?:--\*\*--|$)
?: - Non-capture group, matches the delimiter, begin of line, or end of line but does not create a capture group
*? - Lazy match so you only grab the contents of the field
https://regex101.com/r/aUAvcx/1
--- Second stab at this follows ---
This is convoluted. Hopefully you can use Lookahead and Lookbehind. The last problem I had was the final record was being "Greedy" and sucking up the field before it too. So I had to add an exclusion in the capture group for your delimiter.
See if this works for you.
(?<=--\*\*--|^)((?:(?:(?!--\*\*--).)*)-oy-(?:(?:(?!--\*\*--).)*))(?=--\*\*--|$)
https://regex101.com/r/aUAvcx/3
Basically the (?: are so we are not getting too many capture groups to work with.
There are three parts to this:
The lookbehind - Make sure the field is framed by the delimiter (or start of line)
The capture group - Grab the contents of the field, making sure a delimiter isn't sucked up into it
The lookahead - Make sure the field is framed by the delimiter (or end of line)
As far as the capture group goes, I check the left and right side of the -oy- to make sure the delimiter isn't there.

Regular expression to exclude group with 0 and more occurence issue

I need to extract 1234567 from below URLs
http://www.test.in/some--wonders-1234567---2
http://www.test.in/some--wonders-1234567
I tried with .*\-([0-9]+)(?:-{2,}2)?.
but for the first URL it returned 2, but this is in non-capturing group.
Please give me a solution. I am digging it for so long. not getting any idea.
Try this one:
.*?\-([0-9]+)(?:-{2,}2|$)
It sets lazy mode for first .* pattern, you can also remove it at all with same effect:
\-([0-9]+)(?:-{2,}2|$)
If your regex engine supports negative look behinds (some do not), you can do it this way:
(?<!\d+-+)\d+
It gives you any non-empty digit string, which is not preceded by (minuses followed by digits).
Big advantage is that you don't have to use groups here - regex itself returns what you want.
You could match a - followed by one or more digits which you could capture in a group ([0-9]+). This group will contain the value you want to extract.
Then an optional part (?:-{2,}[0-9]+)? that would match ---2 followed by asserting the end of the line $.
-(\d+)(?:-{2,}\d+)?$
Explanation
- Match literally
(\d+) Capture one or more digits in a group
(?: Non capturing group
-{2,} Match 2 or more times -
\d+ Match one or more digits
)? close non capturing group and make it optional
$ Assert position at the end of the line

Repeated capturing group PCRE

Can't get why this regex (regex101)
/[\|]?([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g
captures all the input, while this (regex101)
/[\|]+([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g
captures only |Func
Input string is |Func(param1, param2, param32, param54, param293, par13am, param)|
Also how can i match repeated capturing group in normal way? E.g. i have regex
/\(\(\s*([a-z\_]+){1}(?:\s+\,\s+(\d+)*)*\s*\)\)/gui
And input string is (( string , 1 , 2 )).
Regex101 says "a repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations...". I've tried to follow this tip, but it didn't helped me.
Your /[\|]+([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g regex does not match because you did not define a pattern to match the words inside parentheses. You might fix it as \|+([a-z0-9A-Z]+)(?:\(?(\w+(?:\s*,\s*\w+)*)\)?)?\|?, but all the values inside parentheses would be matched into one single group that you would have to split later.
It is not possible to get an arbitrary number of captures with a PCRE regex, as in case of repeated captures only the last captured value is stored in the group buffer.
What you may do is get mutliple matches with preg_match_all capturing the initial delimiter.
So, to match the second string, you may use
(?:\G(?!\A)\s*,\s*|\|+([a-z0-9A-Z]+)\()\K\w+
See the regex demo.
Details:
(?:\G(?!\A)\s*,\s*|\|+([a-z0-9A-Z]+)\() - either the end of the previous match (\G(?!\A)) and a comma enclosed with 0+ whitespaces (\s*,\s*), or 1+ | symbols (\|+), followed with 1+ alphanumeric chars (captured into Group 1, ([a-z0-9A-Z]+)) and a ( symbol (\()
\K - omit the text matched so far
\w+ - 1+ word chars.