I wrote url validation regex but the regex is very slow

I wrote url validation regex but the regex is very slow - regex

I know this is slow because of ([\.\-][a-z0-9])*. But I don't know how to optimize it.
^https:\/\/([a-z0-9]+([\.\-][a-z0-9])*)+(\.([a-z]{2,11}|[0-9]{1,5}))(:[0-9]{1,5})?(\/.*)?$

You don't have to use this part )*)+ in your pattern. This could also potentially lead to catastrophic backtracking.
Note that you only have to escape the backslash if the delimiters for the regex are also / and you don't have to escape the [\.\-]
If you don't need that capture groups afterwards, you can omit them.
^https:\/\/[a-z0-9]+(?:[.-][a-z0-9]+)*\.(?:[a-z]{2,11}|[0-9]{1,5})(?::[0-9]{1,5})?(\/.*)?$
The pattern matches:
^ Start of string
https:\/\/ Match https:// As you only want to match https
[a-z0-9]+ Match 1+ times any of the listed
(?:[.-][a-z0-9]+)* Optionally repeat matching . or - and 1+ times any of the listed
\.(?:[a-z]{2,11}|[0-9]{1,5}) Match either 2-11 times a char a-z or match 1-5 digits
(?::[0-9]{1,5})? Optionally match : and 1-5 digits
(\/.*)? Optionally match /` and the rest of the line
$ End of string
Regex demo

Related

Using Regex to only accept x amount of a certain character

I have this Regex pattern:
\b(?:[A-Z\d]+[\/\-])+[A-Z\d]+\b
And it collects everything I need perfectly, but then also grabs some things I don't want. I'm wondering how to write in there something like I do want to accept "-", but no more than 5 at a time. Same with "/" but maybe no more than 1 for those. Here's an example of what it's grabbing that I do want vs what it's grabbing that I don't want:
Yes:
AIR-CT2504-50-K9
1000BASE-T
ISR4451-X-SEC/K9
No:
0/1/10/5/50
2B108-250A-2B-2B-2B-250A-2B-2B-2B-250A-2B-2B
2022/10/28

If you don't want partial matches, you might use anchors and exclude a certain number of hyphens or forward slashes.
As the strings do not seems to contain spaces, and you can mix - and /:
^(?!(?:[^\s-]*-){5})(?!(?:[^\s\/]*\/){2})(?:[A-Z\d]+[\/-])+[A-Z\d]+$
The pattern matches:
^ Start of string
(?!(?:[^\s-]*-){5}) Assert not 5 hyphens where [^\s-] matches a non whitespace char except for -
(?!(?:[^\s\/]*\/){2}) Assert not 2 forward slashes
(?:[A-Z\d]+[\/-])+ Repeat 1+ times matching 1+ chars A-Z or digits followed by either / or -
[A-Z\d]+ match 1+ chars A-Z or a digit
$ End of string
Regex demo

greedy-but-not-too-greedy regex: need to exclude last occurrence of optional character

(it must be something trivial and answered many times already - but I can't formulate the right search query, sorry!)
From the text like prefix start.then.123.some-more.text. All the rest I need to extract start.then.123.some-more.text - i.e. string that has no spaces, have periods in the middle and may have or not the trailing period (and that trailing period should not be included). I struggle to build a regex that would catch both cases:
prefix (start[0-9a-zA-Z\.\-]+)\..* - this works correctly only if there's a trailing period,
prefix (start[0-9a-zA-Z\.\-]+)\.?.* - I thought adding ? after \. will make it optional - but it doesn't...
P.S. My environment is MS VBA script, I'm using CreateObject("vbscript.regexp") - but I guess the question is relevant to other regex engines as well.

If you don’t want to include “prefix” you can use:
(?<=prefix )\S*?(?=\.?\s)
Demo
EDIT:
Even simpler, without lookbehinds or lookaheads, if you're using capturing groups anyway:
prefix (\S*\w)
This will stop at the last letter, number, or underscore. If you want to be able to capture a hyphen as the last character, you can change \w above to [\w-].
Demo 2

You could match prefix, and use a capturing group to first match chars A-Za-z0-9.
Then you can repeat the previous pattern in a group preceded by either a . or - using a character class.
prefix ([0-9a-zA-Z]+(?:[.-][0-9a-zA-Z]+)+)
In parts
prefix Match literally
( Capture group 1
[0-9a-zA-Z]+ Match 1+ times any of the listed chars
(?: Non capture group
[.-][0-9a-zA-Z]+ match either a . or - and again match 1+ times any of the listed chars
)+ Close group and repeat 1+ times to match at least a dot or hyphen
) Close group
Regex demo
If the value in the capturing group should begin with start:
prefix (start(?:[.-][0-9a-zA-Z]+)+)
Regex demo

Regex - Allow dash character in body text but not at start or end

How can I allow one subsequent dash character in the body part but not at start or end?
https://regex101.com/r/D8MAXP/8/
One subsequent example: https://regex101.com/r/D8MAXP/9/
Regex
^((https?):\/\/)?(www.)?([a-z0-9-])+\.[a-z]+(\/[a-zA-Z0-9#]+\/?)*$
Allow:
http://www.b-c.de
https://www.b-c.de
www.b-c.de
b-c.de
Don't allow:
https://www.foufos-.gr
http://www.foufos-.gr
https://-foufos.gr
http://foufos-.gr
www.-foufos.gr
www.foufos-.gr
www.-foufos.gr
foufos-.gr
-foufos.gr

Instead of matching the - in the character class, you could take it out and use a repeating group prepending the hyphen before the character class
Use a * to repeat it 0+ times or a ? to match it zero or 1 times.
For the example data in the question, you might use
^((https?):\/\/)?(www\.)?[a-z0-9]+(?:-[a-zA-Z]+)*(?:\.[a-z]+)+(\/[a-zA-Z0-9#]+\/?)*$
Regex demo
For all the links in the regex101 example, you might use for example 2 negative lookaheads:
^(?!ww?\.)(?:https?:\/\/)?(?:www\.)?(?!.*\.www\b)[a-z0-9]+(?:-[a-zA-Z]+)*(?:\.[a-z]+)+(?:\/[a-zA-Z0-9#]+\/?)*$
In parts
^ Start of string
(?!ww?\.) Assert not starting with 1 or 2 times a w char followed by a .
(?:https?:\/\/)? Optionally match the protocol part
(?:www\.)? Optionally match www.
(?!.*\.www\b) Assert that what is on the right is not again www.
[a-z0-9]+(?:-[a-zA-Z]+)* Match chars a-z0-9 optionally repeated by a - and again chars a-z0-9
(?:\.[a-z]+)+ Repeat 1+ times a dot and 1+ chars a-z
(?:\/[a-zA-Z0-9#]+\/?)* Repeat 0+ times matching / and 1+ times any of the listed followed by an optional question mark
$ End of string
Regex demo

Finding a regex for an ID format with hyphens

I'm trying to validate an id using regex. The id is in the below format.
alphaNumeric-alphaNumeric-alphaNumeric (And the total length should be 14, and there should be two hyphens)
Below examples are valid formats
AS12-AS12-AB1C
AS-12ASBC-1234
N-IKNKL-A2LI40
Here the catch is hyphens should not come in the beginning as well as in the end. And also no two hyphens should be together.
Up until now I'm using positive look ahead to do the length match (?=^.{14}$). And matching the other hyphens logic using (?=^[^-]*-[^-]*-[^-]*$)[a-zA-Z0-9-]+. So the regex I'm using is
(?=^.{12}$)(?=^[^-]*-[^-]*-[^-]*$)[a-zA-Z0-9-]+
And the problem here is hyphens can come in the beginning as well as at the end, as well as two hyphens can come together, both of which should not be valid and it's against my id validation check.

You may use this regex:
^(?=.{14}$)[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+){2}$
RegEx Demo
RegEx Details:
^: Match Start
(?=.{14}$): Lookahead condition to assert that we have exact 14 characters of input
[a-zA-Z0-9]+: Match 1 or more of alphanumeric characters
(?:: Start a non-capturing group
-: Match a hyphen
[a-zA-Z0-9]+: Followed by 1 or more of alphanumeric characters
){2}: End non-capturing group. Match 2 instances of this group
$: Match end

RegEx: don't capture match, but capture after match

There are a thousand regular expression questions on SO, so I apologize if this is already covered. I did look first.
I have string:
Name Subname 11X22 88X620 AB33(20) YA5619 77,66
I need to capture this string: YA5619
What I am doing is just finding AB33(20) and after this I am capturing until first white space. But AB33(20) can be AB-33(20) or AB33(-20) or AB33(-1).
My preg_match regex is: (?<=\bAB\d{2}\(\d{2}\)\s).+?(?=\s)
Why I am getting error when I change from \d{2} to \d+?
For final result I was thinking this regix will work but no:
(?<=\bAB-?\d+\(-?\d+\)\s).+?(?=\s)
Any ideas what I am doing wrong?

With most regex flavors, lookbehind needs to evaluate to a fixed-length sequence, so you can't use variable quantifiers like * or + or even {1,2}.
Instead of using lookaround, you can simply match your marker pattern and then forget it with \K.
AB-?\d+(?:\(-?\d+\))? \K[^ ]+
demo: https://regex101.com/r/8XXngH/1

It depends on the language. If it is in .NET for example, it matches due to the various length in the lookbehind.
Another solution might be to use a character class and add the character you would allow to match. Then match a whitespace character and capture in a group matching \S+ which matches 1+ times not a whitespace character.
\bAB[()\d-]+\s\K\S+
Explanation
\bAB Match literally prepended with word boundary to prevent AB being part of a larger match.
[()\d-]+ Match 1+ times any of the listed character in the character class
\s Match a whitespace char (or \s+ to match 1 or more)
\K Reset the starting point of the reported match( Forget what was matched)
\S+ Match in a group 1+ times not a whitespace character
Regex demo | Php demo

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

I wrote url validation regex but the regex is very slow - regex

I know this is slow because of ([\.\-][a-z0-9]). But I don't know how to optimize it. ^https:\/\/([a-z0-9]+([\.\-][a-z0-9]))+(\.([a-z]{2,11}|[0-9]{1,5}))(:[0-9]{1,5})?(\/.*)?$

Related

Using Regex to only accept x amount of a certain character

greedy-but-not-too-greedy regex: need to exclude last occurrence of optional character

Regex - Allow dash character in body text but not at start or end

Finding a regex for an ID format with hyphens

RegEx: don't capture match, but capture after match

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

I wrote url validation regex but the regex is very slow - regex

I know this is slow because of ([\.\-][a-z0-9])*. But I don't know how to optimize it. ^https:\/\/([a-z0-9]+([\.\-][a-z0-9])*)+(\.([a-z]{2,11}|[0-9]{1,5}))(:[0-9]{1,5})?(\/.*)?$

Related

Using Regex to only accept x amount of a certain character

greedy-but-not-too-greedy regex: need to exclude last occurrence of optional character

Regex - Allow dash character in body text but not at start or end

Finding a regex for an ID format with hyphens

RegEx: don't capture match, but capture after match

Categories

Resources

I know this is slow because of ([\.\-][a-z0-9]). But I don't know how to optimize it. ^https:\/\/([a-z0-9]+([\.\-][a-z0-9]))+(\.([a-z]{2,11}|[0-9]{1,5}))(:[0-9]{1,5})?(\/.*)?$