Validate incomplete Regex - regex

Let's say we have a Regex, in my case it's one I found to match UK car registration plates:
^([A-Z]{3}\s?(\d{3}|\d{2}|d{1})\s?[A-Z])|([A-Z]\s?(\d{3}|\d{2}|\d{1})\s?[A-Z]{3})|(([A-HK-PRSVWY][A-HJ-PR-Y])\s?([0][2-9]|[1-9][0-9])\s?[A-HJ-PR-Z]{3})
A typical UK car registration is
HG53CAY
This is matched correctly by the regex, but what i'd like to do is find a way to match any prefix substring of this, so the following would all be valid:
H, HG, HG5, HG53, HG53C, HG53CA, HG53CAY
Is there a suggested way to achieve this?

Firstly I'd rewrite your regexp to look like this:
^([A-Z]{3}\s?(\d{1,3})\s?[A-Z])|([A-Z]\s?(\d{1,3})\s?[A-Z]{3})|(([A-HK-PRSVWY][A-HJ-PR-Y])\s?([0][2-9]|[1-9][0-9])\s?[A-HJ-PR-Z]{3})
as the \d{3}|\d{2}|d{1} parts make no sense and should be written \d{1,3}.
Rewriting the regexp like
^([A-Z]{0,3}\s?(\d{0,3})\s?[A-Z]?)|([A-Z]\s?(\d{0,3})\s?[A-Z]{0,3})|(([A-HK-PRSVWY][A-HJ-PR-Y]?)\s?([0]?[2-9]?|[1-9]?[0-9]?)\s?[A-HJ-PR-Z]{0,3})
should have the desired effect of allowing matching of only the beginning of a registration, but unfortunately it's no longer guaranteed that the full registration will be a valid one, as I had to make most characters optional.
You could possibly try something like this
^(([A-Z]{3})|[A-Z]{1,2}$)\s?((\d{1,3})|$))...
to make it require either that each part is complete, or that it is incomplete but followed by "end of string", represented by the $ in the regexp.

Related

Regex: Non fixed-width look around assertions?

My college asked my to provide him with a regex that only matches if the test-string endswith
.rar or .part1.rar or part01.rar or part001.rar (and so on).
Should match:
foo.part1.rar
xyz.part01.rar
archive.rar
part3_is_the_best.rar
Should not match:
foo.r61
bar.part03.rar
test.sfv
I immediately came up with the regex \.(part0*1\.)?rar$. But this does match for bar.part03.rar.
Next I tried to add a negative look behind assertion: .*(?<!part\d*)\.(part\0*1\.)?rar$ That didn't work either, because look around assertions need to be fixed width.
Then I tried using a regex-conditional. But that didn't work either.
So my question: Can this even be solved by using pure regex?
An answer should either contain a link to regex101.com providing a working solution, or explain why it can't work by using pure regex.
You could use lookahead to verify the one case that fails your original regex (.rar with .part part that isn't 0*1) is discredited:
^(?!.*\.part0*[^1]\.rar$).*\.(part0*1\.)?rar$
See it in action
This is an old question, but here's another approach:
(?:\.part0*1\.rar|^(?<!\.)\w+\.rar)$
The idea is to match either:
A string that ends with .part0*1.rar (ie foo.part01.rar, foo.part1.rar, bar.part001.rar), OR
A string that ends with .rar and doesn't contain any other dots (.) before that.
Works on all your test cases, plus your extra foo.part19.rar.
https://regex101.com/r/EyHhmo/2

Why /^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$/i does not work as expected

I have this regex for email validation (assume only x#y.com, abc#defghi.org, something#anotherhting.edu are valid)
/^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$/i
But #abc.edu and abc#xyz.eduorg are both valid as to the regex above. Can anyone explain why that is?
My approach:
there should be at least one character or number before #
then there comes #
there should be at least one character or number after # and before .
the string should end with either edu, com, or org.
Try this
/^[a-zA-Z0-9]+#[a-zA-Z0-9]+\.(com|edu|org)$/i
and it should become clear - you need to group those alternatives, otherwise you can match any string that has 'edu' in it, or any string that ends with org. To put it another way, your version matches any of these patterns
^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)
(edu)
(org)$
It's worth pointing out that the original poster is using this as a regex learning exercise. This would be a terrible regex for actual production use! It's a thorny problem - see Using a regular expression to validate an email address for a lot more depth.
Your grouping parentheses are incorrect:
/^[a-zA-Z0-9]+#[a-zA-Z0-9]+\.(com|edu|org)$/i
Can also just use one case as you're using the i modifier:
/^[a-z0-9]+#[a-z0-9]+\.(com|edu|org)$/i
N.B. you were also missing a + from the second set, I assume this was just a typo...
What you have written is the equivalent of matching something that:
Begins with [a-zA-Z0-9]+#[a-zA-Z0-9].com
contains edu
or ends with org
What you were looking for was:
/^[a-z0-9]+#[a-z0-9]+\.(com|edu|org)$/i
Your regex looks ok.
I guess you are looking using a find function in stead of a match function
Without specifying what you use it is a bit difficult, but in Python you would write
import re
pattern = re.compile ('^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$')
re.match('#abc.edu') # fails, use this to validate an input
re.search('#abc.edu') # matches, finds the edu
Try to use it:
[a-zA-Z0-9]+#[a-zA-Z0-9]+.(com|edu|org)+$
U forget about + modificator if u want to catch any combinations of (com|edu|org)
Upd: as i see second [a-zA-Z0-9] u missed + too

Regex with negative look behind still matches certain strings in Scala

I have a text, that contains url domains in the following form:
[second_level_domain].[top_level_domain]
This could be for instance test.com, amazon.com or something similar, but not more complex stuff like e.g. www.test.com or de.wikipedia.org (no sub level domains!).
It could be that in front of the dot (between second and top level domain) or after the dot is an optional space like test . com, but this doesn't always have to be the case.
However what I don't want to match is if the second level domain and top level domain belong to an e-mail address like for instance hello#test.org. So in this case it shouldn't extract test.org
I wrote the following regex now:
(?<!#)(([a-zA-Z\d]+(?:-[a-zA-Z\d]+)*(?<!www))\s?\.\s?(com|net|org))
With the negative look behind I want to make sure, that in front of the second level domain shouldn't be an #. However it doesn't really do what I expected. For instance on the text hello#test.org it extracts est.org instead of extracting nothing. So, apparently it only looks at the first character when it checks if there is an # in front. But when I use the following regex it seems to work on the text hello#test.org:
(?<!#)((test)\s?\.\s?(com|net|org))
Here I hard coded the second level domain, with which it works. However if I exchange that with a regex that matches all kinds of second level domains
([a-zA-Z\d]+(?:-[a-zA-Z\d]+)*(?<!www))
it doesn't work anymore. It looks like that the negative look behind is already used after the first character is matched and that it doesn't wait with the negative look behind until everything is matched.
As an alternative I could match a bit more and then use the groups afterwards to build my desired match, but I want to avoid that if possible. I would like to match it correctly immediately. I'm not an expert in regular expressions and apparently I have not understood look arounds properly yet. Is there a way to write a regex, which behaves like I want?
(?:^|(?<=\s))((?:[a-zA-Z\d]+(?:-[a-zA-Z\d]+)*(?<!www))\s?\.\s?(?:com|net|org))
Add anchors to disallow partial matches.See demo.
https://www.regex101.com/r/rK5lU1/34

I want a regular expression that only matches domain names with one period in them

I want it to catch things like somedomain.com/folder/path, but not something like domain.sub.other.com. The regex I have so far is almost complete, it just doesn't sift out the multi-domain urls:
^(.*)://(?!(.{2,3})\.(.*)(.{2,3})(.*)
Is there any way to sift out on multiple periods?
Instead of .{2,3}, you want something like this: [^.]{2,3} - this excludes the period (no need to escape as it has no special meaning in this context in a regular expression) from that particular match. Overall you'd have something like:
://[^.]+\.[^.]{2,3}(/.*)?
Except obviously you're missing things like *.info by doing that....
Found a solution that is working given a variety of test scenarios:
^(.*)://([^.]+)\.([^(\?|/|\r|\n|\.)]+)((/|\?|$)+)(.*)$
Here, the 2nd to the last group is matching against a potential forward slash, question mark or end of string, working together with the group before it which does not allow matches which include '.'
So the final effect is that it only matches URLs with a two-part domain such as 'domain.com' and there aren't any limits placed on string length.

Regex with URLs - syntax

We're using a proprietary tracking system that requires the use of regular expressions to load third party scripts on the URLs we specify.
I wanted to check the syntax of the regex we're using to see if it looks right.
To match the following URL
/products/18/indoor-posters
We are using this rule:
.*\/products\/18\/indoor-posters.*
Does this look right? Also, if there was a query parameter on the URL, would it still work? e.g.
/products/18/indoor-posters?someParam=someValue
There's another URL to match:
/products
The rule for this is:
.*\/products
Would this match correctly?
Well, "right" is a relative term. Usually, .* is not a good idea because it matches anything, even nothing. So while these regexes will all match your example strings, they'll also match much more. The question is: What are you using the regexes for?
If you only want to check whether those substrings are present anywhere in the string, then they are fine (but then you don't need regex anyway, just check for substrings).
If you want to somehow check whether it's a valid URL, then no, the regexes are not fine because they'd also match foo-bar!$%(§$§$/products/18/indoor-postersssssss)(/$%/§($/.
If you can be sure that you'll always get a correct URL as your input and just want to check whether they match you pattern, then I'd suggest
^.*\/products$
to match any URL that ends in /products, and
^.*\/products\/18\/indoor-posters(?:\?[\w-]+=[\w-]+)?$
to match a URL that ends in /products/18/indoor-posters with an optional ?name=value bit at the end, assuming only alphanumeric characters are legal for name and value.