Regular expressions matching date formats and URLs - regex

Hi I want to be able to set the regular expression to allow for dates to be entered like this
01/01/1900 or 01/01/70, I have the following but not sure how to make it so that it takes 4 or 2 at the end.
^([1-9]|0[1-9]|1[012])[- /.]([1-9]|0[1-9]|[12][0-9]|3[01])[- /.][0-9]{4}$
The other one I would like to know is for URL
This one I have no idea how do I make it so that it matches correct URL's?
Thank you

This should match two our four digit numbers:
\d{2}(\d{2})?
Your full regex would be something like this:
^([1-9]|0[1-9]|1[012])[- /.]([1-9]|0[1-9]|[12][0-9]|3[01])[- /.]\d{2}(\d{2})?$
URLs are hard to test. http://localhost is a valid URL and so it https://test.example.co.uk:443/index.ece?foo=bar. I would look for something in your language to test this for you or do a very simple test like this (you will have to delimit some special chars depending on the regex engine you use):
^https?://

To modify your regex so that it takes either 2 or 4 digits at the end, you can try this:
^([1-9]|0[1-9]|1[012])[- /.]([1-9]|0[1-9]|[12][0-9]|3[01])[- /.]([0-9]{4}|[0-9]{2})$
For URLs, you can try (from here):
(http|https)://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?
or have a look at this S.O. question.

^([1-9]|0[1-9]|1[012])[- /.]([1-9]|0[1-9]|[12][0-9]|3[01])[- /.]([0-9]{4}|[0-9]{2})$

Well, is ([0-9]{4}|[0-9]{2}) not good enough for you? Probably you could add some checking that first two digits in the four-digits group is 19 or 20 but it depends on your needs.
As for URL matching look here. There's many of them with tests.

You can use another alternation in at the end to accept 2 or 4 (the same way you do the "or" options for the other date parts). Alternatively, you can require 2 digits in the last position, and then have 2 optional digits after that.
Unless you need to capture the individual parts (day, month, year), you should use non-capturing parentheses, like this (?:) (that's the .NET syntax).
Finally, you should consider the type of validation that you are trying to achieve with this. It is probably better to enforce the format, and not worry about bad forms like 91/73/9004 because even with what you have you can still get invalid dates, like 02/31/2011. Since you probably have to perform further validation, why not simplify the regex to something like ^(?:\d{1,2}[-/.]){2}\d{2}(?:\d{2})?$
As for URLs, stackoverflow is littered with duplicate questions about this.

Related

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

regexing url parameters IIS

I have reviewed a couple questions on regexing url parameters, but none of which seem to specifically address my issue. I have been trying to work out the correct regex pattern in www.regex101.com and I haven't found any successs. I have a url that has parameters which are separated by /'s. I am able to regex one parameter at a time, but I would ideally like to develop a pattern that can extract all of the parameters. So far this is what I have:
\/([a-zA-z]+)\/([a-zA-z]+)\/([a-zA-z-]+)\/
The url that I am trying to modify is:
www.mydomain.com/firstparameter/secondparameter/hyphenated-url-parameter/
The above pattern works for this example, but I need it to also work for these two examples:
www.mydomain.com/firstparameter/secondparameter/
www.mydomain.com/firstparameter/
Is it even possible to write one singular regex that can extract the parameters from each example above?
Try Regex: \/([a-zA-z]+)\/(?:(?:([a-zA-z]+)\/)?([a-zA-z-]+)\/)?
Details:
? Quantifier — Matches between zero and one times, as many times as possible
Demo
The assumption here is that, there is at least one parameter and max 3 parameters.
This should work for any number of parameters:
\/([\w|-]+)
Example

Define regular expression that matches urls that end with digits unless anything else comes after

I'm using Scrapy to scrape a web site. I'm stuck at defining properly the rule for extracting links.
Specifically, I need help to write a regular expression that allows urls like:
https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352
https://discuss.dwolla.com/t/enhancement-dwolla-php-updated-to-2-1-3/1180
https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108
while forbidding urls like this one
https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352/12
In other words, I want urls that end with digits (i.e., /1352 in the example abpve), unless after these digits there is anything after (i.e., /12 in the example above)
I am by no means an expert of regular expressions, and I could only come up with something like \/(\d+)$, or even this one ^https:\/\/discuss.dwolla.com\/t\/\S*\/(\d+)$, but both fail at excluding the unwanted urls since they all capture the last digits in the address.
--- UPDATE ---
Sorry for not being clear in the first place. This addition is to clarify that the digits at the of URLS can change, so the /1352 is not fixed. As such, another example of urls to be accepted is also:
https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108
This is probably the simplest way:
[^\/\d][^\/]*\/\d+$
or to restrict to a particular domain:
^https?:\/\/discuss.dwolla.com\/.*[^\/\d][^\/]*\/\d+$
See live demo.
This regex requires the last part to be all digits, and the 2nd last part to have at least 1 non-digit.
Here is a java regex may fit your requirements in java style. You can specify number of digits N you are excepting in {N}
^https://discuss.dwolla.com/t/[\\w|-]+/[\\d]+$

I want a regular expression that only matches domain names with one period in them

I want it to catch things like somedomain.com/folder/path, but not something like domain.sub.other.com. The regex I have so far is almost complete, it just doesn't sift out the multi-domain urls:
^(.*)://(?!(.{2,3})\.(.*)(.{2,3})(.*)
Is there any way to sift out on multiple periods?
Instead of .{2,3}, you want something like this: [^.]{2,3} - this excludes the period (no need to escape as it has no special meaning in this context in a regular expression) from that particular match. Overall you'd have something like:
://[^.]+\.[^.]{2,3}(/.*)?
Except obviously you're missing things like *.info by doing that....
Found a solution that is working given a variety of test scenarios:
^(.*)://([^.]+)\.([^(\?|/|\r|\n|\.)]+)((/|\?|$)+)(.*)$
Here, the 2nd to the last group is matching against a potential forward slash, question mark or end of string, working together with the group before it which does not allow matches which include '.'
So the final effect is that it only matches URLs with a two-part domain such as 'domain.com' and there aren't any limits placed on string length.

Regex to extract part of a url

I'm being lazy tonight and don't want to figure this one out. I need a regex to match 'jeremy.miller' and 'scottgu' from the following inputs:
http://codebetter.com/blogs/jeremy.miller/archive/2009/08/26/talking-about-storyteller-and-executable-requirements-on-elegant-code.aspx
http://weblogs.asp.net/scottgu/archive/2009/08/25/clean-web-config-files-vs-2010-and-net-4-0-series.aspx
Ideas?
Edit
Chris Lutz did a great job of meeting the requirements above. What if these were the inputs so you couldn't use 'archive' in the regex?
http://codebetter.com/blogs/jeremy.miller/
http://weblogs.asp.net/scottgu/
Would this be what you're looking for?
'/([^/]+)/archive/'
Captures the piece before "archive" in both cases. Depending on regex flavor you'll need to escape the /s for it to work. As an alternative, if you don't want to match the archive part, you could use a lookahead, but I don't like lookaheads, and it's easier to match a lot and just capture the parts you need (in my opinion), so if you prefer to use a lookahead to verify that the next part is archive, you can write one yourself.
EDIT: As you update your question, my idea of what you want is becoming fuzzier. If you want a new regex to match the second cases, you can just pluck the appropriate part off the end, with the same / conditions as before:
'/([^/]+)/$'
If you specifically want either the text jeremy.miller or scottgu, regardless of where they occur in a URL, but only as "words" in the URL (i.e. not scottgu2), try this, once again with the / caveat:
'/(jeremy\.miller|scottgu)/'
As yet a third alternative, if you want the field after the domain name, unless that field is "blogs", it's going to get hairy, especially with the / caveat:
'http://[^/]+/(?:blogs/)?([^/]+)/'
This will match the domain name, an optional blogs field, and then the desired field. The (?:) syntax is a non-capturing group, which means it's just like regular parenthesis, but won't capture the value, so the only value captured is the value you want. (?:) has a risk of varying depending on your particular regex flavor. I don't know what language you're asking for, but I predominantly use Perl, so this regex should pretty much do it if you're using PCRE. If you're using something different, look into non-capturing groups.
Wow. That's a lot of talking about regexes. I need to shut up and post already.
Try this one:
/\/([\w\.]+)\/archive/