Regex match domain that contain certain subdomain - regex

I have this regex (not mine, taken from here)
^[^\.]+\.example\.org$
The regex will match *.example.org (e.g. sub.example.org), but will leaves out sub-subdomain (e.g. sub.sub.example.org), that's great and it is what I want.
But I have other requirement, I want to match subdomain that contain specific string, in this case press. So the regex will match following (literally any subdomain that has word press in it).
free-press.example.org
press.example.org
press23.example.org
I have trouble finding the right syntax, have looked on the internet and mostly they works only for standalone text and not domain like this.

Ok, let's break down what the "subdomain" part of your regex does:
[^\.]+ means "any character except for ., at least once".
You can break your "desired subdomain" up into three parts: "the part before press", "press itself", and "the part after press".
For the parts before and after press, the pattern is basically the same as before, except that you want to change the + (one or more) to a * (zero or more), because there might not be anything before or after press.
So your subdomain pattern will look like [^\.]*press[^\.]*.
Putting it all together, we have ^[^\.]*press[^\.]*\.example\.org$. If we put that into Regex101 we see that it works for your examples.
Note that this isn't a very strict check for valid domains. It might be worth thinking about whether regexes are actually the best tool for the "subdomain checking" part of this task. You might instead want to do something like this:
Use a generic, more thorough, domain-validation regex to check that the domain name is valid.
Split the domain name into parts using String.split('.').
Check that the number of parts is correct (i.e. 3), and that the parts meet your requirements (i.e. the first contains the substring press, the second is example, and the third is org).

If you're looking for a regex that matches URLs whose subdomains contain the word press then use
^[^\.]*press[^\.]*\.example\.org$
See the demo

Related

Find last occurrence of period with regex

I'm trying to create a regex for validating URLs. I know there are many advanced ones out there, but I want to create my own for learning purposes.
So far I have a regex that works quite well, however I want to improve the validation for the TLD part of the URI because I feel it's not quite there yet.
Here's my regex (or find it on regexr):
/^[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}\b([/#?]{0,1}([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)$/
It works well for links such as foo.com or http://foo.com or foo.co.uk
The problem appears when you introduce subdomains or second-level domains such as co.uk because the regex will accept foo.co.u or foo.co..
I did try using the following to select the substring after the last .:
/[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}[^.]{2,}$/
but this prevents me from defining the path rules of the URI.
How can I ensure that the substring after the last . but before the first /, ? or # is at least 2 characters long?
From what I can see, you're almost there. Made some modification and it seems to work.
^(http(s)?:\/\/)?(www\.)?[a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
Can be somewhat shortened by doing
^(http(s)?:\/\/)?(www\.)?[\w#:.\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([-\w.~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
(basically just tweaked your regex)
The main difference is that the parameter part is optional, but if it is there it has to start with one of /#?;. That part could probably be simplified as well.
Check it out here.
Edit:
After some experimenting I think this one is about as simple it'll get:
^(http(?:s)?:\/\/)?([-.~\w]+\.[a-zA-Z]{2,6})(:\d+)?(\/[-.~\w]*)?([#/#?;].*)?$
It also captures the separate parts - scheme, host, port, path and query/params.
Example here.

Define regular expression that matches urls that end with digits unless anything else comes after

I'm using Scrapy to scrape a web site. I'm stuck at defining properly the rule for extracting links.
Specifically, I need help to write a regular expression that allows urls like:
https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352
https://discuss.dwolla.com/t/enhancement-dwolla-php-updated-to-2-1-3/1180
https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108
while forbidding urls like this one
https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352/12
In other words, I want urls that end with digits (i.e., /1352 in the example abpve), unless after these digits there is anything after (i.e., /12 in the example above)
I am by no means an expert of regular expressions, and I could only come up with something like \/(\d+)$, or even this one ^https:\/\/discuss.dwolla.com\/t\/\S*\/(\d+)$, but both fail at excluding the unwanted urls since they all capture the last digits in the address.
--- UPDATE ---
Sorry for not being clear in the first place. This addition is to clarify that the digits at the of URLS can change, so the /1352 is not fixed. As such, another example of urls to be accepted is also:
https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108
This is probably the simplest way:
[^\/\d][^\/]*\/\d+$
or to restrict to a particular domain:
^https?:\/\/discuss.dwolla.com\/.*[^\/\d][^\/]*\/\d+$
See live demo.
This regex requires the last part to be all digits, and the 2nd last part to have at least 1 non-digit.
Here is a java regex may fit your requirements in java style. You can specify number of digits N you are excepting in {N}
^https://discuss.dwolla.com/t/[\\w|-]+/[\\d]+$

Regex with negative look behind still matches certain strings in Scala

I have a text, that contains url domains in the following form:
[second_level_domain].[top_level_domain]
This could be for instance test.com, amazon.com or something similar, but not more complex stuff like e.g. www.test.com or de.wikipedia.org (no sub level domains!).
It could be that in front of the dot (between second and top level domain) or after the dot is an optional space like test . com, but this doesn't always have to be the case.
However what I don't want to match is if the second level domain and top level domain belong to an e-mail address like for instance hello#test.org. So in this case it shouldn't extract test.org
I wrote the following regex now:
(?<!#)(([a-zA-Z\d]+(?:-[a-zA-Z\d]+)*(?<!www))\s?\.\s?(com|net|org))
With the negative look behind I want to make sure, that in front of the second level domain shouldn't be an #. However it doesn't really do what I expected. For instance on the text hello#test.org it extracts est.org instead of extracting nothing. So, apparently it only looks at the first character when it checks if there is an # in front. But when I use the following regex it seems to work on the text hello#test.org:
(?<!#)((test)\s?\.\s?(com|net|org))
Here I hard coded the second level domain, with which it works. However if I exchange that with a regex that matches all kinds of second level domains
([a-zA-Z\d]+(?:-[a-zA-Z\d]+)*(?<!www))
it doesn't work anymore. It looks like that the negative look behind is already used after the first character is matched and that it doesn't wait with the negative look behind until everything is matched.
As an alternative I could match a bit more and then use the groups afterwards to build my desired match, but I want to avoid that if possible. I would like to match it correctly immediately. I'm not an expert in regular expressions and apparently I have not understood look arounds properly yet. Is there a way to write a regex, which behaves like I want?
(?:^|(?<=\s))((?:[a-zA-Z\d]+(?:-[a-zA-Z\d]+)*(?<!www))\s?\.\s?(?:com|net|org))
Add anchors to disallow partial matches.See demo.
https://www.regex101.com/r/rK5lU1/34

Regex for URL routing - match alphanumeric and dashes except words in this list

I'm using CodeIgniter to write an app where a user will be allowed to register an account and is assigned a URL (URL slug) of their choosing (ex. domain.com/user-name). CodeIgniter has a URL routing feature that allows the utilization of regular expressions (link).
User's are only allowed to register URL's that contain alphanumeric characters, dashes (-), and under scores (_). This is the regex I'm using to verify the validity of the URL slug: ^[A-Za-z0-9][A-Za-z0-9_-]{2,254}$
I am using the url routing feature to route a few url's to features on my site (ex. /home -> /pages/index, /activity -> /user/activity) so those particular URL's obviously cannot be registered by a user.
I'm largely inexperienced with regular expressions but have attempted to write an expression that would match any URL slugs with alphanumerics/dash/underscore except if they are any of the following:
default_controller
404_override
home
activity
Here is the code I'm using to try to match the words with that specific criteria:
$route['(?!default_controller|404_override|home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
but it isn't routing properly. Can someone help? (side question: is it necessary to have ^ or $ in the regex when trying to match with URL's?)
Alright, let's pick this apart.
Ignore CodeIgniter's reserved routes.
The default_controller and 404_override portions of your route are unnecessary. Routes are compared to the requested URI to see if there's a match. It is highly unlikely that those two items will ever be in your URI, since they are special reserved routes for CodeIgniter. So let's forget about them.
$route['(?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
Capture everything!
With regular expressions, a group is created using parentheses (). This group can then be retrieved with a back reference - in our case, the $1, $2, etc. located in the second part of the route. You only had a group around the first set of items you were trying to exclude, so it would not properly capture the entire wild card. You found this out yourself already, and added a group around the entire item (good!).
$route['((?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Look-ahead?!
On that subject, the first group around home|activity is not actually a traditional group, due to the use of ?! at the beginning. This is called a negative look-ahead, and it's a complicated regular expression feature. And it's being used incorrectly:
Negative lookahead is indispensable if you want to match something not followed by something else.
There's a LOT more I could go into with this, but basically we don't really want or need it in the first place, so I'll let you explore if you'd like.
In order to make your life easier, I'd suggest separating the home, activity, and other existing controllers in the routes. CodeIgniter will look through the list of routes from top to bottom, and once something matches, it stops checking. So if you specify your existing controllers before the wild card, they will match, and your wild card regular expression can be greatly simplified.
$route['home'] = 'pages';
$route['activity'] = 'user/activity';
$route['([A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Remember to list your routes in order from most specific to least. Wild card matches are less specific than exact matches (like home and activity), so they should come after (below).
Now, that's all the complicated stuff. A little more FYI.
Remember that dashes - have a special meaning when in between [] brackets. You should escape them if you want to match a literal dash.
$route['([A-Za-z0-9][A-Za-z0-9_\-]{2,254})'] = 'view/slug/$1';
Note that your character repetition min/max {2,254} only applies to the second set of characters, so your user names must be 3 characters at minimum, and 255 at maximum. Just an FYI if you didn't realize that already.
I saw your own answer to this problem, and it's just ugly. Sorry. The ^ and $ symbols are used improperly throughout the lookahead (which still shouldn't be there in the first place). It may "work" for a few use cases that you're testing it with, but it will just give you problems and headaches in the future.
Hopefully now you know more about regular expressions and how they're matched in the routing process.
And to answer your question, no, you should not use ^ and $ at the beginning and end of your regex -- CodeIgniter will add that for you.
Use the 404, Luke...
At this point your routes are improved and should be functional. I will throw it out there, though, that you might want to consider using the controller/method defined as the 404_override to handle your wild cards. The main benefit of this is that you don't need ANY routes to direct a wild card, or to prevent your wild card from goofing up existing controllers. You only need:
$route['404_override'] = 'view/slug';
Then, your View::slug() method would check the URI, and see if it's a valid pattern, then check if it exists as a user (same as your slug method does now, no doubt). If it does, then you're good to go. If it doesn't, then you throw a 404 error.
It may not seem that graceful, but it works great. Give it a shot if it sounds better for you.
I'm not familiar with codeIgniter specifically, but most frameworks routing operate based on precedence. In other words, the default controller, 404, etc routes should be defined first. Then you can simplify your regex to only match the slugs.
Ok answering my own question
I've seem to come up with a different expression that works:
$route['(^(?!default_controller$|404_override$|home$|activity$)[A-Za-z0-9][A-Za-z0-9_-]{2,254}$)'] = 'view/slug/$1';
I added parenthesis around the whole expression (I think that's what CodeIgniter matches with $1 on the right) and added a start of line identifier: ^ and a bunch of end of line identifiers: $
Hope this helps someone who may run into this problem later.

Regex for checking a body of text for a URL?

I have a regex pattern for URL's that I use to check for links in a body of text. The only problem is that the pattern will match this link
stackoverflow.com
And this sentence
I'm a sentence.Next Sentence.
Obviously this would make sense because my pattern doesn't strong check .com, .co.uk, .com.au etc
I want it to match stackoverflow.com and not the latter.
As I'm no Regex expert, does anyone know of any good Regex patterns for checking for all types of URL's in a body text, while not matching the sentences like above?
If I have to strong check the domain extension, I suppose I'll have to settle.
Here's my pattern, but i don't think it help.
(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?
I would definitely suggest finding a working regex that someone else has made (which would probably include a strong check on the domain extension), but here is one possible way to just modify your existing regex.
It requires that you make the assumption that usually links will not mix case in the domain extension, for example you might see .COM or .com but probably not .Com, if you only match domain extensions that don't mix case then you would avoid matching most sentences.
In the middle of your regex you have [\w]{2,4}, try changing this to ([A-Z]{2,4}|[a-z]{2,4}) (or (?:[A-Z]{2,4}|[a-z]{2,4}) if you don't want a new captured group).