So I have the following Regex URL validator:
[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9](?:\.[a-zA-Z]{2,})+
It works perfectly well for my needs, except that it accepts urls without a domain for example www.test works.
How can I modify it to validate for a domain? (Any domain should be accepted not just .com
Demo
Just make the last group in your regex mandatory as appearing two or more times:
[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9](?:\.[a-zA-Z]{2,}){2,}
As a disclaimer, and as #Wiktor will probably comment, you might want to use a regex pattern for validating URLs which already has been tested thoroughly. While this answer may fix your immediate problem, there are most likely other edge cases which exist.
You could do it like this to account for unicode:
^\p{L}+\.\p{L}+(\.\p{L}{2,})+
\p{L} or \p{Letter}: any kind of letter from any language.
So with this we match for a group of one or more letters (sudomain) followed by a . followed by a group of one or more letters (main domain) followed by any number of groups of . with two or more letters (domain suffix).
Related
I need to find all email addresses with an arbitrary number of alphanumeric words, separated through a period. To test the regex, I'm using the website https://regex101.com/.
The structure of a valid email addresses is word1.word2.wordN#word1.word2.wordN.word.
The regex /[a-zA-Z0-9.]+#[a-zA-Z0-9.]+.[a-zA-Z0-9]+/gm finds all email addresses included in the document string, but also includes invalid addresses like ........#....com, if present.
I tried to group the repeating parts by using round brackets and a Kleene star, but that causes the regex engine to collapse.
Invalid regex:
/([a-zA-Z0-9]+.?)*[a-zA-Z0-9]+#([a-zA-Z0-9]+.?)*[a-zA-Z0-9]+.[a-zA-Z0-9]+/gm
Although there are many posts concerning regex groups, I was unable to find an explanation, why the regex engine fails. It seems that the engine gets stuck, while trying to find a match.
How can I avoid this problem, and what is the correct solution?
I think the main issue that caused you troubles is:
. (outside of []) matches any character,you probably meant to specify \. instead (only matches literal dot character).
Also there is no need to make it optional with ?, because the non-dot part of your regex will just match with the alphanumerical characters anyway.
I also reduced the right part (x*x is the same as x+), added a case-insensitive flag and ended up with this:
/([a-z0-9]+\.)*[a-z0-9]+#([a-z0-9]+\.)+[a-z0-9]+/gmi
I'm trying to stop spammers who are using short domains bit.ly etc. The domains they post seem to all be only two characters (not .com, etc).
I've used this:
\.[a-z][a-z]$
But, it has two problems:
it matches .co.uk
If anything is after the domain, it doesn't match (a space or slash, example: bit.ly/2231)
Could someone assist me with a regex that would accomplish this, please?
Whole URL matching. Depends on domain being before the first forward slash past protocol. First one uses if it only has one dot in the url and ends with two character primary TLD. Second one uses negative lookbehind to make sure it's not something like .co.uk.
https://regex101.com/r/5acu56/2
^(https?:\/\/)?[^\/.]+\.[a-z][a-z](\/|\s*$)
https://regex101.com/r/p8Ajw9/2
^(https?:\/\/)?[^\/]+(?<!\.[a-z][a-z])\.[a-z][a-z](\/\s*|\s*$)
I have such an http-url detector regex:
(?:http|https)(?::\/{2}[\w]+)(?:[\/|\.]?)(?:[^\s<"]*)
It works pretty well for the following url representation:
http://www.acer.com/clearfi/download/
What kind of modification I can do to extract
http://schemas.microsoft.com/office/word/2003/wordml2450
from
Huanghhttp://schemas.microsoft.com/office/word/2003/wordml2450...)()()()()()
?
You can modify it to capture:
group of http stuff
followed by (group of) subdomain stuff
followed by as many as possible groups of:
one point or slash
followed by a group of characters (non-point, non-space, non-", non-<)
(?:http|https)(?:\/{2}[\w]+)([\/|\.][^\s<"\.]+)*
I made capturing groups to visualize the results
I've changed your expression here and there: (.*)(https?:\/{2}[\w]+[\/|\.]?[^\s<"]*)(\.{3}.*) and get only second capturing group from it. See example here: https://regex101.com/r/0viPC5/2
This expression probably can be simplified further but I don't know your exact input and search criteria so let's stick with what you already wrote.
So I've got /.+[^\x20-\x2A\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\xFF]\#[\w+-?]+(.{1})\w{2,}/ pattern I want to use for email validation on client-side, which doesn't work as expected.
I know that my pattern is simple and doesn't cover every standard possibility, but it's part of my regex training.
Local part of address should be valid only when it has at least one digit [0-9] or letter [a-zA-Z] and can be mixed with comma or plus sign or underscore (or all at once) and then # sign, then domain part, but no IP address literals, only domain names with at least one letter or digit, followed by one dot and at least two letters or two digits.
In test string form it doesn't validate a#b.com and does validate baz_bar.test+private#e-mail-testing-service..com, which is wrong - it should be vice versa - validate a#b.com and not validate baz_bar.test+private#e-mail-testing-service..com
What specific error I've got there and where?
I can't locate this, sorry..
You need to change your regex
From: .+[^\x20-\x2A\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\xFF]\#[\w+-?]+(\.{1})\w{2,}
To: .+[^\x20-\x2A\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\xFF]?\#[\w+-]+(\.{1})\w{2,}
Notice that I added a ? before the # sign and removed the ? from the first "group" after the # sign. Adding that ? will make your regex to know that hole "group" is not mandatory.
See it working here: https://regex101.com/r/iX5zB5/2
You're requiring the local part (before #) to be at least two characters with the .+ followed by the character class [^...]. It's looking for any character followed by another character not in the list of exclusions you specify. That explains why "a#b.com" doesn't match.
The second problem is partly caused by the character class range +-? which includes the . character. I think you wanted [-\w+?]+. (Do you really want question marks?) And then later I think you wanted to look for a literal . character but it really ends up matching the first character that didn't match the previous block.
Between the regex provided and the explanatory text I'm not sure what rules you intend to implement though. And since this is an exercise it's probably better to just give hints anyway.
You will also want to use the ^ and $ anchors to makes sure the entire string matches.
I'm attempting to setup Content Groupings using Extraction within Google Analytics.
I have URL's of the form http://www.ehattons.com/52674/Bachmann_Branchline_37_671_Pack_of_3_14_Ton_tank_wagons_in_Fina_livery_weathered/StockDetail.aspx
I wish to use Regex to say that only in cases where a URL contains /StockDetail.aspx, extract everything before the first underscore, excluding any digits. e.g. 'Bachmann'.
I've managed to source the following regex to return everything before the first underscore
^[^_]+(?=_).
However, that's as far as I can get with my limited understanding. Anyone know what regex will do the trick here?
Many thanks,
Well you did the halfway.
Think about it this way : you're looking for extracting something followed by a underscore but not following one when the string contain /StockDetail.aspx. You know that this part of string will always be after your first underscore.
So you start with no underscore before : [^_]
Then you create the group you want to match with ([a-zA-Z]*) (you cannot work with \w since it's including underscore). Your string has to be followed by a underscore so you add _ after your group. And finnaly somewhere in the url you've got /StockDetail.aspx. Your regex should look like this :
[^_]([a-zA-Z]*)_.*(?:\/StockDetail\.aspx)
Result