I want a regular expression that only matches domain names with one period in them - regex

I want it to catch things like somedomain.com/folder/path, but not something like domain.sub.other.com. The regex I have so far is almost complete, it just doesn't sift out the multi-domain urls:
^(.*)://(?!(.{2,3})\.(.*)(.{2,3})(.*)
Is there any way to sift out on multiple periods?

Instead of .{2,3}, you want something like this: [^.]{2,3} - this excludes the period (no need to escape as it has no special meaning in this context in a regular expression) from that particular match. Overall you'd have something like:
://[^.]+\.[^.]{2,3}(/.*)?
Except obviously you're missing things like *.info by doing that....

Found a solution that is working given a variety of test scenarios:
^(.*)://([^.]+)\.([^(\?|/|\r|\n|\.)]+)((/|\?|$)+)(.*)$
Here, the 2nd to the last group is matching against a potential forward slash, question mark or end of string, working together with the group before it which does not allow matches which include '.'
So the final effect is that it only matches URLs with a two-part domain such as 'domain.com' and there aren't any limits placed on string length.

Related

RegEx to match and select specific URLs

I’m on a website with these URLs;
https://flyheight.com/videos/ybb347
https://flyheight.com/videos/yb24os
https://flyheight.com/public/images/videos/793f77362f321e62c32659c3ab00952d.png
https://flyheight.com/videos/5o6t98/#disqus_thread
I need a RegEx that will only select these URLs instead
https://flyheight.com/videos/yb24os
https://flyheight.com/videos/ybb347
This is what I got so far ^(?!images$).*(flyheight.com/videos/).*
PCRE: ^https?:\/\/flyheight\.com\/videos\/[a-z0-9]{6}$
https://regex101.com/r/vM31MK/1
May be this will also work for your language:
^https?://flyheight\.com/videos/[a-z0-9]{6}$
I'm not too sure if this is what you were looking for, but you could use the following:
^(?!images$).*(flyheight.com/videos/)([^/]+)$
The idea is that it would match the first part that you had, then match one or more characters that is not a slash ([^/]+) .
If you had strings that may or may not contain the / on the end (for example, you had https://flyheight.com/videos/yb24os or https://flyheight.com/videos/yb24os/), you can try the following:
^(?!images$).*(flyheight.com/videos/)([^/]+)/?$
here are my results on regexr.
This simple expression might do that since all your desired output starts with an y:
\/(y.*)
However, if you wish to add additional boundaries to it, you can do so. For instance, this would strengthen the left boundary:
flyheight.com\/videos\/(y.*)
Or you could add a list of char, similar to this:
flyheight.com\/videos\/([a-z0-9]+)
You can also add a quantifier to the desired output, similar to this expression:
flyheight.com\/videos\/([a-z0-9]{6})
and you can simply increase and add any boundary that you wish and capture your desired URLs, and fail others.
You might want to use this tool and change/edit/modify your expression based on your desired engine, as you wish:
^(.*)(flyheight.com\/videos\/)([a-z0-9]{6})$
This graph shows how it works and you can test more expressions here:

regular expressions: catch any URLs of the domain example.com

I'm trying to get regexp code for the below case. I tried multiple tries but in vain.
I need to catch any URLs of the domain site.com. Tried using regexp '^site.com/*$
but it does not recognizes it.
i'm just looking for regexp code whichmatches site.com/*
With your expression ^site.com/*$ you match all strings that start with site.com and have zero or more trailing / characters (/*):
If you want to match any strings starting with site.com/ you might want to try ^site\.com/.*$:
There are already a lot of other regex questions regarding domain names on SO, but your question is not clear to me in what context you are trying to do this, or what is the actual goal you want to achieve. If you describe your needs more precisely you could probably find some answers on this forum.
I generally use a helper website like regex101.com.
Also, a few things to note, . has a special meaning in regex meaning any character, and if you wanted to capture site.com/foo you might want to use something where you are not limited to the number of characters by the end. I'd do this with groupings.
^(site\.com\/)(.+)$
You can see this in action here: https://regex101.com/r/AU2iYC/2
Your regex ^site.com/*$ is only matched follow sentences
ex) site.com/ site.com//////// site.com
because * asterisk in regex means Match 0 or more of the preceding token.
so, it should be work
^site.com\/.*$

Define regular expression that matches urls that end with digits unless anything else comes after

I'm using Scrapy to scrape a web site. I'm stuck at defining properly the rule for extracting links.
Specifically, I need help to write a regular expression that allows urls like:
https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352
https://discuss.dwolla.com/t/enhancement-dwolla-php-updated-to-2-1-3/1180
https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108
while forbidding urls like this one
https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352/12
In other words, I want urls that end with digits (i.e., /1352 in the example abpve), unless after these digits there is anything after (i.e., /12 in the example above)
I am by no means an expert of regular expressions, and I could only come up with something like \/(\d+)$, or even this one ^https:\/\/discuss.dwolla.com\/t\/\S*\/(\d+)$, but both fail at excluding the unwanted urls since they all capture the last digits in the address.
--- UPDATE ---
Sorry for not being clear in the first place. This addition is to clarify that the digits at the of URLS can change, so the /1352 is not fixed. As such, another example of urls to be accepted is also:
https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108
This is probably the simplest way:
[^\/\d][^\/]*\/\d+$
or to restrict to a particular domain:
^https?:\/\/discuss.dwolla.com\/.*[^\/\d][^\/]*\/\d+$
See live demo.
This regex requires the last part to be all digits, and the 2nd last part to have at least 1 non-digit.
Here is a java regex may fit your requirements in java style. You can specify number of digits N you are excepting in {N}
^https://discuss.dwolla.com/t/[\\w|-]+/[\\d]+$

Regex with negative look behind still matches certain strings in Scala

I have a text, that contains url domains in the following form:
[second_level_domain].[top_level_domain]
This could be for instance test.com, amazon.com or something similar, but not more complex stuff like e.g. www.test.com or de.wikipedia.org (no sub level domains!).
It could be that in front of the dot (between second and top level domain) or after the dot is an optional space like test . com, but this doesn't always have to be the case.
However what I don't want to match is if the second level domain and top level domain belong to an e-mail address like for instance hello#test.org. So in this case it shouldn't extract test.org
I wrote the following regex now:
(?<!#)(([a-zA-Z\d]+(?:-[a-zA-Z\d]+)*(?<!www))\s?\.\s?(com|net|org))
With the negative look behind I want to make sure, that in front of the second level domain shouldn't be an #. However it doesn't really do what I expected. For instance on the text hello#test.org it extracts est.org instead of extracting nothing. So, apparently it only looks at the first character when it checks if there is an # in front. But when I use the following regex it seems to work on the text hello#test.org:
(?<!#)((test)\s?\.\s?(com|net|org))
Here I hard coded the second level domain, with which it works. However if I exchange that with a regex that matches all kinds of second level domains
([a-zA-Z\d]+(?:-[a-zA-Z\d]+)*(?<!www))
it doesn't work anymore. It looks like that the negative look behind is already used after the first character is matched and that it doesn't wait with the negative look behind until everything is matched.
As an alternative I could match a bit more and then use the groups afterwards to build my desired match, but I want to avoid that if possible. I would like to match it correctly immediately. I'm not an expert in regular expressions and apparently I have not understood look arounds properly yet. Is there a way to write a regex, which behaves like I want?
(?:^|(?<=\s))((?:[a-zA-Z\d]+(?:-[a-zA-Z\d]+)*(?<!www))\s?\.\s?(?:com|net|org))
Add anchors to disallow partial matches.See demo.
https://www.regex101.com/r/rK5lU1/34

Validate incomplete Regex

Let's say we have a Regex, in my case it's one I found to match UK car registration plates:
^([A-Z]{3}\s?(\d{3}|\d{2}|d{1})\s?[A-Z])|([A-Z]\s?(\d{3}|\d{2}|\d{1})\s?[A-Z]{3})|(([A-HK-PRSVWY][A-HJ-PR-Y])\s?([0][2-9]|[1-9][0-9])\s?[A-HJ-PR-Z]{3})
A typical UK car registration is
HG53CAY
This is matched correctly by the regex, but what i'd like to do is find a way to match any prefix substring of this, so the following would all be valid:
H, HG, HG5, HG53, HG53C, HG53CA, HG53CAY
Is there a suggested way to achieve this?
Firstly I'd rewrite your regexp to look like this:
^([A-Z]{3}\s?(\d{1,3})\s?[A-Z])|([A-Z]\s?(\d{1,3})\s?[A-Z]{3})|(([A-HK-PRSVWY][A-HJ-PR-Y])\s?([0][2-9]|[1-9][0-9])\s?[A-HJ-PR-Z]{3})
as the \d{3}|\d{2}|d{1} parts make no sense and should be written \d{1,3}.
Rewriting the regexp like
^([A-Z]{0,3}\s?(\d{0,3})\s?[A-Z]?)|([A-Z]\s?(\d{0,3})\s?[A-Z]{0,3})|(([A-HK-PRSVWY][A-HJ-PR-Y]?)\s?([0]?[2-9]?|[1-9]?[0-9]?)\s?[A-HJ-PR-Z]{0,3})
should have the desired effect of allowing matching of only the beginning of a registration, but unfortunately it's no longer guaranteed that the full registration will be a valid one, as I had to make most characters optional.
You could possibly try something like this
^(([A-Z]{3})|[A-Z]{1,2}$)\s?((\d{1,3})|$))...
to make it require either that each part is complete, or that it is incomplete but followed by "end of string", represented by the $ in the regexp.