regular expressions: catch any URLs of the domain example.com - regex

I'm trying to get regexp code for the below case. I tried multiple tries but in vain.
I need to catch any URLs of the domain site.com. Tried using regexp '^site.com/*$
but it does not recognizes it.
i'm just looking for regexp code whichmatches site.com/*

With your expression ^site.com/*$ you match all strings that start with site.com and have zero or more trailing / characters (/*):
If you want to match any strings starting with site.com/ you might want to try ^site\.com/.*$:
There are already a lot of other regex questions regarding domain names on SO, but your question is not clear to me in what context you are trying to do this, or what is the actual goal you want to achieve. If you describe your needs more precisely you could probably find some answers on this forum.

I generally use a helper website like regex101.com.
Also, a few things to note, . has a special meaning in regex meaning any character, and if you wanted to capture site.com/foo you might want to use something where you are not limited to the number of characters by the end. I'd do this with groupings.
^(site\.com\/)(.+)$
You can see this in action here: https://regex101.com/r/AU2iYC/2

Your regex ^site.com/*$ is only matched follow sentences
ex) site.com/ site.com//////// site.com
because * asterisk in regex means Match 0 or more of the preceding token.
so, it should be work
^site.com\/.*$

Related

301 redierction, matching urls through regex. Matching dashes

I'm trying to match urls for a migration, however I can't seem to have a regex which matches it.
I've tried different expressions and using regex checkers to determine where exactly it's broken, but it's not clear to me
This is my regex
https:\/\/blog\.xyz\.ca\/EN\/post\/201[0-9]\/[0-9][0-9]\/[0-9][0-9]\/*\).aspx
I'm trying to match these kinds of urls (hundreds)
https://blog.xyz.ca/EN/post/2019/05/14/how-test-higher-education-test-can-test-more-test-students-and-test-sdf-the-test.aspx
https://blog.xyz.ca/EN/post/2019/05/14/how-test.aspx
https://blog.xyz.ca/EN/post/2019/05/14/how-test-higher-the-test.aspx
And remap them to something like this
https://blog.xyz.ca/2017/12/21/test-how-the-testaspx
I thought that I could match the dash section using the wildcard, but it seems to not be working and none of the generators are giving me a clear warning. I've tried https://regexr.com/ and https://www.regextester.com/
If I understand the problem right, here we might just want to have a simple expression and capture our desired URL components, according to which we would find our redirect rules, and we can likely start with:
(.+\.ca)\/EN\/post(\/[0-9]{4}\/[0-9]{2}\/[0-9]{2})(\/.+)\.aspx
and if necessary, we would be adding/reducing our constraints, and I'm guessing that no validation might be required.
Demo 1
or:
(.+\.ca)\/EN\/post(\/[0-9]{4}\/[0-9]{2}\/[0-9]{2})(\/.+)(\.aspx)
Demo 2

RegEx to match and select specific URLs

I’m on a website with these URLs;
https://flyheight.com/videos/ybb347
https://flyheight.com/videos/yb24os
https://flyheight.com/public/images/videos/793f77362f321e62c32659c3ab00952d.png
https://flyheight.com/videos/5o6t98/#disqus_thread
I need a RegEx that will only select these URLs instead
https://flyheight.com/videos/yb24os
https://flyheight.com/videos/ybb347
This is what I got so far ^(?!images$).*(flyheight.com/videos/).*
PCRE: ^https?:\/\/flyheight\.com\/videos\/[a-z0-9]{6}$
https://regex101.com/r/vM31MK/1
May be this will also work for your language:
^https?://flyheight\.com/videos/[a-z0-9]{6}$
I'm not too sure if this is what you were looking for, but you could use the following:
^(?!images$).*(flyheight.com/videos/)([^/]+)$
The idea is that it would match the first part that you had, then match one or more characters that is not a slash ([^/]+) .
If you had strings that may or may not contain the / on the end (for example, you had https://flyheight.com/videos/yb24os or https://flyheight.com/videos/yb24os/), you can try the following:
^(?!images$).*(flyheight.com/videos/)([^/]+)/?$
here are my results on regexr.
This simple expression might do that since all your desired output starts with an y:
\/(y.*)
However, if you wish to add additional boundaries to it, you can do so. For instance, this would strengthen the left boundary:
flyheight.com\/videos\/(y.*)
Or you could add a list of char, similar to this:
flyheight.com\/videos\/([a-z0-9]+)
You can also add a quantifier to the desired output, similar to this expression:
flyheight.com\/videos\/([a-z0-9]{6})
and you can simply increase and add any boundary that you wish and capture your desired URLs, and fail others.
You might want to use this tool and change/edit/modify your expression based on your desired engine, as you wish:
^(.*)(flyheight.com\/videos\/)([a-z0-9]{6})$
This graph shows how it works and you can test more expressions here:

regex to find domain without those instances being part of subdomain.domain

I'm new to regex. I need to find instances of example.com in an .SQL file in Notepad++ without those instances being part of subdomain.example.com(edited)
From this answer, I've tried using ^((?!subdomain))\.example\.com$, but this does not work.
I tested this in Notepad++ and # https://regex101.com/r/kS1nQ4/1 but it doesn't work.
Help appreciated.
Simple
^example\.com$
with g,m,i switches will work for you.
https://regex101.com/r/sJ5fE9/1
If the matching should be done somewhere in the middle of the string you can use negative look behind to check that there is no dot before:
(?<!\.)example\.com
https://regex101.com/r/sJ5fE9/2
Without access to example text, it's a bit hard to guess what you really need, but the regular expression
(^|\s)example\.com\>
will find example.com where it is preceded by nothing or by whitespace, and followed by a word boundary. (You could still get a false match on example.com.pk because the period is a word boundary. Provide better examples in your question if you want better answers.)
If you specifically want to use a lookaround, the neative lookahead you used (as the name implies) specifies what the regex should not match at this point. So (?!subdomain\.)example trivially matches always, because example is not subdomain. -- the negative lookahead can't not be true.
You might be better served by a lookbehind:
(?<!subdomain\.)example\.com
Demo: https://regex101.com/r/kS1nQ4/3
Here's a solution that takes into account the protocols/prefixes,
/^(www\.)?(http:\/\/www\.)?(https:\/\/www\.)?example\.com$/

Regular expression with negative look aheads

I am trying to contruct a regular expression to remove links from content unless it contains 1 of 2 conditions.
<a.*?href=[""'](http[s]?:\/\/(.*?)\.link\.com)?\/(?!m\/).*?<\/a>
This will match any link to link.com that does not have m/ at the end of the domain section. I want to change this slightly so it does't match URLs that are links to pdf files regardless of having the m/ in the url, I came up with:
<a.*?href=["'](http[s]?:\/\/(.*?)\.brodies\.com)?\/(?!m\/).*?\.(?!pdf)["'].*?<\/a>
Which is ooh so very close except now it will only match if the URL has a "." at the end - I can see why it's doing it. I can't seem to make the "." optional as this causes the non greedy pattern prior to the "." to keep going until it hits the ["']
Any help would be good to help solve this.
Thanks
Paul
You probably want to use (?<!\.pdf)["'] instead of \.(?!pdf)["'].
But note that this expression has several issues, best way to solve them is to use a proper HTML parser.
First, RegEx match open tags except XHTML self-contained tags.
That said, (since it probably will not deter,) here is a slightly-better-constrained version of what you're trying to, with the caveat that this is still not good enough!
<a[^>]+?href\s*=\s*["'](https?:\/\/[^"']*?\.link\.com)?\/(?!m\/)[^"']*?\.(?!pdf)[^"']*?["'][^>]*?>.*?<\/a>
You can see a running example of this regex at: http://rubular.com/r/obkKrKpB8B.
Your problem was actually just that you were looking for a quote character immediately after the dot, here: .(?!pdf)["'].

I want a regular expression that only matches domain names with one period in them

I want it to catch things like somedomain.com/folder/path, but not something like domain.sub.other.com. The regex I have so far is almost complete, it just doesn't sift out the multi-domain urls:
^(.*)://(?!(.{2,3})\.(.*)(.{2,3})(.*)
Is there any way to sift out on multiple periods?
Instead of .{2,3}, you want something like this: [^.]{2,3} - this excludes the period (no need to escape as it has no special meaning in this context in a regular expression) from that particular match. Overall you'd have something like:
://[^.]+\.[^.]{2,3}(/.*)?
Except obviously you're missing things like *.info by doing that....
Found a solution that is working given a variety of test scenarios:
^(.*)://([^.]+)\.([^(\?|/|\r|\n|\.)]+)((/|\?|$)+)(.*)$
Here, the 2nd to the last group is matching against a potential forward slash, question mark or end of string, working together with the group before it which does not allow matches which include '.'
So the final effect is that it only matches URLs with a two-part domain such as 'domain.com' and there aren't any limits placed on string length.