Regular expression with negative look aheads - regex

I am trying to contruct a regular expression to remove links from content unless it contains 1 of 2 conditions.
<a.*?href=[""'](http[s]?:\/\/(.*?)\.link\.com)?\/(?!m\/).*?<\/a>
This will match any link to link.com that does not have m/ at the end of the domain section. I want to change this slightly so it does't match URLs that are links to pdf files regardless of having the m/ in the url, I came up with:
<a.*?href=["'](http[s]?:\/\/(.*?)\.brodies\.com)?\/(?!m\/).*?\.(?!pdf)["'].*?<\/a>
Which is ooh so very close except now it will only match if the URL has a "." at the end - I can see why it's doing it. I can't seem to make the "." optional as this causes the non greedy pattern prior to the "." to keep going until it hits the ["']
Any help would be good to help solve this.
Thanks
Paul

You probably want to use (?<!\.pdf)["'] instead of \.(?!pdf)["'].
But note that this expression has several issues, best way to solve them is to use a proper HTML parser.

First, RegEx match open tags except XHTML self-contained tags.
That said, (since it probably will not deter,) here is a slightly-better-constrained version of what you're trying to, with the caveat that this is still not good enough!
<a[^>]+?href\s*=\s*["'](https?:\/\/[^"']*?\.link\.com)?\/(?!m\/)[^"']*?\.(?!pdf)[^"']*?["'][^>]*?>.*?<\/a>
You can see a running example of this regex at: http://rubular.com/r/obkKrKpB8B.
Your problem was actually just that you were looking for a quote character immediately after the dot, here: .(?!pdf)["'].

Related

regular expressions: catch any URLs of the domain example.com

I'm trying to get regexp code for the below case. I tried multiple tries but in vain.
I need to catch any URLs of the domain site.com. Tried using regexp '^site.com/*$
but it does not recognizes it.
i'm just looking for regexp code whichmatches site.com/*
With your expression ^site.com/*$ you match all strings that start with site.com and have zero or more trailing / characters (/*):
If you want to match any strings starting with site.com/ you might want to try ^site\.com/.*$:
There are already a lot of other regex questions regarding domain names on SO, but your question is not clear to me in what context you are trying to do this, or what is the actual goal you want to achieve. If you describe your needs more precisely you could probably find some answers on this forum.
I generally use a helper website like regex101.com.
Also, a few things to note, . has a special meaning in regex meaning any character, and if you wanted to capture site.com/foo you might want to use something where you are not limited to the number of characters by the end. I'd do this with groupings.
^(site\.com\/)(.+)$
You can see this in action here: https://regex101.com/r/AU2iYC/2
Your regex ^site.com/*$ is only matched follow sentences
ex) site.com/ site.com//////// site.com
because * asterisk in regex means Match 0 or more of the preceding token.
so, it should be work
^site.com\/.*$

regex to find domain without those instances being part of subdomain.domain

I'm new to regex. I need to find instances of example.com in an .SQL file in Notepad++ without those instances being part of subdomain.example.com(edited)
From this answer, I've tried using ^((?!subdomain))\.example\.com$, but this does not work.
I tested this in Notepad++ and # https://regex101.com/r/kS1nQ4/1 but it doesn't work.
Help appreciated.
Simple
^example\.com$
with g,m,i switches will work for you.
https://regex101.com/r/sJ5fE9/1
If the matching should be done somewhere in the middle of the string you can use negative look behind to check that there is no dot before:
(?<!\.)example\.com
https://regex101.com/r/sJ5fE9/2
Without access to example text, it's a bit hard to guess what you really need, but the regular expression
(^|\s)example\.com\>
will find example.com where it is preceded by nothing or by whitespace, and followed by a word boundary. (You could still get a false match on example.com.pk because the period is a word boundary. Provide better examples in your question if you want better answers.)
If you specifically want to use a lookaround, the neative lookahead you used (as the name implies) specifies what the regex should not match at this point. So (?!subdomain\.)example trivially matches always, because example is not subdomain. -- the negative lookahead can't not be true.
You might be better served by a lookbehind:
(?<!subdomain\.)example\.com
Demo: https://regex101.com/r/kS1nQ4/3
Here's a solution that takes into account the protocols/prefixes,
/^(www\.)?(http:\/\/www\.)?(https:\/\/www\.)?example\.com$/

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.
So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+
This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?
If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.
So it should be something like:
- find a string of numbers, letters and special signs until "http"
- delete what you found
- and keep searching for more numbers, letters ans special signs after "html"
- and delete that again
Any ideas? Thanks so much.
In Notepad++, in the Replace menu (CTRL+H) you can do the following:
Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
Replace: $1\n
Options: check the Regular expression and the . matches newline
This will return you with a list of all your links. There are two issues though:
The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
It will leave the text after the last matched URL intact. You have to delete it manually.
The answer made previously by #psxls was a great help for me when I have wanted to perform a similar process.
However, this regex rule was written six years ago now: accordingly, I had to adjust / complete / update it in order it can properly work with the some recent links, because:
a lot of URL are now using HTTPS instead of HTTP protocol
many websites less use www as main subdomain
some links adds punctuation mark (which have to be preserved)
I finally reshuffle the search rule to .*?(https?\:\/\/[a-zA-Z0-9[:punct:]]+) and it worked correctly with the file I had.
Unfortunately, this seemingly simple task is going to be almost impossible to do in notepad++. The regex you would have to construct would be...horrible. It might not even be possible, but if it is, it's not worth it. I pretty much guarantee that.
However, all is not lost. There are other tools more suitable to this problem.
Really what you want is a tool that can search through an input file and print out a list of regex matches. The UNIX utility "grep" will do just that. Don't be scared off because it's a UNIX utility: you can get it for Windows:
http://gnuwin32.sourceforge.net/packages/grep.htm
The grep command line you'll want to use is this:
grep -o 'http:\/\/www.[a-zA-Z0-9./-]\+\?' <filename(s)>
(Where <filename(s)> are the name(s) of the files you want to search for URLs in.)
You might want to shake up your regex a little bit, too. The problems I see with that regex are that it doesn't handle URLs without the 'www' subdomain, and it won't handle secure links (which start with https). Maybe that's what you want, but if not, I would modify it thusly:
grep -o 'https\?:\/\/[a-zA-Z0-9./-]\+\?' <filename(s)>
Here are some things to note about these expressions:
Inside a character group, there's no need to quote metacharacters except for [ and (sometimes) -. I say sometimes because if you put the dash at the end, as I have above, it's no longer interpreted as a range operator.
The grep utility's syntax, annoyingly, is different than most regex implementations in that most of the metacharacters we're familiar with (?, +, etc.) must be escaped to be used, not the other way around. Which is why you see backslashes before the ? and + characters above.
Lastly, the repetition metacharacter in this expression (+) is greedy by default, which could cause problems. I made it lazy by appending a ? to it. The way you have your URL match formulated, it probably wouldn't have caused problems, but if you change your match to, say [^ ] instead of [a-zA-Z0-9./-], you would see URLs on the same line getting combined together.
I did this a different way.
Find everything up to the first/next (https or http) (then everything that comes next) up to (html or htm), then output just the '(https or http)(everything next) then (html or htm)' with a line feed/ carriage return after each.
So:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace with: \1\2\3\r\n
Saves looking for all possible (incl non-generic) url matches.
You will need to manually remove any text after the last matched URL.
Can also be used to create url links:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace: \1\2\3\r\n
or image links (jpg/jpeg/gif):
Find: .*?(https:|http:)(.*?)(jpeg|jpg|gif)
Replace: <img src="\1\2\3">\r\n
I know my answer won't be RegEx related, but here is another efficient way to get lines containing URLs.
This won't remove text around links like Toto mentioned in comments.
At least if there is nice pattern to all links, like https://.
CTRL+F => change tab to Mark
Insert https://
Tick Mark to bookmark.
Mark All.
Find => Bookmarks => Delete all lines without bookmark.
I hope someone who lands here in search of same problem will find my way more user-friendly.
You can still use RegEx to mark lines :)

Regex to identify HTML tags (as a regex repetition learning exercise ONLY!!)

I'm very very new to regex. I'd managed to not touch it with a 10-foot pole for so long. And I tried my best to avoid it so far. But now a personal project is pushing me to learn it.
So I started. And I'm going through the tutorial located here:http://www.regular-expressions.info/tutorial.html
Currently I'm here: http://www.regular-expressions.info/repeat.html
My question is:
The tutorial says <[A-Za-z][A-Za-z0-9]*> will match an HTML tag.
But wouldn't it also match invalid html tags like - <h11> or <h111>?
Also how would it match the closing tags?
Edit - My question is very specific. I am referring to one particular example in one particular tutorial to clarify whether or not my understanding of repetitions is correct. Again, I REPEAT, I DO NOT care about html parsing with regex.
I don't see any harm in answering your question seeing as how you are attempting to learn regex:
1) Yes, it will match invalid tags as well because it's any letter followed by any zero or more matches of another letter or a number.
2) It will not match closing tags (there would have to be a search for a / somewhere in there).
One more comment: one way people used to use to look for html tags inside a document was to look for the pattern of opening and closing brackets, like so:
<\/?[^>]*>
That's opening-bracket, an optional slash, (anything but a closing bracket)-repeated and then a closing bracket. Of course, I am not recommending anyone do this. It's merely left here as an exercise.
The tutorial says <[A-Za-z][A-Za-z0-9]*> will match an HTML tag.
But wouldn't it also match invalid html tags like - or ?
Also how would it match the closing tags?
Yes, that will match <h11> as well as <X098wdfhfdshs98fhj2hsdljhkvjnvo9sudvsodfih23234osdfs>.
If you want to just match a letter followed by an optional single digit, so you'd match <h1>, then you want <[A-Za-z][0-9]?>

Regex Problem (newbie)

i'm writing a little app for spam-checking and i'm having problems with a regex.
let's say i'm having this spam-url:
http://hosting.tyumen.ru/tip.html
so i want to check its url for having 2 full stops (subdomain+ending), a slash, a word, full stop and "html".
here's what i got so far:
(http://.*?\..*?..*?/.*?.html)
might look like rubbish but it works - the problem: it's really slow and freezing my app.
any hints on how to optimize it?
thx.re
The reason it's slow is that the non-greedy operators ? being used this way is prone to catastrophic backtracking
Instead of saying "any amount of anything, but only to an extent where it doesn't conflict with later requirements", which is effectively what .*? is saying, try asking for "as much as possible, that isn't a double quote, which would terminate the href ":
\1
I also added a back-reference (\1) to your first capturing group, inside the <a>...</a>, so that you don't have to do the exact same matching all over again.
Note that this regex will be broken if, say, the a has a class name, an id, or anything else in its body. I left it like this because I wanted to give you what you asked for with as few changes as possible, and as to-the-point as possible.
(http://[\w.-]+/.+?\.html) - may be will work for your case only.
or may be faster one
(http://[\w.-]+/[^.]+\.html)
Since you claim to be a regexp newbie, I will offer a more general advice on creating and debugging regular expressions. When they get pretty complicated, I find using Regexp Coach a must.
It's a freeware and really saves a lot of headache. Not to mention you don't have to build / run your application every minute just to see if the regexp works the way you wanted.
In Python, a simple way to match URLs ending in .html or .htm is to use
url_re = re.compile(
r'https?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|' #domain...
r'localhost|' #localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:\S+.html?)+' # ending in .html
, re.IGNORECASE)
which is a modified version of Django's UrlField regex.
This will match any site ending with .html or .htm. (either localhost, ip, domain).
#http://[-a-zA-Z0-9]+\.[-a-zA-Z0-9]+\.[-a-zA-Z]+/\w+\.html#