Regex match except urls starting with a custome strings - regex

I have a text and a regex pattern
text is something like
foo https://www.google.hu <img ... src="http://a-page.com/foobar.jpg" ...> bar
the regex
/(http|https|ftp)\:\/\/(www\.)?([a-zA-Z0-9\-\_\.]+)\.([a-z]{1,5}+)\/([a-zA-Z0-9\.\?\=\&\-\_\~\/\%\+\;]+)?(\#([a-zA-Z0-9\_]+))?/i
and i'd update it with a special case
if url starting with src=" it would be great if regex matches dont contains the image url only other urls
i tried this
/(?!src\=\")(http|https|ftp)\:\/\/(www\.)?([a-zA-Z0-9\-\_\.]+)\.([a-z]{1,5}+)\/([a-zA-Z0-9\.\?\=\&\-\_\~\/\%\+\;]+)?(\#([a-zA-Z0-9\_]+))?/
but it doesnt work
Could you help me, please?
I know I could add (^|\s) to pattern, but it won't work in case when I want to hide urls cause user can write any char before url and the url is no longer hidden and some other regex codes are in source too and one of them is a img bb tag code, and I dont want to hide (replace) it's url
(Sorry for my english)

To be honest I had difficulties to understand what exactly you want, but I guess you mean that you have a text with various URLs inside and you don't want to match those which are included in a html img tag. If so, try this:
/(?<!src\=\")(https?|ftp):\/\/(www\.)?([\w\-\.]+)\.([a-z]{1,5}+)\/?([\w\.\?\=\&\-\~\/\%\+\;]+)?(\#(\w+))?/
Notes:
You can replace [A-Za-z0-9_] with character class \w (read more in perlre).
The (?!pattern) assertion you tried is a negative look-ahead assertion. In your case you want a negative look-behind (?<!pattern) (again you can read perlre for more info).

Related

Notepad++ html tag / string (a href) replace

I found another post that uses the following regex <a[^>]*>([^<]+)</a> it works great however I want to use a capture group to target URLs that have the following 4 letters in them RTRD.
I used <a[^>]*>(RTRD+)</a> and that did not work.
TESTER I want to remove the URL and leave TESTER
LEAVE I want to not touch this one.
One that will work: <a\s[^>]*href\=[\"][^\"]*(RTRD)[^\"]*[\"][^>]*>([^<]+)<\/a>
Decomposition:
<a\s[^>]* find opening a tag with space followed by some arguments
href\=[\"][^\"]* find href attribute with " opening and then multiple non " closing
(RTRD) Your Key group
[^\"]*[\"] Find remainder of argument and closing "
[^>]*>([^<]+)<\/a> The remainder of the original regex
Things your original RegExp would match:
<a stuffhere!!.,?>RTRDDD</a>
<a>RTRD</a>
Decomposing your RegExp:
<a[^>]*> Look for opening tag with any properties
(RTRD+) Look for the RTRD group but also match one or more D
<a[^>]*> Look for closing tag
Use <a[^>]*RTRD[^>]*>([^<]+)<\/a> here.
Inside the opening tag (<a[^>]*>) should be the pattern RTRD somewhere. This can be done by replacing [^>]* with [^>]*RTRB[^>]*which is simply
[^>]* Anything thats not a >(closing tag)
RTRB The pattern RTRB
[^>]* Again anything thats not a >
But caution: This also matches <aRTRB>test</a> or <a id="RTRB">blubb</a>
And if you have any other way than using Regex on HTML, use that way (string operations etc)

Conditional Regex to match url

I am trying to make a if/then condition to match the url, but I can't seem to get it to work. I am trying to match URLs and then capture the non-optional group. So - if a url comes in like this:
/en/testing.aspx
I want to capture /testing.aspx
if the url comes in like this:
/testing.aspx
I want to capture /testing.aspx
Is there an easy way to do this using regex?
EDIT:
The Url can be multi-part url, like /en/sub1/sub2/testing.aspx - I essentially want everything after "/en/".
use regex \/en(\/.+)$
Check this out
edited
https://regex101.com/r/lwowhi/6
If there is "/en/" in the URL and you still want to capture /testing.aspx then here is an edit (?:\/en)*(\/.+)$
https://regex101.com/r/lwowhi/8
You can use a greedy regex which will consume everything up until the final forward slash. Then, capture everything which comes after that point.
^.*?(?:\/en)?(\/.*)$
Demo
Guessing all pages are .aspx then use group.
regex: .(/..aspx)
this will match "/testing.aspx" in all bellow samples
/testing.aspx or
/en/testing.aspx or
www.abc.com/en-us/testing.aspx

Regex to extract hyperlink containing a specific word

I need to extract a hyperlink, containing a specific word in the url, from a piece of text. Example;
"This is a text with a link to some page. Click this link <a href="/server/specificword.htm>this is a link to a page</a> to see that page. Here is a link that doesn't have the word "specificword" in it: <a href="/server/mypage.htm>this is a link without the word "specificword" in the url</a>"
So, I need to parse this text, check the hyperlinks to see if one of them contains the word "specificword", and then extract the entire hyperlink. I would then end up with this:
<a href="/server/specificword.htm>this is a link to a page</a>
I need the hyperlink that has specificword in the url eg. /server/specificword.htm, not in the link text
One regex I have tried, is this one: /(<a[^>]*>.*?</a>)|specificword/
This will match all hyperlinks in the text, or "specificword". If the text has multiple links, without the word "specificword", I will get those too.
Also, I have tried this one, but it matces nothing:
<a.*?href\s*=\s*["\']([^"\'>]*specificword[^"\'>]*)["\'][^>]*>.*?<\/a>
My regex skills end here, any help would be great....
try this for all the a tag:
/<a [^>]*\bhref\s*=\s*"[^"]*SPECIFICWORD.*?<\/a>/
or just for the link (in the first capture group):
/<a [^>]*\bhref\s*=\s*"([^"]*SPECIFICWORD[^"]*)/
If you use php, for the link:
preg_match_all('/<a [^>]*\bhref\s*=\s*"\K[^"]*SPECIFICWORD[^"]*/', $text, $results);
This one should suit your needs:
.*?
Demo
If you want to allow other attributes on your anchor tar, and be more premissive about inner spaces, you could try:
<a( [^>]*?)? href="[^"]*?specificword.*?"( .*?)?>.*?</a>
Demo
You could also of course use non-capturing groups (?:...):
<a(?: [^>]*?)? href="[^"]*?specificword.*?"(?: .*?)?>.*?</a>
Demo
And finally, if you want to allow simple quotes for your href attribute:
<a(?: [^>]*?)? href=(["'])[^\1]*?specificword.*?\1(?: .*?)?>.*?</a>
Demo
Last but not least: if you want to capture the URL, just put parentheses around the [^\1]*?specificword.*? part:
<a(?: [^>]*?)? href=(["'])([^\1]*?specificword.*?)\1(?: .*?)?>.*?</a>
Demo
The final regex you tried almost had it. Try this alteration of it:
<a\s.*?href=["']([^"']*?specificword[^"']*?)[^>]*>.*?<\/a>
The main difference is making the quantifiers "lazy".
try this pattern this is exact requirement you want
(?=.*href=\"([^\"]*specificword[^"]*)")<a [^>]+>
if you want only url value use Groups[1]
Like:
Regex.match("input string",#"(?=.*href=\"([^\"]*specificword[^"]*)")<a [^>]+>").Groups[1].value;

Using reg-ex to filter URL's that contain certain words GA

I want to filter out all URL's that contain certain words, for example:
I have a URL that looks like this:
www.google.com/&SaveThis=true&SaveType=VeryFast&Page=0
And sometimes the 'Save Type' might change to slow or something. So what I want to do is show all URL's that have the 'SaveType=VeryFast' sometimes this can be in the middle of a very long URL.
I tried this:
.*SaveType=VeryFast.*
But it didn't work!
Thanks
From Tip #4 on this page, it looks like you don't need the .* on either end. That is, without the ^ and $ anchors, using SaveType=VeryFast should match any URL that contains those exact characters. It does look like word boundary anchors (\b) are not supported, so you will likely also match any URL that contains e.g. OtherSaveType=VeryFast or SaveType=VeryFastly
Otherwise, I don't see anything wrong with your expression... (?)

match url that doesnt contain asp, apsx, css, htm.html,jpg

Q-1. match url that doesn't contain asp, apsx, css, htm.html,jpg,
Q-2. match url that doesn't end with asp, apsx, css, htm.html,jpg,
You want to use the 'matches count' function, and make it match 0.
eg.
(matches all characters, then a dot, then anything that isnt aspx or css
^.*\.((aspx) | (css)){0}.*$
Edit,
added ^ (start) and $ (end line chars)
Q-1. This is better done using a normal string search, but if you insist on regex: (.(?!asp|apsx|css|htm|html|jpg))*.
Q-2. This is better done using a normal string search, but if you insist on regex: .*(?<!asp|css|htm|jpg)(?<!aspx|html)$.
If your regular expression implementation does allow lookaround assertions, try these:
(?:(?!aspx?|css|html?|jpg).)*
.*$(?<!aspx?|css|html?|jpg)