Negative lookbehind in a regex with an optional prefix - regex

We are using the following regex to recognize urls (derived from this gist by Jim Gruber). This is being executed in Scala using scala.util.matching which in turn uses java.util.regex:
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b/?(?!#)))
This version has escaped forward slashes, for Rubular:
(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#))))
Previously the front-end was only sending plaintext to the back end, however now they're allowing users to create anchor tags for urls. Therefore the back end now needs to recognize urls except for those that are already in anchor tags. I initially tried to accomplish this with a negative loohbehind, ignoring urls with a href=" prefix
(?i)\b((?<!href=")((?:https?: ... etc
The problem is that our url regex is very liberal, recognizing http://www.google.com, www.google.com, and google.com - given
Google
the negative lookbehind will ignore http://www.google.com, but then the regex will still recognize www.google.com. I'm wondering if there's a succinct way to tell the regex "ignore www.google.com and google.com if they are substrings of an ignored http(s)://www.google.com"
At present I'm using a filter on the url regex matches (code is in Scala) - this also ignores urls in link text (www.google.com) by ignoring urls with a > prefix and </a> suffix. I'd rather stick with the filter if doing this in a regex would make an already complicated regex even more unreadable.
urlPattern.findAllMatchIn(text).toList.filter(m => {
val start: Int = m.start(1)
val end: Int = m.end(1)
val isHref: Boolean = (start - 6 > 0) &&
text.substring(start - 6, start) == """href=""""
val isAnchor: Boolean = (start - 1 > 0 && end + 3 < text.length &&
text.substring(start - 1, start) == ">" &&
text.substring(end, end + 3) == "</a>")
!(isHref || isAnchor) && Option(m.group(1)).isDefined
})

<a href=\S+|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#)))
or
<a href=(?:(?!<\/a>).)*<\/a>|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#)))
Try this. What it essentially does is:
Consumes all href links so that it cannot be matched later
Does not capture it so it will not appear in groups anyways.
Process the rest as before.
See demo.
http://regex101.com/r/vR4fY4/17

It seems that you're not only wanting to ignore www.google.com and google.com if they are substrings of an ignored http(s)://www.google.com", but instead any substring fragments from a previously ignored section... In which case, you can use a bit of code to work around this! Please see the regex:
(a href=")?(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#))))
^^^^^^^^^^^
I'm not good at scala but you can probably do this:
val links = new Regex("""(a href=")?(?i)\b(((?:https?:... """.r, "unwanted")
val unwanted = for (o <- links findAllMatchIn text) yield o group "unwanted"
If unwanted is scala.Null, then the match is useful.
You can workaround for a need of replacement by replacing an alternative:
a href="(?i)\b(?:(?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#)))|((?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#)))))
The second part of the regex behind the pipe | is grouped as a capturing group. You can replace by this regex with the first group: \1
Similar question:
Regex Pattern to Match, Excluding when... / Except between

How about just adding the <a href= part as an optional group, then when checking your matching, you only return those matches in which that group is empty?

Related

Exclude phrase ending with matched pattern using regex

I have written a rule to catch all web domains ending with .watch or .video, and I want to exclude 2 domains:
(?!\.videolecture|fb\.watch)\b(\.video|\.watch)
The first exclusion .videolecture works fine. But I can't exclude fb.watch.
I'm really sorry but I could't find any similar questions on stackoverflow..
You can use
\b(\.video(?!lecture\b)|(?<!\bfb)\.watch)
See the regex demo. Details:
\b - a word boundary
( - start of a capturing group:
\.video(?!lecture\b) - .video that is not immediately followed by lecture as a whole word
| - or
(?<!\bfb)\.watch - .watch that is not immediately preceded with fb as a whole word
) - end of the group.
An exclude variable could be set using a map. To exclude second-level domains, videolecture or fb, and top-level domains, watch or video:
map $host $exclude {
~\b(?:videolecture|fb)\.(?:watch|video)$ 1;
}
Return 403 if $exclude is set:
server {
if ($exclude) {
return 403
}
}

Notepad++ html tag / string (a href) replace

I found another post that uses the following regex <a[^>]*>([^<]+)</a> it works great however I want to use a capture group to target URLs that have the following 4 letters in them RTRD.
I used <a[^>]*>(RTRD+)</a> and that did not work.
TESTER I want to remove the URL and leave TESTER
LEAVE I want to not touch this one.
One that will work: <a\s[^>]*href\=[\"][^\"]*(RTRD)[^\"]*[\"][^>]*>([^<]+)<\/a>
Decomposition:
<a\s[^>]* find opening a tag with space followed by some arguments
href\=[\"][^\"]* find href attribute with " opening and then multiple non " closing
(RTRD) Your Key group
[^\"]*[\"] Find remainder of argument and closing "
[^>]*>([^<]+)<\/a> The remainder of the original regex
Things your original RegExp would match:
<a stuffhere!!.,?>RTRDDD</a>
<a>RTRD</a>
Decomposing your RegExp:
<a[^>]*> Look for opening tag with any properties
(RTRD+) Look for the RTRD group but also match one or more D
<a[^>]*> Look for closing tag
Use <a[^>]*RTRD[^>]*>([^<]+)<\/a> here.
Inside the opening tag (<a[^>]*>) should be the pattern RTRD somewhere. This can be done by replacing [^>]* with [^>]*RTRB[^>]*which is simply
[^>]* Anything thats not a >(closing tag)
RTRB The pattern RTRB
[^>]* Again anything thats not a >
But caution: This also matches <aRTRB>test</a> or <a id="RTRB">blubb</a>
And if you have any other way than using Regex on HTML, use that way (string operations etc)

JMeter extract link using regular expression pass into next request with blank values

This is how I have Test Plan set up:
HTTP Request -> Regular Expression Extractor to extract multiple links - This is extracting correctly -- But some of the links are Blank
RegularExpressionExtractor --- <a href="(.*)" class="product-link">
BeanShell Sampler - to filter blank or null values -- This works fine
BeanShell Sampler
log.info("Enter Beanshell Sampler");
matches = vars.get("url_matchNr");
log.info(matches);
for (Integer i=1; i < Integer.parseInt(matches); i++)
{
String url = vars.get("url_"+i);
//log.info(url1);
if(url != null #and url.length() > 0)
{
log.info(i+"->" + url);
//return url;
//vars.put("url2", url);
vars.put("url2", url);
//props.put("url2", url);
log.info("URL2:" + vars.get("url2"));
}
}
ForEach Controller
ForEach Controller
Test Plan
The problem I am facing is ForEach Controller runs through all the values including Blank or NULL -- How can I run the loop only for the non null blank values
You should change your regular expression to exclude empty value
Instead of using any value including empty using * sign
<a href="(.*)" class="product-link">
Find only not empty strings using + sign:
<a href="(.+)" class="product-link">
As mentioned earlier, you should change your regex!
you can replace it directly by
<a href="(.+)" class="product-link">
or by something more constraining like this:
<a href="^((https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?)$" class="product-link">
which is a regex to match only URLs.
https://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149
The first capturing group is all option. It allows the URL to begin
with "http://", "https://", or neither of them. I have a question mark
after the s to allow URL's that have http or https. In order to make
this entire group optional, I just added a question mark to the end of
it.
Next is the domain name: one or more numbers, letters, dots, or hypens
followed by another dot then two to six letters or dots. The following
section is the optional files and directories. Inside the group, we
want to match any number of forward slashes, letters, numbers,
underscores, spaces, dots, or hyphens. Then we say that this group can
be matched as many times as we want. Pretty much this allows multiple
directories to be matched along with a file at the end. I have used
the star instead of the question mark because the star says zero or
more, not zero or one. If a question mark was to be used there, only
one file/directory would be able to be matched.
Then a trailing slash is matched, but it can be optional. Finally we
end with the end of the line.
String that matches:
http://net.tutsplus.com/about
String that doesn't match:
http://google.com/some/file!.html (contains an exclamation point)
Good luck!!!
ForEach controller doesn't work with JMeter Properties, you need to change the "Input Variable Prefix" to url_2 and your test should start working as expected.
Also be aware that since JMeter 3.1 it is recommended to use Groovy language for any form of scripting so consider migrating to JSR223 Sampler and Groovy language on next available opportunity.
Groovy has much better performance while Beanshell might become a bottleneck when it comes to immense loads.

Regex in Google Analytics for segment creation

I'm trying to trap URLs of the following structure:
/resources/state-name/city-name
given that there are URLs of the following type
/resources/other-words
/resources/state-name
/resources/state-name/city-name/other-words
I have tried to trap using
include/matches regex:
\/resources\/.*\/.*
exclude/matches regex:
\/resources\/.*\/.*\/.*
but this is allowing the other-words and state-name only to slip through.
Try this regex \/resources\/[^\/\r\n]*(?:$|(?:\/.*\/.*$)). I assumed the end of the url was also the end of the line. This matches all of them but /resources/state-name/city-name
To only get /resources/state-name/city-name, then use this one \/resources\/[^\/\r\n]*\/[^\/\r\n]*$.
Something like this /(\/resources\/)([\w+%-]+)\/([\w+%-]+)/g
With this [\w-%] you match any letter, number, - and % in the URL (i put % because in the URLs spaces are replaced with - or + or %20)
Also, with the () you can access each member with $1 to $3

Regex not working with similar looking string

I am trying to get the content between Start and End tag for the below mentioned strings
Products
Services & Solution
Regex used:
<([a-z0-9]+)([^<]+)\*(?:>(.\*?)</\\2>|\\D+/>)
It is working fine for the first string but not with the later once
Why so complex? Won't simple />([^<]+)</ capture the content of an element?
Depending on the flavour of regex - use lookahead and lookbehind methods to get just the match between > and < i.e.
(?<=>)[^>]*(?=<)
(?<=>) - looks ahead for a >
(?=<) - looks behind for a <
[^>]* - matches the text in the link itself
lookahead and lookbehind are zero width matches so will will just get what you need
Usually you don't want to parse HTML your self with regex, parser are better at that.
Assuming you are using PCRE here's a random guess at the expression you are looking for:
(?is)<([a-z]+)\b[^<>]*(?:>(.*?)</\1>|/>)
Note that this will not work with nested tags.
Just get rid of the tags.
var str = 'Products '
var str2 = 'Services & Solution '
var RE_findOpenAndCloseTag = /^<[^>]+>|<\/[^>]>$/g;
str.replace( RE_findOpenAndCloseTag, '' ) == "Products ";
str2.replace( RE_findOpenAndCloseTag, '' ) == "Services & Solution ";
Note that RE_findOpenAndCloseTag assumes that tags will always start with a < and not contain an > unless it's closing the tag.
Thus this will fail.
'>">This will fail
But an easier way would be to convert the tags into a node, then get the innerHTML.
Try this it will resolve your issue (Just Add |</\1>)
<([a-z0-9]+)([^<]+)*(?:>(.*?)|\D+/>|</\1>)
For more detail please refer