Regex to find a specific anchor tag that have href with a specific domain and nofollow - regex

I have a string that contains html I want a regex that get me the string that has with a specific domain name and has noFollow
I have found this would will do work on the domain name but does not include nofollow condition
(<a\s*(?!.\brel=)[^>])(href="https?://)((?stackoverflow)[^"]+)"([^>]*)>
let's say the domain name I want is stackoverflow
Example:
- "click here " this would match
- "<a href="stackoverflow.com"> would not match since it has no follow
- "<a href="google.com" rel = "nofollow"> would not match

It's bit hard to match a HTML tag with specific condition, but the following regex should do it:
select regexp_match(str, '<a((?:\s+(([^\/=''"<>\s]+)(=((''[^'']*'')|("[^"]*")|([^\s<>''"=`]+)))?)))* href=((''(https?:\/\/)?stackoverflow\.com[^'']*'')|("(https?:\/\/)?stackoverflow\.com[^"]*"))((?: (([^\/=''"<>\s]+)(=((''[^'']*'')|("[^"]*")|([^\s<>''"=`]+)))?)))*\s+rel=("nofollow"|''nofollow'')((?: (([^\/=''"<>\s]+)(=((''[^'']*'')|("[^"]*")|([^\s<>''"=`]+)))?)))*\/?>') from tes;
It's really hard to read, but basically most of the regex is there for matching attributes. The important thing for you is to find stackoverflow\.com (which can be found 2 times; one for href with single quote and second for double quote) and replace it with whatever domain you need (and don't forget to escape it properly).
Some notes
I don't know which regexp function you want to use, but you should be able to use it with whatever regexp function you need. Another thing is that your example click here won't be matched, because you have spaces between attribute name and = sign (i don't know if this is valid HTML or not). It will work with this click here . If you need to match addresses which might include spaces between = signs just comment me and I'll try to edit the regex.

Related

Notepad++ html tag / string (a href) replace

I found another post that uses the following regex <a[^>]*>([^<]+)</a> it works great however I want to use a capture group to target URLs that have the following 4 letters in them RTRD.
I used <a[^>]*>(RTRD+)</a> and that did not work.
TESTER I want to remove the URL and leave TESTER
LEAVE I want to not touch this one.
One that will work: <a\s[^>]*href\=[\"][^\"]*(RTRD)[^\"]*[\"][^>]*>([^<]+)<\/a>
Decomposition:
<a\s[^>]* find opening a tag with space followed by some arguments
href\=[\"][^\"]* find href attribute with " opening and then multiple non " closing
(RTRD) Your Key group
[^\"]*[\"] Find remainder of argument and closing "
[^>]*>([^<]+)<\/a> The remainder of the original regex
Things your original RegExp would match:
<a stuffhere!!.,?>RTRDDD</a>
<a>RTRD</a>
Decomposing your RegExp:
<a[^>]*> Look for opening tag with any properties
(RTRD+) Look for the RTRD group but also match one or more D
<a[^>]*> Look for closing tag
Use <a[^>]*RTRD[^>]*>([^<]+)<\/a> here.
Inside the opening tag (<a[^>]*>) should be the pattern RTRD somewhere. This can be done by replacing [^>]* with [^>]*RTRB[^>]*which is simply
[^>]* Anything thats not a >(closing tag)
RTRB The pattern RTRB
[^>]* Again anything thats not a >
But caution: This also matches <aRTRB>test</a> or <a id="RTRB">blubb</a>
And if you have any other way than using Regex on HTML, use that way (string operations etc)

Regex for Page Filtering in Google Analytics

I'm trying to use GA to filter out certain URL pages. I need to distinguish between pages like this:
www.example.com/hotel/hotelfoofoo
and this:
www.example.com/hotel/hotelfoofoo/various-options-go-here?lots-of-other-stuff-follows
I'm new to regex, so I know very little, but am basically trying to capture URL pages that begin with /hotel/ but do not include any other forward slashes. Is there a way to write that code?
Two possible solutions:
1) Assuming only alpha numeric + '-' signs allowed in the name of hotel:
/hotel/([-\w]+)(?![-\/\w])
Note: hotel name would be caught in first group. Idea here - is to capture all digits/letters/underscor/- symbols which are not followed by slash.
2) Assuming white space symbol required to designate url end:
/hotel/([^\s/]+)(?=\s)
Note: depending on your regexp language some of character should be escaped. For js all "/" should be escaped e.g.: "/"

extracting a part of an href in JMeter

i'm stuck with the following.
i have a page on ibm filenet containing a list with objects (these are documents or files) which have a specific classID and ID in their href. i need JMeter to get all HREFS containing a specific type of ID:
<a href="http://ipaddress/Workplace/Browse.jsp?eventTarget=WcmController&eventName=GetInfo&id={350B278C-DE7D-44DE-9B54-099672152476}&vsId=&classId={F14AC85A-4474-479A-9B4E-BCBA180B7975}&objectStoreName=Nice&majorVersion=&minorVersion=&versionStatus=&mimeType=&mode=&objectType=customobject&isPopup=true" target="_blank">
the 'classId' = {F14AC85A-4474-479A-9B4E-BCBA180B7975} is the right class id type i need to click on the page (there are several files with this classID but that is no problem). on the other hand the 'id' is thus different for each file.
how can i extract all 'id's containing this specific classId and make JMeter pass it to the next sampler, so it clicks on just one of them? what will my RegEx look like?
As already mentionned in the comment, I do not know jmeter and how to implement it in the code. A regular expression to match both id and classId within a link would be:
~(classId=|id=)([^&]*)~g
This is, search for a string classId= or id= first. If one of the strings is found, match any character afterwards, except an ampersand (&), as many times as possible (*) and capture it in a group (brackets). Possibly you need to fiddle with the parameters (e.g. /g for global) after the regex.
See this regex101 fiddle for more information.

Regex to match anything after /

I'm basically not in the clue about regex but I need a regex statement that will recognise anything after the / in a URL.
Basically, i'm developing a site for someone and a page's URL (Local URL of Course) is say (http://)localhost/sweettemptations/available-sweets. This page is filled with custom post types (It's a WordPress site) which have the URL of (http://)localhost/sweettemptations/sweets/sweet-name.
What I want to do is redirect the URL (http://)localhost/sweettemptations/sweets back to (http://)localhost/sweettemptations/available-sweets which is easy to do, but I also need to redirect any type of sweet back to (http://)localhost/sweettemptations/available-sweets. So say I need to redirect (http://)localhost/sweettemptations/sweets/* back to (http://)localhost/sweettemptations/available-sweets.
If anyone could help by telling me how to write a proper regex statement to match everything after sweets/ in the URL, it would be hugely appreciated.
To do what you ask you need to use groups. In regular expression groups allow you to isolate parts of the whole match.
for example:
input string of: aaaaaaaabbbbcccc
regex: a*(b*)
The parenthesis mark a group in this case it will be group 1 since it is the first in the pattern.
Note: group 0 is implicit and is the complete match.
So the matches in my above case will be:
group 0: aaaaaaaabbbb
group 1: bbbb
In order to achieve what you want with the sweets pattern above, you just need to put a group around the end.
possible solution: /sweets/(.*)
the more precise you are with the pattern before the group the less likely you will have a possible false positive.
If what you really want is to match anything after the last / you can take another approach:
possible other solution: /([^/]*)
The pattern above will find a / with a string of characters that are NOT another / and keep it in group 1. Issue here is that you could match things that do not have sweets in the URL.
Note if you do not mind the / at the beginning then just remove the ( and ) and you do not have to worry about groups.
I like to use http://regexpal.com/ to test my regex.. It will mark in different colors the different matches.
Hope this helps.
I may have misunderstood you requirement in my original post.
if you just want to change any string that matches
(http://)localhost/sweettemptations/sweets/*
into the other one you provided (without adding the part match by your * at the end) I would use a regular expression to match the pattern in the URL but them just blind replace the whole string with the desired one:
(http://)localhost/sweettemptations/available-sweets
So if you want the URL:
http://localhost/sweettemptations/sweets/somethingmore.html
to turn into:
http://localhost/sweettemptations/available-sweets
and not into:
localhost/sweettemptations/available-sweets/somethingmore.html
Then the solution is simpler, no groups required :).
when doing this I would make sure you do not match the "localhost" part. Also I am assuming the (http://) really means an optional http:// in front as (http://) is not a valid protocol prefix.
so if that is what you want then this should match the pattern:
(http://)?[^/]+/sweettemptations/sweets/.*
This regular expression will match the http:// part optionally with a host (be it localhost, an IP or the host name). You could omit the .* at the end if you want.
If that pattern matches just replace the whole URL with the one you want to redirect to.
use this regular expression (?<=://).+

rel-tag bookmarklet for last path component of a URL

Many web sites support folksonomy tags. You may have heard of rel-tag, where it says that "The last path component of the URL is the text of the tag".
I am looking for a bookmarklet or greasemonkey script (javascript) to get the "last path component" for the URL currently being viewed in the browser, add that tag into another URL, and then open that page in a new tab or window.
For example, if I am looking at a delicious.com page with the tag "foo", I may want to create a new URL with the tag "foo". This should also work for multiple tags in the last path component, such as, foo+bar.
Some regexp suggestions have been offered.
Since you're using JavaScript, there's no need to worry about hostnames, querystrings, etc - just use location.pathname to get at the important bit.
For example:
var NewUrl = 'http://technorati.com/tag/';
var LastPart = location.pathname.match( /[^\/]+\/?$/ );
window.open( NewUrl + LastPart );
That allows for a potential single trailing slash.
You can use /[^\/]+\$/ to disallow trailing slashes, or /[^\/]+\/*$/ for any number of them.
If you can assume both your URLs to be valid, you can get the tag from the first URL with this regex:
^[a-z]+://[^/#?]+/[^#?]*?([^#?/]+)(?:[#?]|$)
The first (and only) capturing group will hold the tag. This regex won't match URLs that don't have any tags.
To append the tag to another URL, search for the regex:
^([^#?]*?)/?(?:[#?]|$)
and replace with:
$1/tag
This regex makes sure not to end up with two adjacent slashes in the URL if the path of the original URL ends with a slash.
implementation, as in how the servers are set up, all that jazz? I'm not very knowledgeable about that stuff =\ ahh that sounds