RegEx to find all possible relative links to a specific file - also capture link text - regex

Yes, there's hundreds of [regex] [html] topics on SO, but the first 30 I've checked don't help me with my problem.
I've got 745 total links (all relative, and they have to stay relative) to a file in my site. I need to find all these links and append data before and after them. I also need to capture and use the link text.
I've tried several expressions and the regex below is the closest I can get, but it's not good enough - it keeps finding a few instances of some other href to a different file and captures the content all the way to the </a> of the file I actually care about.
<a href="((.)*?)?myFile.html((.)*?)?>((.)*?)?</a>
In the above, I need to capture the relative path to the file and any anchors that might be present, as well as the actual link text.
What regex should I be using?
It shouldn't matter, but I'm using Adobe Dreamweaver to perform the search.

The following regex should work for what you need:
<a href="([^"]*?a\.fparameters\.html)(#[^"]+?)?".*?>(.*?)<
It will work even if you have URLs like:
JOBMAXNODECOUNT
that do not have #xxxx.
A few examples:
For JOBMAXNODECOUNTyou will get:
Group 1: a.fparameters.html
Group 2: #jobmaxnodecount
Group 3: JOBMAXNODECOUNT
For mjobctl -m to modify the job after it has been submitted. See the RSVSEARCHALGO you will get only one match:
Group 1: a.fparameters.html
Group 2: #rsvsearchalgo
Group 3: RSVSEARCHALGO

Try this regex: (updated)
href="([^"]*?)myFile\.html#?([^"]*).*?>(.*?)<\/a>
Explained demo here: http://regex101.com/r/lA6vB7

First, never do this: (.)* ...or this: (?:.)*
The first one consumes one character at a time and captures it in a group, each time overwriting previous captured character. The second one avoids most of that overhead by using a non-capturing group, but it's still only matching one character at a time inside that group; why bother? All it's doing is cluttering up the regex.
Adding the ? to make it non-greedy -- e.g. (.)*?-- doesn't make it worse, but it doesn't help, either. And sticking that inside another group and making the group optional -- i.e. ((.)*?)? -- is a recipe for catastrophic backtracking.. But performance considerations aside, when I see a capturing group with a quantifier attached, it almost always turns out mistake on the author's part. (ref)
As for your question, my solution turns out to be almost identical to Oscar's:
([^<>]*)

Related

What is the correct regex pattern to use to clean up Google links in Vim?

As you know, Google links can be pretty unwieldy:
https://www.google.com/search?q=some+search+here&source=hp&newwindow=1&ei=A_23ssOllsUx&oq=some+se....
I have MANY Google links saved that I would like to clean up to make them look like so:
https://www.google.com/search?q=some+search+here
The only issue is that I cannot figure out the correct regex pattern for Vim to do this.
I figure it must be something like this:
:%s/&source=[^&].*//
:%s/&source=[^&].*[^&]//
:%s/&source=.*[^&]//
But none of these are working; they start at &source, and replace until the end of the line.
Also, the search?q=some+search+here can appear anywhere after the .com/, so I cannot rely on it being in the same place every time.
So, what is the correct Vim regex pattern to use in order to clean up these links?
Your example can easily be dealt with by using a very simple pattern:
:%s/&.*
because you want to keep everything that comes before the second parameter, which is marked by the first & in the string.
But, if the q parameter can be anywhere in the query string, as in:
https://www.google.com/search?source=hp&newwindow=1&q=some+search+here&ei=A_23ssOllsUx&oq=some+se....
then no amount of capturing or whatnot will be enough to cover every possible case with a single pattern, let alone a readable one. At this point, scripting is really the only reasonable approach, preferably with a language that understands URLs.
--- EDIT ---
Hmm, scratch that. The following seems to work across the board:
:%s#^\(https://www.google.com/search?\)\(.*\)\(q=.\{-}\)&.*#\1\3
We use # as separator because of the many / in a typical URL.
We capture a first group, up to and including the ? that marks the beginning of the query string.
We match whatever comes between the ? and the first occurrence of q= without capturing it.
We capture a second group, the q parameter, up to and excluding the next &.
We replace the whole thing with the first capture group followed by the second capture group.

Google Analytics Content Grouping by Extraction - extract 3rd level subdirectories

I've been going round in circles with this one.
I'd like to perform a content grouping in google analytics that groups by a 3rd level subdirectory.
I can grab the second level successfully with the following regex
`/destinations/(.*?)/`
where the url is
mydomain.com/destinations/europe
mydomain.com/destinations/alaska
I get content groups of europe and alaska.
However, I also then want a grouping of the next level, for example
mydomain.com/destinations/europe/southampton
mydomain.com/destinations/europe/portugal
mydomain.com/destinations/alaska/somealaskanplace
to give me groupings of southampton, portugal and somealaskanplace
This means i need to effectively ignore whatever's in the second level and this is what i'm struggling with.
So far i have
`/destinations\/.*\/(.*?)/$`
but that's given me the domain name as a grouping
Can anyone help? It would be very much appreciated.
You need to have the Multiline flag On
Check this:
/.*?\/(destinations)\/(\w+)\/(\w+)/gm
Demo on Regex101:
https://regex101.com/r/2wvRIx/2
I don't think you need the / delimiters. GA may be interpreting your last /$ as being a slash and then end-of-string. Try making it just /destinations/.*/(.*?)$ (note that GA regex does not require you to escape slashes).

regex to find domain without those instances being part of subdomain.domain

I'm new to regex. I need to find instances of example.com in an .SQL file in Notepad++ without those instances being part of subdomain.example.com(edited)
From this answer, I've tried using ^((?!subdomain))\.example\.com$, but this does not work.
I tested this in Notepad++ and # https://regex101.com/r/kS1nQ4/1 but it doesn't work.
Help appreciated.
Simple
^example\.com$
with g,m,i switches will work for you.
https://regex101.com/r/sJ5fE9/1
If the matching should be done somewhere in the middle of the string you can use negative look behind to check that there is no dot before:
(?<!\.)example\.com
https://regex101.com/r/sJ5fE9/2
Without access to example text, it's a bit hard to guess what you really need, but the regular expression
(^|\s)example\.com\>
will find example.com where it is preceded by nothing or by whitespace, and followed by a word boundary. (You could still get a false match on example.com.pk because the period is a word boundary. Provide better examples in your question if you want better answers.)
If you specifically want to use a lookaround, the neative lookahead you used (as the name implies) specifies what the regex should not match at this point. So (?!subdomain\.)example trivially matches always, because example is not subdomain. -- the negative lookahead can't not be true.
You might be better served by a lookbehind:
(?<!subdomain\.)example\.com
Demo: https://regex101.com/r/kS1nQ4/3
Here's a solution that takes into account the protocols/prefixes,
/^(www\.)?(http:\/\/www\.)?(https:\/\/www\.)?example\.com$/

Regular expression with negative look aheads

I am trying to contruct a regular expression to remove links from content unless it contains 1 of 2 conditions.
<a.*?href=[""'](http[s]?:\/\/(.*?)\.link\.com)?\/(?!m\/).*?<\/a>
This will match any link to link.com that does not have m/ at the end of the domain section. I want to change this slightly so it does't match URLs that are links to pdf files regardless of having the m/ in the url, I came up with:
<a.*?href=["'](http[s]?:\/\/(.*?)\.brodies\.com)?\/(?!m\/).*?\.(?!pdf)["'].*?<\/a>
Which is ooh so very close except now it will only match if the URL has a "." at the end - I can see why it's doing it. I can't seem to make the "." optional as this causes the non greedy pattern prior to the "." to keep going until it hits the ["']
Any help would be good to help solve this.
Thanks
Paul
You probably want to use (?<!\.pdf)["'] instead of \.(?!pdf)["'].
But note that this expression has several issues, best way to solve them is to use a proper HTML parser.
First, RegEx match open tags except XHTML self-contained tags.
That said, (since it probably will not deter,) here is a slightly-better-constrained version of what you're trying to, with the caveat that this is still not good enough!
<a[^>]+?href\s*=\s*["'](https?:\/\/[^"']*?\.link\.com)?\/(?!m\/)[^"']*?\.(?!pdf)[^"']*?["'][^>]*?>.*?<\/a>
You can see a running example of this regex at: http://rubular.com/r/obkKrKpB8B.
Your problem was actually just that you were looking for a quote character immediately after the dot, here: .(?!pdf)["'].

Regex to extract part of a url

I'm being lazy tonight and don't want to figure this one out. I need a regex to match 'jeremy.miller' and 'scottgu' from the following inputs:
http://codebetter.com/blogs/jeremy.miller/archive/2009/08/26/talking-about-storyteller-and-executable-requirements-on-elegant-code.aspx
http://weblogs.asp.net/scottgu/archive/2009/08/25/clean-web-config-files-vs-2010-and-net-4-0-series.aspx
Ideas?
Edit
Chris Lutz did a great job of meeting the requirements above. What if these were the inputs so you couldn't use 'archive' in the regex?
http://codebetter.com/blogs/jeremy.miller/
http://weblogs.asp.net/scottgu/
Would this be what you're looking for?
'/([^/]+)/archive/'
Captures the piece before "archive" in both cases. Depending on regex flavor you'll need to escape the /s for it to work. As an alternative, if you don't want to match the archive part, you could use a lookahead, but I don't like lookaheads, and it's easier to match a lot and just capture the parts you need (in my opinion), so if you prefer to use a lookahead to verify that the next part is archive, you can write one yourself.
EDIT: As you update your question, my idea of what you want is becoming fuzzier. If you want a new regex to match the second cases, you can just pluck the appropriate part off the end, with the same / conditions as before:
'/([^/]+)/$'
If you specifically want either the text jeremy.miller or scottgu, regardless of where they occur in a URL, but only as "words" in the URL (i.e. not scottgu2), try this, once again with the / caveat:
'/(jeremy\.miller|scottgu)/'
As yet a third alternative, if you want the field after the domain name, unless that field is "blogs", it's going to get hairy, especially with the / caveat:
'http://[^/]+/(?:blogs/)?([^/]+)/'
This will match the domain name, an optional blogs field, and then the desired field. The (?:) syntax is a non-capturing group, which means it's just like regular parenthesis, but won't capture the value, so the only value captured is the value you want. (?:) has a risk of varying depending on your particular regex flavor. I don't know what language you're asking for, but I predominantly use Perl, so this regex should pretty much do it if you're using PCRE. If you're using something different, look into non-capturing groups.
Wow. That's a lot of talking about regexes. I need to shut up and post already.
Try this one:
/\/([\w\.]+)\/archive/