Regular expression adjustment not working as expected - regex

I have the following regex https://regex101.com/r/arBFtI/2.
It is a regex for searching and replacing on a webpage. It searches and replaces the results by appending a highlight div so the words show up accordingly for the user.
In order to make sure the HTML itself is not changed (page can not break..) it recognizes HTML tags/attributes so it doesn't show a result in that.
But I have one more issue, now the regexp is strict and only shows a results if it is preceded with a space.
When searching for "export" it will show a result in the sentence above but not in the query below on db0383_bpost.export_201506.
In order to match on all "export" occurrences I can adjust the regex to be (?<![&((])export(?![^<>]*(([\/\"']|]]|\b)>)) but then the following problem arises.. HTML entities!
If you search on "b" for example using (?<![&((])b(?![^<>]*(([\/\"']|]]|\b)>)) it will also match the b in ..
So I like the "strict" regexp (?<![&((\S+])export(?![^<>]*(([\/\"']|]]|\b)>)) or (?<![&((\S+])b(?![^<>]*(([\/\"']|]]|\b)>)) when searching for b but the only thing I need is for it to ignore HTML entities as well. So if I search for "b" it should match all the b's except in HTML entities and b's not between HTML tags.
It looks like a slight adjustment to the original regex in the (\S+]) part but I can't figure it out. Can you? Please help me I greatly appreciate it.

Related

How to use a regular expression in notepad++ to change a url

I need some help with our migrated site urls's. We moved our site from Joomla to Worpdress and IN our posts we have over 20K of internal links.
The structure of these links are like these:
www.mysite.nl/current-post-title/index.php?option=com_content&view=article&id=5259:related-post-title&catid=35:universum&Itemid=48
What we need is this:
www.mysite.nl/related-post-title
So basically we need to remove everyhing behind www.mysite.nl/ up until the colon :, i.e. remove this: current-post-title/index.php?option=com_content&view=article&id=5259: (must remove the colon itself too)
And then remove everything behind the first ampersand (including the ampersand itself) until the end of the string, i.e. remove &catid=35:universum&Itemid=48
Of course only url strings containing this index.php?option=com_content must be changed.
I have dumped the table in plain text and opened it in Notepad++ to do a search and replace with regular expression because the content that must be removed from these lines is different every time.
Can someone please help me with the right regular expression?
In find what box enter below:
(www.mysite.nl)\/.*index.php\?option=com[^:]+:([^&]+)&.*
In replace with box enter:
\1/\2
Result
www.mysite.nl/related-post-title
Go inside-out, rather than outside-in, replace \/.+&id=\d+\:(.+?)&.+ with /$1. Also, paste a few into http://www.regexr.com/ and play around, although JavaScript and Notepad++ might have some differences in implemented Regex features, e.g. negative lookbehinds.

How to match plain text URL in a markdown?

I'm currently trying to match all plain text links in a markdown text.
Example of the markdown text:
Dude, look at this url http://www.google.com .. it's a great search engine
I would like it to be converted into
Dude, look at this url <http://www.google.com> .. it's a great search engine
So in short, processing url should become <url>, but processing existing <url> shouldnt become <<url>>. Also, the link in the markdown can be in the form of (url), so we'll have to avoid matching the normal brackets too.
So my working regex for matching the plain text url in java is :
"[^(\\<|\\(](https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|][^(\\>|\\)]",
with [^(\\<|\\(] and [^(\\>|\\)] to avoid matching the wrapping brackets.
But here lies one problem where i also do not want to match this kind of url :
[1]: http://slashdot.org
So, if the markdown text is
Dude, look at this url http://www.google.com .. it's a great search engine
[1]: http://slashdot.org
I want only http://www.google.com to be matched, but not the http://slashdot.org.
I wonder what's the pattern to meet this criteria ?
What you have here is a parsing problem. Regexes are fine, but just using regexes here will make it a mess (supposing you achieve it). After you fix this problem, you'll probably find yourself facing other ones, like URL in code (between ` or in lines starting with tabs or four spaces) that you don't want to replace.
A solution would be to split into lines and then
detect patterns (for example ^\[\d+\]:\s+)
apply your replacements (for example this URL to link change) only on lines which doesn't follow an incompatible pattern
That's the logic I use in this small pseudo-markdown parser that you can test here.
Note that there's always the solution to use an existing proved markdown parser, there are many of them.

Regex select XML Element (containing hyphen) and inside content

I'm working with an enterprise CMS and in order to properly create our weekly-updated dropdown menu without republishing our entire site, I have an XML document being created which has a various number of useful XML elements. However, when pulling in a link with the CMS, the generated XML also outputs the link's contents (the entire HTML for the page). Needless to say, with roughly 50 items, the XML file is too big for use on the web (as it stands I think it's over 600KB). The element is <page-content>filler here</page-content>.
What I'm trying to do is use TextWrangler to find and replace all <page-content> tags as well as their containing content.
I've tried a few different regex's, but I can't seem to match the closing tag, so it will just trail on.
Here's what I've tried:
(<page-content>)(.*?)
The above will match up until the next starting <page-content> tag, which is not what I want.
(<page-content>)(.*?)(<\/page-content>)
(<page-content>)(.*?)(<\/page\-content>)
The above finds no matches, even though the below will find the 7 matches it should.
(<content>)(.*?)(<\/content>)
I don't know if there's a special way to deal with hyphens (I'm inexperienced in regular expressions), but if anyone could help me out, it would be greatly appreciated.
Thanks!
EDIT: Before you tell me that Regex isn't meant to parse HTML, I know that, but there seems to be no other way for me to easily find and replace this. There are too many occurences to manually delete it and save the file again every week.
It seems the problem is that your . is not matching newlines that exist between your open and close tags.
An easy solution for this would be to add the s flag in order for your . to match over newlines. TextWrangler appears to support inline modifiers (?s). You could do it like this:
(<page-content>)(?s)(.*?)(<\/page-content>)
More information on modifiers here.

How to Easily Remove Unwanted Parts in HTML Table Cells Using Notepad++

I have series of different occurrences of table cells in some html files as shown in this image:
http://screencast.com/t/MqGHN2iwfd
Apart from the beginning and end of each cell, they have the following parts in common:
.net/?mobile=true
/spotlightProfile.htm?f=mkt&v=
/#stats
I want to either be able to remove all the parts that look like that once
OR be able to remove one-by-one in notepad++:
the url part that precede .net/?mobile=true
the url parts before and after /spotlightProfile.htm?f=mkt&v= and
the url part before /#stats
Furthermore, please, I also want to be able to remove the duplicate occurrence also in Notepad++
Thanks a lot in anticipation for helping out.
Regex would look something like this.
Search for: (.*)(\/\?mobild=true|\/spotlightProfile\.htm\?f=mkt&v=|\/#stats)?(.*)
Replace With: \1\3
Basically we create 3 groups:
before the expression you match,
the expression you trying to replace
the rest of the line

Reg Ex for hyperlinks in comments

I am trying to find a solution to extract an hyperlink out of every comment which begins with %. My first idea was to use a regular hyperlink regex:
^(http|https|ftp)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])*[^\.\,\)\(\s]$
and some kind of pattern like:
%.*
so I added them both to:
^%.*(http|https|ftp)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])*[^\.\,\)\(\s]$
But with this pattern I match everything, including the % character and multiple spaces. How can I get only the hyperlink inside the comment?
EDIT1:
Here is an example what to parse:
% http://www.test.com
It is a regular MATLAB Comment and i want to highlight it like a hyperlink to get a more intuitive editor. I am working with Qt 4.7.1 / C++
Thanky for all the answers !
I guess it depends a little on the language that is executing your regex, but you could try putting the URL part in parentheses:
%.*((http|https|ftp)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])*[^\.\,\)\(\s])
That way you can access it as a group (usually an expression such as $1).