How to Easily Remove Unwanted Parts in HTML Table Cells Using Notepad++ - regex

I have series of different occurrences of table cells in some html files as shown in this image:
http://screencast.com/t/MqGHN2iwfd
Apart from the beginning and end of each cell, they have the following parts in common:
.net/?mobile=true
/spotlightProfile.htm?f=mkt&v=
/#stats
I want to either be able to remove all the parts that look like that once
OR be able to remove one-by-one in notepad++:
the url part that precede .net/?mobile=true
the url parts before and after /spotlightProfile.htm?f=mkt&v= and
the url part before /#stats
Furthermore, please, I also want to be able to remove the duplicate occurrence also in Notepad++
Thanks a lot in anticipation for helping out.

Regex would look something like this.
Search for: (.*)(\/\?mobild=true|\/spotlightProfile\.htm\?f=mkt&v=|\/#stats)?(.*)
Replace With: \1\3
Basically we create 3 groups:
before the expression you match,
the expression you trying to replace
the rest of the line

Related

RegEx to delete all XML data outside of specified tags

I am using the latest and greatest version of NotePad++. Is it possible for a RegEx to delete all text and tags I don't need and only leave behind text and tags I need? The tags I need to remain look like this:
<warning>I need this text to remain intact together with accompanying tags.</warning>
There must be around 500 of these WARNING tag pairs nested within a variety of XML levels. I would like the RegEx to delete all data that exists outside of these WARNING tags but not the opening and closing warning tags themselves or the text within the tags. Below are four different RegEx variations I tested out and they all eliminate the text located within the warning tags after performing a Find&Replace operation therefore they are no help:
<warning>[^<>]+</warning>
<warning>[^>]+</warning>
<warning>(.+?)</warning>
<warning>.*?</warning>
I would tremendously appreciate any help that will assist me in developing a RegEx that will perform the data clean up task I need to perform.
I use notepad++ regex find and replace below seems works for me. Remember to select regular expression.
Search and replace both regex below with empty. Require 2 steps though, not perfect yet
1st replace remove all lines that not startswith warning
2nd replace remove all the empty lines leaving only lines with warning
^(?!\s*?<warning>).*?$
^\s*

Regular expression adjustment not working as expected

I have the following regex https://regex101.com/r/arBFtI/2.
It is a regex for searching and replacing on a webpage. It searches and replaces the results by appending a highlight div so the words show up accordingly for the user.
In order to make sure the HTML itself is not changed (page can not break..) it recognizes HTML tags/attributes so it doesn't show a result in that.
But I have one more issue, now the regexp is strict and only shows a results if it is preceded with a space.
When searching for "export" it will show a result in the sentence above but not in the query below on db0383_bpost.export_201506.
In order to match on all "export" occurrences I can adjust the regex to be (?<![&((])export(?![^<>]*(([\/\"']|]]|\b)>)) but then the following problem arises.. HTML entities!
If you search on "b" for example using (?<![&((])b(?![^<>]*(([\/\"']|]]|\b)>)) it will also match the b in ..
So I like the "strict" regexp (?<![&((\S+])export(?![^<>]*(([\/\"']|]]|\b)>)) or (?<![&((\S+])b(?![^<>]*(([\/\"']|]]|\b)>)) when searching for b but the only thing I need is for it to ignore HTML entities as well. So if I search for "b" it should match all the b's except in HTML entities and b's not between HTML tags.
It looks like a slight adjustment to the original regex in the (\S+]) part but I can't figure it out. Can you? Please help me I greatly appreciate it.

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!
You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.
I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

How to use a regular expression in notepad++ to change a url

I need some help with our migrated site urls's. We moved our site from Joomla to Worpdress and IN our posts we have over 20K of internal links.
The structure of these links are like these:
www.mysite.nl/current-post-title/index.php?option=com_content&view=article&id=5259:related-post-title&catid=35:universum&Itemid=48
What we need is this:
www.mysite.nl/related-post-title
So basically we need to remove everyhing behind www.mysite.nl/ up until the colon :, i.e. remove this: current-post-title/index.php?option=com_content&view=article&id=5259: (must remove the colon itself too)
And then remove everything behind the first ampersand (including the ampersand itself) until the end of the string, i.e. remove &catid=35:universum&Itemid=48
Of course only url strings containing this index.php?option=com_content must be changed.
I have dumped the table in plain text and opened it in Notepad++ to do a search and replace with regular expression because the content that must be removed from these lines is different every time.
Can someone please help me with the right regular expression?
In find what box enter below:
(www.mysite.nl)\/.*index.php\?option=com[^:]+:([^&]+)&.*
In replace with box enter:
\1/\2
Result
www.mysite.nl/related-post-title
Go inside-out, rather than outside-in, replace \/.+&id=\d+\:(.+?)&.+ with /$1. Also, paste a few into http://www.regexr.com/ and play around, although JavaScript and Notepad++ might have some differences in implemented Regex features, e.g. negative lookbehinds.

Regular Expression to Remove Subdomains from Domain List

I have a list of domains and subdomains stored in a .txt file (I'm using Windows XP).
The format of the domains is this:
somesite1.com
sub1.somesite1.com
sub2.somesite1.com
somesite2.com
sub1.somesite2.com
sub2.somesite2.com
somesite3.com
sub1.somesite3.com
sub2.somesite3.com
I use notepad++, and I need to use regular expressions
Anyway, I don't know what to put in the find & replace boxes so it can go through the contents of the file and leave me with only the root domains. If done properly, it would turn the above example list into this:
somesite1.com
somesite2.com
somesite3.com
Can somebody help me out?
Thank you in advance.
It's an old question, but the answers provided didn't work for me. You need a negative lookahead. The correct regex is:
^\w*\.(?!\w+\s*\n)
You can use:
Find what: [^\r\n]+\.[^.\r\n]+\.[^.\r\n]+[\r\n]+
Replace with: empty_string
with regular expression checked and dot match line-feed NOT checked
I suggest using the Mark tab of the Notepad++ Find dialogue. Enter the regular expression ^\w+\.\w+\.\w+$, make sure that Bookmark line is selected, then click Mark all. Next, use Menu => Search => Bookmark => Remove bookmarked lines. These will remove all entries having with three "words" separated by two dots. It will leave all other lines in place.
An alternative is to mark all lines matching the regular expression ^\w+\.\w+$ and use the Remove unmarked lines menu entry. This I do not recommend as it will remove all lines with an unexpected format as well as the lines for subdomains.
Another method would use the Replace tab of the Notepad++ Find dialogue. Enter the regular expression ^\w+\.\w+\.\w+\r\n in the Find what field, and leave the Replace with field empty. The \r\n part of this expression may need some adjustment to account for the line endings set on the file.