RegEx to delete all XML data outside of specified tags - regex

I am using the latest and greatest version of NotePad++. Is it possible for a RegEx to delete all text and tags I don't need and only leave behind text and tags I need? The tags I need to remain look like this:
<warning>I need this text to remain intact together with accompanying tags.</warning>
There must be around 500 of these WARNING tag pairs nested within a variety of XML levels. I would like the RegEx to delete all data that exists outside of these WARNING tags but not the opening and closing warning tags themselves or the text within the tags. Below are four different RegEx variations I tested out and they all eliminate the text located within the warning tags after performing a Find&Replace operation therefore they are no help:
<warning>[^<>]+</warning>
<warning>[^>]+</warning>
<warning>(.+?)</warning>
<warning>.*?</warning>
I would tremendously appreciate any help that will assist me in developing a RegEx that will perform the data clean up task I need to perform.

I use notepad++ regex find and replace below seems works for me. Remember to select regular expression.
Search and replace both regex below with empty. Require 2 steps though, not perfect yet
1st replace remove all lines that not startswith warning
2nd replace remove all the empty lines leaving only lines with warning
^(?!\s*?<warning>).*?$
^\s*

Related

Search in VSCode for the multiline contents of a set of XML tags, using a regular expression

I am using VSCode to do a global search of XML files. Within those files there are multiple instances of these XML tags: <translated></translated>. I need to find all occurrences of any hyphens - that exist anywhere between those tags, where the contents of those tags can be on multiple lines.
<translated>
Content is here
Could be on multiple lines
The meeting could take 3-4 hours
</translated>
In the above example, the phrase "3-4 hours" has a hyphen in it. I need a regex that works for VSCode which finds all incidences of hyphens which happen to be within a set of these XML tags.
Option 1 (using VS Code)
This only matches one dash at a time and not all dashes. This is because limiting the search to inside one set of tags means it can only do one pass at a time. I was going to delete this answer but if it's the only answer given it may be better than nothing. The work around would be that you would have to refresh the search (button above the search box) and click replace all over and over. If there are lots of dashes this would be annoying but better than no answer.
I have been fiddling with Visual Code Studio and the following seems to work.
(<translated>(.|\n)*?)(-)((.|\n)*?<\/translated>)
Assuming you may be wanting to, for example, replace the dash it's possible to with adding back groups 1 and 4 wrapped around any new text...
$1 <yourTextHere> $4
Example:
Before replace:
After replace (note only the 3-4 in the first section of the file(s) is affected and the 3 to 4 is not changed):
Option 2 / Update (using Brackets.io)
While I'm unsure of the cause if the failure for VSCode to match across files, the following regex works with Brackets (google Brackets.io) across multiple files...
-(?=[^<]*?<\/translated>)
You have to have all your files in a folder and open the folder. Then search in the project (Find > Find in files). Notice in the screenshot it shows for the matches found across all files. In the lower panel for the selected file t2 copy.txt it matches first on line 6 and then on line 16 and (correctly) does not match on line 10 because it is not contained in a translated tag set.
The reason why -(?=[^<]*?<\/translated>) doesn't work in vscode is because it does not EXPLICITLY contain a newline \n. Even though [^<] includes newlines, the \n needs to be actually written into the regex in order to trigger the multiline option. Why is this?
See https://github.com/microsoft/vscode/issues/75265 which uses a similar regex. The issue makes for interesting reading ;>} Primarily for performance reasons.
So simply using this
-(?=[^<]*?\n*<\/translated>)
works in vscode!
-(?=[^<]*?\n<\/translated>) would work for you too unless you have single line blocks like:
<translated>Con-tent is he-re</translated>

RegEx for matching HTML tags

I am trying to use regular expression to extract start tags in lines of a given HTML code. In the following lines I expect to get only 'body' and 'h1'as start tags in the first line and 'html','head' and 'title' as start tags in the second line:
I have already tried to do this using the following regular expression:
start_tags = re.findall(r'<(\w+)\s*.*?[^\/]>',line)
'<body data-modal-target class=\'3\'><h1>Website</h1><br /></body></html>'
'<html><head><title>HTML Parser - II</title></head>'
But my output for the first line is: ['body','h1','br'], while I do not expect to catch 'br' as I excluded '/'.
And for the second line is ['html','title'], whereas I expect to catch 'head' too. It would be a grate kind if you let me know which part of my code is wrong?
If you wish to do so with regular expressions, you might want to design multiple different expressions, step by step. You may be able to connect them using OR pipes, but it may not be necessary.
RegEx 1 for h1-h6 tags
This link helps you to capture body tags excluding body and head:
(<(.*)>(.*)</([^br][A-Za-z0-9]+)>)
You might want to add more boundaries to it. For example, you can replace (.*) with lists of chars [].
RegEx Circuit
This link helps you to visualize your expressions:
RegEx 2 for head and body
For head and body tags, you might want to swipe the new lines, which you might want an expression similar to:
(<head>([\s\S]*)<\/head>)|(<body>([\s\S]*)</body>)
Performance
These expressions are rather expensive, you might want to simplify them, or write some other scripts to parse your HTMLs, or find a HTML parser maybe, to do so.

Regex select XML Element (containing hyphen) and inside content

I'm working with an enterprise CMS and in order to properly create our weekly-updated dropdown menu without republishing our entire site, I have an XML document being created which has a various number of useful XML elements. However, when pulling in a link with the CMS, the generated XML also outputs the link's contents (the entire HTML for the page). Needless to say, with roughly 50 items, the XML file is too big for use on the web (as it stands I think it's over 600KB). The element is <page-content>filler here</page-content>.
What I'm trying to do is use TextWrangler to find and replace all <page-content> tags as well as their containing content.
I've tried a few different regex's, but I can't seem to match the closing tag, so it will just trail on.
Here's what I've tried:
(<page-content>)(.*?)
The above will match up until the next starting <page-content> tag, which is not what I want.
(<page-content>)(.*?)(<\/page-content>)
(<page-content>)(.*?)(<\/page\-content>)
The above finds no matches, even though the below will find the 7 matches it should.
(<content>)(.*?)(<\/content>)
I don't know if there's a special way to deal with hyphens (I'm inexperienced in regular expressions), but if anyone could help me out, it would be greatly appreciated.
Thanks!
EDIT: Before you tell me that Regex isn't meant to parse HTML, I know that, but there seems to be no other way for me to easily find and replace this. There are too many occurences to manually delete it and save the file again every week.
It seems the problem is that your . is not matching newlines that exist between your open and close tags.
An easy solution for this would be to add the s flag in order for your . to match over newlines. TextWrangler appears to support inline modifiers (?s). You could do it like this:
(<page-content>)(?s)(.*?)(<\/page-content>)
More information on modifiers here.

Regex to remove text between tags in Notepad++

I have a code like this
<wp:post_name>artifical-sweeteners-ruin-your-health</wp:post_name>
I want to change it to
<wp:post_name></wp:post_name>
removing everything inside the tag.
Search for
<wp:post_name>[^<>]+</wp:post_name>
and replace all with
<wp:post_name></wp:post_name>
This assumes that tags can't be nested (which makes the regex quite safe to use). If other tags may be present, then you need to search for
(?i)<wp:post_name>.*?</wp:post_name>
instead (same replace string). However, this probably only works in the latest versions of Notepad++ which brought a major regex engine overhaul, and it's a bit riskier because it will mess up your file if nested <wp:post_name> tags can occur.

How to Easily Remove Unwanted Parts in HTML Table Cells Using Notepad++

I have series of different occurrences of table cells in some html files as shown in this image:
http://screencast.com/t/MqGHN2iwfd
Apart from the beginning and end of each cell, they have the following parts in common:
.net/?mobile=true
/spotlightProfile.htm?f=mkt&v=
/#stats
I want to either be able to remove all the parts that look like that once
OR be able to remove one-by-one in notepad++:
the url part that precede .net/?mobile=true
the url parts before and after /spotlightProfile.htm?f=mkt&v= and
the url part before /#stats
Furthermore, please, I also want to be able to remove the duplicate occurrence also in Notepad++
Thanks a lot in anticipation for helping out.
Regex would look something like this.
Search for: (.*)(\/\?mobild=true|\/spotlightProfile\.htm\?f=mkt&v=|\/#stats)?(.*)
Replace With: \1\3
Basically we create 3 groups:
before the expression you match,
the expression you trying to replace
the rest of the line