Regex select XML Element (containing hyphen) and inside content - regex

I'm working with an enterprise CMS and in order to properly create our weekly-updated dropdown menu without republishing our entire site, I have an XML document being created which has a various number of useful XML elements. However, when pulling in a link with the CMS, the generated XML also outputs the link's contents (the entire HTML for the page). Needless to say, with roughly 50 items, the XML file is too big for use on the web (as it stands I think it's over 600KB). The element is <page-content>filler here</page-content>.
What I'm trying to do is use TextWrangler to find and replace all <page-content> tags as well as their containing content.
I've tried a few different regex's, but I can't seem to match the closing tag, so it will just trail on.
Here's what I've tried:
(<page-content>)(.*?)
The above will match up until the next starting <page-content> tag, which is not what I want.
(<page-content>)(.*?)(<\/page-content>)
(<page-content>)(.*?)(<\/page\-content>)
The above finds no matches, even though the below will find the 7 matches it should.
(<content>)(.*?)(<\/content>)
I don't know if there's a special way to deal with hyphens (I'm inexperienced in regular expressions), but if anyone could help me out, it would be greatly appreciated.
Thanks!
EDIT: Before you tell me that Regex isn't meant to parse HTML, I know that, but there seems to be no other way for me to easily find and replace this. There are too many occurences to manually delete it and save the file again every week.

It seems the problem is that your . is not matching newlines that exist between your open and close tags.
An easy solution for this would be to add the s flag in order for your . to match over newlines. TextWrangler appears to support inline modifiers (?s). You could do it like this:
(<page-content>)(?s)(.*?)(<\/page-content>)
More information on modifiers here.

Related

Search in VSCode for the multiline contents of a set of XML tags, using a regular expression

I am using VSCode to do a global search of XML files. Within those files there are multiple instances of these XML tags: <translated></translated>. I need to find all occurrences of any hyphens - that exist anywhere between those tags, where the contents of those tags can be on multiple lines.
<translated>
Content is here
Could be on multiple lines
The meeting could take 3-4 hours
</translated>
In the above example, the phrase "3-4 hours" has a hyphen in it. I need a regex that works for VSCode which finds all incidences of hyphens which happen to be within a set of these XML tags.
Option 1 (using VS Code)
This only matches one dash at a time and not all dashes. This is because limiting the search to inside one set of tags means it can only do one pass at a time. I was going to delete this answer but if it's the only answer given it may be better than nothing. The work around would be that you would have to refresh the search (button above the search box) and click replace all over and over. If there are lots of dashes this would be annoying but better than no answer.
I have been fiddling with Visual Code Studio and the following seems to work.
(<translated>(.|\n)*?)(-)((.|\n)*?<\/translated>)
Assuming you may be wanting to, for example, replace the dash it's possible to with adding back groups 1 and 4 wrapped around any new text...
$1 <yourTextHere> $4
Example:
Before replace:
After replace (note only the 3-4 in the first section of the file(s) is affected and the 3 to 4 is not changed):
Option 2 / Update (using Brackets.io)
While I'm unsure of the cause if the failure for VSCode to match across files, the following regex works with Brackets (google Brackets.io) across multiple files...
-(?=[^<]*?<\/translated>)
You have to have all your files in a folder and open the folder. Then search in the project (Find > Find in files). Notice in the screenshot it shows for the matches found across all files. In the lower panel for the selected file t2 copy.txt it matches first on line 6 and then on line 16 and (correctly) does not match on line 10 because it is not contained in a translated tag set.
The reason why -(?=[^<]*?<\/translated>) doesn't work in vscode is because it does not EXPLICITLY contain a newline \n. Even though [^<] includes newlines, the \n needs to be actually written into the regex in order to trigger the multiline option. Why is this?
See https://github.com/microsoft/vscode/issues/75265 which uses a similar regex. The issue makes for interesting reading ;>} Primarily for performance reasons.
So simply using this
-(?=[^<]*?\n*<\/translated>)
works in vscode!
-(?=[^<]*?\n<\/translated>) would work for you too unless you have single line blocks like:
<translated>Con-tent is he-re</translated>

Visual Studio Code - Removing Lines Containing criteria

This probably isn't a VS Code-specific question but it's my tool of choice.
I have a log file with a lot of lines containing the following:
Company.Environment.Security.RightsBased.Policies.RightsUserAuthorizationPolicy
Those are debug-level log records that clutter the file I'm trying to process. I'm looking to remove the lines with that content.
I've looked into Regex but, unlike removing a blank line where you have the whole content in the search criteria (making find/replace easy), here I need to match from line break to line break on some criteria between the two, I think...
What are your thoughts on how criteria like that would work?
If the criteria is a particular string and you don't want to have to remember regexes, there is a few handy keyboard shortcuts that can help you out. I'm going to assume you're on a Mac.
Cmd-F to open find.
Paste your string.
Opt-Enter to select all of the instances of the string on the page.
Cmd-L to broaden the selection to the entire line of each instance on the page.
Delete/Backspace to remove those lines.
I think you should be able to just search for ^.*CONTENT.*$\n, where the content is the text you showed us. That is, search on the following pattern:
^.*Company\.Environment\.Security\.RightsBased\.Policies\.RightsUserAuthorizationPolicy.*$\n
And then just replace with empty string.
I have already up-voted answer of #james. But.. still I found one more easy and many feature available extension in VS Code. Here it is
It have much easy options to apply filters.
To match specific case mentioned in question. I am attaching screenshot which display how to use for it. I am posting this for others who come here in search for same issue. (Like I came)

RegEx to delete all XML data outside of specified tags

I am using the latest and greatest version of NotePad++. Is it possible for a RegEx to delete all text and tags I don't need and only leave behind text and tags I need? The tags I need to remain look like this:
<warning>I need this text to remain intact together with accompanying tags.</warning>
There must be around 500 of these WARNING tag pairs nested within a variety of XML levels. I would like the RegEx to delete all data that exists outside of these WARNING tags but not the opening and closing warning tags themselves or the text within the tags. Below are four different RegEx variations I tested out and they all eliminate the text located within the warning tags after performing a Find&Replace operation therefore they are no help:
<warning>[^<>]+</warning>
<warning>[^>]+</warning>
<warning>(.+?)</warning>
<warning>.*?</warning>
I would tremendously appreciate any help that will assist me in developing a RegEx that will perform the data clean up task I need to perform.
I use notepad++ regex find and replace below seems works for me. Remember to select regular expression.
Search and replace both regex below with empty. Require 2 steps though, not perfect yet
1st replace remove all lines that not startswith warning
2nd replace remove all the empty lines leaving only lines with warning
^(?!\s*?<warning>).*?$
^\s*

How to use a regular expression in notepad++ to change a url

I need some help with our migrated site urls's. We moved our site from Joomla to Worpdress and IN our posts we have over 20K of internal links.
The structure of these links are like these:
www.mysite.nl/current-post-title/index.php?option=com_content&view=article&id=5259:related-post-title&catid=35:universum&Itemid=48
What we need is this:
www.mysite.nl/related-post-title
So basically we need to remove everyhing behind www.mysite.nl/ up until the colon :, i.e. remove this: current-post-title/index.php?option=com_content&view=article&id=5259: (must remove the colon itself too)
And then remove everything behind the first ampersand (including the ampersand itself) until the end of the string, i.e. remove &catid=35:universum&Itemid=48
Of course only url strings containing this index.php?option=com_content must be changed.
I have dumped the table in plain text and opened it in Notepad++ to do a search and replace with regular expression because the content that must be removed from these lines is different every time.
Can someone please help me with the right regular expression?
In find what box enter below:
(www.mysite.nl)\/.*index.php\?option=com[^:]+:([^&]+)&.*
In replace with box enter:
\1/\2
Result
www.mysite.nl/related-post-title
Go inside-out, rather than outside-in, replace \/.+&id=\d+\:(.+?)&.+ with /$1. Also, paste a few into http://www.regexr.com/ and play around, although JavaScript and Notepad++ might have some differences in implemented Regex features, e.g. negative lookbehinds.

Get value between <b> tag using regex in Yahoo Pipes

I have searched up and down trying to find an answer that will work for me but haven't been able to figure this out. I'm using Yahoo Pipes for this.
Lake Harmony Estates <b>Sleeps: 16</b>
What I need to do is extract the Sleeps: 16 out from the B tag and output just that value and nothing else. I don't suspect this is very hard to do, but given my limited regex knowledge it's giving me troubles. I've tried adapting regex code pertaining to other tags, but just can't seem to get this one to work.
Any help on this would be appreciated. Thanks.
Edit:
Here is my pipe if you wanted to take a look at the regex horrible-ness I've created. The one I'm trying to work though is the item.sleeps, last entry in the 2nd regex
http://pipes.yahoo.com/pipes/pipe.info?_id=567026d850223b0075d80fd3c9bf7e75
This should fit your needs assuming the html isn't ladened with quotes and such. Note that the + will mean that empty <b> tags are ignored. Also, html is not truly passable via regex, so this will only work for basic tags. It should work even if the tag has an ID or a class property, but there are absolutely manners to break this regex.
/<b[^>]*>([^<]+)<\/b>/
I posted this question to Twitter and got a response back that worked for me.
(?s)^.*<b>(.*?)</b>.*
Replace with $1 and have G flag checked.
This solution did everything I needed. I had additional data that I had already excluded in my example that became unnecessary with this regex.