notepad++ xml node regex find and replace - regex

can you tell me the what to search for in notepad++ in order to find all nodes with the optioncode 09 below and delete that node? For example, I would like to be able to search on the below xml and delete the first entry and be left with is below that.
I tried to searching for <Vehicle>.*?</Vehicle> which works to replace a blank value, however I want to also add criteria to search for the 09 value in the entire node. Is it possible to add a search condition for the ">09<" text string value?
Search here:
<Vehicle>
<InvoiceDateTime>2016-03-20T00:00:00</InvoiceDateTime>
<InvoiceChargeCents>63</InvoiceChargeCents>
<OptionCode>09</OptionCode>
<JobEndDateTime>2016-03-19T00:00:00</JobEndDateTime>
<AuthorizationCode />
</Vehicle>
<Vehicle>
<InvoiceDateTime>2016-03-20T00:00:00</InvoiceDateTime>
<InvoiceChargeCents>63</InvoiceChargeCents>
<OptionCode>35</OptionCode>
<JobEndDateTime>2016-03-19T00:00:00</JobEndDateTime>
<AuthorizationCode />
</Vehicle>
Return the entry below:
<Vehicle>
<InvoiceDateTime>2016-03-20T00:00:00</InvoiceDateTime>
<InvoiceChargeCents>63</InvoiceChargeCents>
<OptionCode>35</OptionCode>
<JobEndDateTime>2016-03-19T00:00:00</JobEndDateTime>
<AuthorizationCode />
</Vehicle>

My approach consists in matching the <Vehicle> opening node, and then restrict dot matching so that it could match neither opening nor closing Vehicle nodes with a tempered greedy token:
<Vehicle>(?:(?!</?Vehicle>).)*>09<.*?</Vehicle>\R*
The . matches newline option should be enabled.
See the regex demo, replace with an empty string.
The (?:(?!</?Vehicle>).)* is a tempered greedy token that matches any text up to the first >09< after the closest <Vehicle> opening node.
Note that \R* matches zero or more newline sequences (either CRLF, or CR, or LF).
Also please note that a more efficient pattern will be the unrolled version of the above pattern:
<Vehicle>[^<]*(?:<(?!\/?Vehicle>)[^<]*)*>09<.*?<\/Vehicle>\R*
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See another regex demo
It matches the same text, but is more efficient with larger texts.

Related

Regex - How to remove last instance of the search value it finds?

I have multiple XML files that I need to delete a line from. The same line exists in different sections of the file but I only need to delete the last instance it finds. For example -
(Openning tag here)Simple name="DisplayValue" value="{?Consumer}" />
(Openning tag here)Simple name="DisplayValue" value="{?Consumer}" />
(Openning tag here)Simple name="DisplayValue" value="{?Consumer}" /> - This is the line I need to delete
This is the line in file.
I am using the Find in Files feature in Notepad++ to achieve this. Tia.
Try the following find and replace, in regex mode (with dot all enabled):
Find: (.*)Same Text(?:\r?\n|$)(.*)
Replace: $1$2
This should work because the initial (.*) capture group should match and capture all content up to, but not including, the last occurrence of Same Text. Then, we also match and capture all content after this last occurrence. Finally, we replace with just the first two capture groups, to effectively splice out the line you want to remove.

Regex to find subelement in XML

I am using the Regular Expression search feature in Notepad++ to find matches in a few hundred files.
My goal is to find a parent/child combo in in each. I don't care a lot about what specifically is selected (parent and child or just child). I just want to know if the parent contains a specific child.
I want to find a parent element that also has a child element.
Example of what it should find (since one of the sub-elements is a ):
<description>
<otherstuff>
</otherstuff>
<something>
</something>
<description>
</description>
<otherstuff>
</otherstuff>
</description>
Example of what it should NOT find:
<description>
<otherstuff>
</otherstuff>
<something>
</something>
<notadescription>
</notadescription>
<otherstuff>
</otherstuff>
<description>
Each may have other children and sub children as well. They both also may be in the same document.
If I search for this:
<description>(.*)<description>(.*)</description>
It selects too much, because it will select another top level when I only want it to select the child for that 2nd piece.
You said you're working with Notepad++, here here a way to go:
Ctrl+F
Find what: <description>(?:(?!</description).)*<description>(?:(?!<description>).)*</description>
check Match case
check Wrap around
check Regular expression
CHECK . matches newline
Explanation:
<description> # opening tag
(?:(?!</description).)* # tempered greedy token, make sure we have not closing tag before:
<description> # opening tag
(?:(?!<description>).)* # tempered greedy token, make sure we have not opening tag before:
</description> # closing tag
Screen capture:
You should not use (.*) it is greedy
here is an example why you shouldn't use it in you case
<description>
<otherstuff>
</otherstuff>
<description>
<description>hello<\description>
</description>
<\description>
Supposing that here we use <description>(.*)<description>(.*)</description>
It will parse:
<description>
<description>hello<\description>
</description>
<\description>
So if you want to parse only what is inside the 2nd description you should use (.*?) it is called non greedy
Using <description>(.*)<description>(.*?)</description> will parse:
<description>
<description>hello<\description> # end of parse
# here <\description> is missing cause (.*?) will look only for the first match
So you must use (.*?) it will stop parsing right when it found the first end match, but (.*) is greedy so it will look for largest match possible
So if you use <description>(.*)<description>(.*?)</description> it will be fine, cause it will parse only what is inside the sub description in your case
I'm guessing that we'd be designing an expression to exclude <notadescription>, such as:
<description>(?!<notadescription>)[\s\S]*<\/description>
which if we would be capturing the description element, we might want a capturing group:
(<description>(?!<notadescription>)[\s\S]*<\/description>)
Demo

Regex that finds unspecified html tags which are not surrounded by specified html tags

I'm trying to find a Regex that finds all tags that are:
NOT part of a list of allowed tags
NOT surrounded by a specific tag
This is what I currently have:
(?<!<noparse>)<(?!(\/?(noparse|u))).*?>(?!<\/noparse>)
If I have the following as input
<u><b>test2</b></u>
<noparse><u><b>test</b></u></noparse>
<noparse><b>test</b></noparse>
It will match
<b> & </b> (correct, not surrounded by <noparse></noparse>, <u></u> is allowed)
<b> & </b> (incorrect, surrounded by <noparse></noparse>)
</b></noparse> (incorrect, surrounded by <noparse></noparse>)
However, I'd like it to match
<b> & </b>
{nothing}
{nothing}
You can check it out here:
https://regex101.com/r/HO1Bo2/1
I want to do this so that I can sanitize strings. Our app is made in Unity and uses TextMeshPro to display text. TMP supports quite a lot of tags, all of which you can find here: http://digitalnativestudios.com/textmeshpro/docs/rich-text/ . We only want to allow a couple of these tags, because users could get too creative and start messing with line heights, offsets, fontsize and so on. We also want to use the tag so that users can surround any supported tag with it to make it show up as plain text.
Thanks in advance, I'm sure there are smarter people than me around!
Yours,
Bas
In the end I went with a different route, because Regex was really not working out in this case.
Create a List that will contain all strings that should not be taken into account during the sanitization process
Replace all existing format items in the input string with a normalized format item, and backup the original format item in said list.
<b>test</b> {56} <noparse><b>test</b></noparse> {3}
becomes
<b>test</b> {0} <noparse><b>test</b></noparse> {1}
Replace all existing ... blocks, also with normalized format items, and add them to the same list.
<b>test</b> {0} <noparse><b>test</b></noparse> {1}
becomes
<b>test</b> {0} {2} {1}
Retrieve all unsupported tags in the remaining string & sanitize, using the following regex:
<(?!(\/?(u|i))).*?>
The following tags are supported tags in this case, all others will be found by the Regex:
<u></u><i></i>
Surrounding all unsupported tags with noparse tags leads to the following string:
<b>test</b> {0} {2} {1}
becomes
<noparse><b></noparse>test<noparse></b></noparse> {0} {2} {1}
Now we can replace all the format items with their original text again
string sanitizedString = string.Format(sanitizedStringBuilder.ToString(), replacedStrings.ToArray());
Result:
<noparse><b></noparse>test<noparse></b></noparse> {56} <noparse><b>test</b></noparse> {3}
Seems to work really well, and while it's quite a lot of steps I'm pretty happy with this solution.

Regex search in XSL, select string after match

I have a solution where the filename has a prefix showing the filesize of a PDF. I need to pick up that value in to a XML-file that has a lot of other info that is collected with the XSLT.
How ever I can't get just this Regex match to work.
Filename have this structure as this example:
776524_P9466_Novilon_Broschyr_SE_Omslag.xml where the digits before the underscore is the filesize.
I have a Regex search pattern of _(.*) and I can validate that it will match everything after the first section of the digits.
Here is my XSL that I'm having problems with:
<xsl:param name="find_size">
<xsl:text>(_.*)</xsl:text>
</xsl:param>
<xsl:variable name="filename_of_start"><xsl:value-of select="replace($filename_of_file, '$find_size', '')"/></xsl:variable>
<artwork_size><xsl:value-of select="$filename_of_start"/></artwork_size>
$filename_of_file has the string: 776524_P9466_Novilon_Broschyr_SE_Omslag.xml
I have also tried to match the digits before the underscore and replace with that match but haven't got that to work either. Other replaces where I remove other matches from the beginning of the string works.
Thanks
How about using the substring-before() XPath function?
<xsl:variable name="file_size" select="substring-before($filename, '_')" />
Instead of replace($filename_of_file, '$find_size', '') you want replace($filename_of_file, $find_size, '').

How to find a word within text using XSLT 2.0 and REGEX (which doesn't have \b word boundary)?

I am attempting to scan a string of words and look for the presence of a particular word(case insensitive) in an XSLT 2.0 stylesheet using REGEX.
I have a list of words that I wish to iterate over and determine whether or not they exist within a given string.
I want to match on a word anywhere within the given text, but I do not want to match within a word (i.e. A search for foo should not match on "food" and a search for bar should not match on "rebar").
XSLT 2.0 REGEX does not have a word boundary(\b), so I need to replicate it as best I can.
You can use alternation to avoid repetition:
<xsl:if test="matches($prose, concat('(^|\W)', $word, '($|\W)'),'i')">
If your XSLT 2.0 processor is Saxon 9 then you can use Java regular expression syntax (including \b) with the functions matches, tokenize and replace by starting the flag attribute with an exclamation mark:
<xsl:value-of select="matches('all foo is bar', '\bfoo\b', '!i')"/>
Michael Kay mentioned that option recently on the XSL mailing list.