Search & replace regex - filtering files - regex

little bit of background:
I work at a multilingual communication company, where we’re working with a CMS system. Since its last update, all the files I export out of the system are ‘polluted’ with metadata, which I don't want to see, use or replace. To filter and change a heap of xml files, I use Powergrep, which operates with regexes.
I want my regex to find, e.g. "there is no spoon", "oracle", "I know kung-fu" and "bending method" (all straight quotation marks) and replace it with “there is no spoon”, “oracle”, “I know kung-fu” and “bending method” (all with curly quotation marks).
I don’t want it to find the metadata "concept.dtd" and "map.dtd"
The following lines are the first lines of my xml file. It's this "concept.dtd" that I would like to ignore.
<?xml version="1.0" encoding="UTF-16" standalone="no"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"[
]>
<?ish ishref="GUID-6B84EF92-DA99-4C54-BA91-FD0A113D4A96" version="1" lang="sv" srclng="en"?>
This is somewhere in the middle of the xml file
<row>
<entry colname="col1" valign="middle" align="left">"Bending method" </entry>
<entry colname="col2" valign="middle" align="left">another word</entry>
</row>
So.. this is the original regex:
(?<!=)”\b(.+?)\b”(?! \[)
Replacement:
“1”
Problem:
As the metadata “concept.dtd” and “map.dtd” are part of the file, I don’t want to replace their quotation marks in order not to change anything crucial. So I tried rewriting the regex:
(?<!=)”\b(.+?[\.d])\b”(?! \[)
It almost works: “concept.dtd” and “map.dtd” are skipped, most of the terms between quotation marks are found, but not all: “Bending method” is not found, for example.
What am I missing? Any help or opinions would be greatly appreciated!

Based on your last answers, here is a regexp that can help you:
(?<=<entry)[^>]+>[^<>]*?(".+?")[^<>]*?(?=<\x2Fentry>)
Description
Demo
http://regex101.com/r/lX2cU3
Discussion
I assume that you have one serie of words between straight quotations and that there are no carriage returns ou line feeds inside an <entry> node.

Related

how to use '*' in XPATH starts-with()?

we received banking statements from the SAP System. We sometimes observe the naming convention of the file name will be not as per the standards and the files will be rejected.
We wanted to validate the file name, as per the below example, we get the file name in the name attribute.
Can the country ISO code escape in the validation?
We wanted an Xpath that captures GLO_***_UPLOAD_STATEMENT like this so that ISO code is not validated.
Example XML:
<?xml version="1.0" encoding="UTF-8"?>
<Details name="GLO_ZFA_UPLOAD_STATEMENT" type="banking" version="3.0">
<description/>
<object>
<encrypted data="b528f05b96102f5d99743ff6122bb0984aa16a02893984a9e427a44fcedae1612104a7df1173d9c61a99ebe0c34ea67a46aecc86f41f5924f74dd525"/>
</object>
</Details>
Xpath tried:
Details[#type="banking"]/#name[not(starts-with(., "GLO_***_UPLOAD_STATEMENT"))]
which is not working :(
Can anyone help here, please :)
Thanks in advance!
Try using the matches() function for a regex like this:
Details[#type="banking"]/#name[not(matches(., "^GLO_(.){3}_UPLOAD_STATEMENT"))]
starts-with() is char based, it doesn't recognize patterns.
If your XPath version doesn't support regex then you can use something like:
Details[#type="banking"]/#name[not(starts-with(., "GLO_")) and not(ends-with(., "_UPLOAD_STATEMENT"))]
You can match regular expressions using the matches() function. For example:
//Details[#type="banking" and not(matches(#name, "GLO_[A-Z]*_UPLOAD_STATEMENT"))]/#name
Will only select Details node's name attribute for Details that have type="banking" and name not matching the regular expression "GLO_[A-Z]*_UPLOAD_STATEMENT". You can refine the regex as needed.

NiFi ReplaceText: strip all xml tags between specific tags

I have the XML document below. I want to strip out all the tags between <TXT> and </TXT> to make a raw text tag in NiFi so the raw text reads like a sentence. I tried the following regex pattern in the ReplaceText processor in NiFi but the process failed--even though it captured the full txt section on regex101.com. What have I done wrong?
Client would prefer to use the built-in NiFi processors to do this rather than implement a script.
Regex
<TXT.*>((.|\n)*?)<\/TXT>$
XML
<DOC>
<ID>12345</ID>
<TXT>
<A><DESC type="PERSON">George Washington</DESC> lived in a house called <DESC type="PLACE">Mount Vernon</DESC></A>
</TXT>
</DOC>
ReplaceText configurations are as follows
Search Value: <TXT.*>((.|\n)*?)<\/TXT>$
Replacement Value: <RAW>$1</RAW>
Character Set: UTF-8
Maximum Buffer Size: 1 MB
Replacement Strategy: Regex Replace
Evaluation Mode: Entire text
Ideal output
<DOC>
<ID>12345</ID>
<RAW>George Washington lived in a house called Mount Vernon</RAW>
</DOC>
First, disclaimers:
XSLT Transformation could be what you want
A script could be what you want
To my knowledge, you can't do recursive regexp in NiFi. So you would need to chain processors :
One processor to replace <TXT>([\S\s]*?)<\/TXT> by <RAW>$1</RAW>
One processor to route on content on <RAW>[\S|\s]*?<[\S|\s]*?</RAW> (If RAW contains an inner tag)
If unmatched, you're good
If it matches, remove the first tag using an other replaceText (<RAW>[\S\s]*?)(<[\S\s]*?>)([\S\s]*?</RAW>)
This really seems overkill though, and since your text is annotated, it is likely that your client is already using Python somewhere, and should not be afraid of scripts.

KML Inserting a Specific Tag between Two Other Tags Based on a Condition

TLDR - Insert the Style tag and its contents (see code blocks) between the name and description tag only if the description mentions the phrase "the office" in order to change the current Google Earth placemark from the default yellow one to a custom one...
Hi guys, I’m having a bit of trouble figuring this one out…
Using Notepadd++ I am editing a Google Earth kml file where I have many placemarks that follow this XML pattern:
<Placemark>
<name>Jim</name>
<description>Jim goes to the office every day</description>
<TimeSpan>
<begin>2016-06-20T12:00:00Z</begin>
<end>2016-06-25T12:00:00Z</end>
</TimeSpan>
<Point>
<coordinates>123412341234,123412341234,1</coordinates>
</Point>
</Placemark>
I would like to find every instance of the phrase “the office”. If that text is found, the code below is inserted between name and description in a fashion that would be readable by Google earth.
<Style id="customstyle">
<IconStyle>
<color>a1ff00ff</color>
<scale>1.5</scale>
<Icon>
<href>http://maps.google.com/mapfiles/kml/shapes/shaded_dot.png</href>
</Icon>
</IconStyle>
</Style>
Doing this would change the placemark from the default one to a custom one.
All of the tutorials I have found thus far, have been instructions on how to add words or phrases to the beginning or end of a line using Notepad++ regex, or the instructions show how to insert text on the next line using \n.
However, I think my situation is unique in that based on a certain criteria I want to insert my text above the line. (more specifically insert my text between the name and description tags)
The end result would look something like this (notice how the text I wanted to add is now in between the name tag and the description tag)
<Placemark>
<name>Jim</name>
<Style id="customstyle">
<IconStyle>
<color>a1ff00ff</color>
<scale>1.5</scale>
<Icon>
<href>http://maps.google.com/mapfiles/kml/shapes/shaded_dot.png</href>
</Icon>
</IconStyle>
</Style>
<description>Jim goes to the office every day</description>
<TimeSpan>
<begin>2016-06-15T12:00:00Z</begin>
<end>2016-06-20T12:00:00Z</end>
</TimeSpan>
<Point>
<coordinates>2135125,1234523451,12341234</coordinates>
</Point>
</Placemark>
Now all placemarks that followed that pattern would have a different type of placemark than the default one (and i would no longer have a headache).
Thanks in advance all.
Well, the regex doesn't really need to match something before the line.
It just needs to put something with lines before your match.
So it's still a fairly simple thing to do.
So using Notepad++
Find What : (\s+)(<description>)(?=.*?the office.*?<\/description>)
Replace with : $1<Style id="customstyle">$1\t<IconStyle>$1\t\t<color>a1ff00ff</color>$1\t\t<scale>1.5</scale>$1\t\t<Icon>$1\t\t\t<href>http://maps.google.com/mapfiles/kml/shapes/shaded_dot.png</href>$1\t\t</Icon>$1\t</IconStyle>$1</Style>$1$2
Search mode : Regular Expression
Note that the whitespaces before the description tag are put in capture group 1.
That's a trick make an insert with the same indentation as the tag.
But you could also just put in tags without whitespaces.
Find What : (<description>)(?=.*?the office.*?<\/description>)
Replace with : <Style id="customstyle"><IconStyle><color>a1ff00ff</color><scale>1.5</scale><Icon><href>http://maps.google.com/mapfiles/kml/shapes/shaded_dot.png</href></Icon></IconStyle></Style>$1
And then use a plugin like "XML Tools" to "Pretty Print" your XML.

Match particular CDATA sections in XML data

I am trying to do a PowerShell Regex, I have the following page (further below) that I want to do a match from, the two parts in bold is the information that I want to capture and assign to a variable. So I need two regex's. From the text below, the two area's I need to find exactly are King and Years & Years. Please note, these two areas change (hence the reason I need to capture them), the rest of of the code stays the same.
This is the regex I have at the moment, but it's not working for me.
\s+artist\s*>\s*<\s*!\s*[CDATA\s*[(.*)\s*]\s*]\s*>\s*<\s*/artist
And here is the page (or data) I am trying to use regex with.
<on_air>
<publishedInfo publishedDate="2015-07-18 16:24:28" />
<stationName><![CDATA[Mix 106.5]]></stationName>
<stationPrefix><![CDATA[mix1065]]></stationPrefix>
<generic_coverart><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></generic_coverart>
<now_playing>
<audio ID="id_1705168034_30458146" type="song">
<title generic="False"><![CDATA[King*]]></title>
<artist><![CDATA[Years & Years]]></artist>
<number><![CDATA[46029]]></number>
<cut><![CDATA[1]]></cut>
<ref><![CDATA[]]></ref>
<played_datetime><![CDATA[2015-07-18 16:24:27]]></played_datetime>
<length><![CDATA[00:03:28]]></length>
<coverart generic="true"><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></coverart>
<options>
<option><![CDATA[KIIS S Integrated]]></option>
</options>
</audio>
</now_playing>
If it is a valid XML, then you does not need to use regular expressions. PowerShell adapt XML objects and you can use standard property syntax to navigate on them:
$xml=[xml]#'
<on_air>
<publishedInfo publishedDate="2015-07-18 16:24:28" />
<stationName><![CDATA[Mix 106.5]]></stationName>
<stationPrefix><![CDATA[mix1065]]></stationPrefix>
<generic_coverart><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></generic_coverart>
<now_playing>
<audio ID="id_1705168034_30458146" type="song">
<title generic="False"><![CDATA[King*]]></title>
<artist><![CDATA[Years & Years]]></artist>
<number><![CDATA[46029]]></number>
<cut><![CDATA[1]]></cut>
<ref><![CDATA[]]></ref>
<played_datetime><![CDATA[2015-07-18 16:24:27]]></played_datetime>
<length><![CDATA[00:03:28]]></length>
<coverart generic="true"><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></coverart>
<options>
<option><![CDATA[KIIS S Integrated]]></option>
</options>
</audio>
</now_playing>
</on_air>
'#
$xml.on_air.now_playing.audio.title.'#cdata-section'
$xml.on_air.now_playing.audio.artist.'#cdata-section'
You want to escape bracket literals.
Also, it's a good practice to avoid using the dot "match almost any character" metacharacter when your intentions are more specific. In your case, what you really want to do is match until you hit the closing bracket, so it's safer to specify that:
'\s+artist\s*>\s*<\s*!\s*\[CDATA\s*\[([^]]*)\s*\]\s*\]\s*>\s*<\s*\/artist'
Note: Regex is contextual, so the reason I don't have to escape the closing bracket within the character class is because of its position, i.e., being the first character specified in the negated class--in that context, it cannot be the closing bracket for the character class. In other words, it's not ambiguous.
To help get off the ground, here is a suggestion for y&y (insert whitespace-selector whereever possible):
artist><!\[CDATA\[Years & Years\]\]></artist

XSL disable-output-escaping removes whitespaces

Part of the XML:
<text><b>Title</b> <b>Happy</b></text>
In my XSL I have:
<xsl:value-of select="text" disable-output-escaping="yes" />
My output becomes
**TitleHappy**
My spacing went missing - there's supposed to be a space between </b> and <b>.
I tried normalize-space(), it doesn't work.
Any suggestions? Thanks!
if you want whitespace from an xsl, use:
<xsl:text> </xsl:text>
whitespace is only preserved if its recognized as a text node (ie: " a " both spaces will be recognized)
whitespace from the orignal source xml has to be preserved by telling the parser (for example)
parser.setPreserveWhitespace(true);
As your outputting HTML you could substitute your space with a non-breaking space
Do you have any control over the generation of the original XML? If so, you could try adding xml:space="preserve" to the text element which should tell the processor to keep the whitespace.
<text xml:space="preserve"><b>Title</b> <b>Happy</b></text>
Alternatively, try looking at the "xsl:preserve-space" element in XSLT.
<xsl:preserve-space elements="text"/>
Although I have never used this personally, it might of some help. See W3Schools for more information.
thank you for everyone's input.
Currently I am using MattH suggestion which is to test for space and substitue to non-breaking space. Another method I thought of is to test for "</b> <b>" and substitue with " </b><b>". The space contain within a bold tags are actually output. Both methods worked. Don't know what the implications are though. And I still can't figure out why the spacing is removed when it is found between 2 seperate bold tags.