Parsing complex xml tag - xslt

I have below xml file
<sect2>
<title>Prophylaxis</title>
<para><calc type="weight"/> EXAMPLE DATA</para>
<para>2 months</para>
</sect2>
I tried with some regular expressions, but no luck
I wanted to extract "EXAMPLE DATA" using xslt template.

I am assuming that you want XPath to extract data from XML, refer below or link:
<xsl:value-of select="//sect2/para"/>

Related

NiFi ReplaceText: strip all xml tags between specific tags

I have the XML document below. I want to strip out all the tags between <TXT> and </TXT> to make a raw text tag in NiFi so the raw text reads like a sentence. I tried the following regex pattern in the ReplaceText processor in NiFi but the process failed--even though it captured the full txt section on regex101.com. What have I done wrong?
Client would prefer to use the built-in NiFi processors to do this rather than implement a script.
Regex
<TXT.*>((.|\n)*?)<\/TXT>$
XML
<DOC>
<ID>12345</ID>
<TXT>
<A><DESC type="PERSON">George Washington</DESC> lived in a house called <DESC type="PLACE">Mount Vernon</DESC></A>
</TXT>
</DOC>
ReplaceText configurations are as follows
Search Value: <TXT.*>((.|\n)*?)<\/TXT>$
Replacement Value: <RAW>$1</RAW>
Character Set: UTF-8
Maximum Buffer Size: 1 MB
Replacement Strategy: Regex Replace
Evaluation Mode: Entire text
Ideal output
<DOC>
<ID>12345</ID>
<RAW>George Washington lived in a house called Mount Vernon</RAW>
</DOC>
First, disclaimers:
XSLT Transformation could be what you want
A script could be what you want
To my knowledge, you can't do recursive regexp in NiFi. So you would need to chain processors :
One processor to replace <TXT>([\S\s]*?)<\/TXT> by <RAW>$1</RAW>
One processor to route on content on <RAW>[\S|\s]*?<[\S|\s]*?</RAW> (If RAW contains an inner tag)
If unmatched, you're good
If it matches, remove the first tag using an other replaceText (<RAW>[\S\s]*?)(<[\S\s]*?>)([\S\s]*?</RAW>)
This really seems overkill though, and since your text is annotated, it is likely that your client is already using Python somewhere, and should not be afraid of scripts.

Regex find all XML values based on subvalue

I have the following XML code:
<quantity1 value="foo" name="bar">
<subquantity duration="2">
<parameter unit="meters" />
</subquantity>
</quantity1>
I want to export all names for further analysis in another document, but only if they have a certain subvalue. For example, how can I use regex to find all names based on if unit="meters"?
Bonus points if you can instruct how to do this in Notepad++. Open to other suggestions/SO posts as well.
Regular expressions are wrong for parsing XML.
Use XPath in XSLT or a scripting language or xmlstarlet instead.
Examples:
//quantity1[subquantity/parameter/#unit="meters"]/#name
//*[*/*/#unit="meters"]/#name
//*[.//#unit="meters"]/#name

REGEXP_LIKE to match xml tag content that is not like a specific string

I'm trying to do a regular expression matching with REGEXP_LIKE and I'm looking for a regexp to find if the value of a specific tag is not a specific string.
For example:
<person>
<name>John</name>
<age>40</age>
</person>
My goal is to validate that the name tag's value is not John, so the REGEXP_LIKE would return true for input xmls where name is not John.
Thank you in advance for the help!
A quick and easy way to do this is to simply negate the regex search:
... WHERE NOT REGEXP_LIKE('column_name', '<name>John</name>')
However, as should be mentioned every time a question like this is posted, it's generally a bad idea to parse XML with regex. If you find yourself constructing more complex regex patterns to search this XML data, then you should:
Use an XML parser instead of regular expressions, or
Change how you are storing the data! Make person.age a separate table column; don't bung the entire XML structure into a single place.

xml Regex matching the whole xml file

I need a regular expression that given the following XML, will give me all the products (productos) that have 'Bebidas' as a category (categoria), and I have to do this in Sublime Text, so only have the option to use a regular expression (no dedicated XML parser allowed):
XML File www.ethgf.com/electricos.xml
I have a problem when I use (?s)<producto>(.+?Bebidas.+?)<\/producto> because it highlights almost all the XML (the first 'producto' tag through the last tag closure).
Since the question is about selecting the whole <product> nodes, you can use the following regex:
(?s)<product>(?:\s*<(\w+)>[^<]*?<\/\1>\s*)*?\s*<category>Drinks<\/category>(?:\s*<(\w+)>[^<]*?<\/\2>\s*)*?\s*<\/product>
It will highlight all <product> nodes that contain Drinks category, even if the nodes are not following some strict order:

How to avoid comma in comma delimited CSV with using XSLT

Im using xslt for csv files. Sometimes there is a text like "a, b" in xml. When you choose comma delimited in excel for showing csv file, excel thinks that they have to separated to columns. But I want to do just for column. Is there a way to do that in xslt part?
In order to preserve spaces you need to quote values which contain commas. It is possible in XSLT, but the answer depends on your stylesheet design. If you want a more accurate answer, please share your code. Generally, you can use the following template to wrap any text nodes of interest in quotes:
<xsl:template match="text()[contains(., ', ')]">
<xsl:value-of select="concat('"', ., '"')"/>
</xsl:template>
You can get clues from the open-source CSV to XML package in XSLT 2.0 I've published in the "Free Developer" section of my web site: http://www.CraneSoftwrights.com/resources/#csv it follows RFC4180 http://www.ietf.org/rfc/rfc4180.txt
The idea is to look for quotes first and then commas when there aren't quotes. This can be expressed in regex as I have done in the code I've cited.