Remove Ampersand from an input xml during xslt transformation - xslt

I am new to xslt. I have an input xml and that has to be sorted. To get the sorted output xml, a xslt has been built and sorting is working fine. But if the one of the tags (sorting is not based on this tag) in the input xml contains '&', the xslt transformation fails while parsing the input xml.
how to escape the '&' in the input xml from compilation.

When you say "the input XML contains '&'", that's technically incorrect: if it contains an unescaped '&' then it's not XML.
If you're asking how to escape the ampersand correctly while generating the XML, then that depends on the tools you are using to generate the XML.
If you're asking how to fix broken XML containing unescaped ampersands, then the answer in the general case is that you can't (there's a reason they have to be escaped, after all). But a search for XML and ampersand will find plenty of suggestions for partial solutions.

Thank you very much for your response. As you told, the xml was having unescaped '&'. Example: Test A & B .. So during the xslt transformation, it was failing while parsing the xml. After making the change in the source system as
the parsing is working fine.

Related

Issue with regex expression used to extract timestamps from XML file

I am aiming to implement regex into my C++ program in an attempt to extract timestamps, among other things from an XML file. Right now I am focusing on creating a regex expression to extract 6 timestamps in particular from the XML file. Unfortunately, I my regex expression does not seem to be locating the 6 timestamps I want it to. The expression I have created is: \2\0\1\4\\-\0\7\-\0\8\T\1\8\:\1\4\:\.\.\\.\7\1\6\Z. If you look at the XML file which I have linked below, I am trying to extract the timestamps from 6 lines in particular(lines 72,75,78,81,84,and 87). Could someone possibly help me point out what is being done wrong? Sorry, I'm just getting familiarizing myself with Regex for the first time. I am using http://regexr.com/ to test my expressions.
Link to XML file: http://pastebin.com/5hMy9RzK
Six timestamps which I want my regex expression to locate:
timestamp="2014-07-08T18:14:17.716Z"
timestamp="2014-07-08T18:14:18.716Z
timestamp="2014-07-08T18:14:19.716Z
timestamp="2014-07-08T18:14:20.716Z
timestamp="2014-07-08T18:14:21.716Z
timestamp="2014-07-08T18:14:22.716Z
Your expression looks strange, you are escaping every literal character with a \ which is usually only used for special characters.
Is this what you're looking for?
\d\d\d\d-\d\d-\d\d\w\d\d:\d\d:\d\d\.716Z
Example:
http://regexr.com/3cbs2

Problems with finding and replacing

Hey stackoverflow community. Ive need help with huge information file. Is it possible with regular expression to find in this tag:
<category_name><![CDATA[Prekiniai ženklai>Adler|Kita buitinė technika>Buičiai naudingi prietaisai|Kita buitinė technika>Lygintuvai]]></category_name>
Somehow replace all the other data and leave only 'Adler' or 'Lygintuvai'. Im using Altova to edit xml files, so i cant find other way then find-replace. And im new in the regex stuff. So i thought maby you can help me.
#\<category_name\>.+?gt\;([\w]+?)\|.+?gt;([\w]+?)\]\]\>\<\/category_name\>#i
\1 - Adler
\2 - Lygintuvai
PHP
regex101.com
Fields may contain alphanumeric characters without spaces.
If you want to modify the scope of acceptable characters change [\w] to something other:
[a-z] - only letters
[0-9] - only digits
etc.
It's possible, but use of regular expressions to process XML will never be 100% correct (you can prove that using computer science theory), and it may also be very inefficient. For example, the solution given by Luk is incorrect because it doesn't allow whitespace in places where XML allows it. Much better to use XQuery or XSLT, both of which are designed for the job (and both work in Altova). You can then use XPath expressions to locate the element or attribute nodes you are interested in, and you can still use regular expressions (e.g. in the XPath replace() function) to process the content of text or attribute nodes.
Incidentally, your input is rather strange because it uses escape sequences like > within a CDATA section; but XML escape sequences are not recognized in a CDATA section.

JavaScript Regular expression text parsing

I have a string like the following
~~<b>A<i>C</i></b>~~/~~<u>D</u><b>B</b>~~has done this.
I am trying to get the text inside <b> tag. I am trying
<b>(.+)</b>
But I am getting <b>A<i>C</i></b>~~/~~<u>D</u><b>B</b>, but I need <b>A<i>C</i></b> as first match and <b>B</b> as the second match
Can anyone please help?
You need to use a non-greedy quantifier:
<b>(.+?)</b>
This will ensure that the match stops at the first </b> it finds.
However, I would generally recommend using a proper XML or HTML parser for this sort of thing. Regular expressions are simply not powerful enough to handle the recursive structure of XML.

How to write a regular expression pattern for this scenario

I am trying to find the special character appearence in my below sample xml.
<?xml version="1.0"?>
<PayLoad>
<requestRows>****</requestRows>
<requestRowLength>1272</requestRowLength>
<exceptionTimestamp>2012070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>201$2070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>20120(702022810680700</exceptionTimestamp>
<exceptionDetail>NO DATA AVAILABLE FOR TIME PERIOD SPECIFIED =</exceptionDetail>
</PayLoad>
I have to find a entire tags that contain $,(,=,- characters. for this i have written below regular expression pattern
(<[\w\d]*>\w*(?<value>[^\w]+)\w*\d*</[\w\d]*>)
and it returns following output(running in Expresso Tool)
<requestRows>****</requestRows>
<exceptionTimestamp>2012070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>20120(702022810680700</exceptionTimestamp>
but it should return below two enrty also.
<exceptionTimestamp>201$2070202281068-0700</exceptionTimestamp>
<exceptionDetail>NO DATA AVAILABLE FOR TIME PERIOD SPECIFIED =</exceptionDetail>
these entries omitted because it contains more than one special characters(including space). Can anyone please give me a correct regular expression for the above scenario.
Thanks.
I would use lookaround for the mid part, so instead of
(<[\w\d]*>\w*(?<value>[^\w]+)\w*\d*</[\w\d]*>)
I would use
(<[\w\d]*>(?=[^<]*[^<\w])(?<value>.*)</[\w\d]*>)
Without the ?<value> part that I don't really recognise the syntax of, this becomes
(<[\w\d]*>(?=[^<]*[^<\w]).*</[\w\d]*>)
Just add capturing groups where you like if you want to save anything in particular.

Regular Expression find a phrase not inside an HTML tag

I'm struggling a bit with this regular expression and wondered if anyone was about to help me please?
What I need to do is isolate the 1st phrase inside a string which is NOT inside an HTML tag. So the examples I have at the moment are:
This is some test text about <acronym
title="Incomplete Test Syndrome"
class="CustomClass">ITS</acronym> for
the **ITS** department. Also worth
mentioning ABS as well I guess.ITS,
... and ...
This is some **ITS** test text about
<acronym title="Incomplete Test
Syndrome"
class="GOTManager">ITS</acronym> for
the ITS department. Also worth
mentioning ABS as well I guess
So in the first example I want it to ignore the wrapped ITS and give me the ITS at the end of the 1st sentence.
In the second example I want it to return the ITS at the start of the 2nd sentence.
The aim is to replace these with my own custom wrapped acronym tags in a ColdFusion application I'm writing.
Thanks a lot,
James
As the commentators have pointed out, regular expressions are not a good tool to work with XML/HTML-like texts. That is because being "inside" something is very hard to check for in any generality (you never know in which of these possible unlimited nesting levels you are).
For your particular examples, though, it possible to do. This heavily relies on not having any nested tags. If you do, you should seriously try a different approach.
Your examples work with
^(?:<[^<]*<[^>]*>|.)*?(ITS)
This matches the entire string up to the first occurance of ITS not in a tag (and has this in its first capturing group), but it should be easy to extract the data you need there. Only matching this instance of ITS is not possible, since your implementation of regular expressions does not support arbitrary length look-behinds.
Ask if you want/need the expression explained. =)
I will tell you the same thing I told you when you asked a very similar question:
Stuck with Regular Expression code to apply HTML tag to text but exclude if inside <?> tag
You CANNOT parse HTML, including nested elements, with pure regular expressions. This is a known limitation of regex and is well documented.
You can try installing and using an external regular expression engine with extensions, which might work. You can manually walk the string, counting the nesting to see if the string you are looking at is wrapped. You can use a genuine HTML parser, like WebKIT do do this externally.
But you can't do it with regex. Please look for an alternative. Heck, we'll even help.
You say:
The aim is to replace these with my
own custom wrapped acronym tags in a
ColdFusion application I'm writing.
It sounds like using XSL might be more appropriate than regex to transform one tag into another.
UPDATE:
Just threw this together, it seems to work for simple cases:
(NOTE: this will simply strip out the 'acronym' tags. You could use XSL to replace them with your own custom tags, but you didn't specify anything along those lines so I didn't get into that)
XSL:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="*[name() = 'acronym']" />
</xsl:stylesheet>
Input:
<?xml version="1.0" encoding="UTF-8"?>
<root>
This is some test text about <acronym
title="Incomplete Test Syndrome"
class="CustomClass">ITS</acronym> for
the **ITS** department. Also worth
mentioning ABS as well I guess.ITS,
This is some **ITS** test text about
<acronym title="Incomplete Test
Syndrome"
class="GOTManager">ITS</acronym> for
the ITS department. Also worth
mentioning ABS as well I guess
</root>
Output:
<?xml version="1.0" encoding="UTF-8"?>
This is some test text about for
the **ITS** department. Also worth
mentioning ABS as well I guess.ITS,
This is some **ITS** test text about
for
the ITS department. Also worth
mentioning ABS as well I guess
UPDATE:
You said:
So in the first example I want it to
ignore the wrapped ITS and give me the
ITS at the end of the 1st sentence.
In the second example I want it to
return the ITS at the start of the 2nd
sentence.
This makes no sense. Your second example doesn't have "ITS" in the second sentence. I think what you meant was that the **ITS** is what you want to have extracted.
The XSL sample I gave only strips the <acronym/> tags, but after that's done you can try to find the ITS at different points in the sentence and maybe for that a regex might be easy (this assumes that you're ONLY have to worry about the <acronym/> tags).