XslCompiledTransform.Transform replaces """ with real quotes - xslt

Doing something like this:
using (XmlWriter myMamlHelpWriter = XmlWriter.Create(myFileStream, XmlHelpExToMamlXslTransform.OutputSettings))
{
XmlHelpExToMamlXslTransform.Transform(myMsHelpExTopicFilePath, null, myMamlHelpWriter);
}
where
private static XslCompiledTransform XmlHelpExToMamlXslTransform
{
get
{
if (fMsHelpExToMamlXslTransform == null)
{
// Create the XslCompiledTransform and load the stylesheet.
fMsHelpExToMamlXslTransform = new XslCompiledTransform();
using (Stream myStream = typeof(XmlHelpBuilder).Assembly.GetManifestResourceStream(
typeof(XmlHelpBuilder),
MamlXmlTopicConsts.cMsHelpExToMamlTransformationResourceName))
{
XmlTextReader myReader = new XmlTextReader(myStream);
fMsHelpExToMamlXslTransform.Load(myReader, null, null);
}
}
return fMsHelpExToMamlXslTransform;
}
}
And every time the string """ is replaced with real quotes in the result file.
Cannot understand why this happens...

The reason is that in the XSLT's internal representation, " is exactly the same characer as ". They both represent the ascii code point 0x34. It would seem that when the XslCompiledTransform produces its output, it uses " where it's legal to do so. I would imagine that it would still output " inside an attribute value.
Is it a problem for you that " is produced as " in the output?
I just ran the following XSLT in Visual Studio using an arbitrary input file:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/*">
<xml>
<xsl:variable name="chars">"&apos;<>&</xsl:variable>
<node a='{$chars}' b="{$chars}">
<xsl:value-of select="$chars"/>
</node>
</xml>
</xsl:template>
</xsl:stylesheet>
The output was:
<xml>
<node a=""'<>&" b=""'<>&">"'<>&</node>
</xml>
As you can see, even though all five characters were represented as entities originally, the apostrophies are produced as ' everywhere, and quotation marks are produced as " in text nodes. Furthermore, the a attribute which had ' delimiters uses " delimiters in the output. As I said, as far as the XSLT cares, a quotation mark is just a quotation mark, and an attribute is just an attribute. How those are produced in the output is up to the XSLT processor.
Edit: The root cause of this behavior appears to be the behavior of the XmlWriter class. It looks like the general suggestion for those who want more customized escaping is to extend the XmlTextWriter class. This page has an implementation that looks fairly promising:
public class KeepEntityXmlTextWriter : XmlTextWriter
{
private static readonly string[] ENTITY_SUBS = new string[] { "&apos;", """ };
private static readonly char[] REPLACE_CHARS = new char[] { '\'', '"' };
public KeepEntityXmlTextWriter(string filename) : base(filename, null) { ; }
private void WriteStringWithReplace(string text)
{
string[] textSegments = text.Split(KeepEntityXmlTextWriter.REPLACE_CHARS);
if (textSegments.Length > 1)
{
for (int pos = -1, i = 0; i < textSegments.Length; ++i)
{
base.WriteString(textSegments[i]);
pos += textSegments[i].Length + 1;
// Assertion: Replace the following if-else when the number of
// replacement characters and substitute entities has grown
// greater than 2.
Debug.Assert(2 == KeepEntityXmlTextWriter.REPLACE_CHARS.Length);
if (pos != text.Length)
{
if (text[pos] == KeepEntityXmlTextWriter.REPLACE_CHARS[0])
base.WriteRaw(KeepEntityXmlTextWriter.ENTITY_SUBS[0]);
else
base.WriteRaw(KeepEntityXmlTextWriter.ENTITY_SUBS[1]);
}
}
}
else base.WriteString(text);
}
public override void WriteString( string text)
{
this.WriteStringWithReplace(text);
}
}
On the other hand, the MSDN documentation recommends using XmlWriter.Create() rather than instantiating XmlTextWriters directly.
In the .NET Framework 2.0 release, the recommended practice is to create XmlWriter instances using the XmlWriter.Create method and the XmlWriterSettings class. This allows you to take full advantage of all the new features introduced in this release. For more information, see Creating XML Writers.
One way around that would be to use the same logic as above, but put it in a class that wraps an XmlWriter. This page has a ready-made implementation of an XmlWrappingWriter, that you can modify as needed.
To use the above code with the XmlWrappingWriter, you would subclass the wrapping writer, like this:
public class KeepEntityWrapper : XmlWrappingWriter
{
public KeepEntityWrapper(XmlWriter baseWriter)
: base(baseWriter)
{
}
private static readonly string[] ENTITY_SUBS = new string[] { "&apos;", """ };
private static readonly char[] REPLACE_CHARS = new char[] { '\'', '"' };
private void WriteStringWithReplace(string text)
{
string[] textSegments = text.Split(REPLACE_CHARS);
if (textSegments.Length > 1)
{
for (int pos = -1, i = 0; i < textSegments.Length; ++i)
{
base.WriteString(textSegments[i]);
pos += textSegments[i].Length + 1;
// Assertion: Replace the following if-else when the number of
// replacement characters and substitute entities has grown
// greater than 2.
Debug.Assert(2 == REPLACE_CHARS.Length);
if (pos != text.Length)
{
if (text[pos] == REPLACE_CHARS[0])
base.WriteRaw(ENTITY_SUBS[0]);
else
base.WriteRaw(ENTITY_SUBS[1]);
}
}
}
else base.WriteString(text);
}
public override void WriteString(string text)
{
this.WriteStringWithReplace(text);
}
}
Note this essentially the same code as the KeepEntityXmlTextWriter, but using XmlWrappingWriter as the base class and with a different constructor.
I don't recognize the Guard that the XmlWrappingWriter code is using in two places, but given that you'll be consuming the code yourself, it should be pretty safe to delete the lines like this. They just ensure that a null value isn't passed to the constructor or the (in the above case inaccessible) BaseWriter property:
Guard.ArgumentNotNull(baseWriter, "baseWriter");
To create an instance of the XmlWrappingWriter, you would create an XmlWriter however you need to, and then use:
KeepEntityWrapper wrap = new KeepEntityWrapper(writer);
And then you'd use this wrap variable as the XmlWriter you pass to your XSL transform.

The XSLT processor doesn't know whether a character was represented by a character entity or not. This is because the XML parser substitutes any character entity with its code-value.
Therefore, the XSLT processor would see exactly the same character, regardless whether it was represented as " or as " or as " or as ".
What you want can be achieved in XSLT 2.0 by using the so called "character maps".

Here is the trick you wanted:
replace all & with &
perform XSLT
replace all & with &

Related

Pretty printing XML in wxWidgets

I'm writing a class derived from wxStyledTextCtrl and I want it to prettify given XML without adding anything other than whitespaces. I cannot find simple working solution. I can only use wxStyledTextCtrl, wxXmlDocument and libxml2.
The result I'm aiming for is that after calling SetText with wxString containing following text
<!-- comment1 --> <!-- comment2 --> <node><emptynode/> <othernode>value</othernode></node>
the control should show
<!-- comment1 -->
<!-- comment2 -->
<node>
<emptynode/>
<othernode>value</othernode>
</node>
using libxml2 I managed to almost achieve this, but it also prints XML declaration (eg. <?xml version="1.0" encoding="UTF-8"?>) and I don't want this.
inb4, I'm looking for simple and clean solution - i don't want to manually remove first line of formatted XML
Is there any simple solution to this using given tools? I feel like I'm missing something.
Is there a simple solution? No. But if you want to write you're own pretty print function, you basically need to make a depth first iteration over the xml document tree, printing it as you go. There's a slight complication in that you also need some way of knowing when to close a tag.
Here's an incomplete example of one way to do this using only wxWidgets xml classes. Currently, it doesn't handle attributes, self closing elements (such as '' in your sample text), or any other special element types. A complete pretty printer would need to add those things.
#include <stack>
#include <set>
#include <wx/xml/xml.h>
#include <wx/sstream.h>
wxString PrettyPrint(const wxString& in)
{
wxStringInputStream string_stream(in);
wxXmlDocument doc(string_stream);
wxString pretty_print;
if (doc.IsOk())
{
std::stack<wxXmlNode*> nodes_in_progress;
std::set<wxXmlNode*> visited_nodes;
nodes_in_progress.push(doc.GetDocumentNode());
while (!nodes_in_progress.empty())
{
wxXmlNode* cur_node = nodes_in_progress.top();
nodes_in_progress.pop();
int depth = cur_node->GetDepth();
for (int i=1;i<depth;++i)
{
pretty_print << "\t";
}
if (visited_nodes.find(cur_node)!=visited_nodes.end())
{
pretty_print << "</" << cur_node->GetName() << ">\n";
}
else if ( !cur_node->GetNodeContent().IsEmpty() )
{
//If the node has content, just print it now
pretty_print << "<" << cur_node->GetName() << ">";
pretty_print << cur_node->GetNodeContent() ;
pretty_print << "</" << cur_node->GetName() << ">\n";
}
else if (cur_node==doc.GetDocumentNode())
{
std::stack<wxXmlNode *> nodes_to_add;
wxXmlNode *child = cur_node->GetChildren();
while (child)
{
nodes_to_add.push(child);
child = child->GetNext();
}
while (!nodes_to_add.empty())
{
nodes_in_progress.push(nodes_to_add.top());
nodes_to_add.pop();
}
}
else if (cur_node->GetType()==wxXML_COMMENT_NODE)
{
pretty_print << "<!-- " << cur_node->GetContent() << " -->\n";
}
//insert checks for other types of nodes with special
//printing requirements here
else
{
//otherwise, mark the node as visited and then put it back
visited_nodes.insert(cur_node);
nodes_in_progress.push(cur_node);
//If we push the children in order, they'll be popped
//in reverse order.
std::stack<wxXmlNode *> nodes_to_add;
wxXmlNode *child = cur_node->GetChildren();
while (child)
{
nodes_to_add.push(child);
child = child->GetNext();
}
while (!nodes_to_add.empty())
{
nodes_in_progress.push(nodes_to_add.top());
nodes_to_add.pop();
}
pretty_print <<"<" << cur_node->GetName() << ">\n";
}
}
}
return pretty_print;
}

xalan and custom function for xslt

I'm using Apache FOP with the IKVM from my c# code. I generate the pdf by using the xslt stylesheet to get the result as xsl fo. I have one problem, that is usingthe custom functions.
My stylesheet declaration:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:cal="xalan://m.test"
extension-element-prefixes="cal"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/XMLSchema-instance http://www.xmlblueprint.com/documents/fop.xsd">
The custom function:
namespace m
{
public class test
{
public static string zzz(ExpressionContext x, object d)
{
return "test";
}
}
}
And calling this from the xslt:
<xsl:value-of select="cal:zzz(1)"/>
Code to compile it:
FopFactory fopFactory = FopFactory.newInstance();
fopFactory.ignoreNamespace("http://www.w3.org/2001/XMLSchema-instance");
fopFactory.setUserConfig(new File("fop.xconf"));
OutputStream o = new DotNetOutputMemoryStream();
try
{
Fop fop = fopFactory.newFop("application/pdf", o);
TransformerFactory factory = TransformerFactory.newInstance();
Source xsltSrc = new StreamSource(new File("data.xsl"));
Transformer transformer = factory.newTransformer(xsltSrc);
var bytes = System.IO.File.ReadAllBytes("data.xml"); //"HR_CV.fo");
var stream = new DotNetInputMemoryStream(new System.IO.MemoryStream(bytes));
Source src = new StreamSource(stream);
Result res = new SAXResult(fop.getDefaultHandler());
transformer.transform(src, res);
}
finally
{
o.close();
}
Exception I got is:
java.lang.NoSychMethodExtension: For extension function, could not find method org.apache.xml.utils.NodeVector.zzz([ExpressionContext,])
What I'm doing wrong?
You're calling the zzz function with a single argument (1). But your function expects two arguments. If you provide both arguments, chances are it will work just fine.

AS3/Regular Expressions - Replacing segments of a string

I have absolutely no knowledge in Regex whatsoever. Basically what I'm trying to do is have an error class that I can use to call errors (obviously) which looks like this:
package avian.framework.errors
{
public class AvError extends Object
{
// errors
public static const LAYER_WARNING:String = "Warning: {0} is not a valid layer - the default layer _fallback_ has been used as the container for {1}.";
/**
* Constructor
* Places a warning or error into the output console to assist with misuse of the framework
* #param err The error to display
* #param params A list of Objects to use throughout the error message
*/
public function AvError(err:String, ...params)
{
trace(err);
}
}
}
What I want to be able to do is use the LAYER_WARNING like this:
new AvError(AvError.LAYER_WARNING, targetLayer, this);
And have the output be something along the lines of:
Warning: randomLayer is not a valid layer - the default layer _fallback_ has been used as the container for [object AvChild].
The idea is to replace {0} with the first parameter parsed in ...params, {1} with the second, etc.
I've done a bit of research and I think I've worked out that I need to search using this pattern:
var pattern:RegExp = /{\d}/;
You can use StringUtil
var original:String = "Here is my {0} and my {1}!";
var myStr:String = StringUtil.substitute(original, ['first', 'second']);
Using the g flag in RegExp you can create an array containing all of your {x} matches, then loop through this array and replace each of the matches with the appropriate parameter.
Code:
var mystring:String = "{0} went to {1} on {2}";
function replace(str:String, ...params):String
{
var pattern:RegExp = /{\d}/g;
var ar:Array = str.match(pattern);
var i:uint = 0;
for(i; i<ar.length; i++)
{
str = str.split(ar[i]).join(params[i]);
}
return str;
}
trace(replace(mystring, "marty", "work", "friday")); // marty went to work on friday
i'm assuming you want to have several static constants with varying replacement instances ({0}, {1}, {2}, etc.) in each string constant.
something like this should work - sorry, it's untested:
public function AvError(err:String, ...params)
{
var replacementArray:Array = err.match(new RegExp("{\\d}", "g"));
for (var i:int = 0, i < replacementArray.length, i++)
err = err.replace(new RegExp(replacementArray[i], "g"), params[i]);
trace(err);
}
if you do have several static constants with varying replacement instances, you'll want to check for an appropriate matching amount of …params that are passed.

Parsin XML file using pugixml

Hi
I want to use XML file as a config file, from which I will read parameters for my application. I came across on PugiXML library, however I have problem with getting values of attributes.
My XML file looks like that
<?xml version="1.0"?>
<settings>
<deltaDistance> </deltaDistance>
<deltaConvergence>0.25 </deltaConvergence>
<deltaMerging>1.0 </deltaMerging>
<m> 2</m>
<multiplicativeFactor>0.7 </multiplicativeFactor>
<rhoGood> 0.7 </rhoGood>
<rhoMin>0.3 </rhoMin>
<rhoSelect>0.6 </rhoSelect>
<stuckProbability>0.2 </stuckProbability>
<zoneOfInfluenceMin>2.25 </zoneOfInfluenceMin>
</settings>
To pare XML file I use this code
void ReadConfig(char* file)
{
pugi::xml_document doc;
if (!doc.load_file(file)) return false;
pugi::xml_node tools = doc.child("settings");
//[code_traverse_iter
for (pugi::xml_node_iterator it = tools.begin(); it != tools.end(); ++it)
{
cout<<it->name() << " " << it->attribute(it->name()).as_double();
}
}
and I also was trying to use this
void ReadConfig(char* file)
{
pugi::xml_document doc;
if (!doc.load_file(file)) return false;
pugi::xml_node tools = doc.child("settings");
//[code_traverse_iter
for (pugi::xml_node_iterator it = tools.begin(); it != tools.end(); ++it)
{
cout<<it->name() << " " << it->value();
}
}
Attributes are loaded corectly , however all values are equals 0. Could somebody tell me what I do wrong ?
I think your problem is that you're expecting the value to be stored in the node itself, but it's really in a CHILD text node. A quick scan of the documentation showed that you might need
it->child_value()
instead of
it->value()
Are you trying to get all the attributes for a given node or do you want to get the attributes by name?
For the first case, you should be able to use this code:
unsigned int numAttributes = node.attributes();
for (unsigned int nAttribute = 0; nAttribute < numAtributes; ++nAttribute)
{
pug::xml_attribute attrib = node.attribute(nAttribute);
if (!attrib.empty())
{
// process here
}
}
For the second case:
LPCTSTR GetAttribute(pug::xml_node & node, LPCTSTR szAttribName)
{
if (szAttribName == NULL)
return NULL;
pug::xml_attribute attrib = node.attribute(szAttribName);
if (attrib.empty())
return NULL; // or empty string
return attrib.value();
}
If you want stock plain text data into the nodes like
<name> My Name</name>
You need to make it like
rootNode.append_child("name").append_child(node_pcdata).set_value("My name");
If you want to store datatypes, you need to set an attribute. I think what you want is to be able to read the value directly right?
When you are writing the node,
rootNode.append_child("version").append_attribute("value").set_value(0.11)
When you want to read it,
rootNode.child("version").attribute("version").as_double()
At least that's my way of doing it!

Unicode Regex; Invalid XML characters

The list of valid XML characters is well known, as defined by the spec it's:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
My question is whether or not it's possible to make a PCRE regular expression for this (or its inverse) without actually hard-coding the codepoints, by using Unicode general categories. An inverse might be something like [\p{Cc}\p{Cs}\p{Cn}], except that improperly covers linefeeds and tabs and misses some other invalid characters.
I know this isn't exactly an answer to your question, but it's helpful to have it here:
Regular Expression to match valid XML Characters:
[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]
So to remove invalid chars from XML, you'd do something like
// filters control characters but allows only properly-formed surrogate sequences
private static Regex _invalidXMLChars = new Regex(
#"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]",
RegexOptions.Compiled);
/// <summary>
/// removes any unusual unicode characters that can't be encoded into XML
/// </summary>
public static string RemoveInvalidXMLChars(string text)
{
if (string.IsNullOrEmpty(text)) return "";
return _invalidXMLChars.Replace(text, "");
}
I had our resident regex / XML genius, he of the 4,400+ upvoted post, check this, and he signed off on it.
For systems that internally stores the codepoints in UTF-16, it is common to use surrogate pairs (xD800-xDFFF) for codepoints above 0xFFFF and in those systems you must verify if you really can use for example \u12345 or must specify that as a surrogate pair. (I just found out that in C# you can use \u1234 (16 bit) and \U00001234 (32-bit))
According to Microsoft "the W3C recommendation does not allow surrogate characters inside element or attribute names." While searching W3s website I found C079 and C078 that might be of interest.
I tried this in java and it works:
private String filterContent(String content) {
return content.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
}
Thank you Jeff.
The above solutions didn't work for me if the hex code was present in the xml. e.g.
<element></element>
The following code would break:
string xmlFormat = "<element>{0}</element>";
string invalid = " ";
string xml = string.Format(xmlFormat, invalid);
xml = Regex.Replace(xml, #"[\x01-\x08\x0B\x0C\x0E\x0F\u0000-\u0008\u000B\u000C\u000E-\u001F]", "");
XDocument.Parse(xml);
It returns:
XmlException: '', hexadecimal value 0x08, is an invalid character.
Line 1, position 14.
The following is the improved regex and fixed the problem mentioned above:
&#x([0-8BCEFbcef]|1[0-9A-Fa-f]);|[\x01-\x08\x0B\x0C\x0E\x0F\u0000-\u0008\u000B\u000C\u000E-\u001F]
Here is a unit test for the first 300 unicode characters and verifies that only invalid characters are removed:
[Fact]
public void validate_that_RemoveInvalidData_only_remove_all_invalid_data()
{
string xmlFormat = "<element>{0}</element>";
string[] allAscii = (Enumerable.Range('\x1', 300).Select(x => ((char)x).ToString()).ToArray());
string[] allAsciiInHexCode = (Enumerable.Range('\x1', 300).Select(x => "&#x" + (x).ToString("X") + ";").ToArray());
string[] allAsciiInHexCodeLoweCase = (Enumerable.Range('\x1', 300).Select(x => "&#x" + (x).ToString("x") + ";").ToArray());
bool hasParserError = false;
IXmlSanitizer sanitizer = new XmlSanitizer();
foreach (var test in allAscii.Concat(allAsciiInHexCode).Concat(allAsciiInHexCodeLoweCase))
{
bool shouldBeRemoved = false;
string xml = string.Format(xmlFormat, test);
try
{
XDocument.Parse(xml);
shouldBeRemoved = false;
}
catch (Exception e)
{
if (test != "<" && test != "&") //these char are taken care of automatically by my convertor so don't need to test. You might need to add these.
{
shouldBeRemoved = true;
}
}
int xmlCurrentLength = xml.Length;
int xmlLengthAfterSanitize = Regex.Replace(xml, #"&#x([0-8BCEF]|1[0-9A-F]);|[\u0000-\u0008\u000B\u000C\u000E-\u001F]", "").Length;
if ((shouldBeRemoved && xmlCurrentLength == xmlLengthAfterSanitize) //it wasn't properly Removed
||(!shouldBeRemoved && xmlCurrentLength != xmlLengthAfterSanitize)) //it was removed but shouldn't have been
{
hasParserError = true;
Console.WriteLine(test + xml);
}
}
Assert.Equal(false, hasParserError);
}
Another way to remove incorrect XML chars in C# with using XmlConvert.IsXmlChar Method (Available since .NET Framework 4.0)
public static string RemoveInvalidXmlChars(string content)
{
return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}
or you may check that all characters are XML-valid.
public static bool CheckValidXmlChars(string content)
{
return content.All(ch => System.Xml.XmlConvert.IsXmlChar(ch));
}
.Net Fiddle - https://dotnetfiddle.net/v1TNus
For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.
In PHP the regex would look like the following way:
protected function isStringValid($string)
{
$regex = '/[^\x{9}\x{a}\x{d}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/u';
return (preg_match($regex, $string, $matches) === 0);
}
This would handle all 3 ranges from the xml specification:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]