XSL Transformation, emoji and attributes - xslt

I'm encountering issue with emojis when trying to generate html output using xsl transformation under certain circumstances.
For instance, I've tested following xsl with different transformation engines:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" encoding="UTF-8"/>
<xsl:template match="/">
<xsl:text disable-output-escaping="yes"><!doctype html></xsl:text>
<html>
<head>
<meta charset="UTF-8"/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
</head>
<body>
<textarea>πŸ‘πŸ»</textarea><br/>
<input type="text" value="πŸ‘πŸ»"/>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
I tested with exact same code (based on JAXP definition) for all transformers. I only changed the transformer instance class reference.
Saxon gives correct result:
Java internal repackaged transformer based on xalan (aka com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl) is correct when emoji is put as text in textarea body, but generates wrong result for <input> field: it seems that emoji is wrong encoded when put in value attribute:
Xalan 2.7.2 gives even worse result:
For different reasons (mainly license one), I would prefer using Xalan transformer. Any idea how I can make xalan manage emoji correctly ?
EDIT
The transformation is performed with following code:
TransformerFactory factory = TransformerFactory.newInstance(
"com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl",
null);
Transformer transformer = factory.newTransformer(new StreamSource(xsl));
DocumentSource domSource = new DocumentSource(doc);
OutputStream stream = response.getOutputStream();
transformer.transform(domSource, new StreamResult(stream));
stream.flush();
stream.close();
where doc is a dom4j document, xsl is the inputstream containing above stylesheet and response is a HttpServletResponse object which will receive the transformation result.

I have tried
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" encoding="UTF-8" doctype-system="about:legacy-compat"/>
<xsl:template match="/">
<html>
<head>
</head>
<body>
<textarea>πŸ‘πŸ»</textarea><br/>
<input type="text" value="πŸ‘πŸ»"/>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
with Xalan 2.7.1 at http://xsltransform.net/ and both thumbs seems to be shown fine i.e. the serialized HTML is
<!DOCTYPE HTML SYSTEM "about:legacy-compat">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<textarea>πŸ‘πŸ»</textarea>
<br>
<input value="πŸ‘πŸ»" type="text">
</body>
</html>
which renders as

I finally decided to fork xalan-java project and patch the serializer by myself. After compilation of the patch, I'm able to have correct emojis for both attributes and text with utf-8 xsl output.
The patch commit is following https://github.com/morbac/xalan-java/commit/a685171e1b621e9b63c8507f467a395fd1fc96a4. It fixes the issue for both input and textarea. The jar with fixed classes is available here

After a day of research, I have come to the conclusion that this is a bug in xalan html serializer (line 1440 and following) with surogate characters (char between \ud800 and \udbff). As mentionned in comments, xalan 2.6.0 makes a correct transformation, but xalan 2.7.* does not.
Martin Honnen mentionned the XALANJ-2419. I also found other tickets related to this issue (XALANJ-2617, https://github.com/apache/xalan-j/pull/4, etc.) I tried to implement some fixes. For instance the version suggested here effectively fixes the issue for my <input> field but it remains the issue with textarea.
I'll try to fork xalan and fix the issue for both attribute and text. Meanwhile, the easiest way to workarround the issue is to change the replace the "UTF-8" encoding with "UTF-16" in xsl:output. This fixes both issues.

Related

CHtmlView that is compatible with UltraHD

CHtmlView is not compatible with UltraHD resolutions. It is not simply down to using the correct HTML/CSS to be UltraHD aware. The print preview mechanism fails and crops the page. Many months ago Microsoft acknowledged this as an issue and has not addressed it.
My application heavily uses a CHtmlView element for displaying schedules and printing. Whilst my application is Windows based (Win32/x64) I am getting more and more users on Mac computers running Windows inside it and they all are always using UltraHD by default. As a result my application fails to function properly and the user has to reduce the resolution and adjust text scaling back to 100%.
Has anyone else encountered this issue with using UltraHD with CHtmlView print preview and got it working?
The related question is here:
How can I make this HTML / CSS file UltraHD / 4k friendly in a CHtmlView?
But I asked that ages ago and got nowhere so I am trying again.
Thank you.
Update
I provided this XSL script to the user to try with Ultra HD resolution in my program:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml">
<xsl:output method="html" indent="yes" version="4.01"
doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN"/>
<xsl:template match="/">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
</head>
<body>
<div style="width:100%; height:100%; border: thick solid #00FF00;">This is a test
</div>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
So it uses the <meta http-equiv="X-UA-Compatible" content="IE=edge" /> code and it has made no difference. When he does a print preview:
So the problem still remains. It seems to be something to do with the Print Preview mechanism of the CHtmlView control.
Update
This is Microsoft link to this issue:
https://developercommunity.visualstudio.com/content/problem/215368/chtmlview-and-printing-on-ultrahd-computers.html
Still not resolved.

Clientside XSLT in Mozilla/Firefox fails

I have a result document that renders in Chrome, but not Mozilla/Firefox.
I believe it is because there is top level leading whitespace (two blank lines before the <!DOCTYPE html).
How can I change this transform to not have leading whitespace (fiddle)?
XML:
<?xml-stylesheet href="/css/my.xsl" type="text/xsl"?>
<webpage>
<title>Book</title>
<auth>Mike</auth>
<container-content>
<p>foo1</p>
<p>foo2</p>
</container-content>
</webpage>
XSLT:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="1.0">
<xsl:output
method="xml"
indent="yes"
encoding="UTF-8"
omit-xml-declaration="yes"
doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"/>
<xsl:strip-space elements="*"/>
<xsl:template match="text()"/>
<xsl:template match="/">
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8"/>
</head>
<body>
<xsl:copy-of select="//container-content/*"/>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
Result:
- a blank line here -
- and here -
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
<body>
<p>foo1</p>
<p>foo2</p>
</body>
</html>
Alternatively, I may be incorrect, and the two blank lines are not the cause of the failed render in Mozilla/Firefox. I have a hard time troubleshooting client side transforms.
Side note: I've developed in Saxon 6.5, thinking Saxon best approximates what browsers do. I could be wrong. I note Xalan does not put in leading whitespace.
I just ran your stylesheet with Saxon 6.5 and indeed, it outputs two blank lines, which are removed if you change the xsl:output to be without indentation and with xml declaration. However, I believe this to be a bug in Saxon 6.5 (a small one, as the whitespace is not significant).
Running it with other XSLT 1.0 processors show no whitepace. However, as said in my comment, the whitespace is insignificant, as browsers do not serialize anyway. (note: apparently, browsers do some kind of serialization, in the sense that they look to whether you use XML or HTML output).
I ran your example with Firefox and it "just works". Since your stylesheet does a simple copy of the XML, it shows just the text. If I change the xsl:output to HTML and add a few lines to be sure I am running it correctly (I added an <h1>Hello</h1>, it shows the HTML.
I'm not sure what you expect the browser to show, but my guess is not XML, but (X)HTML. XSLT 1.0 is not very good with XHTML (it is supported in XSLT 2.0, but that is not supported by browsers), but works fine with HTML.
I modified your stylesheet as follows:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="1.0">
<xsl:strip-space elements="*"/>
<xsl:template match="text()"/>
<xsl:template match="/">
<html>
<head />
<body>
<xsl:apply-templates />
</body>
</html>
</xsl:template>
<xsl:template match="title">
<h1><xsl:value-of select="."/></h1>
</xsl:template>
<xsl:template match="auth">
<p>Author: <xsl:value-of select="." /></p>
</xsl:template>
</xsl:stylesheet>
And in Firefox and Chrome it renders as follows:
Note (1): if you do not run it from a web server (either local or remote), it will not run in either Firefox or Chrome because of security restrictions.
Note (2): to view the rendered XML or HTML, use the Inspect Element feature of the developer tools of either Chrome or Firefox.
Note (3): you do not need to use the meta-tag, as the specification requires this meta tag to be output as soon as it recognizes that HTML is output.
Note (4) if you are unsure whether or not Firefox is loading your stylesheet correctly, have a look using Firebug, it should show something like this (mark the "200 OK"):
If you want to transform to XHTML then you need to make sure you use the XHTML namespace for your result elements so put
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/1999/xhtml"
version="1.0">
on your stylesheet, as otherwise with output method xml your elements in no namespace are not recognized as XHTML elements by Mozilla.
As your input p elements are also in no namespace you can not copy them through but have to write a template for them
<xsl:template match="*">
<xsl:element name="{local-name()}"><!-- assumes you have the namespace declaration suggested above -->
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
and then use <xsl:apply-templates select="//container-content/*"/> instead of the copy-of. And in that case the <xsl:template match="text()"/> needs to be removed as otherwise the text of the transformed p elements would not show up.

saxon including boolean itemscope value and closing source tag in html output

I'm using Saxon-HE 9.6.0.1J from Saxonica to generate HTML documents (xsl:output method="html"). It's generally good at omitting the value of boolean attributes and closing tags for empty elements, but I've found a few situations where it fails:
The microdata itemscope="itemscope" attribute is not output as simply itemscope
empty source elements are given closing tags
Here is an example stylesheet that demonstrates the problem:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="html" encoding="utf-8" include-content-type="no"/>
<xsl:template match="/">
<html>
<head>
<meta charset="UTF-8" />
<title>HTML test</title>
</head>
<body>
<div itemscope="itemscope" itemtype="http://example.com/dummy/">
<span itemprop="prop1">val1</span>
</div>
<audio autoplay="autoplay" controls="controls">
<source type="audio/mpeg" src="example.mp3" />
<source type="audio/x-wav" src="example.wav" />
</audio>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
Sample XML:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="example.xsl"?>
<example/>
Command:
java -cp saxon9he.jar net.sf.saxon.Transform -s:example.xml -a
Results:
<html>
<head>
<meta charset="UTF-8">
<title>HTML test</title>
</head>
<body>
<div itemscope="itemscope" itemtype="http://example.com/dummy/"><span itemprop="prop1">val1</span></div>
<audio autoplay controls>
<source type="audio/mpeg" src="example.mp3"></source>
<source type="audio/x-wav" src="example.wav"></source>
</audio>
</body>
</html>
As demonstrated, meta is properly empty but source is not, and the values for autoplay and controls are properly omitted but not for itemscope.
Is this a bug, or am I missing the solution to tell Saxon how to treat those elements and attributes? I've searched the docs on saxonica.com and questions here for a clue, but not found anything.
Thanks in advance!
Quick update: In XSLT 3.0, you can specify the #html-version attribute, which you can set to 5.0 if you want to use XHTML5. Doing so solved the issue of at <source> for me while still using #method="xhtml".
The "source" element is recognized as an empty element if you specify version="5.0" on xsl:output.
The list of attributes that Saxon recognizes as boolean attributes when you specify method="html" version="5.0" comes from here:
http://www.w3.org/TR/html5/index.html#attributes-1
which does not include "itemscope". I'm afraid I can't help you with the history of how it comes to be present in some flavours of HTML and not in the W3C flavour, but Saxon inevitably follows the W3C specs.
Perhaps we should provide some way of extending the list (if you're really keen you can do it by writing your own serializer factory class that customizes the Saxon serializer, but that's serious hackery).

Generate html from xsl without DOCTYPE

Is it possible to generate html-output with xsl that has no doctype added to the output? If I donΒ΄t set any doctype myself it produces one on its own.
EDIT :
Since I donΒ΄t think that this is possible, I solved my problem bysimply cutting away the DOCTYPE after the html is generated with followiing regex: '<&!DOCTYPE[^>]*>'
Quickly tested with Saxon, yes this is possible...I'm not sure what xslt library you're using so it could be a symptom of that.
If I use Saxon to run this transform against ANY xml file (which generates a minimum viable HTML5 document, minus DOCTYPE) :
<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" encoding="iso-8859-1" indent="yes"/>
<xsl:template match="/">
<html>
<head>
<title>Test</title>
</head>
<body>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
I get this output :
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Test</title>
</head>
<body></body>
</html>

XLST collpases my <script> contents to one line as a result commenting out javascript!

UPDATE: Apologies all it was my http server stripping white space from from xslt before it was sent and was not aware of javascript comments (I should really del the question but cannot).
My XSLT looks like:
<?xml version="1.0"?>
<xsl:stylesheet
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/1999/xhtml">
<xsl:output
method="xml"
indent="yes"/>
<xsl:template match="/root">
<html>
<head>
<title>Title</title>
<script type="text/javascript"><![CDATA[
// Β©2011
function function(){
// do stuff...
}
]]></script>
</head>
<body>
<p> blah blah... </p>
</body>
</html>
</xsl:template>
But the resulting xml is always collapsed to one line resulting in my javascript being commented out from the inital comment! This happens is all major browsers! Despite indent="yes"..
I couldn't repro this.
With all of the following nine XSLT processors (MSXML3 included -- so in IE you should get a good result):
MSXML (3, 4, 6)
.NET (XslCompiledTransform and XslTransform)
Altova (XML-SPY)
Saxon 6.5.4
Saxon 9.1.07 (XSLT 2.0 processor)
XQSharp (XSLT 2.0 processor)
when I perform the provided XSLT transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/1999/xhtml">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/root">
<html>
<head>
<title>Title</title>
<script type="text/javascript">
<![CDATA[
// Β©2011
function function()
{
// do stuff...
}
]]>
</script>
</head>
<body>
<p> blah blah... </p>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
on this XML document (as no source XML document is provided in the question):
<root/>
the result is the same:
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Title</title>
<script type="text/javascript">
// Β©2011
function function()
{
// do stuff...
}
</script>
</head>
<body>
<p> blah blah... </p>
</body>
</html>
Therefore, this is behavior of a buggy XSLT processor, not on the above list -- or there is some missing data in the question.
Try to wrap your javascript in <xsl:text> - Element instead of the CDATA Section. This will at least keep up your linebreaks you made inside. I'm not sure if CDATA stuff cares about linebreaks.
<script type="text/javascript"><xsl:text>
// Β©2011
function function(){
// do stuff...
}
</xsl:text></script>
You also should try to to use method=html instead of xml because your generating html content.
In addition: i think the indent=yes stuff only applies to the indention of the XML Elements. I don't thin that mechanism cares about Text or CDATA Sections so you have to do the linebreaks yourself (as you already did in your javascript).
Three things to try:
You're generating HTML, so why have output method XML?
The CDATA will be used by the XML Parser on input for the XSLT engine, and not carried through (CDATAdoesn't appear in the XML info model).
Would using xml:space='preserve' on the script element help?