I have encountered some odd characters that do not display properly in Internet Explorer, such as these: “, –, and ’. I think they're carried over from copy-and-paste Word content.
I am using XSLT to build the page content and it would be great to detect these characters in the XSLT and replace them with valid HTML codes. I already do string replacement in the style sheet, but I'm not sure how detect these encoded characters or whether it's possible.
What about simply changing the encoding for the Stylesheet as well as its output to UTF-8? The characters you mention are “, – and ’. Certainly not invalid or so, given the correct encoding (the characters are at least perfectly valid in Codepage 1252).
Using a good XML editor such as XMLSpy should highlight any errors in formatting your XSLT by validating at development time.
Jeni Tennison's Multiple string replacements may be a good starting point.
Related
I have a large text file that I'm going to be working with programmatically but have run into problems with a special character strewn throughout the file. The file is way too large to scan it looking for specific characters. Most of the other unwanted special characters I've been able to get rid of using some regex pattern. But there is a box character, similar to "□". When I tried to copy the character from the actual text file and past it here I get "�", so the example of the box is from Windows character map which includes the code 'U+25A1', which I'm not sure how to interpret or if it's something I could use for a regex search.
Would anyone know how I could search for the box symbol similar to "□" in a UTF-8 encoded file?
EDIT:
Here is an example from the text file:
"� Prune palms when flower spathes show, or delay pruning until after the palm has finished flowering, to prevent infestation of palm flower caterpillars. Leave the top five rows."
The only problem is that, as mentioned in the original post, the square gets converted into a diamond question mark.
It's unclear where and how you are searching, although you could use the hex equivalent:
\x{25A1}
Example:
https://regex101.com/r/b84oBs/1
The black diamond with a question mark is not a character, per se. It is what a browser spits out at you when you give it unrecognizable bytes.
Find out where that data is coming from.
Determine its encoding. (Usually UTF-8, but might be something else.)
Be sure the browser is configured to display that encoding. This is likely to suffice <meta charset=UTF-8> in the header of the page.
I found a workaround using Notepad++ and this website. It's still not clear what encoding system the square is originally from, but when I post it into the query field in the website above or into the Notepad++ Conversion Table (Plugins > Converter > Conversion Table) it gives the hex-character code for the "Replacement Character" which is the diamond with the question mark.
Using this code in a regex expression, \x{FFFD}, within Notepad++ search gave me all the squares, although recognizing them as the Replacement Character.
I've been trying to prohibit users from entering double-quotes (") into some fields that are used in JSON strings, as they cause unexpected termination of values in the strings. Unfortunately, while the regex isn't hard to write, I can't get it to work within XPages.
I tried using both double-quotes alone and using the escape character. Both ways fail any string, not just ones including the double-quotes.
<xp:validateConstraint message="Please do not use double quotes in organization/vendor names">
<xp:this.regex><![CDATA['^[^\"]*$]]></xp:this.regex>
</xp:validateConstraint>
There must be a simple way around this issue.
I think you're running into issues with your regex property for your xp:validateConstraint validator. You seem to be attempting to strip the characters in the xp:this.regex as opposed to specifying what characters are allowed, as I believe the docs read. I might recommend checking out the xp:customConverter (bias: I'm more familiar with the customConverter) which gives you the ability to alter the getValueAsObject and getValueAsString methods; then you can escape the undesired characters.
Here's what I'm thinking of, to strip them out. If you plug this into an XPage, you'll find that when the value is pulled (e.g.- by the partial refresh), it converts the input content accordingly by stripping out quotes (both single and double, in my case).
<?xml version="1.0" encoding="UTF-8"?>
<xp:view xmlns:xp="http://www.ibm.com/xsp/core">
<xp:inputTextarea
id="inputTextarea1"
value="#{viewScope.myStuff}"
disableClientSideValidation="true">
<xp:this.converter>
<xp:customConverter>
<xp:this.getAsString><![CDATA[#{javascript:return value.replace(/["']/g, "");}]]></xp:this.getAsString>
<xp:this.getAsObject><![CDATA[#{javascript:return value.replace(/["']/g, "");}]]></xp:this.getAsObject>
</xp:customConverter>
</xp:this.converter>
</xp:inputTextarea>
<xp:button
value="Do Something"
id="button1">
<xp:eventHandler
event="onclick"
submit="true"
refreshMode="partial"
refreshId="computedField1" />
</xp:button>
<xp:text
escape="true"
id="computedField1"
value="#{viewScope.myStuff}" />
</xp:view>
My interaction with the above code yields:
Notice that for it to reflect in the refresh, I'm modifying both the getAsString and the getAsObject, since it's updating the viewScope'd object during the refresh (a fact I had to remind myself of), but saving to a text field in XPages will get the value by the getAsString (provided your data source knows its a String related field, e.g.- NotesXspDocument as document1, with known Form, where the field is a Text field).
As the above comments alluded to, this performs an act of filtering the input values as opposed to escaping or validating those values. You could also change my replace methods to replacing with a text escape character, return value.replace(/"/g,"\"").replace(/'/g,"\'");.
Is the simple answer just add a JavaScript function call on the submit button to remove the quote?
A more elegant solution would be to not allow typing of the quote by checking the keydown event and preventing for that character code. The user should not be able to type one thing and then have it changed on them in processing
#Eric McCormick recommends a customConverter which in my opinion is a neat solution I probably would be going for in many cases. Sometimes however we need to teach users to adhere to the rules so we have to show them where they did wrong. That's when we may need a validator.
Playing around a bit the simplest solution I came up with is a xp:validateExpression simply looking for the first occurrence of a double quote within the String entered:
<xp:inputText
id="inputText1"
value="#{viewScope.testvalue}">
<xp:this.validators>
<xp:validateExpression
message="Hey, wait! Didn't I tell you not to use double quotes in here?">
<xp:this.expression><![CDATA[#{javascript:value.indexOf("\"")==-1}]]></xp:this.expression>
</xp:validateExpression>
</xp:this.validators>
</xp:inputText>
If that's a single occurrence in your application that's it, really. If you need this and similar solutions all over the place you might want to take a look into writing a small validator bean (java), register it via faces-config.xml and then use it everywhere in your application e.g by using an xp:validator instead
As suggested by #Tomalik and #sidyll, this is attempt to solve the wrong problem. While each of the answers supplied do solve the problem of preventing the user from entering undesirable characters, it is better to encode those characters to preserve the user's input. In this particular case, the intermediate step in providing the data to the user via a JSON string is to pull the value from a view.
So, all I had to do was change the column formula to encode the string using the UTF-8 character set and it displays the values with the "undesirable characters". The unencoded value is stored on the document so that Old Notes access won't create confusion.
#URLEncode ("UTF-8"; vendorName )
In one case, the JSON is computed as part of the form design, but the same solution works.
I have an XSLT transform that puts into my output. That is a narrow-non breaking space. Here is one section that results in nnbsp:
<span>
<xsl:text>§ </xsl:text>
<xsl:value-of select="$firstsection"/>
<xsl:text> to </xsl:text>
<xsl:value-of select="$lastsection"/>
</span>
The nnbsp in this case, comes in after the § and after the text to.
<span>§ 1 to 8</span>
(interestingly, the space before the to turns out to be a regular full size space)
This occurs in my UTF-8 encoded output, as well as iso-8859-1 (latin1).
How can I avoid the nnbsp? While the narrow space is visually more appropriate, it doesn't work for all the devices that will read this document. I need a plain vanilla blank space.
Is there a transform setting? I use Saxon 9 at the command line.
Should I do a another transform.. using a replace template to replace the nnbsp?
Should I re-do my templates like the one above? Example, if I did a concat() would that be a better coding practice?
UPDATE: For those who may find this question someday... as suggested by Michael Kay, I researched the issue further. Indeed, it turns out narrow-NBSP were in the source XML files (and bled into my templates via cut/paste). I did not know this, and it was hard to discover (hat tip to gVim hex view). The narrows don't exactly jump out at you in a GUI editor. I have no control over production of the source XML, so I had to find a way to 'deal with it.' Eric's answer below turned out to be my preferred way to scrub the narrow-nbsp. SED editing was (and is) an another option to consider, but I like keeping my production in XSLT when possible. So Eric's suggestion has worked well for me.
You could use the translate() function to replace your nnbsp by something else, but since you are using Saxon 9 you can rely on XSLT 2.0 features and use a character map which will do that kind of things automatically for you, for instance (assuming that you want to replace them by a non breaking space:
<xsl:output use-character-maps="nnbsp"/>
<xsl:character-map name="nnbsp">
<xsl:output-character character=" " string=" "/>
</xsl:character-map>
Eric
The narrow non-breaking space is coming from somewhere: either the source document or the stylesheet. It's not being magically injected by the XSLT processor. If it's in the stylesheet, then get rid of it. If it's in the source document, then transform it away, for example by use of the translate() function.
In fact, pasting your code fragment into a text editor and looking at it in hex, I see that the 202F characters are right there in your code. I don't know how you got them into your stylesheet, but you should (a) remove them, and (b) work out how it happened so it doesn't happen again.
I am using tinyxml to save input from a text ctrl. The user can copy whatever they like into the text box and it gets written to an xml file. I'm finding that the new lines don't get saved and neither do & characters. The weird part is that tinyxml just discards them completely without any warning. If I put a & into the textbox and save, the tag will look like:
<textboxtext></textboxtext>
newlines completely disappear as well. No characters whatsoever are stored. What's going on? Even if I need to escape them with & or something, why does it just discard everything? Also, I can't find anything on google regarding this topic. Any help?
EDIT:
I found this topic which suggest the discarding of these characters may be a bug.
TinyXML and preserving HTML Entities
It is, apparently, a bug in TinyXml.
The simple workaround is to escape anything that it might not like:
&, ", ', < and > got their regular xml entities encoding
strange characters (read non-alphanumerical / regular punctuation) are best translated to their unicode codepoint: &#....;
Remember that TinyXml is before all a lightweight xml library, not a full-fledged beast.
I am trying to create a generic stylesheet that can convert all Latin characters in Unicode to uppercase ASCII characters. Using <xsl:character-map> works well except for one thing: namespaces. The character map converts all of my namespaces to upper case, which I do not want.
Is there a way to utilize a character map to do what I want to all the other nodes while leaving the namespaces untouched? I see the disable-output-escaping attribute might be an option, but I haven't been able to make it work.
Looks like this is an Oracle-specific issue. I'll probably post this on the Oracle Forums then.
Thanks for all the feedback!