ColdFusion character encoding issue - coldfusion

Let me start out by saying my end goal (in case I'm incorrect in any of my assumptions or understanding of character encoding): I want the characters of my webpages to display correctly, even if I paste in unique characters from Word or other programs.
Hopefully this can work most times without having to pass it through some encoding function on the server-side. I'm always working with US English and pasting from Word, so it doesn't seem like it should be that difficult.
It seems to me that ColdFusion is messing up my character encoding and displaying apostrophe's (probably pasted from Word) as ’. My page works as a .htm file, but as soon as I rename extension to .cfm, it doesn't. See here:
http://www.viktorwithak.com/Temp/utf8.htm
http://www.viktorwithak.com/Temp/utf8.cfm
Here's what I'm seeing, in case you get anything different:
http://i.imgur.com/tA4p1yc.png
(Looks like the one that works says Content-Type is text/html, and the one that doesn't has Content-Type: text/html; charset-UTF-8)
Suggestions? (Am I correct that UTF-8 should be the correct encoding?)
Edit: for convenience, my source for both files:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>utf-8</title>
</head>
<body>
<p>Make sure you don’t pack your toiletries or clothes for your travel day.</p>
</body>
</html>

What editor are you using? If you're using CFBuilder2+ the BOM mark should made everything UTF-8 by default. However if you're not, and your editor has no settings on setting BOM mark, then you'll need to use this as the first line of your cfm.
<cfprocessingdirective pageEncoding="utf-8">

Related

How to find and replace box character in text file?

I have a large text file that I'm going to be working with programmatically but have run into problems with a special character strewn throughout the file. The file is way too large to scan it looking for specific characters. Most of the other unwanted special characters I've been able to get rid of using some regex pattern. But there is a box character, similar to "□". When I tried to copy the character from the actual text file and past it here I get "�", so the example of the box is from Windows character map which includes the code 'U+25A1', which I'm not sure how to interpret or if it's something I could use for a regex search.
Would anyone know how I could search for the box symbol similar to "□" in a UTF-8 encoded file?
EDIT:
Here is an example from the text file:
"� Prune palms when flower spathes show, or delay pruning until after the palm has finished flowering, to prevent infestation of palm flower caterpillars. Leave the top five rows."
The only problem is that, as mentioned in the original post, the square gets converted into a diamond question mark.
It's unclear where and how you are searching, although you could use the hex equivalent:
\x{25A1}
Example:
https://regex101.com/r/b84oBs/1
The black diamond with a question mark is not a character, per se. It is what a browser spits out at you when you give it unrecognizable bytes.
Find out where that data is coming from.
Determine its encoding. (Usually UTF-8, but might be something else.)
Be sure the browser is configured to display that encoding. This is likely to suffice <meta charset=UTF-8> in the header of the page.
I found a workaround using Notepad++ and this website. It's still not clear what encoding system the square is originally from, but when I post it into the query field in the website above or into the Notepad++ Conversion Table (Plugins > Converter > Conversion Table) it gives the hex-character code for the "Replacement Character" which is the diamond with the question mark.
Using this code in a regex expression, \x{FFFD}, within Notepad++ search gave me all the squares, although recognizing them as the Replacement Character.

Detect charset of file dynamically in c++

I am trying to read a file which may have any charset/codePage, but I don't which locale to set in order to read the file correctly.
Below is my code snippet in which I am trying to read a file having charset as windows-1256, but I want to get the charset dynamically from the file being read so that I can set the locale accordingly.
std::wifstream input{ filename.c_str() };
std::wstring content{ std::istreambuf_iterator<wchar_t>(input1), std::istreambuf_iterator<wchar_t>() };
input.imbue(std::locale(".1256"));
contents = ws2s(content); // Convert wstring to CString
In general, this is impossible to do accurately using the content of a plain text file alone. Usually you should rely on some external information. For example, if the file was downloaded with HTTP, the encoding should be received within a response header.
Some files may contain information about the encoding as specified by the file format. XML for example: <?xml version="1.0" encoding="XXX"?>.
Unicode encodings can be detected if the file starts with a Byte Order Mark - which is optional.
You can usually assume that the encoding uses a wide character if the file contains a zero byte - which would represent the string terminator as a narrow character - before the end of the file. Likewise if you find two consecutive zeroes aligned to a 2 byte boundary (before the end), then the encoding is probably 4 bytes wide.
Other than that, you could try to guess the encoding based on the frequency of certain characters. This can have some unintended consequences.
Let me be blunt and say: you can't
Let me qualify that: a file is simply tons of 0's and 1's stuck on your disk. A charset is a way to interpret these 0's and 1's. You have to provide the information on how to interpret them, namely, by specifying a charset.
A typical way of doing that is by writing a header to specify the charset.
This is a html header
<head>
<title>Page Title</title>
<meta charset="UTF-8">
</head>
As you can see, the charset must be specified one way or another.
Once in a while, you do see some rogue application guessing a charset, they often do so with some heuristics on the distribution of bytes, but that is not reliable and often results in gibberish.
As a side note, try use UTF-8 everywhere, the others are, to put it lightly, messy.

Non-breaking space(" ") is not transformed correctly in IE9 and IE11 while doing XSLT transformation

We are doing client-side XSL transformation over documents with ISO-8859-1 encoding. Works fine in IE7 and IE8. However, when we run in IE9,IE11 the actual transformation works perfectly, but the non-breaking spaces (" ") are no longer maintained in the transformation.
Instead of the non-breaking space, it puts in the "replacement character" (renders as a question mark inside a dark diamond).
If we flip to Compatibility Mode, the non-breaking space is properly rendered as part of the transformation.
This seems to be a bug in the XSL processor that is in IE9 -- the non-breaking spaces should be transformed correctly.
If there some way around this issue ?
image shown below is the image of replacement character by IE9 and IE11.
Warning/error messages from IE11 console -
HTML1300: Navigation occurred.
File: Test
XML5001: Applying Integrated XSLT Handling.
HTML1524: Invalid HTML5 DOCTYPE. Consider using the interoperable form "<!DOCTYPE html>".
File: Test, Line: 3, Column: 1
HTML1114: Codepage utf-8 from (10) overrides conflicting codepage iso-8859-1 from (META tag)
File: Test
HTML1504: Unexpected end tag.
File: Test, Line: 380, Column: 1
HTML1504: Unexpected end tag.
File: Test, Line: 381, Column: 1
HTML1504: Unexpected end tag.
File: Test, Line: 476, Column: 1
SEC7115: :visited and :link styles can only differ by color. Some styles were not applied to :visited.
File: Test
HTML1114: Codepage utf-8 from (10) overrides conflicting codepage iso-8859-1 from (META tag)
Not sure why IE11 is overriding the encoding given by server.
It's an encoding issue. Seems likely to me that IE10/11 is trying to rencode this in UTF16 after it's already been encoded in ISO-8859-1.
I think your most likely fix for this would be to include a <meta charset="ISO-8859-1"> in your <head> tag in the HTML. I believe all versions of MSIE will respect this (or at least default to it if they ignore it). I've had similar issues before in a different browser related context and that resolved it.

ColdFusion special unicode characters in the content returned by cfhttp

In the content retrieved with ColdFusion http object there are some characters that are returned as question marks; namely these are roman numerals (like Ⅱ) which are displayed without problems when I visit the same page with a browser.
The server where I make request to dose not seem to provide any charset information in the response headers (the value of Content-Type is just "text/html" and charset property in the result of cfhttp is blank), but the encoding is declared in page's html as "charset=EUC-JP" (it is a page in Japanese). So I make request with charset set to EUC-JP.
The content in Japanese (Japanese characters) is retrieved correctly, but the roman numerals are turned into question marks.
I tried requesting with charset set to UTF-8, but in this case everything gets scrambled. To me it seems that those roman numerals are Unicode, so my understanding is that the server where I make request to mixes encodings (but I maybe wrong about this).
How do I get those special characters to display correctly in the fileContent of cfhttp?
Thanks!
The only way I can think of is to make 2 requests with the different encodings and the merge the data together. The first request would be for charset of EUC-JP and the second would be with UTF 8. After the second request look through the content from the first and for every question mark, look up the index in the second request. For example, when you hit the 5th question mark in the first set of content, look for the 5th roman numeral in the second set. It's not guaranteed to work, but it's all I can think of.

Special characters in CFMail

I'm trying to auto-generate a plain text email with a trademark symbol in it. I've tried everything I can think of but it's still not going through.
<cfmail from="#x#" to="#y#" subject="test" charset="UTF-8">
™
™
#Chr(153)#
</cfmail>
This is an encoding issue.
You state the mail is encoded as UTF-8, but Chr(153) does not return a trademark symbol in Unicode. It does in Windows-1252, but Chr() works with Unicode code points.
Use Chr(8482) to nail it to the Unicode TM symbol.
I've found an info page that outlines the issue nicely.
By the way, writing the literal TM symbol works for me as well. But this assumes your .cfm files are in fact encoded as Windows-1252 and that the ColdFusion runtime is configured to expect this (Both of which is the default on Windows systems, where I've tested it on. Analog rules apply to other systems.). ColdFusion converts all strings to Unicode internally, so maybe something is broken in this chain of expectations in your set-up.
I think that this is not so much an issue with CFMail but rather an issue with email clients displaying the characters codes in plain text messages literally rather than converting them to their corresponding characters.
Using CFMail in HTML mode should provide the result you're looking for.