How to let jtidy not convert Chinese characters into html entities? - html-entities

I have some html to convert by jtidy, which contains some Chinese characters:
<font>怎么回事</font>
But the result looks like:
<font>怎么回事</font>
How to configure jtidy and let it not convert Chinese characters into html entities?

tidy.setInputEncoding("utf-8");
tidy.setOutputEncoding("utf-8");
Or what encoding your input and your output are.

see this
http://www.pinyin.info/tools/converter/chars2uninumbers.html
this is the function to convert chinese chars to unicode numbers
function convertToEntities() {
var tstr = document.form.unicode.value;
var bstr = '';
for(i=0; i<tstr.length; i++) {
if(tstr.charCodeAt(i)>127) {
bstr += '&#' + tstr.charCodeAt(i) + ';';
} else {
bstr += tstr.charAt(i);
}
}
document.form.entity.value = bstr;
}

Related

Encoding Vietnamese characters from ISO88591, UTF8, UTF16BE, UTF16LE, UTF16 to Hex and vice versa using C++

I have edited my post. Currently what I'm trying to do is to encode an input string from the user and then convert it to Hex formats. I can do it properly if it does not contain any Vietnamese character.
If my inputString is "Hello". But when I try to input a string such as "Tôi", I don't know how to do it.
enum Encodings { USASCII, ISO88591, UTF8, UTF16BE, UTF16LE, UTF16, BIN, OCT, HEX };
switch (Encodings)
{
case USASCII:
ASCIIToHex(inputString, &ascii); //hello output 48656C6C6F
return new ByteField(ascii.c_str());
case ISO88591:
ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
//tôi output 54F469
return new ByteField(ascii.c_str());
case UTF8:
ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
//tôi output 54C3B469
return new ByteField(ascii.c_str());
case UTF16BE:
ToUTF16(inputString, &ascii, Encodings);//hello output 00480065006C006C006F
//tôi output 005400F40069
return new ByteField(ascii.c_str());
case UTF16:
ToUTF16(inputString, &ascii, Encodings);//hello output FEFF00480065006C006C006F
//tôi output FEFF005400F40069
return new ByteField(ascii.c_str());
case UTF16LE:
ToUTF16(inputString, &ascii, Encodings);//hello output 480065006C006C006F00
//tôi output 5400F4006900
return new ByteField(ascii.c_str());
}
void StringUtilLib::ASCIIToHex(std::string s, std::string * result)
{
int n = s.length();
for (int i = 0; i < n; i++)
{
unsigned char c = s[i];
long val = long(c);
std::string bin = "";
while (val > 0)
{
(val % 2) ? bin.push_back('1') :
bin.push_back('0');
val /= 2;
}
reverse(bin.begin(), bin.end());
result->append(ConvertBinToHex(bin));
}
}
std::string ToUTF16(std::string s, std::string * result, int encodings) {
int n = s.length();
if (encodings == UTF16) {
result->append("FEFF");
}
for (int i = 0; i < n; i++)
{
int val = int(s[i]);
std::string bin = "";
while (val > 0)
{
(val % 2) ? bin.push_back('1') :
bin.push_back('0');
val /= 2;
}
reverse(bin.begin(), bin.end());
if (encodings == UTF16 || encodings == UTF16BE) {
result->append("00" + ConvertBinToHex(bin));
}
if (encodings == UTF16LE) {
result->append(ConvertBinToHex(bin) + "00");
}
}
}
std::string ConvertBinToHex(std::string str) {
long long temp = atoll(str.c_str());
int dec_value = 0;
int base = 1;
int i = 0;
while (temp) {
int last_digit = temp % 10;
temp = temp / 10;
dec_value += last_digit * base;
base = base * 2;
}
char hexaDeciNum[10];
while (dec_value != 0)
{
int temp = 0;
temp = dec_value % 16;
if (temp < 10)
{
hexaDeciNum[i] = temp + 48;
i++;
}
else
{
hexaDeciNum[i] = temp + 55;
i++;
}
dec_value = dec_value / 16;
}
str.clear();
for (int j = i - 1; j >= 0; j--) {
str = str + hexaDeciNum[j];
}
return str;
}
The question is completely unclear. To encode something you need an input right? So when you say "Encoding Vietnamese Character to UTF8, UTF16" what's your input string and what's the encoding before converting to UTF-8/16? How do you input it? From file or console?
And why on earth are you converting to binary and then to hex? You can print directly to binary and hex from the bytes, no need to convert from binary to hex. Note that converting to binary like that is fine for testing but vastly inefficient in production code. I also don't know what you mean by "But what if my letter is "Á" or "À" which is a Vietnamese letter I cannot get the value of it". Please show a minimal, reproducible example along with the input/output
But I think you just want to output the UTF encoded bytes from a string literal in the source code like "ÁÀ". In that case it isn't called "encoding a string" but just "outputting a string"
Both Á and À in Unicode can be represented by precomposed characters (U+00C1 and U+00C0) or combining characters (A + U+0301 ◌́/U+0300 ◌̀). You can switch between them by selecting "Unicode dựng sẵn" or "Unicode tổ hợp" in Unikey. Suppose you have those characters in string literal form then std::string str = "ÁÀ" contains a series of bytes that corresponds to the above letters in the source file encoding. So depending on which encoding you save the *.cpp file as (CP1252, CP1258, UTF-8...), the output byte values will be different
To force UTF-8/16/32 encoding you just need to use the u8, u and U suffix respectively, along with the correct type (char8_t, char16_t, char32_t or std::u8string/std::u16string/std::u32string)
std::u8string utf8 = u8"ÁÀ";
std::u16string utf16 = u"ÁÀ";
std::u32string utf32 = U"ÁÀ";
Then just use c_str() to get the underlying buffers and print the bytes. In C++14 std::u8string is not available yet so just save the file as UTF-8 and use std::string. Similarly you can read std::u*string directly from std::cin to print the encoding of a user-input string
Edit:
To convert between UTF encodings use the standard std::codecvt, std::wstring_convert, std::codecvt_utf8_utf16...
Working on non-Unicode encodings is trickier and needs some external library like ICU or OS-dependent APIs
WideCharToMultiByte and MultiByteToWideChar on Windows
iconv on Linux
Limiting to ISO-8859-1 makes it easier but you still need many lookup tables, and there's no way to convert other encodings to ASCII without loss of information
-64 is the correct representation of À if you are using signed char and CP1258. If you want a positive number you need to cast to unsigned char first.
If you are indeed using CP1258, you are probably on Windows. To convert your input string to UTF-16, you probably want to use a Windows platform API such as MultiByteToWideChar which accepts a code page parameter (of course you have to use the correct code page). Alternatively you may try a standard function like mbstowcs but you need to set up your locale correctly before using it.
You might find it easier to switch to wide characters throughout your application, and avoid most transcoding.
As a side note, converting an integer to binary only to convert that to hexadecimal is not an easy or efficient way to display a hexadecimal representation of an integer.

C++ convert HexString to extended Ascii code not show correct ascii code in the text file

How to convert hex string to extended ascii code symbol code and write the converted codes to the text file.
Example input string:
std:string strInput = "FF2139FF"
Example output string should be "ÿ!9ÿ" in the text file.
I tried to write the program as below to write to a text file.
#include <string>
using namespace std;
string ConvertHexStringToAsciiString(string sInputHexString, int step)
{
int len = sInputHexString.length();
string sOutputAsciiString;
for (int i = 0; i < len; i += step)
{
string byte = sInputHexString.substr(i, step);
char chr = (char)(int)strtol(byte.c_str(), nullptr, 16);
sOutputAsciiString.push_back(chr);
}
return sOutputAsciiString;
}
void main()
{
string sInputHexString = "FF2139FF";
string sOutputAsciiString = "";
sOutputAsciiString = ConvertHexStringToAsciiString(sInputHexString, 2);
const char* sFileName = "E:\\MyProgramDev\\Convert_HexString_To_AsciiCode\\Convert_HexString_To_AsciiCode\\TestFolder\\1.txt";
FILE* file = fopen(sFileName, "wt");
if (nullptr != file)
{
fputs(sOutputAsciiString.c_str(), file);
fclose(file);
}
}
It seems working but when I open the text file 1.txt with notepad, I cannot see the ÿ and only !9 displayed. I am not sure how to display it correctly using notepad or my code is wrong?
Thanks.
Use better notepad - or even better, any hexeditor to view result.
Try for example XVI 32 hex editor
I found a way to do thing, Split this HexString FF to two BYTE(unsigned char) "F" and "F", and then construct together and convert to decimal. It can show the correct letter.

Replace non printable character with octal representation in C++/CLI

I need to replace any non printable character with its octal representation using C++/CLI. There are examples using C# which require lambda or linq.
// There may be a variety of non printable characters, not just the example ones.
String^ input = "\vThis has internal vertical quote and tab \t\v";
Regex.Replace(input, #"\p{Cc}", ??? );
// desired output string = "\013This has internal vertical quote and tab \010\013"
Is this possible in with C++/CLI?
Not sure if you can do it inline. I've used this type of logic.
// tested
String^ input = "\042This has \011 internal vertical quote and tab \042";
String^ pat = "\\p{C}";
String^ result = input;
array<Byte>^ bytes;
Regex^ search = gcnew Regex(pat);
for (Match^ match = search->Match(input); match->Success; match = match->NextMatch()) {
bytes = Encoding::ASCII->GetBytes(match->Value);
int x = bytes[0];
result = result->Replace(match->Value
, "\\" + Convert::ToString(x,8)->PadLeft(3, '0'));
}
Console::WriteLine("{0} -> {1}", input, result);

Polish diacritical marks in libcurl (c++)

I just have a problem with a text that contains Polish diacritical marks (eg. ą, ć, ę, ł, ń, ó, ś, ź, ż) obtained by libcurl from the server. I'm trying to display this text correctly in a Windows C++ console application.
I solved the similar problem with putting to the console screen something like that:
cout << "ąćęźół";
by switching codepage of my source file to: DOS Codepage 852 (Central Europe). Unfortunately it doesn't work out with text passing from libcurl. I think that it works only with the text written directly into the code. So could you tell my some helpful information? I have no idea how to resolve this issue.
Well I've written temporary solution for my problem. It works fine, but I'm not contented of this way:
char* cpl(const char* input)
{
size_t length = strlen(input);
char* output = new char[length+1];
/* Order of the diacretics
Ą ą Ć ć Ę ę
Ł ł Ń ń Ó ó
Ś ś Ź ź Ż ż
*/
const size_t pld_in[] = {
0xA1,0xB1,0xC6,0xE6,0xCA,0xEA,
0xA3,0xB3,0xD1,0xF1,0xD3,0xF3,
0xA6,0xB6,0xAC,0xBC,0xAF,0xBF,
};
const size_t pld_out[] = {
0xA4,0xA5,0x8F,0x86,0xA8,0xA9,
0x9D,0x88,0xE3,0xE4,0xE0,0xA2,
0x97,0x98,0x8D,0xAB,0xBD,0xBE
};
for(size_t i = 0; i < length; i++)
{
bool modified = false;
for(size_t j = 0; j < 18; j++)
{
if(*(input + i) == (*(pld_in + j)) + 0xFFFFFF00)
{
*(output + i) = *(pld_out + j);
modified = true;
break;
}
}
if(!modified)
*(output + i) = *(input + i);
}
*(output + length) = 0x00;
return output;
}
Could you propose better solution of this problem, without characters converting?
The content of the web page returned by libcurl will use the character set of the web page. What's likely happening here is that it's not the character set used by your "codeset", which I presume the MS-Windows term for locale.
libcurl should let you look at the headers of the HTTP response that was received from the server. Look at the Content-Type: header, which will indicate which character set the returned text uses; then look up which codepage uses the same character set.

How to send an SMS in hebrew with clickatell

How can I send an SMS in hebrew through Clickatell?
It arrives on the device as gibberish.
I couldn't find any working example so, i wrote my own:
Try this:
UnicodeEncoding unicode = new UnicodeEncoding(true, false);
return string.Concat(unicode.GetBytes(val).Select(c => string.Format("{0:x2}", c)));
Is it in unicode ? If I remember correctly they require unicode to be escaped into hexadecimal representation. This should be in their docs.
However, I found out when I did this that this is not the only issue, many phones do not support displaying unicode characters properly.
Also, sending unicode may incur a higher cost since it may be split up.
Encode your message as unicode, see this FAQ page for details.
Ran into the same issue... you need to encode to unicode and then convert to hex. The strange thing is that you need to take the last value and append it to the front in order to get it to work. I found this out by comparing the results of my code against the output of their online tool.
private string ToUnicode(string val)
{
Encoding utf8 = Encoding.UTF8;
Encoding unicode = Encoding.Unicode;
byte[] utf8Bytes = utf8.GetBytes(val);
byte[] unicodeBytes = Encoding.Convert(utf8, unicode, utf8Bytes);
var result = ByteArrayToString(unicodeBytes);
result = result.Substring(result.Length - 2, 2) + result.Substring(0, result.Length - 2);
return result;
}
public static string ByteArrayToString(byte[] ba)
{
StringBuilder hex = new StringBuilder(ba.Length * 2);
foreach (byte b in ba)
hex.AppendFormat("{0:x2}", b);
return hex.ToString();
}
I used following logic for arabic .. IT needs more testing . Language is VB.Net
If txtArabic.Text.Trim.Length > 0 Then
Dim unicodeString As String = txtArabic.Text
Dim unicode As Encoding = Encoding.Unicode
' Convert the string into a byte array.
Dim unicodeBytes As Byte() = unicode.GetBytes(unicodeString)
Dim sb As String = ToUnicode(txtArabic.Text)
End If
Here is the conversion part
Private Function ToUnicode(ByVal strVal As String)
Dim unicode As Encoding = New UnicodeEncoding(True, False)
' Encoding.Unicode
Dim utf8 As Encoding = Encoding.UTF8
Dim utf8Bytes() As Byte = unicode.GetBytes(strVal)
Dim unicodeBytes() As Byte = Encoding.Convert(utf8, unicode, utf8Bytes)
Dim result As String = ByteArrayToString(unicodeBytes)
Return result
End Function
Private Function ByteArrayToString(ByVal ba() As Byte)
Dim hex As StringBuilder = New StringBuilder(ba.Length)
For i As Integer = 0 To ba.Length - 1
If (((i - 2) Mod 4.0) = 0) Then
Else
hex.AppendFormat("{0:x00}", ba(i))
' hex.Append(ba(i))
End If
' hex.Append("-")
Next
Return hex.ToString
End Function