PHP: How to get rid of � symbol inside text? - regex

Can't figure out, how to remove this � symbol from string.
String is in utf-8 format.
What to do? :(
This removes whole string:
preg_replace('/\W/','',utf8_decode(substr(utf8_encode($ad['description']),0,125)))
Thanks ;)
Update:
Using:
header('Content-Type: text/html; charset=utf-8');
After replacement using exit() right away.

U+FFFD REPLACEMENT CHARACTER is used when the character does not have a representation in the current charset encoding. Declare your encodings properly as UTF-8 and use UTF-8 strings and it will not show upon most platforms.

The problem here is that your string is not in utf-8 format. You pretend it is, and handle the data accordingly, but the string probably contains Ansi characters. You don't just need to pass the Content-Encoding = utf-8 header, but your contents needs to be converted to utf-8 before it is sent as well.

you could try utf8_decode('string'); or utf8_encode('string');
but you should really try to find the actuall problem make sure the headers are correct set, document type and that the text is encoded in the right format when saved or what not

Related

Converting Hexadecimal(\x) in a string to unicode (\u)

I'm encountering a problem currently.
I'm getting a string from a url, I'm decoding this string via curl_easy_unescape and I'm getting a decoded string. So far so good.
Now is where the problem is. For example, when the url had the "counterpart" to ü inside his header, curl_easy_unescape turns the counterpart of ü in \xfc. Now my String has \xfc.
I need it as a "ü".
I need a written "ü" in my string, or I'm getting an error that my string is not utf8 formatted. And i need it inside a string. For example
"Hallü howre yoü"
with curl_easy_escape this turns into
"Hall\xfc+howre+you\xfc"
And i want to revert the \xfc into "ü"s or into "\u00fc"s
My solutions i tried have been:
changing the \x to \u00 . That would work and do the trick. But replacing doesn't work.
encoding the string in utf 8
getting the decimal value of xFC and doing char = valueofFC.
I don't have a clue, how i could resolve that issue.

UTF 8 encoded Japanese string in XML

I am trying to create a SOAP call with Japanese string. The problem I faced is that when I encode this string to UTF8 encoded string, it has many control characters in it (e.g. 0x1B (Esc)). If I remove all such control characters to make it a valid SOAP call then the Japanese content appears as garbage on server side.
How can I create a valid SOAP request for Japanese characters? Any suggestion is highly appreciated.
I am using C++ with MS-DOM.
With Best Regards.
If I remember correctly it's true, the first 32 unicode code points are not allowed as characters in XML documents, even escaped with &#. Not sure whether they're allowed in HTML or not, but certainly the server thinks they're not allowed in your requests, and it gets the only meaningful vote.
I notice that your document claims to be encoded in iso-2022-jp, not utf-8. And indeed, the sequence of characters ESC $ B that appears in your document is valid iso-2022-jp. It indicates that the data is switching encodings (from ASCII to a 2-byte Japanese encoding called JIS X 0208-1983).
But somewhere in the process of constructing your request, something has seen that 0x1B byte and interpreted it as a character U+001B, not realising that it's intended as one byte in data that's already encoded in the document encoding. So, it has XML-escaped it as a "best effort", even though that's not valid XML.
Probably, whatever is serializing your XML document doesn't know that the encoding is supposed to be iso-2022-jp. I imagine it thinks it's supposed to be serializing the document as ASCII, ISO-Latin-1, or UTF-8, and the <meta> element means nothing to it (that's an HTML way of specifying the encoding anyway, it has no particular significance in XML). But I don't know MS-DOM, so I don't know how to correct that.
If you just remove the ESC characters from iso-2022-jp data, then you conceal the fact that the data has switched encodings, and so the decoder will continue to interpret all that 7nMK stuff as ASCII, when it's supposed to be interpreted as JIS X 0208-1983. Hence, garbage.
Something else strange -- the iso-2022-jp code to switch back to ASCII is ESC ( B, but I see |(B</font> in your data, when I'd expect the same thing to happen to the second ESC character as happened to the first: &#0x1B(B</font>. Similarly, $B#M#S(B and $BL#D+(B are mangled attempts to switch from ASCII to JIS X 0208-1983 and back, and again the ESC characters have just disappeared rather than being escaped.
I have no explanation for why some ESC characters have disappeared and one has been escaped, but it cannot be coincidence that what you generate looks almost, but not quite, like valid iso-2022-jp. I think iso-2022-jp is a 7 bit encoding, so part of the problem might be that you've taken iso-2022-jp data, and run it through a function that converts ISO-Latin-1 (or some other 8 bit encoding for which the lower half matches ASCII, for example any Windows code page) to UTF-8. If so, then this function leaves 7 bit data unchanged, it won't convert it to UTF-8. Then when interpreted as UTF-8, the data has ESC characters in it.
If you want to send the data as UTF-8, then first of all you need to actually convert it out of iso-2022-jp (to wide characters or to UTF-8, whichever your SOAP or XML library expects). Secondly you need to label it as UTF-8, not as iso-2022-jp. Finally you need to serialize the whole document as UTF-8, although as I've said you might already be doing that.
As pointed out by Steve Jessop, it looks like you have encoded the text as iso-2022-jp, not UTF-8. So the first thing to do is to check that and ensure that you have proper UTF-8.
If the problem still persists, consider encoding the text.
The simplest option is "hex encoding" where you just write the hex value of each byte as ASCII digits. e.g. the 0x1B byte becomes "1B", i.e. 0x31, 0x42.
If you want to be fancy you could use MIME or even UUENCODE.

How to convert ISO-8859-1 to UTF-8 using libiconv in C++

I'm using libcurl to fetch some HTML pages.
The HTML pages contain some character references like: סלקום
When I read this using libxml2 I'm getting: ׳₪׳¨׳˜׳ ׳¨
is it the ISO-8859-1 encoding?
If so, how do I convert it to UTF-8 to get the correct word.
Thanks
EDIT: I got the solution, MSalters was right, libxml2 does use UTF-8.
I added this to eclipse.ini
-Dfile.encoding=utf-8
and finally I got Hebrew characters on my Eclipse console.
Thanks
Have you seen the libxml2 page on i18n ? It explains how libxml2 solves these problems.
You will get a ס from libxml2. However, you said that you get something like ׳₪׳¨׳˜׳ ׳¨. Why do you think that you got that? You get an XMLchar*. How did you convert that pointer into the string above? Did you perhaps use a debugger? Does that debugger know how to render a XMLchar* ? My bet is that the XMLchar* is correct, but you used a debugger that cannot render the Unicode in a XMLchar*
To answer your last question, a XMLchar* is already UTF-8 and needs no further conversion.
No. Those entities correspond t the decimal value of the Unicode sequence number of your characters. See this page for example.
You can therefore store your Unicode values as integers and use an algorithm to transform those integers to an UTF-8 multibyte character. See UTF-8 specification for this.
This answer was given in the assumpltion that the encoded text is returned as UTF-16, which as it turns out, isn't the case.
I would guess the encoding is UTF-16 or UCS2. Specify this as input for iconv. There might also be an endian issue, have a look here
The c-style way would be (no checking for clarity):
iconv_t ic = iconv_open("UCS-2", "UTF-8");
iconv(ic, myUCS2_Text, inputSize, myUTF8-Text, outputSize);
iconv_close(ic);

asp-classic Request.Cookies brings this value "ϑ" for 1 cookie instead of "ÅÙÏ‘‹„‰Š„‹"

This is happening in one cookie with keys in one key only.
The value should be "ÅÙÏ‘‹„‰Š„‹".
The value should be "ÅÙÏ‘‹„‰Š„‹".
Erm, really? That looks like the corrupted, wrong-character set version to me! :-) Either way, “ϑ” is what you get when you save that string in Windows Western European encoding (cp1252) and then read it back in as UTF-8, removing all the ‘invalid character’ codes that result because it's not a valid UTF-8 string. So you've got a classic reading-and-writing-using-different-encodings problem.
As a general rule you can't get away with putting non-ASCII characters in a cookie (name or value) directly. You'll need an application-level encoding mechanism of some sort; one of the most popular ways is to URL-encode the UTF-8 representation of the characters you want, similarly to how JavaScript's encodeURIComponent does it.
(Unfortunately ASP classic has very poor support for handling Unicode.)
Final Solution:
Save As different file with "correct" encoding
Changed encoding
From "Unicode (UTF-8 with signature) -Codepage 65001"
To "Western European (Windows) - Codepage 1252"
We're using encoding on our cookies and some of the resulting characters can cause problems. So what we did is take the cookie string and encode it in HEX. - Problems Solved.

Rule for handling UTF-8 characters in cookie for CGI applications?

I was told to always URL-encode a UTF-8 string before placing on a cookie.
So when a CGI application reads this cookie, it has to URL-decode the string to get the original UTF-8 string.
Is this the right way to handle UTF-8 characters in cookies?
Is there a better way to do this?
There is no one standard scheme for encapsulating Unicode characters into a cookie.
URL-encoding the UTF-8 representation is certainly a common and sensible way of doing it, not least because it can be read easily into a Unicode string from JavaScript (using decodeURIComponent). But there's no reason you couldn't choose some other scheme if you prefer.
Generally, this is the easiest way, you could do another binary encoding, not sure if base64 includes reserved characters... %uXXXX where XXXX is the hex unicode value is most appropriate.