printing japanese characters in pdf using libharu - c++

Currently I am using libharu to create pdf file. In the file I have some Japanese characters and they are saving as utf-8 first.
After that, I am using HPDF_UseJPEncodings(m_pdf), HPDF_UseJPFonts(m_pdf) and m_fontStandard = HPDF_GetFont(m_pdf, "MS-Mincho", "90msp-RKSJ-H") to encode.
However, 90msp-RKSJ-H is cmap and not for utf-8, does anyone know how to convert utf-8 to cmap for 90msp-RKSJ-H?
Thank you

Hey why dont you refer Libharu's support group on google.
Here's the link for the code you want.
https://groups.google.com/forum/?fromgroups=#!topic/libharu/YzXoH_K3OAI
I hope it solves your purpose.

Related

Convert \xc3\xd8\xe8\xa7\xc3\xb4\xd to human readable format

I am having trouble converting '\xc3\xd8\xe8\xa7\xc3\xb4\xd' (which is a Thai text) to a readable format. I get this value from a smart card, and it basically was working for Windows but not in Linux.
If I print in my Python console, I get:
����ô
I tried to follow some google hints but I am unable to accomplish my goal.
Any suggestion is appreciated.
Your text does not seem to be a Unicode text. Instead, it looks like it is in one of Thai encodings. Hence, you must know the encoding before printing the text.
For example, if we assume your data is encoded in TIS-620 (and the last character is \xd2 instead of \xd) then it will be "รุ่งรดา".
To work with the non-Unicode strings in Python, you may try: myString.decode("tis-620") or even sys.setdefaultencoding("tis-620")

Reportlab pdf encoding

I want to write some Turkish character to pdf with reportlab.
i used fallowing code to do this.
c = Canvas("test.pdf")
data="ğçİöşü"
p = Paragraph(data.decode('utf-8'), style=styNormal)
but it doesn’t show my data at pdf.
ouput:
■ç■ö■ü
As explained in this answer to a similar question, you need to use a font which supports your characters.
In short, try this:
pdfmetrics.registerFont(TTFont('Verdana', 'Verdana.ttf'))
c.setFont("Verdana", 8)
Make sure your file is UTF-8 encoded and I'd also recommend making sure the data variable is UTF-8 by doing
data = u"ğçİöşü"

Read Chinese Characters in Dicom Files

I have just started to get a feel of Dicom standard. I am trying to write a small program, that would read a dicom file and dump the information to a text file. I have a dataset that has the patient names in Chinese. How can I read and store these names?
Currently, I am reading the names as Char* from the dicom file, converting this char* to wchar* using code page "950" for Chinese and writing to a text file. Instead of seeing Chinese characters I see * ? and % in my text file. What am I missing?
I am working in C++ on Windows.
If the text file contains UTF-16, have you included a BOM?
There may be multiple issues at hand.
First, do you know the character encoding of the Chinese name, e.g. Big5 or GB*? See http://en.wikipedia.org/wiki/Chinese_character_encoding
Second, do you know the encoding of your output text file? If it is ascii, then you probably won't ever be able to view the Chinese characters. In which case, I would suggest changing it to unicode (i.e. UTF-8).
Then, when you read the Chinese name, convert the raw bytes and write out the result. For example, if the DICOM stores it in Big5, and your text file is UTF-8, you will need a Big5->UTF-8 converter.

regex to parse image

If I have a string in the form:

What is the best regex I can use to parse these elements in an array? (so I can write away the correct image)
update: i understand base64 encoding but the question is actually how to parse these kind of embedded icons in webpages. since i dont know if people are using e.g. base62 ... or other image strings or even other formats to embed images. etc... i also see examples in pages where the identifier is image/x-icon but he string actually contains a png.
UPDATE just some giveback to share the code where I used this: http://plugins.svn.wordpress.org/wp-favicons/trunk/filters/search/filter_extract_from_page.php
Though I still have some questions e.g. IF only base64 is used etc... but time will tell in practice.
Can you see the base64 at the beginning? You don't need regex. You need to decode this base64 string into a byte stream and then save it as an image.
I have now saved the following text into a file icon.txt:
iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAABmJLR0QAAAAAAAD5Q7t
/AAAA2UlEQVQ4y8WSvQvCMBDFX2rFUvuFSAUFBQfBwUXQVfFfFpzdRV2c7O5UKmihX9E6RZo2pXbyTbmX3C+5uwD
/FskG+76WsvX65n
/3Lm0pdU214HOAbHIWwvzeYPL1p4cT4QCi5DIxEINIdWt+Hs9cXAtg3UOkIJAUpT5ADiho8kbD0NG0LB6Q76xIevwCpW+0bBvj7Y5wgCpI148RBxTmYo7Z1RGPkSk
/kc4jgme0oHoJlmFUOC+8lUEMN0ASvyBpGha++IXCJrJyKJGhjIalyZVyNqufP9j
/9AH0S0vqrU+YMgAAAABJRU5ErkJggg==
And processed:
base64 -d icon.txt > icon.png
and it shows a red heart icon, 16x16 pixels.
This is the way you can decode it in the command line. Most programming languages offer good libraries to decode it directly in your program.
EDIT: If you use PHP, then have a look at base64_decode().

How to convert ISO-8859-1 to UTF-8 using libiconv in C++

I'm using libcurl to fetch some HTML pages.
The HTML pages contain some character references like: סלקום
When I read this using libxml2 I'm getting: ׳₪׳¨׳˜׳ ׳¨
is it the ISO-8859-1 encoding?
If so, how do I convert it to UTF-8 to get the correct word.
Thanks
EDIT: I got the solution, MSalters was right, libxml2 does use UTF-8.
I added this to eclipse.ini
-Dfile.encoding=utf-8
and finally I got Hebrew characters on my Eclipse console.
Thanks
Have you seen the libxml2 page on i18n ? It explains how libxml2 solves these problems.
You will get a ס from libxml2. However, you said that you get something like ׳₪׳¨׳˜׳ ׳¨. Why do you think that you got that? You get an XMLchar*. How did you convert that pointer into the string above? Did you perhaps use a debugger? Does that debugger know how to render a XMLchar* ? My bet is that the XMLchar* is correct, but you used a debugger that cannot render the Unicode in a XMLchar*
To answer your last question, a XMLchar* is already UTF-8 and needs no further conversion.
No. Those entities correspond t the decimal value of the Unicode sequence number of your characters. See this page for example.
You can therefore store your Unicode values as integers and use an algorithm to transform those integers to an UTF-8 multibyte character. See UTF-8 specification for this.
This answer was given in the assumpltion that the encoded text is returned as UTF-16, which as it turns out, isn't the case.
I would guess the encoding is UTF-16 or UCS2. Specify this as input for iconv. There might also be an endian issue, have a look here
The c-style way would be (no checking for clarity):
iconv_t ic = iconv_open("UCS-2", "UTF-8");
iconv(ic, myUCS2_Text, inputSize, myUTF8-Text, outputSize);
iconv_close(ic);