How to encode text in pango-markup language? - gstreamer

I am writing a subtitle parser gstreamer plugin. I need to encode parser subtitle text in pango-markup language as the gstreamer text-overlay supports it.
I know how to decode pango-markup text to normal text from this link.
But I am not able to find a standard utility library which can encode normal string to pango-markup string.
Is there a standard encoder library for pango-markup? Or should I implement the encoder myself?

It's not clear what you are expecting from "encoding to pango-markup string." Pango markup is just XML-escaped text, with optional formatting markup elements in it.
That is, g_markup_escape_text() should give you a valid Pango markup string, with no formatting.

Related

How to get the character coding type of a json file?

I'm tying to get the character coding type of a json string from jsoncpp: UTF-8, ANSI or UNICODE? How to get character coding type of a json::value? Thanks advance!
Any string is just a sequence of bytes, conforming, may be, to some basic rules (null terminators, prohibited symbols for json, etc). There is no magic way to determine which encoding was used to form a string, because encoding is just a way to represent string binary data. So json string encoding should be either specified by the json issuer (in documentation perhaps), or information about it should be a part of a json (if for some reason different strings has a different encoding).
Determining the character encoding of a string is quite complicated. See this SO answer for choosing the right application.
Apache Tika - the content analysis toolkit is maybe one of the most advanced, according to the following quote:
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more. You can find the latest release on the download page.
Analyzing a JSON string could be done with each of these libraries resulting in a (probable) CharSet usable for further processing.

POCO C++ SAX parser: If the xml document encoding is ANSI then next statement is not reading and throwing encoding error exception

Suppose the following is the xml document then hello tag is not reading by the poco sax parser because of encoding is ANSI.
<?xml version="1.0" encoding="ANSI"?>
<hello xmlns=" ">
If the encoding is UTF-8 then hello tag is reading and everything is went fine.
Is there any solution in POCO for this issue?
It's not a POCO problem, fix the producer. There's no such thing as "ANSI" encoding in XML. The producer should generate output in a valid encoding. Whether that's "UTF-8" or "ISO-8859-1" doesn't really matter, as long as it's all consistent.
The encoding problem arise if you specify a encoding but you use a different encoding, source of trouble could arise (in example) if you copy-paste a XML source between multiple documents, from webpages, or simply because it has a buggy encoder. Try to use some program that can detect encoding and change that, set it to UTF8 and then replace the header tag for ANSI wich the one for UTF8.

Read Chinese Characters in Dicom Files

I have just started to get a feel of Dicom standard. I am trying to write a small program, that would read a dicom file and dump the information to a text file. I have a dataset that has the patient names in Chinese. How can I read and store these names?
Currently, I am reading the names as Char* from the dicom file, converting this char* to wchar* using code page "950" for Chinese and writing to a text file. Instead of seeing Chinese characters I see * ? and % in my text file. What am I missing?
I am working in C++ on Windows.
If the text file contains UTF-16, have you included a BOM?
There may be multiple issues at hand.
First, do you know the character encoding of the Chinese name, e.g. Big5 or GB*? See http://en.wikipedia.org/wiki/Chinese_character_encoding
Second, do you know the encoding of your output text file? If it is ascii, then you probably won't ever be able to view the Chinese characters. In which case, I would suggest changing it to unicode (i.e. UTF-8).
Then, when you read the Chinese name, convert the raw bytes and write out the result. For example, if the DICOM stores it in Big5, and your text file is UTF-8, you will need a Big5->UTF-8 converter.

regex to parse image

If I have a string in the form:
data:image/x-icon;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAABmJLR0QAAAAAAAD5Q7t/AAAA2UlEQVQ4y8WSvQvCMBDFX2rFUvuFSAUFBQfBwUXQVfFfFpzdRV2c7O5UKmihX9E6RZo2pXbyTbmX3C+5uwD/FskG+76WsvX65n/3Lm0pdU214HOAbHIWwvzeYPL1p4cT4QCi5DIxEINIdWt+Hs9cXAtg3UOkIJAUpT5ADiho8kbD0NG0LB6Q76xIevwCpW+0bBvj7Y5wgCpI148RBxTmYo7Z1RGPkSk/kc4jgme0oHoJlmFUOC+8lUEMN0ASvyBpGha++IXCJrJyKJGhjIalyZVyNqufP9j/9AH0S0vqrU+YMgAAAABJRU5ErkJggg==
What is the best regex I can use to parse these elements in an array? (so I can write away the correct image)
update: i understand base64 encoding but the question is actually how to parse these kind of embedded icons in webpages. since i dont know if people are using e.g. base62 ... or other image strings or even other formats to embed images. etc... i also see examples in pages where the identifier is image/x-icon but he string actually contains a png.
UPDATE just some giveback to share the code where I used this: http://plugins.svn.wordpress.org/wp-favicons/trunk/filters/search/filter_extract_from_page.php
Though I still have some questions e.g. IF only base64 is used etc... but time will tell in practice.
Can you see the base64 at the beginning? You don't need regex. You need to decode this base64 string into a byte stream and then save it as an image.
I have now saved the following text into a file icon.txt:
iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAABmJLR0QAAAAAAAD5Q7t
/AAAA2UlEQVQ4y8WSvQvCMBDFX2rFUvuFSAUFBQfBwUXQVfFfFpzdRV2c7O5UKmihX9E6RZo2pXbyTbmX3C+5uwD
/FskG+76WsvX65n
/3Lm0pdU214HOAbHIWwvzeYPL1p4cT4QCi5DIxEINIdWt+Hs9cXAtg3UOkIJAUpT5ADiho8kbD0NG0LB6Q76xIevwCpW+0bBvj7Y5wgCpI148RBxTmYo7Z1RGPkSk
/kc4jgme0oHoJlmFUOC+8lUEMN0ASvyBpGha++IXCJrJyKJGhjIalyZVyNqufP9j
/9AH0S0vqrU+YMgAAAABJRU5ErkJggg==
And processed:
base64 -d icon.txt > icon.png
and it shows a red heart icon, 16x16 pixels.
This is the way you can decode it in the command line. Most programming languages offer good libraries to decode it directly in your program.
EDIT: If you use PHP, then have a look at base64_decode().

Replacing special characters from HTML source

I'm new to HTML coding and I know HTML has some reserved characters for its use and it also displays some characters by their character code. For example -:
Œ is Œ
© is ©
® is ®
I have the HTML source in std::string. how can i decipher them into their actual form and replace from std::string? is there any library with source available or can it be done using macros preprocessors?
I would recommend using some HTML/XML parser that can automatically do the conversion for you. Parsing HTML correctly by hand is extremely difficult. If you insist on doing it yourself, Boost String Algorithms library provides useful replacement functions.
Œ is Œ
No it isn't. Œ is 'PARTIAL LINE BACKWARD'. The correct numeric entities for Œ are Œ and Œ.
One method for the numeric entities would be to use a regular expression like &#([0-9]+);, grab the numeric value and convert it to the ASCII character (probably with sprintf in C++).
For the named entities you would need to build a mapping. You could probably do a simple string replace to convert to the numbers, then use the method above. W3C has a table here: http://www.w3.org/TR/WD-html40-970708/sgml/entities.html
But if you're trying to read or parse a bunch of HTML in a string, you should use an HTML parser. Search for the many questions on SO.