How to write programmatically some unicode text in RTF format? - c++

In order to generate RTF programmatically I have decided to use rtflib v1.0 from codeproject.com. But I can't understand how to generate text in russian unicode. So I need to generate a unicode text. Could someone help me?
P.S. Honeslty, I could write in .rtf file some text in, only by opening it with MS Word. But after writing some text in unicode, WordPad showed text is correctly.

Here are steps:
Create a file named .rtf
Open .rtf
Write there the following code in order to generate an RTF file which contains UTF-8 encoded content:
{\rtf1\adeflang1025\ansi
{\fonttbl
{\f26\fbidi \froman\fcharset204\fprq2{\*\panose 010a0502050306030303}Sylfaen;}
}
{\rtlch\fcs1 \af31507 \ltrch\fcs0 \f26 \u<unicode number>\'3f\u<unicode number>\'3f\u<unicode number>\'3 A lot of other text and symbols
like this: + - * _ }
}
For example:
{\rtf1\adeflang1025\ansi
{\fonttbl
{\f26\fbidi \froman\fcharset204\fprq2{\*\panose 010a0502050306030303}Sylfaen;}
}
{\rtlch\fcs1 \af31507 \ltrch\fcs0 \f26 \u1329\'3f\u1330\'3f\u1331\'3f\u1332\'3f - these are first 4 latters of Armenian alphabet}
}
Foe more details see the UTF-8 encoding table here. And RTF spec is here.

Related

When reading metadata using taglib

I use taglib library for read mp3 file's metadata.
There is no problem when the metadata of any file is in English.
but when the metadata is composed in Korean, Korean does not print correctly.
How do I solve this problem? Could you give me some tips?
TagLib::FileRef f(path);
TagLib::AudioProperties *audioProperties = f.audioProperties();
if (!f.isNull() && f.tag()) {
TagLib::Tag* tag = f.tag();
strcpy_s(musicFile->album, sizeof(musicFile->album),
tag->album().toCString());
strcpy_s(musicFile->artist, sizeof(musicFile->artist),
tag->artist().toCString());
strcpy_s(musicFile->title, sizeof(musicFile->title),
tag->title().toCString());
strcpy_s(musicFile->genre, sizeof(musicFile->genre),
tag->genre().toCString());

PoDoFo polish characters & PdfContentsTokenizer error

1.
How to get polish characters from pdf file? Can I somehow tell
PdfVariant::getString()
it will process polish characters?
Becouse I get \200instead of ł for example and the funny thing is thats only when ł occures as first "nonbase" character. So if the pdf file begins with aaaałęąaaaa, the ł is coded like \200, the ę like \201 and ą like \202 but if pdf file begins with aaaaąęłaaaa, the ł is coded like \202, the ę like \201 and ą like \200
How can i get this characters in any system?
2.
When i'm trying to extract text from pdf file, I do something like this:
string input_name = "example.pdf";
PdfMemDocument pdf(input_name.c_str());
for (int pn = 0; pn < pdf.GetPageCount(); ++pn) {
PdfPage* page = pdf.GetPage(pn);
PdfContentsTokenizer tok(page);
const char* token = nullptr;
PdfVariant var;
EPdfContentsType type;
while (tok.ReadNext(type, token, var)) {
//etc.
But I got problem with PdfContentsTokenizer tok(page); It doesn't work properly. For some pdf files it goes smoothly and for the other it throws Access violation reading location error in inffas32.asm file, 669 line:
L_get_length_code_mmx:
pand mm4,mm0
movd eax,mm4
movq mm4,mm3
mov eax, [ebx+eax*4]//this is the error line
Btw, I noticed not every pdf file is coded in the same way. For example, using podofobrowser I couldn't see Hello World! text from the official podofo helloworld example. And for the others pdf files podofobrowser showed text in different ways or didn't show it at all.
Ad 1. The link to patch files
which allows to extraxt polish text from pdf using TextExtractor.
This is the most important line when it comes to extract non-unicode text from pdf:
PdfString unicode = pCurFont->GetEncoding()->ConvertToUnicode( rString, pCurFont );
Ad 2. The problem was zlib library which was built wrong. I rebuit it, rebuilt podofo and the problem is gone.

Read Arabic file contents using string in c++

I have a text file (ansi encoding) that contains an arabic contents and I have read it using c++ as:
ifstream ifs(file.GetFileName());
std::string content((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());
unfortunately the content variable holds encrypted strings (which must be in arabic lang) i.e:
121101 ÇáÒÈæä ßãÇá
121102 ÇáÒÈæä ÓÚíÏ
121103 ÇáÒÈæä ÚãÇÑ
any solution???
Thanks :)

Aspose.PDF How Replace replace text on PDF page to all upper case

I am trying to replace text on a specific page to upper case using Aspose.PDF for .Net. If anyone can provide any help that would be great. Thank you.
My name is Tilal Ahmad and I am a developer evangelist at Aspose.
You may use the documentation link for searching and replacing text on a specific page of the PDF document. You should call the Accept method for specific page index as suggested at the bottom of the documentation. Furthermore, for replacing text with uppercase you can use ToUpper() method of String object as follows.
....
textFragment.Text = textFragment.Text.ToUpper();
....
Edit: A sample code to change text case on a specific PDF page
//open document
Document pdfDocument = new Document(myDir + "testAspose.pdf");
//create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("");
//accept the absorber for all the pages
pdfDocument.Pages[2].Accept(textFragmentAbsorber);
//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
//update text and other properties
textFragment.Text = textFragment.Text.ToUpper();
}
pdfDocument.Save(myDir+"replacetext_output.pdf");

How to replace text in content control after, XML binding using docx4j

I am using docx4j 2.8.1 with Content Controls in my .docx file. I can replace the CustomXML part by injecting my own XML and then calling BindingHandler.applyBindings after supplying the input XML. I can add a token in my XML such as ¶ then I would like to replace that token in the MainDocumentPart, but using that approach, when I iterate through the content in the MainDocumentPart with this (link) method none of my text from my XML is even in the collection extracted from the MainDocumentPart. I am thinking that even after binding the XML, it remains separate from the MainDocumentPart (??)
I haven't tried this with anything more than a little test doc yet. My token is the Pilcrow: ¶. Since it's a single character, it won't be split in separate runs. My code is:
private void injectXml (WordprocessingMLPackage wordMLPackage) throws JAXBException {
MainDocumentPart part = wordMLPackage.getMainDocumentPart();
String xml = XmlUtils.marshaltoString(part.getJaxbElement(), true);
xml = xml.replaceAll("¶", "</w:t><w:br/><w:t>");
Object obj = XmlUtils.unmarshalString(xml);
part.setJaxbElement((Document) obj);
}
The pilcrow character comes from the XML and is injected by applying the XML bindings to the content controls. The problem is that the content from the XML does not seem to be in the MainDocumentPart so the replace doesn't work.
(Using docx4j 2.8.1)