PoDoFo polish characters & PdfContentsTokenizer error - c++

1.
How to get polish characters from pdf file? Can I somehow tell
PdfVariant::getString()
it will process polish characters?
Becouse I get \200instead of ł for example and the funny thing is thats only when ł occures as first "nonbase" character. So if the pdf file begins with aaaałęąaaaa, the ł is coded like \200, the ę like \201 and ą like \202 but if pdf file begins with aaaaąęłaaaa, the ł is coded like \202, the ę like \201 and ą like \200
How can i get this characters in any system?
2.
When i'm trying to extract text from pdf file, I do something like this:
string input_name = "example.pdf";
PdfMemDocument pdf(input_name.c_str());
for (int pn = 0; pn < pdf.GetPageCount(); ++pn) {
PdfPage* page = pdf.GetPage(pn);
PdfContentsTokenizer tok(page);
const char* token = nullptr;
PdfVariant var;
EPdfContentsType type;
while (tok.ReadNext(type, token, var)) {
//etc.
But I got problem with PdfContentsTokenizer tok(page); It doesn't work properly. For some pdf files it goes smoothly and for the other it throws Access violation reading location error in inffas32.asm file, 669 line:
L_get_length_code_mmx:
pand mm4,mm0
movd eax,mm4
movq mm4,mm3
mov eax, [ebx+eax*4]//this is the error line
Btw, I noticed not every pdf file is coded in the same way. For example, using podofobrowser I couldn't see Hello World! text from the official podofo helloworld example. And for the others pdf files podofobrowser showed text in different ways or didn't show it at all.

Ad 1. The link to patch files
which allows to extraxt polish text from pdf using TextExtractor.
This is the most important line when it comes to extract non-unicode text from pdf:
PdfString unicode = pCurFont->GetEncoding()->ConvertToUnicode( rString, pCurFont );
Ad 2. The problem was zlib library which was built wrong. I rebuit it, rebuilt podofo and the problem is gone.

Related

Why is libxml not storing html in my htmlDocPtr?

I am working on a piece of software that uses libxml to store xml on webpages in an xmlDocPtr. I need to expand this functionality to do the same for html.
The original code:
xmlDocPtr doc = xmlParseEntity(filename.c_str());
Where filename = 10.1.1.135/poll_data.xml and everything works just fine
Now, I have html filename = 10.1.1.165/index.htm and would like to store this as well. I have tried using htmlParseDoc with no success.
htmlDocPtr doc = htmlParseFile(filename.c_str(), "windows-1252");
The resulting doc object is not null but it does not contain the contents of the index.html
Netbeans spits out:
http://10.1.1.165/index.htm:1: HTML parser error : Document is empty
Any suggestions?

How to write ACE_Tstring to file in c++

I'm using windows OS and trying to write ACE_Tstring that contains multiple languages sentence(by Unicode) to a file using ACE_OS::write().
But the result I'm getting in the file is unpredictable characters(gibberish text).
This is my code implemented :
ACE_Tstring *str = new ACE_Tstring(L"مرحبا привет świecie Hello")
ACE_HANDL hFile = ACE_OS::open(L"myfile", _O_WRONLY);
ACE_OS::write(hFile, str, 1048);
wprintf(L"%ls",str->c_str());
As you can see I also print the string to the screen, and on screen I get the characters "????" where any character accept for English characters appear.
Written Text
مرحبا привет świecie Hello
Result on Screen :
?????? ????? ??????? Hello
What am I missing and what is wrong with my code?
ACE_TString is a typedef for ACE_CString when ACE_USES_WCHAR is not set. Try using ACE_WString if you need to force it to wide-chars.

C# UWP textfile to list of strings

I'm creating a UWP application (latest windows IoT) for my raspberry.
I'm trying to get a list of strings from a .txtfile located in the project folder under a map called Words.
This is my code so far.
public async void GenereerGokWoord(int character)
{
StorageFile file = await StorageFile.GetFileFromApplicationUriAsync(
new Uri("ms-appx:///Words//5-letterwoorden.txt")
);
}
By setting a breakpoint at the end of the } I can confirm that this code can find the .txt file.
But now I don't know how to get a list of strings from this point on.
The .txt file looks like things
word 1
word 2
word 3
If you want contents of file as a List with each line as an Item, Use FileIO.ReadLinesAsync()
IList<string> data = await FileIO.ReadLinesAsync(file);
List<string> finallist = data.ToList();
Your finallist should contain all words as List.

End of string expected error when trying to post comment in Weblog

I'm trying to solve an issue with posting comments for a blog that uses the Weblog Sitecore module. From what I can tell, if the blog entry url contains dashes (i.e. http://[domain.org]/blog/2016/december/test-2-entry), then I get the "End of string expected at line [#]" error. If the blog entry url does NOT contain dashes, then the comment form works fine.
<replace mode="on" find="-" replaceWith="_"/>
Also tried to replace the dash with an empty space. Neither solution has worked as I still get the error.
Is there some other setting in the Web.config I can alter to escape the dashes in the urls? I have read that enclosing dashed url text with the # symbol works, but I'd like to be able to do that automatically instead of having the user go back and rename all their blog entries.
Here is a screenshot of the error for reference:
I have not experience the Weblog module but for the issue you are facing, you should escape the dash with #. Please see the following code snippet:
public string EscapePath(string path)
{
string[] joints = Regex.Split(path, "/");
string output = string.Empty;
for (int index = 0; index < joints.Length; index++)
{
string joint = joints[index];
if (!string.IsNullOrEmpty(joint))
output += string.Format("#{0}#", joint);
if (index != joints.Length - 1)
output += "/";
}
return output;
}
Reference: https://github.com/WeTeam/WeBlog/issues/52
More information about escaping dash in queries can be found here
UPDATE
You should call this method before posting the comment for it to escape the dashes. You may also download the dll from here and use it in your solution

How to write programmatically some unicode text in RTF format?

In order to generate RTF programmatically I have decided to use rtflib v1.0 from codeproject.com. But I can't understand how to generate text in russian unicode. So I need to generate a unicode text. Could someone help me?
P.S. Honeslty, I could write in .rtf file some text in, only by opening it with MS Word. But after writing some text in unicode, WordPad showed text is correctly.
Here are steps:
Create a file named .rtf
Open .rtf
Write there the following code in order to generate an RTF file which contains UTF-8 encoded content:
{\rtf1\adeflang1025\ansi
{\fonttbl
{\f26\fbidi \froman\fcharset204\fprq2{\*\panose 010a0502050306030303}Sylfaen;}
}
{\rtlch\fcs1 \af31507 \ltrch\fcs0 \f26 \u<unicode number>\'3f\u<unicode number>\'3f\u<unicode number>\'3 A lot of other text and symbols
like this: + - * _ }
}
For example:
{\rtf1\adeflang1025\ansi
{\fonttbl
{\f26\fbidi \froman\fcharset204\fprq2{\*\panose 010a0502050306030303}Sylfaen;}
}
{\rtlch\fcs1 \af31507 \ltrch\fcs0 \f26 \u1329\'3f\u1330\'3f\u1331\'3f\u1332\'3f - these are first 4 latters of Armenian alphabet}
}
Foe more details see the UTF-8 encoding table here. And RTF spec is here.