Choosing line ending with libxml2

Choosing line ending with libxml2 - c++

I try to generate some xml files (TMX) on our servers.
The servers are Solaris SPARC servers, but the destination of the files are some legacy Windows CAT Tools.
The CAT-Tool requires CR+LF line endings as is the default on Windows. Writing the files with libxml2, using xmlWriter is easy and works quite well. But I haven't figured out a way to force the lib to emit CR+LF instead of the Unix standard LF. The lib only seem to support the line ending of the platform it runs on.
Has somebody found a way to generate files with another line ending than the default of the platform it runs on. Actually my workaround is to open the written file and writing a new file with the changed line ending using a simple C loop. That works, but it is annoying to have such a unnecessary step in our chain.

I haven't tried this myself, but from xmlsave, I can see two possibilities
xmlSaveToBuffer: save to a buffer, convert to CR/LF and write it out yourself.
xmlSaveToIO: register an iowrite callback and convert to CF/LF while writing in your callback function
Maybe, there are other options, but I haven't found them.

The CAT-Tool requires CR+LF line endings as is the default on Windows.
FWIW, that means the CAT-Tool has a broken XML parser. It shouldn't care about this, as the the XML spec says:
To simplify the tasks of applications, the XML processor must behave as if it normalized all line breaks ... by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.
I know often these things are out of our control, but if you can lean on the CAT-Tool vendor to fix their software, it could become a more future-proof solution.

According to the source code (as of April 2013), libxml2 just puts "\n" into the output stream. At least, when writing dtd-part of a document. Therefore, re-encoding the stream on the fly is the only option to get "\r\n" as result.
If you were lucky (as me) and your tool run on Windows, you could open the file in the text mode, and the OS would do recoding for you.

Related

Qt creator 4.11, create a link in the application output panel

I desesperatly try to find a way to make the application output panel a bit more useful by printing an error with a file path and a line number (basically _ FILE _ and _ LINE _ macros) and make it clickable from the pannel to go directly in the source file in the IDE.
Is it possible to do so with std::cout only ?
I found a post on stack which doesn't work with my need.

The mechanic you need to use here is ANSI escape sequences.
ANSI escape sequences are processed by (mostly) Unix terminals and terminal emulators for changing the terminal behavior, e.g. formatting or coloring text. More recently, hyperlinks may be embedded using an escape sequence as well. For example, the ls utility may embed file:// scheme links with the printed filenames and a terminal may allow to open a file by clicking on it. And GCC does this as well (see -fdiagnostics-urls option).
Several IDEs nowadays also support these links in their output panes. To form a link you need to print one escape sequence before the text and one after (to reset the link state), like so:
printf '\e]8;;http://example.com\e\\This is a link\e]8;;\e\\\n'
Note that \e is ESC, the other characters in the example are regular characters as printed.
Find a good documentation about this, esp. about how to form appropriate file:// URIs here.

For Qt Creator to recognize a link to a source file in the Application Output, it needs to be in a specific format. In my tests I've found the following to work:
std::cout << "file:///home/user/project/src/foo.cpp:1234" << std::endl;
This follows the pattern file://%{file}:%{line}.
Since this question is related to Qt, you may want to set the QT_MESSAGE_PATTERN environment variable such that the file and line number is automatically included in your debug messages. For example:
QT_MESSAGE_PATTERN="%{message} (file://%{file}:%{line})"
For more information, see:
https://doc.qt.io/qt-5/qtglobal.html#qSetMessagePattern
https://doc.qt.io/qt-5/debug.html#warning-and-debugging-messages

:QML word wrap with nbsp on Linux

I have the following problem:
When I build my application on Windows QML texts do actually wrap correctly with respect to the nbsp character (U+00A0 I think). On my Raspberry Pi with Raspbian however, it seems that the nbsp is ignored and the text is wrapped as if it was just a normal space.
There are several things that may have some importance here:
On Windows I have QT 5.4 whereas on the Raspberry Pi there is 5.2
I think it may have something to do with encoding. The thing is I remember it worked before I forced the G++ compiler on Pi to take the input files as CP1250 (I added QMAKE_CXXFLAGS += -finput-charset=CP1250 to the project file). Well I had to make this tweak because of the diacritics in some of the string literals (otherwise the texts are absolutely broken on raspberry). So as I said I think the word wrap have worked before I changed this compiler switch.
But still, there is not a single problem with the displaying of anything except that the texts happen to be breaked where they shouldn't. Note that there is not any "random" character or something but a regular space. That's absolutely strange as this looks there is no problem with encoding but rather with the word wrapping algorith itslef. But as I said it used to work when it thought the string literals are whatever the default on Linux is (UTF-8 I guess...).
As for the QML Text assignment these strings are taken from C array and assigned to the QML text using QObject::setProperty if that is of any importance...
Also note that I probably cannot change the encoding of my sources to UTF-8 because the file with the strings is shared also for some embedded project that works on the other side of the communication and this one has to be CP1250 because of the IDE.
Thanks in advance
EDIT:
I have some additional information: If I go through one of the affected string literals on Windows, it is in fact shorter than the same literal compiled on Raspberry, even when the source encoding is set to CP1250. For example the nbsp is encoded in only one byte on Windows (160d), but it is two bytes on Raspberry (194d,160d). That's strange, isn't it? I'd expect that after explaining g++ that the source code is encoded in CP1250, it should encode the literals in the same way? Or maybe not because this is then encoding of the string in the memory which is different by default on both Windows and Linux. But still I don't see where's the problem.

As suggested by Kevin Krammer,
QString::fromLocal8Bit()
was the solution.

End of the line delimiter conversion between Windows and Linux

I need to read a txt file created in Windows in C++ program, compiled on the Debian Linux. Unfortunately I have a problem with the end of line delimiters. I know that the end of the line indicators are different in Linux and Windows. Consequently, in Linux my C++ program reads something like "correct_line^M".
My question: How can I read in Linux my file created in Windows, correctly?
Do I need to convert it manually to Linux representation (I would like to avoid it)?
Thank you.

You'll have to do it yourself. (IMNO, a good library would do it
automatically, in filebuf, if opened in text mode. But the libraries
I'm familiar with don't.)
Depending on what you're doing, it may not matter. Any line oriented
input should accept trailing white space anyway, and the extra 0x0D
character is white space. So except for editors, it usually won't
matter.
If you want to suppress the extra 0x0D when writing the file (under
Windows), just open the file in binary mode. For that matter, when
portability of the file is a concern, it's often a good idea to open the
file in binary mode, and write whichever convention the protocol
requires. (Using the two character sequence 0x0D, 0x0A is more or less
standard, and is what most Internet protocols specify.) When reading,
again, open in binary mode, and write your code to accept any of the
usual conventions: the two character sequence 0x0D, 0x0A, or either a
single 0x0D or a single 0x0A. (This could be done with a filtering
streambuf.)

You may run dos2unix. It'll converts your file.

Encoding issue using XZIP

I wrote a c++ program that needs to zip files in it's work. For creating these zip files I used the XZip library. While developing this program ran on a Win7 machine and it works fine.
Now the program should be used on a WindowsXP machine. The issue I run into is:
If I let XZip create the zip archive "ü.zip" and add the file "ü.txt" to it on Win7 it is working as intended. On WindowsXP however I end up having the "ü.zip" file with "³.txt" as file in it.
The "³" => "ü" thing is of course an encoding issue between UTF8 and Ascii (ü = 252 in UTF8 and 252 = ³ in Ascii) BUT I can't really imagine how this could affect the creating of the internal zip structure in different ways depending on the OS.
//EDIT to clear it up:
the problem is that I run a test with XZip on Win7 and get the archive "ü.zip" containing the file with name "ü.txt".
When I run that test on an XP machine I get the archive "ü.zip" containing the file "³.txt".
//Edit2:
The thing that makes me wonder about that is, what exactly causes the zip to change between XP and Win7. The fact that it does change means that either a windows function behaves differently or XZip has specific behavior for different OS built in.
When having a quick look at XZip I can't see that it changes the encoding flag on the zip archives. The question of course only can be answered by people who did have a closer look into this exact problem before.

As a general rule, if you want any sort of portability between locales, OS's (including different versions) and what have you, you should limit your filenames to the usual 26 letters, the 10 digits, and perhaps '_' and '-' (and I'm not even sure about the latter), and one '.', no more than three characters from the end. Once you start using letters beyond the original ASCII character set, you're at the merci of the various programs which interpret the character set.
Also, 252 isn't anything in ASCII, since ASCII only uses character codes in the range 0...127. And in UTF-8, 252 would be the first byte of a six byte character. Something that doesn't exist in Unicode: in UTF-8, LATIN SMALL LETTER U WITH DIAERESIS would be the two byte sequence 0xC3, 0xBC. 256 is the encoding of LATIN SMALL LETTER U WITH DIAERESIS in ISO 8859-1, otherwise known as Latin-1; it's also the encoding in UTF-16 and UTF-32.
None of this, of course, should affect what is in the file.

May be you are building your Win32 program (or the library) as ASCII (not as UNICODE). It may help if you build your Win32 applications with UNICODE configuration setting (you may change it in your Visual Studio project settings).
It is impossible to say what happened in your program without seeing your code. May be your library or the archive format is not UNICODE-aware, may be your program's code is not UNICODE-aware, may be you don't handle strings careful enough, or may be you just have to change your project setting to UNICODE. Also your "8-bit encoding for non-Unicode programs" Windows OS setting matters if you don't use UNICODE strings.
As for 252, UTF8 and ASCII read post by James Kanze. It is more or less safe to use ASCII file names with no ':', '?', '*', '/', '\' characters. Using non-ASCII characters may lead to encoding problems if you are not using UNICODE-based programs and file-systems.

Programmatically search + replace in a .doc

If I'm given a .doc file with special tags in it such as [first_name], how do I go about replacing all occurrences of it with something like "Clark"? A simple binary replacement only works if the replacement string is the exact same length.
Haskell, C, and C++ answers would be best, but any compiled language would do. I'd also prefer to do this without an external library since it has to be deployed on Windows and Linux and cross-platform dependency handling is a bitch.
To summarize...
.doc -> magic program -> .doc with strings replaced

You could use the Word COM component ("Word.Application") on Windows to open the file, do the replacements, save the file, and close it. However, this is Windows-only and can be buggy.
Another thing you could do is use the OpenOffice.org command line interface to convert the file to the ODF format, unzip the file (ODF is mostly zipped XML), do the replacements with the files inside, re-zip the file, and re-convert it to .doc format. However, OpenOffice.org doesn't always read Word files correctly (especially if there is a lot of complex formatting) and it can make it harder to distribute (users must either have OpenOffice.org or you must distribute it with your program).
Also, if you have a file in the .docx format, you can unzip it, do the replacements, and re-zip it.

First read the Word Document Specification.
If that hasn't terrified you, then you should find it fairly straightforward to figure out how to read and write it. It must be possible; Word manages to do it most of the time.

You probably have to use .Net programming (VB or C#) to create an object of Word.Application and then use the MS Word object model to manipulate your document.

Why do you want to be using C/C++/Haskell or another compiled language? I'm not too familiar with Haskell, but in general I would say that C is not a great language for performing text processing. A lot of interpreted languages (Perl, Python, etc.) also have powerful regular expression libraries that are suited for finding and replacing phrases.
With that said, as the other posters have noted, you will still have to deal with the eccentricities of the .doc format.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js