protocol buffers : no notation for fixed size buffers? - c++

Since I am not getting an answer on this question I gotta prototype and check myself, as my dataset headers need to be fixed size, I need fixed size strings. So, is it possible to specify fixed size strings or byte arrays in protocol buffers ? It is not readily apparent here, and I kinda feel bad about forcing fixed size strings into the header message. --i.e, std::string('\0', 128);
If not I'd rather use a #pragma pack(1) struct header {...};'
edit
Question indirectly answered here. Will answer and except

protobuf does not have such a concept in the protocol, nor in the .proto schema language. In strings and blobs, the data is always technically variable length using a length prefix (which itself uses varint encoding, so even the length is variable length).
Of course, if you only ever store data of a particular length, then it will line up. Note also that since strings in protobuf are unicode using UTF-8 encoding, the length of the encoded data is not as simple as the number of characters (unless you are using only ASCII characters).

This is a slight clarification to the previous answer. Protocol Buffers does not encode strings as UTF-8, it encodes them as regular bytes. The on-wire format would be the number of bytes consumed followed by the actual bytes. See https://developers.google.com/protocol-buffers/docs/encoding/.
While the on-wire format is always the same, protocol buffers provides two interfaces for developers to use, string and bytes, with the primary difference being that the former will generally try to provide string types to the developer where as the latter will try to provide byte types (I.e. Java would provide String for string and ByteArray for bytes).

Related

Convert array to encoded string

I am using libiconv to convert array of char to encoded string.
I am new to the library.
I wonder how I can know what encoded type the given array is encoded in before I call iconv_open("To-be-encoded","given-encoded-type")
It's the second parameter that I need to know.
It's the second parameter that I need to know.
Yes, you indeed need to know it. I.e. it is you that needs to tell iconv what encoding your array is in. There is no reliable way of detecting what encoding was used to produce a set of bytes - at best you can take a guess based on character frequencies or other such heuristics.
But there is no way to be 100% sure without other information, from metadata or from the file/data format itself. (e.g. HTTP provides headers to indicate encoding, XML has that capability too.)
In other words, if you don't know how a stream of bytes you have is encoded, you cannot convert it to anything else. You need to know the starting point.

Is it possible to 'trim' trailing spaces/tabs from a string in an arbitrary encoding using ICU without doing any conversions

Specifically, given the following:
A pointer to a buffer containing string data in some encoding X
supported by ICU
The length of the data in the buffer, in bytes
The encoding of the buffer (i.e. X)
Can I compute the length of the string, minus the trailing space/tab characters, without actually converting it into ICU's internal encoding first, then converting back? (this itself could be problematic due to unicode normalizations).
For certain encodings, such as any ascii-based encoding along with utf-8/16/32 the solution is pretty simple, just iterate from the back of the string comparing either 1/2/4 bytes at a time against the two constants.
For others it could be harder (variable-length encodings come to mind). I would like this to be as efficient as possible.
For a large subset of encodings, and for the limited set of U+0020 SPACE and HORIZONTAL TAB U+0009, this is pretty simple.
In ASCII, single-byte Windows code pages, and single-byte ISO code pages, these characters all have the same value. You can simply work backwards, byte-by-byte, lopping them off as long as the value is either 9 or 32.
This approach also works for UTF-8, which has the nice property that a byte less than 128 is always that ASCII character. You don't have to wonder whether it's a lead byte or a continuation byte, as those always have the high bit set.
Given UTF-16, you work two bytes at a time, looking for 0x0009 and 0x0020, being careful to handle byte order. Like UTF-8, UTF-16 has the nice property that if you see this value, you don't have to wonder if it's part of a surrogate pair, as those always have a distinct value.
The problematic cases are the variable-byte encodings that don't give you the assurance that a given unit is unique. If you see a byte with a value 9, you don't necessarily know whether it's a tab character or a random byte from a multibyte encoding. Even for some of these, however, it may be possible that the specific values you care about (9 and 32) are unique. For example, looking at Windows code page 950, it seems that lead bytes have the high value set, and tail bytes steer clear of the lower values (it would take a lot of checking to be absolutely sure). So for your limited case, this might be sufficient.
For the general problem of stripping out an arbitrary set of characters from absolutely any encoding, you need to parse the string according to the rules of that encoding (as well as knowing all the character mappings). For the general case, it's almost certainly best to convert the string to some Unicode encoding, do the trimming, and then convert back. This should round-trip correctly if you're careful to use the K normalization forms.
I use the rather simplistic STL approach of:
std::string mystring;
mystring.erase(mystring.find_last_not_of(" \n\r\t")+1);
Which seems to work for all my needs so far (your mileage may vary), but after years of using it it seems to do the job:)
Let me know if you need more information:)
If you restrict "arbitrary encoding" requirement to "any encoding that uses same codevalue for space and tab as ascii" which is still rather general you even don't need ICU at all. boost::trim_right or boost::trim_right_if is all you need.
http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/usage.html#idp206822440

How to deal with the 9-bit strings in PDF files?

In PDF files, the datatype string has 9-bit characters. While hex encoded strings are only 8-bit, it is possible in "normal" strings, to express 512 different different values for a single character. This is achieved using the octal representation of the character after a backslash.
When parsing PDF files, what datatype should I use to store such strings in? It's really annoying because I can't handle them like a byte array on which I later apply an encoding like utf-8, but I also can't use it as an already decoded string because 512 different characters are not enough to store unicode characters, so those 9-bit strings again need to get encoded somehow :/ I just don't know any encodings that encode / decode from/to 9-bit "bytes"...
Do you have any tips/ best practices on this?
Update
As R.Martinho Fernandes pointed out:
Event if it theoretically is possible to express values up to 512 with 3 octal digits, it is only valid to express values smaller than 256. The only strange thing about it is, why did they use a 3-digit-octal representation and not just a 2-digit-hex representation?
I think the answer to it is, that \b and \f would then be treated as hex-values. But I'm not sure about this.
Anyway: I'm glad the guys at adobe were not drunk when they made the PDF format :) AND: I need an answer to accept guys!
The PDF format only allows strings of 8-bit bytes. Octal escapes could represent 9-bit units, but the 9th possible bit is useless for representing 8-bit bytes. This is common practice the same is true for C++ octal escapes for example. So, worry not, there are no 9-bit strings in PDF :)
If you're not really pressed for memory space, why not simply use a 16-bit type?

UTF-8 decoding library

I have to code in an application which is in Unicode UTF-8 in Windows, MSVC 10. I'm aware that the UTF-8 encoded strings would use either 1 or 2 bytes per character. So, my question is : Is std::string suitable for this? If yes, how do I decode the strings? As far as I understand std::string is just an array of bytes and it doesn't provide any decoding logic.
How can I know the logical length of the string? How can I extract logical characters from a string? Are there any libraries which helps me to extract logical characters from the string?
e.g : If I have the string "olé" in std::string, I need to know that the length is 3, but not 4.
A commonally used library is ICU - International Components for Unicode
Yes, std::string is appropriare but as you’ve noticed it only operates on bytes, not Unicode code points. In that, std::string is an opaque type; this isn’t necessarily bad (in fact, it does have some advantages, see the links below for information) but it makes it necessary to decode the string if you need information about characters.
For the actual handling of UTF-8 (where necessary), you can use the Boost.NoWide library to decode UTF-8.
Furthermore, I suggest reading the UTF-8 everywhere manifesto for some information about the use of UTF-8 vs. other Unicode transformations.
First you may want to call the mbstowcs() function to transform the UTF-8 characters to wide characters. Then if you want the result to be 8 bits, you'll have a loss of data in the event you have "Unicode" characters (characters outside of the ISO-8859-1 plane, also called Latin 1.)
Note that the "Windows" encoding is not 1 to 1 equivalent to ISO-8859-1, but in most cases ISO-8859-1 is what people use these days.
Reference: http://www.cplusplus.com/reference/clibrary/cstdlib/mbstowcs/
Okay, if you just want the length in characters, use the mblen() function:
len = mblen(str.c_str(), str.length());
Additional note: an easy way to implementation mblen() is to count the number of bytes that are not between 0x80 and 0xBF since those are part of a multi-bytes sequence. This is particularly useful if you receive a UTF-8 byte sequence over a flaky serial connection.

How does the UTF-8 support of TinyXML work?

I'm using TinyXML to parse/build XML files. Now, according to the documentation this library supports multibyte character sets through UTF-8. So far so good I think. But, the only API that the library provides (for getting/setting element names, attribute names and values, ... everything where a string is used) is through std::string or const char*. This has me doubting my own understanding of multibyte character set support. How can a string that only supports 8-bit characters contain a 16 bit character (unless it uses a code page, which would negate the 'supports Unicode' claim)? I understand that you could theoretically take a 16-bit code point and split it over 2 chars in a std::string, but that wouldn't transform the std::string to a 'Unicode' string, it would make it invalid for most purposes and would maybe accidentally work when written to a file and read in by another program.
So, can somebody explain to me how a library can offer an '8-bit interface' (std::string or const char*) and still support 'Unicode' strings?
(I probably mixed up some Unicode terminology here; sorry about any confusion coming from that).
First, utf-8 is stored in const char * strings, as #quinmars said. And it's not only a superset of 7-bit ASCII (code points <= 127 always encoded in a single byte as themselves), it's furthermore careful that bytes with those values are never used as part of the encoding of the multibyte values for code points >= 128. So if you see a byte == 44, it's a '<' character, etc. All of the metachars in XML are in 7-bit ASCII. So one can just parse the XML, breaking strings where the metachars say to, sticking the fragments (possibly including non-ASCII chars) into a char * or std::string, and the returned fragments remain valid UTF-8 strings even though the parser didn't specifically know UTF-8.
Further (not specific to XML, but rather clever), even more complex things genrally just work (tm). For example, if you sort UTF-8 lexicographically by bytes, you get the same answer as sorting it lexicographically by code points, despite the variation in # of bytes used, because the prefix bytes introducing the longer (and hence higher-valued) code points are numerically greater than those for lesser values).
UTF-8 is compatible to 7-bit ASCII code. If the value of a byte is larger then 127, it means a multibyte character starts. Depending on the value of the first byte you can see how many bytes the character will take, that can be 2-4 bytes including the first byte (technical also 5 or 6 are possible, but they are not valid utf-8). Here is a good resource about UTF-8: UTF-8 and Unicode FAQ, also the wiki page for utf8 is very informative. Since UTF-8 is char based and 0-terminated, you can use the standard string functions for most things. The only important thing is that the character count can differ from the byte count. Functions like strlen() return the byte count but not necessarily the character count.
By using between 1 and 4 chars to encode one Unicode code point.