C++ Unicode: Bytes, Code Points and Graphemes

C++ Unicode: Bytes, Code Points and Graphemes - c++

So, I'm building a scripting language and one of my goals is convenient string operations. I tried some ideas in C++.
String as sequences of bytes and free functions that return vectors containing the code-points indices.
A wrapper class that combines a string and a vector containing the indices.
Both ideas had a problem, and that problem was, what should I return. It couldn't be a char, and if it was a string it would be wasted space.
I ended up creating a wrapper class around a char array of exactly 4 bytes: a string that has exactly 4 bytes in memory, no more nor less.
After creating this class, I felt tempted to just wrap it in a std::vector of it in another class and build from there, thus making a string type of code-points. I don't know if this is a good approach, it would end up being much more convenient but it would end up wasting more space.
So, before posting some code, here's a more organized list of ideas.
My character type would be not a byte, nor a grapheme but rather a code-point. I named it a rune like the one in the Go language.
A string as a series of decomposed runes, thus making indexing and slicing O1.
Because a rune is now a class and not a primitive, it could be expanded with methods for detecting unicode whitespace: mysring[0].is_whitespace()
I still don't know how to handle graphemes.
Curious fact! An odd thing about the way I build the prototype of the rune class was that it always print in UTF8. Because my rune is not a int32, but a 4 byte string, this end up having some interesting properties.
My code:
class rune {
char data[4] {};
public:
rune(char c) {
data[0] = c;
}
// This constructor needs a string, a position and an offset!
rune(std::string const & s, size_t p, size_t n) {
for (size_t i = 0; i < n; ++i) {
data[i] = s[p + i];
}
}
void swap(rune & other) {
rune t = *this;
*this = other;
other = t;
}
// Output as UTF8!
friend std::ostream & operator <<(std::ostream & output, rune input) {
for (size_t i = 0; i < 4; ++i) {
if (input.data[i] == '\0') {
return output;
}
output << input.data[i];
}
return output;
}
};
Error handling ideas:
I don't like to use exceptions in C++. My idea is, if the constructor fails, initialize the rune as 4 '\0', then overload the bool operator explicitly to return false if the first byte of the run happens to be '\0'. Simple and easy to use.
So, thoughts? Opinions? Different approaches?
Even if my rune string is to much, at least I have a rune type. Small and fast to copy. :)

It sounds like you're trying to reinvent the wheel.
There are, of course, two ways you need to think about text:
As an array of codepoints
As an encoded array of bytes.
In some codebases, those two representations are the same (and all encodings are basically arrays of char32_t or unsigned int). In some (I'm inclined to say "most" but don't quote me on that), the encoded array of bytes will use UTF-8, where the codepoints are converted into variable lengths of bytes before being placed into the data structure.
And of course many codebases simply ignore unicode entirely and store their data in ASCII. I don't recommend that.
For your purposes, while it does make sense to write a class to "wrap around" your data (though I wouldn't call it a rune, I'd probably just call it a codepoint), you'll want to think about your semantics.
You can (and probably should) treat all std::string's as UTF-8 encoded strings, and prefer this as your default interface for dealing with text. It's safe for most external interfaces—the only time it will fail is when interfacing with a UTF-16 input, and you can write corner cases for that—and it'll save you the most memory, while still obeying common string conventions (it's lexicographically comparable, which is the big one).
If you need to work with your data in codepoint form, then you'll want to write a struct (or class) called codepoint, with the following useful functions and constructors
While I have had to write code that handles text in codepoint form (notably for a font renderer), this is probably not how you should store your text. Storing text as codepoints leads to problems later on when you're constantly comparing against UTF-8 or ASCII encoded strings.
code:
struct codepoint {
char32_t val;
codepoint(char32_t _val = 0) : val(_val) {}
codepoint(std::string const& s);
codepoint(std::string::const_iterator begin, std::string::const_iterator end);
//I don't know the UTF-8→codepoint conversion off-hand. There are lots of places
//online that show how to do this
std::string to_utf8() const;
//Again, look up an algorithm. They're not *too* complicated.
void append_to_string_as_utf8(std::string & s) const;
//This might be more performant if you're trying to reduce how many dynamic memory
//allocations you're making.
//codepoint(std::wstring const& s);
//std::wstring to_utf16() const;
//void append_to_string_as_utf16(std::wstring & s) const;
//Anything else you need, equality operator, comparison operator, etc.
};

Related

Change endianness of entire struct in C++

I am writing a parser in C++ to parse a well defined binary file. I have declared all the required structs. And since only particular fields are of interest to me, so in my structs I have skipped non-required fields by creating char array of size equal to skipped bytes. So I am just reading the file in char array and casting the char pointer to my struct pointer. Now problem is that all data fields in that binary are in big endian order, so after typecasting I need to change the endianness of all the struct fields. One way is to do it manually for each and every field. But there are various structs with many fields, so it'll be very cumbersome to do it manually. So what's the best way to achieve this. And since I'll be parsing very huge such files (say in TB's), so I require a fast way to do this.
EDIT : I have use attribute(packed) so no need to worry about padding.

If you can do misaligned accesses with no penalty, and you don't mind compiler- or platform-specific tricks to control padding, this can work. (I assume you are OK with this since you mention __attribute__((packed))).
In this case the nicest approach is to write value wrappers for your raw data types, and use those instead of the raw type when declaring your struct in the first place. Remember the value wrapper must be trivial/POD-like for this to work. If you have a POSIX platform you can use ntohs/ntohl for the endian conversion, it's likely to be better optimized that whatever you write yourself.
If misaligned accesses are illegal or slow on your platform, you need to deserialize instead. Since we don't have reflection yet, you can do this with the same value wrappers (plus an Ignore<N> placeholder that skips N bytes for fields you're not interested), and declare them in a tuple instead of a struct - you can iterate over the members in a tuple and tell each to deserialize itself from the message.

One way to do that is combine C preprocessor with C++ operators. Write a couple of C++ classes like this one:
#include "immintrin.h"
class FlippedInt32
{
int value;
public:
inline operator int() const
{
return _bswap( value );
}
};
class FlippedInt64
{
__int64 value;
public:
inline operator __int64() const
{
return _bswap64( value );
}
};
Then,
#define int FlippedInt32
before including the header that define these structures. #undef immediately after the #include.
This will replace all int fields in the structures with FlippedInt32, which has the same size but returns flipped bytes.
If it’s your own structures which you can modify you don’t need the preprocessor part. Just replace the integers with the byte-flipping classes.

If you can come up with a list of offsets (in-bytes, relative to the top of the file) of the fields that need endian-conversion, as well as the size of those fields, then you could do all of the endian-conversion with a single for-loop, directly on the char array. E.g. something like this (pseudocode):
struct EndianRecord {
size_t offsetFromTop;
size_t fieldSizeInByes;
};
std::vector<EndianRecord> todoList;
// [populate the todo list here...]
char * rawData = [pointer to the raw data]
for (size_t i=0; i<todoList.size(); i++)
{
const EndianRecord & er = todoList[i];
ByteSwap(&rawData[er.offsetFromTop], er.fieldSizeBytes);
}
struct MyPackedStruct * data = (struct MyPackedStruct *) rawData;
// Now you can just read the member variables
// as usual because you know they are already
// in the correct endian-format.
... of course the difficult part is coming up with the correct todoList, but since the file format is well-defined, it should be possible to generate it algorithmically (or better yet, create it as a generator with e.g. a GetNextEndianRecord() method that you can call, so that you don't have to store a very large vector in memory)

UTF-8 to UTF-32 on iterators using the STL

I have a char iterator - an std::istreambuf_iterator<char> wrapped in a couple of adaptors - yielding UTF-8 bytes. I want to read a single UTF-32 character (a char32_t) from it. Can I do so using the STL? How?
There's std::codecvt_utf8<char32_t>, but that apparently only works on char*, not arbitrary iterators.
Here's a simplified version of my code:
#include <iostream>
#include <sstream>
#include <iterator>
// in the real code some boost adaptors etc. are involved
// but the important point is: we're dealing with a char iterator.
typedef std::istreambuf_iterator< char > iterator;
char32_t read_code_point( iterator& it, const iterator& end )
{
// how do I do this conversion?
// codecvt_utf8<char32_t>::in() only works on char*
return U'\0';
}
int main()
{
// actual code uses std::istream so it works on strings, files etc.
// but that's irrelevant for the question
std::stringstream stream( u8"\u00FF" );
iterator it( stream );
iterator end;
char32_t c = read_code_point( it, end );
std::cout << std::boolalpha << ( c == U'\u00FF' ) << std::endl;
return 0;
}
I am aware that Boost.Regex has an iterator for this, but I'd like to avoid boost libraries that are not header-only and this feels like something the STL should be capable of.

I don't think you can do this directly with codecvt_utf8 or any other standard library components. To use codecvt_utf8 you'd need to copy bytes from the iterator stream into a buffer and convert the buffer.
Something like this should work:
char32_t read_code_point( iterator& it, const iterator& end )
{
char32_t result;
char32_t* resend = &result + 1;
char32_t* resnext = &result;
char buf[7]; // room for 3-byte UTF-8 BOM and a 4-byte UTF-8 character
char* bufpos = buf;
const char* const bufend = std::end(buf);
std::codecvt_utf8<char32_t> cvt;
while (bufpos != bufend && it != end)
{
*bufpos++ = *it++;
std::mbstate_t st{};
const char* be = bufpos;
const char* bn = buf;
auto conv = cvt.in(st, buf, be, bn, &result, resend, resnext);
if (conv == std::codecvt_base::error)
throw std::runtime_error("Invalid UTF-8 sequence");
if (conv == std::codecvt_base::ok && bn == be)
return result;
// otherwise read another byte and try again
}
if (it == end)
throw std::runtime_error("Incomplete UTF-8 sequence");
throw std::runtime_error("No character read from first seven bytes");
}
This appears to do more work than necessary, re-scanning the whole UTF-8 sequence in [buf, bufpos) on every iteration (and making a virtual function call to codecvt_utf8::do_in). In theory the codecvt_utf8::in implementation could read an incomplete multibyte sequence and store state information in the mbstate_t argument, so that the next call would resume from where the last one left off, only consuming new bytes, not re-processing the incomplete multibyte sequence that was already seen.
However, implementations are not required to use the mbstate_t argument to store state between calls and in practice at least one implementation of codecvt_utf8::in (the one I wrote for GCC) doesn't use it at all. From my experiments it seems that the libc++ implementation doesn't use it either. This means that they stop converting before an incomplete multibyte sequence, and leave the from_next pointer (the bn argument here) pointing to the beginning of that incomplete sequence, so that the next call should start from that position and (hopefully) provide enough additional bytes to complete the sequence and allow a complete Unicode character to be read and converted to char32_t. Because you are only trying to read a single codepoint, this means it does no conversion at all, because stopping before an incomplete multibyte sequence means stopping at the first byte.
It's possible that some implementations do use the mbstate_t argument, so you could modify the function above to handle that case as well, but to be portable it would still need to cope with implementations that ignore the mbstate_t. Supporting both types of implementation would complicate the function considerably, so I kept it simple and wrote a form that should work with all implementations, even if they do actually use the mbstate_t. Because you are only going to be reading up to 7 bytes at a time (in the worst case ... the average case may be only one or two bytes, depending on the input text) the cost of re-scanning the first few bytes every time shouldn't be huge.
To get better performance from codecvt_utf8 you should avoid converting one codepoint at a time, because it's designed for converting arrays of characters not individual ones. Since you always need to copy to a char buffer anyway you could copy larger chunks from the input iterator sequence and convert whole chunks. This would reduce the likelihood of seeing incomplete multibyte sequences, since only the last 1-3 bytes at the end of a chunk would need to be re-processed if the chunk ends in an incomplete sequence, everything earlier in the chunk would have been converted.
To get better performance reading single codepoints you should probably avoid codecvt_utf8 entirely and either roll your own (if you only need UTF-8 to UTF-32BE it's not so hard) or use a third-party library such as ICU.

How to implement istream& getline (istream& in, MyString& s, char delim = '\n')? [duplicate]

My Question is very simple, how is getline(istream, string) implemented?
How can you solve the problem of having fixed size char arrays like with getline (char* s, streamsize n ) ?
Are they using temporary buffers and many calls to new char[length] or another neat structure?

getline(istream&, string&) is implemented in a way that it reads a line. There is no definitive implementation for it; each library probably differs from one another.
Possible implementation:
istream& getline(istream& stream, string& str)
{
char ch;
str.clear();
while (stream.get(ch) && ch != '\n')
str.push_back(ch);
return stream;
}

#SethCarnegie is right: more than one implementation is possible. The C++ standard does not say which should be used.
However, the question is still interesting. It's a classic computer-science problem. Where, and how, does one allocate memory when one does not know in advance how much memory to allocate?
One solution is to record the string's characters as a linked list of individual characters. This is neither memory-efficient nor fast, but it works, is robust, and is relatively simple to program. However, a standard library is unlikely to be implemented this way.
A second solution is to allocate a buffer of some fixed length, such as 128 characters. When the buffer overflows, you allocate a new buffer of double length, 256 characters, then copy the old characters over to the new storage, then release the old. When the new buffer overflows, you allocate an even newer buffer of double length again, 512 characters, then repeat the process; and so on.
A third solution combines the first two. A linked list of character arrays is maintained. The first two members of the list store (say) 128 characters each. The third stores 256. The fourth stores 512, and so on. This requires more programming than the others, but may be preferable to either, depending on the application.
And the list of possible implementations goes on.
Regarding standard-library implementations, #SteveJessop adds that "[a] standard library's string isn't permitted to be implemented as (1), because of the complexity requirement of operator[] for strings. In C++11 it's not permitted to be implemented as (3) either, because of the contiguity requirement for strings. The C++ committee expressed the belief that no active C++ implementation did (3) at the time they added the contiguity requirement. Of course, getline can do what it likes temporarily with the characters before adding them all to the string, but the standard does say a lot about what string can do."
The addition is relevant because, although getline could temporarily store its data in any of several ways, if the data's ultimate target is a string, this may be relevant to getline's implementation. #SteveJessop further adds, "For string itself, implementations are pretty much required to be (2) except that they can choose their own rate of expansion; they don't have to double each time as long as they multiply by some constant."

As #3bdalla said, implementation of thb doesn't work as gnu implementation. So, I wrote my own implementation, which works like gnu's one. I don't know what will be with errors in this variant, so it needs to be tested.
My implementation of getline:
std::istream& getline(std::istream& is, std::string& s, char delim = '\n'){
s.clear();
char c;
std::string temp;
if(is.get(c)){
temp.push_back(c);
while((is.get(c)) && (c != delim))
temp.push_back(c);
if(!is.bad())
s = temp;
if(!is.bad() && is.eof())
is.clear(std::ios_base::eofbit);
}
return is;
}

How to parse a sequence of integers stored in a text buffer?

Parsing text consisting of a sequence of integers from a stream in C++ is easy enough: just decode them. When the data is received somehow and is readily available within a program, e.g., receiving a base64 encoded text (the decoding isn't the problem), the situation is a bit different. The data is sitting in a buffer within the program and only needs to be decoded, not read. Of course, a std::istringstream could be used:
std::vector<int> parse_text(char* begin, char* end) {
std::istringstream in(std::string(begin, end));
return std::vector<int>(std::istream_iterator<int>(in),
std::istream_iterator<int>());
}
Since a lot of these buffers are received and they can be fairly big, it is desirable to not copy the actual content of character array and, ideally, to also avoid creating a stream for each buffer. Thus, the question becomes:
Given a buffer of chars containing a sequences of (space separated; dealing with other separators is easily done, e.g., using a suitable manipulator) integers how can they be decoded without copying the sequence and, if possible, without creating even an std::istream?

Avoiding a copy of the buffer is easily done with a custom stream buffer which simply sets of the get area to use the buffer. The stream buffer actually doesn't even need to override any of the virtual functions and would just set up the internal buffer:
class imemstream
: private virtual std::streambuf
, public std::istream
{
public:
imemstream(char* begin, char* end)
: std::streambuf()
, std::istream(static_cast<std::streambuf*>(this))
{
this->setg(begin, begin, end);
}
};
std::vector<int> parse_data_via_istream(char* begin, char* end)
{
imemstream in(begin, end);
return std::vector<int>(std::istream_iterator<int>(in),
std::istream_iterator<int>());
}
This approach avoids copying the stream and uses the ready made std::istream functionality. However, it does create a stream object. With a suitable update function the stream stream/stream buffer can be extended to reset the buffer and process multiple buffers.
To avoid creation of the stream, the underlying functionality from std::num_get<...> could be used. The actual parsing is done by one of the std::locale facets. The numeric parsing for std::istream is done by std::num_get<char, std::istreambuf_iterator<char>>. This facet isn't much help as it uses a sequence specified by std::istreambuf_iterator<char>s but a std::num_get<char, char const*> facet can be instantiated. It won't be in part of the default std::locale but it easy to create a corresponding std::locale and install it, e.g., as the global std::locale object first thing in main():
int main()
{
std::locale::global(std::locale(std::locale(),
new std::num_get<char, char const*>()));
...
Note that the std::locale object will clean-up the added facet, i.e., there is no need to add any clean-up code: the facets are reference counted and released when the last std::locale holding a particular facet disappears. To actually use the facet it, unfortunately, needs an std::ios_base object which is can only really be obtained from some stream object. However, any stream can be used (although in a multi-threaded system it should probably be a separate stream object per stream to avoid accidental race conditions):
char const* skipspace(char const* it, char const* end)
{
return std::find_if(it, end,
[](unsigned char c){ return !std::isspace(c); });
}
std::vector<int> parse_data_via_istream(std::ios_base& fmt,
char const* it, char const* end)
{
std::vector<int> rc;
std::num_get<char, char const*> const& ng
= std::use_facet<std::num_get<char, char const*>>(std::locale());
std::ios_base::iostate error;
for (long tmp;
(it = ng.get(skipspace(it, end), end, fmt, error, tmp))
, error == std::ios_base::goodbit; ) {
rc.push_back(tmp);
}
return rc;
}
Most of this just about a bit of error handling and skipping leading whitespace: mostly, std::istream provides facilities to automatically skip whitespace for formatted input and deals with the necessary error protocol. There is potentially a small advantage of the approach outlined above with respect to getting the facet just once per buffer and avoiding creation of a std::istream::sentry object as well as avoiding creation of a stream. Of course, the code assumes that some stream can be used to pass it in as its std::ios_base& subobject to provide parsing flags like the base to be used.
OK, this is quite a bit of code for something which strtol() could mostly do, too. The approach using std::num_get<char, char const*> has some flexibility which isn't offered by strtol():
Since the std::locale's facet are used which can be overridden to parse arbitrary formats of representation, e.g., Roman numerals, it more flexible with respect to input formats.
It is easy to set up use of thousands separators or change the representation of the decimal point (just change std::numpunct<char> in std::locale used by fmt to set these up).
The buffer doesn't have to be null-terminated. For example, a contiguous sequence of character made up of 8 digit values can be parsed by feeding it and it+8 as the range when calling std::num_get<char, char const*>::get().
However, strtol() is probably a good approach for most uses. On the other hand, the above provides an alternative which may be useful in some contexts.

Could I use stringstream as a memory stream like `MemoryStream` in C#?

In C/C++, strings are NULL terminated.
Could I use stringstream as a memory stream like MemoryStream in C#?
Data of memory streams may have \0 values in the middle of data, but C++ strings are NULL terminated.

When storing character sequences in a std::string you can have included null characters. Correspondingly, a std::stringstream can deal with embedded null characters as well. However, the various formatted operations on streams won't pass through the null characters. Also, when using a built-in string to assign values to a std::string the null characters will matter, i.e., you'd need to use the various overloads taking the size of the character sequence as argument.
What exactly are you trying to achieve? There may be an easier approach than traveling in string streams. For example, if you want to read the stream interface to interact with a memory buffer, a custom stream buffer is really easy to write and setup:
struct membuf
: std::streambuf
{
membuf(char* base, std::size_t size) {
this->setp(base, base + size);
this->setg(base, base, base + size);
}
std::size_t written() const { return this->pptr() - this->pbase(); }
std::size_t read() const { return this->gptr() - this->eback(); }
};
int main() {
// obtain a buffer starting at base with size size
membuf sbuf(base, size);
std::ostream out(&sbuf);
out.write("1\08\09\0", 6); // write three digits and three null chars
}

In C/C++, strings are NULL terminated.
Two completely different languages (that have some commonality in syntax). But C is about as close to C++ as Jave is to C# (in that they are different).
Each language has their own string features.
C uses a sequence of bytes that by convention is terminated by a '\0' byte.
Commonly referred to as a C-String.
There is nothing special about this area of memory it is just a sequence of bytes.
There is no enforcement of convention and no standard way to build.
It is a huge area of bugs in C programs.
C++ uses a class (with methods) std::string.
Could I use stringstream as a memory stream like MemoryStream in C#?
Sure. The Memory stream is a stream that is backed by memory (rather than a file). This is exactly what std::stringstream is. There are some differences in the interface but these are minor and use of the documentation should easily resolve any confusion.
Data of memory streams may have \0 values in the middle of data, but C++ strings are NULL terminated.
This is totally incorrect.
C-String are '\0' terminated.
C++ std::string are not null terminated and can contain any character

C and C++ are two different languages, in C sequence of continues character means string, and you can treat it as a memory part which can set/get its values, using memcpy(),memset(),memcmp().
In C++ strings means a class of information which used to get correct data as string. So you can't treat it as a sequence of memory location with char type.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js