UTF-8 to UTF-32 on iterators using the STL

UTF-8 to UTF-32 on iterators using the STL - c++

I have a char iterator - an std::istreambuf_iterator<char> wrapped in a couple of adaptors - yielding UTF-8 bytes. I want to read a single UTF-32 character (a char32_t) from it. Can I do so using the STL? How?
There's std::codecvt_utf8<char32_t>, but that apparently only works on char*, not arbitrary iterators.
Here's a simplified version of my code:
#include <iostream>
#include <sstream>
#include <iterator>
// in the real code some boost adaptors etc. are involved
// but the important point is: we're dealing with a char iterator.
typedef std::istreambuf_iterator< char > iterator;
char32_t read_code_point( iterator& it, const iterator& end )
{
// how do I do this conversion?
// codecvt_utf8<char32_t>::in() only works on char*
return U'\0';
}
int main()
{
// actual code uses std::istream so it works on strings, files etc.
// but that's irrelevant for the question
std::stringstream stream( u8"\u00FF" );
iterator it( stream );
iterator end;
char32_t c = read_code_point( it, end );
std::cout << std::boolalpha << ( c == U'\u00FF' ) << std::endl;
return 0;
}
I am aware that Boost.Regex has an iterator for this, but I'd like to avoid boost libraries that are not header-only and this feels like something the STL should be capable of.

I don't think you can do this directly with codecvt_utf8 or any other standard library components. To use codecvt_utf8 you'd need to copy bytes from the iterator stream into a buffer and convert the buffer.
Something like this should work:
char32_t read_code_point( iterator& it, const iterator& end )
{
char32_t result;
char32_t* resend = &result + 1;
char32_t* resnext = &result;
char buf[7]; // room for 3-byte UTF-8 BOM and a 4-byte UTF-8 character
char* bufpos = buf;
const char* const bufend = std::end(buf);
std::codecvt_utf8<char32_t> cvt;
while (bufpos != bufend && it != end)
{
*bufpos++ = *it++;
std::mbstate_t st{};
const char* be = bufpos;
const char* bn = buf;
auto conv = cvt.in(st, buf, be, bn, &result, resend, resnext);
if (conv == std::codecvt_base::error)
throw std::runtime_error("Invalid UTF-8 sequence");
if (conv == std::codecvt_base::ok && bn == be)
return result;
// otherwise read another byte and try again
}
if (it == end)
throw std::runtime_error("Incomplete UTF-8 sequence");
throw std::runtime_error("No character read from first seven bytes");
}
This appears to do more work than necessary, re-scanning the whole UTF-8 sequence in [buf, bufpos) on every iteration (and making a virtual function call to codecvt_utf8::do_in). In theory the codecvt_utf8::in implementation could read an incomplete multibyte sequence and store state information in the mbstate_t argument, so that the next call would resume from where the last one left off, only consuming new bytes, not re-processing the incomplete multibyte sequence that was already seen.
However, implementations are not required to use the mbstate_t argument to store state between calls and in practice at least one implementation of codecvt_utf8::in (the one I wrote for GCC) doesn't use it at all. From my experiments it seems that the libc++ implementation doesn't use it either. This means that they stop converting before an incomplete multibyte sequence, and leave the from_next pointer (the bn argument here) pointing to the beginning of that incomplete sequence, so that the next call should start from that position and (hopefully) provide enough additional bytes to complete the sequence and allow a complete Unicode character to be read and converted to char32_t. Because you are only trying to read a single codepoint, this means it does no conversion at all, because stopping before an incomplete multibyte sequence means stopping at the first byte.
It's possible that some implementations do use the mbstate_t argument, so you could modify the function above to handle that case as well, but to be portable it would still need to cope with implementations that ignore the mbstate_t. Supporting both types of implementation would complicate the function considerably, so I kept it simple and wrote a form that should work with all implementations, even if they do actually use the mbstate_t. Because you are only going to be reading up to 7 bytes at a time (in the worst case ... the average case may be only one or two bytes, depending on the input text) the cost of re-scanning the first few bytes every time shouldn't be huge.
To get better performance from codecvt_utf8 you should avoid converting one codepoint at a time, because it's designed for converting arrays of characters not individual ones. Since you always need to copy to a char buffer anyway you could copy larger chunks from the input iterator sequence and convert whole chunks. This would reduce the likelihood of seeing incomplete multibyte sequences, since only the last 1-3 bytes at the end of a chunk would need to be re-processed if the chunk ends in an incomplete sequence, everything earlier in the chunk would have been converted.
To get better performance reading single codepoints you should probably avoid codecvt_utf8 entirely and either roll your own (if you only need UTF-8 to UTF-32BE it's not so hard) or use a third-party library such as ICU.

Related

Why is the last character of character array is getting excluded?

#include<iostream>
using namespace std;
int main()
{
int n;
cin>>n;
cin.ignore();
char arr[n+1];
cin.getline(arr,n);
cin.ignore();
cout<<arr;
return 0;
}
Input:
11
of the year
Output:
of the yea
I'm already providing n+1 for the null character. Then why is the last character getting excluded?

You allocated n+1 characters for your array, but then you told getline that there were only n characters available. It should be like this:
int n;
cin>>n;
cin.ignore();
char arr[n+1];
cin.getline(arr,n+1); // change here
cin.ignore();
cout<<arr;

Per cppreference.com:
https://en.cppreference.com/w/cpp/io/basic_istream/getline
Behaves as UnformattedInputFunction. After constructing and checking the sentry object, extracts characters from *this and stores them in successive locations of the array whose first element is pointed to by s, until any of the following occurs (tested in the order shown):
end of file condition occurs in the input sequence (in which case setstate(eofbit) is executed)
the next available character c is the delimiter, as determined by Traits::eq(c, delim). The delimiter is extracted (unlike basic_istream::get()) and counted towards gcount(), but is not stored.
count-1 characters have been extracted (in which case setstate(failbit) is executed).
If the function extracts no characters (e.g. if count < 1), setstate(failbit)is executed.
In any case, if count > 0, it then stores a null character CharT() into the next successive location of the array and updates gcount().
In your case, n=11. You are allocating n+1 (12) chars, but telling getline() that only n (11) chars are available, so it reads only n-1 (10) chars into the array and then terminates the array with '\0' in the 11th char. That is why you are missing the last character.
of the year
^
10th char, stops here
You need to +1 when calling getline(), to match your actual array size:
cin.getline(arr,n+1);

john's answer should fix your issue. Variable-length arrays (your char arr[n+1]) are not part of the C++ standard, for justified reasons. Yet I've taken a few hours of my time to go way out of question's scope and create the...
Student's guide to C++ I/O
...and I/O in general, with an emphasis on the I part. Fear not, do it the C++ way! The following snippets should be compiled with a standard-conforming C++ compiler.
C++ I/O & standard library
Textual input
This is the recommended way of reading UTF-8 encoded strings in C++, the most widespread text encoding. We will use std::string for storage, which is the de-facto way for holding UTF-8 encoded strings, and std::getline for the reading itself.
#include <iostream> // std::cin, std::cout, std::ws
#include <string> // std::string, std::getline
int main() {
int size;
// std::ws ignores all whitespace in the stream,
// until the first non-whitespace character.
// it's prettier and handles cases a simple .ignore() does not.
std::cin >> size >> std::ws;
std::string input;
std::getline(std::cin, input);
// This condition will most certainly be true (output will be 1).
std::cout << (size == input.size()) << '\n';
}
std::string is dynamically allocated, or, as you may hear, on the heap. This is a broad subject, so feel free to venture on your own, from this given starting point! How does this help us? We can store strings of sizes unknown ahead of time on the heap, because we can always reallocate a bigger buffer! std::getline allocates and reallocates as it reads the input until newline is reached, so you can read without knowing a size beforehand. Your size variable will most probably be equal with the size of the string under the assumption that this is a school exercise where the input length is provided as you're probably not taught of dynamic memory. For good reason, though - it's complex and would needlessly distract from the actual subject (algorithms, data structures etc.). Good to keep in mind: std::strings, unlike C-style strings, are not null terminated, but you can get a null-terminated C-style string from an std::string by calling the .c_str() method.
Binary data
What's binary data? Everything that's not text: images, videos, music, 2003 MS Word documents (the .doc ones, wait 'til you see what .docx is) and many others. It's customary to store binary as raw bytes, which is a fancy way to say numbers. unsigned char is the C/C++ type used to represent these raw bytes (C++17 introduces std::byte for this purpose. To work with data from binary input we need to store it somewhere in memory - either on the stack, or on the heap. We could store the whole input at once, but binary files are considered too large for this (and, really, are - think about the size of a movie!), so we usually read it in chunks - that means, we read only a finite part of it at a time (say 256 characters, that's our buffer), and we keep reading until we reach the end of the input (usually called end-of-file or, short, EOF). As a rule of thumb, when a buffer is small and static (doesn't need to be resized, as our string above), we can store it on the stack. If any of those conditions is not met, it goes on the heap. We should note that the notions of small and large are quite context dependent - compiler, OS, hardware, runtime environment (see this thread on stack size limits and embedded systems). The buffer size you'll choose is also task-specific, so there's no rule here, too. Let's see some code now!
#include <array> // std::array
#include <fstream> // std::ifstream, std::ofstream
int main() {
// We open this file in binary mode.
// The default mode may modify the input.
std::ifstream input{"some_image.jpg", std::ios::binary};
// 256 is our buffer size, unsigned char is the array type.
// This is the C++ way of `unsigned char buffer[256]`.
std::array<unsigned char, 256> buffer;
while (input.read(buffer.data(), buffer.size())) {
// Buffer is filled, do something with it
}
// At this point, either EOF is reached or an error occurred.
if (input.eof()) {
// Less characters than the buffer's size have been read.
// .gcount() returns the number of characters read by
// the last operation.
const std::streamsize chunk_size = input.gcount();
// Do something with these characters, as in the loop.
// Valid range to access in the buffer is [0, chunk_size).
// chunk_size can be 0, too. In that case, there is no more data
// to handle.
} else {
// Some other failure, handle error.
}
}
This snippet is reading through a file using a small, stack-allocated buffer of 256 bytes. std::array makes usage convenient and safe with its methods - read the linked docs! If we want to use a large buffer (say, 16MB), we replace the std::array with an std::vector:
std::vector buffer(1 << 24); // 1 << 24 gives 16MB in bytes
Rest is the same. You could also use std::string here, too, as std::string does not imply/force UTF-8 encoding of input. It's useful to have a convention that easily differentiates between binary and text data, in code.
Something to note is that reading in smaller chunks uses less space, but takes more time - taking bytes from a file involves making OS system calls and moving disks or electrons, when reading from a hard drive or an SSD, respectively. C++'s fstream objects already do buffering for you to speed up reads, which is usually a much-needed optimization. You'll know if this affects you.
Another thing to note is the EOF and error handling, using the .eof() method. We have omitted error handling in the textual input retrieval, but here we are forced into doing it, if we don't want to lose data. When EOF is reached, usually less bytes than the buffer size have been read, so we need a way to know how much of the buffer was filled with data. This is what .gcount() tells us. Depending on the program you're making, you may deem the EOF error as "unexpected" if the buffer is partially filled (.gcount() returns a non-0 value) - for example, the data read is incomplete, according to the rules it was assumed to be created after, or in other words, the end of file was reached before the data was supposed to end. Other than that, EOF is a condition that all files are in after being fully read.
C-style I/O
This may look closer to what's taught in school. As we've explained the general concepts above, this section will be richer in coding and explanations of code. We still use C++ as a language, so the C++ version of the C headers and the std namespace will be used - to have the code that follows work in a C compiler, replace the <csomething> headers with <something.h> and remove the std:: namespace prefix from types and functions. Let's dive into it!
Textual input
The equivalent of a C++ stream (std::cin, std::fstream etc.) in C is the std::FILE. FILEs are buffered by default, as are C++ streams. We'll use std::fscanf for reading the size of the input, which is just scanf but it takes as parameter the stream you read from, and std::fgets for reading the text line.
#include <cstdio> // std::FILE, std::fscanf, std::fgets, stdin
#include <cstring> // std::strcspn
// discard_whitespace does what std::ws did above.
// It consumes all whitespace before a non-whitespace
// character from stream f.
void discard_whitespace(std::FILE* f) {
char discard;
// The leading space in the format string
// tells fscanf to consume all whitespace.
std::fscanf(f, " %c", &discard);
}
int main() {
int size;
// stdin is a macro, doesn't have a namespace,
// hence no std:: prefix.
std::fscanf(stdin, "%d", &size);
// fscanf, like std::cin, doesn't consume whitespace
discard_whitespace(stdin);
// Your school exercise will probably have a size limit for the input.
// We consider it to be 256.
const int SIZE_UPPER_BOUND = 256;
// We add some extra bytes so the maximum length input can be accomodated.
// 1 is added for the null terminator of C-style strings.
// The other 2 is because `fgets` will also read the newline,
// which can be \n or \r\n, depending on OS. See explanation after code.
char input[SIZE_UPPER_BOUND + 3];
// The actual read - sizeof gets the size of our input buffer,
// we don't have to write it twice.
std::fgets(input, sizeof input, stdin);
// fgets also reads the newline, unlike `std::getline` or
// `std::cin.getline` - we have to remove it ourselves.
input[std::strcspn(input, "\r\n")] = '\0';
// This condition will be true, as in the C++ example.
std::fprintf(stdin, "%d\n", std::strlen(input) == size);
}
Let's unpack that newline removal. std::strcspn finds the first position of any of the given characters in the input. We provide both \r and \n, to support UNIX (\n) and Windows (\r\n) newline terminators - yeah, they're different, see Wikipedia, on "Newline". By adding the null terminator, '\0', we move the ending of the string where the newline was, basically "removing" the newline. If this is a school assignment, we can assume input is correct, so we could have used size + 1 instead of std::strcspn to remove the newline:
input[size + 1] = '\0';
This doesn't work when we don't know the input size or the input may be invalid.
As an optimization trick, observe that std::strcspn returns the line length, in this case. When you don't know the size, but you need it for later, you can save the result of std::strcspn in a variable before, and then use it instead of std::strlen:
// std::size_t is an unsigned integral type, used to represent
// array sizes and indexes in C/C++
const std::size_t input_size = std::strcspn(input, "\r\n");
input[input_size] = '\0';
You'll see some people use 0 or NULL for the terminator. I recommend against it - unlike the \0 literal, that is of char type, the other two variants are implicitly casted to char. If you read the linked documentation, you'll realize NULL is even incorrect, according to the spec, as it's meant to be used in contexts that require pointers only.
An alternative method to fgets is fscanf, again. Thread carefully, though - while a simple %s may do it, it makes your code vulnerable to buffer overflow exploits. See this StackOverflow thread on disadvantages of scanf, too. Let's see the (safe) code:
std::fscanf(stdin, "%256[^\r\n]s", input);
That number limits the input size to our SIZE_UPPER_BOUND, and the [^\r\n] tells fscanf to read all characters up to \r or \n. With this method you can remove the discard_whitespace call, as fscanf with the %s verb consumes leading whitespace. A downside to fscanf is that you have to keep the size limit in the input string and the buffer size in sync - you have no way to specify the input size dynamically other than building the format string dynamically, which is overkill for a school assignment). This is a problem in more sizable codebases, but for a one-file, one-time school assignment it's not a big deal, so you may prefer fscanf over fgets, as it's less work. fscanf doesn't read the newline in the buffer, too.
Binary data
The equivalent of C++'s std::cin.read in C world is std::fread. Code will resemble its C++ counterpart:
#include <cstdio>
int main() {
// The second parameter is the file access mode.
// In this case, it is read (r) binary (b).
std::FILE* f = std::fopen("some_image.jpg", "rb");
unsigned char buffer[256];
std::size_t chunk_size;
while (chunk_size = std::fread(buffer, sizeof buffer[0], sizeof buffer, f)) {
// chunk_size == sizeof buffer, do something with the buffer
}
if (std::feof(f)) {
// chunk_size != sizeof buffer, do something with buffer
// or handle as error
} else {
// an error occurred, handle it
}
// We need to close the file, unlike in C++, where it is closed automatically.
std::fclose(f);
}
The arguments to std::fread are hairy: read the documentation. Everything else looks very similar to the C++ way, from the loop to the error handling. Why? Because it's literally the same thing - we're just using different (standard) libraries. Another similarity is that C I/O is also buffered by default, just like C++'s. What's different is the line at the end - the call to std::fclose. We're not doing anything similar in the C++ code, right? No. Remember that C++ classes have constructors and destructors, functions that are automatically called at the beginning, respectively at the end of a variable's lifetime. These two allow us to implement the RAII technique, which will do the resource management automatically (opening the file in the constructor, closing it in the destructor). RAII is used inside std::string and std::vector (and other containers, smart pointers & others). In other words, the destructor of std::ifstream closes the file at the end of main(), just as we are doing here, manually.
Hybrid approach (??)
Would you ever want to combine the two? So it seems. Let's talk drawbacks:
The C++ I/O library, due to the way it's built, takes more care to use in a performant manner compared to C's (virtual function calls and extra function calls in general, especially when using << and >> operators & stream manipulators, as each of these is a function call, compared to a single plain function call/operation with the C library). See this StackOverflow thread on i/ostream speed, too. The C++ library is also more verbose, especially in the case of outputting (ever heard of the "chevrone hell"?)
The C I/O library is easy to use improperly/unsafely, the terse, shorthand namings make code difficult to follow, and output cannot be extended to support custom types (this is a problem when using C-style I/O in C++). It also takes great care to handle dynamic buffers correctly, given that the only way of managing heap memory in C is malloc and free.
Some schools may crucify you if any trace of std::string is left in sight (or so I've heard)
Using C-style types (char[N] instead of std::array<char, N>, for example), is easier - no headers to include, as the types are builtin primitives and less to type. May be preferred in short, throwaway programs like algorithmic exercises at school.
With these in mind, we can take a look at how to conveniently combine the two when reading text and binary!
Textual input
We will take advantage of the terseness of C-style types and the ease of use of C++'s I/O library:
#include <iostream>
int main() {
int size;
std::cin >> size >> std::ws;
const int SIZE_UPPER_BOUND = 256;
char input[SIZE_UPPER_BOUND + 1];
std::cin.getline(input, sizeof input);
// Input done, solve the problem.
}
Teachers don't have to scratch their head at the presence of std::string and std::getline and all the standard library shenanigans you start using after diving in this rabbit hole. You, the programmer, don't have to clean up newline endings and memorize arcane format specifiers just to read a string and an int. Focus on code and solve problems without having to debug the input reading logic, ever - it just works!
Binary data
The convoluted hierarchical tree of C++'s I/O library types scares you, the clean assembly output enjoyer, just like Linus Torvalds. You're still somehow afraid to manually manage memory, so you choose this solution:
#include <cstdio>
#include <vector>
int main() {
// The second parameter is the file access mode.
// In this case, it is read (r) binary (b).
std::FILE* f = std::fopen("some_image.jpg", "rb");
std::vector<unsigned char> buffer(1 << 24);
std::size_t chunk_size;
while (chunk_size = std::fread(buffer.data(), sizeof buffer[0], buffer.size(), f)) {
// use the buffer
}
if (std::feof(f)) {
// handle EOF
} else {
// handle error
}
std::fclose(f);
}
Weird choice, given that you still manage the file's lifetime manually. While this may not be the best example, using C++ RAII containers together with C libraries is not uncommon - memory safety is crucial.
Trivia
as usual, weigh your decision of using namespace std;
Cool things you won't need:
speed up C++ I/O using a single line at the beginning of the program (but be careful)
disable C I/O buffering
disable C++ I/O buffering
Conclusion
I/O is the crowded junction of fundamental CS concepts, hardware & software inner workings and C++'s features and quirks. Take in what you can at a time & focus on what matters, and make sure you're building on sturdy fundamentals.

C++ Unicode: Bytes, Code Points and Graphemes

So, I'm building a scripting language and one of my goals is convenient string operations. I tried some ideas in C++.
String as sequences of bytes and free functions that return vectors containing the code-points indices.
A wrapper class that combines a string and a vector containing the indices.
Both ideas had a problem, and that problem was, what should I return. It couldn't be a char, and if it was a string it would be wasted space.
I ended up creating a wrapper class around a char array of exactly 4 bytes: a string that has exactly 4 bytes in memory, no more nor less.
After creating this class, I felt tempted to just wrap it in a std::vector of it in another class and build from there, thus making a string type of code-points. I don't know if this is a good approach, it would end up being much more convenient but it would end up wasting more space.
So, before posting some code, here's a more organized list of ideas.
My character type would be not a byte, nor a grapheme but rather a code-point. I named it a rune like the one in the Go language.
A string as a series of decomposed runes, thus making indexing and slicing O1.
Because a rune is now a class and not a primitive, it could be expanded with methods for detecting unicode whitespace: mysring[0].is_whitespace()
I still don't know how to handle graphemes.
Curious fact! An odd thing about the way I build the prototype of the rune class was that it always print in UTF8. Because my rune is not a int32, but a 4 byte string, this end up having some interesting properties.
My code:
class rune {
char data[4] {};
public:
rune(char c) {
data[0] = c;
}
// This constructor needs a string, a position and an offset!
rune(std::string const & s, size_t p, size_t n) {
for (size_t i = 0; i < n; ++i) {
data[i] = s[p + i];
}
}
void swap(rune & other) {
rune t = *this;
*this = other;
other = t;
}
// Output as UTF8!
friend std::ostream & operator <<(std::ostream & output, rune input) {
for (size_t i = 0; i < 4; ++i) {
if (input.data[i] == '\0') {
return output;
}
output << input.data[i];
}
return output;
}
};
Error handling ideas:
I don't like to use exceptions in C++. My idea is, if the constructor fails, initialize the rune as 4 '\0', then overload the bool operator explicitly to return false if the first byte of the run happens to be '\0'. Simple and easy to use.
So, thoughts? Opinions? Different approaches?
Even if my rune string is to much, at least I have a rune type. Small and fast to copy. :)

It sounds like you're trying to reinvent the wheel.
There are, of course, two ways you need to think about text:
As an array of codepoints
As an encoded array of bytes.
In some codebases, those two representations are the same (and all encodings are basically arrays of char32_t or unsigned int). In some (I'm inclined to say "most" but don't quote me on that), the encoded array of bytes will use UTF-8, where the codepoints are converted into variable lengths of bytes before being placed into the data structure.
And of course many codebases simply ignore unicode entirely and store their data in ASCII. I don't recommend that.
For your purposes, while it does make sense to write a class to "wrap around" your data (though I wouldn't call it a rune, I'd probably just call it a codepoint), you'll want to think about your semantics.
You can (and probably should) treat all std::string's as UTF-8 encoded strings, and prefer this as your default interface for dealing with text. It's safe for most external interfaces—the only time it will fail is when interfacing with a UTF-16 input, and you can write corner cases for that—and it'll save you the most memory, while still obeying common string conventions (it's lexicographically comparable, which is the big one).
If you need to work with your data in codepoint form, then you'll want to write a struct (or class) called codepoint, with the following useful functions and constructors
While I have had to write code that handles text in codepoint form (notably for a font renderer), this is probably not how you should store your text. Storing text as codepoints leads to problems later on when you're constantly comparing against UTF-8 or ASCII encoded strings.
code:
struct codepoint {
char32_t val;
codepoint(char32_t _val = 0) : val(_val) {}
codepoint(std::string const& s);
codepoint(std::string::const_iterator begin, std::string::const_iterator end);
//I don't know the UTF-8→codepoint conversion off-hand. There are lots of places
//online that show how to do this
std::string to_utf8() const;
//Again, look up an algorithm. They're not *too* complicated.
void append_to_string_as_utf8(std::string & s) const;
//This might be more performant if you're trying to reduce how many dynamic memory
//allocations you're making.
//codepoint(std::wstring const& s);
//std::wstring to_utf16() const;
//void append_to_string_as_utf16(std::wstring & s) const;
//Anything else you need, equality operator, comparison operator, etc.
};

How to parse a sequence of integers stored in a text buffer?

Parsing text consisting of a sequence of integers from a stream in C++ is easy enough: just decode them. When the data is received somehow and is readily available within a program, e.g., receiving a base64 encoded text (the decoding isn't the problem), the situation is a bit different. The data is sitting in a buffer within the program and only needs to be decoded, not read. Of course, a std::istringstream could be used:
std::vector<int> parse_text(char* begin, char* end) {
std::istringstream in(std::string(begin, end));
return std::vector<int>(std::istream_iterator<int>(in),
std::istream_iterator<int>());
}
Since a lot of these buffers are received and they can be fairly big, it is desirable to not copy the actual content of character array and, ideally, to also avoid creating a stream for each buffer. Thus, the question becomes:
Given a buffer of chars containing a sequences of (space separated; dealing with other separators is easily done, e.g., using a suitable manipulator) integers how can they be decoded without copying the sequence and, if possible, without creating even an std::istream?

Avoiding a copy of the buffer is easily done with a custom stream buffer which simply sets of the get area to use the buffer. The stream buffer actually doesn't even need to override any of the virtual functions and would just set up the internal buffer:
class imemstream
: private virtual std::streambuf
, public std::istream
{
public:
imemstream(char* begin, char* end)
: std::streambuf()
, std::istream(static_cast<std::streambuf*>(this))
{
this->setg(begin, begin, end);
}
};
std::vector<int> parse_data_via_istream(char* begin, char* end)
{
imemstream in(begin, end);
return std::vector<int>(std::istream_iterator<int>(in),
std::istream_iterator<int>());
}
This approach avoids copying the stream and uses the ready made std::istream functionality. However, it does create a stream object. With a suitable update function the stream stream/stream buffer can be extended to reset the buffer and process multiple buffers.
To avoid creation of the stream, the underlying functionality from std::num_get<...> could be used. The actual parsing is done by one of the std::locale facets. The numeric parsing for std::istream is done by std::num_get<char, std::istreambuf_iterator<char>>. This facet isn't much help as it uses a sequence specified by std::istreambuf_iterator<char>s but a std::num_get<char, char const*> facet can be instantiated. It won't be in part of the default std::locale but it easy to create a corresponding std::locale and install it, e.g., as the global std::locale object first thing in main():
int main()
{
std::locale::global(std::locale(std::locale(),
new std::num_get<char, char const*>()));
...
Note that the std::locale object will clean-up the added facet, i.e., there is no need to add any clean-up code: the facets are reference counted and released when the last std::locale holding a particular facet disappears. To actually use the facet it, unfortunately, needs an std::ios_base object which is can only really be obtained from some stream object. However, any stream can be used (although in a multi-threaded system it should probably be a separate stream object per stream to avoid accidental race conditions):
char const* skipspace(char const* it, char const* end)
{
return std::find_if(it, end,
[](unsigned char c){ return !std::isspace(c); });
}
std::vector<int> parse_data_via_istream(std::ios_base& fmt,
char const* it, char const* end)
{
std::vector<int> rc;
std::num_get<char, char const*> const& ng
= std::use_facet<std::num_get<char, char const*>>(std::locale());
std::ios_base::iostate error;
for (long tmp;
(it = ng.get(skipspace(it, end), end, fmt, error, tmp))
, error == std::ios_base::goodbit; ) {
rc.push_back(tmp);
}
return rc;
}
Most of this just about a bit of error handling and skipping leading whitespace: mostly, std::istream provides facilities to automatically skip whitespace for formatted input and deals with the necessary error protocol. There is potentially a small advantage of the approach outlined above with respect to getting the facet just once per buffer and avoiding creation of a std::istream::sentry object as well as avoiding creation of a stream. Of course, the code assumes that some stream can be used to pass it in as its std::ios_base& subobject to provide parsing flags like the base to be used.
OK, this is quite a bit of code for something which strtol() could mostly do, too. The approach using std::num_get<char, char const*> has some flexibility which isn't offered by strtol():
Since the std::locale's facet are used which can be overridden to parse arbitrary formats of representation, e.g., Roman numerals, it more flexible with respect to input formats.
It is easy to set up use of thousands separators or change the representation of the decimal point (just change std::numpunct<char> in std::locale used by fmt to set these up).
The buffer doesn't have to be null-terminated. For example, a contiguous sequence of character made up of 8 digit values can be parsed by feeding it and it+8 as the range when calling std::num_get<char, char const*>::get().
However, strtol() is probably a good approach for most uses. On the other hand, the above provides an alternative which may be useful in some contexts.

Print text between pointers

I have a character range with pointers (pBegin and pEnd). I think of it as a string, but it is not \0 terminated. How can I print it to std::cout effectively?
Without creating a copy, like with std::string
Without a loop that prints each character
Do we have good solution? If not, what is the smoothest workaround?

You can use ostream::write, which takes pointer and length arguments:
std::cout.write(pBegin, pEnd - pBegin);

Since C++17 you can use std::string_view, which was created for sharing part of std::string without copying
std::cout << std::string_view(pBegin, pEnd - pBegin);
pEnd must point to one pass the last character to print, like how iterators in C++ work, instead of the last character to print
What is string_view?
In C++11 what is the most performant way to return a reference/pointer to a position in a std::string?
In older C++ standards boost::string_ref is an alternative. Newer boost versions also have boost::string_view with the same semantics as std::string_view. See Differences between boost::string_ref and boost::string_view
If you use Qt then there's also QStringView and QStringRef although unfortunately they're used for viewing QString which stores data in UTF-16 instead of UTF-8 or a byte-oriented encoding
However if you need to process the string by some functions that require null-terminated string without any external libraries then there's a simple solution
char tmpEnd = *pEnd; // backup the after-end character
*pEnd = '\0';
std::cout << pBegin; // use it as normal C-style string, like dosomething(pBegin);
*pEnd = tmpEnd; // restore the char
In this case make sure that pEnd still points to an element inside the original array and not one past the end of it

Ignore byte-order marks in C++, reading from a stream

I have a function to read the value of one variable (integer, double, or boolean) on a single line in an ifstream:
template <typename Type>
void readFromFile (ifstream &in, Type &val)
{
string str;
getline (in, str);
stringstream ss(str);
ss >> val;
}
However, it fails on text files created with editors inserting a BOM (byte order mark) at the beginning of the first line, which unfortunately includes {Note,Word}pad. How can I modify this function to ignore the byte-order mark if present at the beginning of str?

(I'm assuming you're on Windows, since using U+FEFF as a signature in UTF-8 files is mostly a Windows thing and should simply be avoided elsewhere)
You could open the file as a UTF-8 file and then check to see if the first character is U+FEFF. You can do this by opening a normal char based fstream and then use wbuffer_convert to treat it as a series of code units in another encoding. VS2010 doesn't yet have great support for char32_t so the following uses UTF-16 in wchar_t.
std::fstream fs(filename);
std::wbuffer_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> wb(fs.rdbuf());
std::wistream is(&wb);
// if you don't do this on the stack remember to destroy the objects in reverse order of creation. is, then wb, then fs.
std::wistream::int_type ch = is.get();
const std::wistream::int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF
if(ZERO_WIDTH_NO_BREAK_SPACE != ch)
is.putback(ch);
// now the stream can be passed around and used without worrying about the extra character in the stream.
int i;
readFromStream<int>(is,i);
Remember that this should be done on the file stream as a whole, not inside readFromFile on your stringstream, because ignoring U+FEFF should only be done if it's the very first character in the whole file, if at all. It shouldn't be done anywhere else.
On the other hand, if you're happy using a char based stream and just want to skip U+FEFF if present then James Kanze suggestion seems good so here's an implementation:
std::fstream fs(filename);
char a,b,c;
a = fs.get();
b = fs.get();
c = fs.get();
if (a != (char)0xEF || b != (char)0xBB || c != (char)0xBF) {
fs.seekg(0);
} else {
std::cerr << "Warning: file contains the so-called 'UTF-8 signature'\n";
}
Additionally if you want to use wchar_t internally the codecvt_utf8_utf16 and codecvt_utf8 facets have a mode that can consume 'BOMs' for you. The only problem is that wchar_t is widely recognized to be worthless these days* and so you probably shouldn't do this.
std::wifstream fin(filename);
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header));
* wchar_t is worthless because it is specified to do just one thing; provide a fixed size data type that can represent any code point in a locale's character repertoire. It does not provide a common representation between locales (i.e., the same wchar_t value can be different characters in different locales so you cannot necessarily convert to wchar_t, switch to another locale, and then convert back to char in order to do iconv-like encoding conversions.)
The fixed sized representation itself is worthless for two reasons; first, many code points have semantic meanings and so understanding text means you have to process multiple code points anyway. Secondly, some platforms such as Windows use UTF-16 as the wchar_t encoding, which means a single wchar_t isn't even necessarily a code point value. (Whether using UTF-16 this way is even conformant to the standard is ambiguous. The standard requires that every character supported by a locale be representable as a single wchar_t value; If no locale supports any character outside the BMP then UTF-16 could be seen as conformant.)

You have to start by reading the first byte or two of the stream, and
deciding whether it is part of a BOM or not. It's a bit of a pain,
since you can only putback a single byte, whereas you typically will
want to read four. The simplest solution is to open the file, read the
initial bytes, memorize how many you need to skip, then seek back to the
beginning and skip them.

With a not-so-clean solution, I solved by removing non printing chars:
bool isNotAlnum(unsigned char c)
{
return (c < ' ' || c > '~');
}
...
str.erase(remove_if(str.begin(), str.end(), isNotAlnum), str.end());

Here's a simple C++ function to skip the BOM on an input stream on Windows. This assumes byte-sized data, as in UTF-8:
// skip BOM for UTF-8 on Windows
void skip_bom(auto& fs) {
const unsigned char boms[]{ 0xef, 0xbb, 0xbf };
bool have_bom{ true };
for(const auto& c : boms) {
if((unsigned char)fs.get() != c) have_bom = false;
}
if(!have_bom) fs.seekg(0);
return;
}
It simply checks the first three bytes for the UTF-8 BOM signature, and skips them if they all match. There's no harm if there's no BOM.
Edit: This works with a file stream, but not with cin. I found it did work with cin on Linux with GCC-11, but that's clearly not portable. See #Dúthomhas comment below.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

UTF-8 to UTF-32 on iterators using the STL - c++

Related

Why is the last character of character array is getting excluded?

C++ Unicode: Bytes, Code Points and Graphemes

How to parse a sequence of integers stored in a text buffer?

Print text between pointers

Ignore byte-order marks in C++, reading from a stream

Categories

Resources