Ignore byte-order marks in C++, reading from a stream

Ignore byte-order marks in C++, reading from a stream - c++

I have a function to read the value of one variable (integer, double, or boolean) on a single line in an ifstream:
template <typename Type>
void readFromFile (ifstream &in, Type &val)
{
string str;
getline (in, str);
stringstream ss(str);
ss >> val;
}
However, it fails on text files created with editors inserting a BOM (byte order mark) at the beginning of the first line, which unfortunately includes {Note,Word}pad. How can I modify this function to ignore the byte-order mark if present at the beginning of str?

(I'm assuming you're on Windows, since using U+FEFF as a signature in UTF-8 files is mostly a Windows thing and should simply be avoided elsewhere)
You could open the file as a UTF-8 file and then check to see if the first character is U+FEFF. You can do this by opening a normal char based fstream and then use wbuffer_convert to treat it as a series of code units in another encoding. VS2010 doesn't yet have great support for char32_t so the following uses UTF-16 in wchar_t.
std::fstream fs(filename);
std::wbuffer_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> wb(fs.rdbuf());
std::wistream is(&wb);
// if you don't do this on the stack remember to destroy the objects in reverse order of creation. is, then wb, then fs.
std::wistream::int_type ch = is.get();
const std::wistream::int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF
if(ZERO_WIDTH_NO_BREAK_SPACE != ch)
is.putback(ch);
// now the stream can be passed around and used without worrying about the extra character in the stream.
int i;
readFromStream<int>(is,i);
Remember that this should be done on the file stream as a whole, not inside readFromFile on your stringstream, because ignoring U+FEFF should only be done if it's the very first character in the whole file, if at all. It shouldn't be done anywhere else.
On the other hand, if you're happy using a char based stream and just want to skip U+FEFF if present then James Kanze suggestion seems good so here's an implementation:
std::fstream fs(filename);
char a,b,c;
a = fs.get();
b = fs.get();
c = fs.get();
if (a != (char)0xEF || b != (char)0xBB || c != (char)0xBF) {
fs.seekg(0);
} else {
std::cerr << "Warning: file contains the so-called 'UTF-8 signature'\n";
}
Additionally if you want to use wchar_t internally the codecvt_utf8_utf16 and codecvt_utf8 facets have a mode that can consume 'BOMs' for you. The only problem is that wchar_t is widely recognized to be worthless these days* and so you probably shouldn't do this.
std::wifstream fin(filename);
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header));
* wchar_t is worthless because it is specified to do just one thing; provide a fixed size data type that can represent any code point in a locale's character repertoire. It does not provide a common representation between locales (i.e., the same wchar_t value can be different characters in different locales so you cannot necessarily convert to wchar_t, switch to another locale, and then convert back to char in order to do iconv-like encoding conversions.)
The fixed sized representation itself is worthless for two reasons; first, many code points have semantic meanings and so understanding text means you have to process multiple code points anyway. Secondly, some platforms such as Windows use UTF-16 as the wchar_t encoding, which means a single wchar_t isn't even necessarily a code point value. (Whether using UTF-16 this way is even conformant to the standard is ambiguous. The standard requires that every character supported by a locale be representable as a single wchar_t value; If no locale supports any character outside the BMP then UTF-16 could be seen as conformant.)

You have to start by reading the first byte or two of the stream, and
deciding whether it is part of a BOM or not. It's a bit of a pain,
since you can only putback a single byte, whereas you typically will
want to read four. The simplest solution is to open the file, read the
initial bytes, memorize how many you need to skip, then seek back to the
beginning and skip them.

With a not-so-clean solution, I solved by removing non printing chars:
bool isNotAlnum(unsigned char c)
{
return (c < ' ' || c > '~');
}
...
str.erase(remove_if(str.begin(), str.end(), isNotAlnum), str.end());

Here's a simple C++ function to skip the BOM on an input stream on Windows. This assumes byte-sized data, as in UTF-8:
// skip BOM for UTF-8 on Windows
void skip_bom(auto& fs) {
const unsigned char boms[]{ 0xef, 0xbb, 0xbf };
bool have_bom{ true };
for(const auto& c : boms) {
if((unsigned char)fs.get() != c) have_bom = false;
}
if(!have_bom) fs.seekg(0);
return;
}
It simply checks the first three bytes for the UTF-8 BOM signature, and skips them if they all match. There's no harm if there's no BOM.
Edit: This works with a file stream, but not with cin. I found it did work with cin on Linux with GCC-11, but that's clearly not portable. See #Dúthomhas comment below.

Related

How to save text file to struct with string in C++

I'm wanting to save the content of a file to a struct. I've tried to use seekg and read to write to it but it isn't working.
My file is something like:
johnmayer24ericclapton32
I want to store the name, the last name and the age in a struct like that
typedef struct test_struct{
string name;
string last_name;
int age;
} test_struct;
Here is my code
int main(){
test_struct ts;
ifstream data_base;
data_base.open("test_file.txt");
data_base.seekg(0, ios_base::beg);
data_base.read(ts, sizeof(test_struct));
data_base.close();
return 0;
}
It doesn't compile as it don't want me to use ts on the read function. Is there another way - or a way - of doing it?

Serialization/Deserialization of strings is tricky.
As binary data the convention is to output the length of the string first, then the string data.
https://isocpp.org/wiki/faq/serialization#serialize-binary-format
String data is tricky because you have to unambiguously know when the string’s body stops. You can’t unambiguously terminate all strings with a '\0' if some string might contain that character; recall that std::string can store '\0'. The easiest solution is to write the integer length just before the string data. Make sure the integer length is written in “network format” to avoid sizeof and endian problems (see the solutions in earlier bullets).
That way when reading the data back in you know the length of the string to expect and can preallocate the size of the string then just read that much data from the stream.
If your data is a non-binary (text) format it's a little trickier:
https://isocpp.org/wiki/faq/serialization#serialize-text-format
String data is tricky because you have to unambiguously know when the string’s body stops. You can’t unambiguously terminate all strings with a '\n' or '"' or even '\0' if some string might contain those characters. You might want to use C++ source-code escape-sequences, e.g., writing '\' followed by 'n' when you see a newline, etc. After this transformation, you can either make strings go until end-of-line (meaning they are deliminated by '\n') or you can delimit them with '"'.
If you use C++-like escape-sequences for your string data, be sure to always use the same number of hex digits after '\x' and '\u'. I typically use 2 and 4 digits respectively. Reason: if you write a smaller number of hex digits, e.g., if you simply use stream << "\x" << hex << unsigned(theChar), you’ll get errors when the next character in the string happens to be a hex digit. E.g., if the string contains '\xF' followed by 'A', you should write "\x0FA", not "\xFA".
If you don’t use some sort of escape sequence for characters like '\n', be careful that the operating system doesn’t mess up your string data. In particular, if you open a std::fstream without std::ios::binary, some operating systems translate end-of-line characters.
Another approach for string data is to prefix the string’s data with an integer length, e.g., to write "now is the time" as 15:now is the time. Note that this can make it hard for people to read/write the file, since the value just after that might not have a visible separator, but you still might find it useful.
Text-based serialization/deserialization convention varies but one field per line is an accepted practice.

You'll have to develop a specific algorithm, since there is no separator character between the "fields".
static const std::string input_text = "johnmayer24ericclapton32";
static const std::string alphabet = "abcdefghijklmnopqrstuvwxyz";
static const std::string decimal_digit = "0123456789";
std::string::size_type position = 0;
std::string artist_name;
position = input_text.find_first_not_of(alphabet);
if (position != std::string::npos)
{
artist_name = input_text.substr(0, position - 1);
}
else
{
cerr << "Artist name not found.";
return EXIT_FAILURE;
}
Similarly, you can extract out the number, then use std::stoi to convert the numeric string to internal representation number.
Edit 1: Splitting the name
Since there is no separator character between the first and last name, you may want to have a list of possible first names and use that to find out where the first name ends and the surname starts.

C++ std:string comparation codification problems

I have a problem with a std::string comparation with codification I think. The problem is that I hate to compare a a string that is received and I dont know how kind of codification it has with a spanish string with unusal characters. I cant change s_area.m_s_area_text so I need to set s2 string with a identical value and i dont know how to do it in a generic way for other chases.
std::string s2= "Versión de sistema";
std::cout << s_area.m_s_area_text << std::endl;
for (const char* p = s2.c_str(); *p; ++p)
{
printf("%02x", *p);
}
printf("\n");
for (const char* p = s_area.m_s_area_text.c_str(); *p; ++p)
{
printf("%02x", *p);
}
printf("\n");
And the result of the execution is:
Versi├│n de sistema
5665727369fffffff36e2064652073697374656d61
5665727369ffffffc3ffffffb36e2064652073697374656d61
Obviously, as the 2 strings has not the same bytes values, all the compare method fails: strncmp, std::string ==, std:sstring.comapre etc.
Any idea of how to do that witho touching s_area.m_s_area_text string?

In general it is impossible to guess the encoding of a string by inspecting its raw bytes. The exception to this rule is when a byte order mark (BOM) is present at the start of the byte stream. The BOM will tell you which unicode encoding the bytes are and the endianness.
As an aside, if at some point in the future you decide you need a canonical string encoding (as some have pointed out in the comments that it would be a good idea). There are strong arguments in favour of UTF-8 as the best choice for C++. See UTF-8 everywhere for further information on this.

First of all, two compare two string correctly you at least need to know their encoding. In your example s_area.m_s_area_text is happened to be encoded with UTF-8 while for s2 ISO/IEC 8859-1 (Latin-1) is used.
If you are sure that s_area.m_s_area_text will always be encoded in UTF-8, you can try to make s2 use the same encoding and then just compare them. One way of defining a UTF-8 encoded string is escaping every character that is not in basic character set with \u.
std::string s2 = u8"Versi\u00F3n de sistema";
...
if (s_area.m_s_area_text == s2)
...
It should also be possible to do it without escaping the characters by setting an appropriate encoding for the source file and specifying the encoding to the compiler.
As #nwp mentioned, you may also want to normalise the strings before comparing. Otherwise, two strings that look the same may have different Unicode representation and that will cause your comparison to yield a false negative result.
For example, "Versión de sistema" will not be equal to "Versión de sistema".

UTF-8 to UTF-32 on iterators using the STL

I have a char iterator - an std::istreambuf_iterator<char> wrapped in a couple of adaptors - yielding UTF-8 bytes. I want to read a single UTF-32 character (a char32_t) from it. Can I do so using the STL? How?
There's std::codecvt_utf8<char32_t>, but that apparently only works on char*, not arbitrary iterators.
Here's a simplified version of my code:
#include <iostream>
#include <sstream>
#include <iterator>
// in the real code some boost adaptors etc. are involved
// but the important point is: we're dealing with a char iterator.
typedef std::istreambuf_iterator< char > iterator;
char32_t read_code_point( iterator& it, const iterator& end )
{
// how do I do this conversion?
// codecvt_utf8<char32_t>::in() only works on char*
return U'\0';
}
int main()
{
// actual code uses std::istream so it works on strings, files etc.
// but that's irrelevant for the question
std::stringstream stream( u8"\u00FF" );
iterator it( stream );
iterator end;
char32_t c = read_code_point( it, end );
std::cout << std::boolalpha << ( c == U'\u00FF' ) << std::endl;
return 0;
}
I am aware that Boost.Regex has an iterator for this, but I'd like to avoid boost libraries that are not header-only and this feels like something the STL should be capable of.

I don't think you can do this directly with codecvt_utf8 or any other standard library components. To use codecvt_utf8 you'd need to copy bytes from the iterator stream into a buffer and convert the buffer.
Something like this should work:
char32_t read_code_point( iterator& it, const iterator& end )
{
char32_t result;
char32_t* resend = &result + 1;
char32_t* resnext = &result;
char buf[7]; // room for 3-byte UTF-8 BOM and a 4-byte UTF-8 character
char* bufpos = buf;
const char* const bufend = std::end(buf);
std::codecvt_utf8<char32_t> cvt;
while (bufpos != bufend && it != end)
{
*bufpos++ = *it++;
std::mbstate_t st{};
const char* be = bufpos;
const char* bn = buf;
auto conv = cvt.in(st, buf, be, bn, &result, resend, resnext);
if (conv == std::codecvt_base::error)
throw std::runtime_error("Invalid UTF-8 sequence");
if (conv == std::codecvt_base::ok && bn == be)
return result;
// otherwise read another byte and try again
}
if (it == end)
throw std::runtime_error("Incomplete UTF-8 sequence");
throw std::runtime_error("No character read from first seven bytes");
}
This appears to do more work than necessary, re-scanning the whole UTF-8 sequence in [buf, bufpos) on every iteration (and making a virtual function call to codecvt_utf8::do_in). In theory the codecvt_utf8::in implementation could read an incomplete multibyte sequence and store state information in the mbstate_t argument, so that the next call would resume from where the last one left off, only consuming new bytes, not re-processing the incomplete multibyte sequence that was already seen.
However, implementations are not required to use the mbstate_t argument to store state between calls and in practice at least one implementation of codecvt_utf8::in (the one I wrote for GCC) doesn't use it at all. From my experiments it seems that the libc++ implementation doesn't use it either. This means that they stop converting before an incomplete multibyte sequence, and leave the from_next pointer (the bn argument here) pointing to the beginning of that incomplete sequence, so that the next call should start from that position and (hopefully) provide enough additional bytes to complete the sequence and allow a complete Unicode character to be read and converted to char32_t. Because you are only trying to read a single codepoint, this means it does no conversion at all, because stopping before an incomplete multibyte sequence means stopping at the first byte.
It's possible that some implementations do use the mbstate_t argument, so you could modify the function above to handle that case as well, but to be portable it would still need to cope with implementations that ignore the mbstate_t. Supporting both types of implementation would complicate the function considerably, so I kept it simple and wrote a form that should work with all implementations, even if they do actually use the mbstate_t. Because you are only going to be reading up to 7 bytes at a time (in the worst case ... the average case may be only one or two bytes, depending on the input text) the cost of re-scanning the first few bytes every time shouldn't be huge.
To get better performance from codecvt_utf8 you should avoid converting one codepoint at a time, because it's designed for converting arrays of characters not individual ones. Since you always need to copy to a char buffer anyway you could copy larger chunks from the input iterator sequence and convert whole chunks. This would reduce the likelihood of seeing incomplete multibyte sequences, since only the last 1-3 bytes at the end of a chunk would need to be re-processed if the chunk ends in an incomplete sequence, everything earlier in the chunk would have been converted.
To get better performance reading single codepoints you should probably avoid codecvt_utf8 entirely and either roll your own (if you only need UTF-8 to UTF-32BE it's not so hard) or use a third-party library such as ICU.

How can I Portably Catch and Handle UTF "EN DASH" Minuses During c++ STL File Reading?

I'm maintaining a large open source project, so I'm running into an odd fringe case on the I/O front.
When my app parses a user parameter file containing a line of text like the following:
CH3 CH2 CH2 CH2 −68.189775 2 180.0 ! TraPPE 1
...at first it looks innocent because it is formatted as desired. But then I see the minus is a UTF character (−) rather than (-).
I'm just using STL's >> with the ifstream object.
When it attempts to convert to a negative and fails on the UTF character STL apparently just sets the internal flag to "bad", which was triggering my logic that stops the reading process. This is sort of good as without that logic I would have had an even harder time tracking it down.
But it's definitely not my desired error handling. I want to catch common minus like characters when reading a double with >>, replace them and complete the conversion if the string is otherwise a properly formatted negative number.
This appears to be happening to my users relatively frequently as they're copying and pasting from programs (calculator or Excel perhaps in Windows?) to get their file values.
I was somewhat surprised not to find this problem on Stack Overflow, as it seems pretty ubiquitous. I found some reference to this on this question:
c++ error cannot be used as a function, some stray error [closed]
...but that was a slightly different problem, in which the code contained that kind of similar, but noncompatible "minus-like" EN DASH UTF character.
Does anyone have a good solution (preferably compact, portable, and reusable) for catch such bad minuses when reading doubles or signed integers?
Note:
I don't want to use Boost or c++11 as believe it or not some of my users on certain supercomputers don't have access to those libraries. I'm try to keep it as portable as possible.

May be using a custom std::num_get is for you. Other character to value aspects can be overwritten as well.
#include <iostream>
#include <string>
#include <sstream>
class num_get : public std::num_get<wchar_t>
{
public:
iter_type do_get( iter_type begin, iter_type end, std::ios_base & str,
std::ios_base::iostate & error, float & value ) const
{
bool neg=false;
if(*begin==8722) {
begin++;
neg=true;
}
iter_type i = std::num_get<wchar_t>::do_get(begin, end, str, error, value);
if (!(error & std::ios_base::failbit))
{
if(neg)
value=-value;
}
return i;
}
};
int main(int argc,char ** argv) {
std::locale new_locale(std::cin.getloc(), new num_get);
// Parsing wchar_t streams makes live easier but in principle
// it should work with char (e.g. UTF8 as well)
static const std::wstring ws(L"CH3 CH2 CH2 CH2 −68.189775 2 180.0 ! TraPPE 1");
std::basic_stringstream<wchar_t> wss(ws);
std::wstring a;
std::wstring b;
std::wstring c;
float f=0;
// Imbue this new locale into wss
wss.imbue(new_locale);
for(int i=0;i<4;i++) {
std::wstring s;
wss >> s >> std::ws;
std::wcerr << s << std::endl;
}
wss >> f;
std::wcerr << f << std::endl;
}

Not gonna happen except manually. There are many characters in Unicode, there's an Em Dash as well as an En Dash, and most likely quite a few more. For example, did you consider the possibility of an Em Dash and then a non-breaking-space and then some numbers? Or an RTL override? Unicode is legend because the possibilities are nearly endless, and double-legend in C++ because the Standard support for it could be charitably described as ISIS's support for sanity.
The only real way to do this is to find each situation as your users report it, and handle it manually- i.e., do not use operator>> for double.

Transcoding characters on-the-fly using iostreams and ICU

I'd like to transcode character encoding on-the-fly. I'd like to use iostreams and my own transcoding streambuf, e.g.:
xcoder_streambuf xbuf( "UTF-8", "ISO-8859-1", cout.rdbuf() );
cout.rdbuf( &xbuf );
char *utf8_s; // pointer to buffer containing UTF-8 encoded characters
// ...
cout << utf8_s; // characters are written in ISO-8859-1
The implementation of xcoder_streambuf would use ICU's converters API. It would take the data coming in (in this case, from utf8_s), transcode it, and write it out using the iostream's original steambuf.
Is that a reasonable way to go? If not, what would be better?

Is that a reasonable way to go?
Yes, but it is not the way you are expected to do it in modern (as in 1997) iostream.
The behaviour of outputting through basic_streambuf<> is defined by the overflow(int_type c) virtual function.
The description of basic_filebuf<>::overflow(int_type c = traits::eof()) includes a_codecvt.out(state, b, p, end, xbuf, xbuf+XSIZE, xbuf_end); where a_codecvt is defined as:
const codecvt<charT,char,typename traits::state_type>& a_codecvt
= use_facet<codecvt<charT,char,typename traits::state_type> >(getloc());
so you are expected to imbue a locale with the appropriate codecvt<charT,char,typename traits::state_type> converter.
The class codecvt<internT,externT,stateT> is for use when converting from one character encoding to another, such as from wide characters to multibyte characters or between wide character encodings such as Unicode and EUC.
The standard library support for Unicode made some progress since 1997:
the specialization codecvt converts between the UTF-32 and UTF-8 encoding schemes.
This seems what you want (ISO-8859-1 codes are USC-4 codes = UTF-32).
If not, what would be better?
I would introduce a different type for UTF8, like:
struct utf8 {
unsigned char d; // d for data
};
struct latin1 {
unsigned char c; // c for character
};
This way you cannot accidentally pass UTF8 where ISO-8859-* is expected. But then you would have to write some interface code, and the type of your streams won't be istream/ostream.
Disclaimer: I never actually did such a thing, so I don't know if it is workable in practice.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js