ifstream position in c++ - c++

I am trying to write a simple UTF-8 decoder for my assignment. I'm fairly new with C++ so bear with me here...
I have to determine whether the encoding is valid or not, and output the value of the UTF-8 character in hexadecimal in either case. Say that I have read the first byte and used this first byte to determine the number of bytes in this UTF8 character. The problem is that after I read the first byte, i'm having trouble setting the ifstream position back one byte and read the whole UTF-8 character. I've tried seekg() and putback(), but i always get BUS error or some weird output that's not my test data. Please help, thanks.
Even though i can use peek() for the first byte, but i still have to read the following bytes to determine whether the encoding is valid or not. The problem of setting back the stream position is still there.

I would suggest you use peek() to read the first byte instead. seekg() should work to rewind, but a BUS error is usually caused by your code breaking alignment issues, which points to you doing something else evil in your code.

Why do you have to seek back? Can't you simply read the rest of the UTF-8 sequence, after knowing how many more octets you're expecting?

I would read the next byte directly and add it to what I got. As Ates Goral said. It is cleaner IMHO.
Anyway, You could move the stream pointer using seekg():
char byte = 0;
unsigned int character = 0; // on every usage
ifstream file("test.txt", ios::binary);
file.get(byte);
......
file.seekg(-1, ios::cur); // cur == current position
file.get(
reinterpret_cast<char*>(&character),
numberOfBytesAndNullTerminator);
cout << hex << character;
Beware that get() in the second case writes '\0' at the end of character. So you have to give it the required number of bytes including the null terminator. So, if you want to read two bytes ==> numberOfBytesAndNullTerminator = 3.

I don't know why you need to put the character back but istream::unget() or istream::putback() should do what you want. Look them up in your compiler's documentation.

please look up :
ifstream::seekg()
ifstream::teellg()

Related

How to read number of characters stored in input stream buffer

I have a quick question - how can I possibly write something in console window to std::cin without assigning it to a string or char[]? And then how to read the number of characters that are stored in buffer?
Let's say that I want to create an array of char, but it shall has the size of the input length. I might create a buffer or a variable of big size to store the input and then read its length, allocate memory to my char array and copy it. But let's also say that I am a purist and I don't want any additional (other than stream buffer) memory used. Is there a possibility to access std::cin buffer, read the number of characters stored and copy them to my array? I was trying to find the answer for several hours, reading cpp reference but I really couldn't find solution. I couldn't even find if there is a possibility to write something to std::cin buffer without assigning it to a variable, aka executing cin >> variable. I would appreciate any help, also if you have alternative solutions for this problem.
Also, does somebody know where can I find information about how buffers work (means where computer stores input from keyboard, how it is processed and how iostream works with computer to extract data from this).
Many thanks!
First of all in order for the input buffer to be filled you need to do some sort of read operation. The read operation may not necessary put what is read in to a variable. For example, cin.peek() may block until the user enters some value and returns the next character that will be read from the buffer without extracting it or you could also use cin.get along with cin.putback.
You can then use the streambuf::in_avail function to determine how many characters are in the input buffer including a new line character.
With that in mind you could do something like this:
char ch;
cin.get(ch);//this will block until some data is entered
cin.putback(ch);//put back the character read in the previous operation
streamsize size=cin.rdbuf()->in_avail();//get the number of character in the buffer available for reading(including spaces and new line)
if(size>0)
{
char* arr=new char[size];//allocate the size of the array(you might want to add one more space for null terminator character)
for(streamsize i=0;i<size;i++)
cin.get(arr[i]);//copy each character, including spaces and newline, from the input buffer to the array
for(streamsize i=0;i<size;i++)
cout<<arr[i];//display the result
}
That being said, i am sure you have a specific reason for doing this, but i don't think it is a good idea to do I/O like this. If you don't want to estimate the size of the character array you need for input then you can always use a std::string and read the input instead.

Now when we get a string from the user using gets(), where does the '\0' terminating character go?

Now when we declare a string, the last character is the null character, right.
(Now pls see the image of the code and its output that i have attached)
As you can see in the image attached, i am getting the null character at the 7th posn!!! What is happening?
According to the book i refer to(see the other image attached), a string always has an extra character associated with it, at the end of the string, called the null character which adds to the size of the string.
But by the above code i am getting the null character at the 7th position, although according to the book, i should get it at the 6th position.
Can someone explain the output pls?
Any help is really appreciated!!
Thank You!
Do not use gets() - ever! It is entirely immaterial what gets() does as is has no place in any reasonably written code! It is certainly removed from the C++ standard and, as far as I know, also from C (I think C removed it first). gets() happily overruns the buffer provided as it doesn't even know the size of the storage provided. It was blamed as the primary reason for most hacks of systems.
In the code you linked to there is such a buffer overrun. Also not that sizeof() determines the size of a variable. It does not consider its content in any shape or form: sizeof(str) will not change unless you change the type of str. If you want to determine the size of the string in that array you'll need to use strlen(str).
If you really need to read a string into a C array using FILE* functions, you shall use fgets() which, in addition ot the pointer to the storage and the stream (e.g. stdin for the default input stream) also takes the size of the array as parameter. fgets() fails if it can't read a complete null-terminated string.
You declare a char array that can hold up to 5 chars, however, dummy\0 is 6 characters long, resulting in buffer overflow.

why std::wofstream do not print all wstring into file?

I have a std::wstring whose size is 139,580,199 characters.
For debugging I printed it into file with this code:
std::wofstream f(L"C:\\some file.txt");
f << buffer;
f.close();
After that noticed that the end of string is missing. The created file size is 109,592,584 bytes (and the "size on disk" is 109,596,672 bytes).
Also checked if buffer contains null chars, did this:
size_t pos = buffer.find(L'\0');
Expecting result to be std::wstring::npos but it is 18446744073709551615, but my string doesn't have null char at the end so probably it's ok.
Can somebody explain, why I have not all string printed into file?
A lot depends on the locale, but typically, files on disk will
not use the same encoding form (or even the same encoding) as
that used by wchar_t; the filebuf which does the actual
reading and writing translates the encodings according to its
imbued locale. And there is only a vague relationship between
the length of a string in different encodings or encoding form.
(And the size the system sees doesn't correspond directly to the
number of bytes you can read from the file.)
To see if everything was written, check the status of f
after the close, i.e.:
f.close();
if ( !f ) {
// Something went wrong...
}
One thing that can go wrong is that the external encoding
doesn't have a representation for one of the characters. If
you're in the "C" locale, this could occur for any character
outside of the basic execution character set.
If there is no error above, there's no reason off hand to assume
that not all of the string has been written. What happens if
you try to read it in another program? Do you get the same
number of characters or not?
For the rest, nul characters are characters like any others in
a std::wstring; there's nothing special about them, including
when they are output to a stream. And 18446744073709551615
looks very much like the value I would expect for
std::wstring::npos on a 64 bit machine.
EDIT:
Following up on Mat Petersson's comment: it's actually highly
unlikely that the file ends up with less bytes than there are
code points in the std::wstring. (std::wstring::size()
returns the number of code points.) I was thinking in terms of
bytes, not in terms of what std::wstring::size() returns. So
the most likely explination is that you have some characters in
your string which aren't representable in the target encoding
(which probably only supports characters with code points
32-126, plus a few control characters, by default).

Any way to get rid of the null character at the end of an istream get?

I'm currently trying to write a bit of code to read a file and extract bits of it and save them as variables.
Here's the relevant code:
char address[10];
ifstream tracefile;
tracefile.open ("trace.txt");
tracefile.seekg(2, ios::beg);
tracefile.get(address, 10, ' ');
cout << address;
The contents of the file: (just the first line)
R 0x00000000
The issue I'm having is that address misses the final '0' because it puts a /0 character there, and I'm not sure how to get around that? So it outputs:
0x0000000
I'm also having issues with
tracefile.seekg(2, ios::cur);
It doesn't seem to work, hence why I've changed it to ios::beg just to try and get something work, although obviously that won't be useable once I try to read multiple lines after one another.
Any help would be appreciated.
ifstream::get() will attempt to produce a null-terminated C string, which you haven't provided enough space for.
You can either:
Allocate char address[11]; (or bigger) to hold a null-terminated string longer than 9 characters.
Use ifstream::read() instead to read the 10 bytes without a null-terminator.
Edit:
If you want a buffer that can dynamically account for the length of the line, use std::getline with a std::string.
std::string buffer;
tracefile.seekg(2, ios::beg);
std::getline( tracefile, buffer );
Edit 2
If you only want to read to the next whitespace, use:
std::string buffer;
tracefile.seekg(2, ios::beg);
tracefile >> buffer;
Make the buffer bigger, so that you can read the entire input text into it, including the terminating '\0'. Or use std::string, which doesn't have a pre-determined size.
There are several issues with your code. The first is that
seekg( 2, ios::beg ) is undefined behavior unless the stream
is opened in binary mode (which yours isn't). It will work
under Unix, and depending on the contents of the file, it
might work under Windows (but it could also send you to the
wrong place). On some other systems, it might systematically
fail, or do just about anything else. You cannot reliably seek
to arbitrary positions in a text stream.
The second is that if you want to read exactly 10 characters,
the function you need is istream::read, and not
istream::get. On the other hand, if you want to read up to
the next white space, using >> into a string will work best.
If you want to limit the number of characters extracted to a
maximum, set the width before calling >>:
std::string address;
// ...
tracefile >> std::setw( 10 ) >> address;
This avoids all issues of '\0', etc.
Finally, of course, you need error checking. You should
probably check whether the open succeeded before doing anything
else, and you should definitely check whether the read succeeded
before using the results. (As you've written the code, if the
open fails for any reason, you have undefined behavior.)
If you're reading multiple lines, of course, the best solution
is usually to use std::getline to read each line into a
string, and then parse that string (possibly using
std::istringstream). This prevents the main stream from
entering error state if there is a format error in the line, and
it provides automatic resynchronization in such cases.

Using C++, how do I read a string of a specific length, from a non-binary file?

The cplusplus.com example for reading text files shows that a line can be read using the getline function. However, I don't want to get an entire line; I want to get only a certain number of characters. How can this be done in a way that preserves character encoding?
I need a function that does something like this:
ifstream fileStream;
fileStream.open("file.txt", ios::in);
resultStream << getstring(fileStream, 10); // read first 10 chars
file.ftell(10); // move to the next item
resultStream << getstring(fileStream, 10); // read 10 more chars
I thought about reading to a char buffer, but wouldn't this change the character encoding?
I really suspect that there's some confusion here regarding the term "character." Judging from the OP's question, he is using the term "character" to refer to a char (as opposed to a logical "character", like a multi-byte UTF-8 character), and thus for the purpose of reading from a text-file the term "character" is interchangeable with "byte."
If that is the case, you can read a certain number of bytes from disk using ifstream::read(), e.g.
ifstream fileStream;
fileStream.open("file.txt", ios::in);
char buffer[1024];
fileStream.read(buffer, sizeof(buffer));
Reading into a char buffer won't affect the character encoding at all. The exact sequence of bytes stored on disk will be copied into the buffer.
However, it is a different story if you are using a multi-byte character set where each character is variable-length. If characters are not fixed-size, there's no way to read exactly N characters from disk with a single disk read. This is not a limitation of C++, this is simply the reality of dealing with block devices (disks). At the lowest levels of your OS, block devices are addressed in terms of blocks, which in turn are made up of bytes. So you can always read an exact number of bytes from disk, but you can't read an exact number of logical characters from disk, unless each character is a fixed number of bytes. For character-sets like UTF-8 where each character is variable length, you'll have to either read in the entire file, or else perform speculative reads and parse the read buffer after each read to determine if you need to read more.
C++ itself doesn't have a concept of character encoding. chars are always the same size, as are wchar_ts. So if you need to read X chars of a multibyte char set (such as utf-8) then you'll either have to read a (single byte) char at a time (e.g. using getchar() - or X chars, speculatively, using istream::getline() ) and test the MBCS signals yourself, or use a third-party library to do it.
If the charset is a fixed width encoding, and you don't mind stopping when you get to a newline, then getline(), which allows you to specify the maximum number of chars to read, is probably what you want.
As a few people have mentioned, the C/C++ Standard Libraries don't really provide anything that operates above essentially byte level. So if you're wanting to do this using only the core libraries you don't have a ready made option.
Which leaves either checking if your chosen platform(s) provide another library that implements this capability, writing your own parser for handling character encodings, or punching something like "c++ utf8 library" or "posix unicode" into Google and taking a look at what turns up.
Possible interesting hits:
UTF-8 and Unicode FAQ
UTF-CPP
I'll leave further investigation to the reader.
I think you can use the sgetn member function of the streams associated streambuf...
char buf[32];
streamsize i = fileStream.rdbuf()->sgetn( &buf[0], 10 );
Which will read 10 chars into buf (if there are 10 available to read), returning the number of chars read.