Input Redirection reading integers and char C++ - c++

Thanks for taking your time to read this!
I'm having trouble parsing through a file with input redirection and I am having trouble reading through integers and characters.
Without using getline(), how do you read in the file including integers, characters, and any amount of whitespaces? (I know the >> operator can skip whitespace but fails when it hits a character)
Thanks!

The first thing you need to realise is that, fundamentally, there are no things like "integers" in your file. Your file does not contain typed data: it contains bytes.
Now, since C++ doesn't support any text encodings, for our purposes here we can consider bytes equivalent to "characters". (In reality, you'll probably layer something like a UTF-8 support library on top of your code, at which point "characters" takes on a whole new meaning. But we'll save that discussion for another day.)
At the most basic, then, we can just extract a bunch of bytes. Let's say 50 at a time:
std::ifstream ifs("filename.dat");
static constexpr const size_t CHUNK_SIZE = 50;
char buf[CHUNK_SIZE];
while (ifs.read(buf, CHUNK_SIZE)) {
const size_t num_extracted = ifs.gcount();
parseData(buf, num_extracted);
}
The function parseData would then examine those bytes in whatever manner you see fit.
For many text files this is unnecessarily arduous. So, as you've discovered, the IOStreams part of the C++ Standard Library provides us with some shortcuts. For example, std::getline will read bytes up to a delimiter, rather than reading a certain quantity of bytes.
Using this, we can read in things "line by line" — assuming a "line" is a sequence of bytes terminated by a \n (or \r\n if your platform performs line-ending translation, and you haven't put the stream into binary mode):
std::ifstream ifs("filename.dat");
static constexpr const size_t CHUNK_SIZE = 50;
std::string line;
while (std::getline(ifs, line)) {
parseLine(line);
}
Instead of \n you can provide, as a third argument to std::getline, some other delimiter.
The other facility it offers is operator<<, which will pick out tokens (sequences of bytes delimited by whitespace) and attempt to "lexically cast" them; that is, it'll try to interpret friendly human ASCII text into C++ data. So if your input is "123 abc", you can pull out the "123" into an int with value 123, and the "abc" into another string.
If you need more complex parsing, though, you're back to the initial offering, and to the conclusion of my answer: read everything and parse it byte-by-byte as you see fit. To help with this, there's sscanf inherited from the C standard library, or spooky incantations from Boost; or you could just write your own algorithms.
The above is true of any compatible input stream, be it a std::ifstream, a std::istringstream, or the good old ready-provided std::istream instance named std::cin (which I guess is how you're accepting the data, given your mention of input redirection: shell scripting?).

Related

What happens when I read a file into a string

For a small program, seen here here, I found out that with gcc-libstdc++ and clang++ - libc++ reading file contents into a string works as intended with std::string itself:
std::string filecontents;
{
std::ifstream t(file);
std::stringstream buffer;
buffer << t.rdbuf();
filecontents = buffer.str();
}
Later on I modify the string. E.g.
ending_it = std::find(ending_it, filecontents.end(), '$');
*ending_it = '\\';
auto ending_pos
= static_cast<size_t>(std::distance(filecontents.begin(), ending_it));
filecontents.insert(ending_pos + 1, ")");
This worked even if the file included non-ascii characters like a greek lambda. I never searched for these unicode characters, but they were in the string. Later on I output the string to std::cout.
Is this guaranteed to work in C++17 (and beyond)?
The question is: What are the conditions, under which I can read file contents into std::string via std::ifstream, work on the string like above and expect things to work correctly.
As far as I know, std::string uses char, which has only 1 byte.
Therefore it surprised me that the method worked with non-ascii chars in the file.
Thanks #user4581301 and #PeteBecker for their helpful comments making me understand the problem.
The question stems from a wrong mental model of std::string, or more fundamentally a wrong model of char.
This is nicely explained here and here.
I implicitly thought, that a char holds a "character" in a more colloquial sense and therefore knows of its encoding. Instead a char really only holds a single byte (in c++, in c its defined slightly differently). Therefore it is always well-defined to read a file into a string, as a string is first and foremost only an array of bytes.
This also means that reading a file in an encoding where a "character" can span multiple bytes results in those characters spanning multiple indices in the std::string.
This can be seen, when outputting a single char from the string.
Luckily whenever the file is ascii-encoded or utf8-encoded, the byte representation of an ascii character can only ever appear when encoding that character. This means that searching the string of the file for an ascii-character will exactly find these characters and nothing else. Therefore the above operations of searching for '$' and inserting a substring after an index that points to an ascii character will not corrupt the characters in the string.
Outputting the string to a terminal then just hands over the bytes to be interpreted by the terminal. If the terminal knows utf8, it will interprete the bytes accordingly.

Convert UCS-2 inside character array to UTF-8 std::string

Well this is a direct "followup" of this question; I decided to split the problem into two - Originally I posted the whole picture to prevent getting another close with "YZ problem". For now consider I know already the character encoding.
However I read a string using std::getline from a file. This file is encoded in a format I know -say UTF16 big endian-.
But not "all" files are UTF16 (actually most are UTF8), I prefer to have as little code-copying as possible.
Now my first response is to "just read the bytes" and "then do the conversion to UTF-8", and skip the conversion if the input is already UTF-8. So I read it first into a std::string (please ignore the "ugglyness" of OpenFilestreams()[file_index]);
std::string retString;
if (isValidIndex(file_index) && OpenFilestreams()[file_index]->good()) {
std::getline(*OpenFilestreams()[file_index], retString);
}
return retString;
After this I oblviously have a nonsense string - as the bytes are ordered as if the string was UCS2/UTF-16. So how can I convert this std::string to another std::string resulting in the UTF8-byte ordering. - Or should I do this at line reading level (or even at opening the file-stream level?)
I prefer to keep myself to the C++11 standard, maybe boost/ICU if it is really better (already have boots, but no ICU library at my pc).

C++ tokenization

I am writing a lexer in C++ and I am reading from a file character by character, however, how do you do tokenization in this case? I can't use strtok since I have character not a string. Somehow I need to keep reading until I reach a delimeter?
The answer is Yes. You need to keep reading until you hit a delimiter.
There are multiple solutions.
The simplest thing to do is exactly that: keep a buffer (std::string) of the characters you already read until you reach a delimiter. At that point, you build a token from the accumulated characters in the buffer, clear the buffer, and push the delimiter (if necessary) in the buffer.
Another solution would be to read ahead of the time: ie, pick up the entire line with std::getline (for example), and then check what's on this line. In general the end-of-line is a natural token delimiter.
This works well... when delimiters are easy.
Unfortunately some languages, like C++, have awkward grammars. For example, in C++ >> can be either:
the operator >> (for right-shift and stream extraction)
the end of two nested templates (ie could be rewritten as > >)
In those cases... well, just don't bother with the difference in the tokenizer, and let your AST building pass disambiguate, it's got more information.
On the basis of information provided you.
If you want to read upto a delimiter from a File, use getline(char *,int,char) function.
getline() is use to read upto n characters or upto a delimiter.
Example:
#include<fstream.h>
using namespace std;
main()
{
fstream f;
f.open("test.cpp",ios::in);
char *c;
f.getline(c,2,' ');
cout<<c; // upto 1 char or till a space
}

Any way to get rid of the null character at the end of an istream get?

I'm currently trying to write a bit of code to read a file and extract bits of it and save them as variables.
Here's the relevant code:
char address[10];
ifstream tracefile;
tracefile.open ("trace.txt");
tracefile.seekg(2, ios::beg);
tracefile.get(address, 10, ' ');
cout << address;
The contents of the file: (just the first line)
R 0x00000000
The issue I'm having is that address misses the final '0' because it puts a /0 character there, and I'm not sure how to get around that? So it outputs:
0x0000000
I'm also having issues with
tracefile.seekg(2, ios::cur);
It doesn't seem to work, hence why I've changed it to ios::beg just to try and get something work, although obviously that won't be useable once I try to read multiple lines after one another.
Any help would be appreciated.
ifstream::get() will attempt to produce a null-terminated C string, which you haven't provided enough space for.
You can either:
Allocate char address[11]; (or bigger) to hold a null-terminated string longer than 9 characters.
Use ifstream::read() instead to read the 10 bytes without a null-terminator.
Edit:
If you want a buffer that can dynamically account for the length of the line, use std::getline with a std::string.
std::string buffer;
tracefile.seekg(2, ios::beg);
std::getline( tracefile, buffer );
Edit 2
If you only want to read to the next whitespace, use:
std::string buffer;
tracefile.seekg(2, ios::beg);
tracefile >> buffer;
Make the buffer bigger, so that you can read the entire input text into it, including the terminating '\0'. Or use std::string, which doesn't have a pre-determined size.
There are several issues with your code. The first is that
seekg( 2, ios::beg ) is undefined behavior unless the stream
is opened in binary mode (which yours isn't). It will work
under Unix, and depending on the contents of the file, it
might work under Windows (but it could also send you to the
wrong place). On some other systems, it might systematically
fail, or do just about anything else. You cannot reliably seek
to arbitrary positions in a text stream.
The second is that if you want to read exactly 10 characters,
the function you need is istream::read, and not
istream::get. On the other hand, if you want to read up to
the next white space, using >> into a string will work best.
If you want to limit the number of characters extracted to a
maximum, set the width before calling >>:
std::string address;
// ...
tracefile >> std::setw( 10 ) >> address;
This avoids all issues of '\0', etc.
Finally, of course, you need error checking. You should
probably check whether the open succeeded before doing anything
else, and you should definitely check whether the read succeeded
before using the results. (As you've written the code, if the
open fails for any reason, you have undefined behavior.)
If you're reading multiple lines, of course, the best solution
is usually to use std::getline to read each line into a
string, and then parse that string (possibly using
std::istringstream). This prevents the main stream from
entering error state if there is a format error in the line, and
it provides automatic resynchronization in such cases.

Using C++, how do I read a string of a specific length, from a non-binary file?

The cplusplus.com example for reading text files shows that a line can be read using the getline function. However, I don't want to get an entire line; I want to get only a certain number of characters. How can this be done in a way that preserves character encoding?
I need a function that does something like this:
ifstream fileStream;
fileStream.open("file.txt", ios::in);
resultStream << getstring(fileStream, 10); // read first 10 chars
file.ftell(10); // move to the next item
resultStream << getstring(fileStream, 10); // read 10 more chars
I thought about reading to a char buffer, but wouldn't this change the character encoding?
I really suspect that there's some confusion here regarding the term "character." Judging from the OP's question, he is using the term "character" to refer to a char (as opposed to a logical "character", like a multi-byte UTF-8 character), and thus for the purpose of reading from a text-file the term "character" is interchangeable with "byte."
If that is the case, you can read a certain number of bytes from disk using ifstream::read(), e.g.
ifstream fileStream;
fileStream.open("file.txt", ios::in);
char buffer[1024];
fileStream.read(buffer, sizeof(buffer));
Reading into a char buffer won't affect the character encoding at all. The exact sequence of bytes stored on disk will be copied into the buffer.
However, it is a different story if you are using a multi-byte character set where each character is variable-length. If characters are not fixed-size, there's no way to read exactly N characters from disk with a single disk read. This is not a limitation of C++, this is simply the reality of dealing with block devices (disks). At the lowest levels of your OS, block devices are addressed in terms of blocks, which in turn are made up of bytes. So you can always read an exact number of bytes from disk, but you can't read an exact number of logical characters from disk, unless each character is a fixed number of bytes. For character-sets like UTF-8 where each character is variable length, you'll have to either read in the entire file, or else perform speculative reads and parse the read buffer after each read to determine if you need to read more.
C++ itself doesn't have a concept of character encoding. chars are always the same size, as are wchar_ts. So if you need to read X chars of a multibyte char set (such as utf-8) then you'll either have to read a (single byte) char at a time (e.g. using getchar() - or X chars, speculatively, using istream::getline() ) and test the MBCS signals yourself, or use a third-party library to do it.
If the charset is a fixed width encoding, and you don't mind stopping when you get to a newline, then getline(), which allows you to specify the maximum number of chars to read, is probably what you want.
As a few people have mentioned, the C/C++ Standard Libraries don't really provide anything that operates above essentially byte level. So if you're wanting to do this using only the core libraries you don't have a ready made option.
Which leaves either checking if your chosen platform(s) provide another library that implements this capability, writing your own parser for handling character encodings, or punching something like "c++ utf8 library" or "posix unicode" into Google and taking a look at what turns up.
Possible interesting hits:
UTF-8 and Unicode FAQ
UTF-CPP
I'll leave further investigation to the reader.
I think you can use the sgetn member function of the streams associated streambuf...
char buf[32];
streamsize i = fileStream.rdbuf()->sgetn( &buf[0], 10 );
Which will read 10 chars into buf (if there are 10 available to read), returning the number of chars read.