How to save text file to struct with string in C++ - c++

I'm wanting to save the content of a file to a struct. I've tried to use seekg and read to write to it but it isn't working.
My file is something like:
johnmayer24ericclapton32
I want to store the name, the last name and the age in a struct like that
typedef struct test_struct{
string name;
string last_name;
int age;
} test_struct;
Here is my code
int main(){
test_struct ts;
ifstream data_base;
data_base.open("test_file.txt");
data_base.seekg(0, ios_base::beg);
data_base.read(ts, sizeof(test_struct));
data_base.close();
return 0;
}
It doesn't compile as it don't want me to use ts on the read function. Is there another way - or a way - of doing it?

Serialization/Deserialization of strings is tricky.
As binary data the convention is to output the length of the string first, then the string data.
https://isocpp.org/wiki/faq/serialization#serialize-binary-format
String data is tricky because you have to unambiguously know when the string’s body stops. You can’t unambiguously terminate all strings with a '\0' if some string might contain that character; recall that std::string can store '\0'. The easiest solution is to write the integer length just before the string data. Make sure the integer length is written in “network format” to avoid sizeof and endian problems (see the solutions in earlier bullets).
That way when reading the data back in you know the length of the string to expect and can preallocate the size of the string then just read that much data from the stream.
If your data is a non-binary (text) format it's a little trickier:
https://isocpp.org/wiki/faq/serialization#serialize-text-format
String data is tricky because you have to unambiguously know when the string’s body stops. You can’t unambiguously terminate all strings with a '\n' or '"' or even '\0' if some string might contain those characters. You might want to use C++ source-code escape-sequences, e.g., writing '\' followed by 'n' when you see a newline, etc. After this transformation, you can either make strings go until end-of-line (meaning they are deliminated by '\n') or you can delimit them with '"'.
If you use C++-like escape-sequences for your string data, be sure to always use the same number of hex digits after '\x' and '\u'. I typically use 2 and 4 digits respectively. Reason: if you write a smaller number of hex digits, e.g., if you simply use stream << "\x" << hex << unsigned(theChar), you’ll get errors when the next character in the string happens to be a hex digit. E.g., if the string contains '\xF' followed by 'A', you should write "\x0FA", not "\xFA".
If you don’t use some sort of escape sequence for characters like '\n', be careful that the operating system doesn’t mess up your string data. In particular, if you open a std::fstream without std::ios::binary, some operating systems translate end-of-line characters.
Another approach for string data is to prefix the string’s data with an integer length, e.g., to write "now is the time" as 15:now is the time. Note that this can make it hard for people to read/write the file, since the value just after that might not have a visible separator, but you still might find it useful.
Text-based serialization/deserialization convention varies but one field per line is an accepted practice.

You'll have to develop a specific algorithm, since there is no separator character between the "fields".
static const std::string input_text = "johnmayer24ericclapton32";
static const std::string alphabet = "abcdefghijklmnopqrstuvwxyz";
static const std::string decimal_digit = "0123456789";
std::string::size_type position = 0;
std::string artist_name;
position = input_text.find_first_not_of(alphabet);
if (position != std::string::npos)
{
artist_name = input_text.substr(0, position - 1);
}
else
{
cerr << "Artist name not found.";
return EXIT_FAILURE;
}
Similarly, you can extract out the number, then use std::stoi to convert the numeric string to internal representation number.
Edit 1: Splitting the name
Since there is no separator character between the first and last name, you may want to have a list of possible first names and use that to find out where the first name ends and the surname starts.

Related

Convert string to raw string

char str[] = "C:\Windows\system32"
auto raw_string = convert_to_raw(str);
std::cout << raw_string;
Desired output:
C:\Windows\system32
Is it possible? I am not a big fan of cluttering my path strings with extra backslash. Nor do I like an explicit R"()" notation.
Any other work-around of reading a backslash in a string literally?
That's not possible, \ has special meaning inside a non-raw string literal, and raw string literals exist precisely to give you a chance to avoid having to escape stuff. Give up, what you need is R"(...)".
Indeed, when you write something like
char const * str{"a\nb"};
you can verify yourself that strlen(str) is 3, not 4, which means that once you compile that line, in the binary/object file there's only one single character, the newline character, corresponding to \n; there's no \ nor n anywere in it, so there's no way you can retrieve them.
As a personal taste, I find raw string literals great! You can even put real Enter in there. Often just for the price of 3 characters - R, (, and ) - in addtion to those you would write anyway. Well, you would have to write more characters to escape anything needs escaping.
Look at
std::string s{R"(Hello
world!
This
is
Me!)"};
That's 28 keystrokes from R to last " included, and you can see in a glimpse it's 6 lines.
The equivalent non-raw string
std::string s{"Hello\nworld!\nThis\nis\nMe!"};
is 30 keystrokes from R to last " included, and you have to parse it carefully to count the lines.
A pretty short string, and you already see the advantage.
To answer the question, as asked, no it is not possible.
As an example of the impossibility, assume we have a path specified as "C:\a\b";
Now, str is actually represented in memory (in your program when running) using a statically allocated array of five characters with values {'C', ':', '\007', '\010', '\000'} where '\xyz' represents an OCTAL representation (so '\010' is a char equal to numerically to 8 in decimal).
The problem is that there is more than one way to produce that array of five characters using a string literal.
char str[] = "C:\a\b";
char str1[] = "C:\007\010";
char str2[] = "C:\a\010";
char str3[] = "C:\007\b";
char str4[] = "C:\x07\x08"; // \xmn uses hex coding
In the above, str1, str2, str3, and str4 are all initialised using equivalent arrays of five char.
That means convert_to_raw("C:\a\b") could quite legitimately assume it is passed ANY of the strings above AND
std::cout << convert_to_raw("C:\a\b") << '\n';
could quite legitimately produce output of
C:\007\010
(or any one of a number of other strings).
The practical problem with this, if you are working with windows paths, is that c:\a\b, C:\007\010, C:\a\010, C:\007\b, and C:\x07\x08 are all valid filenames under windows - that (unless they are hard links or junctions) name DIFFERENT files.
In the end, if you want to have string literals in your code representing filenames or paths, then use \\ or a raw string literal when you need a single backslash. Alternatively, write your paths as string literals in your code using all forward slashes (e.g. "C:/a/b") since windows API functions accept those too.

How to deal with garbage characters in a string?

Suppose I have a string that contains a necessary numeric character but it is not terminated by '/0', it has garbage characters instead. Actually, the string has garbage characters after the number. So how to deal with the garbage character while storing that numerical character in another string or variable?
So how to deal with the garbage character while storing that numerical character in another string or variable?
Only copy a substring. Example:
std::string example "garbage1garbage";
char numerical = example[7];
We got the numerical character excluding the garbage entirely.
If the text be converted is in a std::string, then you can extract a number from the front as follows:
#include <sstream>
...
std::string input = "128734garbage";
std::istringstream iss{input};
int num;
if (iss >> num)
...use_num...
else
std::cerr << "wasn't able to parse an int from input\n";
Just change int to double, uint64_t, ... - whatever suits your data.
If you have only a pointer to the text and know it's not null-terminated, just getting the text into a std::string is problematic. You could instead use a function that converts text to a number, but stops at the first invalid character. std::stol et al, and the other unsigned and floating point variants linked from the same reference page, are good candidates for that.
From your "another string or variable" - the above addresses storing into a numeric variable. You can then create a new std::string from the number using std::to_string, or a std::ostringstream, if that's what you want to do. This will standardise the output format though, so input like say "1E4" might end up looking like say 1000.0. Alternatively, with the stol-type functions you can use the pointer-to-the-end-of-the-number to work out the length of the numeric part, and use std::string::substr() to extract the leading number as a new std::string object.
You should also be aware that the distinction between number and garbage is not always what you might expect. For example "0XBEFHJQ" might be split by some of the above functions as 0xBEF hex and HJQ garbage.

How can I check if the first char in a string is '-'?

In general, I need to check if a given string is a number. So I thought my function will check:
1. If the first char is '-' I want to check if there are only digits after it.
2. If the first char is 0 the length of the string has to be less than 3.
The problem: I cannot find a way to get the first char in the string, like if I would do it in C (just look if it is equal to ASCII number), nor in Java, where I would compare strings with equals().
Here's a handy utility function to parse numbers based on streams:
template <class T>
bool try_parse_number(std::string_view s, T& v, const std::locale& locale)
{
std::stringstream stream;
stream.imbue(locale);
stream << s;
stream >> v;
return !stream.fail();
}
Requires the includes <sstream>, <string_view> and <locale>, although you could strip the locale handling out.
You can further create a custom locale and a number facet to control number parsing to a greater degree.
I think in java it will be much easier since using a function --> s.charAt(0). you can easily take the first character of that string and can store that character and later you can compare that to anything.

How to read a specific amount of characters from a text file

I tried to do it like this
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
char b[2];
ifstream f("prad.txt");
f>>b ;
cout <<b;
return 0;
}
It should read 2 characters but it reads whole line. This worked on another language but doesn't work in C++ for some reason.
You can use read() to specify the number of characters to read:
char b[3] = "";
ifstream f("prad.txt");
f.read(b, sizeof(b) - 1); // Read one less that sizeof(b) to ensure null
cout << b; // terminated for use with cout.
This worked on another language but doesn't work in C++ for some
reason.
Some things change from language to language. In particular, in this case you've run afoul of the fact that in C++ pointers and arrays are scarcely different. That array gets passed to operator>> as a pointer to char, which is interpreted as a string pointer, so it does what it does to char buffers (to wit read until the width limit or end of line, whichever comes first). Your program ought to be crashing when that happens, since you're overflowing your buffer.
istream& get (char* s, streamsize n );
Extracts characters from the stream and stores them as a c-string into
the array beginning at s. Characters are extracted until either (n -
1) characters have been extracted or the delimiting character '\n' is
found. The extraction also stops if the end of file is reached in the
input sequence or if an error occurs during the input operation. If
the delimiting character is found, it is not extracted from the input
sequence and remains as the next character to be extracted. Use
getline if you want this character to be extracted (and discarded).
The ending null character that signals the end of a c-string is
automatically appended at the end of the content stored in s.

Ignore byte-order marks in C++, reading from a stream

I have a function to read the value of one variable (integer, double, or boolean) on a single line in an ifstream:
template <typename Type>
void readFromFile (ifstream &in, Type &val)
{
string str;
getline (in, str);
stringstream ss(str);
ss >> val;
}
However, it fails on text files created with editors inserting a BOM (byte order mark) at the beginning of the first line, which unfortunately includes {Note,Word}pad. How can I modify this function to ignore the byte-order mark if present at the beginning of str?
(I'm assuming you're on Windows, since using U+FEFF as a signature in UTF-8 files is mostly a Windows thing and should simply be avoided elsewhere)
You could open the file as a UTF-8 file and then check to see if the first character is U+FEFF. You can do this by opening a normal char based fstream and then use wbuffer_convert to treat it as a series of code units in another encoding. VS2010 doesn't yet have great support for char32_t so the following uses UTF-16 in wchar_t.
std::fstream fs(filename);
std::wbuffer_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> wb(fs.rdbuf());
std::wistream is(&wb);
// if you don't do this on the stack remember to destroy the objects in reverse order of creation. is, then wb, then fs.
std::wistream::int_type ch = is.get();
const std::wistream::int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF
if(ZERO_WIDTH_NO_BREAK_SPACE != ch)
is.putback(ch);
// now the stream can be passed around and used without worrying about the extra character in the stream.
int i;
readFromStream<int>(is,i);
Remember that this should be done on the file stream as a whole, not inside readFromFile on your stringstream, because ignoring U+FEFF should only be done if it's the very first character in the whole file, if at all. It shouldn't be done anywhere else.
On the other hand, if you're happy using a char based stream and just want to skip U+FEFF if present then James Kanze suggestion seems good so here's an implementation:
std::fstream fs(filename);
char a,b,c;
a = fs.get();
b = fs.get();
c = fs.get();
if (a != (char)0xEF || b != (char)0xBB || c != (char)0xBF) {
fs.seekg(0);
} else {
std::cerr << "Warning: file contains the so-called 'UTF-8 signature'\n";
}
Additionally if you want to use wchar_t internally the codecvt_utf8_utf16 and codecvt_utf8 facets have a mode that can consume 'BOMs' for you. The only problem is that wchar_t is widely recognized to be worthless these days* and so you probably shouldn't do this.
std::wifstream fin(filename);
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header));
* wchar_t is worthless because it is specified to do just one thing; provide a fixed size data type that can represent any code point in a locale's character repertoire. It does not provide a common representation between locales (i.e., the same wchar_t value can be different characters in different locales so you cannot necessarily convert to wchar_t, switch to another locale, and then convert back to char in order to do iconv-like encoding conversions.)
The fixed sized representation itself is worthless for two reasons; first, many code points have semantic meanings and so understanding text means you have to process multiple code points anyway. Secondly, some platforms such as Windows use UTF-16 as the wchar_t encoding, which means a single wchar_t isn't even necessarily a code point value. (Whether using UTF-16 this way is even conformant to the standard is ambiguous. The standard requires that every character supported by a locale be representable as a single wchar_t value; If no locale supports any character outside the BMP then UTF-16 could be seen as conformant.)
You have to start by reading the first byte or two of the stream, and
deciding whether it is part of a BOM or not. It's a bit of a pain,
since you can only putback a single byte, whereas you typically will
want to read four. The simplest solution is to open the file, read the
initial bytes, memorize how many you need to skip, then seek back to the
beginning and skip them.
With a not-so-clean solution, I solved by removing non printing chars:
bool isNotAlnum(unsigned char c)
{
return (c < ' ' || c > '~');
}
...
str.erase(remove_if(str.begin(), str.end(), isNotAlnum), str.end());
Here's a simple C++ function to skip the BOM on an input stream on Windows. This assumes byte-sized data, as in UTF-8:
// skip BOM for UTF-8 on Windows
void skip_bom(auto& fs) {
const unsigned char boms[]{ 0xef, 0xbb, 0xbf };
bool have_bom{ true };
for(const auto& c : boms) {
if((unsigned char)fs.get() != c) have_bom = false;
}
if(!have_bom) fs.seekg(0);
return;
}
It simply checks the first three bytes for the UTF-8 BOM signature, and skips them if they all match. There's no harm if there's no BOM.
Edit: This works with a file stream, but not with cin. I found it did work with cin on Linux with GCC-11, but that's clearly not portable. See #Dúthomhas comment below.