c++: problem while copying string into character array - c++

I am trying to copy contents of a file into fields in a class courseInfo.
this is the code im using:
#include<iostream>
#include<fstream>
#include<vector>
#include<sstream>
#include <bits/stdc++.h>
using namespace std;
class courseInfo
{
public:
char courseCode[8];
char courseName[80];
int ECTS;
};
int main()
{
ifstream fin("courses.txt");
if(!fin.is_open())
{
cout<<"file doesn't exist";
return 0;
}
string line;
vector<courseInfo> courses;
while(getline(fin,line))
{
stringstream linestream(line);
string segment;
vector<string> segmentlist;
while(getline(linestream, segment, ';'))
{
segmentlist.push_back(segment);
}
//cout<<segmentlist.at(0).c_str();
courseInfo c;
//segmentlist.at(0).copy(c.courseCode, segmentlist.at(0).size()+1);
//c.courseCode[segmentlist.at(0).size()] = '\0';
strcpy(c.courseCode, segmentlist.at(0).c_str());
cout<<c.courseCode<<"\n;
strcpy(c.courseName, segmentlist.at(1).c_str());
cout<<c.courseCode;
}
return 0;
}
content of courses.txt file:
TURK 101;Turkish l;3.
output i get:
TURK 101
TURK 101Turkish l
the contents of courseCode changes when i copy something into courseName.
why does this happen?
how do i rectify this?

Note how TURK 101 is exactly 8 bytes.
When you cout << c.courseCode, your program prints characters until it encounters a NUL byte. By accident, the first byte of c.courseName is NUL.
After you read into it, it is no longer NUL and thus printing c.courseCode happily continues into c.courseName.
Some options:
The most obvious (and recommended) solution is to use std::string in your struct instead of fixed-size char arrays.
However, this looks like a homework question, so you probably are not allowed to use std::string.
Use std::vector<char> instead, but that is probably also not allowed.
Make courseCode large enough to contain any possible course code, plus one character for the NUL-terminator. In this case, make courseCode 9 chars large.
Use heap-allocated memory (new char[str.size()+1] to allocate a char *, delete[] ptr to free it afterwards). And then change courseInfo to take regular pointers. Ideally all the memory management is done in constructors/destructors. See the rule of three/five/zero.

Related

scanf function for strings

The problem is simple, the code below does not work. it says Process finished with exit code -1073740940 (0xC0000374). Removing ampersand does not change anything.
int main(){
string x;
scanf("%s",&x);
cout << x;
}
scanf() with the %s format specifier reads bytes into a preallocated character array (char[]), to which you pass a pointer.
Your s is not a character array. It is a std::string, a complex object.
A std::string* is not in any way the same as a char*. Your code overwrites the memory of parts of a complex object in unpredictable ways, so you end up with a crash.
Your compiler should have warned about this, since it knows that a char* is not a std::string*, and because compilers are clever and can detect mistakes like this despite the type-unsafe nature of C library functions.
Even if this were valid via some magic compatibility layer, the string is empty.
Use I/O streams instead.
You cannot pass complex objects through the ... operator of printf/scanf. Many compilers print a warning for that.
scanf requires a pointer of type char* pointing to sufficient storage for an argument of %s. std::string is something completely different.
In C++ the iostream operators are intended for text input and output.
cin >> x;
will do the job.
You should not use scanf in C++. There are many pitfalls, you found one of them.
Another pitfall: %s at scanf is almost always undefined behavior unless you you really ensure that the source stream can only contain strings of limited size. In this case a buffer of char buffer[size]; is the right target.
In any other case you should at least restrict the size of the string to scan. E.g. use %20s and of course a matching char buffer, char buffer[21];in this case. Note the size +1.
You should use cin. But if you want to use scanf() for whatever reason and still manipulate your strings with std::string, then you can read the C-string and use it to initialize your C++ string.
#include <iostream>
#include <cstdio>
#include <string>
using std::cout;
using std::string;
int main()
{
char c_str[80];
scanf("%s", c_str);
string str(c_str);
cout << str << "\n";
}
If you want to use strings, use cin (or getline).
string s;
cin>>s; //s is now read
If you want to use scanf, you want to have a char array (and don't use &):
char text[30];
scanf("%s", text); //text is now read
You can use char[] instead of string
include <iostream>
using namespace std;
int main()
{
char tmp[101];
scanf("%100s", tmp);
cout << tmp;
}

10 fold size increase when reading file into struct

I am trying to read a csv file into a struct containing a vector of vector of strings. The file contains ~2 million lines and size on disk is ~350 mb. When I read the file into struct top shows me that the on reading the full file, the program is now using almost 3.5GB of my memory. I have used vector reserve to try to limit vector capacity increase on push_back.
#include<iomanip>
#include<stdio.h>
#include<stdlib.h>
#include<iostream>
#include<fstream>
#include<string.h>
#include<sstream>
#include<math.h>
#include<vector>
#include<algorithm>
#include<array>
#include<ctime>
#include<boost/algorithm/string.hpp>
using namespace std;
struct datStr{
vector<string> colNames;
vector<vector<string>> data;
};
datStr readBoost(string fileName)
{
datStr ds;
ifstream inFile;
inFile.open(fileName);
string line;
getline(inFile, line);
vector<string> colNames;
stringstream ss(line);
string item;
int i = 0;
vector<int> colTypeInt;
while(getline(ss, item, ','))
{
item.erase( remove( item.begin(), item.end(), ' ' ), item.end() );
colNames.push_back(item);
vector<string> colVec;
ds.data.push_back(colVec);
ds.data[i].reserve(3000000);
i++;
}
int itr = 0;
while(getline(inFile, line))
{
vector<string> rowStr;
boost::split(rowStr, line, boost::is_any_of(","));
for(int ktr = 0; ktr < rowStr.size(); ktr++)
{
rowStr[ktr].erase( remove( rowStr[ktr].begin(), rowStr[ktr].end(), ' ' ), rowStr[ktr].end() );
ds.data[ktr].push_back(rowStr[ktr]);
}
itr++;
}
int main()
{
datStr ds = readBoost("file.csv");
while(true)
{
}
}
PS: The last while is just so I can monitor the memory usage on completion of the program.
Is this something expected when using vectors or am I missing something here?
Another interesting fact. I started adding up size and capacity for each string in the read loop. Surprisingly it just adds up to 1/10 of what I am shown in top on ubuntu? Could it be that top is misreporting or my compiler is allocating too much space?
I tested your code with an input file that has 1886850 lines of text, with a size of 105M.
With your code, the memory consumption was about 2.5G.
Then, I started modifying how data is stored.
First test:
Change datStr to:
struct datStr{
vector<string> colNames;
vector<string> lines;
};
This reduced the memory consumption to 206M. That's more than 10 fold reduction in size. It's clear that the penalty for using
vector<vector<string>> data;
is rather stiff.
Second test:
Change datStr to:
struct datStr{
vector<string> colNames;
vector<string> lines;
vector<vector<string::size_type>> indices;
};
with indices keeping track of where the tokens in lines start. You can extract the tokens from each line by using lines and indices.
With this change, the memory consumption went to 543MB but sill is five times smaller than the original.
Third test
Change dataStr to:
struct datStr{
vector<string> colNames;
vector<string> lines;
vector<vector<unsigned int>> indices;
};
With this change, the memory consumption came down to 455MB. This should work if you don't expect your lines to be longer or equal to UINT_MAX.
Fourth Test
Change dataStr to:
struct datStr{
vector<string> colNames;
vector<string> lines;
vector<vector<unsigned short>> indices;
};
With this change, the memory consumption came down to 278MB. This should work if you don't expect your lines to be longer or equal to USHRT_MAX. For this case, the overhead of indices is really small, only 72MB.
Here's the modified code I used for my tests.
#include<iomanip>
#include<stdio.h>
#include<stdlib.h>
#include<iostream>
#include<fstream>
#include<string.h>
#include<sstream>
#include<math.h>
#include<vector>
#include<algorithm>
#include<array>
#include<ctime>
// #include<boost/algorithm/string.hpp>
using namespace std;
struct datStr{
vector<string> colNames;
vector<string> lines;
vector<vector<unsigned short>> data;
};
void split(vector<unsigned short>& rowStr, string const& line)
{
string::size_type begin = 0;
string::size_type end = line.size();
string::size_type iter = begin;
while ( iter != end)
{
++iter;
if ( line[iter] == ',' )
{
rowStr.push_back(static_cast<unsigned short>(begin));
++iter;
begin = iter;
}
}
if (begin != end )
{
rowStr.push_back(static_cast<unsigned short>(begin));
}
}
datStr readBoost(string fileName)
{
datStr ds;
ifstream inFile;
inFile.open(fileName);
string line;
getline(inFile, line);
vector<string> colNames;
stringstream ss(line);
string item;
int i = 0;
vector<int> colTypeInt;
while(getline(ss, item, ','))
{
item.erase( remove( item.begin(), item.end(), ' ' ), item.end() );
ds.colNames.push_back(item);
}
int itr = 0;
while(getline(inFile, line))
{
ds.lines.push_back(line);
vector<unsigned short> rowStr;
split(rowStr, line);
ds.data.push_back(rowStr);
}
}
int main(int argc, char** argv)
{
datStr ds = readBoost(argv[1]);
while(true)
{
}
}
Your vector<vector<string>> suffers from the costs of indirection (pointer members to dynamically allocated memory), housekeeping (members supporting size()/end()/capacity()), and the housekeeping and rounding-up internal to the dynamic memory allocation functions... if you look at the first graph titled Real memory consumption for different string lengths here it suggests total overheads around 40-45 bytes per string for a 32 bit app built with G++ 4.6.2, though an implementation could potentially get this as low as 4 bytes for strings of up to ~4 characters. Then there's waste for vector overheads....
You can address the issue in any of several ways, depending on your input data and efficiency needs:
store vector<std::pair<string, Column_Index>>, where Column_Index is a class you write that records the offsets in the string where each field appear
store vector<std::string> where column values are padded to known maximum widths, which will help most if the lengths are small, fixed and/or similar (e.g. date/times, small monetary amounts, ages)
memory map the file, then store offsets (but unquoting/unescaping is an issue - you could do that in-place, e.g. abc""def or abc\"def (whichever you support) -> abc"deff)
with the last two approaches, you can potentially overwrite the character after each field with a NUL if that's useful to you, so you can treat the fields as "C"-style ASCIIZ NUL-terminated strings
if some/all fields contain values like values like say 1.23456789012345678... where the textual representation may be longer than a binary inbuilt type (float, double, int64_t), doing a conversion before storage makes sense
similarly, if there's a set of repeating values - like a field of what are logically enumeration identifiers, you can encoding them as integers, or if the values are repetitive but not known until runtime, you can create a bi-directional mapping from incrementing indices to values
A couple things come to mind:
You say your file has about 2 million lines, but you reserve space for 3 million strings for each column. Even if you only have one column, that's a lot of wasted space. If you have a bunch of columns, that's a ton of wasted space. It might be informative to see how much space difference it makes if you don't reserve at all.
string has a small* but nonzero amount of overhead that you're paying for every single field in your 2-million line file. If you really need to hold all the data in memory at once and it's causing problems to do so, this may actually be a case where you're better off just using char* instead of string. But I'd only resort to this if adjusting reserve doesn't help.
* The overhead due to metadata is small, but if the strings are allocating extra capacity for their internal buffers, that could really add up. See this recent question.
Update: The problem with your update is that you are storing pointers to temporary std::string objects in datStr. By the time you get around to printing, those strings have been destroyed and your pointers are wild.
If you want a simple, safe way to store your strings in datStr that doesn't allocate more space than it needs, you could use something like this:
class TrivialReadOnlyString
{
private:
char* m_buffer;
public:
TrivialReadOnlyString(const std::string& src)
{
InitFrom(src.c_str());
}
TrivialReadOnlyString(const TrivialReadOnlyString& src)
{
InitFrom(src.m_buffer);
}
~TrivialReadOnlyString()
{
delete[] m_buffer;
}
const char* Get()
{
return m_buffer;
}
private:
void InitFrom(const char* src)
{
// Can switch to the safe(r) versions of these functions
// if you're using vc++ and it complains.
size_t length = strlen(src);
m_buffer = new char[length + 1];
strcpy(m_buffer, src);
}
};
There are a lot of further enhancements that could be made to this class, but I think it is sufficient for your program's needs as shown. This will fragment memory more than Blastfurnace's idea of storing the whole file in a single buffer. however.
If there is a lot of repetition in your data, you might also consider "folding" the repeats into a single object to avoid redundantly storing the same strings in memory over and over (flyweight pattern).
Indulge me while I take a very different approach in answering your question. Others have already answered your direct question quite well, so let me provide another perspective entirely.
Do you realize that you could store that data in memory with a single allocation, plus one pointer for each line, or perhaps one pointer per cell?
On a 32 bit machine, that's 350MB + 8MB (or 8MB * number columns).
Did you know that it's easy to parallelize CSV parsing?
The problem you have is layers and layers of bloat. ifstream, stringstream, vector<vector<string>>, and boost::split are wonderful if you don't care about size or speed. All of that can be done more directly and at lower cost.
In situations like this, where size and speed do matter, you should consider doing things the manual way. Read the file using an API from your OS. Read it into a single memory location, and modify the memory in place by replacing commas or EOLs with '\0'. Store pointers to those C strings in your datStr and you're done.
You can write similar solutions for variants of the problem. If the file is too large for memory, you can process it in pieces. If you need to convert data to other formats like floating point, that's easy to do. If you'd like to parallelize parsing, that's far easier without the extra layers between you and your data.
Every programmer should be able to choose to use convenience layers or to use simpler methods. If you lack that choice, you won't be able to solve some problems.

C++ multidimension array storing strings

I wish to store for example 10 words into a multi-d array. This is my code.
char array[10][80]; //store 10 words, each 80 chars in length, get from file
int count = 0;
while ( ifs >> word ){ //while loop get from file input stream <ifstream>
array[count++][0] = word;
}
when i compile, there's error. "invalid conversion from ‘char*’ to ‘char’ ". ifs return a char pointer. How can i succesffuly store into array?
As this is C++, I would use the STL containers to avoid some char* limitations. word would have type std::string, array would have type std::vector<std::string> and you would push_back instead of assigning. The code looks like this:
#include <string>
#include <vector>
std::string word;
std::vector<std::string> array;
while(ifs >> word) {
array.push_back(word);
}
This is better than char* for a few reasons: you hide the dynamic allocation, you have words with real variable size(up to memory size), and you don't have any issues if you need more than 10 words.
Edit: as mentioned in the comments, if you have a compiler that supports C++11, you can use emplace_back and std::move instead, which will move the string instead of copying it (emplace_back alone will construct the string inplace.)
You should define a pointer to array I think, that can access each value of array blocks one by one (or the way you want). You can also try dynamic allocation. Those are pointer things so then it'll be comparable easily.
word is char*(string), but array[count++][0] is store a char, you can change "array[count++][0] = word;" to "strcpy(array[count++], word);"
char array[10][80]; //store 10 words, each 80 chars in length, get from file
int count = 0;
while ( ifs >> word ){ //while loop get from file input stream <ifstream>
strcpy(array[count++], word);
}

Segmentation fault while using "ifstream"

I'm trying to get a part of text in a file.
I used "ifstream":
#include <fstream>
void func(){
char *myString;
ifstream infile;
infile.open("/proc/cpuinfo");
while (infile >> myString){
if (!strcmp(myString,"MHz"))
break;
}
}
and I get a "Segmentation fault". does anyone know why?
You have not allocated memory for the myString. Use std::string. Or better any other language, python, perl, or unix utils such as grep, awk, sed.
Because the target value should be:
std::string myString;
and not char*. It's possible to use char*, but you have to ensure that it points to something big enough first. (In your case, it doesn't point anywhere—you forgot to initialize it.) And defining “big enough” is non-trivial, given that you don't know the size of the next word until you've read it.
There's a reason why C++ has a string class, you know. It's because using char pointers is cumbersome and error-prone.
infile >> myString
will read from the file into *wherever myString points to. And it is an uninitialized pointer, it points to some random garbage address.
If you absolutely do want to use char pointers instead of strings, you'll have to allocate a buffer you can read data into.
But the sane solution is to replace it entirely by std::string.
Because you did not allocate memory for myString. The quick solution to this is to use std::string instead of the C-style char* strings, which does memory management so you don't have to.
Here's why your error occurs:
When you declare char *myString you are creating a pointer to a character. However you do not initialize this pointer to anything, so it can point absolutely anywhere in memory. When you do infile >> myString you are going to write a bunch of characters at an unspecified location in memory. It just so happens that this location was a vital part of your program, which caused it to crash.
char myString[256] compiles fine just as well too.
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
void func()
{
char myString[256] ;
ifstream infile;
infile.open("/proc/cpuinfo");
while ( ! infile.eof() )
{
infile >> myString;
cout<<myString<<" \n";
if (! strcmp(myString,"MHz") )
{
infile.close();
break;
}
}
infile.close();
cout<<" \n";
}
int main()
{
func();
return 0;
}

How to read in a bunch of strings in C++?

I need to read in a bunch of strings without knowing in advance how many are there and print them as they are read. So I decided to use while(!feof(stdin)) as an EOF indicator.Here is my code:
#include<stdio.h>
#include<stdlib.h>
#include<iostream>
using namespace std;
int main(void)
{
char* str;
std::cout<<"\nEnter strings:";
while(!feof(stdin))
{
std::cin>>str;
std::cout<<"\nThe string you entered is"<<str;
}
return 0;
}
The above code for some reason says segmentation fault after I enter the first string.Can someone suggest a fix for that.
You need to allocate some memory for the string you are reading to go into.
All you have currently is a pointer on the stack to some random memory area, which means that as you read characters they will stomp all over other data, or even memory you aren't allowed to write to - which then causes a segfault.
The problem with trying to allocate some memory is that you don't know how much to allocate until the string is read in... (You could just say "300 chars" and see if it's enough. But if it isn't you have the same problem of data corruption)
Better to use C++ std::string type.
str is a pointer to char. It doesn't point anywhere valid when you try to write there.
Try some form of new in C++, or, better yet since you're coding C++, use std::string.
str is declared as char* str, which means it isn't really a string (just a pointer to it, uninitialized BTW), and doesn't allocate any space for it. That's why it seggfaults. Since you program in c++, you can use
std::string str;
and it will work. Don't forget to #include <string>.
A segmentation fault occurs when an application tries to access a memory location that it isn't allowed. In your case the problem is that you are dereferencing a non-initialized pointer: char* str;
A possible solution would be change the pointer to an array, with a suitable size
Something like this may suffice:
#include<stdio.h>
#include<stdlib.h>
#include<iostream>
using namespace std;
int main(void)
{
char str[20];
cout<<"\nEnter strings:";
while(!feof(stdin))
{
cin.width(20); //set a limit
cin>>str;
cout<<"\nThe string you entered is"<<str;
}
return 0;
}
#include<stdio.h>
#include<stdlib.h>
#include<iostream>
using namespace std;
int main(void)
{
std::string str[300];
std::cout<<"\nEnter strings:";
while(!std::cin.eof())
{
std::cin>>str;
std::cout<<"\nThe string you entered is"<<str;
}
return 0;
}
Should do the trick