Trying to copy lines from text file to array of strings (char**) - c++

This is my code for allocating memory for the array of strings:
FileReader::FileReader()
{
readBuffer = (char**)malloc(100 * sizeof(char*));
for (int i = 0; i < 100; i++)
{
readBuffer[i] = (char*)malloc(200 * sizeof(char));
}
}
Im alocating 100 strings for 100 lines then allocating 200 chars for each string.
This is my code for reading the lines:
char** FileReader::ReadFile(const char* filename)
{
int i = 0;
File.open(filename);
if (File.is_open())
{
while (getline(File, tmpString))
{
readBuffer[i] = (char*)tmpString.c_str();
i++;
}
return readBuffer;
}
}
and for printing:
for (int i = 0; i <= 5; i++)
{
cout << fileCpy[i];
}
this is the output to terminal:
Picture
As you can see it just repeats the last line of the file as the file just reads:
This is test
line 2
line 3
line 4
line 5
Any idea on whats going on? Why the lines aren't copying correctly?

Replace
readBuffer[i] = (char*)tmpString.c_str();
with
strcpy(readBuffer[i], tmpString.c_str());
Your version just saves a pointers to tmpString in your array. When tmpString changes then that pointer points at the new contents of tmpString (and that's just the best possible outcome). However strcpy actually copies the characters of the string, which is what you want.
Of course, I'm sure it doesn't need saying, but you can avoid all the headache and complication like this
vector<string> readBuffer;
This way there are no more pointer problems, no more manual allocation or freeing of memory, no limits, you aren't limited to 100 lines or 200 characters per line. I'm sure you have a reason for doing things the hard way. but I wonder if it's a good reason.

First of all, you have to switch from C to C++.
Do not allocate memory like that when the only way to do it now in modern C++ is trough smart pointers from the memory header.
Anyways, you do not directly need dynamic allocation here. You have to encapsulate your data within the class and to use the std::vector<std::string> component from the standard library. This vector is a dynamic array that handle all the memory stuff behind the scene for you.
To read all lines of a file :
std::string item_name;
std::vector<std::string> your_buffer;
std::ifstream name_fileout;
name_fileout.open("test.txt");
while (std::getline(name_fileout, item_name))
{
your_buffer.push_back(item_name);
std::cout << item_name;
}
name_fileout.close();

Related

strcat error "Unhandled exception.."

My goal with my constructor is to:
open a file
read into everything that exists between a particular string ("%%%%%")
put together each read row to a variable (history)
add the final variable to a double pointer of type char (_stories)
close the file.
However, the program crashes when I'm using strcat. But I can't understand why, I have tried for many hours without result. :/
Here is the constructor code:
Texthandler::Texthandler(string fileName, int number)
: _fileName(fileName), _number(number)
{
char* history = new char[50];
_stories = new char*[_number + 1]; // rows
for (int j = 0; j < _number + 1; j++)
{
_stories[j] = new char [50];
}
_readBuf = new char[10000];
ifstream file;
int controlIndex = 0, whileIndex = 0, charCounter = 0;
_storieIndex = 0;
file.open("Historier.txt"); // filename
while (file.getline(_readBuf, 10000))
{
// The "%%%%%" shouldnt be added to my variables
if (strcmp(_readBuf, "%%%%%") == 0)
{
controlIndex++;
if (controlIndex < 2)
{
continue;
}
}
if (controlIndex == 1)
{
// Concatenate every line (_readBuf) to a complete history
strcat(history, _readBuf);
whileIndex++;
}
if (controlIndex == 2)
{
strcpy(_stories[_storieIndex], history);
_storieIndex++;
controlIndex = 1;
whileIndex = 0;
// Reset history variable
history = new char[50];
}
}
file.close();
}
I have also tried with stringstream without results..
Edit: Forgot to post the error message:
"Unhandled exception at 0x6b6dd2e9 (msvcr100d.dll) in Step3_1.exe: 0xC00000005: Access violation writing location 0c20202d20."
Then a file named "strcat.asm" opens..
Best regards
Robert
You've had a buffer overflow somewhere on the stack, as evidenced by the fact one of your pointers is 0c20202d20 (a few spaces and a - sign).
It's probably because:
char* history = new char[50];
is not big enough for what you're trying to put in there (or it's otherwise not set up correctly as a C string, terminated with a \0 character).
I'm not entirely certain why you think multiple buffers of up to 10K each can be concatenated into a 50-byte string :-)
strcat operates on null terminated char arrays. In the line
strcat(history, _readBuf);
history is uninitialised so isn't guaranteed to have a null terminator. Your program may read beyond the memory allocated looking for a '\0' byte and will try to copy _readBuf at this point. Writing beyond the memory allocated for history invokes undefined behaviour and a crash is very possible.
Even if you added a null terminator, the history buffer is much shorter than _readBuf. This makes memory over-writes very likely - you need to make history at least as big as _readBuf.
Alternatively, since this is C++, why don't you use std::string instead of C-style char arrays?

CFile writes some garbage value

Im wriitng some data in File.But it doesnot write this properly.
Code:
CString sFileName = "C:\\Test.txt";
CFile gpFile;
CString testarr[10] = {"Tom","Ger","FER","DER","SIL","REM","FWE","DWR","SFE","RPOP"};
if (!gpFile.Open( sFileName,CFile::modeCreate|CFile::modeWrite))
{
AfxMessageBox( sFileName + (CString)" - File Write Error");
return;
}
else
{
gpFile.Write(testarr,10);
}
AfxMessageBox("Completed");
gpFile.Close();
It shows the file as
That's probably because you're using CFile incorrectly. The first parameter to CFile::Write should be a buffer whose bytes you'd like to write to the file. However, testarr is more like a "buffer of buffers", since each element of testarr is a string, and a string is itself a sequence of bytes.
What you would need to do instead is either concatenate the elements of testarr, and then call CFile::Write. Or (probably more practical), iterate over testarr printing each string one at a time, e.g. for your particular example, the following should do what you're looking for:
for(int i = 0; i < 10; ++i)
{
gpFile.Write(testarr[i], strlen(testarr[i]));
}
There may be some built-in way to accomplish this, but I'm not really familiar with MFC, so I won't be much help there.

very fast text file processing (C++)

i wrote an application which processes data on the GPU. Code works well, but i have the problem that the reading part of the input file (~3GB, text) is the bottleneck of my application. (The read from the HDD is fast, but the processing line by line is slow).
I read a line with getline() and copy line 1 to a vector, line2 to a vector and skip lines 3 and 4. And so on for the rest of the 11 mio lines.
I tried several approaches to get the file at the best time possible:
Fastest method I found is using boost::iostreams::stream
Others were:
Read the file as gzip, to minimize IO, but is slower than directly
reading it.
copy file to ram by read(filepointer, chararray, length)
and process it with a loop to distinguish the lines (also slower than boost)
Any suggestions how to make it run faster?
void readfastq(char *filename, int SRlength, uint32_t blocksize){
_filelength = 0; //total datasets (each 4 lines)
_SRlength = SRlength; //length of the 2. line
_blocksize = blocksize;
boost::iostreams::stream<boost::iostreams::file_source>ins(filename);
in = ins;
readNextBlock();
}
void readNextBlock() {
timeval start, end;
gettimeofday(&start, 0);
string name;
string seqtemp;
string garbage;
string phredtemp;
_seqs.empty();
_phred.empty();
_names.empty();
_filelength = 0;
//read only a part of the file i.e the first 4mio lines
while (std::getline(in, name) && _filelength<_blocksize) {
std::getline(in, seqtemp);
std::getline(in, garbage);
std::getline(in, phredtemp);
if (seqtemp.size() != _SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
for (int k = 0; k < _SRlength; k++) {
//handle special letters
if(seqtemp[k]== 'A') ...
else{
_seqs.push_back(5);
}
}
_filelength++;
}
}
EDIT:
The source-file is downloadable under https://docs.google.com/open?id=0B5bvyb427McSMjM2YWQwM2YtZGU2Mi00OGVmLThkODAtYzJhODIzYjNhYTY2
I changed the function readfastq to read the file, because of some pointer problems. So if you call readfastq the blocksize (in lines) must be bigger than the number of lines to read.
SOLUTION:
I found a solution, which get the time for read in the file from 60sec to 16sec. I removed the inner-loop which handeles the special characters and do this in GPU. This decreases the read-in time and only minimal increases the GPU running time.
Thanks for your suggestions.
void readfastq(char *filename, int SRlength) {
_filelength = 0;
_SRlength = SRlength;
size_t bytes_read, bytes_expected;
FILE *fp;
fp = fopen(filename, "r");
fseek(fp, 0L, SEEK_END); //go to the end of file
bytes_expected = ftell(fp); //get filesize
fseek(fp, 0L, SEEK_SET); //go to the begining of the file
fclose(fp);
if ((_seqarray = (char *) malloc(bytes_expected/2)) == NULL) //allocate space for file
err(EX_OSERR, "data malloc");
string name;
string seqtemp;
string garbage;
string phredtemp;
boost::iostreams::stream<boost::iostreams::file_source>file(filename);
while (std::getline(file, name)) {
std::getline(file, seqtemp);
std::getline(file, garbage);
std::getline(file, phredtemp);
if (seqtemp.size() != SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
strncpy( &(_seqarray[SRlength*_filelength]), seqtemp.c_str(), seqtemp.length()); //do not handle special letters here, do on GPU
_filelength++;
}
}
}
First instead of reading the file into memory you may work with file mappings. You just have to build your program as 64-bit to fit 3GB of virtual address space (for 32-bit application only 2GB is accessible in the user mode). Or alternatively you may map & process your file by parts.
Next, it sounds to me that your bottleneck is "copying a line to a vector". Dealing with vectors involves dynamic memory allocation (heap operations), which in a critical loop hits the performance very seriously). If this is the case - either avoid using vectors, or make sure they're declared outside the loop. The latter helps because when you reallocate/clear vectors they do not free memory.
Post your code (or a part of it) for more suggestions.
EDIT:
It seems that all your bottlenecks are related to string management.
std::getline(in, seqtemp); reading into an std::string deals with the dynamic memory allocation.
_names.push_back(name); This is even worse. First the std::string is placed into the vector by value. Means - the string is copied, hence another dynamic allocation/freeing happens. Moreover, when eventually the vector is internally reallocated - all the contained strings are copied again, with all the consequences.
I recommend using neither standard formatted file I/O functions (Stdio/STL) nor std::string. To achieve better performance you should work with pointers to strings (rather than copied strings), which is possible if you map the entire file. Plus you'll have to implement the file parsing (division into lines).
Like in this code:
class MemoryMappedFileParser
{
const char* m_sz;
size_t m_Len;
public:
struct String {
const char* m_sz;
size_t m_Len;
};
bool getline(String& out)
{
out.m_sz = m_sz;
const char* sz = (char*) memchr(m_sz, '\n', m_Len);
if (sz)
{
size_t len = sz - m_sz;
m_sz = sz + 1;
m_Len -= (len + 1);
out.m_Len = len;
// for Windows-format text files remove the '\r' as well
if (len && '\r' == out.m_sz[len-1])
out.m_Len--;
} else
{
out.m_Len = m_Len;
if (!m_Len)
return false;
m_Len = 0;
}
return true;
}
};
if _seqs and _names are std::vectors and you can guess the final size of them before processing the whole 3GB of data, you can use reserve to avoid most of the memory re-allocation during pushing back the new elements in the loop.
You should be aware of the fact that the vectors effectively produce another copy of parts of the file in main memory. So unless you have a main memory sufficiently large to store the text file plus the vector and its contents, you will probably end up with a number of page faults that also have a significant influence on the speed of your program.
You are apparently using <stdio.h> since using getline.
Perhaps fopen-ing the file with fopen(path, "rm"); might help, because the m tells (it is a GNU extension) to use mmap for reading.
Perhaps setting a big buffer (i.e. half a megabyte) with setbuffer could also help.
Probably, using the readahead system call (in a separate thread perhaps) could help.
But all this are guesses. You should really measure things.
General suggestions:
Code the simplest, most straight-forward, clean approach,
Measure,
Measure,
Measure,
Then if all else fails:
Read raw bytes (read(2)) in page-aligned chunks. Do so sequentially, so kernel's read-ahead plays to your advantage.
Re-use the same buffer to minimize cache flushing.
Avoid copying data, parse in place, pass around pointers (and sizes).
mmap(2)-ing [parts of the] file is another approach. This also avoids kernel-userland copy.
Depending on your disk speed, using a very fast de compression algorithm might help, like fastlz (there are at least two other that might be more efficient, but under GPL, so licence can be a problem).
Also, using C++ data structures and functions car increase the speed as you can maybe achieve a better compiler-time optimization. Going the C way isn't always the fastes! In some bad conditions, using char* you need to parse the whole string to reach the \0 yielding desastrous performances.
For parsing your data, using boost::spirit::qi is also probably the most optimized approach http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html

segmentation fault cause by delete[] while writing to a file

I'm trying to write to a file and i get a segmentation fault when i delete the allocated memory. I don't understant what is the problem, please help:
void writeToLog(string msg) {
int len = msg.size()+1;
char *text = new char(len);
strcpy(text,msg.c_str());
char* p = text;
for(int i=0; i<len; i++){
fputc(*p, _log) ;
p++;
}
delete[] text; //THIS IS WHERE IT CRASHES
}
I also tried without the [ ] but then i get
*** glibc detected *** ./s: free(): invalid next size (fast): 0x09ef7308 ***
So what is the problem?
Thanks!
This:
char *text = new char(len);
should be:
char *text = new char[len + 1];
And this is all unnecessary anyway. why are you doing it?
Well, delete[] doesn't balance new char(N), it balances new char[N]. The former creates a pointer to a single char and gives it the value N; the latter creates a pointer to an array of char with length N, and leaves the values indefined.
Of course, to write a std::string to a FILE *, why not just do:
fwrite(msg.c_str(), sizeof(char), msg.size() + 1, _log);
Note that preserves the trailing null character; so does your original code.
char *text = new char(len);
allocates just one char. Try with:
char *text = new char[len];
Try this:
char *text = new char[len];
Then:
delete[] text;
Although the technical issue has been answer (mismatched new/delete pair), I still think you could benefit from some help here. And I thus propose to help you trim your code.
First: there would not be any issue if you simply did not perform a copy.
void writeToLog(string msg) {
typedef std::string::const_iterator iterator;
for(iterator it = msg.begin(), end = msg.end(); it != end; ++it) {
fputc(*it, _log) ;
}
}
Note how I reworked the code to use C++ iterators instead of a mix of pointers and indices.
Second: what is this fputc call ?
You should not need to use a FILE* in your code. If you do, you are likely to get it wrong too and forget to close it, or close it twice etc...
The Standard Library provides the Streams collection to handle input and output, and for a log file the ofstream class seems particularly adapted.
std::ofstream _log("myLogFile");
void writeToLog(std::string const& msg) { // by reference (no copy)
_log << msg;
}
Note how it is much simpler ? And you cannot forget to close the file either, because if you do forget, then it'll be closed when _log is destructed anyway.
Of course at this point one might decide that it is superflous to have a function. However such a function allows you to prefix the message, typically with timestamps / PID / Thread ID or other decorations, so it's still nice.

Read variable-length records from a buffer - weird memory issues

I'm trying to implement an i/o intensive quicksort (C++ qsort) on a very large dataset. In the interests of speed, I'd like to read in a chunk of data at a time into a buffer and then use qsort to sort it inside the buffer. (I am currently working with text files but would like to move to binary soon.) However, my data is composed of variable-length records, and qsort needs to be told the length of the record in order to sort. Is there any way to standardize this? The only thing I could think of was rather convoluted: my program currently reads from the buffer until it hits a linefeed character ('10' in ascii), transferring each character over to another array. When it finds a linefeed (the delimiter in the input file), it fills the number of spaces remaining in the buffer for that record (record size is set to 30) with null characters. This way, I should end up with a buffer full of fixed-size records to give qsort.
I know there are several problems with my approach, one being that it's just clumsy, another that the record size might conceivably be larger than 30, but is generally much less. Is there a better way of doing this?
As well, my current code doesn't even work. When I debug it, it seems to be transferring characters from one buffer to the other, but when I try to print out the buffer, it contains only the first record.
Here is my code:
FILE *fp;
unsigned char *buff;
unsigned char *realbuff;
FILE *inputFiles[NUM_INPUT_FILES];
buff = (unsigned char *) malloc(2048);
realbuff = (unsigned char *) malloc(NUM_RECORDS * RECORD_SIZE);
fp = fopen("postings0.txt", "r");
if(fp)
{
fread(buff, 1, 2048, fp);
/*for(int i=0; i <30; i++)
cout << buff[i] <<endl;*/
int y=0;
int recordcounter = 0;
//cout << buff;
for(int i=0;i <100; i++)
{
if(buff[i] != char(10))
{
realbuff[y] = buff[i];
y++;
recordcounter++;
}
else
{
if(recordcounter < RECORD_SIZE)
for(int j=recordcounter; j < RECORD_SIZE;j++)
{
realbuff[y] = char(0);
y++;
}
recordcounter = 0;
}
}
cout << realbuff <<endl;
cout << buff;
}
else
cout << "sorry";
Thank you very much,
bsg
The qsort function can only work on fixed length record (like you say). In order to sort variable length records, you need an array of pointers to them and then have qsort sort the array of pointers. This may be more efficient too, as pointers are much faster to move around than large chunks of data are.
The same goes for std::sort, which would be recommended because it is type safe. Just be sure to supply a comparison predicate (a less than function) taking pointers as its arguments as the third parameter.
How about using c++ file streams for parsing your file ?
Checkout this example(website name is strange, no offense!!) which returns the record as a STL vector
and then you can use STL Sort algorithm.