Why did getline not reach the end of the file in C++? - c++

I have a simple function that edits an HTML file. All it does is just to replace some texts in the file. This is the code for the function:
void edit_file(char* data1, char* data1_token, char* data2, char* data2_token) {
std::ifstream filein("datafile.html");
std::ofstream fileout("temp.html");
std::string line;
//bool found = false;
while(std::getline(filein, line))
{
std::size_t for_data1 = line.find(data1_token);
std::size_t for_data2 = line.find(data2_token);
if (for_data1 != std::string::npos) {
line.replace(for_data1, 11, data1);
}
if (for_data2 != std::string::npos) {
line.replace(for_data2, 19, data2);
}
fileout<<line;
}
filein.close();
fileout.close();
}
void edit_file_and_copy_back(char* data1, char* data1_token, char* data2, char* data2_token)
{
edit_file(data1, data1_token, data2, data2_token);
MoveFileEx("temp.html", "datafile.html", MOVEFILE_REPLACE_EXISTING);
}
For some reasons, I will call this function multiple times, but this function only works for the first time, and later on the "getline" it will stop somewhere in the middle of the file.
The replace function works without any problems (because it works the first time). However, the second time, the while loop will end after just reading some lines.
I have tried filein.close() or file.seekg function, but neither of them fix the problem. What causes the incorrect execution and how do to solve it?

Buffering is biting you. Here's what you're doing:
Opening datafile.html for read and temp.html for write
Copying lines from datafile.html to temp.html
When you're done, without closing or flushing temp.html, you open a separate handle to temp.html for read (which won't share the buffer with the original handle, so unflushed data isn't seen)
You open a separate handle to datafile.html for write, and copying from the second temp.html handle to the new datafile.html handle
But the copy in steps 3 & 4 is missing the data still in the buffer for temp.html opened in step 1. And each time you call this, if the input and output buffer sizes don't match, or the iostream implementation you're using doesn't flush until you write buffer size + 1 bytes, you'll drop up to another buffer size worth of data.
Change your code so the scope of the original handles ends before you call the copy back function:
void edit_file(char* data1, char* data1_token, char* data2, char* data2_token) {
{ // New scope; when it ends, files are closed
ifstream filein("datafile.html");
ofstream fileout("temp.html");
string strTemp;
std::string line;
//bool found = false;
while(std::getline(filein, line))
{
std::size_t for_data1 = line.find(data1_token);
std::size_t for_data2 = line.find(data2_token);
if (for_data1 != std::string::npos) {
line.replace(for_data1, 11, data1);
}
if (for_data2 != std::string::npos) {
line.replace(for_data2, 19, data2);
}
fileout<<line;
}
} // End of new scope, files closed at this point
write_back_file();
}
void write_back_file() {
ifstream filein("temp.html");
ofstream fileout("datafile.html");
fileout<<filein.rdbuf();
}
Mind you, this still has potential errors; if both data tokens are found, and data1_token occurs before data2_token, the index for data2_token will be stale when you use it; you need to delay the scan for data2_token until after you scan and replace data1_token (or if the data1_token replacement might create a data2_token that shouldn't be replaced, you'll need to compare the hit indices and perform the replacement for the later hit first, so the earlier index remains valid).
Similarly, from a performance and atomicity perspective, you probably don't want to copy from temp.html back to datafile.html; other threads and processes would be able to see the incomplete datafile.html in that case, rather than seeing the old version atomically replaced with the new version. It also means you need to worry about removing temp.html at some point. Typically, you just move the temporary file over the original file:
rename("temp.html", "datafile.html");
If you're on Windows, that won't work atomically to replace an existing file; you'd need to use MoveFileEx to force replacing of existing files:
MoveFileEx("temp.html", "datafile.html", MOVEFILE_REPLACE_EXISTING);

void edit_file(char* data1, char* data1_token, char* data2, char* data2_token) {
ifstream filein("datafile.html");
ofstream fileout("temp.html");
// STUFF
// At this point the two streams are still open and
// may not have been flushed to the file system.
// You now call this function.
write_back_file();
}
void write_back_file() {
// You are opening files that are already open.
// Do not think there are any guarantees about the content at this point.
// So what is copied is questionable.
ifstream filein("temp.html");
ofstream fileout("datafile.html");
fileout<<filein.rdbuf();
}
Do not call write_back_file() from within edit_file(). Rather provide a wrapper that calls both.
void edit_file_and_copy_back(char* data1, char* data1_token, char* data2, char* data2_token)
{
edit_file(data1, data1_token, data2, data2_token);
write_back_file();
}

Related

compare data in a JSON::Value variable and then update to file

I am trying to update a data to two JSON files by providing the filename at run time.
This is the updateTOFile function which will update data stored in JSON variable to two different in two different threads.
void updateToFile()
{
while(runInternalThread)
{
std::unique_lock<std::recursive_mutex> invlock(mutex_NodeInvConf);
FILE * pFile;
std::string conff = NodeInvConfiguration.toStyledString();
pFile = fopen (filename.c_str(), "wb");
std::ifstream file(filename);
fwrite (conff.c_str() , sizeof(char), conff.length(), pFile);
fclose (pFile);
sync();
}
}
thread 1:
std::thread nt(&NodeList::updateToFile,this);
thread 2:
std::thread it(&InventoryList::updateToFile,this);
now it's updating the files even if no data has changed from the previous execution. I want to update the file only if there's any change compared to previously stored one. if there is no change then it should print the data is same.
Can anyone please help with this??
Thanks.
You can check if it has changed before writing.
void updateToFile()
{
std::string previous;
while(runInternalThread)
{
std::unique_lock<std::recursive_mutex> invlock(mutex_NodeInvConf);
std::string conf = NodeInvConfiguration.toStyledString();
if (conf != previous)
{
// TODO: error handling missing like in OP
std::ofstream file(filename);
file.write (conf.c_str() , conf.length());
file.close();
previous = std::move(conf);
sync();
}
}
}
However such constant polling in loop is likely inefficient. You may add Sleeps to make it less diligent. Other option is to track by NodeInvConfiguration itself if it has changed and clear that flag when storing.

C++ rapidxml access violation after a certain amount of time (visual studio 2013)

I have been using the excellent rapidxml library to read and use information from XML files to hold cutscene information for a game I am programming in C++. I have run into an odd problem,
I start by loading the XML file into a rapidxml::xmldocument<>* from std::ifstream* XMLFile
std::stringstream buffer; //Create a string buffer to hold the loaded information
buffer << XMLFile->rdbuf(); //Pass the ifstream buffer to the string buffer
XMLFile->close(); //close the ifstream
std::string content(buffer.str()); //get the buffer as a string
buffer.clear();
cutScene = new rapidxml::xml_document<>;
cutScene->parse<0>(&content[0]);
root = cutScene->first_node();
my cutscene xml file is made up of "parts" and at the beginning I want to load all of those parts (which are all xml_nodes) into a vector
//Load parts
if (parts->size() == 0) {
rapidxml::xml_node<>* partNode = root->first_node("part");
parts->push_back(partNode);
for (int i = 1; i < numParts; i++) {
parts->push_back(partNode->next_sibling());
printf("name of part added at %i: %s.\n", i, parts->at(i)->name());
}
}
That last line prints "name of part added at 1: part" to the console.
The problem is for some reason, whenever I try to access the vector and print the same name of that same specific part not as a part of this method, the name can be accessed but is just a random string of letters and numbers. It seems that for some reason rapidxml is deleting everything after my load method is complete. I am still new to posting on stackoverflow so if you need more information just ask, thanks!
Rapidxml is in-situ xml parser. It alters the original string buffer (contentin your case) to format null-terminated tokens such as element and attribute names. Secondly, the lifespan of tree nodes referenced byparts items is defined by xml_document (currscene) instance.
Keep currscene and content instances together with the vector, this will keep the vector items alive as well.
e.g:
struct SceneData
{
std::vector<char> content;
rapidxml::xml_document<> cutScene;
std::vector<rapidxml::xml_node<>*> parts;
bool Parse(const std::string& text);
};
bool SendData::Parse(const std::string& text)
{
content.reserve(text.length() + 1);
content.assign(text.begin(), text.end());
content.push_back('\0');
parts.clear();
try
{
cutScene.parse<0>(content.data());
}
catch(rapidxml::parse_error & err)
{
return false;
}
// Load parts.
rapidxml::xml_node<>* root = cutScene.first_node();
rapidxml::xml_node<>* partNode = root->first_node("part");
parts->push_back(partNode);
for (int i = 1; i < numParts; i++) {
parts->push_back(partNode->next_sibling());
//printf("name of part added at %i: %s.\n", i, parts->at(i)->name());
}
return true ;
}
EDITED
The parser expects a sequence of characters terminated by '\0 as input. Since a buffer referenced by &string[0] is not guaranteed to be null-terminated, it is recommended to copy the string content into std::vector<char>.

How to check whether ifstream is end of file in C++

I need to read all blocks of one large file(about 10GB) sequentially, the file contains many floats with a few strings, like this(each item splited by '\n'):
6.292611
-1.078219E-266
-2.305673E+065
sod;eiwo
4.899747e-237
1.673940e+089
-4.515213
I read MAX_NUM_PER_FILE items each time and process them and write to another file, but i don't know when the ifstream is ended.
Here is my code:
ifstream file_input(path_input); //my file is a text file, but i tried both text and binary mode, both failed.
if(file_input)
{
file_input.seekg(0,file_input.end);
unsigned long long length = file_input.tellg(); //get file size
file_input.seekg(0,file_input.beg);
char * buffer = new char [MAX_NUM_PER_FILE+MAX_NUM_PER_LINE];
int i=1,j;
char c,tmp[3];
while(file_input.tellg()<length)
{
file_input.read(buffer,MAX_NUM_PER_FILE);
j=MAX_NUM_PER_FILE;
while(file_input.get(c)&&c!='\n')
buffer[j++]=c; //get a complete item
//process with buffer...
itoa(i++,tmp,10); //int2char
string out_name="out"+string(tmp)+".txt";
ofstream file_output(out_name);
file_output.write(buffer,j);
file_output.close();
}
file_input.close();
delete[] buffer;
}
My code goes wrong, length is bigger than real file size. I have tried file_input.good() or !file_input.eof(), they didn't work, getline(file_input,s) is good, but it is much slower than read, i want read, but i don't know how to check whether ifstream is end-of-file.
I do my work in WINDOWS 7 with VS2010.
I have searched, but there are not any answer about it, How to open a file using ifstream and keep reading it until the end this link can't answer my question.
Update, Problem solved
Hi everyone, I have figured it out that it's my fault. Both while(file_input.tellg()<length) and while(file_input.peek()!=EOF) work fine! while(file_input.peek()!=EOF) is recommended.
The extra items written after the end-of-file is the left items in buffer written in the last time.
Here is the correct code:
ifstream file_input(path_input);
if(file_input)
{
//file_input.seekg(0,file_input.end);
//unsigned long long length = file_input.tellg(); //get file size
//file_input.seekg(0,file_input.beg);
char * buffer = new char [MAX_NUM_PER_FILE+MAX_NUM_PER_LINE];
int i=1,j;
char c,tmp[3];
while(file_input.peek()!=EOF)
{
memset(buffer,0,sizeof(char)*(MAX_NUM_PER_FILE+MAX_NUM_PER_LINE)); //clear first!
file_input.read(buffer,MAX_NUM_PER_FILE);
j=MAX_NUM_PER_FILE;
while(file_input.get(c)&&c!='\n')
buffer[j++]=c;
itoa(i++,tmp,10);//int2char
string out_name="out"+string(tmp)+".txt";
ofstream file_output(out_name);
file_output.write(buffer,strlen(buffer)); //use the correct buffer size instead of j
file_output.close();
}
file_input.close();
delete[] buffer;
}
while( file_input.peek() != EOF )
{
// code
}
Basically peek() will read the next char without extracting it.
So you can simply compare it to EOF.

very fast text file processing (C++)

i wrote an application which processes data on the GPU. Code works well, but i have the problem that the reading part of the input file (~3GB, text) is the bottleneck of my application. (The read from the HDD is fast, but the processing line by line is slow).
I read a line with getline() and copy line 1 to a vector, line2 to a vector and skip lines 3 and 4. And so on for the rest of the 11 mio lines.
I tried several approaches to get the file at the best time possible:
Fastest method I found is using boost::iostreams::stream
Others were:
Read the file as gzip, to minimize IO, but is slower than directly
reading it.
copy file to ram by read(filepointer, chararray, length)
and process it with a loop to distinguish the lines (also slower than boost)
Any suggestions how to make it run faster?
void readfastq(char *filename, int SRlength, uint32_t blocksize){
_filelength = 0; //total datasets (each 4 lines)
_SRlength = SRlength; //length of the 2. line
_blocksize = blocksize;
boost::iostreams::stream<boost::iostreams::file_source>ins(filename);
in = ins;
readNextBlock();
}
void readNextBlock() {
timeval start, end;
gettimeofday(&start, 0);
string name;
string seqtemp;
string garbage;
string phredtemp;
_seqs.empty();
_phred.empty();
_names.empty();
_filelength = 0;
//read only a part of the file i.e the first 4mio lines
while (std::getline(in, name) && _filelength<_blocksize) {
std::getline(in, seqtemp);
std::getline(in, garbage);
std::getline(in, phredtemp);
if (seqtemp.size() != _SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
for (int k = 0; k < _SRlength; k++) {
//handle special letters
if(seqtemp[k]== 'A') ...
else{
_seqs.push_back(5);
}
}
_filelength++;
}
}
EDIT:
The source-file is downloadable under https://docs.google.com/open?id=0B5bvyb427McSMjM2YWQwM2YtZGU2Mi00OGVmLThkODAtYzJhODIzYjNhYTY2
I changed the function readfastq to read the file, because of some pointer problems. So if you call readfastq the blocksize (in lines) must be bigger than the number of lines to read.
SOLUTION:
I found a solution, which get the time for read in the file from 60sec to 16sec. I removed the inner-loop which handeles the special characters and do this in GPU. This decreases the read-in time and only minimal increases the GPU running time.
Thanks for your suggestions.
void readfastq(char *filename, int SRlength) {
_filelength = 0;
_SRlength = SRlength;
size_t bytes_read, bytes_expected;
FILE *fp;
fp = fopen(filename, "r");
fseek(fp, 0L, SEEK_END); //go to the end of file
bytes_expected = ftell(fp); //get filesize
fseek(fp, 0L, SEEK_SET); //go to the begining of the file
fclose(fp);
if ((_seqarray = (char *) malloc(bytes_expected/2)) == NULL) //allocate space for file
err(EX_OSERR, "data malloc");
string name;
string seqtemp;
string garbage;
string phredtemp;
boost::iostreams::stream<boost::iostreams::file_source>file(filename);
while (std::getline(file, name)) {
std::getline(file, seqtemp);
std::getline(file, garbage);
std::getline(file, phredtemp);
if (seqtemp.size() != SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
strncpy( &(_seqarray[SRlength*_filelength]), seqtemp.c_str(), seqtemp.length()); //do not handle special letters here, do on GPU
_filelength++;
}
}
}
First instead of reading the file into memory you may work with file mappings. You just have to build your program as 64-bit to fit 3GB of virtual address space (for 32-bit application only 2GB is accessible in the user mode). Or alternatively you may map & process your file by parts.
Next, it sounds to me that your bottleneck is "copying a line to a vector". Dealing with vectors involves dynamic memory allocation (heap operations), which in a critical loop hits the performance very seriously). If this is the case - either avoid using vectors, or make sure they're declared outside the loop. The latter helps because when you reallocate/clear vectors they do not free memory.
Post your code (or a part of it) for more suggestions.
EDIT:
It seems that all your bottlenecks are related to string management.
std::getline(in, seqtemp); reading into an std::string deals with the dynamic memory allocation.
_names.push_back(name); This is even worse. First the std::string is placed into the vector by value. Means - the string is copied, hence another dynamic allocation/freeing happens. Moreover, when eventually the vector is internally reallocated - all the contained strings are copied again, with all the consequences.
I recommend using neither standard formatted file I/O functions (Stdio/STL) nor std::string. To achieve better performance you should work with pointers to strings (rather than copied strings), which is possible if you map the entire file. Plus you'll have to implement the file parsing (division into lines).
Like in this code:
class MemoryMappedFileParser
{
const char* m_sz;
size_t m_Len;
public:
struct String {
const char* m_sz;
size_t m_Len;
};
bool getline(String& out)
{
out.m_sz = m_sz;
const char* sz = (char*) memchr(m_sz, '\n', m_Len);
if (sz)
{
size_t len = sz - m_sz;
m_sz = sz + 1;
m_Len -= (len + 1);
out.m_Len = len;
// for Windows-format text files remove the '\r' as well
if (len && '\r' == out.m_sz[len-1])
out.m_Len--;
} else
{
out.m_Len = m_Len;
if (!m_Len)
return false;
m_Len = 0;
}
return true;
}
};
if _seqs and _names are std::vectors and you can guess the final size of them before processing the whole 3GB of data, you can use reserve to avoid most of the memory re-allocation during pushing back the new elements in the loop.
You should be aware of the fact that the vectors effectively produce another copy of parts of the file in main memory. So unless you have a main memory sufficiently large to store the text file plus the vector and its contents, you will probably end up with a number of page faults that also have a significant influence on the speed of your program.
You are apparently using <stdio.h> since using getline.
Perhaps fopen-ing the file with fopen(path, "rm"); might help, because the m tells (it is a GNU extension) to use mmap for reading.
Perhaps setting a big buffer (i.e. half a megabyte) with setbuffer could also help.
Probably, using the readahead system call (in a separate thread perhaps) could help.
But all this are guesses. You should really measure things.
General suggestions:
Code the simplest, most straight-forward, clean approach,
Measure,
Measure,
Measure,
Then if all else fails:
Read raw bytes (read(2)) in page-aligned chunks. Do so sequentially, so kernel's read-ahead plays to your advantage.
Re-use the same buffer to minimize cache flushing.
Avoid copying data, parse in place, pass around pointers (and sizes).
mmap(2)-ing [parts of the] file is another approach. This also avoids kernel-userland copy.
Depending on your disk speed, using a very fast de compression algorithm might help, like fastlz (there are at least two other that might be more efficient, but under GPL, so licence can be a problem).
Also, using C++ data structures and functions car increase the speed as you can maybe achieve a better compiler-time optimization. Going the C way isn't always the fastes! In some bad conditions, using char* you need to parse the whole string to reach the \0 yielding desastrous performances.
For parsing your data, using boost::spirit::qi is also probably the most optimized approach http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html

Newline character in Text Document?

I wrote a pretty simple function that reads in possible player names and stores them in a map for later use. Basically in the file, each line is a new possible player name, but for some reason it seems like all but the last name has some invisible new line character after it. My print out is showing it like this...
nameLine = Georgio
Name: Georgio
0
nameLine = TestPlayer
Name: TestPlayer 0
Here is the actual code. I assume I need to be stripping something out but I am not sure what I need to be checking for.
bool PlayerManager::ParsePlayerNames()
{
FileHandle_t file;
file = filesystem->Open("names.txt", "r", "MOD");
if(file)
{
int size = filesystem->Size(file);
char *line = new char[size + 1];
while(!filesystem->EndOfFile(file))
{
char *nameLine = filesystem->ReadLine(line, size, file);
if(strcmp(nameLine, "") != 0)
{
Msg("nameLine = %s\n", nameLine);
g_PlayerNames.insert(std::pair<char*, int>(nameLine, 0));
}
for(std::map<char*,int>::iterator it = g_PlayerNames.begin(); it != g_PlayerNames.end(); ++it)
{
Msg("Name: %s %d\n", it->first, it->second);
}
}
return true;
}
Msg("[PlayerManager] Failed to find the Player Names File (names.txt)\n");
filesystem->Close(file);
return false;
}
You really need to consider using iostreams and std::string. The above code is SO much more simpler if you used the C++ constructs available to you.
Problems with your code:
why do you allocate a buffer for a single line which is the size of the file?
You don't clean up this buffer!
How does ReadLine fill the line buffer?
presumably nameLine points to the begining of the line buffer, if so, given in the std::map, the key is a pointer (char*) rather than a string as you were expecting, and the pointer is the same! If different (i.e. somehow you read a line and then move the pointer along for each name, then std::map will contain an entry per player, however you'll not be able to find an entry by player name as the comparison will be a pointer comparison rather than a string comparison as you are expecting!
I suggest that you look at implementing this using iostreams, here is some example code (without any testing)
ifstream fin("names.txt");
std::string line;
while (fin.good())
{
std::getline(fin, line); // automatically drops the new line character!
if (!line.empty())
{
g_PlayerNames.insert(std::pair<std::string, int>(line, 0));
}
}
// now do what you need to
}
No need to do any manual memory management, and std::map is typed with std::string!
ReadLine clearly includes the newline in the data it returns. Simply check for and remove it:
char *nameLine = filesystem->ReadLine(line, size, file);
// remove any newline...
if (const char* p_nl = strchr(nameLine, '\n'))
*p_nl = '\0';
(What this does is overwrite the newline character with a new NUL terminator, which effectively truncates the ASCIIZ string at that point.
Most likely the ReadLinefunction also reads the newline character. I suppose your file does not have a newline at the very last line, thus you do not get a newline for that name.
But until I know what filesystem, FileHandle_t, and Msg is, it is very hard to determine where the issue could be.