Filter unwanted elements from ifstream - c++

Currently I use this template method to read a file and store the content to a std::vector:
template <class T> std::vector<T> FileToBuffer(const std::string pathToFile, std::vector<T> *ignoreList = NULL, bool binary = false)
{
std::vector<T> fileBuffer;
std::ifstream file;
try
{
if(binary) file.open(pathToFile, std::ios::in | ios::binary);
else file.open(pathToFile, std::ios::in);
if (file)
{
// get length of file:
file.seekg (0, file.end);
int length = file.tellg();
file.seekg (0, file.beg);
fileBuffer.resize(length);
file.read((char*)fileBuffer.data(), length);
file.close();
// remove unwanted elements
if(ignoreList != NULL)
{
for(std::vector<T>::iterator it=ignoreList->begin(); it!=ignoreList->end(); it++)
{
fileBuffer.erase(std::remove(fileBuffer.begin(), fileBuffer.end(), *it), fileBuffer.end());
}
}
}
}
catch(std::exception &exp)
{
fileBuffer.clear();
}
return fileBuffer;
}
To filter unwanted elements I go through the whole vector and delete it! Is there a better way to filter the unwanted elements immediately in read process?
(in fact I need this to remove line feed and carriage return)
Thanks!

This should work:
if (file)
{
T object;
while ( file.read((char*)(&object), sizeof(T)) )
{
bool found = false;
if(ignoreList != NULL)
{
auto iter = std::find(ignoreList->begin(), ignoreList->end(), object);
found = (iter != ignoreList->end());
}
if ( !found )
{
fileBuffer.push_back(object);
}
}
file.close();
}

Related

Fastest way to read in a file c++

I would like to read in a file like this:
13.3027 29.2191 2.39999
13.3606 29.1612 2.39999
13.3586 29.0953 2.46377
13.4192 29.106 2.37817
It has more than 1mio lines.
My current cpp code is:
loadCloud(const string &filename, PointCloud<PointXYZ> &cloud)
{
print_info("\nLoad the Cloud .... (this takes some time!!!) \n");
ifstream fs;
fs.open(filename.c_str(), ios::binary);
if (!fs.is_open() || fs.fail())
{
PCL_ERROR(" Could not open file '%s'! Error : %s\n", filename.c_str(), strerror(errno));
fs.close();
return (false);
}
string line;
vector<string> st;
while (!fs.eof())
{
getline(fs, line);
// Ignore empty lines
if (line == "")
{
std::cout << " this line is empty...." << std::endl;
continue;
}
// Tokenize the line
boost::trim(line);
boost::split(st, line, boost::is_any_of("\t\r "), boost::token_compress_on);
cloud.push_back(PointXYZ(float(atof(st[0].c_str())), float(atof(st[1].c_str())), float(atof(st[2].c_str()))));
}
fs.close();
std::cout<<" Size of loaded cloud: " << cloud.size()<<" points" << std::endl;
cloud.width = uint32_t(cloud.size()); cloud.height = 1; cloud.is_dense = true;
return (true);
}
Reading this file currently takes really long. I would like to speed this up any ideas how to do that?
You can just read the numbers instead of the whole line plus parsing, as long as the numbers always come in sets of three.
void readFile(const std::string& fileName)
{
std::ifstream infile(fileName);
float vertex[3];
int coordinateCounter = 0;
while (infile >> vertex[coordinateCounter])
{
coordinateCounter++;
if (coordinateCounter == 3)
{
cloud.push_back(PointXYZ(vertex[0], vertex[1], vertex[2]));
coordinateCounter = 0;
}
}
}
Are you running optimised code? On my machine your code reads a million values in 1800ms.
The trim and the split are probably taking most of the time. If there is white space at the beginning of the string trim has to copy the whole string contents to erase the first characters. split is creating new string copies, you can optimise this by using string_view to avoid the copies.
As your separators are white space you can avoid all the copies with code like this:
bool loadCloud(const string &filename, std::vector<std::array<float, 3>> &cloud)
{
ifstream fs;
fs.open(filename.c_str(), ios::binary);
if (!fs)
{
fs.close();
return false;
}
string line;
vector<string> st;
while (getline(fs, line))
{
// Ignore empty lines
if (line == "")
{
continue;
}
const char* first = &line.front();
const char* last = first + line.length();
std::array<float, 3> arr;
for (float& f : arr)
{
auto result = std::from_chars(first, last, f);
if (result.ec != std::errc{})
{
return false;
}
first = result.ptr;
while (first != last && isspace(*first))
{
first++;
}
}
if (first != last)
{
return false;
}
cloud.push_back(arr);
}
fs.close();
return true;
}
On my machine this code runs in 650ms. About 35% of the time is used by getline, 45% by parsing the floats, the remaining 20% is used by push_back.
A few notes:
I've fixed the while(!fs.eof()) issue by checking the state of the stream after calling getline
I've changed the result to an array as your example wasn't a mcve so I didn't have a definition of PointCloud or PointXYZ, its possible that these types are the cause of your slowness.
If you know the number of lines (or at least an approximation) in advance then reserving the size of the vector would improve performance

Reading a file using fstream

When I try to read a file to a buffer, it always appends random characters to the end of the buffer.
char* thefile;
std::streampos size;
std::fstream file(_file, std::ios::in | std::ios::ate);
if (file.is_open())
{
size = file.tellg();
std::cout << "size: " << size;
thefile = new char[size]{0};
file.seekg(0, std::ios::beg);
file.read(thefile, size);
std::cout << thefile;
}
int x = 0;
While my original text in my file is: "hello"
The output becomes: "helloýýýý««««««««þîþîþ"
Could anyone help me as to what is happening here? Thanks
From the C++ docs: http://cplusplus.com/reference/istream/istream/read
"This function simply copies a block of data, without checking its contents nor appending a null character at the end."
So your string misses the trailing null character which indicates the end of the string. In this case cout will just continue printing characters from what is beyond thefile in memory.
Add a '\0' at the end of your string.
If the file is not opened with ios::binary mode, you cannot assume that the position returned by tellg() will give you the number of chars that you will read. Text mode operation may perform some transformations on the flow (f.ex: on windows, it will convert "\r\n" in the file in "\n", so you might find out a size of 2 but read only 1)
Anyway, read() doesn't add a null terminator.
Finally, you must allocate one more character than the size that you expect due to the null terminator that you have to add. Otherwise you risk a buffer overflow when you add it.
You should verify how many chars were really read with gcount(), and set a null terminator to your string accordingly.
thefile = new char[size + 1]{0}; // one more for the trailing null
file.seekg(0, std::ios::beg);
if (file.read(thefile, size))
thefile[size]=0; // successfull read: all size chars were read
else thefile[file.gcount()]=0; // or less chars were read due to text mode
Here's a better way of reading your collection:
#include <vector>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <cstdint>
#include <iterator>
template<class T>
void Write(std::string const & path, T const & value, std::ios_base::openmode mode)
{
if (auto stream = std::ofstream(path, mode))
{
Write(stream, value);
stream.close();
}
else
{
throw std::runtime_error("failed to create/open stream");
}
}
template<class T>
void Write(std::ostream & stream, T const & value)
{
std::copy(value.begin(), value.end(), std::ostreambuf_iterator<char>(stream));
if (!stream)
{
throw std::runtime_error("failed to write");
}
}
template<class T>
void Read(std::istream & stream, T & output)
{
auto eof = std::istreambuf_iterator<char>();
output = T(std::istreambuf_iterator<char>(stream), eof);
if(!stream)
{
throw std::runtime_error("failed to read stream");
}
}
template<class T>
void Read(std::string const & path, T & output)
{
if (auto stream = std::ifstream(path, std::ios::in | std::ios::binary))
{
Read(stream, output);
stream.close();
}
else
{
throw std::runtime_error("failed to create stream");
}
}
int main(void)
{
// Write and read back text.
{
auto const s = std::string("I'm going to write this string to a file");
Write("temp.txt", s, std::ios_base::trunc | std::ios_base::out);
auto t = std::string();
Read("temp.txt", t);
}
// Write and read back a set of ints.
{
auto const v1 = std::vector<int>() = { 10, 20, 30, 40, 50 };
Write("temp.txt", v1, std::ios_base::trunc | std::ios_base::out | std::ios_base::binary);
auto v2 = std::vector<int>();
Read("temp.txt", v2);
}
return 0;
}
Pass in an iterable container rather than using "new".

Is there any possibility that this class may be unsafe?

I've made this class to read binary files and store their data.
FileInput.h:
#pragma once
#include <Windows.h>
#include <fstream>
using namespace std;
class FileInput
{
public:
FileInput(LPSTR Filename);
FileInput(LPWSTR Filename);
~FileInput();
operator char*();
explicit operator bool();
size_t Size;
private:
__forceinline void Read();
ifstream File;
char* Data;
};
FileInput.cpp
#include "FileInput.h"
FileInput::FileInput(LPSTR Filename)
{
File.open(Filename, ios::binary);
Read();
}
FileInput::FileInput(LPWSTR Filename)
{
File.open(Filename, ios::binary);
Read();
}
FileInput::~FileInput()
{
if (Data) delete[] Data;
}
FileInput::operator char*()
{
return Data;
}
FileInput::operator bool()
{
return (bool)Data;
}
void FileInput::Read()
{
if (!File)
{
Data = nullptr, Size = 0;
return;
}
File.seekg(0, ios::end);
Size = (size_t)File.tellg();
File.seekg(0, ios::beg);
Data = new char[Size];
File.read(Data, Size);
File.close();
}
Then I use it like this:
FileInput File(Filename); // This reads the file and allocates memory
if (!File) // This is for error checking
{
// Do something
}
if (File.Size >= sizeof(SomeType))
{
char FirstChar = File[0]; // Gets a single character
SomeStruct *pSomeStruct = reinterpret_cast<SomeStruct*>(&File[0]); // Gets a structure
}
So, is there any possibility that this class may be unsafe?
A reinterpret_cast<SomeStruct*>(&File) or other nonsense statement doesn't count.
EDIT: What I mean with unsafe is "to do unexpected or 'dangerous' things".
This is major overkill. If you want to copy a file into a buffer, you can use use an istreambuf_iterator:
std::ifstream inFile(fileName);
std::vector<char> fileBuffer ( (std::istreambuf_iterator<char>(inFile)),
std::istreambuf_iterator<char>() );
Then you can read from fileBuffer as necessary.

How to update struct item in binary files

I have a binary file that i write some struct items to it. Now I want to find and update specific item from file items.
Note that my struct has a vector and its size is not constant.
my struct:
struct mapItem
{
string term;
vector<int> pl;
};
codes that write struct items to file
if (it==hashTable.end())//didn't find
{
vector <int> posting;
posting.push_back(position);
hashTable.insert ( pair<string,vector <int> >(md,posting ) );
mapItem* mi = new mapItem();
mi->term = md;
mi->pl = posting;
outfile.write((char*)mi, sizeof(mi));
}
else//finded
{
}
In else block I want to find and update item with its term(term is unique).
Now I have changed my code like this to serialize my vector.
if (it==hashTable.end())//didn't find
{
vector <int> posting;
posting.push_back(position);
hashTable.insert ( pair<string,vector <int> >(md,posting ) );
mapItem* mi = new mapItem();
mi->term = md;
mi->pl = posting;
if(!outfile.is_open())
outfile.open("sample.dat", ios::binary | ios::app);
size_t size = mi->term.size() + 1;
outfile.write((char*)&size, sizeof(size) );
outfile.write((char*)mi->term.c_str(), size);
size = (int)mi->pl.size() * sizeof(int);
outfile.write((char*)&size, sizeof(size) );
outfile.write((char*)&mi->pl[0], size );
outfile.close();
}
else//finded
{
(it->second).push_back(position);
mapItem* mi = new mapItem();
size_t size;
if(!infile.is_open())
{
infile.open("sample.dat", ios::binary | ios::in);
}
do{
infile.read((char*)&size, sizeof(size) ); // string size
mi->term.resize(size - 1); // make string the right size
infile.read((char*)mi->term.c_str(), size); // may need const_cast
infile.read((char*)&size, sizeof(size) ); // vector size
mi->pl.resize(size / sizeof(int));
infile.read((char*)&mi->pl[0], size );
}while(mi->term != md);
infile.close();
}
Well, my main question still remains: how can I update the data that I found?
Is there a better way to find them?
I evaluated the following solutions:
update in a new file, rename it to the old one in the end
update in the same file with a stream with two file positions, read & write, but I didn't rapidly find support for such a thing
update in the same file with two streams, read & write, but the risk of underlying overwrite is too big for me (even if protected outside against overlaps)
So I choose the first one, the most straightforward anyway.
#include <string>
#include <vector>
#include <fstream>
#include <cstdio>
#include <assert.h>
I added the following function to your struct:
size_t SizeWrittenToFile() const
{
return 2*sizeof(size_t)+term.length()+pl.size()*sizeof(int);
}
The read & write functions are basically same as your, except I choose not to write to string:c_str() pointer (although this ugly solution should work on every known compiles).
bool ReadNext(std::istream& in, mapItem& item)
{
size_t size;
in.read(reinterpret_cast<char*>(&size), sizeof(size_t));
if (!in)
return false;
std::istreambuf_iterator<char> itIn(in);
std::string& out = item.term;
out.reserve(size);
out.clear(); // this is necessary if the string is not empty
for (std::insert_iterator<std::string> itOut(out, out.begin());
in && (out.length() < size); itIn++, itOut++)
*itOut = *itIn;
assert(in);
if (!in)
return false;
in.read(reinterpret_cast<char*>(&size), sizeof(size_t));
if (!in)
return false;
std::vector<int>& out2 = item.pl;
out2.resize(size); // unfortunately reserve doesn't work here
in.read(reinterpret_cast<char*>(&out2[0]), size * sizeof(int));
assert(in);
return true;
}
// a "header" should be added to mark complete data (to write "atomically")
bool WriteNext(std::ostream& out, const mapItem& item)
{
size_t size = item.term.length();
out.write(reinterpret_cast<const char*>(&size), sizeof(size_t));
if (!out)
return false;
out.write(item.term.c_str(), size);
if (!out)
return false;
size = item.pl.size();
out.write(reinterpret_cast<const char*>(&size), sizeof(size_t));
if (!out)
return false;
out.write(reinterpret_cast<const char*>(&item.pl[0]), size * sizeof(int));
if (!out)
return false;
return true;
}
The update functions look like this:
bool UpdateItem(std::ifstream& in, std::ofstream& out, const mapItem& item)
{
mapItem it;
bool result;
for (result = ReadNext(in, it); result && (it.term != item.term);
result = ReadNext(in, it))
if (!WriteNext(out, it))
return false;
if (!result)
return false;
// write the new item content
assert(it.term == item.term);
if (!WriteNext(out, item))
return false;
for (result = ReadNext(in, it); result; result = ReadNext(in, it))
if (!WriteNext(out, it))
return false;
// failure or just the end of the file?
return in.eof();
}
bool UpdateItem(const char* filename, const mapItem& item)
{
std::ifstream in(filename);
assert(in);
std::string filename2(filename);
filename2 += ".tmp";
std::ofstream out(filename2.c_str());
assert(out);
bool result = UpdateItem(in, out, item);
// close them before delete
in.close();
out.close();
int err = 0;
if (result)
{
err = remove(filename);
assert(!err && "remov_140");
result = !err;
}
if (!result)
{
err = remove(filename2.c_str());
assert(!err && "remov_147");
}
else
{
err = rename(filename2.c_str(), filename);
assert(!err && "renam_151");
result = !err;
}
return result;
}
Questions ?
This:
outfile.write((char*)mi, sizeof(mi));
Does not make sense. You're writing the bits of a vector's implementation directly to disk. Some of those bits are extremely likely to be pointers. Pointers written to a file on disk are not useful, because they point to an address space belonging to the process which wrote the file, but won't work in another process reading the same file.
You need to "serialize" your data to the file, e.g. in a for loop writing each element.
You can serialize the struct to a file this way:
write length of string (4 bytes)
write string itself.
write length of vector (in bytes is easier to parse later).
write vector data. &vec[0] is the address of the first element. you can write all elements in ones shot since this buffer is contiguous.
Write:
size_t size = mi->term.size() + 1;
outfile.write((char*)&size, sizeof(size) );
outfile.write((char*)mi->term.c_str(), size);
size = (int)mi->pl.size() * sizeof(int);
outfile.write((char*)&size, sizeof(size) );
outfile.write((char*)&mi->pl[0], size );
Read:
infile.read((char*)&size, sizeof(size) ); // string size
mi->term.resize(size - 1); // make string the right size
infile.read((char*)mi->term.c_str(), size); // may need const_cast
infile.read((char*)&size, sizeof(size) ); // vector size
mi->pl.resize(size / sizeof(int));
infile.read((char*)&mi->pl[0], size );

read in txt file in C++

I have a .txt parameter file like this:
#Stage
filename = "a.txt";
...
#Stage
filename = "b.txt";
...
Basically I want to read one stage each time I access the parameter file.
I planed to use getline in C++ with delimiter "#Stage" to do this. Or there is a better way to solve this? Any sample codes will be helpful.
*struct content{
DATA data;
content* next;
};
struct List{
content * head;
};
static List * list;
char buf[100];
FILE * f = fopen(filename);
while(NULL != fgets(buf,f))
{
char str[100] ={0};
sccanf(buf,"%s",str);
if(strcmp("#Stage",str) == 0)
{
// read content and add to list
cnntent * p = new content();
list->add();
}
else
{
//update content in last node
list->last().data =
}
}*
Maybe I should express more clear. Anyway, I manage like this:
ifstream file1;
file1.open(parfile, ios::binary);
if (!file1) {
cout<<"Error opening file"<<parfile<<"for read"<<endl;
exit(1);
}
std::istreambuf_iterator<char> eos;
std::string s(std::istreambuf_iterator<char>(file1), eos);
unsigned int block_begin = 0;
unsigned int block_end = string::npos;
for (unsigned int i=0; i<stage; i++) {
if(s.find("#STAGE", block_begin)!=string::npos) {
block_begin = s.find("#STAGE", block_begin);
}
}
if(s.find("#STAGE", block_begin)!=string::npos) {
block_end = s.find("#STAGE", block_begin);
}
string block = s.substr(block_begin, block_end);
stringstream ss(block);
....
I'd read line by line, ignoring the lines, starting with # (or the lines, with content #Stage, depending on the format/goal) (as there's no getline version, taking std::string as delimiter).