Reading/writing binary file returns 0xCCCCCCCC - c++

I have a script that dumps class info into a binary file, then another script that retrieves it.
Since binary files only accept chars, I wrote three functions for reading and writing Short Ints, Ints, and Floats. I've been experimenting with them so they're not overloaded properly, but they all look like this:
void writeInt(ofstream& file, int val) {
file.write(reinterpret_cast<char *>(&val), sizeof(val));
}
int readInt(ifstream& file) {
int val;
file.read(reinterpret_cast<char *>(&val), sizeof(val));
return val;
}
I'll put the class load/save script at the end of the post, but I don't think it'll make too much sense without the rest of the class info.
Anyway, it seems that the file gets saved properly. It has the correct size, and all of the data matches when I load it. However, at some point in the load process, the file.read() function starts returning 0xCCCCCCCC every time. This looks to me like a read error, but I'm not sure why, or how to correct it. Since the file is the correct size, and I don't touch the seekg() function, it doesn't seem likely that it's reaching the end of file prematurely. I can only assume it's an issue with my read/write method, since I did kind of hack it together with limited knowledge. However, if this is the case, why does it read all the data up to a certain point without issue?
The error starts occurring at a random point each run. This may or may not be related to the fact that all the class data is randomly generated.
Does anyone have experience with this? I'm not even sure how to continue debugging it at this point.
Anyway, here are the load/save functions:
void saveToFile(string fileName) {
ofstream dataFile(fileName.c_str());
writeInt(dataFile, inputSize);
writeInt(dataFile, fullSize);
writeInt(dataFile, outputSize);
// Skips input nodes - no data needs to be saved for them.
for (int i = inputSize; i < fullSize; i++) { // Saves each node after inputSize
writeShortInt(dataFile, nodes[i].size);
writeShortInt(dataFile, nodes[i].skip);
writeFloat(dataFile, nodes[i].value);
//vector<int> connects;
//vector<float> weights;
for (int j = 0; j < nodes[i].size; j++) {
writeInt(dataFile, nodes[i].connects[j]);
writeFloat(dataFile, nodes[i].weights[j]);
}
}
read(500);
}
void loadFromFile(string fileName) {
ifstream dataFile(fileName.c_str());
inputSize = readInt(dataFile);
fullSize = readInt(dataFile);
outputSize = readInt(dataFile);
nodes.resize(fullSize);
for (int i = 0; i < inputSize; i++) {
nodes[i].setSize(0); // Sets input nodes
}
for (int i = inputSize; i < fullSize; i++) { // Loads each node after inputSize
int s = readShortInt(dataFile);
nodes[i].setSize(s);
nodes[i].skip = readShortInt(dataFile);
nodes[i].value = readFloat(dataFile);
//vector<int> connects;
//vector<float> weights;
for (int j = 0; j < nodes[i].size; j++) {
nodes[i].connects[j] = readInt(dataFile); //Error occurs in a random instance of this call of readInt().
nodes[i].weights[j] = readFloat(dataFile);
}
read(i); //Outputs class data to console
}
read(500);
}
Thanks in advance!

You have to check the result of open, read, write operations.
And you need to open files (for reading and writing) as binary.

Related

How to generate a hashmap for huge chunk of data?

I want to make a map such that a set of pointers point to arrays of dynamic size.
I did use hashing with chaining. But since data I am using it for is huge, the program give std::bad_alloc after few iterations. The reason of which may be new used to generate the linked list.
Someone please suggest which data structure shall I use?
Or anything else that can improve memory usage with my hash table?
Program is in C++.
This is what my code looks like:
Initialization of hashtable:
class Link
{
public:
double iData;
Link* pNext;
Link(double it) : iData(it)
{ }
void displayLink()
{ cout << iData << " "; }
};
class List
{
private:
Link* pFirst;
public:
List()
{ pFirst = NULL; }
void insert(double key)
{
if(pFirst==NULL)
pFirst = new Link(key);
else
{
Link* pLink = new Link(key);
pLink->pNext = pFirst;
pFirst = pLink;
}
}
};
class HashTable
{
public:
int arraySize;
vector<List*> hashArray;
HashTable(int size)
{
hashArray.resize(size);
for(int j=0; j<size; j++)
hashArray[j] = new List;
}
};
main snippet:
int t_sample = 1000;
for(int i=0; i < k; i++) // initialize random position
{
x[i] = (cal_rand() * dom_sizex); //dom_sizex = 20e-10 cal_rand() generates rand no between 0 and 1
y[i] = (cal_rand() * dom_sizey); //dom_sizey = 10e-10
}
for(int t=0; t < t_sample; t++)
{
int size;
size = cell_nox * cell_noy; //size of hash table cell_nox = 212, cell_noy = 424
HashTable theHashTable(size); //make table
int hashValue = 0;
for(int n=0; n<k; n++) // k = 10*212*424
{
int m = x[n] /cell_width; //cell_width = 4.7e-8
int l = y[n] / cell_width;
hashValue = (kx*l)+m;
theHashTable.hashArray[hashValue]->insert(n);
}
-------
-------
}
First things first, use a Standard Container. In your specific case, you might want:
either std::unordered_multimap<int, double>
or std::unordered_map<int, std::vector<double>>
(Note: if you do not have C++11, those are available in Boost)
Your main loop becomes (using the second option):
typedef std::unordered_map<int, std::vector<double>> HashTable;
for(int t = 0; t < t_sample; ++t)
{
size_t const size = cell_nox * cell_noy;
// size of hash table cell_nox = 212, cell_noy = 424
HashTable theHashTable;
theHashTable.reserve(size);
for (int n = 0; n < k; ++n) // k = 10*212*424
{
int m = x[n] / cell_width; //cell_width = 4.7e-8
int l = y[n] / cell_width;
int const cellId = (kx*l)+m;
theHashTable[cellId].push_back(n);
}
}
This will not leak memory (reliably), although of course you might have other leaks, and thus will give you a reliable baseline. It is also probably faster than your approach, with a more convenient interface, etc...
In general you should not re-invent the wheel, unless you have a specific need that is not addressed by the available wheels or you are actually trying to learn how to create a wheel or to create a better wheel.
The OS has to solve the same issues with the memory pages, maybe it's worth looking at how that is done? First of all, let's assume all pages are on the disk. A page is a fixed size memory chunk. For your use case, let's say it's an array of your records. Because RAM is limited, the OS maintains a mapping between the page number and it's location in RAM.
So, let's say your pages have 1000 records, and you want to access record 2024, you would ask the OS for page 2, and read record 24 from that page. That way, your map is only 1/1000 in size.
Now, if your page has no mapping to a memory location, then it is either on disk or has never been accessed before (is empty). Then you need to swap out another page, and load that page from disk (and update the location mapping).
This is a very simplified description of what happens and i wouldn't be surprised if someone jumps me in the neck for describing it like this.
The point is:
What does this mean for you?
First of all, your data exceeds your RAM - you won't get around writing to disk, if you don't want to try compression first.
Second, your chains can work as pages if you want, but i wonder whether just paging your hashcode would work better. What i mean is, use the upper bits as page number, and the lower bits as offset in the page. Avoiding collisions is still key, as you want to load the least pages possible. You can still chain your pages, and end up with a much smaller map.
Second - a crucial part is deciding which pages to swap out to make room for the new pages. LRU should do ok. If you can better predict which pages you will (not) need, so much better for you.
Third - you need placeholders for your pages to tell you whether they are in-memory or on disk.
Hope this helps.

Where did these .tmp files come from?

Firstly, some details:
I am using a combination of C++ (Armadillo library) and R.
I am using Ubuntu as my operating system.
I am not using Rcpp
Suppose that I have some C++ code called cpp_code which:
Reads, as input from R, an integer,
Performs some calculations,
Saves, as output to R, a spreadsheet "out.csv". (I use .save( name, file_type = csv))
Some simplified R code would be:
for(i in 1:10000)
{
system(paste0("echo ", toString(i), " | ./cpp_code")) ## produces out.csv
output[i,,] <- read.csv("out.csv") ## reads out.csv
}
My Problem:
99% of the time, everything works fine. However, every now and then, I keep getting some unusual .tmp files like: "out.csv.tmp_a0ac9806ff7f0000703a". These .tmp files only appear for a second or so, then suddenly disappear.
Questions:
What is causing this?
Is there a way to stop this from happening?
Please go easy on me since computing is not my main discipline.
Thank you very much for your time.
Many programs write their output to a temporary file, then rename it to the destination file. This is often done to avoid leaving a half-written output file if the process is killed while writing. By using a temporary, the file can be atomically renamed to the output file name ensuring either:
the entire output file is properly written or
no change is made to the output file
Note there usually are still some race conditions that could result, for example, in the output file being deleted but the temporary file not renamed, but one of the two outcomes above is the general goal.
I believe you're using .save function in armadillo.
http://arma.sourceforge.net/docs.html#save_load_field
There are two functions you should see in
include/armadillo_bits/diskio_meat.hpp. In save_raw_ascii, it first stores data to the filename from diskio::gen_tmp_name, and if save_okay, rename by safe_rename. If safe_okay or safe_rename failed, you will see temporary file. The temporary file name is chosen as filename + .tmp_ + some hex value from file name.
//! Save a matrix as raw text (no header, human readable).
//! Matrices can be loaded in Matlab and Octave, as long as they don't have complex elements.
template<typename eT>
inline
bool
diskio::save_raw_ascii(const Mat<eT>& x, const std::string& final_name)
{
arma_extra_debug_sigprint();
const std::string tmp_name = diskio::gen_tmp_name(final_name);
std::fstream f(tmp_name.c_str(), std::fstream::out);
bool save_okay = f.is_open();
if(save_okay == true)
{
save_okay = diskio::save_raw_ascii(x, f);
f.flush();
f.close();
if(save_okay == true)
{
save_okay = diskio::safe_rename(tmp_name, final_name);
}
}
return save_okay;
}
//! Append a quasi-random string to the given filename.
//! The rand() function is deliberately not used,
//! as rand() has an internal state that changes
//! from call to call. Such states should not be
//! modified in scientific applications, where the
//! results should be reproducable and not affected
//! by saving data.
inline
std::string
diskio::gen_tmp_name(const std::string& x)
{
const std::string* ptr_x = &x;
const u8* ptr_ptr_x = reinterpret_cast<const u8*>(&ptr_x);
const char* extra = ".tmp_";
const uword extra_size = 5;
const uword tmp_size = 2*sizeof(u8*) + 2*2;
char tmp[tmp_size];
uword char_count = 0;
for(uword i=0; i<sizeof(u8*); ++i)
{
conv_to_hex(&tmp[char_count], ptr_ptr_x[i]);
char_count += 2;
}
const uword x_size = static_cast<uword>(x.size());
u8 sum = 0;
for(uword i=0; i<x_size; ++i)
{
sum += u8(x[i]);
}
conv_to_hex(&tmp[char_count], sum);
char_count += 2;
conv_to_hex(&tmp[char_count], u8(x_size));
std::string out;
out.resize(x_size + extra_size + tmp_size);
for(uword i=0; i<x_size; ++i)
{
out[i] = x[i];
}
for(uword i=0; i<extra_size; ++i)
{
out[x_size + i] = extra[i];
}
for(uword i=0; i<tmp_size; ++i)
{
out[x_size + extra_size + i] = tmp[i];
}
return out;
}
What “Dark Falcon” hypothesises is exactly true: when calling save, Armadillo creates a temporary file to which it saves the data, and then renames the file to the final name.
This can be seen in the source code (this is save_raw_ascii but the other save* functions work analogously):
const std::string tmp_name = diskio::gen_tmp_name(final_name);
std::fstream f(tmp_name.c_str(), std::fstream::out);
bool save_okay = f.is_open();
if(save_okay == true)
{
save_okay = diskio::save_raw_ascii(x, f);
f.flush();
f.close();
if(save_okay == true)
{
save_okay = diskio::safe_rename(tmp_name, final_name);
}
}
The comment on safe_rename says this:
Safely rename a file.
Before renaming, test if we can write to the final file.
This should prevent:
overwriting files that are write protected,
overwriting directories.
It’s worth noting that this will however not prevent a race condition.

Problems with destroying an array of libtrace_out_t*

The task is to read packets from one tracer and write to many.
I use libtrace_out_t** for output tracers.
Initialization:
uint16_t size = 10;
libtrace_out_t** array = libtrace_out_t*[size];
for(uint16_t i = 0; i < size; ++i) {
array[i] = trace_create_output(uri); // created OK
trace_start_output(outTracers_[i]); // started OK
}
// writing packets
Creating, starting and writing packets using elements of tracer's array are fine.
Problems are caused by trace_destroy_output() when I destroy output tracers in loop:
for(uint16_t i = 0; i < size; ++i)
{
if(outTracers_[i])
trace_destroy_output(outTracers_[i]);
}
On the first iteration output tracer is destroying fine.
But on the second it fails with Segmentation fault in
pcap_close(pcap_t* p)
because pointer p has value 0x0.
Can someone explain me why this thing happens or how to destroy it properly?
From the code that you have posted, it looks like you are creating 10 output traces all using the same URI. So, essentially, you've created 10 output files all with the same filename which probably isn't what you intended.
When it comes time to destroy the output traces, the first destroy closes the file matching the name you provided and sets the reference to that file to be NULL. Because the reference is now NULL, any subsequent attempts to destroy that file will cause a segmentation fault.
Make sure you change your URI for each new output trace you create and you should fix the problem.
Example:
/* I prefer pcapfile: over pcap: */
const char *base="pcapfile:output";
uint16_t size = 10;
libtrace_out_t* array[size];
for (uint16_t i = 0; i < size; ++i) {
char myuri[1024];
/* First output file will be called output-1.pcap
* Second output file will be called output-2.pcap
* And so on...
*/
snprintf(myuri, 1023, "%s-%u.pcap", base, i);
array[i] = trace_create_output(uri);
/* TODO Check for errors here */
if (trace_start_output(array[i])) {
/* TODO Handle error case */
}
}
One other hint: libtrace already includes a tool called tracesplit which takes an input source and splits the packets into multiple output traces based on certain criteria (e.g. number of packets, size of output file, time interval). This tool may already do what you want without having to write code or at least it will act as a good example when writing your own code.
I think you have an out of bounds access in your code
uint16_t size = 5; /// number of tracers
for(uint16_t i = 0; i != size; ++i)
{
if(outTracers_[i])
trace_destroy_output(outTracers_[i]);
}
translates to
for(uint16_t i = 0; i <= 5; ++i)
{
...
}
And outTracers_[5] is not a valid element in your array

Trying to fill a 2d array of structures in C++

As above, I'm trying to create and then fill an array of structures with some starting data to then write to/read from.
I'm still writing the cache simulator as per my previous question:
Any way to get rid of the null character at the end of an istream get?
Here's how I'm making the array:
struct cacheline
{
string data;
string tag;
bool valid;
bool dirty;
};
cacheline **AllocateDynamicArray( int nRows, int nCols)
{
cacheline **dynamicArray;
dynamicArray = new cacheline*[nRows];
for( int i = 0 ; i < nRows ; i++ )
dynamicArray[i] = new cacheline [nCols];
return dynamicArray;
}
I'm calling this from main:
cacheline **cache = AllocateDynamicArray(nooflines,noofways);
It seems to create the array ok, but when I try to fill it I get memory errors, here's how I'm trying to do it:
int fillcache(cacheline **cache, int cachesize, int cachelinelength, int ways)
{
for (int j = 0; j < ways; j++)
{
for (int i = 0; i < cachesize/(cachelinelength*4); i++)
{
cache[i][ways].data = "EMPTY";
cache[i][ways].tag = "";
cache[i][ways].valid = 0;
cache[i][ways].dirty = 0;
}
}
return(1);
}
Calling it with:
fillcache(cache, cachesize, cachelinelength, noofways);
Now, this is the first time I've really tried to use dynamic arrays, so it's entirely possible I'm doing that completely wrong, let alone when trying to make it 2d, any ideas would be greatly appreciated :)
Also, is there an easier way to do write to/read from the array? At the moment (I think) I'm having to pass lots of variables to and from functions, including the array (or a pointer to the array?) each time which doesn't seem efficient?
Something else I'm unsure of, when I pass the array (pointer?) and edit the array, when I go back out of the function, will the array still be edited?
Thanks
Edit:
Just noticed a monumentally stupid error, it should ofcourse be:
cache[i][j].data = "EMPTY";
You should find your happiness. You just need the time to check it out (:
The way to happiness

C++: how to output data to multiple .dat files?

I have a research project I'm working on. I am a beginner in C++ and programming in general. I have already made a program that generates interacting particles that move on continuous space as time progresses. The only things my program outputs are the XY coordinates for each particle in each time-step.
I want to visualize my findings, to know if my particles are moving as they should. My professor said that I must use gnuplot. Since I could not find a way to output my data in one file so that gnuplot would recognize it, I thought of the following strategy:
a) For each time-step generate one file with XY coordinates of the form "output_#.dat".
b) Generate a .png file for each one of them in gnuplot.
c) Make a movie of the moving particles with all the .png files.
I am going to worry about b and c later, but up to now, I am able to output all my data in one file using this code:
void main()
{
int i = 0;
int t = 0; // time
int j = 0;
int ctr = 0;
double sumX = 0;
double sumY = 0;
srand(time(NULL)); // give system time as seed to random generator
Cell* particles[maxSize]; // create array of pointers of type Cell.
for(i=0; i<maxSize; i++)
{
particles[i] = new Cell(); // initialize in memory
particles[i]->InitPos(); // give initial positions
}
FILE *newFile = fopen("output_00.dat","w"); // print initial positions
for(i=0; i<maxSize; i++)
{
fprintf(newFile, "%f %3 ", particles[i]->getCurrX());
fprintf(newFile, "%f %3 \n", particles[i]->getCurrY());
}
fclose(newFile);
FILE *output = fopen("output_01.dat","w");
for(t = 0; t < tMax; t++)
{
fprintf(output, "%d ", t);
for(i=0; i<maxSize; i++) // for every cell
{
sumX = 0;
sumY = 0;
for(j=0; j<maxSize; j++) // for all surrounding cells
{
sumX += particles[i]->ForceX(particles[i], particles[j]);
sumY += particles[i]->ForceY(particles[i], particles[j]);
}
particles[i]->setVelX(particles[i]->getPrevVelX() + (sumX)*dt); // update speed
particles[i]->setVelY(particles[i]->getPrevVelY() + (sumY)*dt);
particles[i]->setCurrX(particles[i]->getPrevX() + (particles[i]->getVelX())*dt); // update position
particles[i]->setCurrY(particles[i]->getPrevY() + (particles[i]->getVelY())*dt);
fprintf(output, " ");
fprintf(output, "%f %3 ", particles[i]->getCurrX());
fprintf(output, "%f %3 \n", particles[i]->getCurrY());
}
}
fclose(output);
}
This indeed generates 2 files, output_00.dat and output01.dat with the first one containing the initial randomly generated positions and the second one containing all my results.
I can feel that in the nested for loop, where I'm updating the speed and position for the XY coordinates, I can have a FILE* that will store the coordinates for each time step and then close it, before incrementing time. In that way, I will not need multiple pointers to be open at the same time. At least that is my intuition.
I do not know how to generate incrementing filenames. I have stumbled upon ofstream, but I don't understand how it works...
I think what I would like my program to do at this point is:
1) Generate a new file name, using a base name and the current loop counter value.
2) Open that file.
3) Write the coordinates for that time-step.
4) Close the file.
5) Repeat.
Any help will be greatly appreciated. Thank you for your time!
Using ofstream instead of fopen would be a better use of the C++ standard library, whereas now you are using C standard library calls, but there is nothing wrong per se with what you are doing doing now.
It seems like your real core question is how to generate a filename from an integer so you can use it in a loop:
Here is one way:
// Include these somewhere
#include <string>
#include <sstream>
// Define this function
std::string make_output_filename(size_t index) {
std::ostringstream ss;
ss << "output_" << index << ".dat";
return ss.str();
}
// Use that function with fopen in this way:
for (size_t output_file_number=0; /* rest of your for loop stuff */) {
FILE *file = fopen(make_output_filename(output_file_number).c_str(), "w");
/* use the file */
fclose(file);
}
This uses a std::ostringstream" to build a filename using stream operations, and returns the built std::string. When you pass it to fopen, you have to give it a const char * rather than a std::string, so we use the .c_str() member which exists just for this purpose.