Sorting a file in-place with Shell sort - c++

I have been asked to sort a file in-place using shell sort (and quick sort too, but I think that if I find the way to do one I will be able to do both of them). I have been thinking what could be helpful but I can't find a way to do it. I have the algorithm for an array, but I can't think a way to get it to work with a file.
Is there any way this can be done?
Edit:
With the help of the code posted by André Puel I was able to write some code that is working for the moment, here it is if you want to check it out:
#include <iostream>
#include <iomanip>
#include <fstream>
#include <cstdlib>
#include <sstream>
using namespace std;
int toNum(const string &s) {
stringstream ss(s);
int n;
ss >> n;
return n;
}
string toStr(int n) {
stringstream ss;
ss << n;
string s;
ss >> s;
return string(5 - s.size(),' ') + s;
}
int getNum(fstream &f,int pos) {
f.seekg(pos*5);
string s;
for(int i = 0; i < 5; ++i) s += f.get();
return toNum(s);
}
void putNum(fstream &f, int pos,int n) {
f.seekp(pos*5);
f.write(toStr(n).c_str(),5);
}
int main() {
fstream input("entrada1",fstream::in | fstream::out);
string aux;
getline(input,aux);
int n = aux.size() / 5,temp,j;
int gaps[] = {701,301,132,57,23,10,4,1};
int g = sizeof(gaps)/sizeof(gaps[0]);
for(int k = 0; k < g; ++k) {
for(int i = k; i < n; ++i) {
temp = getNum(input,i);
for(j = i; j >= k and getNum(input,j - k) > temp; j -= k) {
putNum(input,j,getNum(input,j - k));
}
putNum(input,j,temp);
}
}
input.close();
return 0;
}

When you open a file in C++ you have two pointers. The getter pointer and the putter pointer. They indicate where in the file you are writing and reading.
Using seekp, you may tell where you wanna write. Using tellp you know where you are going to write. Everytime you write something the putter pointer advances automatically.
The same goes to the getter pointer, the functions are seekg and tellg.
Using theses operations you may easily simulate an array. Let me show you some code:
class FileArray {
public:
FileArray(const char* path)
: file(path, std::fstream::app|std::fstream::binary)
{
file.seekg(0,std::fstream::end);
size = file.tellg();
}
void write(unsigned pos, char data) {
assert(pos < size );
file.tellp(pos);
file.put(data);
}
char read(unsigned pos) {
assert(pos < size);
file.seekg(pos);
return file.get();
}
private:
std::fstream file;
std::size_t size;
}
This is a naive way to deal with a file because you are supposing random access. Well, random access is true, but it may be slow. File streams works faster when you access data that are near each other (spacial locality).
Even though, it is a nice way to start dealing with your problem, you with get experienced with file IO and you will end figuring out ways to improve the performance for your specific problem. Lets keep the baby steps.
Other thing that I want you to note is that when you perform a write, the data is redirected to the fstream that will write to the file. I know that the kernel will try to cache this stuff, and optimize the speed, but still would be better if you had some kind of cache layer to avoid writing directly to the disk.
Finally, I supposed that you are dealing with chars (because it would be easier), but you can deal with other data types, you will just need to be careful about the indexing and the size of the data type. For example, long long type does have size of 8 bytes, if you want to access the first element in your file-array you will access the position 8*0, and you will have to read 8 bytes. If you want the 10th element, you will access the position 8*10 and again read 8 bytes of data to construct the long long value.

Related

Reading the dynamic bitset written data from file cannot read the correct data

So I have a vector which has three numbers. 65, 66, and 67. I am converting these numbers from int to binary and appending them in a string. the string becomes 100000110000101000011 (65, 66, 67 respectively). I am writing this data into a file through dynamic_bitset library. I have BitOperations class which does the reading and writing into file work. When I read the data from file instead of giving the above bits it gives me these 001100010100001000001 bits.
Here is my BitOperations class:
#include <iostream>
#include <boost/dynamic_bitset.hpp>
#include <fstream>
#include <streambuf>
#include "Utility.h"
using namespace std;
using namespace boost;
template <typename T>
class BitOperations {
private:
T data;
int size;
dynamic_bitset<unsigned char> Bits;
string fName;
int bitSize;
public:
BitOperations(dynamic_bitset<unsigned char> b){
Bits = b;
size = b.size();
}
BitOperations(dynamic_bitset<unsigned char> b, string fName){
Bits = b;
this->fName = fName;
size = b.size();
}
BitOperations(T data, string fName, int bitSize){
this->data = data;
this->fName = fName;
this->bitSize = bitSize;
}
BitOperations(int bitSize, string fName){
this->bitSize = bitSize;
this->fName = fName;
}
void writeToFile(){
if (data != ""){
vector<int> bitTemp = extractIntegersFromBin(data);
for (int i = 0; i < bitTemp.size(); i++){
Bits.push_back(bitTemp[i]);
}
}
ofstream output(fName, ios::binary| ios::app);
ostream_iterator<char> osit(output);
to_block_range(Bits, osit);
cout << "File Successfully modified" << endl;
}
dynamic_bitset<unsigned char> readFromFile(){
ifstream input(fName);
stringstream strStream;
strStream << input.rdbuf();
T str = strStream.str();
dynamic_bitset<unsigned char> b;
for (int i = 0; i < str.length(); i++){
for (int j = 0; j < bitSize; ++j){
bool isSet = str[i] & (1 << j);
b.push_back(isSet);
}
}
return b;
}
};
And here is the code which calls theses operations:
#include <iostream>
// #include <string.h>
#include <boost/dynamic_bitset.hpp>
#include "Utility/BitOps.h"
int main(){
vector<int> v;
v.push_back(65);
v.push_back(66);
v.push_back(67);
stringstream ss;
string st;
for (int i = 0; i < v.size(); i++){
ss = toBinary(v[i]);
st += ss.str().c_str();
cout << i << " )" << st << endl;
}
// reverse(st.begin(), st.end());
cout << "Original: " << st << endl;
BitOperations<string> b(st, "bits2.bin", 7);
b.writeToFile();
BitOperations<string>c(7, "bits2.bin");
boost::dynamic_bitset<unsigned char> bits;
bits = c.readFromFile();
string s;
// for (int i = 0; i < 16; i++){
to_string(bits, s);
// reverse(s.begin(), s.end());
// }
cout << "Decompressed: " << s << endl;
}
What am I doing wrong which results in incorrect behaviour?
EDIT: Here is the extractIntegersFromBin(string s) function.
vector<int> extractIntegersFromBin(string s){
char tmp;
vector<int> nums;
for (int i = 0; s[i]; i++ ){
nums.push_back(s[i] - '0');
}
return nums;
}
Edit 2: Here is the code for toBinary:
stringstream toBinary(int n){
vector<int> bin, bin2;
int i = 0;
while (n > 0){
bin.push_back(n % 2);
n /= 2;
i++;
}
// for (int j = i-1; j >= 0; j--){
// bin2.push_back(bin[j]);
// }
reverse(bin.begin(), bin.end());
stringstream s;
for (int i = 0; i < bin.size(); i++){
s << bin[i];
}
return s;
}
You are facing two different issues:
The boost function to_block_range will pad the output to the internal block size, by appending zeros at the end. In your case, the internal block size is sizeof(unsigned char)*8 == 8. So if the bit sequence you write to the file in writeToFile is not a multiple of 8, additional 0s will be written to make for a multiple of 8. So if you read the bit sequence back in with readFromFile, you have to find some way to remove the padding bits again.
There is no standard way for how to represent a bit sequence (reference). Depending on the scenario, it might be more convenient to represent the bits left-to-right or right-to-left (or some completely different order). For this reason, when you use different code pieces to print the same bit sequence and you want these code pieces to print the same result, you have to make sure that these code pieces agree on how to represent the bit sequence. If one piece of code prints left-to-right and the other right-to-left, you will get different results.
Let's discuss each issue individually:
Regarding issue 1
I understand that you want to define your own block size with the bitSize variable, on top of the internal block size of boost::dynamic_bitset. For example, in your main method, you construct BitOperations<string> c(7, "bits2.bin");. I understand that to mean that you expect the bit seqence stored in the file to have a length that is some multiple of 7.
If this understanding is correct, you can remove the padding bits that have been inserted by to_block_range by reading the file size and then rounding it down to the nearest multiple of your block size. Though you should note that you currently do not enforce this contract in the BitOperation constructor or in writeToFile (i.e. by ensuring that the data size is a multiple of 7).
In your readFromFile method, first note that the inner loop incorrectly takes the blockSize into account. So if blockSize is 7, this incorrectly only considers the first 7 bits of each block. Whereas the blocks that were written by to_block_range use the full 8 bit of each 1-byte block, since boost::dynamic_bitset does not know anything about your 7-bit block size. So this makes you miss some bits.
Here is one example for how to fix your code:
size_t bitCount = (str.length()*8) / bitSize * bitSize;
size_t bitsPerByte = 8;
for (int i = 0; i < bitCount; i++) {
size_t index = (i / bitsPerByte);
size_t offset = (i % bitsPerByte);
bool isSet = (str[index] & ( 1 << offset));
b.push_back(isSet);
}
This example first calculates how many bits should be read in total, by rounding down the file size to the nearest multiple of your block size. It then iterates over the full bytes in the input (i.e. the internal blocks that were written by boost::dynamic_bitset), until the targeted number of bits have been read. The remaining padding bits are discarded.
An alternative method would be to use boost::from_block_range. This allows you to get rid of some boiler plate code (i.e. reading the input into some string buffer):
dynamic_bitset<unsigned char> readFromFile() {
ifstream input{fName};
// Get file size
input.seekg(0, ios_base::end);
ssize_t fileSize{input.tellg()};
// TODO Handle error: fileSize < 0
// Reset to beginning of file
input.clear();
input.seekg(0);
// Create bitset with desired size
size_t bitsPerByte = 8;
size_t bitCount = (fileSize * bitsPerByte) / bitSize * bitSize;
dynamic_bitset<unsigned char> b{bitCount};
// TODO Handle error: fileSize != b.num_blocks() * b.bits_per_block / bitsPerByte
// Read file into bitset
std::istream_iterator<char> iter{input};
boost::from_block_range(iter, {}, b);
return b;
}
Regarding issue 2
Once you have solved issue 1, the boost::dynamic_bitset that is written to the file by writeToFile will be the same as the one read by readFromFile. If you print both with the same method, the output will match. However, if you use different methods for printing, and these methods do not agree on the order in which to print the bits, you will get different results.
For example, in the output of your program you can now see that the "Original:" output is the same as "Decompressed:", except in reverse order:
Original: 100000110000101000011
...
Decompressed: 110000101000011000001
Again, this does not mean that readFromFile is working incorrectly, only that you are using different ways of printing the bit sequences.
The output for Original: is obtained by directly printing the 0/1 input string in main from left to right. In writeToFile, this string is then decomposed in the same order with extractIntegersFromBin and each bit is passed to the push_back method of boost::dynamic_bitset. The push_back method appends to the end of the bit sequence, meaning it will interpret each bit you pass as more significant than the previous (reference):
Effects: Increases the size of the bitset by one, and sets the value of the new most-significant bit to value.
Therefore, your input string is interpreted such that the first bit in the input string is the least significant bit (i.e. the "first" bit of the sequence), and the last bit of the input string is the most significant bit (i.e. the "last" bit of the sequence).
Whereas you construct the output for "Decompressed:" with to_string. From the documentation of this method, we can see that the least-significant bit of the bit sequence will be the last bit of the output string (reference):
Effects: Copies a representation of b into the string s. A character in the string is '1' if the corresponding bit is set, and '0' if it is not. Character position i in the string corresponds to bit position b.size() - 1 - i.
So the problem is simply that to_string (by design) prints in opposite order compared to the order in which you print the input string manually. So to fix this, you have to reverse one of these, i.e. by printing the input string by iterating over the string in reverse order, or by reversing the output of to_string.

Multithreaded storing into file

I have code
const int N = 100000000;
int main() {
FILE* fp = fopen("result.txt", "w");
for (int i=0; i<N; ++i) {
int res = f(i);
fprintf (fp, "%d\t%d\n", i, res);
}
return 0;
}
Here f averagely run for several milliseconds in single thread.
To make it faster I'd like to use multithreading.
What provides a way to get the next i? Or do I need to lock, get, add and unlock?
Should writing be proceeded in a separated thread to make things easier?
Do I need a temporary memory in case f(7) is worked out before f(3)?
If 3, is it likely that f(3) is not calculated for long time and the temporary memory is filled?
I'm currently using C++11, but requiring higher version of C++ may be acceptable
General rule how to improve performance:
Find way to measure performance (automated test)
Do profiling of existing code (find bottlenecks)
Understanding findings in point 2 and try to fix them (without mutilating)
Do a measurement from point 1. and decide if change provided expected improvement.
go back to point 2 couple times
Only if steps 1 to 5 didn't help try use muti threading. Procedure is same as in points 2 - 5, but you have to think: can you split large task to couple smaller one? If yest do they need synchronization? Can you avoid it?
Now in your example just split result to 8 (or more) separate files and merge them at the end if you have to.
This can look like this:
#include <vector>
#include <future>
#include <fstream>
std::vector<int> multi_f(int start, int stop)
{
std::vector<int> r;
r.reserve(stop - start);
for (;start < stop; ++start) r.push_back(f(start));
return r;
}
int main()
{
const int N = 100000000;
const int tasks = 100;
const int sampleCount = N / tasks;
std::vector<std::future<std::vector<int>>> allResults;
for (int i=0; i < N; i += sampleCount) {
allResults.push_back(std::async(&multi_f, i, i + sampleCount));
}
std::ofstream f{ "result.txt" }; // it is a myth that printf is faster
int i = 0;
for (auto& task : allResults)
{
for (auto r : task.get()) {
f << i++ << '\t' << r << '\n';
}
}
return 0;
}

How to read an "uneven" matrix from a file, and store into a 2D array?

I'm working on an experiment that requires me to switch over to C++, which I'm still learning. I need to read data from a file into a 2D array, where the data in the file is composed of floating point numbers, laid out in a matrix format. However, each row of the matrix in the data file has a different number of columns, for example:
1.24 3.55 6.00 123.5
65.8 45.2 1.0
1.1 389.66 101.2 34.5 899.12 23.7 12.1
The good news is that I know the maximum number of possible rows/columns the file might have, and at least right now, I'm not particularly worried about optimization for memory. What I would like is to have a 2D array where the corresponding rows/columns match those of the file, and all the other elements are some known "dummy" value.
The idea I had was to loop through each element of the file (row by row), recognize the end of a line, and then begin reading the next line. Unfortunately I'm having trouble executing this. For example:
#include <iostream>
#include <fstream>
int main() {
const int max_rows = 100;
const int max_cols = 12;
//initialize the 2D array with a known dummy
float data[max_rows][max_cols] = {{-361}};
//prepare the file for reading
ifstream my_input_file;
my_input_file.open("file_name.dat");
int k1 = 0, k2 = 0; //the counters
while (!in.eof()) { //keep looping through until we reach the end of the file
float data_point = in.get(); //get the current element from the file
//somehow, recognize that we haven't reached the end of the line...?
data[k1][k2] = next;
//recognize that we have reached the end of the line
//in this case, reset the counters
k1 = 0;
k2=k2+1;
}
}
And so I haven't been able to figure out the indexing. Part of the problem is that although I'm aware that the character "\n" marks the end of a line, it is of a different type compared to the floating point numbers in the file, and so I'm at a loss. Am I thinking about this the wrong way?
You don't need to know the limits in advance if you stick to std::vector. Heres some example code that will read the file (assuming no non-floats in the file).
using Row = std::vector<float>;
using Array2D = std::vector<Row>;
int main() {
Array2D data;
std::ifstream in("file_name.dat");
std::string line;
Row::size_type max_cols = 0U;
while (std::getline(in, line)) { // will stop at EOF
Row newRow;
std::istringstream iss(line);
Row::value_type nr;
while (iss >> nr) // will stop at end-of-line
newRow.push_back(nr);
max_cols = std::max(max_cols, newRow.size());
data.push_back(std::move(newRow)); // using move to avoid copy
}
// make all columns same length, shorter filled with dummy value -361.0f
for(auto & row : data)
row.resize(max_cols, -361.0f);
}
Here is starting point, I will be working on code that you can use. Use two dimensional vector vector<vector<double>> and you can use getline() to get the line as a string than use string stream to get decimals from string.
And here is the code
#include <iostream>
#include <vector>
#include <fstream>
#include <sstream>
int main (void) {
std::vector<std::vector<double>> matrix;
std::ifstream inputFile;
inputFile.open("test.txt");
char line[99];
for (int i = 0; inputFile.getline(line, sizeof(line)); ++i) {
std::stringstream strStream (line);
double val = 0.0;
matrix.push_back(std::vector<double>());
while (strStream >> val)
matrix[i].push_back(val);
}
return 0;
}

calculating occurrences of values in an input stream

I am a college student who is currently learning programming. one of the problem statements given to us was:
user inputs an integer n followed by n different integers. Without using arrays or strings, find the number which occurs the most number of times in the input stream.
we are required to use the simplecpp package which is basically easier commands than standard c++. for example we write repeat(n) to get a for loop with n iterations.
What can i do to solve the problem?.
I thought of creating a number like
10101010[number]10101010[number2]...
to store the input and then splitting but this fails to solve the problem.
we are not allowed to use anything like while loops or string manipulation to solve the problem.the only solutions i could think of were using the string method and then manipulating the string but apparently that is not allowed.
Any method to do this and such other problems where input cannot be stored in an array?
Assuming you are allowed to use normal streaming, and other std:: headers, that are not specifically forbidden
#include <iostream>
#include <map>
#include <algorithm>
using counter = std::map<int, int>;
using element = std::map<int, int>::const_reference;
bool compare_second(element lhs, element rhs)
{
return lhs.second < rhs.second;
}
int main()
{
counter data;
int n = 0;
std::cin >> n;
for (int i = 0; i < n; ++i)
{
int current = 0;
std::cin >> current;
data[current]++;
}
if (n > 1)
{
std::cout << std::max_element(data.begin(), data.end(), compare_second)->second;
}
return 0;
}

Processing a large image text file

I'm using C++ and have a 1234 by 1234 text file with values 0 to 255. I have been trying to speed up my code because its used in real time with the user. Right now it takes .5 seconds to run with .4 seconds devoted to reading the text file to a vector<vector<int>>. I am using getline then istringstream. Below is the code I'm currently using. There is some stuff in there where I get rid of the first and last 50 columns as well as take the first chunk of rows into one vector and the second chunk into another vector because that's how I need it for processing purposes.
void readInRawData(string fileName, int start, int split, int finish, vector< vector <int> > &rawArrayTop, vector< vector <int> > &rawArrayBottom)
{
string line;
vector<int> rawRow;
int counter=0;
int value=0;
int numberOfColumns=0, numberOfRows=0;
ifstream rawImage;
rawImage.open(fileName.c_str()); //open file using fileName
if (rawImage.is_open()&&!is_empty(rawImage))
{
int length=0;
getline(rawImage,line);
istringstream ss(line);
while(ss>>value)//clump into values between spaces
{
length++;
}
while(getline(rawImage, line))//get row
{
if(counter<start)
{
}
else
{
break;
}
counter++;
}
while(getline(rawImage, line))//get row
{
if(counter<split)
{
rawRow.clear();
istringstream ss(line);
for(int i=0;i<50;i++)
{
ss>>value;
}
for(int i=0; i<length-100; i++)
{
ss>>value;
rawRow.push_back(value);
}
rawArrayTop.push_back(rawRow);
}
else
{
break;
}
counter++;
}
while(getline(rawImage, line))//get row
{
if(counter<finish)
{
rawRow.clear();
istringstream ss(line);
for(int i=0;i<50;i++)
{
ss>>value;
}
for(int i=0; i<length-100; i++)
{
ss>>value;
rawRow.push_back(value);
}
rawArrayBottom.push_back(rawRow);
}
else
{
break;
}
counter++;
}
rawImage.close();
}
//if it can't be opened throw error
else
{
throw rawArrayTop;
}
}
To get a real increase in performance, you'll have to rewrite totally.
while((ch = fgetc(fp)) != EOF)
{
if(isdigit(ch))
{
sample = sample * 10 + ch - '0';
onsample = 1;
}
else
{
if(onsample)
{
*out++ = sample;
sample = 0;
onsample = 0;
}
}
}
Set up out with malloc(width * height). Now it should zip through the file almost as fast as it can read it.
I will not give you the code, but I will suggest how to proceed here:
parsing text takes a long time. If real-time is important, pre-process the file to a binary format, since it can be loaded directly with read/write functions. You will need to create a stream in binary mode from the binary file and use istream::read.
try to avoid vector<vector<int>> unless you use scoped allocators, which I assume you are not using. This is bad for the cache. It is a much better fit to use a vector with n * m reserved space.
If you need bidimensional accesss, you can just code your functions for that:
using Matrix = vector<int>;
int & idx(Matrix, size_t row, size_t col);
Matrix mat(m * n);
idx(mat, 2, 3) = 17;
Another concern is that you must load into the Matrix. If you want to avoid redundant initialization and at the same time prereserve memory before loading the data, that is not possible with the stl vector, but you can use Boost.Container, which has an overload for reserve with default_init_t. That will not trigger initialization of elements in the vector.
if the values are between 0 and 255 use char, not int. You will fit more data at once in cache.