Calculating a file's mean value of data bytes

Calculating a file's mean value of data bytes - c++

Just for fun, I am trying to calculate a file's mean value of data bytes, essentially replicating a feature available in an already existing tool (ent). Basically, it is simply the result of summing all the bytes of a file and dividing by the file length. If the data are close to random, this should be about 127.5. I am testing 2 methods of computing the mean value, one is a simple for loop which works on an unordered_map and the other is using std::accumulate directly on a string object.
Benchmarking both methods show that it is much slower to use std::accumulate than a simple for loop. Also, mesured on my system, on average, clang++ is about 4 times faster for the accumulate method than g++.
So here are my questions:
Why is the for loop method producing bad output at around 2.5GB input for g++ but not with clang++. My guess is I am doing things wrong (UB probably), but they happen to work with clang++. (solved and code modified accordingly)
Why is the std::accumulate method so much slower on g++ with the same optimization settings?
Thanks!
Compiler info (target is x86_64-pc-linux-gnu):
clang version 11.1.0
gcc version 11.1.0 (GCC)
Build info:
g++ -Wall -Wextra -pedantic -O3 -DNDEBUG -std=gnu++2a main.cpp -o main-g
clang++ -Wall -Wextra -pedantic -O3 -DNDEBUG -std=gnu++20 main.cpp -o main-clang
Sample file (using random data):
dd if=/dev/urandom iflag=fullblock bs=1G count=8 of=test-8g.bin (example for 8GB random data file)
Code:
#include <chrono>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <numeric>
#include <stdexcept>
#include <string>
#include <unordered_map>
auto main(int argc, char** argv) -> int {
using std::cout;
std::filesystem::path file_path{};
if (argc == 2) {
file_path = std::filesystem::path(argv[1]);
} else {
return 1;
}
std::string input{};
std::unordered_map<char, int> char_map{};
std::ifstream istrm(file_path, std::ios::binary);
if (!istrm.is_open()) {
throw std::runtime_error("Could not open file");
}
const auto file_size = std::filesystem::file_size(file_path);
input.resize(file_size);
istrm.read(input.data(), static_cast<std::streamsize>(file_size));
istrm.close();
// store frequency of individual chars in unordered_map
for (const auto& c : input) {
if (!char_map.contains(c)) {
char_map.insert(std::pair<char, int>(c, 1));
} else {
char_map[c]++;
}
}
double sum_for_loop = 0.0;
cout << "using for loop\n";
// start stopwatch
auto start_timer = std::chrono::steady_clock::now();
// for loop method
for (const auto& item : char_map) {
sum_for_loop += static_cast<unsigned char>(item.first) * static_cast<double>(item.second);
}
// stop stopwatch
cout << std::chrono::duration<double>(std::chrono::steady_clock::now() - start_timer).count() << " s\n";
auto mean_for_loop = static_cast<double>(sum_for_loop) / static_cast<double>(input.size());
cout << std::fixed << "sum_for_loop: " << sum_for_loop << " size: " << input.size() << '\n';
cout << "mean value of data bytes: " << mean_for_loop << '\n';
cout << "using accumulate()\n";
// start stopwatch
start_timer = std::chrono::steady_clock::now();
// accumulate method, but is slow (much slower in g++)
auto sum_accum =
std::accumulate(input.begin(), input.end(), 0.0, [](auto current_val, auto each_char) { return current_val + static_cast<unsigned char>(each_char); });
// stop stopwatch
cout << std::chrono::duration<double>(std::chrono::steady_clock::now() - start_timer).count() << " s\n";
auto mean_accum = sum_accum / static_cast<double>(input.size());
cout << std::fixed << "sum_for_loop: " << sum_accum << " size: " << input.size() << '\n';
cout << "mean value of data bytes: " << mean_accum << '\n';
}
Sample output from 2GB file (clang++):
using for loop
2.024e-05 s
sum_for_loop: 273805913805 size: 2147483648
mean value of data bytes: 127.500814
using accumulate()
1.317576 s
sum_for_loop: 273805913805.000000 size: 2147483648
mean value of data bytes: 127.500814
Sample output from 2GB file (g++):
using for loop
2.41e-05 s
sum_for_loop: 273805913805 size: 2147483648
mean value of data bytes: 127.500814
using accumulate()
5.269024 s
sum_for_loop: 273805913805.000000 size: 2147483648
mean value of data bytes: 127.500814
Sample output from 8GB file (clang++):
using for loop
1.853e-05 s
sum_for_loop: 1095220441576 size: 8589934592
mean value of data bytes: 127.500440
using accumulate()
5.247585 s
sum_for_loop: 1095220441576.000000 size: 8589934592
mean value of data bytes: 127.500440
Sample output from 8GB file (g++):
using for loop
7.5e-07 s
sum_for_loop: 1095220441576.000000 size: 8589934592
mean value of data bytes: 127.500440
using accumulate()
21.484348 s
sum_for_loop: 1095220441576.000000 size: 8589934592
mean value of data bytes: 127.500440

There are numerous issues with the code. The first - and the one that is causing your display problem - is that sum_for_loop should be a double, not an unsigned long. The sum is overflowing what can be stored in an unsigned long, resulting in your incorrect result when that happens.
The timers should be started after the cout, otherwise you're including the output time in the compute time. In addition, the "for loop" elapsed time excludes the time taken to construct char_map.
When building char_map, you don't need the if. If an entry is not found in the map it will be zero initialized. A better approach (since you only have 256 unique values) is to use an indexed vector (remembering to cast the char to unsigned char).

Related

Is there an easy way to add leading zeros to a string created via std::to_string(int)?

I got the following std::string that I partially create through converting an int to a string:
std::string text = std::string("FPS: " + std::to_string(1000 / d));
Example output:
FPS: 60
Now, I would like to add leading zeros specifically to the int part, such that I get this output:
FPS: 060
I already know how to achieve this for stdout with std::cout << std::setfill('0') << std::setw(5) .
But I haven't found a solution for the simple std::to_string() conversion.

Use a stringstream.
It behaves exactly like std::cout but has a method str() to get the string you created.
For your problem it would probably look like this:
std::stringstream ss;
ss << "FPS: " << std::setfill('0') << std::setw(5) << std::to_string(1000 / d);
std::string text(ss.str());
Edit: To test the performance of this I created a dumb test program:
#include <sstream>
#include <iomanip>
int main()
{
std::stringstream ss;
for(int i=0; i<100000; i++)
{
ss.clear();
ss << "FPS: " << std::setfill('0') << std::setw(5) << std::to_string(i);
std::string text(ss.str());
}
return 0;
}
and compiled it with g++ -O3 main.cpp. I then opened a terminal and started the program through time:
$ time ./a.out
./a.out 1,53s user 0,01s system 99% cpu 1,536 total
So 1.53s for 100k iterations (15.3µs per iteration on average) on my Intel(R) Core(TM) i5-8500 CPU # 3.00GHz CPU running a Linux 5.13.5 kernel the latest libstdc++
It's very long from an instruction perspective, tens of thousands of instructions is costly on small micro-processor, but on a modern system it's hardly ever a problem.

In C++20, you might use std::format
std::format("{:03}", 1000 / d);

You could count the length of the converted string and use that to create a string with zeroes:
size_t min_len = 3;
std::string text = std::to_string(1000 / d);
if(text.size() < min_len) text = std::string(min_len - text.size(), '0') + text;
text = "FPS: " + text;
A performance test comparing using this "strings only" approach to that of using std::stringstream may be interesting if you do this formatting a lot:
quick-bench.com

Setprecision in a function is also applying in another function. I can't seem to know why [duplicate]

I want to control the precision for a double during a comparison, and then come back to default precision, with C++.
I intend to use setPrecision() to set precision. What is then syntax, if any, to set precision back to default?
I am doing something like this
std::setPrecision(math.log10(m_FTOL));
I do some stuff, and I would like to come back to default double comparison right afterwards.
I modified like this, and I still have some errors
std::streamsize prec = std::ios_base::precision();
std::setprecision(cmath::log10(m_FTOL));
with cmath false at compilation, and std::ios_base also false at compilation. Could you help?

You can get the precision before you change it, with std::ios_base::precision and then use that to change it back later.
You can see this in action with:
#include <ios>
#include <iostream>
#include <iomanip>
int main (void) {
double d = 3.141592653589;
std::streamsize ss = std::cout.precision();
std::cout << "Initial precision = " << ss << '\n';
std::cout << "Value = " << d << '\n';
std::cout.precision (10);
std::cout << "Longer value = " << d << '\n';
std::cout.precision (ss);
std::cout << "Original value = " << d << '\n';
std::cout << "Longer and original value = "
<< std::setprecision(10) << d << ' '
<< std::setprecision(ss) << d << '\n';
std::cout << "Original value = " << d << '\n';
return 0;
}
which outputs:
Initial precision = 6
Value = 3.14159
Longer value = 3.141592654
Original value = 3.14159
Longer and original value = 3.141592654 3.14159
Original value = 3.14159
The code above shows two ways of setting the precision, first by calling std::cout.precision (N) and second by using a stream manipulator std::setprecision(N).
But you need to keep in mind that the precision is for outputting values via streams, it does not directly affect comparisons of the values themselves with code like:
if (val1== val2) ...
In other words, even though the output may be 3.14159, the value itself is still the full 3.141592653590 (subject to normal floating point limitations, of course).
If you want to do that, you'll need to check if it's close enough rather than equal, with code such as:
if ((fabs (val1 - val2) < 0.0001) ...

Use C++20 std::format and {:.2} instead of std::setprecision
Finally, this will be the superior choice once you can use it:
#include <format>
#include <string>
int main() {
std::cout << std::format("{:.3} {:.4}\n", 3.1415, 3.1415);
}
Expected output:
3.14 3.145
This will therefore completely overcome the madness of modifying std::cout state.
The existing fmt library implements it for before it gets official support: https://github.com/fmtlib/fmt Install on Ubuntu 22.04:
sudo apt install libfmt-dev
Modify source to replace:
<format> with <fmt/core.h>
std::format to fmt::format
main.cpp
#include <iostream>
#include <fmt/core.h>
int main() {
std::cout << fmt::format("{:.3} {:.4}\n", 3.1415, 3.1415);
}
and compile and run with:
g++ -std=c++11 -o main.out main.cpp -lfmt
./main.out
Output:
3.14 3.142
See also:
How do I print a double value with full precision using cout?
std::string formatting like sprintf
Pre C++20/fmt::: Save the entire state with std::ios::copyfmt
You might also want to restore the entire previous state with std::ios::copyfmt in these situations, as explained at: Restore the state of std::cout after manipulating it
main.cpp
#include <iomanip>
#include <iostream>
int main() {
constexpr float pi = 3.14159265359;
constexpr float e = 2.71828182846;
// Sanity check default print.
std::cout << "default" << std::endl;
std::cout << pi << std::endl;
std::cout << e << std::endl;
std::cout << std::endl;
// Change precision format to scientific,
// and restore default afterwards.
std::cout << "modified" << std::endl;
std::ios cout_state(nullptr);
cout_state.copyfmt(std::cout);
std::cout << std::setprecision(2);
std::cout << std::scientific;
std::cout << pi << std::endl;
std::cout << e << std::endl;
std::cout.copyfmt(cout_state);
std::cout << std::endl;
// Check that cout state was restored.
std::cout << "restored" << std::endl;
std::cout << pi << std::endl;
std::cout << e << std::endl;
std::cout << std::endl;
}
GitHub upstream.
Compile and run:
g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o main.out main.cpp
./main.out
Output:
default
3.14159
2.71828
modified
3.14e+00
2.72e+00
restored
3.14159
2.71828
Tested on Ubuntu 19.04, GCC 8.3.0.

You need to keep track of your current precison and then reset back to the same once done with your operations with required modified precison. For this you can use std::ios_base::precision:
streamsize precision ( ) const;
streamsize precision ( streamsize prec );
The first syntax returns the value of the current floating-point precision field for the stream.
The second syntax also sets it to a new value.

setprecision() can be used only for output operations and cannot be used for comparisons
To compare floats say a and b , you have to do it explicitly like this:
if( abs(a-b) < 1e-6) {
}
else {
}

You can use cout << setprecision(-1)

Flatbuffers struct in union not working (C++)

I am trying to get going with Flatbuffers in C++, but I'm already failing to write and read a struct in a union. I have reduced my original problem to an anonymous, minimal example.
Example Schema (favorite.fbs)
// favorite.fbs
struct FavoriteNumbers
{
first: uint8;
second: uint8;
third: uint8;
}
union Favorite
{ FavoriteNumbers }
table Data
{ favorite: Favorite; }
root_type Data;
I compiled the schema using Flatbuffers 1.11.0 downloaded from the release page (I'm on Windows so to be safe I used the precompiled binaries).
flatc --cpp favorite.fbs
This generates the file favorite_generated.h.
Example Code (fav.cpp)
#include <iostream>
#include "favorite_generated.h"
int main(int, char**)
{
using namespace flatbuffers;
FlatBufferBuilder builder;
// prepare favorite numbers and write them to the buffer
FavoriteNumbers inFavNums(17, 42, 7);
auto inFav{builder.CreateStruct(&inFavNums)};
auto inData{CreateData(builder, Favorite_FavoriteNumbers, inFav.Union())};
builder.Finish(inData);
// output original numbers from struct used to write (just to be safe)
std::cout << "favorite numbers written: "
<< +inFavNums.first() << ", "
<< +inFavNums.second() << ", "
<< +inFavNums.third() << std::endl;
// output final buffer size
std::cout << builder.GetSize() << " B written" << std::endl;
// read from the buffer just created
auto outData{GetData(builder.GetBufferPointer())};
auto outFavNums{outData->favorite_as_FavoriteNumbers()};
// output read numbers
std::cout << "favorite numbers read: "
<< +outFavNums->first() << ", "
<< +outFavNums->second() << ", "
<< +outFavNums->third() << std::endl;
return 0;
}
I'm using unary + to force numerical output instead of characters. An answer to another question here on StackOverflow told me I had to use CreateStruct to achieve what I want. I compiled the code using g++ 9.1.0 (by MSYS2).
g++ -std=c++17 -Ilib/flatbuffers/include fav.cpp -o main.exe
This generates the file main.exe.
Output
favorite numbers written: 17, 42, 7
32 B written
favorite numbers read: 189, 253, 34
Obviously this is not the desired outcome. What am I doing wrong?

Remove the & in front of inFavNums and it will work.
CreateStruct is a template function, which sadly in this case it means it will also take pointers without complaining about it. Would be nice to avoid that, but that isn't that easy in C++.

boost::filesystem::space() is reporting wrong diskspace

I have 430 GB free on drive C:. But for this program:
#include <iostream>
#include <boost/filesystem.hpp>
int main()
{
boost::filesystem::path p("C:");
std::size_t freeSpace = boost::filesystem::space(p).free;
std::cout<<freeSpace << " Bytes" <<std::endl;
std::cout<<freeSpace / (1 << 20) << " MB"<<std::endl;
std::size_t availableSpace = boost::filesystem::space(p).available;
std::cout << availableSpace << " Bytes" <<std::endl;
std::cout << availableSpace / (1 << 20) << " MB"<<std::endl;
std::size_t totalSpace = boost::filesystem::space(p).capacity;
std::cout << totalSpace << " Bytes" <<std::endl;
std::cout << totalSpace / (1 << 20) << " MB"<<std::endl;
return 0;
}
The output is:
2542768128 Bytes
2424 MB
2542768128 Bytes
2424 MB
2830102528 Bytes
2698 MB
I need to know how much diskspace is available because my application has to download a huge file, and I need to know whether it's viable to download it.
I'm using mingw on Windows:
g++ (i686-posix-dwarf-rev2, Built by MinGW-W64 project) 7.1.0
I also tried using MXE to cross compile from Linux:
i686-w64-mingw32.static-g++ (GCC) 5.5.0
Both are returning the same numbers.

std::size_t is not guaranteed to be the biggest standard unsigned type. Actually, it rarely is.
And boost::filesystem defines space_info thus:
struct space_info // returned by space function
{
uintmax_t capacity;
uintmax_t free;
uintmax_t available; // free space available to a non-privileged process
};
You would have easily avoided the error by using auto, which would be natural as the exact type is not of any importance. Nearly always only mismatch hurts, thus Almost Always auto.

Use a type that boost::filesystem::space(p).free requires. it may require a 64 bit integer type:
uintmax_t freeSpace = boost::filesystem::space(p).free;
use of auto is also good.

Set back default floating point print precision in C++

I want to control the precision for a double during a comparison, and then come back to default precision, with C++.
I intend to use setPrecision() to set precision. What is then syntax, if any, to set precision back to default?
I am doing something like this
std::setPrecision(math.log10(m_FTOL));
I do some stuff, and I would like to come back to default double comparison right afterwards.
I modified like this, and I still have some errors
std::streamsize prec = std::ios_base::precision();
std::setprecision(cmath::log10(m_FTOL));
with cmath false at compilation, and std::ios_base also false at compilation. Could you help?

You can get the precision before you change it, with std::ios_base::precision and then use that to change it back later.
You can see this in action with:
#include <ios>
#include <iostream>
#include <iomanip>
int main (void) {
double d = 3.141592653589;
std::streamsize ss = std::cout.precision();
std::cout << "Initial precision = " << ss << '\n';
std::cout << "Value = " << d << '\n';
std::cout.precision (10);
std::cout << "Longer value = " << d << '\n';
std::cout.precision (ss);
std::cout << "Original value = " << d << '\n';
std::cout << "Longer and original value = "
<< std::setprecision(10) << d << ' '
<< std::setprecision(ss) << d << '\n';
std::cout << "Original value = " << d << '\n';
return 0;
}
which outputs:
Initial precision = 6
Value = 3.14159
Longer value = 3.141592654
Original value = 3.14159
Longer and original value = 3.141592654 3.14159
Original value = 3.14159
The code above shows two ways of setting the precision, first by calling std::cout.precision (N) and second by using a stream manipulator std::setprecision(N).
But you need to keep in mind that the precision is for outputting values via streams, it does not directly affect comparisons of the values themselves with code like:
if (val1== val2) ...
In other words, even though the output may be 3.14159, the value itself is still the full 3.141592653590 (subject to normal floating point limitations, of course).
If you want to do that, you'll need to check if it's close enough rather than equal, with code such as:
if ((fabs (val1 - val2) < 0.0001) ...

Use C++20 std::format and {:.2} instead of std::setprecision
Finally, this will be the superior choice once you can use it:
#include <format>
#include <string>
int main() {
std::cout << std::format("{:.3} {:.4}\n", 3.1415, 3.1415);
}
Expected output:
3.14 3.145
This will therefore completely overcome the madness of modifying std::cout state.
The existing fmt library implements it for before it gets official support: https://github.com/fmtlib/fmt Install on Ubuntu 22.04:
sudo apt install libfmt-dev
Modify source to replace:
<format> with <fmt/core.h>
std::format to fmt::format
main.cpp
#include <iostream>
#include <fmt/core.h>
int main() {
std::cout << fmt::format("{:.3} {:.4}\n", 3.1415, 3.1415);
}
and compile and run with:
g++ -std=c++11 -o main.out main.cpp -lfmt
./main.out
Output:
3.14 3.142
See also:
How do I print a double value with full precision using cout?
std::string formatting like sprintf
Pre C++20/fmt::: Save the entire state with std::ios::copyfmt
You might also want to restore the entire previous state with std::ios::copyfmt in these situations, as explained at: Restore the state of std::cout after manipulating it
main.cpp
#include <iomanip>
#include <iostream>
int main() {
constexpr float pi = 3.14159265359;
constexpr float e = 2.71828182846;
// Sanity check default print.
std::cout << "default" << std::endl;
std::cout << pi << std::endl;
std::cout << e << std::endl;
std::cout << std::endl;
// Change precision format to scientific,
// and restore default afterwards.
std::cout << "modified" << std::endl;
std::ios cout_state(nullptr);
cout_state.copyfmt(std::cout);
std::cout << std::setprecision(2);
std::cout << std::scientific;
std::cout << pi << std::endl;
std::cout << e << std::endl;
std::cout.copyfmt(cout_state);
std::cout << std::endl;
// Check that cout state was restored.
std::cout << "restored" << std::endl;
std::cout << pi << std::endl;
std::cout << e << std::endl;
std::cout << std::endl;
}
GitHub upstream.
Compile and run:
g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o main.out main.cpp
./main.out
Output:
default
3.14159
2.71828
modified
3.14e+00
2.72e+00
restored
3.14159
2.71828
Tested on Ubuntu 19.04, GCC 8.3.0.

You need to keep track of your current precison and then reset back to the same once done with your operations with required modified precison. For this you can use std::ios_base::precision:
streamsize precision ( ) const;
streamsize precision ( streamsize prec );
The first syntax returns the value of the current floating-point precision field for the stream.
The second syntax also sets it to a new value.

setprecision() can be used only for output operations and cannot be used for comparisons
To compare floats say a and b , you have to do it explicitly like this:
if( abs(a-b) < 1e-6) {
}
else {
}

You can use cout << setprecision(-1)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Calculating a file's mean value of data bytes - c++

Related

Is there an easy way to add leading zeros to a string created via std::to_string(int)?

Setprecision in a function is also applying in another function. I can't seem to know why [duplicate]

Flatbuffers struct in union not working (C++)

boost::filesystem::space() is reporting wrong diskspace

Set back default floating point print precision in C++

Categories

Resources