boost::filesystem::space() is reporting wrong diskspace - c++

I have 430 GB free on drive C:. But for this program:
#include <iostream>
#include <boost/filesystem.hpp>
int main()
{
boost::filesystem::path p("C:");
std::size_t freeSpace = boost::filesystem::space(p).free;
std::cout<<freeSpace << " Bytes" <<std::endl;
std::cout<<freeSpace / (1 << 20) << " MB"<<std::endl;
std::size_t availableSpace = boost::filesystem::space(p).available;
std::cout << availableSpace << " Bytes" <<std::endl;
std::cout << availableSpace / (1 << 20) << " MB"<<std::endl;
std::size_t totalSpace = boost::filesystem::space(p).capacity;
std::cout << totalSpace << " Bytes" <<std::endl;
std::cout << totalSpace / (1 << 20) << " MB"<<std::endl;
return 0;
}
The output is:
2542768128 Bytes
2424 MB
2542768128 Bytes
2424 MB
2830102528 Bytes
2698 MB
I need to know how much diskspace is available because my application has to download a huge file, and I need to know whether it's viable to download it.
I'm using mingw on Windows:
g++ (i686-posix-dwarf-rev2, Built by MinGW-W64 project) 7.1.0
I also tried using MXE to cross compile from Linux:
i686-w64-mingw32.static-g++ (GCC) 5.5.0
Both are returning the same numbers.

std::size_t is not guaranteed to be the biggest standard unsigned type. Actually, it rarely is.
And boost::filesystem defines space_info thus:
struct space_info // returned by space function
{
uintmax_t capacity;
uintmax_t free;
uintmax_t available; // free space available to a non-privileged process
};
You would have easily avoided the error by using auto, which would be natural as the exact type is not of any importance. Nearly always only mismatch hurts, thus Almost Always auto.

Use a type that boost::filesystem::space(p).free requires. it may require a 64 bit integer type:
uintmax_t freeSpace = boost::filesystem::space(p).free;
use of auto is also good.

Related

Calculating a file's mean value of data bytes

Just for fun, I am trying to calculate a file's mean value of data bytes, essentially replicating a feature available in an already existing tool (ent). Basically, it is simply the result of summing all the bytes of a file and dividing by the file length. If the data are close to random, this should be about 127.5. I am testing 2 methods of computing the mean value, one is a simple for loop which works on an unordered_map and the other is using std::accumulate directly on a string object.
Benchmarking both methods show that it is much slower to use std::accumulate than a simple for loop. Also, mesured on my system, on average, clang++ is about 4 times faster for the accumulate method than g++.
So here are my questions:
Why is the for loop method producing bad output at around 2.5GB input for g++ but not with clang++. My guess is I am doing things wrong (UB probably), but they happen to work with clang++. (solved and code modified accordingly)
Why is the std::accumulate method so much slower on g++ with the same optimization settings?
Thanks!
Compiler info (target is x86_64-pc-linux-gnu):
clang version 11.1.0
gcc version 11.1.0 (GCC)
Build info:
g++ -Wall -Wextra -pedantic -O3 -DNDEBUG -std=gnu++2a main.cpp -o main-g
clang++ -Wall -Wextra -pedantic -O3 -DNDEBUG -std=gnu++20 main.cpp -o main-clang
Sample file (using random data):
dd if=/dev/urandom iflag=fullblock bs=1G count=8 of=test-8g.bin (example for 8GB random data file)
Code:
#include <chrono>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <numeric>
#include <stdexcept>
#include <string>
#include <unordered_map>
auto main(int argc, char** argv) -> int {
using std::cout;
std::filesystem::path file_path{};
if (argc == 2) {
file_path = std::filesystem::path(argv[1]);
} else {
return 1;
}
std::string input{};
std::unordered_map<char, int> char_map{};
std::ifstream istrm(file_path, std::ios::binary);
if (!istrm.is_open()) {
throw std::runtime_error("Could not open file");
}
const auto file_size = std::filesystem::file_size(file_path);
input.resize(file_size);
istrm.read(input.data(), static_cast<std::streamsize>(file_size));
istrm.close();
// store frequency of individual chars in unordered_map
for (const auto& c : input) {
if (!char_map.contains(c)) {
char_map.insert(std::pair<char, int>(c, 1));
} else {
char_map[c]++;
}
}
double sum_for_loop = 0.0;
cout << "using for loop\n";
// start stopwatch
auto start_timer = std::chrono::steady_clock::now();
// for loop method
for (const auto& item : char_map) {
sum_for_loop += static_cast<unsigned char>(item.first) * static_cast<double>(item.second);
}
// stop stopwatch
cout << std::chrono::duration<double>(std::chrono::steady_clock::now() - start_timer).count() << " s\n";
auto mean_for_loop = static_cast<double>(sum_for_loop) / static_cast<double>(input.size());
cout << std::fixed << "sum_for_loop: " << sum_for_loop << " size: " << input.size() << '\n';
cout << "mean value of data bytes: " << mean_for_loop << '\n';
cout << "using accumulate()\n";
// start stopwatch
start_timer = std::chrono::steady_clock::now();
// accumulate method, but is slow (much slower in g++)
auto sum_accum =
std::accumulate(input.begin(), input.end(), 0.0, [](auto current_val, auto each_char) { return current_val + static_cast<unsigned char>(each_char); });
// stop stopwatch
cout << std::chrono::duration<double>(std::chrono::steady_clock::now() - start_timer).count() << " s\n";
auto mean_accum = sum_accum / static_cast<double>(input.size());
cout << std::fixed << "sum_for_loop: " << sum_accum << " size: " << input.size() << '\n';
cout << "mean value of data bytes: " << mean_accum << '\n';
}
Sample output from 2GB file (clang++):
using for loop
2.024e-05 s
sum_for_loop: 273805913805 size: 2147483648
mean value of data bytes: 127.500814
using accumulate()
1.317576 s
sum_for_loop: 273805913805.000000 size: 2147483648
mean value of data bytes: 127.500814
Sample output from 2GB file (g++):
using for loop
2.41e-05 s
sum_for_loop: 273805913805 size: 2147483648
mean value of data bytes: 127.500814
using accumulate()
5.269024 s
sum_for_loop: 273805913805.000000 size: 2147483648
mean value of data bytes: 127.500814
Sample output from 8GB file (clang++):
using for loop
1.853e-05 s
sum_for_loop: 1095220441576 size: 8589934592
mean value of data bytes: 127.500440
using accumulate()
5.247585 s
sum_for_loop: 1095220441576.000000 size: 8589934592
mean value of data bytes: 127.500440
Sample output from 8GB file (g++):
using for loop
7.5e-07 s
sum_for_loop: 1095220441576.000000 size: 8589934592
mean value of data bytes: 127.500440
using accumulate()
21.484348 s
sum_for_loop: 1095220441576.000000 size: 8589934592
mean value of data bytes: 127.500440
There are numerous issues with the code. The first - and the one that is causing your display problem - is that sum_for_loop should be a double, not an unsigned long. The sum is overflowing what can be stored in an unsigned long, resulting in your incorrect result when that happens.
The timers should be started after the cout, otherwise you're including the output time in the compute time. In addition, the "for loop" elapsed time excludes the time taken to construct char_map.
When building char_map, you don't need the if. If an entry is not found in the map it will be zero initialized. A better approach (since you only have 256 unique values) is to use an indexed vector (remembering to cast the char to unsigned char).

Setting stack size with setrlimit fails with gcc

I'm using gcc 10.1 on Ubuntu 18.04. I'm getting segfaults when defining a large stack allocated variable, even though my stack seems to be large enough to accommodate it. Here is a code snippet:
#include <iostream>
#include <array>
#include <sys/resource.h>
using namespace std;
int main() {
if (struct rlimit rl{1<<28, 1l<<32}; setrlimit(RLIMIT_STACK, &rl))
cout << "Can not set stack size! errno = " << errno << endl;
else
cout << "Stack size: " << rl.rlim_cur/(1<<20) << "MiB to " << rl.rlim_max/(1<<20) << "MiB\n";
array<int8_t, 100'000'000> a;
cout << (int)a[42] << endl;
}
which segfaults when compiled with gcc, but runs fine when compiled with clang 11.0.1 and outputs:
Stack size: 256MiB to 4096MiB
0
EDIT
Clang was eliding allocation of a. Here is a better example:
#include <iostream>
#include <array>
#include <sys/resource.h>
using namespace std;
void f() {
array<int8_t, 100'000'000> a;
cout << (long)&a[0] << endl;
}
int main()
{
if (struct rlimit rl{1<<28, 1l<<32}; setrlimit(RLIMIT_STACK, &rl))
cout << "Can not set stack size! errno = " << errno << endl;
else
cout << "Stack size: " << rl.rlim_cur/(1<<20) << "MiB to " << rl.rlim_max/(1<<20) << "MiB" << endl;
array<int8_t, 100'000'000> a; // line 21
cout << (long)&a[0] << endl; // line 23
f();
}
which you can find at: https://wandbox.org/permlink/XMaGFMa7heWfI9G8. It runs fine when lines 21 and 23 are commented out, but segfaults otherwise.
Use proc(5) and pmap(1) and strace(1) to understand the limitations of your computer.
array<int8_t, 100'000'000> a;
requires about 100Mbytes of space on call stack, which is generally limited to a few megabytes (perhaps even by your Linux kernel, but I am not sure)
Try also cat /proc/$$/limits in your terminal. On mine I am getting
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
The difference of behavior between compilers might be attributed to various optimizations (e.g. permitted by some C++ standard like n4849). A clever enough compiler is allowed to use just a few words for a inside your function f (e.g. because it would figure out, maybe with some abstract interpretation techniques, that locations a[1024] ... a[99999999] are useless).
If you compile with a recent GCC (e.g. GCC 10), you could invoke it as g++ -O -Wall -Wextra -fanalyzer -Wstack-usage=2048 to get useful warnings. See also this draft report funded by CHARIOT & DECODER projects.
In practice, use dynamic allocation for huge data (e.g. placement new with mmap(2) and smart pointers)
For a real application, consider writing your GCC plugin to get ad-hoc warnings.
Or at least compile your source code foo.cc with g++ -O2 -fverbose-asm -S foo.cc and look inside the generated foo.s and repeat with clang++ : the generated assembler files are different.

How to define and use numbers smaller than 2e-308

Smallest double is 2.22507e-308. Is there any way I can use smaller numbers?
I found a library called gmp, but have no idea how to use it, documentation is not clear at all, and I'm not sure if it works on windows.
I don't expect to give me instructions, but maybe at least some piece of advice.
If you need really big precision, then give gmp chance. I am sure it works on Windows too.
If you just need bigger precision than double, try long double. It may or may not give you more, depends on your compiler and target platform.
In my case it does give more (gcc 6, x86_64 linux):
Test program:
#include <iostream>
#include <limits>
int main() {
std::cout << "float:"
<< " bytes=" << sizeof(float)
<< " min=" << std::numeric_limits<float>::min()
<< std::endl;
std::cout << "double:"
<< " bytes=" << sizeof(double)
<< " min=" << std::numeric_limits<double>::min()
<< std::endl;
std::cout << "long double:"
<< " bytes=" << sizeof(long double)
<< " min=" << std::numeric_limits<long double>::min()
<< std::endl;
}
Output:
float: bytes=4 min=1.17549e-38
double: bytes=8 min=2.22507e-308
long double: bytes=16 min=3.3621e-4932
If your compiler/architecture allows it, you could use something like long double, which compiles to an 80-bit float (though I think it aligns to 128 bits, so there's a bit of wasted space) and has more range and precision than a typical double value. Not all compilers will do that though, and on many compilers, long double is equivalent to a double, at 64-bits.
"gmp" is one library you could use for extended precision floats. I generally recommend boost.multiprecision, which includes gmp, though personally, I'd use cpp_bin_float or cpp_dec_float for my multiprecision needs (the former is IEEE756 compliant, the latter isn't)
As for how to use them: I haven't used gmp, so I can't comment on its syntax, but cpp_bin_float is pretty easy to use:
typedef boost::multiprecision::cpp_bin_float_quad quad;
quad a = 34;
quad b = 17.95467;
b += a;
for(int i = 0; i < 10; i++) {
b *= b;
}
std::cout << "This might be rather big: " << b << std::endl;
If you change your compiler to gcc or Intel type long double will be supported with bigger precission (80-bit). With default visual studio compiler, I have no advice for you what to do.

Is there something wrong with the way I am using a long double?

I have recently become interested in learning about programming in c++, because I want to get a bit deeper understanding of the way computers work and handle instructions. I thought I would try out the data types, but I don't really understand what's happening with my output...
#include <iostream>
#include <iomanip>
using namespace std;
int main() {
float fValue = 123.456789;
cout << setprecision(20) << fixed << fValue << endl;
cout << "Size of float: " << sizeof(float) << endl;
double dValue = 123.456789;
cout << setprecision(20) << fixed << dValue << endl;
cout << "Size of double: " << sizeof(double) << endl;
long double lValue = 123.456789;
cout << setprecision(20) << fixed << lValue << endl;
cout << "Size of long double: " << sizeof(long double) << endl;
return 0;
}
The output I expected would be something like:
123.45678710937500000000
Size of float: 4
123.45678900000000000000
Size of double: 8
123.45678900000000000000
Size of long double: 16
This is my actual output:
123.45678710937500000000
Size of float: 4
123.45678900000000000000
Size of double: 8
-6518427077408613100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.00000000000000000000
Size of long double: 12
Any ideas on what happened would be much appreciated, thanks!
Edit:
System:
Windows 10 Pro Technical Preview
64-bit Operating System, x64-based processor
Eclipse CDT 8.5
From the patch that fixed this in earlier versions:
MinGW uses the Microsoft runtime DLL msvcrt.dll. Here lies a problem: while gcc creates 80 bits long doubles, the MS runtime accepts 64 bit long doubles only.
This bug happens to me when I use 4.8.1 revision 4 from MinGW-get (the most recent version it offers), but not when I use 4.8.1 revision 5.
So you are not using long double wrong (although there would be better accuracy to do long double lValue = 123.456789L to make sure it doesn't take 123.456789 as a double, then cast it to a long double).
The easiest way to fix this would be to simply change the version of MinGW you are using to 4.9 or 4.7, depending on what you need (you can get 4.9 here).
If you are willing to instead use printf, you could change to printf("%Lf", ...), and either:
add the flag -posix when you compile with g++
add #define __USE_MINGW_ANSI_STDIO 1 before #include <cstdio> (found this from the origional patch)
Finally, you can even just cast to a double whenever you try to print out the long double (there is some loss of accuracy, but it shouldn't matter when just printing out numbers).
To find more details, you can also look at my blog post on this issue.
Update: If you want to continue to use Mingw 4.8, you can also just download a different distribution of Mignw, which didn't have that problem for me.

Can someone provide an example of seeking, reading, and writing a >4GB file using boost iostreams

I have read that boost iostreams supposedly supports 64 bit access to large files semi-portable way. Their FAQ mentions 64 bit offset functions, but there is no examples on how to use them. Has anyone used this library for handling large files? A simple example of opening two files, seeking to their middles, and copying one to the other would be very helpful.
Thanks.
Short answer
Just include
#include <boost/iostreams/seek.hpp>
and use the seek function as in
boost::iostreams::seek(device, offset, whence);
where
device is a file, stream, streambuf or any object convertible to seekable;
offset is a 64-bit offset of type stream_offset;
whence is BOOST_IOS::beg, BOOST_IOS::cur or BOOST_IOS::end.
The return value of seek is of type std::streampos, and it can be converted to a stream_offset using the position_to_offset function.
Example
Here is an long, tedious and repetitive example, which shows how to open two files, seek to offstets >4GB, and copying data between them.
WARNING: This code will create very large files (several GB). Try this example on an OS/file system which supports sparse files. Linux is ok; I did not test it on other systems, such as Windows.
/*
* WARNING: This creates very large files (several GB)
* unless your OS/file system supports sparse files.
*/
#include <boost/iostreams/device/file.hpp>
#include <boost/iostreams/positioning.hpp>
#include <cstring>
#include <iostream>
using boost::iostreams::file_sink;
using boost::iostreams::file_source;
using boost::iostreams::position_to_offset;
using boost::iostreams::seek;
using boost::iostreams::stream_offset;
static const stream_offset GB = 1000*1000*1000;
void setup()
{
file_sink out("file1", BOOST_IOS::binary);
const char *greetings[] = {"Hello", "Boost", "World"};
for (int i = 0; i < 3; i++) {
out.write(greetings[i], 5);
seek(out, 7*GB, BOOST_IOS::cur);
}
}
void copy_file1_to_file2()
{
file_source in("file1", BOOST_IOS::binary);
file_sink out("file2", BOOST_IOS::binary);
stream_offset off;
off = position_to_offset(seek(in, -5, BOOST_IOS::end));
std::cout << "in: seek " << off << std::endl;
for (int i = 0; i < 3; i++) {
char buf[6];
std::memset(buf, '\0', sizeof buf);
std::streamsize nr = in.read(buf, 5);
std::streamsize nw = out.write(buf, 5);
std::cout << "read: \"" << buf << "\"(" << nr << "), "
<< "written: (" << nw << ")" << std::endl;
off = position_to_offset(seek(in, -(7*GB + 10), BOOST_IOS::cur));
std::cout << "in: seek " << off << std::endl;
off = position_to_offset(seek(out, 7*GB, BOOST_IOS::cur));
std::cout << "out: seek " << off << std::endl;
}
}
int main()
{
setup();
copy_file1_to_file2();
}