Get Real Free Usable Space - c++

I've got an issue with a program that should send me back the free disk space usable by any user.
My goal is to get all the free disk space of every partitions of my hard drive that is usable by anyone who doesn't have sudo rights.
So I tryed this :
int main() {
struct statvfs diskData;
statvfs("/", &diskData);
unsigned long long available = (diskData.f_favail + diskData.f_bavail) * diskData.f_frsize) / (1024 * 1024)
std::cout << "Free Space : " << available << std::endl;
}
This gives me a total of 2810 ...
However, when I output df -h, I can read that the available space is 25G for sda3 and 30G for sda1
This seems completely inaccurate.
I've been running on the posts on Stackoverflow, mixing solutions I saw, but none is satisfactory. How can I get a correct value in Megabytes of my available free space ?
EDIT : Full statvfs and df / output
statvfs :
Block Size : 4 096
Fragment Size : 4 096
Blocks : 9 612 197
Free Blocks : 7 009 166
Non Root Free Blocks : 6 520 885
Inodes : 2 444 624
Free Inodes Space : 2 137 054
Non Root Free Inodes : 2 137 054
File System ID : 4 224 884 198
Mount Flags : 4 096
Max Filename Length : 255
df / :
Filesystem 1K-Blocks Used Available Use% Mounted on
/dev/sda3 38 448 788 10 412 112 26 083 556 29% /

This seems like a more accurate measure of the free disk space:
unsigned long long available = (diskData.f_bavail * diskData.f_bsize) / (1024 * 1024);
It matches the output from df quite closely on my system (df shows the sizes in gigs, and probably rounds them).
If you want the output in gigs like df you could use this:
#include <sys/statvfs.h>
#include <stdio.h>
unsigned long rounddiv(unsigned long num, unsigned long divisor) {
return (num + (divisor/2)) / divisor;
}
int main() {
struct statvfs diskData;
statvfs("/home", &diskData);
unsigned long available = diskData.f_bavail * diskData.f_bsize;
printf("Free Space : %luG\n", rounddiv(available, 1024*1024*1024));
return 0;
}
The output from this on my system:
Free Space : 31G
And if I run df -h /home:
Filesystem Size Used Avail Use% Mounted on
181G 141G 31G 83% /home

It seems that the right value to use is the fragment size, not the block size (i.e. f_frsize)
Have you tried with
diskData.f_bavail * diskData.f_frsize
instead ?

Related

OpenCL: Confusing Results according local_item_size

My code acts like 2d matrix muliplication ( http://gpgpu-computing4.blogspot.de/2009/09/matrix-multiplication-2-opencl.html).
The dimenstions of the matrixes are (1000*1000 and 10000*10000 and 100000*100000).
My Hardware is: NVIDIA Corporation GM204 [GeForce GTX 980] (MAX_WORK_GROUP_SIZES: 1024 1024 64).
The question is:
I have got Some Confusing Results depends on local_item_size and I need to understand what is happened?
1000 X 1000 matrixes & local_item_size = 16 : INVALID_WORKGROUP_SIZE.
1000 X 1000 matrixes & local_item_size = 8 : WORKS :).
1000 X 1000 matrixes & local_item_size = 10 : WORKS :) (The Execution time when 8 was better).
10000 X 10000 matrixes & local_item_size = 8 or 16: CL_OUT_OF_RESOURCES.
Thanks in advance,
To your second question, this is the reasoning behind:
1000 / 8 = 125, ok
1000 / 16 = 62.5, wrong! INVALID_WORKGROUP_SIZE
1000 / 10 = 100 ok, but 10 and multiples of 10, will never fully use the GPU cores.
IE: If you have 16 warps, 6 are wasted, if you have 32, 2 are wasted, and so on.
10000x10000 = 400MB(at least, if using floats) for just the input, so something is getting too big for the memory, therefore CL_OUT_OF_RESOURCES

pack bytes array with strange alignment requirements

Suppose you have 32 threads and 32 pieces of data for each thread to operate on independently, e.g.
struct data
{
unsigned short int N;
char *features; //array length N
uint *values; //array length N
};
data alldata[32];
Suppose that the shared memory for these threads is "partitioned" into 32 "banks", where each bank is 4 bytes wide. Each thread can read from its corresponding "bank" in parallel, but if threads try to access the same bank simlutaneously, the read operation serializes.
bank | 0 | 1 | 2 | .....
bytes | 0 1 2 3 | 4 5 6 7 | 8 9 10 11 | ....
bytes | 128 129 130 131 | 132 133 134 135 | 136 137 138 139 | ...
...............
...............
threads | 0 | 1 | 2 | .....
(This bizarre situation is called GPU computing).
Thus, for maximum parallelization:
(in terms of the picture above) the member variables of alldata[0] must only be written to the bytes in the first column. The member of variables of alldata[1] to the second column, etc. Equivalently,
In other words, I must write the contents of alldata[32] into one dynamic array, where the member variables of alldata[j] are written in 4-byte intervals once every 32*4 bytes. Then, when I copy this dynamic array into the shared memory for the threads, it will be properly aligned for the banks.
Question:
Does anybody know of any kind of package that will write variables to a byte array with proper spacing, as discussed above (once every 32*4 bytes) ?
This is a desperation question...

c++ array sorting with some specifications

I'm using C++. Using sort from STL is allowed.
I have an array of int, like this :
1 4 1 5 145 345 14 4
The numbers are stored in a char* (i read them from a binary file, 4 bytes per numbers)
I want to do two things with this array :
swap each number with the one after that
4 1 5 1 345 145 4 14
sort it by group of 2
4 1 4 14 5 1 345 145
I could code it step by step, but it wouldn't be efficient. What I'm looking for is speed. O(n log n) would be great.
Also, this array can be bigger than 500MB, so memory usage is an issue.
My first idea was to sort the array starting from the end (to swap the numbers 2 by 2) and treating it as a long* (to force the sorting to take 2 int each time). But I couldn't manage to code it, and I'm not even sure it would work.
I hope I was clear enough, thanks for your help : )
This is the most memory efficient layout I could come up with. Obviously the vector I'm using would be replaced by the data blob you're using, assuming endian-ness is all handled well enough. The premise of the code below is simple.
Generate 1024 random values in pairs, each pair consisting of the first number between 1 and 500, the second number between 1 and 50.
Iterate the entire list, flipping all even-index values with their following odd-index brethren.
Send the entire thing to std::qsort with an item width of two (2) int32_t values and a count of half the original vector.
The comparator function simply sorts on the immediate value first, and on the second value if the first is equal.
The sample below does this for 1024 items. I've tested it without output for 134217728 items (exactly 536870912 bytes) and the results were pretty impressive for a measly macbook air laptop, about 15 seconds, only about 10 of that on the actual sort. What is ideally most important is no additional memory allocation is required beyond the data vector. Yes, to the purists, I do use call-stack space, but only because q-sort does.
I hope you get something out of it.
Note: I only show the first part of the output, but I hope it shows what you're looking for.
#include <iostream>
#include <fstream>
#include <algorithm>
#include <iterator>
#include <cstdint>
// a most-wacked-out random generator. every other call will
// pull from a rand modulo either the first, or second template
// parameter, in alternation.
template<int N,int M>
struct randN
{
int i = 0;
int32_t operator ()()
{
i = (i+1)%2;
return (i ? rand() % N : rand() % M) + 1;
}
};
// compare to integer values by address.
int pair_cmp(const void* arg1, const void* arg2)
{
const int32_t *left = (const int32_t*)arg1;
const int32_t *right = (const int32_t *)arg2;
return (left[0] == right[0]) ? left[1] - right[1] : left[0] - right[0];
}
int main(int argc, char *argv[])
{
// a crapload of int values
static const size_t N = 1024;
// seed rand()
srand((unsigned)time(0));
// get a huge array of random crap from 1..50
vector<int32_t> data;
data.reserve(N);
std::generate_n(back_inserter(data), N, randN<500,50>());
// flip all the values
for (size_t i=0;i<data.size();i+=2)
{
int32_t tmp = data[i];
data[i] = data[i+1];
data[i+1] = tmp;
}
// now sort in pairs. using qsort only because it lends itself
// *very* nicely to performing block-based sorting.
std::qsort(&data[0], data.size()/2, sizeof(data[0])*2, pair_cmp);
cout << "After sorting..." << endl;
std::copy(data.begin(), data.end(), ostream_iterator<int32_t>(cout,"\n"));
cout << endl << endl;
return EXIT_SUCCESS;
}
Output
After sorting...
1
69
1
83
1
198
1
343
1
367
2
12
2
30
2
135
2
169
2
185
2
284
2
323
2
325
2
347
2
367
2
373
2
382
2
422
2
492
3
286
3
321
3
364
3
377
3
400
3
418
3
441
4
24
4
97
4
153
4
210
4
224
4
250
4
354
4
356
4
386
4
430
5
14
5
26
5
95
5
145
5
302
5
379
5
435
5
436
5
499
6
67
6
104
6
135
6
164
6
179
6
310
6
321
6
399
6
409
6
425
6
467
6
496
7
18
7
65
7
71
7
84
7
116
7
201
7
242
7
251
7
256
7
324
7
325
7
485
8
52
8
93
8
156
8
193
8
285
8
307
8
410
8
456
8
471
9
27
9
116
9
137
9
143
9
190
9
190
9
293
9
419
9
453
With some additional constraints on both your input and your platform, you can probably use an approach like the one you are thinking of. These constraints would include
Your input contains only positive numbers (i.e. can be treated as unsigned)
Your platform provides uint8_t and uint64_t in <cstdint>
You address a single platform with known endianness.
In that case you can divide your input into groups of 8 bytes, do some byte shuffling to arrange each groups as one uint64_t with the "first" number from the input in the lower-valued half and run std::sort on the resulting array. Depending on endianness you may need to do more byte shuffling to rearrange each sorted 8-byte group as a pair of uint32_t in the expected order.
If you can't code this on your own, I'd strongly advise you not to take this approach.
A better and more portable approach (you have some inherent non-portability by starting from a not clearly specified binary file format), would be:
std::vector<int> swap_and_sort_int_pairs(const unsigned char buffer[], size_t buflen) {
const size_t intsz = sizeof(int);
// We have to assume that the binary format in buffer is compatible with our int representation
// we also require an even number of integers
assert(buflen % (2*intsz) == 0);
// load pairwise
std::vector< std::pair<int,int> > pairs;
pairs.reserve(buflen/(2*intsz));
for (const unsigned char* bufp=buffer; bufp<buffer+buflen; bufp+= 2*intsz) {
// It would be better to have a more portable binary -> int conversion
int first_value = *reinterpret_cast<int*>(bufp);
int second_value = *reinterpret_cast<int*>(bufp + intsz);
// swap each pair here
pairs.emplace_back( second_value, firstvalue );
}
// less<pair<..>> does lexicographical ordering, which is what you are looking ofr
std::sort(pairs.begin(), pairs.end());
// convert back to linear vector
std::vector<int> result;
result.reserve(2*pairs.size());
for (auto& entry : pairs) {
result.push_back(entry.first);
result.push_back(entry.second);
}
return result;
}
Both the inital parse/swap pass (which you need anyway) and the final conversion are O(N), so the total complexity is still (O(N log(N)).
If you can continue to work with pairs, you can save the final conversion. The other way to save that conversion would be to use a hand-coded sort with two-int strides and two-int swap: much more work - and possibly still hard to get as efficient as a well-tuned library sort.
Do one thing at a time. First, give your data some *struct*ure. It seems that each 8 byte form a unit of the
form
struct unit {
int key;
int value;
}
If the endianness is right, you can do this in O(1) with a reinterpret_cast. If it isn't, you'll have to live with a O(n) conversion effort. Both vanish compared to the O(n log n) search effort.
When you have an array of these units, you can use std::sort like:
bool compare_units(const unit& a, const unit& b) {
return a.key < b.key;
}
std::sort(array, length, compare_units);
The key to this solution is that you do the "swapping" and byte-interpretation first and then do the sorting.

In Doom3's source code, why did they use bitshift to generate the number instead of hardcoding it?

Why did they do this:
Sys_SetPhysicalWorkMemory( 192 << 20, 1024 << 20 ); //Min = 201,326,592 Max = 1,073,741,824
Instead of this:
Sys_SetPhysicalWorkMemory( 201326592, 1073741824 );
The article I got the code from
A neat property is that shifting a value << 10 is the same as multiplying it by 1024 (1 KiB), and << 20 is 1024*1024, (1 MiB).
Shifting by successive powers of 10 yields all of our standard units of computer storage:
1 << 10 = 1 KiB (Kibibyte)
1 << 20 = 1 MiB (Mebibyte)
1 << 30 = 1 GiB (Gibibyte)
...
So that function is expressing its arguments to Sys_SetPhysicalWorkMemory(int minBytes, int maxBytes) as 192 MB (min) and 1024 MB (max).
Self commenting code:
192 << 20 means 192 * 2^20 = 192 * 2^10 * 2^10 = 192 * 1024 * 1024 = 192 MByte
1024 << 20 means 1024 * 2^20 = 1 GByte
Computations on constants are optimized away so nothing is lost.
I might be wrong (and I didn't study the source) , but I guess it's just for readability reasons.
I think the point (not mentioned yet) is that
All but the most basic compilers will do the shift at compilation time. Whenever you use operators with constant expressions, the
compiler will be able to do this before the code is even generated.
Note, that before constexpr and C++11, this did not extend to
functions.

Fair comparison of fork() Vs Thread [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
I was having a discussion about the relative cost of fork() Vs thread() for parallelization of a task.
We understand the basic differences between processes Vs Thread
Thread:
Easy to communicate between threads
Fast context switching.
Processes:
Fault tolerance.
Communicating with parent not a real problem (open a pipe)
Communication with other child processes hard
But we disagreed on the start-up cost of processes Vs threads.
So to test the theories I wrote the following code. My question: Is this a valid test of measuring the start-up cost or I am missing something. Also I would be interested in how each test performs on different platforms.
fork.cpp
#include <boost/lexical_cast.hpp>
#include <vector>
#include <unistd.h>
#include <iostream>
#include <stdlib.h>
#include <time.h>
extern "C" int threadStart(void* threadData)
{
return 0;
}
int main(int argc,char* argv[])
{
int threadCount = boost::lexical_cast<int>(argv[1]);
std::vector<pid_t> data(threadCount);
clock_t start = clock();
for(int loop=0;loop < threadCount;++loop)
{
data[loop] = fork();
if (data[looo] == -1)
{
std::cout << "Abort\n";
exit(1);
}
if (data[loop] == 0)
{
exit(threadStart(NULL));
}
}
clock_t middle = clock();
for(int loop=0;loop < threadCount;++loop)
{
int result;
waitpid(data[loop], &result, 0);
}
clock_t end = clock();
std::cout << threadCount << "\t" << middle - start << "\t" << end - middle << "\t"<< end - start << "\n";
}
Thread.cpp
#include <boost/lexical_cast.hpp>
#include <vector>
#include <iostream>
#include <pthread.h>
#include <time.h>
extern "C" void* threadStart(void* threadData)
{
return NULL;
}
int main(int argc,char* argv[])
{
int threadCount = boost::lexical_cast<int>(argv[1]);
std::vector<pthread_t> data(threadCount);
clock_t start = clock();
for(int loop=0;loop < threadCount;++loop)
{
if (pthread_create(&data[loop], NULL, threadStart, NULL) != 0)
{
std::cout << "Abort\n";
exit(1);
}
}
clock_t middle = clock();
for(int loop=0;loop < threadCount;++loop)
{
void* result;
pthread_join(data[loop], &result);
}
clock_t end = clock();
std::cout << threadCount << "\t" << middle - start << "\t" << end - middle << "\t"<< end - start << "\n";
}
I expect Windows to do worse in processes creation.
But I would expect modern Unix like systems to have a fairly light fork cost and be at least comparable to thread. On older Unix style systems (before fork() was implemented as using copy on write pages) that it would be worse.
Anyway My timing results are:
> uname -a
Darwin Alpha.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386
> gcc --version | grep GCC
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5659)
> g++ thread.cpp -o thread -I~/include
> g++ fork.cpp -o fork -I~/include
> foreach a ( 1 2 3 4 5 6 7 8 9 10 12 15 20 30 40 50 60 70 80 90 100 )
foreach? ./thread ${a} >> A
foreach? end
> foreach a ( 1 2 3 4 5 6 7 8 9 10 12 15 20 30 40 50 60 70 80 90 100 )
foreach? ./fork ${a} >> A
foreach? end
vi A
Thread: Fork:
C Start Wait Total C Start Wait Total
==============================================================
1 26 145 171 1 160 37 197
2 44 198 242 2 290 37 327
3 62 234 296 3 413 41 454
4 77 275 352 4 499 59 558
5 91 107 10808 5 599 57 656
6 99 332 431 6 665 52 717
7 130 388 518 7 741 69 810
8 204 468 672 8 833 56 889
9 164 469 633 9 1067 76 1143
10 165 450 615 10 1147 64 1211
12 343 585 928 12 1213 71 1284
15 232 647 879 15 1360 203 1563
20 319 921 1240 20 2161 96 2257
30 461 1243 1704 30 3005 129 3134
40 559 1487 2046 40 4466 166 4632
50 686 1912 2598 50 4591 292 4883
60 827 2208 3035 60 5234 317 5551
70 973 2885 3858 70 7003 416 7419
80 3545 2738 6283 80 7735 293 8028
90 1392 3497 4889 90 7869 463 8332
100 3917 4180 8097 100 8974 436 9410
Edit:
Doing a 1000 children caused the fork version to fail.
So I have reduced the children count. But doing a single test also seems unfair so here is a range of values.
mumble ... I do not like your solution for many reasons:
You are not taking in account the execution time of child processes/thread.
You should compare cpu-usage not the bare elapsed time. This way your statistics will not depend from, e.g., disk access congestion.
Let your child process do something. Remember that "modern" fork uses copy-on-write mechanisms to avoid to allocate memory to the child process until needed. It is too easy to exit immediately. This way you avoid quite all the disadvantages of fork.
CPU time is not the only cost you have to account. Memory consumption and slowness of IPC are both disadvantages of fork solution.
You could use "rusage" instead of "clock" to measure real resource usage.
P.S. I do not think you can really measure the process/thread overhead writing a simple test program. There are too many factors and, usually, the choice between threads and processes is driven by other reasons than mere cpu-usage.
Under Linux fork is a special call to sys_clone, either within the library or within the kernel. Clone has lots of switches to flip on and off, and each of them effects how expensive it is to start.
The actual library function clone is probably more expensive than fork though because it does more, though most of that is on the child side (stack swapping and calling a function by pointer).
What that micro-benchmark shows is that thread creation and joining (there are no fork results when I'm writing this) takes tens or hundreds of microseconds (assuming your system has CLOCKS_PER_SEC=1000000, which it probably has, since it's an XSI requirement).
Since you said that fork() takes 3 times the cost of threads, we are still talking tenths of a millisecond at worst. If that is noticeable on an application, you could use pools of processes/threads, like Apache 1.3 did. In any case, I'd say that startup time is a moot point.
The important difference of threads vs processes (on Linux and most Unix-likes) is that on processes you choose explicitly what to share, using IPC, shared memory (SYSV or mmap-style), pipes, sockets (you can send file descriptors over AF_UNIX sockets, meaning you get to choose which fd's to share), ... While on threads almost everything is shared by default, whether there's a need to share it or not. In fact, that is the reason Plan 9 had rfork() and Linux has clone() (and recently unshare()), so you can choose what to share.