We have migrated a C++ RHEL 5.4 app from RH 6.2 and found that application has broken. One of our investigations lead to findings that the code in 5.4 box refers 'futex'. Note out app is compiled using 32 bit compiler option -
grep futex tool_strace.txt
futex(0xff8ea454, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xf6d1f4fc, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0xf6c10a4c, FUTEX_WAKE_PRIVATE, 2147483647) = 0
As per http://www.akkadia.org/drepper/assumekernel.html I added the code on 5.4 build -
setenv("LD_ASSUME_KERNEL" , "2.4.1" , 1); // to use Linux Threads
But the strace dump still shows me 'futex' being called.
All the addresses ff8ea454, f6d1f4fc and f6c10a4c are 32 bit addresses. So if my assumption is right how can I code that 'futex' calls can be suppressed or be not called at all.
Is there any known issue with futex calls?
I believe the following to be true:
LD_ASSUME_KERNEL has to be set before your program starts to have any effect.
futex is used to implement any type of locks, so you can't avoid it.
You shouldn't need LD_ASSUME_KERNEL when you are compiling your own code, as it should
use newer interfaces as appropriate.
2.4.1 is an ancient kernel version to be trying to work as. Given your mentioning of 32 bit compiles, suggests you are on an AMD64 architecture machine, and that may not even support libraries for going back that far.
Related
I have a fairly robust CI test for a C++ library, these tests (around 50) run over the same docker image but in different machines.
In one machine ("A") all the memcheck (valgrind) tests pass (i.e. no memory leaks).
In the other ("B"), all tests produce the same valgrind error below.
51/56 MemCheck #51: combinations.cpp.x ....................***Exception: SegFault 0.14 sec
valgrind: m_libcfile.c:66 (vgPlain_safe_fd): Assertion 'newfd >= VG_(fd_hard_limit)' failed.
Cannot find memory tester output file: /builds/user/boost-multi/build/Testing/Temporary/MemoryChecker.51.log
The machines are very similar, both are intel i7.
The only difference I can think of is that one is:
A. Ubuntu 22.10, Linux 5.19.0-29, docker 20.10.16
and the other:
B. Fedora 37, Linux 6.1.7-200.fc37.x86_64, docker 20.10.23
and perhaps some configuration of docker I don't know about.
Is there some configuration of docker that might generate the difference? or of the kernel? or some option in valgrind to workaround this problem?
I know for a fact that in real machines (not docker) valgrind doesn't produce any memory error.
The options I use for valgrind are always -leak-check=yes --num-callers=51 --trace-children=yes --leak-check=full --track-origins=yes --gen-suppressions=all.
Valgrind version in the image is 3.19.0-1 from the debian:testing image.
Note that this isn't an error reported by valgrind, it is an error within valgrind.
Perhaps after all, the only difference is that Ubuntu version of valgrind is compiled in release mode and the error is just ignored. (<-- this doesn't make sense, valgrind is the same in both cases because the docker image is the same).
I tried removing --num-callers=51 or setting it at 12 (default value), to no avail.
I found a difference between the images and the real machine and a workaround.
It has to do with the number of file descriptors.
(This was pointed out briefly in one of the threads on valgind bug issues on Mac OS https://bugs.kde.org/show_bug.cgi?id=381815#c0)
Inside the docker image running in Ubuntu 22.10:
ulimit -n
1048576
Inside the docker image running in Fedora 37:
ulimit -n
1073741816
(which looks like a ridiculous number or an overflow)
In the Fedora 37 and the Ubuntu 22.10 real machines:
ulimit -n
1024
So, doing this in the CI recipe, "solved" the problem:
- ulimit -n # reports current value
- ulimit -n 1024 # workaround neededed by valgrind in docker running in Fedora 37
- ctest ... (with memcheck)
I have no idea why this workaround works.
For reference:
$ ulimit --help
...
-n the maximum number of open file descriptors
First off, "you are doing it wrong" with your Valgrind arguments. For CI I recommend a two stage approach. Use as many default arguments as possible for the CI run (--trace-children=yes may well be necessary but not the others). If your codebase is leaky then you may need to check for leaks, but if you can maintain a zero leak policy (or only suppressed leaks) then you can tell if there are new leaks from the summary. After your CI detects an issue you can run again with the kitchen sink options to get full information. Your runs will be significantly faster without all those options.
Back to the question.
Valgrind is trying to dup() some file (the guest exe, a tempfile or something like that). The fd that it fets is higher than what it thinks the nofile rlimit is, so it is asserting.
A billion files is ridiculous.
Valgrind will try to call prlimit RLIMIT_NOFILE, with a fallback call to rlimit, and a second fallback to setting the limit to 1024.
To realy see what is going on you need to modify the Valgrind source (m_main.c, setup_file_descriptors, set local show to True). With this change I see
fd limits: host, before: cur 65535 max 65535
fd limits: host, after: cur 65535 max 65535
fd limits: guest : cur 65523 max 65523
Otherwise with strace I see
2049 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=65535, rlim_max=65535}) = 0
2049 prlimit64(0, RLIMIT_NOFILE, {rlim_cur=65535, rlim_max=65535}, NULL) = 0
(all the above on RHEL 7.6 amd64)
EDIT: Note that the above show Valgrind querying and setting the resource limit. If you use ulimit to lower the limit before running Valgrind, then Valgrind will try to honour that limit. Also note that Valgrind reserves a small number (8) of files for its own use.
Though the terminal on ubuntu:
db#morris:~/lisbet/elki-master/elki/target$ elki-cli -algorithm outlier.lof.LOF -dbc.parser ArffParser -dbc.in /home/db/lisbet/AllData/literature/WBC/WBC_withoutdupl_norm_v10_no_ids.arff -lof.k 8 -evaluator outlier.OutlierROCCurve -rocauc.positive yes
giving
# ROCAUC: 0.6230046948356808
and in ELKI's GUI:
Running: -verbose -dbc.in /home/db/lisbet/AllData/literature/WBC/WBC_withoutdupl_norm_v10_no_ids.arff -dbc.parser ArffParser -algorithm outlier.lof.LOF -lof.k 8 -evaluator outlier.OutlierROCCurve -rocauc.positive yes
de.lmu.ifi.dbs.elki.datasource.FileBasedDatabaseConnection.parse: 18 ms
de.lmu.ifi.dbs.elki.datasource.FileBasedDatabaseConnection.filter: 0 ms
LOF #1/3: Materializing LOF neighborhoods.
de.lmu.ifi.dbs.elki.index.preprocessed.knn.MaterializeKNNPreprocessor.k: 9
Materializing k nearest neighbors (k=9): 223 [100%]
de.lmu.ifi.dbs.elki.index.preprocessed.knn.MaterializeKNNPreprocessor.precomputation-time: 10 ms
LOF #2/3: Computing LRDs.
LOF #3/3: Computing LOFs.
LOF: complete.
de.lmu.ifi.dbs.elki.algorithm.outlier.lof.LOF.runtime: 39 ms
ROCAUC: **0.6220657276995305**
I don't understand why the 2 ROCAUCcurves aren't the same.
My goal in testing this is to be comfortable with my result, that what I do is right, but it is hard when I don't get matching results. When I see that my settings are right I will move on to making my own experiments, that I can trust.
Pass cli as first command line parameter to launche the CLI, or minigui to launch the MiniGUI. The following are equivalent:
java -jar elki/target/elki-0.6.5-SNAPSHOT.jar cli
java -jar elki/target/elki-0.6.5-SNAPSHOT.jar KDDCLIApplication
java -jar elki/target/elki-0.6.5-SNAPSHOT.jar de.lmu.ifi.dbs.elki.application.KDDCLIApplication
This will work for any class extending the class AbstractApplication.
Your can also do:
java -cp elki/target/elki-0.6.5-SNAPSHOT.jar de.lmu.ifi.dbs.elki.application.KDDCLIApplication
(Which will load 1 class less, but this is usually not worth the effort.)
This will work for any class that has a standard public void main(String[]) method, as this is the standard Java invocation.
But notice that -h currently will still print 0.6.0 (2014, January), that value was not updated for the 0.6.5 interim versions. It will be bumped for 0.7.0. That version number is therefore not reliable.
As for the differences you observed: try varing k by 1. If I recall correctly, we changed the meaning of the k parameter to be more consistent across different algorithms. (They are not consistent in literature anyway.)
I've developed a multi platform program (using the FLTK toolkit) and implement multithreading to do background intensive tasks.
I have followed the FLTK tutorials/examples on multithreading which involve using pthread on Mac, ie the function pthread_create and windows threading on windows ie _beginthread
What I have noticed is that the performance is much higher on Windows ie 3 to 4 times faster in these background threads (in the time to execute them).
Why might this be? Is it the threading libraries I'm using? Surely there shouldn't be such a difference? Or could it be the runtime libraries underneath it all?
Here are my machine details
Mac:
Intel(R) Core(TM) i7-3820QM CPU # 2.70GHz
16 GB DDR3 1600 MHz
Model MacBookPro9,1
OS: Mac OSX 10.8.5
Windows:
Intel(R) Core(TM) i7-3520M CPU # 2.90GHz
16 GB DDR3 1600 MHz
Model: Dell Latitude E5530
OS: Windows 7 Service Pack 1
EDIT
To just do a basic speed comparison I compiled this on both machines running from the command line
int main(int agrc, char **argv)
{
time_t t = time(NULL);
tm* tt=localtime(&t);
std::stringstream s;
s<< std::setfill('0')<<std::setw(2)<<tt->tm_mday<<"/"<<std::setw(2)<<tt->tm_mon+1<<"/"<< std::setw(4)<<tt->tm_year+1900<<" "<< std::setw(2)<<tt->tm_hour<<":"<<std::setw(2)<<tt->tm_min<<":"<<std::setw(2)<<tt->tm_sec;
std::cout<<"1: "<<s.str()<<std::endl;
double sum=0;
for (int i=0;i<100000000;i++){
double ii=i*0.123456789;
sum=sum+sin(ii)*cos(ii);
}
t = time(NULL);
tt=localtime(&t);
s.str("");
s<< std::setfill('0')<<std::setw(2)<<tt->tm_mday<<"/"<<std::setw(2)<<tt->tm_mon+1<<"/"<< std::setw(4)<<tt->tm_year+1900<<" "<< std::setw(2)<<tt->tm_hour<<":"<<std::setw(2)<<tt->tm_min<<":"<<std::setw(2)<<tt->tm_sec;
std::cout<<"2: "<<s.str()<<std::endl;
}
Windows takes less than a second. Mac takes 4/5 seconds. Any ideas?
On Mac I'm compiling with g++, with visual studio 2013 on windows.
SECOND EDIT
if I change the line
std::cout<<"2: "<<s.str()<<std::endl;
to
std::cout<<"2: "<<s.str()<<" "<<sum<<std::endl;
Then all of a sudden Windows takes a little bit longer...
This makes me think that the whole thing might be compiler optimisation. So the question would be is g++ (4.2 is the version I have) worse at optimisation or do I need to provide additional flags?
THIRD(!) AND FINAL EDIT
I can report that I achieve comparable performance by ensuring g++ optimisation flags -O were provided at compile time. One of those annoying things that happens so often
A: Im tearing my hair out on problem x
B: Are you sure you're not doing y?
A: That works, why is this information not plastered all over the place and in every tutorial on problem x on the web?
B: Did you read the manual?
A: No, if I completely read the manual for every single bit of code/program I used I would never actually get round to doing anything...
Meh.
I'm using libpcap of version 1.1.1 built as a static library(libpcap.a). When I try to execute a following block of code on RHEL 6 64 bit(The executable module itself is built as 32-bit ELF image) I get segmentation fault:
const unsigned char* packet;
pcap_pkthdr pcap_header = {0};
unsigned short ether_type = 0;
while ( ether_type != ntohs( 0x800 ) )
{
packet = pcap_next ( m_pcap_handle, &pcap_header );
if (packet != NULL)
{
memcpy ( ðer_type, &( packet[ 12 ] ), 2 );
}
else
{
/*Sleep call goes here*/
}
}
if ( raw_buff ->data_len >= pcap_header.caplen )
{
memcpy ( raw_buff->data, &(packet[14]), pcap_header.len -14 );
raw_buff->data_len = pcap_header.len -14;
raw_buff->timestamp = pcap_header.ts;
}
A bit of investigation revealed pcap_header.len field is equal to zero upon pcap_next return. In fact caplen field seems to reflect packet size correсtly. If I try to dump a packet memory from packet address - data seems to be valid. As of len field equal to zero I know it's invalid. It supposed to be at least as of caplen magnitude. Is it a bug? What steps shall I take to get this fixed?
GDB shows pcap_header contents as:
(gdb) p pcap_header
$1 = {ts = {tv_sec = 5242946, tv_usec = 1361456997}, caplen = 66, len = 0}
Maybe I can have some workaround applied? I don't want to upgrade libpcap version.
Kernels prior to the 2.6.27 kernel do not support running 32-bit binaries using libpcap 1.0 or later on a 64-bit kernel.
libpcap 1.0 and later use the "memory-mapped" capture mechanism on Linux kernels that have it available, and the first version of that mechanism did not ensure that the data structures shared between the kernel and code using the "memory-mapped" capture mechanism were laid out in memory the same way in 32-bit and 64-bit mode.
2.6 kernels prior to the 2.6.27 kernel have only the first version of that mechanism. The 2.6.27 kernel has the second version of that mechanism, which does ensure that the data structures are laid out in memory the same way in 32-bit and 64-bit mode, so that 32-bit user-mode code works the same atop 32-bit and 64-bit kernels.
Hopefully I googled for "https://bugzilla.redhat.com/show_bug.cgi?id=557728" defect description and it seems it is still relevant nowadays. The problem went away when I linked my application to a shared library version of libpcap instead of having it linked with a static one. Then a system gets my app linked to a libpcap at runtime which is being shipped with RHEL.
Sincerely yours, Alexander Chernyaev.
When I call function execution time is 6.8 sec.
Call it from a thread time is 3.4 sec
and when using 2 thread 1.8 sec. No matter what optimization I use rations stay same.
In Visual Studio times are like expected 3.1, 3 and 1.7 sec.
#include<math.h>
#include<stdio.h>
#include<windows.h>
#include <time.h>
using namespace std;
#define N 400
float a[N][N];
struct b{
int begin;
int end;
};
DWORD WINAPI thread(LPVOID p)
{
b b_t = *(b*)p;
for(int i=0;i<N;i++)
for(int j=b_t.begin;j<b_t.end;j++)
{
a[i][j] = 0;
for(int k=0;k<i;k++)
a[i][j]+=k*sin(j)-j*cos(k);
}
return (0);
}
int main()
{
clock_t t;
HANDLE hn[2];
b b_t[3];
b_t[0].begin = 0;
b_t[0].end = N;
b_t[1].begin = 0;
b_t[1].end = N/2;
b_t[2].begin = N/2;
b_t[2].end = N;
t = clock();
thread(&b_t[0]);
printf("0 - %d\n",clock()-t);
t = clock();
hn[0] = CreateThread ( NULL, 0, thread, &b_t[0], 0, NULL);
WaitForSingleObject(hn[0], INFINITE );
printf("1 - %d\n",clock()-t);
t = clock();
hn[0] = CreateThread ( NULL, 0, thread, &b_t[1], 0, NULL);
hn[1] = CreateThread ( NULL, 0, thread, &b_t[2], 0, NULL);
WaitForMultipleObjects(2, hn, TRUE, INFINITE );
printf("2 - %d\n",clock()-t);
return 0;
}
Times:
0 - 6868
1 - 3362
2 - 1827
CPU - Core 2 Duo T9300
OS - Windows 8, 64 - bit
compiler: mingw32-g++.exe, gcc version 4.6.2
edit:
Tried different order, same result, even tried separate applications.
Task Manager showing CPU Utilization around 50% for function and 1 thread and 100% for 2 thread
Sum of all elements after each call is the same: 3189909.237955
Cygwin result: 2.5, 2.5 and 2.5 sec
Linux result(pthread): 3.7, 3.7 and 2.1 sec
#borisbn results: 0 - 1446 1 - 1439 2 - 721.
The difference is a result of something in the math library implementing sin() and cos() - if you replace the calls to those functions with something else that takes time the significant difference between step and 0 and step 1 goes away.
Note that I see the difference with gcc (tdm-1) 4.6.1, which is a 32-bit toolchain targeting 32 bit binaries. Optimization makes no difference (not surprising since it seems to be something in the math library).
However, if I build using gcc (tdm64-1) 4.6.1, which is a 64-bit toolchain, the difference does not appear - regardless if the build is creating a 32-bit program (using the -m32 option) or a 64-bit program (-m64).
Here are some example test runs (I made minor modifications to the source to make it C99 compatible):
Using the 32-bit TDM MinGW 4.6.1 compiler:
C:\temp>gcc --version
gcc (tdm-1) 4.6.1
C:\temp>gcc -m32 -std=gnu99 -o test.exe test.c
C:\temp>test
0 - 4082
1 - 2439
2 - 1238
Using the 64-bit TDM 4.6.1 compiler:
C:\temp>gcc --version
gcc (tdm64-1) 4.6.1
C:\temp>gcc -m32 -std=gnu99 -o test.exe test.c
C:\temp>test
0 - 2506
1 - 2476
2 - 1254
C:\temp>gcc -m64 -std=gnu99 -o test.exe test.c
C:\temp>test
0 - 3031
1 - 3031
2 - 1539
A little more information:
The 32-bit TDM distribution (gcc (tdm-1) 4.6.1) links to the sin()/cos() implementations in the msvcrt.dll system DLL via a provided import library:
c:/mingw32/bin/../lib/gcc/mingw32/4.6.1/../../../libmsvcrt.a(dcfls00599.o)
0x004a113c _imp__cos
While the 64-bit distribution (gcc (tdm64-1) 4.6.1) doesn't appear to do that, instead linking to some static library implementation provided with the distribution:
c:/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/4.6.1/../../../../x86_64-w64-mingw32/lib/../lib32/libmingwex.a(lib32_libmingwex_a-cos.o)
C:\Users\mikeb\AppData\Local\Temp\cc3pk20i.o (cos)
Update/Conclusion:
After a bit of spelunking in a debugger stepping through the assembly of msvcrt.dll's implementation of cos() I've found that the difference in the timing of the main thread versus an explicitly created thread is due to the FPU's precision being set to a non-default setting (presumably the MinGW runtime in question does this at start up). In the situation where the thread() function takes twice as long, the FPU is set to 64-bit precision (REAL10 or in MSVC-speak _PC_64). When the FPU control word is something other than 0x27f (the default state?), the msvcrt.dll runtime will perform the following steps in the sin() and cos() function (and probably other floating point functions):
save the current FPU control word
set the FPU control word to 0x27f (I believe it's possible for this value to be modified)
perform the fsin/fcos operation
restore the saved FPU control word
The save/restore of the FPU control word is skipped if it's already set to the expected/desired 0x27f value. Apparently saving/restoring the FPU control word is expensive, since it appears to double the amount of time the function takes.
You can solve the problem by adding the following line to main() before calling thread():
_control87( _PC_53, _MCW_PC); // requires <float.h>
Not a cache matterhere.
Likely different runtime libraries for user created threads and main thread.
You may compare the calculations a[i][j]+=k*sin(j)-j*cos(k); in detail (numbers) for specific values of i, j, and k to confirm differences.
The reason is the main thread is doing 64 bit float math and the threads are doing 53 bit math.
You can know this / fix it by changing the code to
...
extern "C" unsigned int _control87( unsigned int newv, unsigned int mask );
DWORD WINAPI thread(LPVOID p)
{
printf( "_control87(): 0x%.4x\n", _control87( 0, 0 ) );
_control87(0x00010000,0x00010000);
...
The output will be:
c:\temp>test
_control87(): 0x8001f
0 - 2667
_control87(): 0x9001f
1 - 2683
_control87(): 0x9001f
_control87(): 0x9001f
2 - 1373
c:\temp>mingw32-c++ --version
mingw32-c++ (GCC) 4.6.2
You can see that 0 was going to run w/o the 0x10000 flag, but once set, runs at the same speed as 1 & 2. If you look up the _control87() function, you'll see that this value is the _PC_53 flag, which sets the precision to be 53 instead of 64 had it been left as zero.
For some reason, Mingw isn't setting it to the same value at process init time that CreateThread() does at thread create time.
Another work around it to turn on SSE2 with _set_SSE2_enable(1), which will run even faster, but may give different results.
c:\temp>test
0 - 1341
1 - 1326
2 - 702
I believe this is on by default for the 64 bit because all 64 bit processors support SSE2.
As others suggested, change the order of your three tests to get some more insight. Also, the fact that you have a multi-core machine pretty well explains why using two threads doing half the work each takes half the time. Take a look at your CPU usage monitor (Control-Shift-Escape) to find out how many cores are maxed out during the running time.