How to understand percents in perf? - c++

Consider the following application.
#include <cmath>
void foo()
{
double x = 42.0;
for ( unsigned long i = 0; i < 10000000; ++i )
x = std::sin( x );
}
int main()
{
foo();
return 0;
}
I use the following commands.
g++ main.cpp
perf record ./a.out
perf report
And I see.
Samples: 518 of event 'cycles', Event count (approx.): 410229343
Overhead Command Shared Object Symbol
84,28% a.out libm.so.6 [.] __subtf3
12,59% a.out a.out [.] foo
2,47% a.out a.out [.] _init
0,47% a.out [kernel.kallsyms] [k] may_open
0,17% a.out [kernel.kallsyms] [k] memcg_slab_post_alloc_hook
0,01% perf-ex [kernel.kallsyms] [k] mutex_unlock
0,01% a.out [kernel.kallsyms] [k] __intel_pmu_enable_all.constprop.0
0,00% perf-ex [kernel.kallsyms] [k] native_write_msr
0,00% a.out [kernel.kallsyms] [k] native_write_msr
How to understand 12,59% for foo?
How to tell perf report to show full percent of time spent in a function? I want to see something like - foo 99%, __subtf3 90%.

This is a half of an answer. I will answer myself on the second question.
Use the following command to see percents as you wish.
perf record -e cpu-cycles --call-graph dwarf,4096 -F 250 ./a.out
perf report
And the result will look like.
+ 100,00% 0,00% a.out a.out [.] _start
+ 100,00% 0,00% a.out libc.so.6 [.] __libc_start_main_impl (inlined)
+ 100,00% 0,00% a.out libc.so.6 [.] __libc_start_call_main (inlined)
+ 100,00% 0,00% a.out a.out [.] main
+ 100,00% 16,27% a.out a.out [.] foo
+ 83,73% 0,00% a.out libm.so.6 [.] __sin_fma (inlined)
+ 83,73% 83,73% a.out libm.so.6 [.] __subtf3
+ 62,87% 0,00% a.out libm.so.6 [.] do_sin (inlined)
+ 5,85% 0,00% a.out libm.so.6 [.] libc_feholdsetround_sse_ctx (inlined)
...

Related

Unable to set the lowest byte in a register x86-64? [duplicate]

This question already has an answer here:
How do you access low-byte registers for r8-r15 from gdb in x86-64?
(1 answer)
Closed 1 year ago.
I'm writing a function in x86-64 to convert a 1-byte value into a hexadecimal string representing the ASCII code for that byte. At the start of my function, I try to use
movb %dil, %r11b
to store the 1-byte value in the lowest byte of register %r11. However, when I examine this in gdb, %r11b is never set. Instead, the higher bytes of %r11 are getting set. This is what I get when using gdb:
Breakpoint 1, 0x00000000004011f0 in byte_as_hex ()
(gdb) print /x $r11b
$1 = 0x0
(gdb) print /x $r11
$2 = 0x246
(gdb) print /x $rdi
$3 = 0x48
(gdb) print /x $dil
$4 = 0x48
(gdb) stepi /* subq $8, %rsp */
0x00000000004011f4 in byte_as_hex ()
(gdb) print /x $r11b
$5 = 0x0
(gdb) print /x $r11
$6 = 0x246
(gdb) print /x $rdi
$7 = 0x48
(gdb) print /x $dil
$8 = 0x48
(gdb) stepi /* movb %dil, %r11b */
0x00000000004011f7 in byte_as_hex ()
(gdb) print /x $r11b
$9 = 0x0
(gdb) print /x $r11
$10 = 0x248
(gdb) print /x $rdi
$11 = 0x48
(gdb) print /x $dil
$12 = 0x48
(gdb) print /x $r11d
$13 = 0x248
(gdb) print /x $r11w
$14 = 0x248
(gdb) print /x $r11b
$15 = 0x0
I'm very confused because I specifically tried to movb from %dil into %r11b, but I still can't set the byte. Could anyone explain to me why this is this happening? Thanks!
Multiple problems are at play here:
(Reported as GDB bug.) An undefined convenience variable (a GDB-local variable that starts with $), when printed with an explicit format specifier, is shown as 0 instead of the default void, which is displayed when format is not specified:
$ gdb /bin/true
Reading symbols from /bin/true...
(gdb) p $asdf
$1 = void <------ undefined, OK
(gdb) p/x $asdf
$2 = 0x0 <------ the problem
(gdb) set $asdf=4345
(gdb) p $asdf
$3 = 4345
(gdb) p/x $asdf
$4 = 0x10f9
(gdb)
Register values have the same syntax as the values of convenience variables. Thus, when you mistake the name of a register, e.g. use r11b instead of GDB's r11l, you refer to a(n undefined) convenience variable. Moreover, even if you simply use the correct name in incorrect case, like R11L, you bump into this too.
GDB uses its own set of names for x86(_64) registers. Sometimes they differ from the names given e.g. in Intel manuals (e.g. ftag instead of Intel's FTW). In any case, the lowest bytes of the general purpose registers have the following names in GDB:
al
cl
dl
bl
spl
bpl
sil
dil
r8l
...
r15l
There are no aliases for them like e.g. r11b for r11l, so one must use the correct names.

C++ TBB concurrent_hash_map: find/erase methods stuck my program

I'm new with Intel TBB library. I'm using the concurrent_hash_map.
When I call to find/erase method my program stuck. why?
Is it a wrong using of find?
TBBProcessingThreadHash_t::accessor tbbAccessor;
std::cout << "size before erase is: " << m_oProcessingThreadHash.size() << std::endl; // return 1
std::cout << "going to find" << std::endl;
if(m_oProcessingThreadHash.find(tbbAccessor, pi_nThreadID)) // <<<<<<<<<<< stuck at this point
{
std::cout << "going to erase" << std::endl;
m_oProcessingThreadHash.erase(tbbAccessor);
std::cout << "Thread deleted successfully from CServerThreadManager" << std::endl;
}
else
{
std::cout << "Failed to delete thread from CServerThreadManager" << std::endl;
}
I used perf top and this is the result for my program:
23.40% libServer.so [.] __TBB_machine_pause
22.81% libc-2.27.so [.] __sched_yield
15.07% [kernel] [k] do_syscall_64
13.95% [kernel] [k] __sched_text_start
2.23% libServer.so [.] tbb::spin_rw_mutex_v3::scoped_lock::try_acquire
1.86% libServer.so [.] tbb::internal::atomic_backoff::bounded_pause
1.43% libServer.so [.] tbb::interface5::concurrent_hash_map<int, IServerProcessingThread*, tbb::tbb_hash_compare<int>, tbb::tbb_allocator<std::pair<int const, IServerProcessingThread*> > >::lookup
1.28% libServer.so [.] __TBB_machine_fetchadd8
1.17% [kernel] [k] exit_to_usermode_loop
0.99% libtbb.so.2 [.] tbb::spin_rw_mutex_v3::internal_acquire_reader
0.97% libServer.so [.] tbb::spin_rw_mutex_v3::scoped_lock::release
0.75% libServer.so [.] tbb::internal::__TBB_machine_atomic_load<long, 2>
0.74% prog [.] tbb::internal::no_copy::no_copy
0.72% [kernel] [k] sys_sched_yield
0.66% libServer.so [.] tbb::interface5::concurrent_hash_map<int, IServerProcessingThread*, tbb::tbb_hash_compare<int>, tbb::tbb_allocator<std::pair<int const, IServerProcessingThread*> > >::search_bu
0.64% libServer.so [.] tbb::interface5::internal::hash_map_base::get_bucket
0.57% libServer.so [.] tbb::interface5::concurrent_hash_map<int, IServerProcessingThread*, tbb::tbb_hash_compare<int>, tbb::tbb_allocator<std::pair<int const, IServerProcessingThread*> > >::bucket_ac
0.56% libServer.so [.] tbb::spin_rw_mutex_v3::scoped_lock::acquire
0.52% libServer.so [.] tbb::tbb_hash_compare<int>::equal
0.51% libServer.so [.] tbb::internal::__TBB_machine_atomic_load<tbb::interface5::internal::hash_map_node_base*, 2>
0.48% libServer.so [.] tbb::internal::atomic_impl<unsigned long>::to_bits_ref<unsigned long const volatile>
0.46% libServer.so [.] tbb::internal::atomic_impl<unsigned long>::to_value<unsigned long>
0.44% libServer.so [.] tbb::spin_rw_mutex_v3::scoped_lock::scoped_lock
0.42% libtbb.so.2 [.] tbb::spin_rw_mutex_v3::internal_try_acquire_writer
0.41% libServer.so [.] tbb::internal::itt_load_word_with_acquire<unsigned long>
0.39% libServer.so [.] tbb::internal::atomic_backoff::atomic_backoff
0.38% prog [.] tbb::interface5::internal::hash_map_base::segment_base
0.37% libServer.so [.] tbb::internal::atomic_impl<unsigned long>::operator unsigned long
0.35% libServer.so [.] tbb::internal::__TBB_load_with_acquire<tbb::interface5::internal::hash_map_node_base*>
0.32% libServer.so [.] tbb::interface5::concurrent_hash_map<int, IServerProcessingThread*, tbb::tbb_hash_compare<int>, tbb::tbb_allocator<std::pair<int const, IServerProcessingThread*> > >::bucket_ac
0.31% libServer.so [.] tbb::interface5::internal::hash_map_base::is_valid
0.28% libServer.so [.] tbb::internal::atomic_backoff::pause
0.28% libServer.so [.] tbb::internal::itt_load_word_with_acquire<tbb::interface5::internal::hash_map_node_base*>
0.24% libServer.so [.] tbb::interface5::internal::hash_map_base::segment_index_of
0.24% libServer.so [.] __TBB_machine_lg
0.23% libServer.so [.] tbb::internal::gcc_builtins::clz
0.23% libServer.so [.] tbb::internal::machine_load_store<long, 8ul>::load_with_acquire
0.23% libServer.so [.] tbb::internal::__TBB_load_with_acquire<long>
0.23% libServer.so [.] tbb::internal::machine_load_store<tbb::interface5::internal::hash_map_node_base*, 8ul>::load_with_acquire
0.23% libServer.so [.] tbb::aligned_space<std::pair<int const, IServerProcessingThread*>, 1ul>::begin
0.22% libServer.so [.] tbb::internal::punned_cast<std::pair<int const, IServerProcessingThread*>*, tbb::aligned_space<std::pair<int const, IServerProcessingThread*>, 1ul> const>
0.21% libServer.so [.] tbb::internal::atomic_impl<unsigned long>::ptr_converter<unsigned long const volatile*>::ptr_converter
0.20% libServer.so [.] tbb::spin_rw_mutex_v3::scoped_lock::~scoped_lock
0.18% libServer.so [.] tbb::interface5::concurrent_hash_map<int, IServerProcessingThread*, tbb::tbb_hash_compare<int>, tbb::tbb_allocator<std::pair<int const, IServerProcessingThread*> > >::node::sto
0.17% libServer.so [.] tbb::interface5::concurrent_hash_map<int, IServerProcessingThread*, tbb::tbb_hash_compare<int>, tbb::tbb_allocator<std::pair<int const, IServerProcessingThread*> > >::bucket_ac
0.15% libServer.so [.] tbb::interface5::concurrent_hash_map<int, IServerProcessingThread*, tbb::tbb_hash_compare<int>, tbb::tbb_allocator<std::pair<int const, IServerProcessingThread*> > >::node::val
0.13% prog [.] tbb::internal::no_assign::no_assign
0.12% libServer.so [.] tbb::interface5::concurrent_hash_map<int, IServerProcessingThread*, tbb::tbb_hash_compare<int>, tbb::tbb_allocator<std::pair<int const, IServerProcessingThread*> > >::bucket_ac
0.09% [kernel] [k] __softirqentry_text_start
0.08% libServer.so [.] tbb::spin_rw_mutex_v3::scoped_lock::acquire#plt
Its look like this entry is locked somehow, but no one locked it.

Dynamic Parallelism Invalid File Format

I am facing a problem in correctly compiling CUDA code containing dynamic parallelism.
The problem is that compilation and linking show no error, but the generated file is invalid executable.
Configuration:
Tesla K40, Ubuntu 14.04 LTS, CUDA 7.5
Compilation Command:
nvcc -o cdp -rdc=true -dc -dlink -arch=sm_35 cdp.cu -lcudadevrt
Code:
#include <iostream>
#include <cuda_runtime.h>
using namespace std;
__global__ void kernel_find(int* data, int count, int value, int* index)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx<count)
{
bool exists = (data[idx] == value);
if(exists)
atomicMin(index, idx);
}
}
__host__ __device__ int find_device(int* data, int count, int value)
{
int* idx = new int;
(*idx) = count;
dim3 block(8);
dim3 grid((count + block.x - 1)/block.x);
kernel_find<<<grid, block>>>(data, count, value, idx);
cudaDeviceSynchronize();
int retval = *idx;
delete idx;
return retval;
}
__global__ void kernel_find_bulk(int* data, int count, const int* toFind, int* foundIndices, int toFindCount)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx<toFindCount)
{
int val = toFind[idx];
int foundIndex = find_device(data, count, val);
foundIndices[idx] = foundIndex;
}
}
int main()
{
const int count = 100, toFindCount = 10;
int *data, *toFind, *foundIndices;
cudaMallocManaged(&data, count * sizeof(int));
cudaMallocManaged(&toFind, toFindCount * sizeof(int));
cudaMallocManaged(&foundIndices, toFindCount * sizeof(int));
for(int i=0; i<count; i++)
{
data[i] = rand() % 30;
}
for(int i=0; i<toFindCount; i++)
{
toFind[i] = i;
}
dim3 block(8);
dim3 grid((toFindCount + block.x - 1)/block.x);
kernel_find_bulk<<<grid, block>>>(data, count, toFind, foundIndices, toFindCount);
cudaDeviceSynchronize();
for(int i=0; i<toFindCount; i++)
{
if(foundIndices[i] < count)
{
cout<<toFind[i]<<" found at index "<<foundIndices[i]<<endl;
}
else
{
cout<<toFind[i]<<" not found"<<endl;
}
}
return 0;
}
If I try to run the executable, I get Permission denied error. If permissions are changed forcefully using chmod, the error changes to cannot execute binary file: Exec format error.
I can't figure out the solution, as CUDA dynamic parallelism samples are running fine and CUDA programs without Dynamic Parallelism are also working fine. Any help would be appreciated.
Output of file command:
cdp: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not
stripped
Output of objdump -f command:
cdp: file format elf64-x86-64
architecture: i386:x86-64, flags 0x00000011:
HAS_RELOC, HAS_SYMS
start address 0x0000000000000000
If you run your compile command with the --dryrun option:
$ nvcc --dryrun -o cdp -rdc=true -dc -dlink -arch=sm_35 cdp.cu -lcudadevrt
#$ _SPACE_=
#$ _CUDART_=cudart
#$ _HERE_=/opt/cuda-7.5/bin
#$ _THERE_=/opt/cuda-7.5/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_SIZE_=64
#$ TOP=/opt/cuda-7.5/bin/..
#$ NVVMIR_LIBRARY_DIR=/opt/cuda-7.5/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/opt/cuda-7.5/bin/../lib:/opt/cuda-7.5/lib64
#$ PATH=/opt/cuda-7.5/bin/../open64/bin:/opt/cuda-7.5/bin/../nvvm/bin:/opt/cuda-7.5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/opt/cuda-7.5/bin
#$ INCLUDES="-I/opt/cuda-7.5/bin/..//include"
#$ LIBRARIES= "-L/opt/cuda-7.5/bin/..//lib64/stubs" "-L/opt/cuda-7.5/bin/..//lib64"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -D__CUDA_ARCH__=350 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ -D__CUDACC_RDC__ "-I/opt/cuda-7.5/bin/..//include" -D"__CUDACC_VER__=70517" -D"__CUDACC_VER_BUILD__=17" -D"__CUDACC_VER_MINOR__=5" -D"__CUDACC_VER_MAJOR__=7" -include "cuda_runtime.h" -m64 "cdp.cu" > "/tmp/tmpxft_000022ba_00000000-7_cdp.cpp1.ii"
#$ cudafe --allow_managed --m64 --gnu_version=40603 -tused --no_remove_unneeded_entities --device-c --gen_c_file_name "/tmp/tmpxft_000022ba_00000000-4_cdp.cudafe1.c" --stub_file_name "/tmp/tmpxft_000022ba_00000000-4_cdp.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_000022ba_00000000-4_cdp.cudafe1.gpu" --nv_arch "compute_35" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_000022ba_00000000-3_cdp.module_id" --include_file_name "tmpxft_000022ba_00000000-2_cdp.fatbin.c" "/tmp/tmpxft_000022ba_00000000-7_cdp.cpp1.ii"
#$ gcc -D__CUDA_ARCH__=350 -E -x c -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ -D__CUDACC_RDC__ -D__CUDANVVM__ -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/opt/cuda-7.5/bin/..//include" -m64 "/tmp/tmpxft_000022ba_00000000-4_cdp.cudafe1.gpu" > "/tmp/tmpxft_000022ba_00000000-8_cdp.cpp2.i"
#$ cudafe -w --allow_managed --m64 --gnu_version=40603 --c --device-c --gen_c_file_name "/tmp/tmpxft_000022ba_00000000-9_cdp.cudafe2.c" --stub_file_name "/tmp/tmpxft_000022ba_00000000-9_cdp.cudafe2.stub.c" --gen_device_file_name "/tmp/tmpxft_000022ba_00000000-9_cdp.cudafe2.gpu" --nv_arch "compute_35" --module_id_file_name "/tmp/tmpxft_000022ba_00000000-3_cdp.module_id" --include_file_name "tmpxft_000022ba_00000000-2_cdp.fatbin.c" "/tmp/tmpxft_000022ba_00000000-8_cdp.cpp2.i"
#$ gcc -D__CUDA_ARCH__=350 -E -x c -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDABE__ -D__CUDANVVM__ -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/opt/cuda-7.5/bin/..//include" -m64 "/tmp/tmpxft_000022ba_00000000-9_cdp.cudafe2.gpu" > "/tmp/tmpxft_000022ba_00000000-10_cdp.cpp3.i"
#$ filehash -s "--compile-only " "/tmp/tmpxft_000022ba_00000000-10_cdp.cpp3.i" > "/tmp/tmpxft_000022ba_00000000-11_cdp.hash"
#$ gcc -E -x c++ -D__CUDACC__ -D__NVCC__ -D__CUDACC_RDC__ "-I/opt/cuda-7.5/bin/..//include" -D"__CUDACC_VER__=70517" -D"__CUDACC_VER_BUILD__=17" -D"__CUDACC_VER_MINOR__=5" -D"__CUDACC_VER_MAJOR__=7" -include "cuda_runtime.h" -m64 "cdp.cu" > "/tmp/tmpxft_000022ba_00000000-5_cdp.cpp4.ii"
#$ cudafe++ --allow_managed --m64 --gnu_version=40603 --parse_templates --device-c --gen_c_file_name "/tmp/tmpxft_000022ba_00000000-4_cdp.cudafe1.cpp" --stub_file_name "tmpxft_000022ba_00000000-4_cdp.cudafe1.stub.c" --module_id_file_name "/tmp/tmpxft_000022ba_00000000-3_cdp.module_id" "/tmp/tmpxft_000022ba_00000000-5_cdp.cpp4.ii"
#$ cicc -arch compute_35 -m64 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 -nvvmir-library "/opt/cuda-7.5/bin/../nvvm/libdevice/libdevice.compute_35.10.bc" --device-c --orig_src_file_name "cdp.cu" "/tmp/tmpxft_000022ba_00000000-10_cdp.cpp3.i" -o "/tmp/tmpxft_000022ba_00000000-6_cdp.ptx"
#$ ptxas -arch=sm_35 -m64 --compile-only "/tmp/tmpxft_000022ba_00000000-6_cdp.ptx" -o "/tmp/tmpxft_000022ba_00000000-13_cdp.sm_35.cubin"
#$ fatbinary --create="/tmp/tmpxft_000022ba_00000000-2_cdp.fatbin" -64 --key="xxxxxxxxxx" --cmdline="--compile-only " "--image=profile=sm_35,file=/tmp/tmpxft_000022ba_00000000-13_cdp.sm_35.cubin" "--image=profile=compute_35,file=/tmp/tmpxft_000022ba_00000000-6_cdp.ptx" --embedded-fatbin="/tmp/tmpxft_000022ba_00000000-2_cdp.fatbin.c" --cuda --device-c
#$ rm /tmp/tmpxft_000022ba_00000000-2_cdp.fatbin
#$ gcc -D__CUDA_ARCH__=350 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/opt/cuda-7.5/bin/..//include" -m64 "/tmp/tmpxft_000022ba_00000000-4_cdp.cudafe1.cpp" > "/tmp/tmpxft_000022ba_00000000-14_cdp.ii"
#$ gcc -c -x c++ "-I/opt/cuda-7.5/bin/..//include" -fpreprocessed -m64 -o "cdp" "/tmp/tmpxft_000022ba_00000000-14_cdp.ii"
it becomes obvious that this has only emitted a host object file with an embedded cubin payload. There is no host code compilation or linking to an executable, which is confirmed by the output of objdump posted in an edit to your question.
The complicating factor here is that you must perform device independent compilation to use dynamic parallelism and then link the device code, but you only have a single source file, so the conventional build approach (device compile, device link, host compile) would fail with duplicate symbols.
The solution seems to be this:
$ nvcc --dryrun -o cdp -rdc=true -arch=sm_35 cdp.cu
#$ _SPACE_=
#$ _CUDART_=cudart
#$ _HERE_=/opt/cuda-7.5/bin
#$ _THERE_=/opt/cuda-7.5/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_SIZE_=64
#$ TOP=/opt/cuda-7.5/bin/..
#$ NVVMIR_LIBRARY_DIR=/opt/cuda-7.5/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/opt/cuda-7.5/bin/../lib:/opt/cuda-7.5/lib64
#$ PATH=/opt/cuda-7.5/bin/../open64/bin:/opt/cuda-7.5/bin/../nvvm/bin:/opt/cuda-7.5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/opt/cuda-7.5/bin
#$ INCLUDES="-I/opt/cuda-7.5/bin/..//include"
#$ LIBRARIES= "-L/opt/cuda-7.5/bin/..//lib64/stubs" "-L/opt/cuda-7.5/bin/..//lib64"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -D__CUDA_ARCH__=350 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ -D__CUDACC_RDC__ "-I/opt/cuda-7.5/bin/..//include" -D"__CUDACC_VER__=70517" -D"__CUDACC_VER_BUILD__=17" -D"__CUDACC_VER_MINOR__=5" -D"__CUDACC_VER_MAJOR__=7" -include "cuda_runtime.h" -m64 "cdp.cu" > "/tmp/tmpxft_00002454_00000000-9_cdp.cpp1.ii"
#$ cudafe --allow_managed --m64 --gnu_version=40603 -tused --no_remove_unneeded_entities --device-c --gen_c_file_name "/tmp/tmpxft_00002454_00000000-4_cdp.cudafe1.c" --stub_file_name "/tmp/tmpxft_00002454_00000000-4_cdp.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00002454_00000000-4_cdp.cudafe1.gpu" --nv_arch "compute_35" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_00002454_00000000-3_cdp.module_id" --include_file_name "tmpxft_00002454_00000000-2_cdp.fatbin.c" "/tmp/tmpxft_00002454_00000000-9_cdp.cpp1.ii"
#$ gcc -D__CUDA_ARCH__=350 -E -x c -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ -D__CUDACC_RDC__ -D__CUDANVVM__ -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/opt/cuda-7.5/bin/..//include" -m64 "/tmp/tmpxft_00002454_00000000-4_cdp.cudafe1.gpu" > "/tmp/tmpxft_00002454_00000000-10_cdp.cpp2.i"
#$ cudafe -w --allow_managed --m64 --gnu_version=40603 --c --device-c --gen_c_file_name "/tmp/tmpxft_00002454_00000000-11_cdp.cudafe2.c" --stub_file_name "/tmp/tmpxft_00002454_00000000-11_cdp.cudafe2.stub.c" --gen_device_file_name "/tmp/tmpxft_00002454_00000000-11_cdp.cudafe2.gpu" --nv_arch "compute_35" --module_id_file_name "/tmp/tmpxft_00002454_00000000-3_cdp.module_id" --include_file_name "tmpxft_00002454_00000000-2_cdp.fatbin.c" "/tmp/tmpxft_00002454_00000000-10_cdp.cpp2.i"
#$ gcc -D__CUDA_ARCH__=350 -E -x c -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDABE__ -D__CUDANVVM__ -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/opt/cuda-7.5/bin/..//include" -m64 "/tmp/tmpxft_00002454_00000000-11_cdp.cudafe2.gpu" > "/tmp/tmpxft_00002454_00000000-12_cdp.cpp3.i"
#$ filehash -s "--compile-only " "/tmp/tmpxft_00002454_00000000-12_cdp.cpp3.i" > "/tmp/tmpxft_00002454_00000000-13_cdp.hash"
#$ gcc -E -x c++ -D__CUDACC__ -D__NVCC__ -D__CUDACC_RDC__ "-I/opt/cuda-7.5/bin/..//include" -D"__CUDACC_VER__=70517" -D"__CUDACC_VER_BUILD__=17" -D"__CUDACC_VER_MINOR__=5" -D"__CUDACC_VER_MAJOR__=7" -include "cuda_runtime.h" -m64 "cdp.cu" > "/tmp/tmpxft_00002454_00000000-5_cdp.cpp4.ii"
#$ cudafe++ --allow_managed --m64 --gnu_version=40603 --parse_templates --device-c --gen_c_file_name "/tmp/tmpxft_00002454_00000000-4_cdp.cudafe1.cpp" --stub_file_name "tmpxft_00002454_00000000-4_cdp.cudafe1.stub.c" --module_id_file_name "/tmp/tmpxft_00002454_00000000-3_cdp.module_id" "/tmp/tmpxft_00002454_00000000-5_cdp.cpp4.ii"
#$ cicc -arch compute_35 -m64 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 -nvvmir-library "/opt/cuda-7.5/bin/../nvvm/libdevice/libdevice.compute_35.10.bc" --device-c --orig_src_file_name "cdp.cu" "/tmp/tmpxft_00002454_00000000-12_cdp.cpp3.i" -o "/tmp/tmpxft_00002454_00000000-6_cdp.ptx"
#$ ptxas -arch=sm_35 -m64 --compile-only "/tmp/tmpxft_00002454_00000000-6_cdp.ptx" -o "/tmp/tmpxft_00002454_00000000-15_cdp.sm_35.cubin"
#$ fatbinary --create="/tmp/tmpxft_00002454_00000000-2_cdp.fatbin" -64 --key="xxxxxxxxxx" --cmdline="--compile-only " "--image=profile=sm_35,file=/tmp/tmpxft_00002454_00000000-15_cdp.sm_35.cubin" "--image=profile=compute_35,file=/tmp/tmpxft_00002454_00000000-6_cdp.ptx" --embedded-fatbin="/tmp/tmpxft_00002454_00000000-2_cdp.fatbin.c" --cuda --device-c
#$ rm /tmp/tmpxft_00002454_00000000-2_cdp.fatbin
#$ gcc -D__CUDA_ARCH__=350 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/opt/cuda-7.5/bin/..//include" -m64 "/tmp/tmpxft_00002454_00000000-4_cdp.cudafe1.cpp" > "/tmp/tmpxft_00002454_00000000-16_cdp.ii"
#$ gcc -c -x c++ "-I/opt/cuda-7.5/bin/..//include" -fpreprocessed -m64 -o "/tmp/tmpxft_00002454_00000000-17_cdp.o" "/tmp/tmpxft_00002454_00000000-16_cdp.ii"
#$ nvlink --arch=sm_35 --register-link-binaries="/tmp/tmpxft_00002454_00000000-7_cdp_dlink.reg.c" -m64 "-L/opt/cuda-7.5/bin/..//lib64/stubs" "-L/opt/cuda-7.5/bin/..//lib64" -cpu-arch=X86_64 "/tmp/tmpxft_00002454_00000000-17_cdp.o" -lcudadevrt -o "/tmp/tmpxft_00002454_00000000-18_cdp_dlink.sm_35.cubin"
#$ fatbinary --create="/tmp/tmpxft_00002454_00000000-8_cdp_dlink.fatbin" -64 --key="cdp_dlink" --cmdline="--compile-only " -link "--image=profile=sm_35,file=/tmp/tmpxft_00002454_00000000-18_cdp_dlink.sm_35.cubin" --embedded-fatbin="/tmp/tmpxft_00002454_00000000-8_cdp_dlink.fatbin.c"
#$ rm /tmp/tmpxft_00002454_00000000-8_cdp_dlink.fatbin
#$ gcc -c -x c++ -DFATBINFILE="\"/tmp/tmpxft_00002454_00000000-8_cdp_dlink.fatbin.c\"" -DREGISTERLINKBINARYFILE="\"/tmp/tmpxft_00002454_00000000-7_cdp_dlink.reg.c\"" -I. "-I/opt/cuda-7.5/bin/..//include" -D"__CUDACC_VER__=70517" -D"__CUDACC_VER_BUILD__=17" -D"__CUDACC_VER_MINOR__=5" -D"__CUDACC_VER_MAJOR__=7" -m64 -o "/tmp/tmpxft_00002454_00000000-19_cdp_dlink.o" "/opt/cuda-7.5/bin/crt/link.stub"
#$ g++ -m64 -o "cdp" -Wl,--start-group "/tmp/tmpxft_00002454_00000000-19_cdp_dlink.o" "/tmp/tmpxft_00002454_00000000-17_cdp.o" "-L/opt/cuda-7.5/bin/..//lib64/stubs" "-L/opt/cuda-7.5/bin/..//lib64" -lcudadevrt -lcudart_static -lrt -lpthread -ldl -Wl,--end-group
i.e. just pass -rdc=true. It seems for the single source file case, the necessary device link stage in implicitly performed, and the result is an executable which should work:
$ nvcc -o cdp -rdc=true -arch=sm_35 cdp.cu
$ file cdp
cdp: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.24, BuildID[sha1]=0xdcd6119fb9e2efdf2759093e8e9b762d0a55ddfd, not stripped
Note that I haven't run this because I am doing the build on a system with a GPU without dynamic parallelism support.

Using a text file as input for a binary

I'm trying to test my homework (knapsack problem) and it's getting repetitive entering all these inputs every time I recompile.
Here's what I've got:
will#will-mint ~/code/byun-sp15 $ g++ knapsack.cpp
will#will-mint ~/code/byun-sp15 $ ./a.out
Please enter number of items: 3
Please enter Knapsack Capacity: 10
Enter weights and values of 3 items:
Item 1: 3 40
Item 2: 2 10
Item 3: 5 50
0 * 0 * 0 * 0 * 0 * 0 * 0 * 0 * 0 * 0 *
0 * 10 * 10 * 10 * 10 * 10 * 10 * 10 * 10 * 10 *
0 * 50 * 50 * 99 * 50 * 50 * 60 * 60 * 60 * 60 *
Clearly my table is not correct, please do not help me there. NO SPOILERS!
I put 3 10 3 40 2 10 5 50 in test.txt and tried the following:
will#will-mint ~/code/byun-sp15 $ vim test.txt
will#will-mint ~/code/byun-sp15 $ test.txt > ./a.out
test.txt: command not found
will#will-mint ~/code/byun-sp15 $ cat test.txt | ./a.out
will#will-mint ~/code/byun-sp15 $ cat test.txt | ./a.out >> output.log
will#will-mint ~/code/byun-sp15 $ vim output.log
will#will-mint ~/code/byun-sp15 $ cat test.txt | ./a.out 2> output.log
will#will-mint ~/code/byun-sp15 $ vim output.log
will#will-mint ~/code/byun-sp15 $ ./a.out < test.txt
will#will-mint ~/code/byun-sp15 $ cat test.txt | ./a.out > output.log
will#will-mint ~/code/byun-sp15 $ vim output.log
will#will-mint ~/code/byun-sp15 $ ./a.out << test.txt
None of which worked. I need help with my bash-fu, how can I use a string of space-separated numbers in a text file as input for my a.out?

How to know if process has truly finished with a dlclose()ed library?

I'm on Linux (Ubuntu 12.04, gcc 4.6.3), trying to bend dlopen/close to my will in order to make a plugin-based application that can reload its plugins as and when necessary (e.g. if they're recompiled).
The basic theory is simple: dlopen the plugin; use it, keeping track of all its symbols that are in use. When the time comes to reload, clean up all the symbols and dlclose the plugin.
I threw together a simple demo app, 'test.cpp':
#include <dlfcn.h>
#include <iostream>
using namespace std;
int main(int argc, char** argv)
{
if (argc > 1)
{
void* h = dlopen(argv[1], RTLD_NOW|RTLD_LOCAL);
if (!h)
{
cerr << "ERROR: " << dlerror() << endl;
return 1;
}
cin.get();
if (dlclose(h))
{
cerr << "ERROR: " << dlerror() << endl;
return 2;
}
cin.get();
}
return 0;
}
Compile with:
g++ test.cpp -o test -ldl
To make a trivial library that can be passed as a argument to the above code, use:
touch libtest.cpp && g++ -rdynamic -shared libtest.cpp -o libtest.so
Then run with:
./test ./libtest.so
Here's the problem; if, after hitting [Enter] once (i.e. after loading and supposedly unloading the library), you run 'pmap' to check which libraries are loaded in 'test', it'll tell you that libtest.so is still there! Now this is despite a valid return from dlclose() and no reasonable way that the reference count could have edged above 1 prior to that (this can be verified by attempting a second dlclose() - it will give an error return saying that it's already closed).
So either Linux never unloads a dlopen()ed library (contradicts the documentation), or 'pmap' is wrong. If it is the latter, is there a more reliable method of determining if a library is still loaded?
I don't observe the same as you with the below program:
// file soq.c
#include <dlfcn.h>
#include <iostream>
#include <cstdio>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
using namespace std;
int main(int argc, char** argv)
{
char cmd[60];
snprintf(cmd, sizeof(cmd), "pmap %d", getpid());
if (argc > 1)
{
void* h = dlopen(argv[1], RTLD_NOW|RTLD_LOCAL);
if (!h)
{
cerr << "ERROR: " << dlerror() << endl;
return 1;
}
cerr << "after dlopen " << argv[1] << endl;
system(cmd);
cin.get();
if (dlclose(h))
{
cerr << "ERROR: " << dlerror() << endl;
return 2;
}
cin.get();
cerr << "after close " << argv[1] << endl;
system(cmd);
}
return 0;
}
I am getting, as expected:
% ./soq ./libempty.so
./soq ./libempty.so
after dlopen ./libempty.so
5276: ./soq ./libempty.so
0000000000400000 8K r-x-- /home/basile/tmp/soq
0000000000601000 4K rw--- /home/basile/tmp/soq
0000000001b4d000 132K rw--- [ anon ]
00007f1dbfd01000 4K r-x-- /home/basile/tmp/libempty.so
00007f1dbfd02000 2044K ----- /home/basile/tmp/libempty.so
00007f1dbff01000 4K rw--- /home/basile/tmp/libempty.so
00007f1dbff02000 1524K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so
00007f1dc007f000 2048K ----- /lib/x86_64-linux-gnu/libc-2.13.so
00007f1dc027f000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so
00007f1dc0283000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so
00007f1dc0284000 20K rw--- [ anon ]
00007f1dc0289000 84K r-x-- /lib/x86_64-linux-gnu/libgcc_s.so.1
00007f1dc029e000 2048K ----- /lib/x86_64-linux-gnu/libgcc_s.so.1
00007f1dc049e000 4K rw--- /lib/x86_64-linux-gnu/libgcc_s.so.1
00007f1dc049f000 516K r-x-- /lib/x86_64-linux-gnu/libm-2.13.so
00007f1dc0520000 2044K ----- /lib/x86_64-linux-gnu/libm-2.13.so
00007f1dc071f000 4K r---- /lib/x86_64-linux-gnu/libm-2.13.so
00007f1dc0720000 4K rw--- /lib/x86_64-linux-gnu/libm-2.13.so
00007f1dc0721000 928K r-x-- /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.17
00007f1dc0809000 2048K ----- /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.17
00007f1dc0a09000 32K r---- /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.17
00007f1dc0a11000 8K rw--- /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.17
00007f1dc0a13000 84K rw--- [ anon ]
00007f1dc0a28000 8K r-x-- /lib/x86_64-linux-gnu/libdl-2.13.so
00007f1dc0a2a000 2048K ----- /lib/x86_64-linux-gnu/libdl-2.13.so
00007f1dc0c2a000 4K r---- /lib/x86_64-linux-gnu/libdl-2.13.so
00007f1dc0c2b000 4K rw--- /lib/x86_64-linux-gnu/libdl-2.13.so
00007f1dc0c2c000 128K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so
00007f1dc0e1c000 20K rw--- [ anon ]
00007f1dc0e49000 8K rw--- [ anon ]
00007f1dc0e4b000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so
00007f1dc0e4c000 4K rw--- /lib/x86_64-linux-gnu/ld-2.13.so
00007f1dc0e4d000 4K rw--- [ anon ]
00007fff076c3000 132K rw--- [ stack ]
00007fff077b4000 4K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 15984K
after close ./libempty.so
5276: ./soq ./libempty.so
0000000000400000 8K r-x-- /home/basile/tmp/soq
0000000000601000 4K rw--- /home/basile/tmp/soq
0000000001b4d000 132K rw--- [ anon ]
00007f1dbff02000 1524K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so
00007f1dc007f000 2048K ----- /lib/x86_64-linux-gnu/libc-2.13.so
00007f1dc027f000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so
00007f1dc0283000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so
00007f1dc0284000 20K rw--- [ anon ]
00007f1dc0289000 84K r-x-- /lib/x86_64-linux-gnu/libgcc_s.so.1
00007f1dc029e000 2048K ----- /lib/x86_64-linux-gnu/libgcc_s.so.1
00007f1dc049e000 4K rw--- /lib/x86_64-linux-gnu/libgcc_s.so.1
00007f1dc049f000 516K r-x-- /lib/x86_64-linux-gnu/libm-2.13.so
00007f1dc0520000 2044K ----- /lib/x86_64-linux-gnu/libm-2.13.so
00007f1dc071f000 4K r---- /lib/x86_64-linux-gnu/libm-2.13.so
00007f1dc0720000 4K rw--- /lib/x86_64-linux-gnu/libm-2.13.so
00007f1dc0721000 928K r-x-- /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.17
00007f1dc0809000 2048K ----- /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.17
00007f1dc0a09000 32K r---- /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.17
00007f1dc0a11000 8K rw--- /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.17
00007f1dc0a13000 84K rw--- [ anon ]
00007f1dc0a28000 8K r-x-- /lib/x86_64-linux-gnu/libdl-2.13.so
00007f1dc0a2a000 2048K ----- /lib/x86_64-linux-gnu/libdl-2.13.so
00007f1dc0c2a000 4K r---- /lib/x86_64-linux-gnu/libdl-2.13.so
00007f1dc0c2b000 4K rw--- /lib/x86_64-linux-gnu/libdl-2.13.so
00007f1dc0c2c000 128K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so
00007f1dc0e1c000 20K rw--- [ anon ]
00007f1dc0e48000 12K rw--- [ anon ]
00007f1dc0e4b000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so
00007f1dc0e4c000 4K rw--- /lib/x86_64-linux-gnu/ld-2.13.so
00007f1dc0e4d000 4K rw--- [ anon ]
00007fff076c3000 132K rw--- [ stack ]
00007fff077b4000 4K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 13936K
So you did run your pmap wrongly.
By the way, you could avoid any dlclose-ing in practice, as my manydl.c example demonstrates. In practice, not bothering dlclose-ing means only a tiny address space leak, not a big deal in practice. (You can dlopen nearly a million different shared objects without much hurt).
And to know when a shared object is unloaded, use the "destructor" functions of your dlclose-d plugin (e.g. destructor of static data in C++, or attribute((destructor)) in e.g. C code), because they are called from inside the unloading dlclose.