Thread-safe std::string and std::stringbuf in C++ - c++

In my multithread C++ program on Solaris 10 using GNU 2.95.3, I am getting a contention issue while one thread is trying to call the string constructor and other thread is calling a stringbuf constructor, as mentioned below:
One thread is calling
__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0PCc
i.e.
basic_string<char, string_char_traits<char>, __default_alloc_template<false, 0>>::basic_string(char const *)
Another thread is calling
__9stringbufRCt12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0i
i.e.
stringbuf::stringbuf(basic_string<char, string_char_traits<char>, __default_alloc_template<false, 0> > const &, int)
Eventually both threads lead to allocating space for the string or stringbuf as applicable and results in a contention leading to SIGBUS error.
Any clue?
Do I need to overload malloc or new to be thread-safe or do I need to overload string and stringbuf to ensure thread safety?
I use g++ linker flag as below:
I g++ -g -lclntsh -ldl -lsocket -lrt –lthread
Do I need to add 'g++ -pthread' during compilation? I hope this would add -D_RENENTRANT as well. But Solaris g++ 2.95.3 does not support -pthread
Here's more pstack core output:
core 'core' of 28477:
----------------- lwp# 1 / thread# 1 --------------------
fcb48ac4 __lwp_park (0, 0, fcbb74c4, 1c00, 0, 0) + 14
ff375244 sem_wait (1da56fd8, 1da56c00, 8ea400, 2aae33d0, fcbb49cc, 0) + 20
....
001965f0 _start (0, 0, 0, 0, 0, 0) + 5c
----------------- lwp# 2 / thread# 2 --------------------
fcb4c714 _lwp_kill (a, fed84948, 0, 1, fffffffc, 0) + 8
fdfd6c6c skgesigOSCrash (fc3eea54, fc3ee8fc, ff031460, fef7536c, bc0f4, bc000) + 4c
fe5253a0 kpeDbgSignalHandler (fc3eea54, 2b200cd8, 9e800, a, fc3eee5d, 6060000) + 2f0
fdfd71ac skgesig_sigactionHandler (a, fc3ef6e0, fefef8f8, 0, ff0314a0, fc3eea40) + f0
fcb48b4c __sighndlr (a, fc3ef6e0, fc3ef428, fdfd70bc, 0, 1) + c
fcb3d1f8 call_user_handler (a, 0, 8, 0, fc540200, fc3ef428) + 3b8
fcb3d3cc sigacthandler (a, fc3ef6e0, fc3ef428, 1, fc540200, 0) + 4c
--- called from signal handler with signal 10 (SIGBUS) ---
00608e00 allocate__t24__default_alloc_template2b0i0Ui (20, 20, 9eff80, 1c7be, 2ab11e06, fcbb4f18) +
a4
003cde3c __nw__Q2t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0_3RepU
iUi (10, 10, ff000000, 4, fc540200, 861908) + 14
003cef8c create__Q2t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0_3Re
pUi (2, 2, 2b10d5c8, fcbb5434, fcbb5784, fffc00) + 24
003ee944 replace__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0UiUiP
CcUi (fc3efaf8, 0, ffffffff, 8618a0, 2, 80808080) + 114
006919cc assign__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0PCcUi
(fc3efaf8, 8618a0, 2, 0, fc540200, 7164b4) + 24
00666684 assign__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0PCc (f
c3efaf8, 8618a0, 8ea400, 0, fc5706c0, 0) + 24
005f7868 __t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0PCc (fc3efaf
8, 8618a0, 861800, 0, fc3efa90, a4090) + 28
....
----------------- lwp# 3 / thread# 3 --------------------
fcb41818 mutex_lock_impl (fcbb3910, 0, 2dfc0030, fc3beee0, 6, 80808080) + 4
fcad640c malloc (6, 1, d9fd8, 6919cc, fcbb03a8, fcbba518) + 44
003c878c __builtin_vec_new (6, 3, 37, fc3bee38, ff32a888, 4) + 3c
005e16ec __9stringbufRCt12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0
i (fc3bee60, fc3bee58, 3, 0, 2b2ab6e8, 20) + 64
...
----------------- lwp# 4 / thread# 4 --------------------
fcb48ac4 __lwp_park (0, 0, fcbb74c4, 1c00, 0, 0) + 14
ff375244 sem_wait (1da56f48, 1da56c00, 1, 0, fc541200, 0) + 20
....
fcb48a20 _lwp_start (0, 0, 0, 0, 0, 0)

Related

Dataset of invalid geometries in boost::geometry

Exists a dataset of all posible invalid geometries using c++ and boost::geometry libraries? or at least polygon coordinates of that invalid geomtries that i can to translate to boost::geometry
Example: Selfintersection, etc
I would like to test my application with a least all posibles invalid geometries.
Something like this:
https://knowledge.safe.com/articles/21674/invalid-ogc-geometry-examples.html
but with more test cases with inners and outer polygons.
I have created a library 'boost_geometry_make_valid' which allows correcting the errors described in this dataset:
https://github.com/kleunen/boost_geometry_make_valid
I use the dataset now for testing, the library is able to correct all the mentioned failures. Most important, remove self-intersection from polygons.
The Boost Geometry library implements the OGC standard. From the intro
The library follows existing conventions:
conventions from Boost
conventions from the std library
conventions and names from one of the OGC standards on geometry and, more specificly, from the OGC Simple Feature Specification
So the list that you used is relevant.
Besides, you can use the is_valid function with a reason parameter to interrogate the library about your geometry. I have several examples on this site showing how to do that. (Note: Not all constraints might be validatable)
Your Samples, Live
Let's adopt the outer ring orientation from the samples (not the BG default):
namespace bg = boost::geometry;
using pt = bg::model::d2::point_xy<double>;
using poly = bg::model::polygon<pt, false>;
using multi = bg::model::multi_polygon<poly>;
Let's create a generalized checker:
template <typename Geo = poly> void check(std::string wkt) {
Geo g;
bg::read_wkt(wkt, g);
std::string reason;
bool ok = bg::is_valid(g, reason);
std::cout << "Valid: " << std::boolalpha << ok << " (" << reason << ")\n";
bg::correct(g);
if (bg::is_valid(g, reason)) {
std::cout << "Autocorrected: " << bg::wkt(g) << "\n";
}
}
And run it for all the test cases:
//Hole Outside Shell
check("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0), (15 15, 15 20, 20 20, 20 15, 15 15))");
//Nested Holes
check("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0), (2 2, 2 8, 8 8, 8 2, 2 2), (3 3, 3 7, 7 7, 7 3, 3 3))");
//Disconnected Interior
check("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0), (5 0, 10 5, 5 10, 0 5, 5 0))");
//Self Intersection
check("POLYGON((0 0, 10 10, 0 10, 10 0, 0 0))");
//Ring Self Intersection
check("POLYGON((5 0, 10 0, 10 10, 0 10, 0 0, 5 0, 3 3, 5 6, 7 3, 5 0))");
//Nested Shells
check<multi>("MULTIPOLYGON(((0 0, 10 0, 10 10, 0 10, 0 0)),(( 2 2, 8 2, 8 8, 2 8, 2 2)))");
//Duplicated Rings
check<multi>("MULTIPOLYGON(((0 0, 10 0, 10 10, 0 10, 0 0)),((0 0, 10 0, 10 10, 0 10, 0 0)))");
//Too Few Points
check("POLYGON((2 2, 8 2))");
//Invalid Coordinate
check("POLYGON((NaN 3, 3 4, 4 4, 4 3, 3 3))");
//Ring Not Closed
check("POLYGON((0 0, 0 10, 10 10, 10 0))");
Output
Live On Coliru
Prints
Valid: false (Geometry has interior rings defined outside the outer boundary)
Valid: false (Geometry has nested interior rings)
Valid: false (Geometry has wrong orientation)
Valid: false (Geometry has wrong orientation)
Valid: false (Geometry has invalid self-intersections. A self-intersection point was found at (5, 0); method: t; operations: i/i; segment IDs {source, multi, ring, segment}: {0, -1, -1, 4}/{0, -1, -1, 8})
Valid: false (Multi-polygon has intersecting interiors)
Valid: false (Geometry has invalid self-intersections. A self-intersection point was found at (10, 0); method: e; operations: c/c; segment IDs {source, multi, ring, segment}: {0, 0, -1, 0}/{0, 1, -1, 0})
Valid: false (Geometry has too few points)
Valid: false (Geometry has point(s) with invalid coordinate(s))
Valid: false (Geometry is defined as closed but is open)
Autocorrected: POLYGON((0 0,10 0,10 10,0 10,0 0))
Note: the bg::correct might in cases correct /part/ of the problem, but leave other issues, and this check function doesn't report on that.

hash_map implementation (Solaris 10, g++ 2.95.3) with default_alloc

I am getting a core dump, the pstack core shows as below:
fed4ebd4 _lwp_kill (b, fe984948, 0, 1, fffffffc, 0) + 8
fdbd6c6c skgesigOSCrash (ffbfb5d4, ffbfb47c, fec31460, feb7536c, bc0f4, bc000) + 4c
fe1253a0 kpeDbgSignalHandler (ffbfb5d4, b13d630, 9e800, b, ffbfb9dd, 6060000) + 2f0
fdbd71ac skgesig_sigactionHandler (b, ffbfc260, febef8f8, 0, fec314a0, ffbfb5c0) + f0
fed4b00c __sighndlr (b, ffbfc260, ffbfbfa8, fdbd70bc, 0, 1) + c
fed3f6bc call_user_handler (b, 0, 8, 0, fc7f2a00, ffbfbfa8) + 3b8
fed3f8a4 sigacthandler (b, ffbfc260, ffbfbfa8, 29a8d0, 0, 0) + 60
--- called from signal handler with signal 11 (SIGSEGV) ---
fecd8028 _malloc_unlocked (308, bad8f38, bad8f38, bad8f38, fffffffc, 0) + 22c
fecd7de0 malloc (304, 1, ea654, 297f1c, fedc23f0, fedcc5e0) + 4c
002f0ee8 malloc (304, ffbfddd4, ba56c40, 1c, 3a, ba56d7d) + 54
001c4c78 allocate__t23__malloc_alloc_template1i0Ui (304, 0, 0, ffbfd99f, 0, 80808080)
+ c
001c4cfc allocate__t24__default_alloc_template2b0i0Ui (304, 304, ffbfd99f, 1, b, 0) +
18
0029a80c allocate__t12simple_alloc2ZPt15_Hashtable_node1Zt4pair2ZCt12basic_string3ZcZt
18string_char_traits1ZcZt24__default_alloc_template2b0i0Zt6vector2ZiZt9allocator1ZiZt24
__default_alloc_template2b0i0Ui (c1, 0, 0, 0, 0, ffffffff) + 20
00299d38 _M_allocate__t18_Vector_alloc_base3ZPt15_Hashtable_node1Zt4pair2ZCt12basic_st
ring3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0Zt6vector2ZiZt9allocat
or1ZiZt9allocator1Zt6vector2ZiZt9allocator1Zib1Ui (ffbfddf4, c1, 0, ffffffff, a, 808080
80) + 10
0029a8d0 _M_allocate_and_copy__H1ZPPt15_Hashtable_node1Zt4pair2ZCt12basic_string3ZcZt1
8string_char_traits1ZcZt24__default_alloc_template2b0i0Zt6vector2ZiZt9allocator1Zi_t6ve
ctor2ZPt15_Hashtable_node1Zt4pair2ZCt12basic_string3ZcZt18string_char_traits1ZcZt (ffbf
ddf4, c1, 0, 0, ffbfe047, 1) + 1c
00299e24 reserve__t6vector2ZPt15_Hashtable_node1Zt4pair2ZCt12basic_string3ZcZt18string
_char_traits1ZcZt24__default_alloc_template2b0i0Zt6vector2ZiZt9allocator1ZiZt9allocator
1Zt6vector2ZiZt9allocator1ZiUi (ffbfddf4, c1, ffbfdb74, 0, ffbfe047, 7ffffff0) + 48
00297f1c _M_copy_from__t9hashtable6Zt4pair2ZCt12basic_string3ZcZt18string_char_traits1
ZcZt24__default_alloc_template2b0i0Zt6vector2ZiZt9allocator1ZiZt12basic_string3ZcZt18st
ring_char_traits1ZcZt24__default_alloc_template2b0i0Z7strhashZt10_Select1st1Zt4pa (ffbf
ddf0, ffbfddd4, c1, ffbfdbe8, 38, ba56602) + 3c
0029ace8 __t9hashtable6Zt4pair2ZCt12basic_string3ZcZt18string_char_traits1ZcZt24__defa
ult_alloc_template2b0i0Zt6vector2ZiZt9allocator1ZiZt12basic_string3ZcZt18string_char_tr
aits1ZcZt24__default_alloc_template2b0i0Z7strhashZt10_Select1st1Zt4pair2ZCt12basi (ffbf
ddf0, ffbfddd4, ba56c40, 1c, 3a, ba56d7d) + 124
002998ac __t8hash_map5Zt12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_
template2b0i0Zt6vector2ZiZt9allocator1ZiZ7strhashZ5streqZt9allocator1Zt6vector2ZiZt9all
ocator1ZiRCt8hash_map5Zt12basic_string3ZcZt18string_char_traits1ZcZt24__default_a (ffbf
ddf0, ffbfddd4, ffbfdcef, ffbfdcee, ffbfdce0, 80808080) + 14
00294754 __Q212Notification4._47RCQ212Notification4._47 (ffbfddec, ffbfddd0, 484564, 1
, b, 0) + 1c
0029adf8 __t4pair2ZCt12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_tem
plate2b0i0ZQ212Notification9_RecordIdRCt12basic_string3ZcZt18string_char_traits1ZcZt24_
_default_alloc_template2b0i0RCQ212Notification9_RecordId (ffbfdde8, ffbfde88, ffbfddd0,
0, 0, ffffffff) + 2c
00299fcc __vc__t8hash_map5Zt12basic_string3ZcZt18string_char_traits1ZcZt24__default_al
loc_template2b0i0ZQ212Notification9_RecordIdZ7strhashZ5streqZt9allocator1ZQ212Notificat
ion9_RecordIdRCt12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_tem (4845
64, ffbfde88, ffbfde80, ffffffff, a, 80808080) + 2c
002948a8 _Add__12NotificationPCcP8RecordId (484560, ba56d60, 0, ffbfe047, ffbfe047, 1)
+ a0
...
The core dump is coming when using hash_map with Solaris 10, g++ 2.95.3 compiled.
What's __Q212Notification4._47RCQ212Notification4._47?
Please suggest if any clue.

What could be cause of allocate failing with a SIGBUS

I am getting a runtime SIGBUS with the call stack showing as below:
--- called from signal handler with signal 10 (SIGBUS) ---
001279b8 allocate__t24__default_alloc_template2b0i0Ui (20, 20, 2fa3c0, 32, 0, 0
) + a4
00117380 __nw__Q2t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc
_template2b0i0_3RepUiUi (10, 10, 838e00, 0, 0, 0) + 14
001173c0 create__Q2t12basic_string3ZcZt18string_char_traits1ZcZt24__default_all
oc_template2b0i0_3RepUi (9, 9, 838e00, 9, 0, 0) + 24
00117784 replace__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_allo
c_template2b0i0UiUiPCcUi (fbf7f758, 0, ffffffff, fcbf40c2, 9, 80808080) + 114
0012b988 assign__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc
_template2b0i0PCcUi (fbf7f758, fcbf40c2, 9, ffffffff, ffffffff, 20) + 24
0012a35c assign__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc
_template2b0i0PCc (fbf7f758, fcbf40c2, 90, b0, 1ff0, 0) + 24
00127170 __t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_templ
ate2b0i0PCc (fbf7f758, fcbf40c2, fcbf40b8, 1c00, 9, 7c) + 28
What could be the cause?
Stack overflow? or no more space on heap?

Parallel algorithm that does a small insertion/shifting

Say I have a array A of 8 numbers, I have another array B of numbers to determine how many places should the number in A be shifted to right
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 0, 0, 0, 0
0 means valid, 1 means this number should be 1 place after, the output array is should insert 0 between after 3, the output array C should be :
C: 3,0,6,7,8,1,2,3
Whether to insert 0 or something else is not important, the point is that all numbers after 3 got shifted by one place. The outbound numbers will not be in the array anymore.
Another example:
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 2, 0, 0, 0
C 3, 0, 6, 7, 8, 0, 1, 2
.......................................
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 1, 0, 0, 0
C 3, 0, 6, 7, 8, 1, 2, 3
I am thinking about using scan/prefix-sum or something similar to solve this problem. also this array is small that I should be able to fit the array in one warp (<32 numbers) and use shuffle instructions. Anyone has an idea?
One possible approach.
Due to the ambiguity of your shifting (0, 1, 0, 1, 0, 1, 1, 1 and 0, 1, 0 ,0 all produce the same data offset pattern, for example) it's not possible to just create a prefix sum of the shift pattern to produce the relative offset at each position. An observation we can make, however, is that a valid offset pattern will be created if each zero in the shift pattern gets replaced by the first non-zero shift value to its left:
0, 1, 0, 0 (shift pattern)
0, 1, 1, 1 (offset pattern)
or
0, 2, 0, 2 (shift pattern)
0, 2, 2, 2 (offset pattern)
So how to do this? Let's assume we have the second test case shift pattern:
0, 1, 0, 0, 2, 0, 0, 0
Our desired offset pattern would be:
0, 1, 1, 1, 2, 2, 2, 2
for a given shift pattern, create a binary value, where each bit is one if the value at the corresponding index into the shift pattern is zero, and zero otherwise. We can use a warp vote instruction, called __ballot() for this. Each lane will get the same value from the ballot:
1 0 1 1 0 1 1 1 (this is a single binary 8-bit value in this case)
Each warp lane will now take this value, and add a value to it which has a 1 bit at the warp lane position. Using lane 1 for the remainder of the example:
+ 0 0 0 0 0 0 1 0 (the only 1 bit in this value will be at the lane index)
= 1 0 1 1 1 0 0 1
We now take the result of step 2, and bitwise exclusive-OR with the result from step 1:
= 0 0 0 0 1 1 1 0
We now count the number of 1 bits in this value (there is a __popc() intrinsic for this), and subtract one from the result. So for the lane 1 example above, the result of this step would be 2, since there are 3 bits set. This gives use the distance to the first value to our left that is non-zero in the original shift pattern. So for the lane 1 example, the first non-zero value to the left of lane 1 is 2 lanes higher, i.e. lane 3.
For each lane, we use the result of step 4 to grab the appropriate offset value for that lane. We can process all lanes at once using a __shfl_down() warp shuffle instruction.
0, 1, 1, 1, 2, 2, 2, 2
Thus producing our desired "offset pattern".
Once we have the desired offset pattern, the process of having each warp lane use its offset value to appropriately shift its data item is straightforward.
Here is a fully worked example, using your 3 test cases. Steps 1-4 above are contained in the __device__ function mydelta. The remainder of the kernel is performing the step 5 shuffle, appropriately indexing into the data, and copying the data. Due to the usage of the warp shuffle instructions, we must compile this for a cc3.0 or higher GPU. (However, it would not be difficult to replace the warp shuffle instructions with other indexing code that would allow operation on cc2.0 or greater devices.) Also, due to the various intrinsics used, this function cannot work for more than 32 data items, but that was a prerequisite condition stated in your question.
$ cat t475.cu
#include <stdio.h>
#define DSIZE 8
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__device__ int mydelta(const int shift){
unsigned nz = __ballot(shift == 0);
unsigned mylane = (threadIdx.x & 31);
unsigned lanebit = 1<<mylane;
unsigned temp = nz + lanebit;
temp = nz ^ temp;
unsigned delta = __popc(temp);
return delta-1;
}
__global__ void mykernel(const int *data, const unsigned *shift, int *result, const int limit){ // limit <= 32
if (threadIdx.x < limit){
unsigned lshift = shift[(limit - 1) - threadIdx.x];
unsigned delta = mydelta(lshift);
unsigned myshift = __shfl_down(lshift, delta);
myshift = __shfl(myshift, ((limit -1) - threadIdx.x)); // reverse offset pattern
result[threadIdx.x] = 0;
if ((myshift + threadIdx.x) < limit)
result[threadIdx.x + myshift] = data[threadIdx.x];
}
}
int main(){
int A[DSIZE] = {3, 6, 7, 8, 1, 2, 3, 5};
unsigned tc1B[DSIZE] = {0, 1, 0, 0, 0, 0, 0, 0};
unsigned tc2B[DSIZE] = {0, 1, 0, 0, 2, 0, 0, 0};
unsigned tc3B[DSIZE] = {0, 1, 0, 0, 1, 0, 0, 0};
int *d_data, *d_result, *h_result;
unsigned *d_shift;
h_result = (int *)malloc(DSIZE*sizeof(int));
if (h_result == NULL) { printf("malloc fail\n"); return 1;}
cudaMalloc(&d_data, DSIZE*sizeof(int));
cudaMalloc(&d_shift, DSIZE*sizeof(unsigned));
cudaMalloc(&d_result, DSIZE*sizeof(int));
cudaCheckErrors("cudaMalloc fail");
cudaMemcpy(d_data, A, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_shift, tc1B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("index: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", i);
printf("\nA: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", A[i]);
printf("\ntc1 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc1B[i]);
printf("\ntc1 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc2B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("\ntc2 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc2B[i]);
printf("\ntc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc3B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("\ntc3 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc3B[i]);
printf("\ntc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
printf("\n");
return 0;
}
$ nvcc -arch=sm_35 -o t475 t475.cu
$ ./t475
index: 0, 1, 2, 3, 4, 5, 6, 7,
A: 3, 6, 7, 8, 1, 2, 3, 5,
tc1 B: 0, 1, 0, 0, 0, 0, 0, 0,
tc1 C: 3, 0, 6, 7, 8, 1, 2, 3,
tc2 B: 0, 1, 0, 0, 2, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 0, 1, 2,
tc3 B: 0, 1, 0, 0, 1, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 1, 2, 3,
$

How to set width of GDB memory examine (x) or print (p) commands?

I am trying to get a formatted print of a long 2D array, of width 8, of floats. When using the x command, I get the array printed as four-column table:
(gdb) x/16f 0x81000000
0x81000000: 0 0 1 0
0x81000010: 2 0 3 0
0x81000020: 4 0 5 0
0x81000030: 6 0 7 0
When using the p command, I get an unformatted output, the width of the terminal:
(gdb) p/f *(0x81000000)#16
$27 = {0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0}
Required output, something like:
(gdb) x/16f 0x81000000
0x81000000: 0 0 1 0 2 0 3 0
0x81000020: 4 0 5 0 6 0 7 0
or:
(gdb) p/f *(0x81000000)#16
$27 = {0, 0, 1, 0, 2, 0, 3, 0,
4, 0, 5, 0, 6, 0, 7, 0}
Is there a simple way to format the output for a specific width?
Use python scripting:
I think this is pretty close, if pretty obscure:
python print "\n".join(", ".join(gdb.execute('x/f 0x%x' % a, False, True).split()[-1] for a in range(s, s+32, 4)) for s in range(0x81000000, 0x81000040, 32))