How to display AVX registers as doubles with GDB? - gdb

I was trying to use AVX in a Mandelbrot program and it's not working right.
I try debugging it but GDB refuses to show me the floating point values in the YMM registers. Here's the minimum example
t.c
#include <stdio.h>
extern void loadnum(void);
extern double input[4];
extern double output[4];
int main(void)
{
/*
input[0] = 1.1;
input[1] = 2.2;
input[2] = 3.3;
input[3] = 3.14159;
*/
printf("%f %f %f %f\n",input[0],input[1],input[2],input[3]);
loadnum();
printf("%f %f %f %f\n",output[0],output[1],output[2],output[3]);
return 0;
}
l.asm
section .data
global input
global output
align 64
input dq 1.1,2.2,3.3,3.14159
output dq 0,0,0,0
section .text
global loadnum
loadnum:
vmovapd ymm0, [input]
vmovapd [output],ymm0
ret
how it's compiled
OBJECTS = t.o l.o
CFLAGS = -c -O2 -g -no-pie -mavx -Wall
t: $(OBJECTS)
gcc -g -no-pie $(OBJECTS) -o t
t.o: t.c
gcc $(CFLAGS) t.c
l.o: l.asm
nasm -felf64 -gdwarf l.asm
The output is
> 1.100000 2.200000 3.300000 3.141590
> 1.100000 2.200000 3.300000 3.141590
which shows it's loading and storing these doubles as expected, but in gdb it shows
> gdb t (followed by some boilerplate)
> Reading symbols from t...
> (gdb) b loadnum
> Breakpoint 1 at 0x4011b0: file l.asm, line 15.
> (gdb) run
> Starting program: /somedir/t
> 1.100000 2.200000 3.300000 3.141590
> Breakpoint 1, loadnum () at l.asm:15
> 15 vmovapd ymm0, [input]
> (gdb) n
> 16 vmovapd [output],ymm0
> (gdb)
then I say
> (gdb) info all-registers
and this shows up.
> ymm0 (blah blah) v4_double = {0x1, 0x2, 0x3, 0x3}
when I expected it to show
> ymm0 (blah blah) v4_double = {1.100000 2.200000 3.300000 3.141590}
None of the other fields show anything like that, unless you want to parse the floating point bits
> v4_int64 = {0x3ff199999999999a, 0x400199999999999a, 0x400a666666666666, 0x400921f9f01b866e}
How can I fix this?

p $ymm0.v4_double (the print command) defaults to decimal formatting.
Use p /whatever for other formats, like p /x $ymm0.v4_int64 to see hex for the bit-patterns. help p for more.
display $ymm0.v4_double can work as a stand-in for layout reg + tui reg vec being buggy/broken in some versions, and always an unusable mess of different formats for registers as wide and numerous as ymm0-15. It takes the same options as print, and prints before every prompt. (undisplay 1 or undisplay (all) to disable some of the expressions you've set up.)
It can get cluttered in TUI mode (layout asm or layout reg + layout next to see integer regs and disassembly) if you want to track more than a couple registers, so you might prefer to use non-TUI mode, either don't use layout in the first place, or tui dis.
(When debugging hand-written asm, I almost always want to look at disassembly, not source; but maybe for a complicated algorithm I'd sometimes want to see source with comments as a reminder of what the values should be/mean at a certain point.)

Related

How do I dump the contents of an ELF file at a specific address?

Using GDB, if I load an ELF image and give it an address, I can get GDB to dump the contents of the ELF file at that address. For example:
p *((MYSTRUCT *)0x06f8f5b0)
$1 = {
filename = 0x6f8f5e0 <bla> "this is a string", format = 0x6f8f640 <pvt> "This is another string!\n", lineNumber = 148, argumentCount = 0 '\000', printLevel = 1 '\001'}
This works because GDB has loaded the ELF image and parsed its relocation tables, and it's also aware of the layout of MYSTRUCT.
How do I do the same thing without GDB? I actually don't really care about parsing MYSTRUCT. I just want a dump of 20 bytes at location 0x06f8f5b0. I've tried playing with readelf and objdump, but I couldn't get what I wanted.
Python code (e.g. using pyelftools) would also be acceptable.
I just want a dump of 20 bytes at location 0x06f8f5b0.
Your question only makes sense in the context of position-dependent (i.e. ET_EXEC) binary (any other binary can be loaded at arbitrary address).
For a position-dependent binary, the answer is pretty easy:
iterate over program headers until you find one which "covers" desired address,
from .p_vaddr and .p_offset compute the offset in the file
use lseek and read to read the bytes of interest.
To make this more concrete, here is an example:
// main.c
const char foo[] = "This is the song that never ends.";
int main() { printf("&foo = %p\n", &foo[0]); return 0; }
gcc -w -no-pie main.c
./a.out ; ./a.out
&foo = 0x402020
&foo = 0x402020
readelf -Wl a.out | grep LOAD
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x000438 0x000438 R 0x1000
LOAD 0x001000 0x0000000000401000 0x0000000000401000 0x0001bd 0x0001bd R E 0x1000
LOAD 0x002000 0x0000000000402000 0x0000000000402000 0x000190 0x000190 R 0x1000
LOAD 0x002e10 0x0000000000403e10 0x0000000000403e10 0x000220 0x000228 RW 0x1000
Here we see that the address we care about is 0x20 bytes into the 3rd LOAD segment, which starts at offset 0x002000 into the file.
Therefore the bytes we are interested in are at offset 0x2020 into the file.
Let's check:
dd if=a.out bs=1 count=15 skip=$((0x002020)) 2>/dev/null
This is the son
QED.

shellcode calls different syscall while runing alone as individiual code and while running with C++ code

I've such a code that run's shell:
BITS 64
global _start
_start:
mov rax, 59
jmp short file
c1:
pop rdi
jmp short argv
c2:
pop rsi
mov rdx, 0
syscall
file:
call c1
db '/bin/sh',0
argv:
call c2
dq arg, 0
arg:
db 'sh',0
It works when it's built in this way:
nasm -f elf64 shcode.asm
ld shcode.o -o shcode
Althougt, when I bring it into binary form with:
nasm -f bin shcode.asm
paste it into following C++ code:
int main(void)
{
char kod[]="\xB8\x3B\x00\x00\x00\xEB\x0B\x5F\xEB\x15\x5E\xBA\x00\x00\x00\x00\x0F\x05\xE8\xF0\xFF\xFF\xFF\x2F\x62\x69\x6E\x2F\x73\x68\x00\xE8\xE6\xFF\xFF\xFF\x34\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x73\x68\x00";
reinterpret_cast<void(*)()>(kod)();
return 0;
}
make it with clang++ texp.cpp -o texp.e -Wl,-z,execstack and execute, shell isn't running.
After running it with
strace ./texp.e
I see something like this (I stopped this process with ^C):
syscall_0xffffffffffffffda(0x7ffc23e0a297, 0x7ffc23e0a2a4, 0, 0x4a0, 0x7fe1ff3039b0, 0x7fe1ff69b960) = -1 ENOSYS (Nie zaimplementowana funkcja)
syscall_0xffffffffffffffda(0x7ffc23e0a297, 0x7ffc23e0a2a4, 0, 0x4a0, 0x7fe1ff3039b0, 0x7fe1ff69b960) = -1 ENOSYS (Nie zaimplementowana funkcja)
.
.
.
syscall_0xffffffffffffffda(0x7ffc23e0a297, 0x7ffc23e0a2a4, 0, 0x4a0, 0x7fe1ff3039b0, 0x7fe1ff69b960) = -1 ENOSYS (Nie zaimplementowana funkcja)
^Csyscall_0xffffffffffffffda(0x7ffc23e0a297, 0x7ffc23e0a2a4, 0, 0x4a0, 0x7fe1ff3039b0, 0x7fe1ff69b960strace: Process 2806 detached
<detached ...>
Nie zaimplementowana funkcja - Function not implemented
So the program ( aka shellcode ) is propably running improper syscall.
In your C++ shellcode caller, strace shows your execve system call was
execve("/bin/sh", [0x34], NULL) = -1 EFAULT (Bad address)
The later syscall_0xffffffffffffffda(...) = -1 ENOSYS are from an infinite loop with RAX = -EFAULT instead of 59, and then from RAX=- ENOSYS (again not a valid call number). This loop is created by your call/pop.
Presumably because you hexdumped an absolute address for arg from an unlinked .o or from a PIE executable, which is how you got 0x34 as the absolute address.
Obviously the whole approach of embedding an absolute address in your shellcode doesn't work if it's going to run from a randomized stack address, with no relocation fixup. dq arg, 0 is not position-independent.
You need to construct at least the argv array yourself (usually with push) using pointers. You could also use a push imm32 to construct arg itself. e.g. push 'shsh' / lea rax, [rsp+2].
Or the most common trick is to take advantage of a Linux-specific "feature": you can pass argv=NULL (instead of a pointer to a NULL pointer) with xor esi,esi.
(Using mov reg,0 completely defeats the purpose of the jmp/call/pop trick for avoiding zero bytes. You might as well just use a normal RIP-relative LEA if zero bytes are allowed. But if not, you can jump forward over data then use RIP-relative LEA with a negative displacement.)

GCC produces unneccessary register pushes for simple ISR on AVR

I have some simple C++ programm that produces the following assembler text if compile with g++. The only statement is sbi, which doesn't affect any status flags. I wonder why G++ produces these useless push/pop's of r0 and r1?
.global __vector_14
.type __vector_14, #function
__vector_14:
push r1 ;
push r0 ;
in r0,__SREG__ ; ,
push r0 ;
clr __zero_reg__ ;
/* prologue: Signal */
/* frame size = 0 */
/* stack size = 3 */
.L__stack_usage = 3
sbi 0x1e,0 ; ,
/* epilogue start */
pop r0 ;
out __SREG__,r0 ; ,
pop r0 ;
pop r1 ;
reti
.size __vector_14, .-__vector_14
Is there any way that g++ automatically omits these register saves. I don't want to declare the ISR as ISR_NAKED in general.
Edit:
This is the correcponding C++ code (-Os or -O3):
#include <avr/interrupt.h>
struct AppFlags final {
bool expired : 1;
} __attribute__((packed));
int main() {
}
ISR(TIMER0_COMPA_vect) {
auto f = reinterpret_cast<volatile AppFlags*>(0x3e);
f->expired = true;
}
The reason is that you are using an outdated compiler. The mentioned optimization has been added in v8 (released spring 2018), see GCC v8 Release Notes:
The compiler now generates efficient interrupt service routine (ISR) prologues and epilogues. This is achieved by using the new AVR pseudo instruction __gcc_isr which is supported and resolved by the GNU assembler.
GCC pushes all used registers. Your only real recourse is to enable the naked attribute, which will only push the stack pointer. Or change to assembly language.
Simple answer:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=20296
The
difficulty is, that the present architecture of the avr back-end does not
easily permit to improve this case: Every instruction pattern (like "multiply
two 16 bit integers" or "sign-extend a 16 bit variable to 32 bits") presently
is free to assume that may overwrite or change r0 and r1 unless it leaves the
"zero_reg" with 0 after finishing it's task.
Resolving this issue, IMHO, would require a major refactoring of the
back-end.
This is a long standing bug / enhancement request to avr-backend.
GCC before version 8 doesn't optimize ISR prologues and epilogues.
GCC 8 and later now emit __gcc_isr pseudo-instructions that enclose your ISR body when compiling with some optimization levels such as -Os or when supplying -mgas-isr-prologues.
The GNU assembler (from not too outdated binutils versions) understands these pseudo-instructions and scans the instructions (via) between __gcc_isr 1 and __gcc_isr 2 to decide which of r0 (tmp register), r1 (zero register), SREG (status register) need to be saved and restored.
Thus, for your example I get a pretty minimal objdump (when compiling with GCC 11.1):
$ avr-objdump -d foo.o
[..]
00000000 <__vector_14>:
0: f0 9a sbi 0x1e, 0 ; 30
2: 18 95 reti
[..]
When I tell GCC to just emit assembly we see the pseudo-instructions:
$ avr-g++ -c -S -Os -mmcu=atmega328p foo.c -fno-exceptions
$ cat foo.s
[..]
.global __vector_14
.type __vector_14, #function
__vector_14:
__gcc_isr 1
/* prologue: Signal */
/* frame size = 0 */
/* stack size = 0...3 */
.L__stack_usage = 0 + __gcc_isr.n_pushed
sbi 0x1e,0
/* epilogue start */
__gcc_isr 2
reti
__gcc_isr 0,r0
[..]

neon float multiplication is slower than expected

I have two tabs of floats. I need to multiply elements from the first tab by corresponding elements from the second tab and store the result in a third tab.
I would like to use NEON to parallelize floats multiplications: four float multiplications simultaneously instead of one.
I have expected significant acceleration but I achieved only about 20% execution time reduction. This is my code:
#include <stdlib.h>
#include <iostream>
#include <arm_neon.h>
const int n = 100; // table size
/* fill a tab with random floats */
void rand_tab(float *t) {
for (int i = 0; i < n; i++)
t[i] = (float)rand()/(float)RAND_MAX;
}
/* Multiply elements of two tabs and store results in third tab
- STANDARD processing. */
void mul_tab_standard(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i++)
tr[i] = t1[i] * t2[i];
}
/* Multiply elements of two tabs and store results in third tab
- NEON processing. */
void mul_tab_neon(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i+=4)
vst1q_f32(tr+i, vmulq_f32(vld1q_f32(t1+i), vld1q_f32(t2+i)));
}
int main() {
float t1[n], t2[n], tr[n];
/* fill tables with random values */
srand(1); rand_tab(t1); rand_tab(t2);
// I repeat table multiplication function 1000000 times for measuring purposes:
for (int k=0; k < 1000000; k++)
mul_tab_standard(t1, t2, tr); // switch to next line for comparison:
//mul_tab_neon(t1, t2, tr);
return 1;
}
I run the following command to compile:
g++ -mfpu=neon -ffast-math neon_test.cpp
My CPU: ARMv7 Processor rev 0 (v7l)
Do you have any ideas how I can achieve more significant speed-up?
Cortex-A8 and Cortex-A9 can do only two SP FP multiplications per cycle, so you may at most double the performance on those (most popular) CPUs. In practice, ARM CPUs have very low IPC, so it is preferably to unroll the loops as much as possible. If you want ultimate performance, write in assembly: gcc's code generator for ARM is nowhere as good as for x86.
I also recommend to use CPU-specific optimization options: "-O3 -mcpu=cortex-a9 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mthumb" for Cortex-A9; for Cortex-A15, Cortex-A8 and Cortex-A5 replace -mcpu=-mtune=cortex-a15/a8/a5 accordingly. gcc does not have optimizations for Qualcomm CPUs, so for Qualcomm Scorpion use Cortex-A8 parameters (and also unroll even more than you usually do), and for Qualcomm Krait try Cortex-A15 parameters (you will need a recent version of gcc which supports it).
One shortcoming with neon intrinsics, you can't use auto increment on loads, which shows up as extra instructions with your neon implementation.
Compiled with gcc version 4.4.3 and options -c -std=c99 -mfpu=neon -O3 and dumped with objdump, this is loop part of mul_tab_neon
000000a4 <mul_tab_neon>:
ac: e0805003 add r5, r0, r3
b0: e0814003 add r4, r1, r3
b4: e082c003 add ip, r2, r3
b8: e2833010 add r3, r3, #16
bc: f4650a8f vld1.32 {d16-d17}, [r5]
c0: f4642a8f vld1.32 {d18-d19}, [r4]
c4: e3530e19 cmp r3, #400 ; 0x190
c8: f3400df2 vmul.f32 q8, q8, q9
cc: f44c0a8f vst1.32 {d16-d17}, [ip]
d0: 1afffff5 bne ac <mul_tab_neon+0x8>
and this is loop part of mul_tab_standard
00000000 <mul_tab_standard>:
58: ecf01b02 vldmia r0!, {d17}
5c: ecf10b02 vldmia r1!, {d16}
60: f3410db0 vmul.f32 d16, d17, d16
64: ece20b02 vstmia r2!, {d16}
68: e1520003 cmp r2, r3
6c: 1afffff9 bne 58 <mul_tab_standard+0x58>
As you can see in standard case, compiler creates much tighter loop.

efficient way to divide ignoring rest

there are 2 ways i found to get a whole number from a division in c++
question is which way is more efficient (more speedy)
first way:
Quotient = value1 / value2; // normal division haveing splitted number
floor(Quotient); // rounding the number down to the first integer
second way:
Rest = value1 % value2; // getting the Rest with modulus % operator
Quotient = (value1-Rest) / value2; // substracting the Rest so the division will match
also please demonstrate how to find out which method is faster
If you're dealing with integers, then the usual way is
Quotient = value1 / value2;
That's it. The result is already an integer. No need to use the floor(Quotient); statement. It has no effect anyway. You would want to use Quotient = floor(Quotient); if it was needed.
If you have floating point numbers, then the second method won't work at all, as % is only defined for integers. But what does it mean to get a whole number from a division of real numbers? What integer do you get when you divide 8.5 by 3.2? Does it ever make sense to ask this question?
As a side note, the thing you call 'Rest' is normally called 'reminder'.remainder.
Use this program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#ifdef DIV_BY_DIV
#define DIV(a, b) ((a) / (b))
#else
#define DIV(a, b) (((a) - ((a) % (b))) / (b))
#endif
#ifndef ITERS
#define ITERS 1000
#endif
int main()
{
int i, a, b;
srand(time(NULL));
a = rand();
b = rand();
for (i = 0; i < ITERS; i++)
a = DIV(a, b);
return 0;
}
You can time execution
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 -DDIV_BY_DIV 1.c && time ./a.out
real 0m0.010s
user 0m0.012s
sys 0m0.000s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c && time ./a.out
real 0m0.019s
user 0m0.020s
sys 0m0.000s
Or, you look at the assembly output:
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 -DDIV_BY_DIV 1.c -S; mv 1.s 1_div.s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S; mv 1.s 1_modulus.s
mihai#keldon:/tmp$ diff 1_div.s 1_modulus.s
24a25,32
> movl %edx, %eax
> movl 24(%esp), %edx
> movl %edx, %ecx
> subl %eax, %ecx
> movl %ecx, %eax
> movl %eax, %edx
> sarl $31, %edx
> idivl 20(%esp)
As you see, doing only the division is faster.
Edited to fix error in code, formatting and wrong diff.
More edit (explaining the assembly diff): In the second case, when doing the modulus first, the assembly shows that two idivl operations are needed: one to get the result of % and one for the actual division. The above diff shows the subtraction and the second division, as the first one is exactly the same in both codes.
Edit: more relevant timing information:
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=42000000 -DDIV_BY_DIV 1.c && time ./a.out
real 0m0.384s
user 0m0.360s
sys 0m0.004s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=42000000 1.c && time ./a.out
real 0m0.706s
user 0m0.696s
sys 0m0.004s
Hope it helps.
Edit: diff between assembly with -O0 and without.
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S -O0; mv 1.s O0.s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S; mv 1.s noO.s
mihai#keldon:/tmp$ diff noO.s O0.s
Since the defualt optimization level of gcc is O0 (see this article explaining optimization levels in gcc) the result was expected.
Edit: if you compile with -O3 as one of the comments suggested you'll get the same assembly, at that level of optimization, both alternatives are the same.