Need information about using Inline Assembly for WinCE, ARM9 - mfc

I am not very good in inline assembly, but planning to use it for optimization purpose in an Embedded project. As I don't know much of the information about it, I am in need of some help.
I am having Win CE 6.0, with ARM9, using MS Visual Studio 2005 (using MFC).
Basically, I want to make memory access faster, and do some bitwise operations.
It would be really helpful for me if I can get any online link, or some examples of using registers, variable names, pointers (some memory transfer and bitwise operations related stuff) etc for my particular environment.
EDIT after ctacke's answer:
It would be really helpful for me if there is any link or small examples to work out with .s files, specifically writing and exporting functions from .s, and steps involving in combining them with my MFC application. Any small example would do it.
Thank You.
Kind Regards,

The ARM compilers that ship with Visual Studio (all versions) do not support inline ASM - only the x86 compilers support inline ASM. To use ASM for ARM (or SH or MIPS as well) you have to create a separate code file (typically a .s file), export functions from your ASM and call those.
Here's a simple example (taken from here):
; Export my_asm function location so that C compiler can find it and link
EXPORT my_asm
; ARM Assembly language function to set LED1 bit to a value passed from C
; LED1 gets value (passed from C compiler in R0)
; LED1 is on GPIO port 1 bit 18
; See Chapter 9 in the LPC1768 User Manual
; for all of the GPIO register info and addresses
; Pinnames.h has the mbed modules pin port and bit connections
; Load GPIO Port 1 base address in register R1
LDR R1, =0x2009C020 ; 0x2009C020 = GPIO port 1 base address
; Move bit mask in register R2 for bit 18 only
MOV.W R2, #0x040000 ; 0x040000 = 1<<18 all "0"s with a "1" in bit 18
; value passed from C compiler code is in R0 - compare to a "0"
CMP R0, #0 ; value == 0 ?
; (If-Then-Else) on next two instructions using equal cond from the zero flag
; STORE if EQ - clear led 1 port bit using GPIO FIOCLR register and mask
STREQ R2, [R1,#0x1C] ; if==0, clear LED1 bit
; STORE if NE - set led 1 port bit using GPIO FIOSET register and mask
STRNE R2, [R1,#0x18] ; if==1, set LED1 bit
; Return to C using link register (Branch indirect using LR - a return)


AWS Neoverse N1 r3p1: MOVSNE pc, lr = Illegal instruction unconditionally

On a Neoverse N1 r3p1 in a AWS t4g.nano instance, in 32-bit user mode the following pieces of code result an illegal instruction exception:
teq pc, pc
movsne pc, lr
teq pc, pc
ldmfdne sp!, {pc}
However from reading the current Arm Architecture Reference Manual it should work, firstly teq pc, pc should set the Z flag in 32-bit modes (and clear Z in 26-bit modes if one of PSR bits is set), the pseudo-code for MOVS states:
if ConditionPassed() then
(shifted, carry) = Shift_C(R[m], shift_t, shift_n, PSTATE.C);
result = shifted;
if d == 15 then
if setflags then
// else branch snipped
ALUExceptionReturn(result) is CONSTRAINED UNPREDICTABLE, but control flow shouldn't reach there.
Is my understanding wrong or is the CPU broken?
Is it safe to replace the offending instruction in an exception handler (a Linux SIGILL handler) without stopping other threads?
A complete test program for Linux:
.syntax unified
.global _start
teq pc, pc
movsne pc, lr
mov r7, #1
svc 0
The intent is to return restoring the PSR in 26-bit modes, in order to comply with an old ABI on sufficiently old hardware.

how do I load my kernel from my bootloader?

I am trying to make a operating system. I just finished the bootloader, however I am having a problem loading my kernel.
section .boot
bits 16
global boot
mov ax, 0x2401
int 0x15
mov ax, 0x3
int 0x10
mov [disk],dl
mov ah, 0x2 ;read sectors
mov al, 6 ;sectors to read
mov ch, 0 ;cylinder idx
mov dh, 0 ;head idx
mov cl, 2 ;sector idx
mov dl, [disk] ;disk idx
mov bx, copy_target;target pointer
int 0x13
lgdt [gdt_pointer]
mov eax, cr0
or eax,0x1
mov cr0, eax
mov ax, DATA_SEG
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
mov ss, ax
jmp CODE_SEG:boot2
dq 0x0
dw 0xFFFF
dw 0x0
db 0x0
db 10011010b
db 11001111b
db 0x0
dw 0xFFFF
dw 0x0
db 0x0
db 10010010b
db 11001111b
db 0x0
dw gdt_end - gdt_start
dd gdt_start
db 0x0
CODE_SEG equ gdt_code - gdt_start
DATA_SEG equ gdt_data - gdt_start
times 510 - ($-$$) db 0
dw 0xaa55
bits 32
hello: db "Hello more than 512 bytes world!!",0
mov esi,hello
mov ebx,0xb8000
or al,al
jz halt
or eax,0x0F00
mov word [ebx], ax
add ebx,2
jmp .loop
mov esp,kernel_stack_top
extern kzos
call kzos
section .bss
align 4
kernel_stack_bottom: equ $
resb 16384 ; 16 KB
extern "C" void kzos()
const short color = 0x0F00;
const char* hello = "Kernel test!";
short* vga = (short*)0xb8000;
for (int i = 0; i<16;++i)
vga[i+80] = color | hello[i];
The error I am getting is with " extern kzos" the error reads "boot.asm:77: error: binary output format does not support external references"
I tried adding " extern kzos.cpp " but then I get "boot.asm:77: error: symbol `kmain' undefined"
if I add ".cpp" to the call function I get "boot4.asm:77: error: binary output format does not support external references"
I am using Nasm to compile to bin and qemu to run it.
I just finished the bootloader, ...
No you didn't. At least half of a boot loader's code exists to handle errors that "shouldn't" happen (things like checking if the BIOS failed to enable A20, or if the BIOS says there was a problem reading data from disk) and handling those cases somehow - at a minimum, by providing the user information that can help them fix the problem (and determine if it's faulty hardware or a problem with the way the OS was installed or ...), so the user isn't stuck wondering why their entire computer is unusable (and so that you're not stuck with a bug report saying "doesn't boot" with no useful information either).
how do I load my kernel from my bootloader?
Your choices for finding where the kernel is (its location on the disk and its size) are:
a) Put the kernel in some kind of normal file system, and have the boot loader support that file system (e.g. find the right directory entry and get location of file's data from the file's directory entry). Note that this is complicated (e.g. you'll have a lot of error handling, in case the file system's structures are corrupted, etc)
b) Put the kernel in some kind of purpose designed file system or in some kind of purpose designed "special/reserved area" of a normal file system. This could be as simple as a table of "offset and length" structures stored in the first sector, where kernel's file is always the first entry in that table.
c) Include the kernel's binary directly into the boot loader using something like the incbin directive in NASM. In this case you can use labels to determine the size of the kernel's file, like:
incbin 'kernel.bin'
In this case you can determine where the kernel is on disk from where the boot loader was on disk, and calculate how many sectors it is (e.g. (kernel_end - kernel_start + SECTOR_SIZE-1)/SECTOR_SIZE). Of course this is horribly inflexible (e.g. you can't easily update the kernel without assembling and then reinstalling the boot loader).
Once you've determined the location and size of the kernel on disk; you need to load it into memory somewhere. Note that this can depend on what kind of executable file format you chose to use for the kernel; and could involve loading the executable file's headers and parsing them to figure out which parts of the file should go where in memory (and setting up things like ".bss" that aren't in the file).
the output format is bin, flat format, that is
it does not support rellocations
thus call kzos's location/address is lost once the code is rendered
any linking utility will fail locating call kzos' in order
to link it to the new address of extern "C" void kzos() from kzos.cpp

STM32F767ZI External Interrupt Handling

I'm attempting to create a proper SPI slave interface for an AD7768-4 ADC. The ADC has a SPI interface, but it doesn't output the conversions via SPI. Instead, there are data outputs that are clocked out on individual GPIO pins. So I basically need to bit-bang data, and output to SPI to get a proper slave SPI interface. Please don't ask why I'm doing it this way, it was assigned to me.
The issue I'm having is with the interrupts. I'm using the STM32F767ZI processor - it runs at 216 MHz, and my ADC data MUST BE clocked out at 20MHz. I've set up my NMIs but what I'm not seeing is where the system calls or points to the interrupt handler.
I used the STMCubeMX software to assign pins and generate the setup code, and in the stm32F7xx.c file, it shows the NMI_Handler() function, but I don't see a pointer to it anywhere in the system files. I also found void HAL_GPIO_EXTI_IRQHandler() function in STM32F7xx_hal_gpio.c, which appears to check if the pin is asserted, and clears any pending bits, but it doesn't reset the interrupt flag, or check it, and again, I see no pointer to this function.
To more thoroughly complicate things, I have 10 clock cycles to determine which flag is set (1 of two at a time), reset it, incerment a variable, and move data from the GPIO registers. I believe this is possible, but again, I'm uncertain of what the system is doing as soon as the interrupt is tripped.
Does anyone have any experience in working with external interrupts on this processor that could shed some light on how this particular system handles things? Again - 10 clock cycles to do what I need to... moving data should only take me 1-2 clock cycles, leaving me 8 to handle interrupts...
We changed the DCLK speed to 5.12 MHz (20.48 MHz MCLK/4) because at 2.56 MHz we had exactly 12.5 microseconds to pipe data out and set up for the next DRDY pulse, and 80 kHz speed gives us exactly zero margin. At 5.12 MHz, I have 41 clock cycles to run the interrupt routine, which I can reduce slightly if I skip checking the second flag and just handle incoming data. But I feel I must use the DRDY flag check at least, and use the routine to enable the second interrupt otherwise I'll be constantly interrupting because DCLK on the ADC is always running. This allows me 6.12 microseconds to read in the data, and 6.25 microseconds to shuffle it out before the next DRDY pulse. I should be able to do that at 32 MHz SPI clock (slave) but will most likely do it at 50MHz. This is my current interrupt code:
void NMI_Handler(void)
count = 0;
data_pad[count] = GPIOF->IDR;
if (count == 31)
data_send = !data_send;
I am still concerned about clock cycles, and I believe I can get away with only checking the DRDY flag if I operate on the presumption that the only other EXTI flag that will trip is for the clock pin. Although I question how this will work if SYS_TICK is running in the background... I'll have to find out.
We're investigating a faster processor to handle the bit-banging, but right now, it looks like the PI3 won't be able to handle it if it's running Linux, and I'm unaware of too many faster processors that run either a very small reliable RTOS, or can be bare metal programmed in a pinch...
10 clock cycles to do what I need to... moving data should only take me 1-2 clock cycles, leaving me 8 to handle interrupts...
No way. Interrupt entry (pushing registers, fetching the vector and filling the pipeline) takes 10-12 cycles even on a Cortex-M7. Then consider a very simple interrupt handler, just moving the input data bits to a buffer and clearing the interrupt flag:
uint32_t *p;
void handler(void) {
*p++ = GPIOA->IDR;
EXTI->PR = 0x10;
it gets translated to something like this
ldr r0, .addr_of_idr // load &GPIOA->IDR
ldr r1, [r0] // load GPIOA->IDR
ldr r2, .addr_ofr_p // load &p
ldr r3, [r2] // load p
str r1, [r3] // store the value from IDR to *p
adds r3, r3, #4 // increment p
str r3, [r2] // store p
ldr r0, .addr_of_pr // load &EXTI->PR
movs r1, #0x10
str r1, [r0] // store 0x10 to EXTI->PR
bx lr
.word p
.word 0x40020010
.word 0x40013C14
So it's 11 instructions, each taking at least one cycle, after interrupt entry. That's assuming the code, vector table, and the stack are all in the fastest RAM region. I'm not sure whether literal pools work in ITCM at all, using immediate literals would add 3 more cycles. Forget it.
This has to be solved with hardware.
The controller has 6 SPI interfaces, pick 4 of them. Connect DRDY to all four NSS pins, DCLK to all SCK pins, and each DOUT pin to one MISO pin. Now each SPI interface handles a single channel, and can collect up to 32 bits in its internal FIFO.
Then I'd set an interrupt on a rising edge on one of the NSS pins (EXTI still works even if the pin is in alternate function mode), and read all data at once.
It turns out that the STM32 SPI requres an inordinate amount of delay between NSS falling and SCK rising, which the AD7768 does not provide, so it will not work.
Sigma-Delta interface
The STM32F767 has a DFSDM peripheral, designed to receive data from external ADCs. It can receive up to 8 channels of serial data with 20 MHz, and it can even do some preprocessing that your application might need.
The problem is that the DFSDM has no DRDY input, I don't exactly know how could the data transfer be synchronized. It might work by asserting the START# singal to reset the communication.
If that doesn't work, then you can try starting the DFSDM channels using a timer and DMA. Connect DRDY to the external trigger of TIM1 or TIM8 (other timers won't work, because they are connected to the slower APB1 bus and the other DMA controller), start it on the rising edge of ETR, and let it generate a DMA request after ~20 ns. Then let the DMA write the value needed to start the channel to the DFSDM channel configuration register. Repeat for the oher three channels.
There's a startup file generated before compile: startup_stm32f767xx.s - which contains all the pointers to functions.
Under the marker g_pfnVectors: is .word NMI_Handler pointing to a function for handling the non-masked interrupts, and two other pointers, .word EXTI0_IRQHandler and .word EXTI1_IRQHandler as vectors to the external interrupt handlers. Further down in the same file, is the following compiler directives:
.weak NMI_Handler
.thumb_set NMI_Handler,Default_Handler
.weak EXTI0_IRQHandler
.thumb_set EXTI0_IRQHandler,Default_Handler
.weak EXTI1_IRQHandler
.thumb_set EXTI1_IRQHandler,Default_Handler
This was the info I was looking for to be able to control my interrupts with more precision and fewer clock cycles.
I readed AD7768 DS more carefully and found that it can srnd four channels data to one DOUT pin. So, I talking again about serial audio interface (SAI).
If you can lower DCLK frequency up to 2.5MHz than you can lower sample with ratio 1:8 (as ratio 2.5 MHz to 20 MHz) irt sample rate at full ADC clock.
If you route all 4 channels to one output DOUT0 you slow down sample rate just in ratio 1:4.
AD7768-4 DS
page 53
On the AD7768, the interface can be configured to output conversion
data on one, two, or eight of the DOUTx pins. The DOUTx configuration
for the AD7768 is selected using the FORMATx pins (see Table 33).
page 66 table 34: (for AD7768-4)
page 67 figure 98:
FORMAT0 = 1 All channels output on the DOUT0 pin, in TDM output. Only DOUT0 is in use.
You can use SAI with FS = DRDY and four slots, 32 bits/slot

ORG alternative for C++

In assembly we use the org instruction to set the location counter to a specific location in the memory. This is particularly helpful in making Operating Systems. Here's an example boot loader (From wikibooks):
org 7C00h
jmp short Start ;Jump over the data (the 'short' keyword makes the jmp instruction smaller)
Msg: db "Hello World! "
Start: mov bx, 000Fh ;Page 0, colour attribute 15 (white) for the int 10 calls below
mov cx, 1 ;We will want to write 1 character
xor dx, dx ;Start at top left corner
mov ds, dx ;Ensure ds = 0 (to let us load the message)
cld ;Ensure direction flag is cleared (for LODSB)
Print: mov si, Msg ;Loads the address of the first byte of the message, 7C02h in this case
;PC BIOS Interrupt 10 Subfunction 2 - Set cursor position
;AH = 2
Char: mov ah, 2 ;BH = page, DH = row, DL = column
int 10h
lodsb ;Load a byte of the message into AL.
;Remember that DS is 0 and SI holds the
;offset of one of the bytes of the message.
;PC BIOS Interrupt 10 Subfunction 9 - Write character and colour
;AH = 9
mov ah, 9 ;BH = page, AL = character, BL = attribute, CX = character count
int 10h
inc dl ;Advance cursor
cmp dl, 80 ;Wrap around edge of screen if necessary
jne Skip
xor dl, dl
inc dh
cmp dh, 25 ;Wrap around bottom of screen if necessary
jne Skip
xor dh, dh
Skip: cmp si, EndMsg ;If we're not at end of message,
jne Char ;continue loading characters
jmp Print ;otherwise restart from the beginning of the message
times 0200h - 2 - ($ - $$) db 0 ;Zerofill up to 510 bytes
dw 0AA55h ;Boot Sector signature
;To zerofill up to the size of a standard 1.44MB, 3.5" floppy disk
;times 1474560 - ($ - $$) db 0
Is it possible accomplish the task with C++? Is there any command, function etc. like org where i can change the location of the program?
No it's not possible to do in any C compiler that I know of. You can however create your own linker script that places the code/data/bss segments at specific addresses.
Just for clarity, the org directive does not load the code at the specified address, it merely informs the assembler that the code will be loaded at that address. The code shown appears to be for Nasm (or similar) - in AT&T syntax, the .org directive does something different: it pads the code to that address - similar to the times line in the Nasm code.. Nasm can do this because in -f bin mode, it "acts as it's own linker".
The important thing for the code to know is the address where Msg can be found. The jmps and jnes (and call and ret which your example doesn't have, but a compiler may generate) are relative addressing mode. We code jmp target but the bytes that are actually emitted say jmp distance_to_target (plus or minus) so the address doesn't matter.
Gas doesn't do this, it emits a linkable object file. To use ld without a linker script the command line looks something like:
ld -o boot.bin boot.o -oformat binary -T text=0x7C00
(don't quote me on that exact syntax but "something like that") If you can get a linkable object file from your (16-bit capable!) C++ compiler, you might be able to do the same.
In the case of a bootsector, the code is loaded by the BIOS (or fake BIOS) at 0x7C00 - one of the few things we can assume about the bootsector. The sane thing for a bootsector to do is not fiddle-faddle around printing a message, but to load something else. You'll need to know how to find the something else on the disk and where you want to load it to (perhaps where your C++ compiler wants to put it by default) - and jmp there. This jmp will want to be a far jmp, which does need to know the address.
I'm guessing it's going to be some butt-ugly C++!

How to print register values in GDB?

How do I print the value of %eax and %ebp?
(gdb) p $eax
$1 = void
info registers shows all the registers; info registers eax shows just the register eax. The command can be abbreviated as i r
If you're trying to print a specific register in GDB, you have to omit the % sign. For example,
info registers eip
If your executable is 64 bit, the registers start with r. Starting them with e is not valid.
info registers rip
Those can be abbreviated to:
i r rip
There is also:
info all-registers
Then you can get the register name you are interested in -- very useful for finding platform-specific registers (like NEON Q... on ARM).
If only want check it once, info registers show registers.
If only want watch one register, for example, display $esp continue display esp registers in gdb command line.
If want watch all registers, layout regs continue show registers, with TUI mode.
Gdb commands:
i r <register_name>: print a single register, e.g i r rax, i r eax
i r <register_name_1> <register_name_2> ...: print multiple registers, e.g i r rdi rsi,
i r: print all register except floating point & vector register (xmm, ymm, zmm).
i r a: print all register, include floating point & vector register (xmm, ymm, zmm).
i r f: print all FPU floating registers (st0-7 and a few other f*)
Other register groups besides a (all) and f (float) can be found with:
maint print reggroups
as documented at:
xmm0 ~ xmm15, are 128 bits, almost every modern machine has it, they are released in 1999.
ymm0 ~ ymm15, are 256 bits, new machine usually have it, they are released in 2011.
zmm0 ~ zmm31, are 512 bits, normal pc probably don't have it (as the year 2016), they are released in 2013, and mainly used in servers so far.
Only one serial of xmm / ymm / zmm will be shown, because they are the same registers in different mode. On my machine ymm is shown.
p $eax works as of GDB 7.7.1
Tested as of GDB 7.7.1, the command you've tried works:
set $eax = 0
p $eax
# $1 = 0
set $eax = 1
p $eax
# $2 = 1
This syntax can also be used to select between different union members e.g. for ARM floating point registers that can be either floating point or integers:
p $s0.f
p $s0.u
From the docs:
Any name preceded by ‘$’ can be used for a convenience variable, unless it is one of the predefined machine-specific register names.
You can refer to machine register contents, in expressions, as variables with names starting with ‘$’. The names of registers are different for each machine; use info registers to see the names used on your machine.
But I haven't had much luck with control registers so far: OSDev 2012 || 2005 feature request || alt.lang.asm 2013!topic/alt.lang.asm/JC7YS3Wu31I
ARM floating point registers
Easiest for me is:
(gdb) x/x $eax
First x stands for examine and second x is hex. You can see other formats using:
(gdb) help x
You can easily print strings with x/s $eax or return addresses with x/a $ebp+4.