MIPS Concurrency - concurrency

MIPS Concurrency - concurrency

What is the difference between these two ways of storing a value in a global variable, in terms of concurrency?:
lis $4
.word 0x10008008
sw $3, 0($4)
vs
sw $3, 8($GP)

http://www.mips.com/media/files/MD00565-2B-MIPS32-QRC-01.01.pdf
“lis $4” doesn't seems like a valid MIPS instruction

Related

is a labeled instruction going to be executed without a branching instruction when it is in an instruction order?

beq $s0, $s1, Lab1
add $s2, $s0, $s1
Lab1: sub $s1, $s1, $s0
when $s0, $s1 are not equal line2 will be executed. is line3 going to be executed after line2?
or can line3 be executed when only the if statement is satisfied and send to Lab1?
I hope I made my question clear. Thanks in advance.

Every instruction tells the processor what instruction comes next.
Let's take a closer look at the add instruction.
It computes the sum, places that into the target register, and also, in parallel, increments the program counter by 4 — definitively telling the processor that the next instruction is the next one in address order sequence.
A nop instruction is commonly said to do nothing — it even stands for no-operation — but it does increment the pc, so technically it doesn't do nothing at all.
As a mentor to assembly language students, I find it useful to emphasize the program counter.
Experienced assembly language programmers often overlook the program counter as and because it is so fundamental to the operation of the processor. So, let's talk about it for a moment.
Every instruction tells the processor what instruction comes next: each instruction updates the program counter, and, this update to the program counter is the mechanism by which each instruction tells the processor what is next. Each instruction has its own memory address; a given instruction executes because the program counter held its address. Sequential operation isn't magic — each instruction has to tell the processor what is next (i.e. has to update the pc).
Programs can also interact with the program counter, calling (jal) captures the program counter next into the $ra register for the subroutine or function to use to return back control of the processor to the caller.
Only slightly overly simplified, within a subroutine or function, moving the program counter backwards forms a loop — the processor goes backward to re-execute something it already did before; moving it forwards skip something, as needed for if-then or if-then-else.
But each instruction has well-defined way that it modifies the program counter, whether it appears to be explicit or implicit.
Assembly Machine Code Operation
beq $s0, $s1, Lab1 # skip 1 instruction on condition $s0 == $s1
add $s2, $s0, $s1 # not skipped if $s0 != $s1: next is pc+4
Lab1: # no machine code for this
sub $s1, $s1, $s0 # run after beq when $s0 == $s1 -or else- after add
In assembly language, the label informs the assembler how to translate an instruction that uses it. Lab1 will be associated with a location, an address, here the address of the sub instruction.
The beq is a conditional pc-relative branch. Thus, the value it wants is the delta between the pc-next (of itself) and the branch target, here Lab1. A delta of 0 would cause no instructions to be skipped, and a delta of 1 will cause 1 instruction to be skipped. Here we want to skip 1 instruction, so the delta will be 1. There is literally a 1 in the machine code for that beq.
After executing the beq the processor will have been told (based on the specified eq condition) whether it will branch or not. It does this by either adjusting the pc to pc+4 -or- to pc+4+delta.
Both the other instructions are what we call sequential so they inform the processor that pc-next is pc+4. Knowing that you can follow the full sequencing on either the conditional branch taken or not taken.

How are assembly directives instructed?

To elaborate the question on the title, suppose I declared the following array in C++,
int myarr[10];
This disassembles to the following in x86
myarr:
.zero 40
Now, AFAIK this .zero directive is used for convention and is not an instruction. Then, how exactly is this directive translated to x86(or any other architecture, it's not the emphasis here) instructions? Because, for all we know the CPU can only execute instructions. So I guess these directives are somehow translated to instructions, am I correct?
I could generalize the question by also asking how .word .long etc. are translated into instructions, but I think it is clear.

The output of the assembler is an object module. In the object module are representations of various sections for a program. Each section has a size, some attributes, and possibly some data to be put into the section.
For example, a section may be a few thousand bytes, have attributes indicating it contains instructions for execution, and have data that consists of those instructions. Another section might be several hundred bytes but have no data—it is just space to be allocated when the program starts. Another section might be very big and have non-zero data that contains its initial values when the program starts.
To assemble a .zero 40 directive, the compiler just includes forty bytes of zeros in the section it is currently building. When it writes the final output, it will include those zeros in that section. Data directives like this and .word and such simply tell the assembler what data to put into its output.

unsigned int stuff[10];
void fun ( void )
{
unsigned int r;
for(r=0;r<10;r++) stuff[r]=r;
}
using ARM...
00000000 <fun>:
0: e3a03000 mov r3, #0
4: e59f2010 ldr r2, [pc, #16] ; 1c <fun+0x1c>
8: e5a23004 str r3, [r2, #4]!
c: e2833001 add r3, r3, #1
10: e353000a cmp r3, #10
14: 1afffffb bne 8 <fun+0x8>
18: e12fff1e bx lr
1c: 00000ffc
Disassembly of section .bss:
00001000 <stuff>:
...
The array stuff is simply data it is not code it is not instructions and won't be, the directive in question you asked about won't become code, it cants it is data.
If you want to see code, instructions, then you need to put lines of high level language that act on data for example as shown here. And in that case the compiler generates code.
Looking at this compilers actual output (comments and other non-essentials removed)
fun:
mov r3, #0
ldr r2, .L6
.L2:
str r3, [r2, #4]!
add r3, r3, #1
cmp r3, #10
bne .L2
bx lr
.L7:
.align 2
.L6:
.word stuff-4
...
.comm stuff,40,4
the .comm in this case is how they declared the data that represents the array in the high level language. and the other stuff is mostly code. the .align is there so that the address of L6 is aligned so that you don't get an alignment fault when you try to read it.
.word is a directive, what you see here is .text vs .data while it is just one simple C program with the array and the code right there next to each other. because code can possibly live in read only memory like flash and data needs to be in read/write memory and at compile time the compiler doesn't know where the data is relative to the code, so it generates an abstraction by placing a read only word in the code that the linker fills in later, the code is generic and whatever the linker puts in there it uses. The linker "places" .text and .bss in this case it wasn't initialized so it isn't actually .data and then makes that connection in the code.
labels are directives if you will so that the programmer or code generator (compiler) doesn't have to count instructions or overall size of instructions to make relative jumps. Let the tools do that for you.
1c: 00000ffc
Disassembly of section .bss:
00001000 <stuff>:
...
and based on the way I linked this (non actually a working) program stuff is the only data item in this program and the linker placed it where I asked at address 0x1000, then went back and filled in that .word directive to be stuff-4 which is 0xFFC so that the code as compiled works.
directives are not part of the instruction set but are part of the assembly language, note that assembly language is defined by the assembler, the tool, not the instruction set/target. There are countless different x86 assembly languages and AT&T vs Intel is not the primary difference, the directives how you define a label, how you indicate a number is hex or decimal, because of the vagueness of the instructions as defined in the early docs lots of adjectives if you will to be able to specify which mov instruction you were actually after and even though that's part of the instruction and not a directive those adjectives varied across assembly languages. ARM, MIPS, and many if not most others have had tools created with incompatible assembly languages. .zero for example being one of those incompatible things.
In any case an assembly language in question needs to be able to define data and then have a way for code to reference that data in order to make useful programs.
The notion of a one to one line of assembly language to instructions is very misleading and don't get fooled by it, today's compilers generate almost as much non-code as code in their output. Lots of directives and other information.

Clobbered memory in two inline assembly calls vs in one inline assembly call?

This question follows this one, considering a GCC-compliant compiler and a x86-64 architecture.
I am wondering if there is any difference between option 1, option 2 and option 3 below. Would the result be the same in all contexts, or would it be different. And if so what would be the difference?
// Option 1
asm volatile(:::"memory");
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):);
and
// Option 2
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):);
asm volatile(:::"memory");
and
// Option 3
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory");

Options 1 & 2 would let the CPUID itself reorder with unrelated non-volatile loads/stores (in one direction or the other). This is very likely not what you want.
You could put a memory barrier on both sides of CPUID, but it's certainly better to just make CPUID a memory barrier itself.
As Jester points out, option 1 would force reload of level from memory, if it had ever had its address passed outside of the function, or if it already is a global or static.
(Or whatever the exact criterion is that decides whether a C variable could be modified read or written by asm that uses a "memory" clobber. I think it's essentially the same as what the optimizer uses to decide whether a variable can be kept in a register across a non-inline function call to an opaque function, so pure local variables that haven't had their address passed anywhere, and that aren't inputs to the asm statement, can still live in registers).
For example (Godbolt compiler explorer):
void foo(int level){
int eax, ebx, ecx, edx;
asm volatile("":::"memory");
asm volatile("CPUID"
: "=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx)
: "0"(level)
:
);
}
# x86-64 gcc7.3 -O3 -fverbose-asm
pushq %rbx # # rbx is call-preserved, but we clobber it.
movl %edi, %eax # level, eax
CPUID
popq %rbx #
ret
Notice the lack of a spill/reload of the function arg.
Normally I'd use Intel syntax, but with inline asm it's a good idea to always use AT&T unless you complete hate AT&T syntax or don't know it.
Even if it started in memory (i386 System V calling convention, with stack args), the compiler still decides that nothing else (including the asm statement with a memory clobber) could reference it. But how do we tell the difference between delaying the load? Modify the function arg before the barrier, then use it after:
void modify_level(int level){
level += 1; // modify level before the barrier
int eax, ebx, ecx, edx;
asm volatile("#mem barrier here":::"memory");
asm volatile("CPUID" // then read it after
: "=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx)
: "0"(level):);
}
The asm output from gcc -m32 -O3 -fverbose-asm is:
modify_level(int):
pushl %ebx #
#mem barrier here
movl 8(%esp), %eax # level, tmp97
addl $1, %eax #, level
CPUID
popl %ebx #
ret
Notice that the compiler let level++ reorder across the memory barrier, because it's a local variable.
Godbolt filters hand-written asm comments along with compiler-generated asm comment-only lines. I disabled the comment filter and found the mem barrier. You might want to remove -fverbose-asm to get less noise. Or use a non-comment string for the mem barrier: it doesn't have to assemble if you're just looking at the compiler's asm output. (Unless you're using clang, which has the assembler built-in).
BTW, the original version of your question didn't compile: you left out the empty string as asm template. asm(:::"memory"). The output, input, and clobber sections can be empty, but the asm instruction string is not optional.
Fun fact, you can put asm comments in the string:
asm volatile("# memory barrier here":::"memory");
gcc fills in any %whatever things in the string template as it writes asm output, so you can even do stuff like "CPUID # %%0 was in %0" and see what gcc chose for your "dummy" args that are otherwise unmentioned in the asm template. (This is more interesting for dummy memory input/output operands to tell the compiler which memory you read/write instead of using a "memory" clobber, when you give the asm statement a pointer.)

Using base pointer register in C++ inline asm

I want to be able to use the base pointer register (%rbp) within inline asm. A toy example of this is like so:
void Foo(int &x)
{
asm volatile ("pushq %%rbp;" // 'prologue'
"movq %%rsp, %%rbp;" // 'prologue'
"subq $12, %%rsp;" // make room
"movl $5, -12(%%rbp);" // some asm instruction
"movq %%rbp, %%rsp;" // 'epilogue'
"popq %%rbp;" // 'epilogue'
: : : );
x = 5;
}
int main()
{
int x;
Foo(x);
return 0;
}
I hoped that, since I am using the usual prologue/epilogue function-calling method of pushing and popping the old %rbp, this would be ok. However, it seg faults when I try to access x after the inline asm.
The GCC-generated assembly code (slightly stripped-down) is:
_Foo:
pushq %rbp
movq %rsp, %rbp
movq %rdi, -8(%rbp)
# INLINEASM
pushq %rbp; // prologue
movq %rsp, %rbp; // prologue
subq $12, %rsp; // make room
movl $5, -12(%rbp); // some asm instruction
movq %rbp, %rsp; // epilogue
popq %rbp; // epilogue
# /INLINEASM
movq -8(%rbp), %rax
movl $5, (%rax) // x=5;
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Foo
movl $0, %eax
leave
ret
Can anyone tell me why this seg faults? It seems that I somehow corrupt %rbp but I don't see how. Thanks in advance.
I'm running GCC 4.8.4 on 64-bit Ubuntu 14.04.

See the bottom of this answer for a collection of links to other inline-asm Q&As.
Your code is broken because you step on the red-zone below RSP (with push) where GCC was keeping a value.
What are you hoping to learn to accomplish with inline asm? If you want to learn inline asm, learn to use it to make efficient code, rather than horrible stuff like this. If you want to write function prologues and push/pop to save/restore registers, you should write whole functions in asm. (Then you can easily use nasm or yasm, rather than the less-preferred-by-most AT&T syntax with GNU assembler directives1.)
GNU inline asm is hard to use, but allows you to mix custom asm fragments into C and C++ while letting the compiler handle register allocation and any saving/restoring if necessary. Sometimes the compiler will be able to avoid the save and restore by giving you a register that's allowed to be clobbered. Without volatile, it can even hoist asm statements out of loops when the input would be the same. (i.e. unless you use volatile, the outputs are assumed to be a "pure" function of the inputs.)
If you're just trying to learn asm in the first place, GNU inline asm is a terrible choice. You have to fully understand almost everything that's going on with the asm, and understand what the compiler needs to know, to write correct input/output constraints and get everything right. Mistakes will lead to clobbering things and hard-to-debug breakage. The function-call ABI is a much simpler and easier to keep track of boundary between your code and the compiler's code.
Why this breaks
You compiled with -O0, so gcc's code spills the function parameter from %rdi to a location on the stack. (This could happen in a non-trivial function even with -O3).
Since the target ABI is the x86-64 SysV ABI, it uses the "Red Zone" (128 bytes below %rsp that even asynchronous signal handlers aren't allowed to clobber), instead of wasting an instruction decrementing the stack pointer to reserve space.
It stores the 8B pointer function arg at -8(rsp_at_function_entry). Then your inline asm pushes %rbp, which decrements %rsp by 8 and then writes there, clobbering the low 32b of &x (the pointer).
When your inline asm is done,
gcc reloads -8(%rbp) (which has been overwritten with %rbp) and uses it as the address for a 4B store.
Foo returns to main with %rbp = (upper32)|5 (orig value with the low 32 set to 5).
main runs leave: %rsp = (upper32)|5
main runs ret with %rsp = (upper32)|5, reading the return address from virtual address (void*)(upper32|5), which from your comment is 0x7fff0000000d.
I didn't check with a debugger; one of those steps might be slightly off, but the problem is definitely that you clobber the red zone, leading to gcc's code trashing the stack.
Even adding a "memory" clobber doesn't get gcc to avoid using the red zone, so it looks like allocating your own stack memory from inline asm is just a bad idea. (A memory clobber means you might have written some memory you're allowed to write to, e.g. a global variable or something pointed-to by a global, not that you might have overwritten something you're not supposed to.)
If you want to use scratch space from inline asm, you should probably declare an array as a local variable and use it as an output-only operand (which you never read from).
AFAIK, there's no syntax for declaring that you modify the red-zone, so your only options are:
use an "=m" output operand (possibly an array) for scratch space; the compiler will probably fill in that operand with an addressing mode relative to RBP or RSP. You can index into it with constants like 4 + %[tmp] or whatever. You might get an assembler warning from 4 + (%rsp) but not an error.
skip over the red-zone with add $-128, %rsp / sub $-128, %rsp around your code. (Necessary if you want to use an unknown amount of extra stack space, e.g. push in a loop, or making a function call. Yet another reason to deref a function pointer in pure C, not inline asm.)
compile with -mno-red-zone (I don't think you can enable that on a per-function basis, only per-file)
Don't use scratch space in the first place. Tell the compiler what registers you clobber and let it save them.
Here's what you should have done:
void Bar(int &x)
{
int tmp;
long tmplong;
asm ("lea -16 + %[mem1], %%rbp\n\t"
"imul $10, %%rbp, %q[reg1]\n\t" // q modifier: 64bit name.
"add %k[reg1], %k[reg1]\n\t" // k modifier: 32bit name
"movl $5, %[mem1]\n\t" // some asm instruction writing to mem
: [mem1] "=m" (tmp), [reg1] "=r" (tmplong) // tmp vars -> tmp regs / mem for use inside asm
:
: "%rbp" // tell compiler it needs to save/restore %rbp.
// gcc refuses to let you clobber %rbp with -fno-omit-frame-pointer (the default at -O0)
// clang lets you, but memory operands still use an offset from %rbp, which will crash!
// gcc memory operands still reference %rsp, so don't modify it. Declaring a clobber on %rsp does nothing
);
x = 5;
}
Note the push/pop of %rbp in the code outside the #APP / #NO_APP section, emitted by gcc. Also note that the scratch memory it gives you is in the red zone. If you compile with -O0, you'll see that it's at a different position from where it spills &x.
To get more scratch regs, it's better to just declare more output operands that are never used by the surrounding non-asm code. That leaves register allocation to the compiler, so it can be different when inlined into different places. Choosing ahead of time and declaring a clobber only makes sense if you need to use a specific register (e.g. shift count in %cl). Of course, an input constraint like "c" (count) gets gcc to put the count in rcx/ecx/cx/cl, so you don't emit a potentially redundant mov %[count], %%ecx.
If this looks too complicated, don't use inline asm. Either lead the compiler to the asm you want with C that's like the optimal asm, or write a whole function in asm.
When using inline asm, keep it as small as possible: ideally just the one or two instructions that gcc isn't emitting on its own, with input/output constraints to tell it how to get data into / out of the asm statement. This is what it's designed for.
Rule of thumb: if your GNU C inline asm start or ends with a mov, you're usually doing it wrong and should have used a constraint instead.
Footnotes:
You can use GAS's intel-syntax in inline-asm by building with -masm=intel (in which case your code will only work with that option), or using dialect alternatives so it works with the compiler in Intel or AT&T asm output syntax. But that doesn't change the directives, and GAS's Intel-syntax is not well documented. (It's like MASM, not NASM, though.) I don't really recommend it unless you really hate AT&T syntax.
Inline asm links:
x86 wiki. (The tag wiki also links to this question, for this collection of links)
The inline-assembly tag wiki
The manual. Read this. Note that inline asm was designed to wrap single instructions that the compiler doesn't normally emit. That's why it's worded to say things like "the instruction", not "the block of code".
A tutorial
Looping over arrays with inline assembly Using r constraints for pointers/indices and using your choice of addressing mode, vs. using m constraints to let gcc choose between incrementing pointers vs. indexing arrays.
How can I indicate that the memory *pointed* to by an inline ASM argument may be used? (pointer inputs in registers do not imply that the pointed-to memory is read and/or written, so it might not be in sync if you don't tell the compiler).
In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?. Using %q0 to get %rax vs. %w0 to get %ax. Using %g[scalar] to get %zmm0 instead of %xmm0.
Efficient 128-bit addition using carry flag Stephen Canon's answer explains a case where an early-clobber declaration is needed on a read+write operand. Also note that x86/x86-64 inline asm doesn't need to declare a "cc" clobber (the condition codes, aka flags); it's implicit. (gcc6 introduces syntax for using flag conditions as input/output operands. Before that you have to setcc a register that gcc will emit code to test, which is obviously worse.)
Questions about the performance of different implementations of strlen: my answer on a question with some badly-used inline asm, with an answer similar to this one.
llvm reports: unsupported inline asm: input with type 'void *' matching output with type 'int': Using offsetable memory operands (in x86, all effective addresses are offsettable: you can always add a displacement).
When not to use inline asm, with an example of 32b/32b => 32b division and remainder that the compiler can already do with a single div. (The code in the question is an example of how not to use inline asm: many instructions for setup and save/restore that should be left to the compiler by writing proper in/out constraints.)
MSVC inline asm vs. GNU C inline asm for wrapping a single instruction, with a correct example of inline asm for 64b/32b=>32bit division. MSVC's design and syntax require a round trip through memory for inputs and outputs, making it terrible for short functions. It's also "never very reliable" according to Ross Ridge's comment on that answer.
Using x87 floating point, and commutative operands. Not a great example, because I didn't find a way to get gcc to emit ideal code.
Some of those re-iterate some of the same stuff I explained here. I didn't re-read them to try to avoid redundancy, sorry.

In x86-64, the stack pointer needs to be aligned to 8 bytes.
This:
subq $12, %rsp; // make room
should be:
subq $16, %rsp; // make room

gcov and switch statements

I'm running gcov over some C code with a switch statement. I've written test cases to cover every possible path through that switch statement, but it still reports a branch in the switch statement as not taken and less than 100% on the "Taken at least once" stat.
Here's some sample code to demonstrate:
#include "stdio.h"
void foo(int i)
{
switch(i)
{
case 1:printf("a\n");break;
case 2:printf("b\n");break;
case 3:printf("c\n");break;
default: printf("other\n");
}
}
int main()
{
int i;
for(i=0;i<4;++i)
foo(i);
return 0;
}
I built with "gcc temp.c -fprofile-arcs -ftest-coverage", ran "a", then did "gcov -b -c temp.c". The output indicates eight branches on the switch and one (branch 6) not taken.
What are all those branches and how do I get 100% coverage?

Oho! bde's assembly dump shows that that version of GCC is compiling this switch statement as some approximation of a binary tree, starting at the middle of the set. So it checks if i is equal to 2, then checks if it's greater or less than 2, and then for each side it checks if it's equal to 1 or 3 respectively, and if not, then it goes to default.
That means there are two different code paths for it to get to the default result -- one for numbers higher than 2 that aren't 3, and one for numbers lower than 2 that aren't 1.
Looks like you'll get to 100% coverage if you change that i<4 in your loop to i<=4, so as to test the path on each side.
(And, yes, that's something that's very likely to have changed from GCC 3.x to GCC 4.x. I wouldn't say it's "fixed", as it's not "wrong" exactly aside from making the gcov results confusing. It's just that on a modern processor with branch prediction, it's probably slow as well as overly complicated.)

I get the same result using gcc/gcov 3.4.6.
For a switch statement, it should normally generate two branches for each case statement. One is if the case is true and should be executed, and the other is a "fallthrough" branch that goes on to the next case.
In your situation, it looks like gcc is making a "fallthrough" branch for the last case, which doesn't make sense since there is nothing to fall into.
Here's an excerpt from the assembly code generated by gcc (I changed some of the labels for readability):
cmpl $2, -4(%ebp)
je CASE2
cmpl $2, -4(%ebp)
jg L7
cmpl $1, -4(%ebp)
je CASE1
addl $1, LPBX1+16
adcl $0, LPBX1+20
jmp DEFAULT
L7:
cmpl $3, -4(%ebp)
je CASE3
addl $1, LPBX1+32
adcl $0, LPBX1+36
jmp DEFAULT
I admit that I don't know much about x86 assembly, and I don't understand the use of the L7 label but it might have something to do with the extra branch. Maybe someone with more knowledge about gcc can explain what is going on here.
It sounds like it might be an issue with the older version of gcc/gcov, upgrading to a newer gcc/gcov might fix the problem, especially given the other post where the results look correct.

Are you sure you are running a.out? Here is my results (gcc 4.4.1):
File 't.c'
Lines executed:100.00% of 11
Branches executed:100.00% of 6
Taken at least once:100.00% of 6
Calls executed:100.00% of 5
t.c:creating 't.c.gcov'

I'm using mingw on windows (which is not the latest gcc) and it looks like this may be sorted out in newer versions of gcc.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js