could you help me undrstand the parrallelisme in VHDL? - concurrency

I understand that in a process the instructions are executed sequentially and that the value of a signal is not updated until the end of the process, but I can not understand the principle of parallelism? for example in the following code I know that both instructions will be executed in parallel (at the same time) but I do not know if Q will have the new value of Sig2 or the precidente also when we calculate Sig2 do we use the new value of Sig1 or the precidente ?
Sig1<=a and b;
Sig2<=Sig1 and a;
Q<=Sig2;

As VHDL uses event driven semantics, nothing actually executes in parallel. It just has the appearance of parallelism. The concurrent assignments you show execute whenever the RHS operands change—there is no implied ordering. If a changes from 1 to 0, you cannot depend on which order the first two statements execute. It's possible the 2nd assignment executes first, then the 1st assignment executes second, followed by the 3rd assignment executes third (because Sig2 has changed) and then the 2nd assignment executes again because Sig1 has changed.
Most tools will try to order the statements to minimize the number of assignment re-executions and may even optimize it as if you wrote:
Q <= a and b;
and eliminate Sig1 and Sig2 from the simulation.

Related

volatile variable updated from multiple threads C++

volatile bool b;
Thread1: //only reads b
void f1() {
while (1) {
if (b) {do something};
else { do something else};
}
}
Thread2:
//only sets b to true if certain condition met
// updated by thread2
void f2() {
while (1) {
//some local condition evaluated - local_cond
if (!b && (local_cond == true)) b = true;
//some other work
}
}
Thread3:
//only sets b to false when it gets a message on a socket its listening to
void f3() {
while (1) {
//select socket
if (expected message came) b = false;
//do some other work
}
}
If thread2 updates b first at time t and later thread3 updates b at time t+5:
will thread1 see the latest value "in time" whenever it is reading b?
for example: reads from t+delta to t+5+delta should read true and
reads after t+5+delta should read false.
delta is the time for the store of "b" into memory when one of threads 2 or 3 updated it
The effect of volatile keyword is principally two things (I avoid scientifically strict formulations here):
​1) Its accesses can't be cached or combined. (UPD: on suggestion, I underline this is for caching in registers or another compiler-provided location, not the RAM cache in CPU.) For example, the following code:
x = 1;
x = 2;
for a volatile x will never be combined into single x = 2, whatever optimization level is required; but if x is not volatile, even low levels will likely cause this collapse into a single write. The same for reads: each read operation will access the variable value without any attempt to cache it.
​2) All volatile operations are relayed onto machine command layer in the same order between them (to underline, only between volatile operations), as they are defined in source code.
But this is not true for accesses between non-volatile and volatile memory. For the following code:
int *x;
volatile int *vy;
void foo()
{
*x = 1;
*vy = 101;
*x = 2;
*vy = 102;
}
gcc (9.4) with -O2 and clang (10.0) with -O produce something similar to:
movq x(%rip), %rax
movq vy(%rip), %rcx
movl $101, (%rcx)
movl $2, (%rax)
movl $102, (%rcx)
retq
so one access to x is already gone, despite its presence between two volatile accesses. If one need the first x = 1 to succeed before first write to vy, let him put an explicit barrier (since C11, atomic_signal_fence is the platform-independent mean for this).
That was the common rule but without regarding multithread issues. What happens here with multithreading?
Well, imagine as you declare that thread 2 writes true to b, so, this is writing of value 1 to single-byte location. But, this is ordinary write without any memory ordering requirements. What you provided with volatile is that compiler won't optimize it. But what for processor?
If this was a modern abstract processor, or one with relaxed rules, like ARM, I'd say nothing prevent it from postponing the real write for an indefinite time. (To clarify, "write" is exposing the operation to RAM-and-all-caches conglomerate.) It's fully up to processor's deliberation. Well, processors are designed to flush their stockpiling of pending writes as fast as possible. But what affects real delay, you can't know: for example, it could "decide" to fill instruction cache with a few next lines, or flush another queued writings... lots of variants. The only thing we know it provides "best effort" to flush all queued operations, to avoid getting buried under previous results. That's truly natural and nothing more.
With x86, there is an additional factor. Nearly every memory write (and, I guess, this one as well) is "releasing" write in x86, so, all previous reads and writes shall be completed before this write. But, the gut fact is that the operations to complete are before this write. So when you write true to volatile b, you will be sure all previous operations have already got visible to other participants... but this one still could be postponed for a while... how long? Nanoseconds? Microseconds? Any other write to memory will flush and so publish this write to b... do you have writes in cycle iteration of thread 2?
The same affects thread 3. You can't be sure this b = false will be published to other CPUs when you need it. Delay is unpredictable. The only thing is guaranteed, if this is not a realtime-aware hardware system, for an indefinite time, and the ISA rules and barriers provide ordering but not exact times. And, x86 is definitely not for such a realtime.
Well, all this means you also need an explicit barrier after write which affects not only compiler, but CPU as well: barrier before previous write and following reads or writes. Among C/C++ means, full barrier satifies this - so you have to add std::atomic_thread_fence(std::memory_order_seq_cst) or use atomic variable (instead of plain volatile one) with the same memory order for write.
And, all this still won't provide you with exact timings like you described ("t" and "t+5"), because the visible "timestamps" of the same operation can differ for different CPUs! (Well, this resembles Einstein's relativity a bit.) All you could say in this situation is that something is written into memory, and typically (not always) the inter-CPU order is what you expected (but the ordering violation will punish you).
But, I can't catch the general idea of what do you want to implement with this flag b. What do you want from it, what state should it reflect? Let you return to the upper level task and reformulate. Is this (I'm just guessing on coffee grounds) a green light to do something, which is cancelled by an external order? If so, an internal permission ("we are ready") from the thread 2 shall not drop this cancellation. This can be done using different approaches, as:
​1) Just separate flags and a mutex/spinlock around their set. Easy but a bit costly (or even substantially costly, I don't know your environment).
​​2) An atomically modified analog. For example, you can use a bitfield variable which is modified using compare-and-swap. Assign bit 0 to "ready" but bit 1 for "cancelled". For C, atomic_compare_exchange_strong is what you'll need here at x86 (and at most other ISAs). And, volatile is not needed anymore here if you keep residing with memory_order_seq_cst.
Will thread1 see the latest value "in time" whenever it is reading b?
Yes, the volatile keyword denotes that it can be modified outside of the thread or hardware without the compiler being aware thus every access (both read and write) will be made through an lvalue expression of volatile-qualified type is considered an observable side effect for the purpose of optimization and is evaluated strictly according to the rules of the abstract machine (that is, all writes are completed at some time before the next sequence point). This means that within a single thread of execution, a volatile access cannot be optimized out or reordered relative to another visible side effect that is separated by a sequence point from the volatile access.
Unfortunately, the volatile keyword is not thread-safe and operation will have to be taken with care, it is recommended to use atomic for this, unless in an embedded or bare-metal scenario.
Also the whole struct should be atomic struct X {int a; volatile bool b;};.
Say I have a system with 2 cores. The first core runs thread 2, the second core runs thread 3.
reads from t+delta to t+5+delta should read true and reads after t+5+delta should read false.
Problem is that thread 1 will read at t + 10000000 when the kernel decides one of the threads has run long enough and schedules a different thread. So it likely thread1 will not see the change a lot of the time.
Note: this ignores all the additional problems of synchronicity of caches and observability. If the thread isn't even running all of that becomes irrelevant.

Execute and finish of methods

This is a very naive question, please forgive my ignorance if I use the wrong terms.
If I have a series of instructions as in the snippet,
bool methodComplete = false;
methodComplete = doSomeMethod(someParam, etcParam); //long & complex method that returns true
if (methodComplete)
doSomeOtherMethod();
will the method doSomeMethod() finish its execution before if (methodComplete) is evaluated?
Or is this a case for an asynchronous pattern if I want to guarantee it is completed?
The language specifications define how a program will effectively behave from the point of the user/programmer. So, yes, you can assume that the program behaves as that:
It computes doSomeMethod
It stores the results in methodComplete
It executes the if clauses
That said, some optimizations might result in code executed ahead, see Speculative execution.
will the method doSomeMethod() finished executing before if (methodComplete) is evaluated?
Yes*.
or is this a case for an asynchronous pattern if I want to guarantee it has completed?
Only if you are doing parallel computing.
*)It can get to be a no if your code is executing in parallel..

How statements are executed concurrently in combinational logic using VHDL?

I wonder how signal assignment statements are executed concurrently in combinational logic using VHDL? For the following code for example the three statements are supposed to run concurrently. What I have a doubt in is that how the 'y' output signal is immediately changed when I run the simulation although if the statements ran concurrently 'y' will not see the effect of 'wire1' and 'wire2' (only if the statements are executed more than one time).
entity test1 is port (a, b, c, d : in bit; y : out bit);
end entity test1;
------------------------------------------------------
architecture basic of test1 is
signal wire1, wire2 : bit;
begin
wire1 <= a and b;
wire2 <= c and d;
y <= wire1 and wire2;
end architecture basic;
Since VHDL is used for simulating digital circuits, this must work similarly to the actual circuits, where (after a small delay usually ignored in simulations) circuits continously follow their inputs.
I assume you wonder how the implementation achieves this behaviour:
The simulator will keep track of which signal depends on which other symbol and reevaluates the expression whenever one of the inputs changes.
So when a changes, wire1 will be updated, and in turn trigger an update to y. This will continue as long as combinatorial updates are necessary. So in the simulation the updates are indeed well ordered, although no simulation time has passed. The "time" between such updates is often called a "delta cycle".

concurrent and conditional signal assignment (VHDL)

In VHDL, there are two types for signal assignment:
concurrent ----> when...else
----> select...when...else
sequential ----> if...else
----> case...when
Problem is that some say that when...else conditions are checked line by line (king of sequential) while select...when...else conditionals are checked once. See this reference for example.
I say that when..else is also a sequential assignment because you are checking line by line. In other words, I say that there no need to say if..else within a process is equivalent to when..else. Why they assume when..else is a concurrent assignment?
Where you are hinting at in your problem has nothing to do with concurrent assignments or sequential statements. It has more to do with the difference between if and case. Before we get to that first lets understand a few equivalents. The concurrent conditional assignment:
Y <= A when ASel = '1' else B when BSel = '1' else C ;
Is exactly equivalent to a process with the following code:
process(A, ASel, B, BSel, C)
begin
if ASel = '1' then
Y <= A ;
elsif BSel = '1' then
Y <= B ;
else
Y <= C ;
end if ;
end process ;
Likewise the concurrent selected assignment:
With MuxSel select
Y <= A when "00", B when "01", C when others ;
Is equivalent to a process with the following:
process(MuxSel, A, B , C)
begin
case MuxSel is
when "00" => Y <= A;
when "01" => Y <= B ;
when others => Y <= C ;
end case ;
end process ;
From a coding perspective, the sequential forms above have a little more coding capability than the assignment form because case and if allow blocks of code, where the assignment form only assigns to one signal. However other than that, they have the same language restrictions and produce the same hardware (as much as synthesis tools do that). In addition for many simple hardware problems, the assignment form works well and is a concise capture of the problem.
So where your thoughts are leading really comes down to the difference between if and case. If statements (and their equivalent conditional assignments) that have have multiple "elsif" in (or implied in) them tend to create priority logic or at least cascaded logic. Where as case (and their equivalent selected assignments) tend to be well suited for things like multiplexers and their logic structure tends to be more of a balanced tree structure.
Sometimes tools will refactor an if statement to allow it to be equivalent to a case statement. Also for some targets (particularly LUT based logic like Xilinx and Altera), the difference between them in terms of hardware effiency does not show up until there are enough "elsif" branches though.
With VHDL-2008, the assignment forms are also allowed in sequential code. The transformation is the same except without the process wrapper.
Concurrent vs Sequential is about independence of execution.
A concurrent statement is simply a statement that is evaluated and/or executed independently of the code that surrounds it. Processes are concurrent. Component/Entity Instances are concurrent. Signal assignments and procedure calls that are done in the architecture are concurrent.
Sequential statements (other than wait) run when the code around it also runs.
Interesting note, while a process is concurrent (because it runs independently of other processes and concurrent assignments), it contains sequential statements.
Often when we write RTL code, the processes that we write are simple enough that it is hard to see the sequential nature of them. It really takes a statemachine or a testbench to see the true sequential nature of a process.

Can an expression containing post increment execute in parallel with other parts of that expression in C++?

I came up with this question from following answer:
Efficiency of postincrement v.s. preincrement in C++
There I've found this expression:
a = b++ * 2;
They said, above b++ can run parallel with multiplication.
How b++ run parallel with multiplication?
What i've understood about the procedure is:
First we copy b's value to a temporary variable , then increment b , finally multiply 2 with that temporary variable.
We're not multiplying with b but with that temporary variable , then how it will run parallel?
I've got above idea about temporary variable from another answer Is there a performance difference between i++ and ++i in C++?
What you are talking about is instruction level parallelism. The processor in this case can execute an increment on the copy of b while also multiplying the old b.
This is very fine-grained parallelism, at the processor level, and in general you can expect it to give you some advantages, depending on the architecture.
In the case of pre-increment, instead, the result of the increment operation must be waited in the processor's pipeline before the multiplication can be executed, hence giving you a penalty.
However, this is not semantically equivalent as the value of a will be different.
#SimpleGuy's answer is a pretty reasonable first approximation. The trouble is, it assumes a halway point between the simple theoretic model (no parallelization) and the real world. This answer tries to look at a more realistic model, still without assuming one particular CPU.
The chief thing to realize is that real CPU's have registers and cache. These exist because memory operations are far more expensive than simple math. Parallelization of integer increment and integer bitshift (*2 is optimized to <<1 on real CPU's) is a minor concern; the optimizer will chiefly look at avoiding load stalls.
So let's assume that a and b aren't in a CPU register. The relevant memory operations are LOAD b, STORE a and STORE b. Everything starts with LOAD b so the optimizer may move that up as far as possible, even before a previous instruction when possible (aliasing is the chief concern here). The STORE b can start as soon as b++ has finished. The STORE a can happen after the STORE b so it's not a big problem that it's delayed by one CPU instruction (the <<1), and there's little to be gained by parallelizing the two operations.
The reason that b++ can run in parallel is as under
b++ is a post increment, which means the value of a variable would be incremented after use. The below line can be broken into two parts
a = b++ * 2
Part-1: Multiply b with 2
Part-2: Increment the value of b by 1
Since above two are not dependent on each other, they can be run in parallel.
Had case been of pre-increment, which means to increment before use
a = ++b * 2
The parts would have been
Part-1: Increment the value of b by 1
Part-2: Multiply (new) b with 2
As can be seen above, the part two can be run only after part 1 is executed, so there is a dependency and hence no parallelism.