I am trying to understand Fortran 77 code but stumbled on EQUIVALENCE() statement on the code.
Here is part of the code:
REAL*8 DATA1(0:N-1)
COMPLEX*16 DATA2(0:N/2-1)
EQUIVALENCE(DATA1, DATA2)
...
...
CALL FFT(DATA1, N/2, -1)
Basically FFT subroutine is a one-dimensional complex-to-complex FFT engine. There are some permutation and matrix-vector multiplication on the subroutine.
The code calls DATA2 later in this manner:
K0=DATA2(0)
K1=DCONJG(DATA2(0))
can anyone give me clue about why EQUIVALENCE() statement is used? My assumption is DATA1 which is REAL, being changed to DATA2, which is COMPLEX variable with some changed performed on FFT subroutine. But if it is so, how about the imaginary part of DATA2? Because the FFT subroutine only contains REAL variable. And why the array size of DATA1 and DATA2 is different?
I can not find any answer in this forum which satisfy my question. Thanks for your answers. It would help me a lot.
equivalence is one of Fortran's two features for storage-association of entities. (The other is common blocks on which topic I will remain silent here). The equivalence statement declares that the entities named in its argument list share the same storage locations. In this case data1 and data2 share the same memory locations.
If you have a tool for inspecting memory locations and point it at data1 you'll see something like this:
+----------+----------+----------+----------+----------+----------+
| | | | | |
| data1(0) | data1(1) | data1(2) | data1(3) | data1(4) | data1(...
| | | | | |
+----------+----------+----------+----------+----------+----------+
Point the same tool at data2 and you'll see something like this
+----------+----------+----------+----------+----------+----------+
| | |
| data2(0) | data2(1) | data2(....
| re im | re im | re im
+----------+----------+----------+----------+----------+----------+
but the 'truth' is rather more like
+----------+----------+----------+----------+----------+----------+
| | |
| data1(0) data1(1) | data1(2) data1(3) | data1(4) data1(...
| | |
| data2(0) | data2(1) | data2(....
| re im | re im | re im
+----------+----------+----------+----------+----------+----------+
data1(0) is at the same location as the real component of data2(0). data1(1) is the imaginary component of data2(0), and so forth.
This is one of the applications of equivalence which one still occasionally comes across -- being able to switch between viewing data as complex or as pairs of reals. However, it's not confined to this kind of type conversion, there's nothing to say you can't equivalence integers and reals, or any other types.
Another use one occasionally still sees is the use of equivalence for remapping arrays from one rank to another. For example, given
integer, dimension(3,2) :: array2
integer, dimension(6) :: array1
and
equivalence(array1(1),array2(1,1))
the same elements can be treated as belonging to a rank-2 array or to a rank-1 array to suit the program's needs.
equivalence is generally frowned upon these days, most of what it has been used for can be done more safely with modern Fortran. For more, you might care to look at my answer to Is the storage of COMPLEX in fortran guaranteed to be two REALs?
Related
I am here to ask if my perception is actually true.
I originally thought defining vector<T> v(size_t someSize, T init_value) would call a function such as vector<T>::reserve, instead of vector<T>::push_back. I found some discussion relating to this here: std::vector push_back is bottleneck, but this is slightly different in its idea.
Running some experiments, I notice that vector<T> v(size_t someSize, T init_value) calls ::push_back all along. Is this true? I have the following report using uftrace(https://github.com/namhyung/uftrace).
Avg total Min total Max total Function
========== ========== ========== ====================================
858.323 ms 858.323 ms 858.323 ms main
618.245 ms 618.245 ms 618.245 ms sortKaway
234.795 ms 234.795 ms 234.795 ms std::sort
72.752 us 72.752 us 72.752 us std::vector::_M_fill_initialize
65.788 us 49.551 us 82.026 us std::vector::vector
20.292 us 11.387 us 68.629 us std::vector::_M_emplace_back_aux
18.722 us 17.263 us 20.181 us std::equal
18.472 us 18.472 us 18.472 us std::vector::~vector
17.891 us 10.002 us 102.079 us std::vector::push_back // push_back?!
Does vector<T>::reserve also call on vector<t>::push_back eventually? Is there faster version for vector?
The above was the original post. After some comments, I tested a simple version, and realized I was completely mistaken.
#include <vector>
#include <functional>
#include <queue>
#include <cassert>
using namespace std; // for the time being
int main () {
vector<int> v(10, 0);
return 0;
}
This actually results in the following, which doesn't involve std::vector<T>::push_back.
# Function Call Graph for 'main' (session: 9ce7f6bb33885ff7)
=============== BACKTRACE ===============
backtrace #0: hit 1, time 12.710 us
[0] main (0x4009c6)
========== FUNCTION CALL GRAPH ==========
12.710 us : (1) main
0.591 us : +-(1) std::allocator::allocator
0.096 us : | (1) __gnu_cxx::new_allocator::new_allocator
: |
6.880 us : +-(1) std::vector::vector
4.338 us : | +-(1) std::_Vector_base::_Vector_base
0.680 us : | | +-(1) std::_Vector_base::_Vector_impl::_Vector_impl
0.445 us : | | | (1) std::allocator::allocator
0.095 us : | | | (1) __gnu_cxx::new_allocator::new_allocator
: | | |
3.294 us : | | +-(1) std::_Vector_base::_M_create_storage
3.073 us : | | (1) std::_Vector_base::_M_allocate
2.849 us : | | (1) std::allocator_traits::allocate
2.623 us : | | (1) __gnu_cxx::new_allocator::allocate
0.095 us : | | +-(1) __gnu_cxx::new_allocator::max_size
: | | |
1.867 us : | | +-(1) operator new
: | |
2.183 us : | +-(1) std::vector::_M_fill_initialize
0.095 us : | +-(1) std::_Vector_base::_M_get_Tp_allocator
: | |
1.660 us : | +-(1) std::__uninitialized_fill_n_a
1.441 us : | (1) std::uninitialized_fill_n
1.215 us : | (1) std::__uninitialized_fill_n::__uninit_fill_n
0.988 us : | (1) std::fill_n
0.445 us : | +-(1) std::__niter_base
0.096 us : | | (1) std::_Iter_base::_S_base
: | |
0.133 us : | +-(1) std::__fill_n_a
Sorry for the confusion. Yes, the library implementation works as we expect, it doesn't involve push_back if constructed with initial size.
Phew, looks like you answered your own question! I was extremely confused for a moment. Just hypothetically, I could have imagined some obscure case of vector's fill constructor using reserve and push_backs, but definitely not in the likes of high-quality implementations like those found accompanying GCC in the GNU standard lib. I would say, hypothetically, that it is possible for an obscure vector implementation to be implemented this way, but practically completely unlikely for any decent implementation.
To the contrary, this was almost two decades ago but I tried to implement my version of std::vector in hopes of matching its performance. This wasn't just some dumb exercise but the temptation was due to the fact that we had a software development kit and wanted to use some basic C++ containers for it, but it had the goal of allowing people to write plugins for our software using different compilers (and also different standard library implementations) than what we were using. So we couldn't safely use std::vector in those contexts since our version may not match the plugin writer's. We were forced to begrudgingly roll our own containers for the SDK.
Instead I found std::vector to be incredibly efficient in ways that were hard to match, especially for plain old data types with trivial ctors and dtors. Again this was over a decade ago but I found that using the fill constructor with vector<int> in MSVC 5 or 6 (forgot which one) actually translated to the same disassembly as using memset in ways that my naive version, just looping through things and using placement new on them regardless of whether they were PODs or not, did not. The range ctor also effectively translated to a super fast memcpy for PODs. And that's precisely what made vector so hard to beat for me, at least back then. Without getting deep into type traits and special casing PODs, I couldn't really match vector's performance for PODs. I could match it for UDTs, but most of our performance-critical code tended to use PODs.
So chances are that popular vector implementations today are just as efficient if not more than back when I conducted those tests, and I wanted to pitch in kind of as a reassurance that your vector implementation is most likely damned fast. The last thing I'd expect it to do is be implementing fill ctors or range ctors using push_backs.
I am a university student currently studying computer science and programming and while reading chapter 2 of c++ primer by Stanley B. Lippmann a question popped up into my mind and that is, if computer memory is divided into tiny storage locations called Bytes (8 bits) and each Byte of memory is assigned a unique address, and an integer variable uses up 4 Bytes of memory, shouldn't my console, when using the address-of operator print out 4 unique addresses instead of 1?
I doubt that the textbook is incorrect and that their is a flaw in my understanding of computer memory. As a result, I would like a positive clarification of this question I am facing. Thanks in advance people :)
shouldn't my console, when using the address-of operator print out 4 unique addresses instead of 1?
No.
The address of an object is the address of its starting byte. A 4-byte int has a unique address, the address of its first byte, but it occupies the next three bytes as well. Those next three bytes have different addresses, but they are not the address of the int.
Each variable is located in memory somewhere, so each variable gets an address you can get with the address-of operator.
That each byte in a multi-byte variable also have their addresses doesn't matter, the address-of operator gives you a pointer to the variable.
Some "graphics" to hopefully explain it...
Lets say we have an int variable named i, and that the type int takes four bytes (32 bits, this is the usual for int). Then you have something like
+---+---+---+---+
| | | | |
+---+---+---+---+
Some place is reserved for the four bytes, where doesn't matter the compiler will handle all that for you.
Now if you use the address-of operator to get a pointer to the variable i i.e. you do &i, then you have something like
+---+---+---+---+
| | | | |
+---+---+---+---+
^
|
&i
The expression &i points to the memory position where the byte-sequence of the variable begins. It can't possible give you multiple pointers, one for each byte, that's really impossible, and not needed as well.
Yes an integer type requires four bytes. All four bytes are allocated as one block of memory for your integer, where each block has a unique address. This unique address is simply the first byte's address of the block.
I was reading through some C code written online and came across the following line:
if(!(array[index] ^ array[index - 1]))
The ^ operator is a bitwise XOR, so I read this line to say that it will return true if "The array value at index is not different to the value at the previous index." Simplified, I read this as "If the array value at the index is the same as the one at the previous index."
When I read it like that, it seems like an overcomplicated way to write:
if(array[index] == array[index - 1])
Are these expressions the same? If not, then why? If I'm not misreading it, the best explanation I have is that since this code is involved with interrupts on clock signals it needs to be quick. Maybe a bitwise operation is faster than whatever goes on behind-the-scenes with ==?
Yes, basically they are the same thing.
Let's see here:
a b (a) xnor (b)
___|_____|_______________|
0 | 0 | 1
0 | 1 | 0
1 | 0 | 0
1 | 1 | 1
As you can see here, the xnor returns 1 only when a and b are equal.
This is one of the most preferred technique used in embedded C, especially when you have memory, or time constrains.
Since Arithmetic Logic Units (ALU), include a n-bit XNOR circuit (n: depending on your processors architecture); comparison with XNOR would processed in one instruction cycle.
[Someone who has more experience can correct me if I am wrong.]
let's say I have a struct of type A that is POD, and a void pointer p.
Can I safely cast p to a pointer to A, then read/write to the A structure pointed by p ?
Is it guaranteed to work everytime, even if the alignment of A is 8 and p points to an odd memory adress? (worst case)
I am not concerned about performance issues, I just want to know if it's supposed to work according to the standard and / or if it's portable enough on mainstream platforms.
edit : I'm also interested to know if there's any difference depending on x86 / 64 bits architecture
Thanks!
Yes, you can cast a pointer to class A to a class B.
Essentially, you are telling the compiler to use stencil class B when referring to the memory location of the class A variable.
Generally, this is not safe because the values at the locations will have different meanings and positions.
Usually, this type of casting is used for interpreting a buffer of uint8_t as a structured object. Another usage is when there is a union.
So the term safe depends on the context that the operation is used in.
Edit 1: Alignment
Most modern processors can handle alignment issues. The processor may require more operations to fetch the data, which will slow down the performance.
For example with a 16-bit processor, a 16-bit value aligned on an odd address will require two fetches (since it only fetches at event addresses):
+----------+----------------------------+
| address | value |
+----------+----------------------------+
| 16 | N/A |
+----------+----------------------------+
| 17 | 1st 8 bits of 16-bit value |
+----------+----------------------------+
| 18 | 2nd 8 bits of 16-bit value |
+----------+----------------------------+
| 19 | N/A |
+----------+----------------------------+
Since the processor only fetches values at even addresses, fetching the value will require 2 fetches. The first fetch at address 16 obtains the first 8 bits of the 16-bit variable. The second fetch obtains the second 8 bits of the variable.
The processor also has to perform some operations to get the bits in the preferred order in one "word". These operations will also negatively affect the performance.
Is there a way to improve the boost ublas product performance?
I have two matrices A,B which i want to mulitply/add/sub/...
In MATLAB vs. C++ i get the following times [s] for a 2000x2000 matrix Operations
OPERATION | MATLAB | C++ (MSVC10)
A + B | 0.04 | 0.04
A - B | 0.04 | 0.04
AB | 1.0 | 62.66
A'B' | 1.0 | 54.35
Why there is such a huge performance loss here?
The matrices are only real doubles.
But i also need positive definites,symmetric,rectangular products.
EDIT:
The code is trivial
matrix<double> A( 2000 , 2000 );
// Fill Matrix A
matrix<double> B = A;
C = A + B;
D = A - B;
E = prod(A,B);
F = prod(trans(A),trans(B));
EDIT 2:
The results are mean values of 10 trys. The stddev was less than 0.005
I would expect an factor 2-3 maybe to but not 50 (!)
EDIT 3:
Everything was benched in Release ( NDEBUG/MOVE_SEMANTICS/.. ) mode.
EDIT 4:
Preallocated Matrices for the product results did not affect the runtime.
Post your C+ code for advice on any possible optimizations.
You should be aware however that Matlab is highly specialized for its designed task, and you are unlikely to be able to match it using Boost. On the other hand - Boost is free, while Matlab decidedly not.
I believe that best Boost performance can be had by binding the uBlas code to an underlying LAPACK implementation.
You should use noalias in the left hand side of matrix multiplications in order to get rid of unnecessary copies.
Instead of E = prod(A,B); use noalias(E) = prod(A,b);
From documentation:
If you know for sure that the left hand expression and the right hand
expression have no common storage, then assignment has no aliasing. A
more efficient assignment can be specified in this case: noalias(C) =
prod(A, B); This avoids the creation of a temporary matrix that is
required in a normal assignment. 'noalias' assignment requires that
the left and right hand side be size conformant.
There are many efficient BLAS implementation, like ATLAS, gotoBLAS, MKL, use them instead.
I don't pick at the code, but guess the ublas::prod(A, B) using three-loops, no blocks and not cache friendly. If that's true, prod(A, B.trans()) will be much faster then others.
If cblas is avaiable, using cblas_dgemm to do the calculation. If not, you can simply rearrange the data, means, prod(A, B.trans()) instead.
You don't know what role memory-management is playing here. prod is having to allocate a 32mb matrix, and so is trans, twice, and then you're doing all that 10 times. Take a few stackhots and see what it's really doing. My dumb guess is if you pre-allocate the matrices you get a better result.
Other ways matrix-multiplication could be speeded up are
pre-transposing the left-hand matrix, to be cache-friendly, and
skipping over zeros. Only if A(i,k) and B(k,j) are both non-zero is any value contributed.
Whether this is done in uBlas is anybody's guess.