Array index ordering during declaration for performance [closed]

Array index ordering during declaration for performance [closed] - fortran

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I require an array of the dimensions 100000 x 2 . Is there a performance advantage that can be gained (over the other) from declaring an array in one of the following formats in Fortran 90/95 :
Case (i):
real, dimension(100000,2) :: A
or case (ii):
real, dimension(2,100000) :: B
I believe case(ii) would have an advantage due to the column-major storage order of Fortran. I have run a few test cases and the result was as expected, but the difference in times are small. I would like someone to confirm this in the event of both with and without ifort's vectorization.
The compiler flags I used for the test cases were -no-vec for disabling vectorization and -vec-report3 for report generation.

In your case, the arrays are ordered like this:
A(1, 1) A(2, 1) A(3, 1) ... A(100000, 1) A(1, 2), A(2, 2) ... A(100000, 2)
B(1, 1) B(2, 1) B(1, 2) B(2, 2) ... B(1, 100000), B(2, 100000)
What is better depends on what you want to do with it:
mean(A(:, 1)) + mean(A(:, 2))
is faster than
mean(B(1, :)) + mean(B(2, :))
Because in A it can read in a lot of values at once, whereas for B it has to jump over every second value, then go back.
But
do i = 1, 100000
C(i) = A(i, 1) - A(i, 2)
end do
is probably slower than
do i = 1, 100000
C(i) = B(1, i) - B(2, i)
end do
Because for B it can read in the values sequentially whereas for A it has to jump forward and back 100000 values every time.

Related

Improve performance of calculating log2 of a number between 1 and 2 [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am trying to calculate log2(x) using integer-arithmetic operations.
The input x is a value between 1 and 2.
Because this will simply yield 0, everything is pre-scaled by 16.
In other words:
The function takes the integer value of x * 2^16 instead of x
The function returns the integer value of log2(x) * 2^16 instead of log2(x)
Here is my code:
uint64_t Log2(uint64_t x)
{
static uint64_t TWO = (uint64_t)2 << 16;
uint64_t res = 0;
for (int i=0; i<16; i++)
{
x = (x * x) >> 16;
if (x >= TWO)
{
x >>= 1;
res += 1 << (15 - i);
}
}
return res;
}
What I am looking for is a way to improve the performance of the loop.

While you said in the comments that you don't want a lookup table based solution, I still present one here. The reason is simple: this lookup table is 516 bytes. And if I compile your Log2 with -O3, I get a ~740 byte function for that, so it is in the same ballpark.
I didn't create a solution which perfectly matches yours. The reason is simple: your version is not as precise as possible. I used rint(log(in/65536.0f)/log(2)*65536) as a reference. Your version produces worst difference of 2, and average difference of 1.0. This proposed version has a worst difference of 1, and average difference of 0.2. So this version is more accurate.
About performance: I've checked two microbenchmarks:
use a simple LCG random generator for input. My version is 29 times faster
use numbers 0x10000->0x20000 linearly. My version is 17 times faster
The solution is extremely simple (use initTable() to initialize the lookup table), it linearly interpolates between table elements:
unsigned short table[0x102];
void initTable() {
for (int i=0; i<0x102; i++) {
int v = rint(log(i*0x100/65536.0f+1)/log(2)*65536);
if (v>0xffff) v = 0xffff;
table[i] = v;
}
}
int log2(int val) {
int idx = (val-0x10000)>>8;
int l0 = table[idx];
int l1 = table[idx+1];
return l0+(((l1-l0)*(val&0xff)+128)>>8);
}
I've just played with the table, and here are further results:
you can decrease table size to have 0x82 elements (260 bytes), and still have worst error of 1, and average error of 0.32 (you need to put 0.5+ in rint() in this case)
you can decrease table size to have 0x42 elements (132 bytes), worst error becomes 2, and average error is 0.53 (you need to put 0.75+ in rint() in this case)
decreasing the table size further significantly increases worst error

As your code is already very fast, I would try to unroll the loop. Writing the loop body 16 times renders the code unreadable, but saves the loop overhead and the expression 1 << (15 - i) becomes constant.

Trust your compiler :). Just use high enough level of optimisation and the compiler will sort this kind of micro optimalisations.
examples:
gcc ARM - https://godbolt.org/g/4XdPCp
gcc - x86-64 https://godbolt.org/g/nBNmLR
gcc - AVR https://godbolt.org/g/Mq81Sg
So almost no branches, no cache flushes & misses (or at least a minimal number) - easy pipelined, optimum execution time

Python3 how to create a list of partial products

I have a very long list (of big numbers), let's say for example:
a=[4,6,7,2,8,2]
I need to get this output:
b=[4,24,168,336,2688,5376]
where each b[i]=a[0]*a[1]...*a[i]
I'm trying to do this recursively in this way:
b=[4] + [ a[i-1]*a[i] for i in range(1,6)]
but the (wrong) result is: [4, 24, 42, 14, 16, 16]
I don't want to compute all the products each time, I need a efficient way (if possible), because the list is very long
At the moment this works for me:
b=[0]*6
b[0]=4
for i in range(1,6): b[i]=a[i]*b[i-1]
but it's too slow. Any ideas? Is it possible to avoid "for" or to speedup it in other way?

You can calculate the product step-by-step since every next calculation heavily depends on the previous one.
What I mean is:
1) Compute the product for the first i - 1 numbers
2) The i-th product will be equal to a[i] * product of the last i - 1 numbers
This method is called dynamic programming
Dynamic programming (also known as dynamic optimization) is a method for solving a complex problem by breaking it down into a collection of simpler subproblems, solving each of those subproblems just once, and storing their solutions
This is the implementation:
a = [4, 6, 7, 2, 8, 2]
b = []
product_so_far = 1
for i in range(len(a)):
product_so_far *= a[i]
b.append(product_so_far)
print(b)
This algorithm works in linear time (O(n)), which is the most efficient complexity you'll get for such a task
If you want a little optimization, you could generate the b list to the predefined length (b = [0] * len(a)) and, instead of appending, you would do this in a loop:
b[i] = product_so_far

Representing a polynomial with linked lists [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
Thanks in advance.
For my c++ class, i am tasked with representing a polynomial such as (MyPoly= 7x^6*y^4 + 4x^4*y^5 - 8x^5*y^3 – 9x + 8) using linked lists and building Node & Poly classes to help express it.
I don't know how to represent a polynomial with both an X and Y in linked list.
I have an idea of building a linked list to represent the polynomial like 7 6 4 -> 4 4 5 -> -8 5 3 ->-9 1 0 -> 8 0 0 ->NULL
I am new to this so any example code or pseudo code would be of great help.
Attempted setup
I came up with this code here(starting point) but i think it will only work for having a single variable and not two (7x^6*... but not 7x^6*y^4). Thanks again :).

have you thought, or are you allowed to work with the Horner's representation of polynomials? It's not only a much more efficient way to calculate the polynomial values, but can in many cases lead to a more sparse datastructure. For example, the polynom:
is equivalent to the following expression:
So there are 3 things to note:
Actually the one remarkable thing of this schema (although not directly related to your question) is that its calculation is much faster, since you save a lot of multiplications.
The index of the polynom depends directly on the length of the expression
All elements in the expression are isomorph, independently of the degree. This is also true for each arity.
So in this lucky case the polynom I chose could be very easily and efficiently stored as the following list/array:
[7, 5, 1, -4, 1, 8, 1, -7]
or if you want, as a linked list of [x_mult|sum] numbers:
[7|5]->[1|4]->[1|8]->[1|-7]
whereas you know that the elements with even indexes are multiplied by x, and added to the following element, the schema is quite simple.
#include <iostream>
using namespace std;
int main()
{
// the x you want to calculate
int x = 1;
// the horner-representation of your polynom
int A[8] {7, 5, 1, -4, 1, 8, 1, -7};
int partial;
int result = 1;
// run calculations following horner-schema
for (int i = 0; i < 8; i++) {
if (i%2==0){
partial = A[i]*x; // this mult. is only needed at i=0
result *= partial;
} else{
partial = A[i];
result += partial;
}
}
cout << "x=" << x << ", p(x)=" << result << endl;
return 0;
}
Issues: You could greatly improve its performance and memory usage if you supress the odd indexes, and take the "1"'s as granted, storing the first 7 elsewhere. Also since the index depends directly on the list's length, polynoms like
would have a very inefficient representation.
Workaround for the memory issues: A possible workaround would be to inherit your ListElement as ExpandedListElement, so that the numbers in its containers aren't interpreted as factors but as number of repetitions. So the ExpandedListElement [1000|a] would mean, that your list has one thousand ListElements that look like this: [1|a]. So the x^1000+3 given example would have two elements: ExpandedListElement[999|0]-->ListElement[1|3]. You would also need a method to perform the loop, which I omit (if you need this workaround let me know, and I'll post it).
I didn't test it extensively, but I assume it's a good approach also for two or more variables. I left also the rest of the OO-implementation details apart, but the core DS and operations are there and should be easy to embed in classes. If you try it, let me know how it works!
Cheers
Andres

I think, you can represent polynominal with a matrix not a linked list or something.
|X^0|x^1|x^2|x^3
---|---|---|---|---
y^0| | | |
---|---|---|---|---
y^1| | | |
---|---|---|---|---
y^2| | | |
---|---|---|---|---
y^3| | | |
In each cell you should keep the coefficients of x^x' and y^y'. And you can define operations more easily.
You can use Boost.uBLAS for matrix operations.

Fast solution not so clean:
MyPoly= 7x^6*y^4 + 4x^4*y^5 - 8x^5*y^3 – 9x + 8
#include <list>
class factor;
class polynomial{
public:
std::list<factor> factors;
};
class factor{
public:
factor()=default;
factor(int constant,int x_pow,int y_pow):constant(constant),x_pow(x_pow),y_pow(y_pow){}
int constant;
int x_pow;
int y_pow;
};
int main()
{
polynomial MyPoly;
MyPoly.factors.emplace_back(7,6,4);
MyPoly.factors.emplace_back(4,4,5);
MyPoly.factors.emplace_back(8,5,3);
MyPoly.factors.emplace_back(9,1,0);
MyPoly.factors.emplace_back(8,0,0);
}

This would be one way of doing it:
Design Considerations:
1. Consider Polynomial as a list of nodes
2. Each node can consist sub-nodes.
So, your class definitions would be:
class Polynomial
{
list<nodes> termList;
};
class Node
{
list<SubNodes> subnodelist;
};
template<class T>
class subNode
{
int coefficient;
int power;
T variable;
};
Note: Not tested the code for correctness.

Recursive Functions Using Fibonacci Series [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am self-learning C++ from Sams Teach Yourself C++ In One Hour A Day and on page 150 the author discusses recursive functions using the Fibonacci Series.
He uses the following code:
#include <iostream>
using namespace std;
int GetFibNumber(int FibIndex)
{
if(FibIndex < 2 )
return FibIndex;
else
return GetFibNumber(FibIndex - 1) + GetFibNumber(FibIndex - 2);
}
int main()
{
cout << " Enter 0 based index of desired Fibonacci Number: ";
int Index = 0;
cin >> Index;
cout << " Fibonacci number is: " << GetFibNumber(Index) << endl;
return 0;
}
What is the difference between having
return GetFibNumber(FibIndex - 1) + GetFibNumber(FibIndex - 2);
and
return FibIndex - 1 + FibIndex - 2;
Why do you have to call the function inside itself?
Thank you in advance!

You ask: "Why do you have to call the function inside itself?" Well, strictly, you don't.
The Fibonacci sequence is the sequence of numbers defined by this mathematical recursion:
a0 = 0
a1 = 1
an = an-1 + an-2
The original function does compute this sequence, albeit inefficiently. If Index is 0 it returns 0; if it is 1, it returns 1. Otherwise it returns GetFibNumber(Index - 1) + GetFibNumber(Index - 2), which is precisely what the mathematical definition is. For each element of the sequence, you must add the two previous terms in the sequence.
Your code just returns Index - 1 + Index - 2, which will give a different numerical sequence. Compare:
Fibonacci: 0, 1, 1, 2, 3, 5, 8, 13, 21, 36...
Yours: 0, 1, 1, 3, 5, 7, 9, 11, 13, 17...
But that aside, you do not strictly need a recursive function to compute this mathematical recursion. All you need is a simple for loop:
int GetFibNumber(int FibIndex)
{
if(FibIndex < 2 )
return FibIndex;
int a_n_2 = 0, a_n_1 = 1, a_n;
for (i = 2; i < FibIndex; i++)
{
a_n = a_n_1 + a_n_2;
a_n_2 = a_n_1;
a_n_1 = a_n;
}
return a_n;
}
This approach will be much faster, also.
The code-recursive technique is mathematically correct. It is, however, slower, because it doesn't reuse any computations. It computes an-1 by reworking the recursion all the way back to a1. It then computes an-2, without reusing any of the work that generated an-1. And if you consider this lack-of-reuse happens at every step of the recursion, you'll see that the running time grows exponentially for the recursive function. It grows linearly for the for loop, though.
Advanced topic: There is a way to make the recursive version run faster, and that can be important to know if you run into a programming problem that is most readily defined recursively. Once you're much more comfortable with C++, look up memoization. A memoized recursive Fibonacci gives linear worst-case run-time, and constant run time for repeated lookups (assuming the memo lookup is itself O(1)).

Using the version that does not use recursion is not correct. It will only compute correctly first few Fiboonacci numbers. Try to compute first 10 Fibonacci numbers using the two versions and you will see yourself the two versions compute two different sequences.

The function GetFibNumber calculates the Nth number in the Fibonacci series. If you just take a look the explanation on http://en.wikipedia.org/wiki/Fibonacci_number it is calculated by adding the Nth-1 and Nth-2 numbers in the Fibinacci series. And this is exactly what the function does. You provide the function with an index in the Fibonacci series that you want to calculate (lets say 6; this should have 8 as result).
To calculate the 6th element in the Fibonacci series you need to add the 5th and 4th elements together. So you first need to calculate those. This is where recursion steps in. You can let the function call itself; but instead of calling it again with the value 6 as parameter you now use 5 and 4. This will again lead to the same problem (you need to calc 5th element by adding elements 4 and 3), etc. etc.
With the recursive function you can simply re-use the code to perform the same calculation over and over again until you reach a certain point where you have an answer for the calculation (in this case if N = 1 or N = 0; these cases will result in 1).

I would suggest, since you are still learning, to program this both recursively (like the author did) and using a loop (while, for). It will most likely show you the answer on how this algorithm is built up.
Hint 1: You must know that Fibonnaci sequences are built up upon two initial values...
Hint 2: For when it comes to recursion, you should know how the function results are stored. That will explain your question as well.

They are not equvialent, and it certainly won't calculate the fibonanic sequence. Recursion can be thought of like a tree, so to compute Fib(8) say, by definition we take Fib(7) + Fib(6)
Fib(8)
/ \
Fib(7) Fib(6)
Which in turn require computing Fib(6), Fib(5), Fib(4) as follows:
Fib(8)
/ \
Fib(7) Fib(6)
/ \ / \
Fib(5) Fib(6) Fib(5) Fib(4)
And so on. What you are doing, would produce a different, depth 1 tree:
Fib(8)
/ \
7 6
Because, if you never call the function within the function, it can never go more deep. It should be clear from this and the other answers why it is not correct.

Dynamic programming algorithm N, K problem

An algorithm which will take two positive numbers N and K and calculate the biggest possible number we can get by transforming N into another number via removing K digits from N.
For ex, let say we have N=12345 and K=3 so the biggest possible number we can get by removing 3 digits from N is 45 (other transformations would be 12, 15, 35 but 45 is the biggest). Also you cannot change the order of the digits in N (so 54 is NOT a solution). Another example would be N=66621542 and K=3 so the solution will be 66654.
I know this is a dynamic programming related problem and I can't get any idea about solving it. I need to solve this for 2 days, so any help is appreciated. If you don't want to solve this for me you don't have to but please point me to the trick or at least some materials where i can read up more about some similar issues.
Thank you in advance.

This can be solved in O(L) where L = number of digits. Why use complicated DP formulas when we can use a stack to do this:
For: 66621542
Add a digit on the stack while there are less than or equal to L - K digits on the stack:
66621. Now, remove digits from the stack while they are less than the currently read digit and put the current digit on the stack:
read 5: 5 > 2, pop 1 off the stack. 5 > 2, pop 2 also. put 5: 6665
read 4: stack isnt full, put 4: 66654
read 2: 2 < 4, do nothing.
You need one more condition: be sure not to pop off more items from the stack than there are digits left in your number, otherwise your solution will be incomplete!
Another example: 12345
L = 5, K = 3
put L - K = 2 digits on the stack: 12
read 3, 3 > 2, pop 2, 3 > 1, pop 1, put 3. stack: 3
read 4, 4 > 3, pop 3, put 4: 4
read 5: 5 > 4, but we can't pop 4, otherwise we won't have enough digits left. so push 5: 45.

Well, to solve any dynamic programming problem, you need to break it down into recurring subsolutions.
Say we define your problem as A(n, k), which returns the largest number possible by removing k digits from n.
We can define a simple recursive algorithm from this.
Using your example, A(12345, 3) = max { A(2345, 2), A(1345, 2), A(1245, 2), A(1234, 2) }
More generally, A(n, k) = max { A(n with 1 digit removed, k - 1) }
And you base case is A(n, 0) = n.
Using this approach, you can create a table that caches the values of n and k.
int A(int n, int k)
{
typedef std::pair<int, int> input;
static std::map<input, int> cache;
if (k == 0) return n;
input i(n, k);
if (cache.find(i) != cache.end())
return cache[i];
cache[i] = /* ... as above ... */
return cache[i];
}
Now, that's the straight forward solution, but there is a better solution that works with a very small one-dimensional cache. Consider rephrasing the question like this: "Given a string n and integer k, find the lexicographically greatest subsequence in n of length k". This is essentially what your problem is, and the solution is much more simple.
We can now define a different function B(i, j), which gives the largest lexicographical sequence of length (i - j), using only the first i digits of n (in other words, having removed j digits from the first i digits of n).
Using your example again, we would have:
B(1, 0) = 1
B(2, 0) = 12
B(3, 0) = 123
B(3, 1) = 23
B(3, 2) = 3
etc.
With a little bit of thinking, we can find the recurrence relation:
B(i, j) = max( 10B(i-1, j) + ni , B(i-1, j-1) )
or, if j = i then B(i, j) = B(i-1, j-1)
and B(0, 0) = 0
And you can code that up in a very similar way to the above.

The trick to solving a dynamic programming problem is usually to figuring out what the structure of a solution looks like, and more specifically if it exhibits optimal substructure.
In this case, it seems to me that the optimal solution with N=12345 and K=3 would have an optimal solution to N=12345 and K=2 as part of the solution. If you can convince yourself that this holds, then you should be able to express a solution to the problem recursively. Then either implement this with memoisation or bottom-up.

The two most important elements of any dynamic programming solution are:
Defining the right subproblems
Defining a recurrence relation between the answer to a sub-problem and the answer to smaller sub-problems
Finding base cases, the smallest sub-problems whose answer does not depend on any other answers
Figuring out the scan order in which you must solve the sub-problems (so that you never use the recurrence relation based on uninitialized data)
You'll know that you have the right subproblems defined when
The problem you need the answer to is one of them
The base cases really are trivial
The recurrence is easy to evaluate
The scan order is straightforward
In your case, it is straightforward to specify the subproblems. Since this is probably homework, I will just give you the hint that you might wish that N had fewer digits to start off with.

Here's what i think:
Consider the first k + 1 digits from the left. Look for the biggest one, find it and remove the numbers to the left. If there exists two of the same biggest number, find the leftmost one and remove the numbers to the left of that. store the number of removed digits ( name it j ).
Do the same thing with the new number as N and k+1-j as K. Do this until k+1 -j equals to 1 (hopefully, it will, if i'm not mistaken).
The number you end up with will be the number you're looking for.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Array index ordering during declaration for performance [closed] - fortran

Related

Improve performance of calculating log2 of a number between 1 and 2 [closed]

Python3 how to create a list of partial products

Representing a polynomial with linked lists [closed]

Recursive Functions Using Fibonacci Series [closed]

Dynamic programming algorithm N, K problem

Categories

Resources