Optimize a algorithm that find a specific Fibonacci number - c++

Given an number A, I want to find the Ath Fibonacci number
that is multiple of 3 or if the number representation has at least a 3
on it.
Example:
Fibonacci > 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, ...
Input: 1, Output: 3;
3 is the first Fibonacci number that is multiple of 3 or has an 3 on
it.
Input: 3, Output: 21;
21 is the third Fibonacci number that is multiple of 3 or has an 3 on
it.
Edit: Variable type changed to unsigned long long int and ajust on Fibonacci generator. Thanks #rcgldr and #Jarod42 for the help!
My code:
#include<bits/stdc++.h>
using namespace std;
int tem(unsigned long long int i){
while(i != 0){
if(i%10 == 3){
return true;
}
i = i/10;
}
return false;
}
int main(){
int a, count = 0;
unsigned long long int f1 = 1, f2 = 1;
while(scanf("%d", &a) != EOF){
for(unsigned long long int i = 2; i > 0; i++){
i = f1 + f2;
f1 = f2;
f2 = i;
if((i%3 == 0) || tem(i)){
count++;
if(count == a){
cout << i << endl;
break;
}
}
}
}
}
When A > 20, it starts to slow down. Makes sense because it tends to be exponecial. My code is not very efficient, but I didn't find an better logic to use.
I looked into these links, but didn't find an conclusion:
1 - Recursive Fibonacci
2 - Fibonacci Optimization
Any ideas? Thanks for the help!

You can speed up the Fibonacci part using this sequence
uint64_t f0 = 0; // fib( 0)
uint64_t f1 = 1; // fib(-1)
int n = ... ; // generate fib(n)
for(int i = 0; i < n; i++){
std::swap(f0,f1);
f0 += f1;
}
Note Fib(93) is the maximum Fibonacci number that fits in a 64 bit unsigned integer, it also has a 3 in it. Fib(92) is the maximum Fibonacci number that is a multiple of 3.
I used this example code to find all of the values (a ranges from 0 to 62), it seems to run fairly fast, so I'm not sure what the issue is. Is optimization enabled?
#include <iostream>
#include <iomanip>
typedef unsigned long long uint64_t;
int tem(uint64_t i){
while(i != 0){
if(i%10 == 3)
return true;
i = i/10;
}
return false;
}
int main(){
int a = 0, n;
uint64_t f0 = 1, f1 = -1; // fib(-1), fib(-2)
for(n = 0; n <= 93; n++){
std::swap(f0, f1); // f0 = next fib
f0 += f1;
if((n&3) == 0 || tem(f0)){
std::cout << std::setw( 2) << a << " "
<< std::setw( 2) << n << " "
<< std::setw(20) << f0 << std::endl;
a++;
}
}
return 0;
}
Depending on the compiler, i%10 and i/10 may use a multiply by "magic number" and shift to replace divide by a constant. Code generated by Visual Studio 2015 for tem(), which is fairly "clever":
tem proc ;rcx = input
test rcx,rcx ; return false if rcx == 0
je SHORT tem1
mov r8, 0cccccccccccccccdh ;magic number for divide by 10
tem0: mov rax, r8 ;rax = magic number
mul rcx ;rdx = quotient << 3
shr rdx, 3 ;rdx = quotient
lea rax, QWORD PTR [rdx+rdx*4] ;rax = quotient*5
add rax, rax ;rax = quotient*10
sub rcx, rax ;rcx -= quotient * 10 == rcx % 10
cmp rcx, 3 ;br if rcx % 10 == 3
je SHORT tem2
mov rcx, rdx ;rcx = quotient (rcx /= 10)
test rdx, rdx ;loop if quotient != 0
jne SHORT tem0
tem1: xor eax, eax ;return false
ret 0
tem2: mov eax, 1 ;return true
ret 0
tem endp

Just pointing out some obvious coding errors
for(unsigned long long int i = 2; i > 0; i++)
is redundant.
for(;;){
unsigned long long i = f1+f2;
should suffice. Secondly
return 0;
is meaningless because it breaks out of the while loop. A break would be better.

There's a clever way to do Fibonacci.
http://stsievert.com/blog/2015/01/31/the-mysterious-eigenvalue/
Code's in python and is just for the nth number, but I think you get the idea.
def fib(n):
lambda1 = (1 + sqrt(5))/2
lambda2 = (1 - sqrt(5))/2
return (lambda1**n - lambda2**n) / sqrt(5)
def fib_approx(n)
# for practical range, percent error < 10^-6
return 1.618034**n / sqrt(5)

Related

_asm function, which prints "Pow (x) = x^2"

I create this piece of code in VS (C++)
#include<iostream>
using namespace std;
static short arr[10];
void powers() {
_asm {
mov ecx, 0;
mov dx, 0;
for:
mov ax, cx;
inc ax;
mul ax;
mov[arr + 2 * ecx], ax;
inc ecx;
cmp ecx, 10;
jl for;
}
}
Now I want to create another function which prints "Pow (x) = x^2", I stuck here:
void print_power(unsigned short x) {
const char* f = "Pow(%d) = %d \n";
_asm {
call powers
push f
add esp, 4
}
}
When I call my "powers" function from "print_power" -> my arr[10] gets filled with the ^2 of 1 to 10
(arr[0] = 1, arr[1] = 4, arr[2] = 9, arr[3] = 16, arr[4] = 25 .....) (I think)
I want when I call my print_power(x) function from main(), for example print_power(8) -> to print the 8th element from my arr like this: "Pow (8) = 81".
Thank you in advance, also I want to apologize if I have some mistakes.

Why is constexpr performing worse than normal expression?

I know there is a similar question about this: constexpr performing worse at runtime.
But my case is a lot simpler than that one, and the answers were not enough for me.
I'm just learning about constexpr in C++11 and a wrote a code to compare its efficiency, and for some reason, using constexpr makes my code run more than 4 times slower!
By the way, i'm using exactly the same example as in this site: https://www.embarcados.com.br/introducao-ao-cpp11/ (its in Portuguese but you can see the example code about constexpr). Already tried other expressions and the results are similar.
constexpr double divideC(double num){
return (2.0 * num + 10.0) / 0.8;
}
#define SIZE 1000
int main(int argc, char const *argv[])
{
// Get number of iterations from user
unsigned long long count;
cin >> count;
double values[SIZE];
// Testing normal expression
clock_t time1 = clock();
for (int i = 0; i < count; i++)
{
values[i%SIZE] = (2.0 * 3.0 + 10.0) / 0.8;
}
time1 = clock() - time1;
cout << "Time1: " << float(time1)/float(CLOCKS_PER_SEC) << " seconds" << endl;
// Testing constexpr
clock_t time2 = clock();
for (int i = 0; i < count; i++)
{
values[i%SIZE] = divideC( 3.0 );
}
time2 = clock() - time2;
cout << "Time2: " << float(time2)/float(CLOCKS_PER_SEC) << " seconds" << endl;
return 0;
}
Input given:
9999999999
Ouput:
> Time1: 5.768 seconds
> Time2: 27.259 seconds
Can someone tell me the reason of this? As constexpr calculations should run in compile time, it's supposed to run this code faster and not slower.
I'm using msbuild version 16.6.0.22303 to compile the Visual Studio project generated by the following CMake code:
cmake_minimum_required(VERSION 3.1.3)
project(C++11Tests)
add_executable(Cpp11Tests main.cpp)
set_property(TARGET Cpp11Tests PROPERTY CXX_STANDARD_REQUIRED ON)
set_property(TARGET Cpp11Tests PROPERTY CXX_STANDARD 11)
Without optimizations, the compiler will keep the divideC call so it is slower.
With optimizations on any decent compiler knows that - for the given code - everything related to values can be optimized away without any side-effects. So the shown code can never give any meaningful measurements between the difference of values[i%SIZE] = (2.0 * 3.0 + 10.0) / 0.8; or values[i%SIZE] = divideC( 3.0 );
With -O1 any decent compiler will create something this:
for (int i = 0; i < count; i++)
{
values[i%SIZE] = (2.0 * 3.0 + 10.0) / 0.8;
}
results in:
mov rdx, QWORD PTR [rsp+8]
test rdx, rdx
je .L2
mov eax, 0
.L3:
add eax, 1
cmp edx, eax
jne .L3
.L2:
and
for (int i = 0; i < count; i++)
{
values[i%SIZE] = divideC( 3.0 );
}
results in:
mov rdx, QWORD PTR [rsp+8]
test rdx, rdx
je .L4
mov eax, 0
.L5:
add eax, 1
cmp edx, eax
jne .L5
.L4:
So both will result in the identical machine code, only containing the counting of the loop and nothing else. So as soon as you turn on optimizations you will only measure the loop but nothing related to constexpr.
With -O2 even the loop is optimized away, and you would only measure:
clock_t time1 = clock();
time1 = clock() - time1;
cout << "Time1: " << float(time1)/float(CLOCKS_PER_SEC) << " seconds" << endl;

Why does shifting a value further than its size not result in 0?

Actual refined question:
Why does this not print 0?
#include "stdafx.h"
#include <iostream>
#include <string>
int _tmain(int argc, _TCHAR* argv[])
{
unsigned char barray[] = {1,2,3,4,5,6,7,8,9};
unsigned long weirdValue = barray[3] << 32;
std::cout << weirdValue; // prints 4
std::string bla;
std::getline(std::cin, bla);
return 0;
}
The disassembly of the shift operation:
10: unsigned long weirdValue = barray[3] << 32;
00411424 movzx eax,byte ptr [ebp-1Dh]
00411428 shl eax,20h
0041142B mov dword ptr [ebp-2Ch],eax
Original question:
I found the following snippet in some old code we maintain. It converts a byte array to multiple float values and adds the floats to a list. Why does it work for byte arrays greater than 4?
unsigned long ulValue = 0;
for (USHORT usIndex = 0; usIndex < m_oData.usNumberOfBytes; usIndex++)
{
if (usIndex > 0 && (usIndex % 4) == 0)
{
float* pfValue = (float*)&ulValue;
oValues.push_back(*pfValue);
ulValue = 0;
}
ulValue += (m_oData.pabyDataBytes[usIndex] << (8*usIndex)); // Why does this work for usIndex > 3??
}
I would understand that this works if << was a rotate operator, not a shift operator. Or if it was
ulValue += (m_oData.pabyDataBytes[usIndex] << (8*(usIndex%4)))
But the code like i found it just confuses me.
The code is compiled using VS 2005.
If i try the original snippet in the immediate window, it doesn't work though.
I know how to do this properly, i just want to know why the code and especially the shift operation works as it is.
Edit: The disassembly for the shift operation is:
13D61D0A shl ecx,3 // multiply uIndex by 8
13D61D0D shl eax,cl // shift to left, does nothing for multiples of 32
13D61D0F add eax,dword ptr [ulValue]
13D61D15 mov dword ptr [ulValue],eax
So the disassembly is fine.
The shift count is masked to 5 bits, which limits the range to 0-31.
A shift of 32 therefore is same as a shift of zero.
http://x86.renejeschke.de/html/file_module_x86_id_285.html

Run-Time Check Failure #2 - Stack around the variable 'result' was corrupted

I have compiled this code and got Run-Time Check Failure #2 - Stack around the variable 'result' was corrupted exception. But when I changed result array size from 2 to 4 exception disappeared. Can you explain why this happens?
Sorry, if you found this question too basic.
#include "stdafx.h"
string get_cpu_name()
{
uint32_t data[4] = { 0 };
_asm
{
cpuid;
mov data[0], ebx;
mov data[4], edx;
mov data[8], ecx;
}
return string((const char *)data);
}
void assembler()
{
cout << "CPU is " << get_cpu_name() << endl;
float f1[] = { 1.f , 22.f};
float f2[] = { 5.f , 3.f };
float result[2] = { 0.f };
/*float f1[] = { 1.f , 22.f , 1.f , 22.f };
float f2[] = { 5.f , 3.f , 1.f , 22.f };
float result[4] = { 0.f };*/
_asm
{
movups xmm1, f1;
movups xmm2, f2;
mulps xmm1, xmm2;
movups result, xmm1;
}
/*for (size_t i = 0; i < 4; i++)*/
for (size_t i = 0; i < 2; i++)
{
cout << result[i] << "\t";
}
cout << endl;
}
int main()
{
assembler();
getchar();
return 0;
}
The movups instruction writes 128 bits (16 bytes) to memory. You are writing this to the location of an 8-byte array (2*4 bytes, or 64 bits). The 8 bytes after the array will also be written to.
You should make sure there are at least 16 bytes of space to write the result, or you should make sure to write less than 16 bytes there.

SSE2 shift by vector

I've been trying to implement shift by vector in SSE2 intrinsics, but from experimentation and the intel intrinsic guide, it appears to only use the least-significant part of the vector.
To reword my question, given a vector {v1, v2, ..., vn} and a set of shifts {s1, s2, ..., sn}, how do I calculate a result {r1, r2, ..., rn} such that:
r1 = v1 << s1
r2 = v2 << s2
...
rn = vn << sn
since it appears that _mm_sll_epi* performs this:
r1 = v1 << s1
r2 = v2 << s1
...
rn = vn << s1
Thanks in advance.
EDIT:
Here's the code I have:
#include <iostream>
#include <cstdint>
#include <mmintrin.h>
#include <emmintrin.h>
namespace SIMD {
using namespace std;
class SSE2 {
public:
// flipped operands due to function arguments
SSE2(uint64_t a, uint64_t b, uint64_t c, uint64_t d) { low = _mm_set_epi64x(b, a); high = _mm_set_epi64x(d, c); }
uint64_t& operator[](int idx)
{
switch (idx) {
case 0:
_mm_storel_epi64((__m128i*)result, low);
return result[0];
case 1:
_mm_store_si128((__m128i*)result, low);
return result[1];
case 2:
_mm_storel_epi64((__m128i*)result, high);
return result[0];
case 3:
_mm_store_si128((__m128i*)result, high);
return result[1];
}
/* Undefined behaviour */
return 0;
}
SSE2& operator<<=(const SSE2& rhs)
{
low = _mm_sll_epi64(low, rhs.getlow());
high = _mm_sll_epi64(high, rhs.gethigh());
return *this;
}
void print()
{
uint64_t a[2];
_mm_store_si128((__m128i*)a, low);
cout << hex;
cout << a[0] << ' ' << a[1] << ' ';
_mm_store_si128((__m128i*)a, high);
cout << a[0] << ' ' << a[1] << ' ';
cout << dec;
}
__m128i getlow() const
{
return low;
}
__m128i gethigh() const
{
return high;
}
private:
__m128i low, high;
uint64_t result[2];
};
}
int main()
{
cout << "operator<<= test: vector << vector: ";
{
auto x = SIMD::SSE2(7, 8, 15, 10);
auto y = SIMD::SSE2(4, 5, 6, 7);
x.print();
y.print();
x <<= y;
if (x[0] != 112 || x[1] != 256 || x[2] != 960 || x[3] != 1280) {
cout << "FAILED: ";
x.print();
cout << endl;
} else {
cout << "PASSED" << endl;
}
}
return 0;
}
What should be happening gets results of {7 << 4 = 112, 8 << 5 = 256, 15 << 6 = 960, 10 << 7 = 1280}. The results seem to be {7 << 4 = 112, 8 << 4 = 128, 15 << 6 = 960, 15 << 6 = 640}, which isn't what I want.
Hope this helps, Jens.
If AVX2 is available, and your elements are 32 or 64 bits, your operation takes one variable-shift instruction: vpsrlvq, (__m128i _mm_srlv_epi64 (__m128i a, __m128i count) )
For 32bit elements with SSE4.1, see Shifting 4 integers right by different values SIMD. Depending on latency vs. throughput requirements, you can do separate shifts shift and then blend, or use a multiply (by a specially-constructed vector of powers of 2) to get variable-count left shifts and then do a same-count-for-all-elements right shift.
For your case, 64bit elements with runtime-variable shift counts:
There are only two elements per SSE vector, so we just need two shifts and then combine the results (which we can do with a pblendw, or with a floating-point movsd (which may cause extra bypass-delay latency on some CPUs), or we can use two shuffles, or we can do two ANDs and an OR.
__m128i SSE2_emulated_srlv_epi64(__m128i a, __m128i count)
{
__m128i shift_low = _mm_srl_epi64(a, count); // high 64 is garbage
__m128i count_high = _mm_unpackhi_epi64(count,count); // broadcast the high element
__m128i shift_high = _mm_srl_epi64(a, count_high); // low 64 is garbage
// SSE4.1:
// return _mm_blend_epi16(shift_low, shift_high, 0x0F);
#if 1 // use movsd to blend
__m128d blended = _mm_move_sd( _mm_castsi128_pd(shift_high), _mm_castsi128_pd(shift_low) ); // use movsd as a blend. Faster than multiple instructions on most CPUs, but probably bad on Nehalem.
return _mm_castpd_si128(blended);
#else // SSE2 without using FP instructions:
// if we're going to do it this way, we could have shuffled the input before shifting. Probably not helpful though.
shift_high = _mm_unpackhi_epi64(shift_high, shift_high); // broadcast the high64
return _mm_unpacklo_epi64(shift_high, shift_low); // combine
#endif
}
Other shuffles like pshufd or psrldq would work, but punpckhqdq gets the job done without needing an immediate byte, so it's one byte shorter. SSSE3 palignr could get the high element from one register and the low element from another register into one vector, but they'd be reversed (so we'd need a pshufd to swap high and low halves). shufpd would work to blend, but has no advantage over movsd.
See Agner Fog's microarch guide for the details of the potential bypass-delay latency from using an FP instruction between two integer instructions. It's probably fine on Intel SnB-family CPUs, because other FP shuffles are. (And yes, movsd xmm1, xmm0 runs on the shuffle unit in port5. Use movaps or movapd for reg-reg moves even of scalars if you don't need the merging behaviour).
This compiles (on Godbolt with gcc5.3 -O3) to
movdqa xmm2, xmm0 # tmp97, a
psrlq xmm2, xmm1 # tmp97, count
punpckhqdq xmm1, xmm1 # tmp99, count
psrlq xmm0, xmm1 # tmp100, tmp99
movsd xmm0, xmm2 # tmp102, tmp97
ret