Addition of every subset of two multiplied - c++

I have an array with the elements {7,2,1} and the idea is to do 7 * 2 + 7 * 1 + 2 * 1 which is basically this algorithm:
for(int i=0;i<n-1;++i)
for(int k=i+1;k<n;++k)
sum += a[i] * a[k];
Where a is the array in which I have the numbers and n is the number of elements, I need a more efficient algorithm for doing this, and I have no clue how to do it, can someone give me a hand?
Thank you!

You can do better in the general case. Time to do some math. Let's look at the 3-element version, we have:
ab + ac + bc
= 1/2 * (2ab + 2ac + 2bc)
= 1/2 * (2ab + 2ac + 2bc + a^2 + b^2 + c^2 - (a^2 + b^2 + c^2))
= 1/2 * ((a+b+c)^2 - (a^2 + b^2 + c^2))
That is:
int sum = 0;
int sum_sq = 0;
for (int i : arr) {
sum += i;
sum_sq += i*i;
}
int result = (sum*sum - sum_sq) / 2;
This is O(n) multiplications, instead of O(n^2). This'll certainly be better than the naive implementation at some point. Whether or not it's better for just 3 elements is something I haven't timed.

#chux's suggestion is essentially to redistribute operations:
ai * ai + 1 + ai * ai + 2 + ... + ai * an
-->
ai * (ai + 1 + ... + an)
combined with the avoiding unnecessary recomputation of partial sums of the (ai + 1 + ... + an) terms by leveraging the fact that each differs from the next by the value of one element of the input array.
Here's a one-pass implementation with O(1) overhead:
int psum(size_t n, int array[n]) {
int result = 0;
int rsum = array[n - 1];
for (int i = n - 2; i >= 0; i--) {
result += array[i] * rsum;
rsum += array[i];
}
return result;
}
The sum of all elements to the right of index i is maintained from iteration to iteration in variable rsum. It's unnecessary to track its various values in an array, because we need each value only for one iteration of the loop.
This scales linearly with the number of elements in the input array. You'll see that the number and type of operations is quite similar to #Barry's answer, but nothing analogous to his final step is required, which saves a few operations.
As #Barry observes in comments, the iteration can also be run in the other direction, in conjunction with tracking the left-hand partial sums intead of the right-hand ones. That would diverge a bit more from #chux's description, but it relies on exactly the same principles.

We have (a + b + c + ...)2 = (a2 + b2 + c2 + ...) + 2(ab + bc + ca + ...)
You want the sum S = ab + bc + ca + ..., which has O(n2) pairs (using 2 nested loops)
You can do 2 separated loops, one calculates P = a2 + b2 + c2 + ... in O(n) time, and another calculates Q = (a + b + c + ...)2 also in O(n) time. Then take S = (Q - P) / 2.

Make 1 pass, walk from the end of [a] to the front and form a sum of all the elements "to the right".
2nd pass, Multiple a[i] * sum[i].
O(n).
long sum0(int a[], int n) {
long sum = 0;
for (int i = 0; i < n - 1; ++i)
for (int k = i + 1; k < n; ++k)
sum += a[i] * a[k];
return sum;
}
long sum1(int a[], int n) {
int long sums[n];
sums[n - 1] = 0;
for (int i = n - 2; i >= 0; i--) {
sums[i] = a[i+1] + sums[i + 1];
}
long sum = 0;
for (int i = 0; i < n - 1; ++i)
sum += a[i] * sums[i];
return sum;
}
void test(int a[], int n) {
long s0 = sum0(a, n);
long s1 = sum1(a, n);
if (s0 != s1) printf("%9ld %9ld\n", s0, s1);
}
void tests(int k) {
while (k--) {
int n = rand() % 10 + 2;
int a[n + 1];
for (int m = 0; m < n; m++)
a[m] = rand() % 256;
test(a, n);
}
}
int main() {
int a[3] = { 7, 2, 1 };
printf("%d\n", sum1(a, 3));
tests(1000000);
puts("Done");
}
As it turns out the sums[] array is not needed either as the the running sums needs only 1 location. This effectively makes this answers similar to others
long sum1(int a[], int n) {
int long sums = 0;
long sum = 0;
for (int i = n - 2; i >= 0; i--) {
sums = a[i+1] + sums;
sum += a[i] * sums;
}
return sum;
}

Related

Min sum of distances (absolute differences) between array element and set of k array elements

I need to find the minimum sum of the distances between an element in the array and the set of k-elements of the array, not including that index.
For example:
arr = {5, 7, 4, 9}
k = 2
min_sum(5) = |5-4| + |5-7| = 3
min_sum(7) = |7-9| + |7-5| = 4
min_sum(4) = |4-5| + |4-7| = 4
min_sum(9) = |9-7| + |9-5| = 6
So, a naive solution would be to subtract the i-th element from each element of the array, then sort the array and calculate the sum of the first k elements in the sorted array. But it takes too long... I believe this is a dp-problem or something like that (maybe treaps).
Input:
n - number of array elements
k - number of elements in a set
array
Constraints:
2 <= n <= 350 000
1 <= k < n
1 <= a[i] <= 10^9
time limit: 2 seconds
Input:
4
2
5 7 4 9
Output:
3 4 4 6
What is the most efficient way to solve this problem? How to optimize the search for the minimum sum?
This is my code in C++, and it works about 3 mins for n = 350 000, k = 150 000:
#include <bits/stdc++.h>
using namespace std;
int main() {
int n, k, tp;
unsigned long long temp;
cin >> n >> k;
vector<unsigned int> org;
vector<unsigned int> a;
vector<unsigned long long> cum(n, 0);
//unordered_map <int, long long> ans;
unordered_map <int, long long> mp;
for (int i = 0; i < n; i++){
cin >> tp;
org.push_back(tp);
a.push_back(tp);
}
/*
srand(time(0));
for (int i = 0; i < n; i++){
org.push_back(rand());
a.push_back(org[i]);
}
*/
sort(a.begin(), a.end());
partial_sum(a.begin(), a.end(), cum.begin());
mp[a[0]] = cum[k] - cum[0] - a[0] * k;
//ans[a[0]] = mp[a[0]];
for (int i = 1; i <= k; i++) {
mp[a[i]] = a[i] * i - cum[i-1] + cum[k] - cum[i] - a[i] * (k-i);
}
for (int i = 1; i < n-k; i++){
for (int j = 0; j <= k; j++){
//if (ans.find(a[i+j]) != ans.end()) {continue;}
temp = ( (a[i+j] * j) - (cum[i+j-1] - cum[i-1]) ) + ( cum[i+k] - cum[i+j] - a[i+j] * (k-j) );
if (mp.find(a[i+j]) == mp.end()) { mp[a[i+j]] = temp; }
else if (mp[a[i+j]] > temp) { mp[a[i+j]] = temp; }
//else { ans[a[i+j]] = mp[a[i+j]]; }
}
}
for (int i = 0; i < n; i++) {
cout << mp[org[i]] << " ";
}
return 0;
}
We can solve this problem efficiently by taking the sliding window approach.
It seems safe to assume that there are no duplicates in the array. If it contains duplicates, then we can simply discard them with the help of HashSet.
The next step is to sort the array to guarantee that the closest k elements will be within the window [i - k; i + k] for each index i.
We will keep three variables for the window: left, right and currentSum. They will be adjusted accordingly at each iteration. Initially, left = 0 and right = k(since the element at index 0 doesn't have elements to its left) and currentSum = result for index 0.
The key consideration is that the variables left and right are unlikely to change 'significantly' during the iteration. To be more precise, at each iteration we should attempt to move the window to the right by comparing the distances nums[i + right + 1] - nums[i] vs nums[i] - nums[i - left]. (You can prove mathematically that there is no point in trying to move the window to the left.) If the former is less than the latter, we increment right and decrement left while updating currentSum at the same time.
In order to recalculate currentSum, I would suggest writing down expressions for two adjacent iterations and looking closer at the difference between them.
For instance, if
result[i] = nums[i + 1] + ... + nums[i + right] - (nums[i - 1] + ... + nums[i - left]) + (left - right) * nums[i], then
result[i] = nums[i + 2] + ... + nums[i + right] - (nums[i] + ... + nums[i - left]) + (left - right + 2) * nums[i + 1].
As we can see, these expressions are quite similar. The time complexity of this solution is O(n * log(n)). (my solution in Java for n ~ 500_000 and k ~ 400_000 works within 300 ms) I hope this together with the consideration above will help you.
Assuming that we have sorted the original array nums and computed the mapping element->its index in the sorted array(for instance, through binary search), we can proceed with finding the distances.
public long[] findMinDistances(int[] nums, int k) {
long[] result = new long[nums.length];
long currentSum = 0;
for (int i = 1; i <= k; i++) {
currentSum += nums[i];
}
result[0] = currentSum - (long) k * nums[0];
int left = 0;
int right = k;
currentSum = result[0];
for (int i = 1; i < nums.length; i++) {
int current = nums[i];
int previous = nums[i - 1];
currentSum -= (long) (left - right) * previous;
currentSum -= previous;
if (right >= 1) {
currentSum -= current;
left++;
right--;
} else {
currentSum += nums[i - 1 - left];
}
currentSum += (long) (left - right) * current;
while (i + right + 1 < nums.length && i - left >= 0 &&
nums[i + right + 1] - current < current - nums[i - left]) {
currentSum += nums[i + right + 1] - current;
currentSum -= current - nums[i - left];
right++;
left--;
}
result[i] = currentSum;
}
return result;
}
For every element e in the original array its minimal sum of distances will be result[mapping.get(e)].
I think this one is better:
Sort the array first then you can know that fact -
For every element i in the array the k minimum distances of it with other elemets will be the distances with the ones that in k places around it in the array.
(of course it's maybe to the right or to the left or from both sides).
So for every element i to calculate min_sum(a[i]) do that:
First, min_sum(a[i]) = 0.
Then, go with two indexes, let's mark them r (to the right of i) and l (to the left of i)
and compare the distance (a[i]-a[r]) with the distance (a[i]-a[l]).
You will add the smallest to min_sum(a[i]) and if it was the right one then
increas index r, and if it was the left one then decrease index l.
Of course if the left got to 0 or the right one got to n you will most take the distaces with elemets from the other side.
Anyway you do that till you sum k elemets and that's it.
This way you didn't sort any thing but the main array.

How is array of pair<double,double> 2 times faster than two arrays of double C++

#include <iostream>
#include <chrono>
#include <random>
#include <time.h>
using namespace std;
typedef pair<double,double> pd;
#define x first
#define y second
#define cell(i,j,w) ((i)*(w) + (j))
class MyTimer
{
private:
std::chrono::time_point<std::chrono::steady_clock> starter;
std::chrono::time_point<std::chrono::steady_clock> ender;
public:
void startCounter() {
starter = std::chrono::steady_clock::now();
}
long long getCounter() {
ender = std::chrono::steady_clock::now();
return std::chrono::duration_cast<std::chrono::milliseconds>(ender - starter).count();
}
};
int main()
{
const int n = 5000;
int* value1 = new int[(n + 1) * (n + 1)];
int* value2 = new int[(n + 1) * (n + 1)];
double* a = new double[(n + 1) * (n + 1)];
double* b = new double[(n + 1) * (n + 1)];
pd* packed = new pd[(n + 1) * (n + 1)];
MyTimer timer;
for (int i = 1; i <= n; i++)
for (int j = 1; j <= n; j++) {
value1[cell(i, j, n + 1)] = rand() % 5000;
value2[cell(i, j, n + 1)] = rand() % 5000;
}
for (int i = 1; i <= n; i++) {
a[cell(i, 0, n + 1)] = 0;
a[cell(0, i, n + 1)] = 0;
b[cell(i, 0, n + 1)] = 0;
b[cell(0, i, n + 1)] = 0;
packed[cell(i, 0, n + 1)] = pd(0, 0);
packed[cell(0, i, n + 1)] = pd(0, 0);
}
for (int tt=1; tt<=5; tt++)
{
timer.startCounter();
for (int i=1; i<=n; i++)
for (int j = 1; j <= n; j++) {
// packed[i][j] = packed[i-1][j] + packed[i][j-1] - packed[i-1][j-1] + value1[i][j]
packed[cell(i, j, n + 1)].x = packed[cell(i - 1, j, n + 1)].x + packed[cell(i, j - 1, n + 1)].x - packed[cell(i - 1, j - 1, n + 1)].x + value1[cell(i, j, n + 1)];
packed[cell(i, j, n + 1)].y = packed[cell(i - 1, j, n + 1)].y + packed[cell(i, j - 1, n + 1)].y - packed[cell(i - 1, j - 1, n + 1)].y + value1[cell(i, j, n + 1)] * value1[cell(i, j, n + 1)];
}
cout << "Time packed = " << timer.getCounter() << "\n";
timer.startCounter();
for (int i=1; i<=n; i++)
for (int j = 1; j <= n; j++) {
// a[i][j] = a[i-1][j] + a[i][j-1] - a[i-1][j-1] + value2[i][j];
// b[i][j] = b[i-1][j] + b[i][j-1] - b[i-1][j-1] + value2[i][j] * value2[i][j];
a[cell(i, j, n + 1)] = a[cell(i - 1, j, n + 1)] + a[cell(i, j - 1, n + 1)] - a[cell(i - 1, j - 1, n + 1)] + value2[cell(i, j, n + 1)];
b[cell(i, j, n + 1)] = b[cell(i - 1, j, n + 1)] + b[cell(i, j - 1, n + 1)] - b[cell(i - 1, j - 1, n + 1)] + value2[cell(i, j, n + 1)] * value2[cell(i, j, n + 1)];
}
cout << "Time separate = " << timer.getCounter() << "\n\n";
}
delete[] value1;
delete[] value2;
delete[] a;
delete[] b;
delete[] packed;
}
So I'm computing a 2D prefix table (Summed Area Table). And I notice the property in the title.
When using CUDA nvcc compiler (with -O2) using the command line or Visual Studio Release mode , the result is 2x faster (separate takes 200ms, packed takes 100ms) the first run, but only 25% faster in subsequent run (this is because value2[] is cached after the first loop). In my actual program with more steps of calculation (computing SAT is just step 1), it's always 2x faster since value1[] and value2[] have definitely been evicted from cache.
I know packed array is faster because modern Intel CPU read 32-64 bytes into cache at once. So by packing both array together, it can read both data in 1 main memory (RAM) access instead of 2. But why is the speedup so high? Along with memory access, the CPU still has to perform 6 additions, 2 subtractions, and 1 multiply per loop. 2x speedup from halving memory access is 100% improvement efficiency (Amdahl Law), the same as if those add/mult operations didn't exist. How is it possible?
I'm certain it has something to do with CPU pipelining, but can't explain more thoroughly. Can anyone explain this further in terms of instruction latency/memory access latency/assembly? Thank you.
The code doesn't use any GPU, so any other good compiler should give the same 2x speedup as nvcc. On g++ 9.3.0 (g++ file.cpp -O2 -std=c++11 -o file.exe), it's also 2x speedup. CPU is Intel i7-7700
I've run this program here and here2 with command line arguments -O2 -std=c++11, it also shows 1.5-2x speedup. Use n = 3000, bigger and it won't run (free VM service, afterall). So it's not just my computer
The answer is in the access latency of different level of memory, from L1 cache -> main memory (RAM).
Data in L1 cache takes ~~5 cycle to access, while data from RAM takes 50-100cycle. Meanwhile, add/sub/mult operations takes 3-5 cycles.
Therefore, the dominating limiter of performance is main memory access. So by reducing the number of main memory request by half, performance almost doubles

MaxDoubleSliceSum Codility Algorithm

I stumbled upon this problem on Codility Lessons, here is the description:
A non-empty zero-indexed array A consisting of N integers is given.
A triplet (X, Y, Z), such that 0 ≤ X < Y < Z < N, is called a double slice.
The sum of double slice (X, Y, Z) is the total of A[X + 1] + A[X + 2] + ... + A[Y − 1] + A[Y + 1] + A[Y + 2] + ... + A[Z − 1].
For example, array A such that:
A[0] = 3
A[1] = 2
A[2] = 6
A[3] = -1
A[4] = 4
A[5] = 5
A[6] = -1
A[7] = 2
contains the following example double slices:
double slice (0, 3, 6), sum is 2 + 6 + 4 + 5 = 17,
double slice (0, 3, 7), sum is 2 + 6 + 4 + 5 − 1 = 16,
double slice (3, 4, 5), sum is 0.
The goal is to find the maximal sum of any double slice.
Write a function:
int solution(vector &A);
that, given a non-empty zero-indexed array A consisting of N integers, returns the maximal sum of any double slice.
For example, given:
A[0] = 3
A[1] = 2
A[2] = 6
A[3] = -1
A[4] = 4
A[5] = 5
A[6] = -1
A[7] = 2
the function should return 17, because no double slice of array A has a sum of greater than 17.
Assume that:
N is an integer within the range [3..100,000];
each element of array A is an integer within the range [−10,000..10,000].
Complexity:
expected worst-case time complexity is O(N);
expected worst-case space complexity is O(N), beyond input storage (not counting >the storage required for input arguments).
Elements of input arrays can be modified.
I have already read about the algorithm with counting MaxSum starting at index i and ending at index i, but I don't know why my approach sometimes gives bad results. The idea is to compute MaxSum ending at index i, ommiting the minimum value at range 0..i. And here is my code:
int solution(vector<int> &A) {
int n = A.size();
int end = 2;
int ret = 0;
int sum = 0;
int min = A[1];
while (end < n-1)
{
if (A[end] < min)
{
sum = max(0, sum + min);
ret = max(ret, sum);
min = A[end];
++end;
continue;
}
sum = max(0, sum + A[end]);
ret = max(ret, sum);
++end;
}
return ret;
}
I would be glad if you could help me point out the loophole!
My solution based on bidirectional Kadane's algorithm. More details on my blog here. Scores 100/100.
public int solution(int[] A) {
int N = A.length;
int[] K1 = new int[N];
int[] K2 = new int[N];
for(int i = 1; i < N-1; i++){
K1[i] = Math.max(K1[i-1] + A[i], 0);
}
for(int i = N-2; i > 0; i--){
K2[i] = Math.max(K2[i+1]+A[i], 0);
}
int max = 0;
for(int i = 1; i < N-1; i++){
max = Math.max(max, K1[i-1]+K2[i+1]);
}
return max;
}
Here is my code:
int get_max_sum(const vector<int>& a) {
int n = a.size();
vector<int> best_pref(n);
vector<int> best_suf(n);
//Compute the best sum among all x values assuming that y = i.
int min_pref = 0;
int cur_pref = 0;
for (int i = 1; i < n - 1; i++) {
best_pref[i] = max(0, cur_pref - min_pref);
cur_pref += a[i];
min_pref = min(min_pref, cur_pref);
}
//Compute the best sum among all z values assuming that y = i.
int min_suf = 0;
int cur_suf = 0;
for (int i = n - 2; i > 0; i--) {
best_suf[i] = max(0, cur_suf - min_suf);
cur_suf += a[i];
min_suf = min(min_suf, cur_suf);
}
//Check all y values(y = i) and return the answer.
int res = 0;
for (int i = 1; i < n - 1; i++)
res = max(res, best_pref[i] + best_suf[i]);
return res;
}
int get_max_sum_dummy(const vector<int>& a) {
//Try all possible values of x, y and z.
int res = 0;
int n = a.size();
for (int x = 0; x < n; x++)
for (int y = x + 1; y < n; y++)
for (int z = y + 1; z < n; z++) {
int cur = 0;
for (int i = x + 1; i < z; i++)
if (i != y)
cur += a[i];
res = max(res, cur);
}
return res;
}
bool test() {
//Generate a lot of small test cases and compare the output of
//a brute force and the actual solution.
bool ok = true;
for (int test = 0; test < 10000; test++) {
int size = rand() % 20 + 3;
vector<int> a(size);
for (int i = 0; i < size; i++)
a[i] = rand() % 20 - 10;
if (get_max_sum(a) != get_max_sum_dummy(a))
ok = false;
}
for (int test = 0; test < 10000; test++) {
int size = rand() % 20 + 3;
vector<int> a(size);
for (int i = 0; i < size; i++)
a[i] = rand() % 20;
if (get_max_sum(a) != get_max_sum_dummy(a))
ok = false;
}
return ok;
}
The actual solution is get_max_sum function(the other two are a brute force solution and a tester functions that generates a random array and compares the output of a brute force and actual solution, I used them for testing purposes only).
The idea behind my solution is to compute the maximum sum in a sub array that that starts somewhere before i and ends in i - 1, then do the same thing for suffices(best_pref[i] and best_suf[i], respectively). After that I just iterate over all i and return the best value of best_pref[i] + best_suf[i]. It works correctly because best_pref[y] finds the best x for a fixed y, best_suf[y] finds the best z for a fixed y and all possible values of y are checked.
def solution(A):
n = len(A)
K1 = [0] * n
K2 = [0] * n
for i in range(1,n-1,1):
K1[i] = max(K1[i-1] + A[i], 0)
for i in range(n-2,0,-1):
K2[i] = max(K2[i+1]+A[i], 0)
maximum = 0;
for i in range(1,n-1,1):
maximum = max(maximum, K1[i-1]+K2[i+1])
return maximum
def main():
A = [3,2,6,-1,4,5,-1,2]
print(solution(A))
if __name__ == '__main__': main()
Ruby 100%
def solution(a)
max_starting =(a.length - 2).downto(0).each.inject([[],0]) do |(acc,max), i|
[acc, acc[i]= [0, a[i] + max].max ]
end.first
max_ending =1.upto(a.length - 3).each.inject([[],0]) do |(acc,max), i|
[acc, acc[i]= [0, a[i] + max].max ]
end.first
max_ending.each_with_index.inject(0) do |acc, (el,i)|
[acc, el.to_i + max_starting[i+2].to_i].max
end
end

ith order Selection in O(n) time

From "Introduction to algorithms" I am trying to implement the code as dividing n elements in n/5 groups, then recursively finding the median and then recursively finding the ith order statistics. Here is My code
bool isTrue(int *a, int *b)
{
if((*a) < (*b))
swap(*a, *b);
return *a < *b;
}
int select(int *A, int n ,int p)
{
int *B[(n / 5) + 2];
cout << (n / 5) + 2 << endl;
int i, j;
for(i = 1, j = 1; i <= (n - 5); i += 5, ++j)
{
sort(A + i, A + i + 5);
B[j] = A + i + 2;
}
sort(A + i, A + n + 1);
B[j] = A + i + (n - i) / 2;
sort(B + 1, B + (n / 5) + 2, isTrue);
}
this is as far as I can go. Now I am trying to find the median of B, then do B[median] - A as pivot. But It doesn't seems right. In the book it says to recursively find the median of medians.I can't catch that. Any help?
edit: I also note that in the wiki they didnt use any recursion!

OpenCV Sum of squared differences speed

I've been using the openCV to do some block matching and I've noticed it's sum of squared differences code is very fast compared to a straight forward for loop like this:
int SSD = 0;
for(int i =0; i < arraySize; i++)
SSD += (array1[i] - array2[i] )*(array1[i] - array2[i]);
If I look at the source code to see where the heavy lifting happens, the
OpenCV folks have their for loops do 4 squared difference calculations at a time in each iteration of the loop. The function to do the block matching looks like this.
int64
icvCmpBlocksL2_8u_C1( const uchar * vec1, const uchar * vec2, int len )
{
int i, s = 0;
int64 sum = 0;
for( i = 0; i <= len - 4; i += 4 )
{
int v = vec1[i] - vec2[i];
int e = v * v;
v = vec1[i + 1] - vec2[i + 1];
e += v * v;
v = vec1[i + 2] - vec2[i + 2];
e += v * v;
v = vec1[i + 3] - vec2[i + 3];
e += v * v;
sum += e;
}
for( ; i < len; i++ )
{
int v = vec1[i] - vec2[i];
s += v * v;
}
return sum + s;
}
This calculation is for unsigned 8 bit integers. They perform a similar calculation for 32-bit floats in this function:
double
icvCmpBlocksL2_32f_C1( const float *vec1, const float *vec2, int len )
{
double sum = 0;
int i;
for( i = 0; i <= len - 4; i += 4 )
{
double v0 = vec1[i] - vec2[i];
double v1 = vec1[i + 1] - vec2[i + 1];
double v2 = vec1[i + 2] - vec2[i + 2];
double v3 = vec1[i + 3] - vec2[i + 3];
sum += v0 * v0 + v1 * v1 + v2 * v2 + v3 * v3;
}
for( ; i < len; i++ )
{
double v = vec1[i] - vec2[i];
sum += v * v;
}
return sum;
}
I was wondering if anyone had any idea if breaking a loop up into chunks of 4 like this might speed up code? I should add that there is no multithreading occuring in this code.
My guess is that this is just a simple implementation of unrolling the loop - it saves 3 additions and 3 compares on each pass of the loop, which can be a great savings if, for example, checking len involves a cache miss. The downside is that this optimization adds code complexity (e.g. the additional for loop at the end to finish the loop for the len % 4 items left if the length is not evenly divisible by 4) and, of course, it's an architecture-dependent optimization whose magnitude of improvement will vary by hardware/compiler/etc...
Still, it's straightforward to follow compared to most optimizations and will probably result in some sort of performance increase regardless of the architecture, so it's low risk to just throw it in there and hope for the best. Since OpenCV is such a well-supported chunk of code, I'm sure that someone instrumented these chunks of code and found them to be well worth it - as you yourself have done.
There is one obvious optimisation of your code, viz:
int SSD = 0;
for(int i = 0; i < arraySize; i++)
{
int v = array1[i] - array2[i];
SSD += v * v;
}