I am making a test program to measure time for storage of each container. The following is my code for the test.
#include <list>
#include <vector>
#include <iostream>
#include <iomanip>
#include <string>
#include <ctime>
#include <cstdlib>
using namespace std;
void insert(list<short>& l, const short& value);
void insert(vector<short>& v, const short& value);
void insert(short arr[], int& logicalSize, const int& physicalSize, const short& value);
int main() {
clock_t start, end;
srand(time(nullptr));
const int SIZE = 50000;
const short RANGE = 10000;
list<short> l;
vector<short> v;
short* arr = new short[SIZE];
int logicalSize = 0;
// array
start = clock();
cout << "Array storage time test...";
for (int i = 0; i < SIZE; i++) {
try {
insert(arr, logicalSize, SIZE, (short)(rand() % (2 * RANGE + 1) - RANGE));
} catch (string s) {
cout << s << endl;
system("pause");
exit(-1);
}
}
end = clock();
cout << "Time: " << difftime(end, start) << endl << endl;
// list
cout << "List storage time test...";
start = clock();
for (int i = 0; i < SIZE; i++) {
insert(l, (short)(rand() % (2 * RANGE + 1) - RANGE));
}
end = clock();
cout << "Time: " << difftime(end, start) << endl << endl;
// vector
cout << "Vector storage time test...";
start = clock();
for (int i = 0; i < SIZE; i++) {
insert(v, (short)(rand() % (2 * RANGE + 1) - RANGE));
}
end = clock();
cout << "Time: " << difftime(end, start) << endl << endl;
delete[] arr;
system("pause");
return 0;
}
void insert(list<short>& l, const short& value) {
for (auto it = l.begin(); it != l.end(); it++) {
if (value < *it) {
l.insert(it, value);
return;
}
}
l.push_back(value);
}
void insert(vector<short>& v, const short& value) {
for (auto it = v.begin(); it != v.end(); it++) {
if (value < *it) {
v.insert(it, value);
return;
}
}
v.push_back(value);
}
void insert(short arr[], int& logicalSize, const int& physicalSize, const short& value) {
if (logicalSize == physicalSize) throw string("No spaces in array.");
for (int i = 0; i < logicalSize; i++) {
if (value < arr[i]) {
for (int j = logicalSize - 1; j >= i; j--) {
arr[j + 1] = arr[j];
}
arr[i] = value;
logicalSize++;
return;
}
}
arr[logicalSize] = value;
logicalSize++;
}
However, when I execute the code, the result seems a little different from the theory. The list should be fastest, but the result said that insertion in the list is slowest. Can you tell me why?
Inserting into a vector or array requires moving everything after it; so if at a random spot, requires an average of 1.5 accesses to each element. 0.5 to find the spot, and 0.5*2 (read and write) to do the insert.
Inserting into a list requires 0.5 accesses per element (to find the spot).
This means the vector is only 3 times more element accesses.
Lists nodes are 5 to 9 times larger than vector "nodes" (which are just elements). Forward iteration requires reading 3 to 5 times as much memory (element 16 bits and pointer 32 to 64 bits).
So the list solution reads/writes more memory! Worse, it is sparser (with the back pointer), and it may not be arranged in a cache-friendly way in memory (vectors are contiguous; list nodes may be a mess in linear space) thus messing with cpu memory cache predictions and loads and etc.
List is very rarely faster than vector; you have to be inserting/deleting many times more often than you iterate over the list.
Finally vector uses exponential allocation with reserved unused space. List allocates each time. Calling new is slow, and often not much slower when you ask for bigger chunks than when you ask for smaller ones. Growing a vector by 1 at a time 1000 times results in about 15 allocations (give or take); for list, 1000 allocations.
Insertion in a list is blisteringly fast, but first you have to find there you want to insert. This is where list comes out a loser.
It might be helpful to stop and read Why is it faster to process a sorted array than an unsorted array? sometime around now because it covers similar material and covers it really well.
With a vector or array each element comes one after the next. Prediction is dead easy, so the CPU can be loading the cache with values you won't need for a while at the same time as it is processing the current value.
With a list predictability is shot, you have to get the next node before you can load the node after that, and that pretty much nullifies the cache. Without the cache you can see an order of magnitude degradation in performance as the CPU sits around waiting for data to be retrieved from RAM.
Bjarne Stroustrup has a number of longer pieces on this topic. The keynote video is definitely worth watching.
One important take-away is take Big-O notation with a grain of salt because it is measuring a the efficiency of the algorithm, not how well the algorithm takes advantage of the hardware.
Related
As cppreference says
Lists are sequence containers that allow constant time insert and erase operations anywhere within the sequence, and iteration in both directions.
Considering the continuous memory used by std::vector where erase should be linear time. So it is reasonable that random erase operations on std::list should be more efficient than std::vector.
But I the program shows differently.
int randi(int min, int max) {
return rand()%(max-min)+min; // Generate the number, assign to variable.
}
int main() {
srand(time(NULL)); // Seed the time
int N = 100000;
int M = N-2;
int arr[N];
for (int i = 0; i < N; i++) {
arr[i] = i;
}
list<int> ls(arr, arr+N);
vector<int> vec(arr, arr+N);
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
for (int i = 0; i < M; i++) {
int j = randi(0, N - i);
ls.erase(next(ls.begin(), j));
}
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds_1 = end - start;
cout << "list time cost: " << elapsed_seconds_1.count()) << "\n";
for (int i = 0; i < M; i++) {
int j = randi(0, N - i);
vec.erase(vec.begin() + j);
}
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds_2 = end - start;
cout << "vector time cost: " << elapsed_seconds_2.count()) << "\n";
return 0;
}
~/cpp_learning/list$ ./demo
list time cost: 8.114993171
vector time cost: 8.306458676
Because it takes a long time to find the element in the list. Insertion or removal from list is O(1) if you already hold an iterator to the desired insertion/deletion location. In this case you don't, and the std::next(ls.begin(), j) call is doing O(n) work, eliminating all savings from the cheap O(1) erase (frankly, I'm a little surprised it didn't lose to vector; I'd expect O(n) pointer-chasing operations to cost more than a O(n) contiguous memmove-like operation, but what do I know?) Update: On checking, you forgot to save a new start point before the vector test, and in fact, once you fix that issue, the vector is much faster, so my intuition was correct there: Try it online!
With -std=c++17 -O3, output was:
list time cost: 9.63976
vector time cost: 0.191249
Similarly, the vector is cheap to get to the relevant index (O(1)), but (relatively) expensive to delete it (O(n) copy-down operation after).
When you won't be iterating it otherwise, list won't save you anything if you're performing random access insertions and deletions. Situations like that call for using std::unordered_map and related containers.
Why does it appear that this quicksort algorithm is faster than std::sort? I've checked to make sure it's actually sorting the array. I've also replaced both sorting calls with hollow for loops of the same number of iterations to test the timing benchmark and everything checked out there.
I'm also wondering what adjustments I can make to the quicksort to allow it to recurse more times. Maybe some sort of variable memory management?
#include <iostream>
#include <vector>
#include <algorithm>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <chrono>
using namespace std;
void quickSort(int*, int);
void fillRandom(int*, int,int b2);
int main() {
//setup arrays
int size = 100000;
auto myints = new int[size];
auto myints2 = new int[size];
fillRandom(myints, size,10000);
std::copy(myints, myints + size, myints2);
//measurement 1
auto t1 = std::chrono::high_resolution_clock::now();
quickSort(myints, size);
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
std::cout << endl << "Execution 1 took: "<< duration << endl;
//measurement 2
t1 = std::chrono::high_resolution_clock::now();
std::sort(myints2,myints2+size);
t2 = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
std::cout << endl << "Execution 2 took: " << duration << endl;
cout << "finished!";
return 1;
}
void fillRandom(int* p, int size,int upTo) {
srand(time(0));
for (int i = 0;i < size;i++) {
p[i] = rand() % upTo + 1;
}
}
void quickSortSwap(int *p1, int*p2) {
int temp = *p1;
*p1 = *p2;
*p2 = temp;
}
void quickSort(int* original, int len) {
int split = *original;
int greaterIndex = len - 1;
int lesserIndex = 1;
int* currentP;
//rearrange stuff so smaller is left, bigger is right
for (int i = 1;i < len;i++) {
currentP = original + lesserIndex;
//cout << *currentP << " compared to " << split << endl;
if (*currentP <= split) {
lesserIndex++;
}
else {
//cout << "greater: " << *currentP <<endl;
quickSortSwap(currentP, original + greaterIndex);
greaterIndex--;
}
}
//uhh, now we switch pivot element with the right most left side element. Adjust our left side length measurement accordingly.
lesserIndex--;
quickSortSwap(original, original + lesserIndex);
greaterIndex++;
//this point
if (lesserIndex > 1) {
quickSort(original, lesserIndex);
}
int greater_range = len - greaterIndex;
if (greater_range > 1) {
quickSort(original + greaterIndex, greater_range);
}
}
https://rextester.com/AOPBP48224
Visual Studio's std::sort has some overhead and some optimizations that your program does not. Your program is based on Lomuto partition scheme, while std::sort is a single pivot, 3 partition Hoare like quicksort + insertion sort for small partitions. The 3 partitions are elements < pivot, elements == pivot, elements > pivot. If there are no duplicate values, the 3 partition sort is just some overhead. If there are duplicate values, then as the number of duplicate values increases, Lomuto gets worse, and Hoare or std::sort get better. Try using fillRandom(myints, size,10); and you should see a large performance hit with Lomuto method, and an increase in performance with std::sort().
Visual Studio's std::sort uses median of nine if >= 40 elements, median of 3 for 33 to 39 elements, which reduces the probability of worst case scenarios, and switches to insertion sort for <= 32 elements (this speeds it up). To reduce stack overhead, it uses recursion for the smaller partition, and loops back to handle the larger partition. It has a check to avoid worst case time complexity O(n^2), switching to heap sort if the depth of partitioning ("recursions") becomes excessive. It uses iterators instead of plain pointers, but in release mode, my impression is the iterators are unchecked, and effectively pointers. It also uses a function pointer for less than compare, defaulting to std::less, which I don't know if it get optimized away.
I find an interesting phenomenon when I try to optimize my solution for the leetcode two sum problem (https://leetcode.com/problems/two-sum/description/).
Leetcode description for the two-sum problem is:
Given an array of integers, return indices of the two numbers such that they add up to a specific target.
You may assume that each input would have exactly one solution, and you may not use the same element twice.
Initially, I solve this problem by using two loops. First I loop through input array to store array value and array index as pair into a map. Then I loop through input array again to loop up each element and check if it exists in the map. The following is my solution from leetcode:
class Solution {
public:
vector<int> twoSum(vector<int>& nums, int target)
{
vector<int> res;
map<int, int> store;
for(int i = 0; i < nums.size(); ++i)
{
store[nums[i]] = i;
}
for(int i = 0; i < nums.size(); ++i)
{
auto iter = store.find(target - nums[i]);
if(iter != store.end() && (iter -> second) != i)
{
res.push_back(i);
res.push_back(iter -> second);
break;
}
}
return res;
}
};
This solution takes 4ms in leetcode submission. Since I am looping through the same array twice, I was thinking to optimize my code by combining insert operation and map.find() into a single loop. Therefore I can check for a solution while inserting elements. Then I have the following solution:
class Solution {
public:
vector<int> twoSum(vector<int>& nums, int target)
{
vector<int> res;
map<int, int> store;
for(int i = 0; i < nums.size(); ++i)
{
auto iter = store.find(target - nums[i]);
if(iter != store.end() && (iter -> second) != i)
{
res.push_back(i);
res.push_back(iter -> second);
break;
}
store[nums[i]] = i;
}
return res;
}
};
However, the single loop version is much slower than two separate loops, which takes 12ms.
For further research, I made a test case where the input size is 100000001 and solution for this code will be [0, 100000001] (first index and last index). The following is my test code:
#include <iostream>
#include <vector>
#include <algorithm>
#include <map>
#include <iterator>
#include <cstdio>
#include <ctime>
using namespace std;
vector<int> twoSum(vector<int>& nums, int target)
{
vector<int> res;
map<int, int> store;
for(int i = 0; i < nums.size(); ++i)
{
auto iter = store.find(target - nums[i]);
if(iter != store.end() && (iter -> second) != i)
{
res.push_back(i);
res.push_back(iter -> second);
break;
}
store[nums[i]] = i;
}
return res;
}
vector<int> twoSum2(vector<int>& nums, int target)
{
vector<int> res;
map<int, int> store;
for(int i = 0; i < nums.size(); ++i)
{
store[nums[i]] = i;
}
for(int i = 0; i < nums.size(); ++i)
{
auto iter = store.find(target - nums[i]);
if(iter != store.end() && (iter -> second) != i)
{
res.push_back(i);
res.push_back(iter -> second);
break;
}
}
return res;
}
int main()
{
std::vector<int> test1;
test1.push_back(4);
for (int i = 0; i < 100000000; ++i)
{
test1.push_back(3);
}
test1.push_back(6);
std::clock_t start;
double duration;
start = std::clock();
auto res1 = twoSum(test1, 10);
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
std::cout<<"single loop: "<< duration <<'\n';
cout << "result: " << res1[1] << ", " << res1[0] << endl;
start = std::clock();
res1 = twoSum2(test1, 10);
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
std::cout<<"double loops: "<< duration <<'\n';
cout << "result: " << res1[0] << ", " << res1[1] << endl;
}
I still get a similar result, which single loop version (7.9s) is slower than double loops version (3.0s):
results
I don't really understand why a single loop combined version is slower than a double loops separated version? I think the single loop version should reduce some redundant looping. Is it because of the STL map implementation that it is better to do insertion and map.find() operation separately in two loops, rather than do insertion and map.find() alternately in one loop?
BTW I am working on a MAC OS and using Apple LLVM version 10.0.0 (clang-1000.10.44.2).
Let us see what actually happens in both scenarii.
In the two loops scenario, you do N insertions in the map, but then do only one find, because as the map is fully fed, you get the expected result on first iteration.
In the single loop scenario, you must wait for the last insertion to find the result. So you do N-1 insertions and N-1 find.
It is no surprise that it takes twice the time in your worst case test...
For randomized use cases, the two loop scenario will result in exactly N insertions, and statistically N/2 find. Best case N inserts 1 find, worst case N inserts N-1 finds.
In the single loop you start finding as soon as the map in not empty. The best case is 1 insert and 1 find (far better than two loops), and the worst case is N-1 inserts and N-1 finds. I know that it is easy to be misguided in probabilities, but I would expect statistically 3N/4 inserts and N/2 finds. So slightly better than the two loops scenario.
TL/DR: you get better results for the two loops scenario than for the single loop one, because your test case is the best for two loops and worst for single loop.
I was experimenting with some known algorithm which aims to reduce the number of comparisons in an operation of finding element in an unsorted array. The algorithm uses sentinel which is added to the back of the array, which allows to write a loop where we use only one comparison, instead of two. It's important to note that the overall Big O computational complexity is not changed, it is still O(n). However, when looking at the number of comparisons, the standard finding algorithm is so to say O(2n) while the sentinel algorithm is O(n).
Standard find algorithm from the c++ library works like this:
template<class InputIt, class T>
InputIt find(InputIt first, InputIt last, const T& value)
{
for (; first != last; ++first) {
if (*first == value) {
return first;
}
}
return last;
}
We can see two comparisons there and one increment.
In the algorithm with sentinel the loop looks like this:
while (a[i] != key)
++i;
There is only one comparison and one increment.
I made some experiments and measured time, but on every computer the results were different. Unfortunately I didn't have access to any serious machine, I only had my laptop with VirtualBox there with Ubuntu, under which I compiled and run the code. I had a problem with the amount of memory. I tried using online compilers like Wandbox and Ideone but the time limits and memory limits didn't allow me to make reliable experiments. But every time I run my code, changing the number of elements in my vector or changing the number of execution of my test, I saw different results. Sometimes the times were comparable, sometimes std::find was significantly faster, sometimes significantly faster was the sentinel algorithm.
I was surprised because the logic says that the sentinel version indeed should work faster and every time. Do you have any explanation for this? Do you have any experience with this kind of algorithm? Is it worht the effort to even try to use it in production code when performance is crucial and when the array cannot be sorted (and any other mechanism to solve this problem, like hashmap, indexing etc., cannot be used)?
Here's my code of testing this. It's not beautiful, in fact it is ugly, but the beauty wasn't my goal here. Maybe something is wrong with my code?
#include <iostream>
#include <algorithm>
#include <chrono>
#include <vector>
using namespace std::chrono;
using namespace std;
const unsigned long long N = 300000000U;
static void find_with_sentinel()
{
vector<char> a(N);
char key = 1;
a[N - 2] = key; // make sure the searched element is in the array at the last but one index
unsigned long long high = N - 1;
auto tmp = a[high];
// put a sentinel at the end of the array
a[high] = key;
unsigned long long i = 0;
while (a[i] != key)
++i;
// restore original value
a[high] = tmp;
if (i == high && key != tmp)
cout << "find with sentinel, not found" << endl;
else
cout << "find with sentinel, found" << endl;
}
static void find_with_std_find()
{
vector<char> a(N);
int key = 1;
a[N - 2] = key; // make sure the searched element is in the array at the last but one index
auto pos = find(begin(a), end(a), key);
if (pos != end(a))
cout << "find with std::find, found" << endl;
else
cout << "find with sentinel, not found" << endl;
}
int main()
{
const int times = 10;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
for (auto i = 0; i < times; ++i)
find_with_std_find();
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(t2 - t1).count();
cout << "std::find time = " << duration << endl;
t1 = high_resolution_clock::now();
for (auto i = 0; i < times; ++i)
find_with_sentinel();
t2 = high_resolution_clock::now();
duration = duration_cast<milliseconds>(t2 - t1).count();
cout << "sentinel time = " << duration << endl;
}
Move the memory allocation (vector construction) outside the measured functions (e.g. pass the vector as argument).
Increase times to a few thousands.
You're doing a whole lot of time-consuming work in your functions. That work is hiding the differences in the timings. Consider your find_with_sentinel function:
static void find_with_sentinel()
{
// ***************************
vector<char> a(N);
char key = 1;
a[N - 2] = key; // make sure the searched element is in the array at the last but one index
// ***************************
unsigned long long high = N - 1;
auto tmp = a[high];
// put a sentinel at the end of the array
a[high] = key;
unsigned long long i = 0;
while (a[i] != key)
++i;
// restore original value
a[high] = tmp;
// ***************************************
if (i == high && key != tmp)
cout << "find with sentinel, not found" << endl;
else
cout << "find with sentinel, found" << endl;
// **************************************
}
The three lines at the top and the four lines at the bottom are identical in both functions, and they're fairly expensive to run. The top contains a memory allocation and the bottom contains an expensive output operation. These are going to mask the time it takes to do the real work of the function.
You need to move the allocation and the output out of the function. Change the function signature to:
static int find_with_sentinel(vector<char> a, char key);
In other words, make it the same as std::find. If you do that, then you don't have to wrap std::find, and you get a more realistic view of how your function will perform in a typical situation.
It's quite possible that the sentinel find function will be faster. However, it comes with some drawbacks. The first is that you can't use it with immutable lists. The second is that it's not safe to use in a multi-threaded program due to the potential of one thread overwriting the sentinel that the other thread is using. It also might not be "faster enough" to justify replacing std::find.
EDIT:
I've fixed the insertion. As Blastfurnace kindly mentioned the insertion invalidated the iterators. The loop is needed I believe to compare performance (see my comment on Blastfurnance's answer). My code is updated. I have completely similar code for the list just with vector replaced by list. However, with the code I find that the list performs better than the vector both for small and large datatypes and even for linear search (if I remove the insertion). According to http://java.dzone.com/articles/c-benchmark-%E2%80%93-stdvector-vs and other sites that should not be the case. Any clues to how that can be?
I am taking a course on programming of mathematical software (exam on monday) and for that I would like to present a graph that compares performance between random insertion of elements into a vector and a list. However, when I'm testing the code I get random slowdowns. For instance I might have 2 iterations where inserting 10 elements at random into a vector of size 500 takes 0.01 seconds and then 3 similar iterations that each take roughly 12 seconds. This is my code:
void AddRandomPlaceVector(vector<FillSize> &myContainer, int place) {
int i = 0;
vector<FillSize>::iterator iter = myContainer.begin();
while (iter != myContainer.end())
{
if (i == place)
{
FillSize myFill;
iter = myContainer.insert(iter, myFill);
}
else
++iter;
++i;
}
//cout << i << endl;
}
double testVector(int containerSize, int iterRand)
{
cout << endl;
cout << "Size: " << containerSize << endl << "Random inserts: " << iterRand << endl;
vector<FillSize> myContainer(containerSize);
boost::timer::auto_cpu_timer tid;
for (int i = 0; i != iterRand; i++)
{
double randNumber = (int)(myContainer.size()*((double)rand()/RAND_MAX));
AddRandomPlaceVector(myContainer, randNumber);
}
double wallTime = tid.elapsed().wall/1e9;
cout << "New size: " << myContainer.size();
return wallTime;
}
int main()
{
int testSize = 500;
int measurementIters = 20;
int numRand = 1000;
int repetionIters = 100;
ofstream tidOutput1_sum("VectorTid_8bit_sum.txt");
ofstream tidOutput2_sum("ListTid_8bit_sum.txt");
for (int i = 0; i != measurementIters; i++)
{
double time = 0;
for (int j = 0; j != repetionIters; j++) {
time += testVector((i+1)*testSize, numRand);
}
std::ostringstream strs;
strs << double(time/repetionIters);
tidOutput1_sum << ((i+1)*testSize) << "," << strs.str() << endl;
}
for (int i = 0; i != measurementIters; i++)
{
double time = 0;
for (int j = 0; j != repetionIters; j++) {
time += testList((i+1)*testSize, numRand);
}
std::ostringstream strs;
strs << double(time/repetionIters);
tidOutput2_sum << ((i+1)*testSize) << "," << strs.str() << endl;
}
return 0;
}
struct FillSize
{
double fill1;
};
The struct is just for me to easily add more values so I can test for elements with different size. I know that this code is probably not perfect concerning performance-testing, but they would rather have me make a simple example than simply reference to something I found.
I've tested this code on two computers now, both having the same issues. How can that be? And can you help me with a fix so I can graph it and present it Monday? Perhaps adding some seconds of wait time between each iteration will help?
Kind regards,
Bjarke
Your AddRandomPlaceVector function has a serious flaw. Using insert() will invalidate iterators so the for loop is invalid code.
If you know the desired insertion point there's no reason to iterate over the vector at all.
void AddRandomPlaceVector(vector<FillSize> &myContainer, int place)
{
FillSize myFill;
myContainer.insert(myContainer.begin() + place, myFill);
}