How much performance difference when using string vs char array?

How much performance difference when using string vs char array? - c++

I have the following code:
char fname[255] = {0}
snprintf(fname, 255, "%s_test_no.%d.txt", baseLocation, i);
vs
std::string fname = baseLocation + "_test_no." + std::to_string(i) + ".txt";
Which one performs better? Does the second one involve temporary creation? Is there any better way to do this?

Let's run the numbers:
2022 edit:
Using Quick-Bench with GCC 10.3 and compiling with C++20 (with some minor changes for constness) demonstrates that std::string is now faster, almost 3x as much:
Original answer (2014)
The code (I used PAPI Timers)
main.cpp
#include <iostream>
#include <string>
#include <stdio.h>
#include "papi.h"
#include <vector>
#include <cmath>
#define TRIALS 10000000
class Clock
{
public:
typedef long_long time;
time start;
Clock() : start(now()){}
void restart(){ start = now(); }
time usec() const{ return now() - start; }
time now() const{ return PAPI_get_real_usec(); }
};
int main()
{
int eventSet = PAPI_NULL;
PAPI_library_init(PAPI_VER_CURRENT);
if(PAPI_create_eventset(&eventSet)!=PAPI_OK)
{
std::cerr << "Failed to initialize PAPI event" << std::endl;
return 1;
}
Clock clock;
std::vector<long_long> usecs;
const char* baseLocation = "baseLocation";
//std::string baseLocation = "baseLocation";
char fname[255] = {};
for (int i=0;i<TRIALS;++i)
{
clock.restart();
snprintf(fname, 255, "%s_test_no.%d.txt", baseLocation, i);
//std::string fname = baseLocation + "_test_no." + std::to_string(i) + ".txt";
usecs.push_back(clock.usec());
}
long_long sum = 0;
for(auto vecIter = usecs.begin(); vecIter != usecs.end(); ++vecIter)
{
sum+= *vecIter;
}
double average = static_cast<double>(sum)/static_cast<double>(TRIALS);
std::cout << "Average: " << average << " microseconds" << std::endl;
//compute variance
double variance = 0;
for(auto vecIter = usecs.begin(); vecIter != usecs.end(); ++vecIter)
{
variance += (*vecIter - average) * (*vecIter - average);
}
variance /= static_cast<double>(TRIALS);
std::cout << "Variance: " << variance << " microseconds" << std::endl;
std::cout << "Std. deviation: " << sqrt(variance) << " microseconds" << std::endl;
double CI = 1.96 * sqrt(variance)/sqrt(static_cast<double>(TRIALS));
std::cout << "95% CI: " << average-CI << " usecs to " << average+CI << " usecs" << std::endl;
}
Play with the comments to get one way or the other.
10 million iterations of both methods on my machine with the compile line:
g++ main.cpp -lpapi -DUSE_PAPI -std=c++0x -O3
Using char array:
Average: 0.240861 microseconds
Variance: 0.196387microseconds
Std. deviation: 0.443156 microseconds
95% CI: 0.240586 usecs to 0.241136 usecs
Using string approach:
Average: 0.365933 microseconds
Variance: 0.323581 microseconds
Std. deviation: 0.568842 microseconds
95% CI: 0.365581 usecs to 0.366286 usecs
So at least on MY machine with MY code and MY compiler settings, I saw about a 50% slowdown when moving to strings. that character arrays incur a 34% speedup over strings using the following formula:
((time for string) - (time for char array) ) / (time for string)
Which gives the difference in time between the approaches as a percentage on time for string alone. My original percentage was correct; I used the character array approach as a reference point instead, which shows a 52% slowdown when moving to string, but I found it misleading.
I'll take any and all comments for how I did this wrong :)
2015 Edit
Compiled with GCC 4.8.4:
string
Average: 0.338876 microseconds
Variance: 0.853823 microseconds
Std. deviation: 0.924026 microseconds
95% CI: 0.338303 usecs to 0.339449 usecs
character array
Average: 0.239083 microseconds
Variance: 0.193538 microseconds
Std. deviation: 0.439929 microseconds
95% CI: 0.238811 usecs to 0.239356 usecs
So the character array approach remains significantly faster although less so. In these tests, it was about 29% faster.

The snprintf() version will almost certainly be quite a bit faster. Why? Simply because no memory allocation takes place. The new operator is surprisingly expensive, roughly 250ns on my system - snprintf() will have finished quite a bit of work in the meantime.
That is not to say that you should use the snprintf() approach: The price you pay is safety. It is just so easy to get things wrong with the fixed buffer size you are supplying to snprintf(), and you absolutely need to supply code for the case that the buffer is not large enough. So, only think about using snprintf() when you have identified this part of code to be really performance critical.
If you have a POSIX-2008 compliant system, you may also think about trying asprintf() instead of snprintf(), it will malloc() the memory for you, giving you pretty much the same comfort as C++ strings. At least on my system, malloc() is quite a bit faster than the builtin new-operator (don't ask me why, though).
Edit:
Just saw, that you used filenames in your example. If filenames are your concern, forget about the performance of string operation! Your code will spend virtually no time in them. Unless you have on the order of 100000 such string operations per second, they are irrelevant to your performance.

If it's REALLY important, measure the two solutions. If not, whichever you think makes most sense from what data you have, company/private coding style standards, etc. Make sure you use an optimised build [with the same optimisation you are going to use in the actual production build, not -O3 because that is the highest, if your production build is using -O1]
I expect that either will be pretty close if you only do a few. If you have several millions, there may be a difference. Which is faster? I'd guess the second [1], but it depends on who wrote the implementation of snprintf and who wrote the std::string implementation. Both certainly have the potential to take a lot longer than you would expect from a naive approach to how the function works (and possibly also run faster than you'd expect)
[1] Because I have worked with printf, and it's not a simple function, it spends a lot of time messing about with various groking of the format string. It's not very efficient (and I have looked at the ones in glibc and such too, and they are not noticeably better).
On the other hand std::string functions are often inlined since they are template implementations, which improves the efficiency. The joker in the pack is whether the memory allocation for std::string that is likely to happen. Of course, if somehow baselocation turns to be rather large, you probably don't want to store it as a fixed size local array anyway, so that evens out in that case.

I would recommend using strcat in that case. It is by far the fastest method:

Related

Global time cost versus sum of local time costs -- "for" loop

As silly as it seems, I would like to know whether there may be pitfalls when trying to reconcile the time costs for a for loop, as measured
either from time points just outside the for loop (global or external time cost)
or, from time points being inside the loop, and being cumulatively considered (local or internal time cost) ?
The example below illustrates my difficulties getting two equal measurements:
#include <iostream>
#include <vector> // std::vector
#include <ctime> // clock(), ..
int main(){
clock_t clockStartLoop;
double timeInternal(0)// the time cost of the loop, summing all time costs of commands within the "for" loop
, timeExternal // time cost of the loop, as measured outside the boundaries of "for" loop
;
std::vector<int> vecInt; // will be [0,1,..,10000] after the loop below
clock_t costExternal(clock());
for(int i=0;i<10000;i++){
clockStartLoop = clock();
vecInt.push_back(i);
timeInternal += clock() - clockStartLoop; // incrementing internal time cost
}
timeInternal /= CLOCKS_PER_SEC;
timeExternal = (clock() - costExternal)/(double)CLOCKS_PER_SEC;
std::cout << "timeExternal = "<< timeExternal << " s ";
std::cout << "vs timeInternal = " << timeInternal << std::endl;
std::cout << "We have a ratio of " << timeExternal/timeInternal << " between the two.." << std::endl;
}
I usually get a ratio around 2 as output e.g.
timeExternal = 0.008407 s vs timeInternal = 0.004287
We have a ratio of 1.96105 between the two..
, whereas I was hoping a ratio closer to 1.
Is it just because there are operations internal to the loop which are not measured by the clock() difference (such as incrementing timeInternal) ?
Could the i++ operation in the for(..) be non-negligible in the external measurement and also explain the difference with the internal one ?
I'm actually dealing with a more complex code and I would like to isolate time costs within a loop, being sure that all the time slices I consider do make up a complete pie (which I never achieved until now..). Thanks a lot

timeExternal = 0.008407 s vs timeInternal = 0.004287 We have a ratio of 1.96105 between the two..
A ratio of ~2 is to be expected - by far the heaviest call in your loop is clock() itself (on most systems clock() is a syscall to the kernel).
Imagine that clock() implementation looks like the following pseudocode:
clock_t clock() {
go_to_kernel(); // very long operation
clock_t rc = query_process_clock();
return_from_kernel(); // very long operation
return rc;
}
Now going back to the loop, we can annotate the places where time is spent:
for(int i=0;i<10000;i++){
// go_to_kernel - very long operation
clockStartLoop = clock();
// return_from_kernel - very long operation
vecInt.push_back(i);
// go_to_kernel - very long operation
timeInternal += clock() - clockStartLoop;
// return_from_kernel - very long operation
}
So between the two calls to clock() we have 2 long operations, with a total in the loop of 4. Hence the ratio of 2-to-1.
Is it just because there are operations internal to the loop which are not measured by the clock() difference (such as incrementing timeInternal) ?
No, incrementing timeInterval is negligible.
Could the i++ operation in the for(..) be non-negligible in the external measurement and also explain the difference with the internal one ?
No, i++ is also negligible. Remove the inner calls to clock() and you will see a much faster execution time. On my system it was 0.00003 s.
The next most expensive operation after clock() is vector::push_back(), because it needs to resize the vector. This is amortized by a quadratic growth factor and can be eliminated entirely by calling vector::reserve() before entering the loop.
Conclusion: when benchmarking, make sure to time entire loops, not individual iterations. Better yet, use frameworks like Google Benchmark, which will help to avoid many other pitfalls (like compiler optimizations). There's also quick-bench.com for simple cases (based on Google Benchmark).

Is std::mt19937_64 faster than std::mt19937?

Does mt19937_64 have a higher throughput (bit/s) than the 32 bit version, mt19937, assuming a 64 bit architecture?
What about after vectorization?

As #byjoe points out, this obviously depends on the compiler.
In this case, it seems to be considerably more dependent on the compiler than is typical though. For example, the Boost test linked in the comments uses the compiler from VC++ 2010, and shows only a fairly slight increase in random bits per second from using mt19937_64.
To get more up-to-date information, I whipped up a simple test:
#include <random>
#include <chrono>
#include <iostream>
#include <iomanip>
template <class T, class U>
U test(char const *label, U count) {
using namespace std::chrono;
T gen(100);
U result = 0;
auto start = high_resolution_clock::now();
for (U i = 0; i < count; i++)
result ^= gen();
auto stop = high_resolution_clock::now();
std::cout << "Time for " << std::left << std::setw(12) << label
<< duration_cast<milliseconds>(stop - start).count() << "\n";
return result;
}
int main(int argc, char **argv) {
unsigned long long limit = 1000000000;
auto result1 = test<std::mt19937>("mt19937: ", limit);
auto result2 = test<std::mt19937_64>("mt19937_64: ", limit);
std::cout << "Ignore: " << result1 << ", " << result2 << "\n";
}
With VC++ 2015 udpate 3 (with /o2b2 /GL, though it probably doesn't matter), I got results like these:
Time for mt19937: 4339
Time for mt19937_64: 4215
Ignore: 2598366015, 13977046647333287932
This shows mt19937_64 as being slightly faster per call, so over twice as fast per bit as mt19937. With MinGW (using -O3), the results were much more like those linked from the Boost site:
Time for mt19937: 2211
Time for mt19937_64: 4183
Ignore: 2598366015, 13977046647333287932
In this case, mt19937_64 takes just a little less than twice the time per call, so it's only slightly faster per bit. The highest overall speed seems to be from g++ with mt19937_64, but the difference between g++ and VC++ (on these runs) is less than 1%, so I'm not sure it's reproducible.
For what it's worth, the difference in speed (per call) between mt19937 and mt19937_64 with VC++ is also pretty small, but does seem to be reproducible--it happened quite consistently in my testing. I did wonder about whether that might be (at least partially) a matter of clock management--that when the code first started, the CPU was idle, and the clock had been slowed, so the first part of the first run was at a lower clock speed. To check, I reversed the order to test mt19937_64 first. I think my hypothesis was at least partially correct--when I reversed the order, mt19937_64 slowed down compared to mt19937, so they were nearly identical on a per-call basis with VC++.

It clearly depends on your compiler and their implementation. I just tested and the 64bit version takes about 60% longer call-for-call, so that makes the 64bit version about 25% fast bit-for-bit. I tested with an i7 cpu.
If you need max speed, you may want to consider using something else. Especially if the numbers don't need to be very high quality.

make_shared (boost or stl) seems slightly slower than shared_ptr+new in my tests

I have a little doubt concerning what I understand about make_shared performance (boost or stl), so I wanted some opinion.
Working on an app in C++ I had to do some performance tests, and I ended up to compare make_shared and shared_ptr+new (note that it's not the performance improvement purpose, and i'm not expecting to gain time here, but i'm just curious now)
I use
Debian Jessy x64
liboost 1.55
gcc 4.9.2
I read that make_shared is more efficient, and according to the explications I can find (allocation numbers, overhead) it seems logical to me (as far as I can understand).
But doing a quick and stupid test, I don't understand what I get
std::shared_ptr<MyUselessClass> dummyPtr = std::shared_ptr<MyUselessClass>(new MyUselessClass());
auto start = boost::chrono::high_resolution_clock::now() ;
// - STD Share
std::shared_ptr<MyUselessClass> stdSharePtr = std::shared_ptr<MyUselessClass>(new MyUselessClass());
auto stdSharePtrTime_1 = boost::chrono::high_resolution_clock::now() ;
// - STD Make
std::shared_ptr<MyUselessClass> stdMakePtr = std::make_shared<MyUselessClass>();
auto stdMakePtrTime_2 = boost::chrono::high_resolution_clock::now() ;
// - BOOST Share
boost::shared_ptr<MyUselessClass> boostSharePtr = boost::shared_ptr<MyUselessClass>(new MyUselessClass());
auto boostSharePtrTime_3 = boost::chrono::high_resolution_clock::now() ;
// - BOOST Make
boost::shared_ptr<MyUselessClass> boostMakePtr = boost::make_shared<MyUselessClass>();
auto boostMakePtrTime_4 = boost::chrono::high_resolution_clock::now() ;
boost::chrono::nanoseconds stdShare = boost::chrono::duration_cast<boost::chrono::nanoseconds>(stdSharePtrTime_1-start) ;
boost::chrono::nanoseconds stdMake = boost::chrono::duration_cast<boost::chrono::nanoseconds>(stdMakePtrTime_2-stdSharePtrTime_1) ;
boost::chrono::nanoseconds boostShare = boost::chrono::duration_cast<boost::chrono::nanoseconds>(boostSharePtrTime_3-stdMakePtrTime_2) ;
boost::chrono::nanoseconds boostMake = boost::chrono::duration_cast<boost::chrono::nanoseconds>(boostMakePtrTime_4-boostSharePtrTime_3) ;
cout << "---" << endl ;
cout << "STD share " << stdShare << endl ;
cout << "BOOST share " << boostShare << endl ;
cout << "STD make " << stdMake << endl ;
cout << "BOOST make " << boostMake << endl ;
MyUselessClass is a simple class with 3 class attributes (sting, bool, int), and only constructor and desctructor.
The "results" (I quote because it's not an accurate test, of course) are the following (i runned it into a loop to get many iterations, gives in average the same results) :
STD share 162 nanoseconds
BOOST share 196 nanoseconds
STD make 385 nanoseconds
BOOST make 264 nanoseconds
If I believe my test, make_shared is slightly slower than calling share_ptr & new instanciation. I would have expected, if I were to see any differences as nanoseconds precision, the contrary ...
So now I am wondering:
Maybe my test is too stupid, nanosecons order has no importance in those operations ?
Maybe I missed a point in the explanation of better performance of make_shared ?
Maybe I missed some(serveral) point(s) in the test ?
If you have some answers of the points below, please don't hesitate :)
Thanx

Try compiling your program with -O2. I tried compiling your code without optimizations and I got similar numbers. After, compiling the code with -O2, make_shared is consistently faster. By the way, the size of the class MyUselessClass also affects the ratio of time.

running speed of permutation function using different methods results in unexpected results

I have implemented a isPermutation function which given two string will return true if the two are permutation of each other, otherwise it will return false.
One uses c++ sort algorithm twice, while the other uses an array of ints to keep track of string count.
I ran the code several times and every time the sorting method is faster. Is my array implementation wrong?
Here is the output:
1
0
1
Time: 0.088 ms
1
0
1
Time: 0.014 ms
And the code:
#include <iostream> // cout
#include <string> // string
#include <cstring> // memset
#include <algorithm> // sort
#include <ctime> // clock_t
using namespace std;
#define MAX_CHAR 255
void PrintTimeDiff(clock_t start, clock_t end) {
std::cout << "Time: " << (end - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
}
// using array to keep a count of used chars
bool isPermutation(string inputa, string inputb) {
int allChars[MAX_CHAR];
memset(allChars, 0, sizeof(int) * MAX_CHAR);
for(int i=0; i < inputa.size(); i++) {
allChars[(int)inputa[i]]++;
}
for (int i=0; i < inputb.size(); i++) {
allChars[(int)inputb[i]]--;
if(allChars[(int)inputb[i]] < 0) {
return false;
}
}
return true;
}
// using sorting anc comparing
bool isPermutation_sort(string inputa, string inputb) {
std::sort(inputa.begin(), inputa.end());
std::sort(inputb.begin(), inputb.end());
if(inputa == inputb) return true;
return false;
}
int main(int argc, char* argv[]) {
clock_t start = clock();
cout << isPermutation("god", "dog") << endl;
cout << isPermutation("thisisaratherlongerinput","thisisarathershorterinput") << endl;
cout << isPermutation("armen", "ramen") << endl;
PrintTimeDiff(start, clock());
start = clock();
cout << isPermutation_sort("god", "dog") << endl;
cout << isPermutation_sort("thisisaratherlongerinput","thisisarathershorterinput") << endl;
cout << isPermutation_sort("armen", "ramen") << endl;
PrintTimeDiff(start, clock());
return 0;
}

To benchmark this you have to eliminate all the noise you can.
The easiest way to do this is to wrap it in a loop that repeats the call to each 1000 times or so, then only spit out the value every 10 iterations. This way they each have a similar caching profile. Throw away values that are bogus due (eg blowouts due to context switches by the OS).
I got your method marginally faster by doing this. An excerpt.
method 1 array Time: 0.768 us
method 2 sort Time: 0.840333 us
method 1 array Time: 0.621333 us
method 2 sort Time: 0.774 us
method 1 array Time: 0.769 us
method 2 sort Time: 0.856333 us
method 1 array Time: 0.766 us
method 2 sort Time: 0.850333 us
method 1 array Time: 0.802667 us
method 2 sort Time: 0.89 us
method 1 array Time: 0.778 us
method 2 sort Time: 0.841333 us
I used rdtsc which works better for me on this system. 3000 cycles per microsecond is close enough for this, but please do make it more accurate if you care about precision of the readings.
#if defined(__x86_64__)
static uint64_t rdtsc()
{
uint64_t hi, lo;
__asm__ __volatile__ (
"xor %%eax, %%eax\n"
"cpuid\n"
"rdtsc\n"
: "=a"(lo), "=d"(hi)
:: "ebx", "ecx");
return (hi << 32)|lo;
}
#else
#error wrong architecture - implement me
#endif
void PrintTimeDiff(uint64_t start, uint64_t end) {
std::cout << "Time: " << (end - start)/double(3000) << " us" << std::endl;
}

you cannot check performance differences between implementations putting in the mix calls to std::cout. isPermutation and isPermutation_sort are some order of magnitude faster than a call to std::cout (and, anyway, prefer \n over std::endl).
for testing you have to activate compiler optimizations. Doing so the compiler will apply the loop-invariant code motion optimization and you'll probably get the same results for both implementations.
A more effective way of testing is:
int main()
{
const std::vector<std::string> bag
{
"god", "dog", "thisisaratherlongerinput", "thisisarathershorterinput",
"armen", "ramen"
};
static std::mt19937 engine;
std::uniform_int_distribution<std::size_t> rand(0, bag.size() - 1);
const unsigned stop = 1000000;
unsigned counter = 0;
std::clock_t start = std::clock();
for (unsigned i(0); i < stop; ++i)
counter += isPermutation(bag[rand(engine)], bag[rand(engine)]);
std::cout << counter << '\n';
PrintTimeDiff(start, clock());
counter = 0;
start = std::clock();
for (unsigned i(0); i < stop; ++i)
counter += isPermutation_sort(bag[rand(engine)], bag[rand(engine)]);
std::cout << counter << '\n';
PrintTimeDiff(start, clock());
return 0;
}
I have 2.4s for isPermutations_sort vs 2s for isPermutation (somewhat similar to Hal's results). Same with g++ and clang++.
Printing the value of counter has the double benefit of:
triggering the as-if rule (the compiler cannot remove the for loops);
allowing a first check of your implementations (the two values cannot be too distant).
There're some things you have to change in your implementation of isPermutation:
pass arguments as const references
bool isPermutation(const std::string &inputa, const std::string &inputb)
just this change brings the time down to 0.8s (of course you cannot do the same with isPermutation_sort).
you can use std::array and std::fill instead of memset (this is C++ :-)
avoid premature pessimization and prefer preincrement. Only use postincrement if you're going to use the original value
do not mix signed and unsigned value in the for loops (inputa.size() and i). i should be declared as std::size_t
even better, use the range based for loop.
So something like:
bool isPermutation(const std::string &inputa, const std::string &inputb)
{
std::array<int, MAX_CHAR> allChars;
allChars.fill(0);
for (auto c : inputa)
++allChars[(unsigned char)c];
for (auto c : inputb)
{
--allChars[(unsigned char)c];
if (allChars[(unsigned char)c] < 0)
return false;
}
return true;
}
Anyway both isPermutation and isPermutation_sort should have this preliminary check:
if (inputa.length() != inputb.length())
return false;
Now we are at 0.55s for isPermutation vs 1.1s for isPermutation_sort.
Last but not least consider std::is_permutation:
for (unsigned i(0); i < stop; ++i)
{
const std::string &s1(bag[rand(engine)]), &s2(bag[rand(engine)]);
counter += std::is_permutation(s1.begin(), s1.end(), s2.begin());
}
(0.6s)
EDIT
As observed in BeyelerStudios' comment Mersenne-Twister is too much in this case.
You can change the engine to a simpler one.:
static std::linear_congruential_engine<std::uint_fast32_t, 48271, 0, 2147483647> engine;
This further lowers the timings. Luckily the relative speeds remain the same.
Just to be sure I've also checked with a non random access scheme obtaining the same relative results.

Your idea amounts to using a Counting Sort on both strings, but with the comparison happening on the count array, rather than after writing out sorted strings.
It works well because a byte can only have one of 255 non-zero values. Zeroing 256B of memory, or even 4*256B, is pretty cheap, so it works well even for fairly short strings, where most of the count array isn't touched.
It should be fairly good for very long strings, at least in some cases. It's pretty heavily dependent on a good and a heavily pipelined L1 cache, because scattered increments to the count array produces scattered read-modify-writes. Repeated occurrences create a dependency chain of with a store-load round-trip in it. This is a big glass-jaw for this algorithm, on CPUs where many loads and stores can be in flight at once (with their latencies happening in parallel). Modern x86 CPUs should run it pretty well, since they can sustain a load + store every clock cycle.
The initial count of inputa compiles to a very tight loop:
.L15:
movsx rdx, BYTE PTR [rax]
add rax, 1
add DWORD PTR [rsp-120+rdx*4], 1
cmp rax, rcx
jne .L15
This brings us to the first major bug in your code: char can be signed or unsigned. In the x86-64 ABI, char is signed, so allChars[(int)inputa[i]]++; sign-extends it for use as an array index. (movsx instead of movzx). Your code will write outside the array bounds on non-ASCII characters that have their high bit set. So you should have written allChars[(unsigned char)inputa[i]]++;. Note that casting to (unsigned) doesn't give the result we want (see comments).
Note that clang makes much worse code (v3.7.1 and v3.8, both with -O3), with a function call to std::basic_string<...>::_M_leak_hard() inside the inner loop. (Leak as in leak a reference, I think.) #manlio's version doesn't have this problem, so I guess for (auto c : inputa) syntax helps clang figure out what's happening.
Also, using std::string when your callers have char[] forces them to construct a std::string. That's kind of silly, but it is helpful to be able to compare string lengths.
GNU libc's std::is_permutation uses a very different strategy:
First, it skips any common prefix that's identical without permutation in both strings.
Then, for each element in inputa:
count the occurrences of that element in inputb. Check that it matches the count in inputa.
There are a couple optimizations:
Only compare counts the first time an element is seen: find duplicates by searching from the beginning of inputa, and if the match position isn't the current position, we've already checked this element.
check that the match count in inputb is != 0 before counting matches in the rest of inputa.
This doesn't need any temporary storage, so it can work when the elements are large. (e.g. an array of int64_t, or an array of structs).
If there is a mismatch, this is likely to find it early, before doing as much work. There are probably a few cases of inputs where the counting version would take less time, but probably for most inputs the library algorithm is best.
std::is_permutation uses std::count, which should be implemented very well with SSE / AVX vectors. Unfortunately, it's auto-vectorized in a really stupid way by both gcc and clang. It unpacks bytes to 64bit integers before accumulating them in vector elements, to avoid overflow. So it spends most of its instructions shuffling data around, and is probably slower than a scalar implementation (which you'd get from compiling with -O2, or with -O3 -fno-tree-vectorize).
It could and should only do this every few iterations, so the inner loop of count can just be something like pcmpeqb / psubb, with a psadbw every 255 iterations. Or pcmpeqb / pmovmskb / popcnt / add, but that's slower.
Template specializations in the library could help a lot for std::count for 8, 16, and 32bit types whose equality can be checked with bitwise equality (integer ==).

How to zero a vector<bool>?

I have a vector<bool> and I'd like to zero it out. I need the size to stay the same.
The normal approach is to iterate over all the elements and reset them. However, vector<bool> is a specially optimized container that, depending on implementation, may store only one bit per element. Is there a way to take advantage of this to clear the whole thing efficiently?
bitset, the fixed-length variant, has the set function. Does vector<bool> have something similar?

There seem to be a lot of guesses but very few facts in the answers that have been posted so far, so perhaps it would be worthwhile to do a little testing.
#include <vector>
#include <iostream>
#include <time.h>
int seed(std::vector<bool> &b) {
srand(1);
for (int i = 0; i < b.size(); i++)
b[i] = ((rand() & 1) != 0);
int count = 0;
for (int i = 0; i < b.size(); i++)
if (b[i])
++count;
return count;
}
int main() {
std::vector<bool> bools(1024 * 1024 * 32);
int count1= seed(bools);
clock_t start = clock();
bools.assign(bools.size(), false);
double using_assign = double(clock() - start) / CLOCKS_PER_SEC;
int count2 = seed(bools);
start = clock();
for (int i = 0; i < bools.size(); i++)
bools[i] = false;
double using_loop = double(clock() - start) / CLOCKS_PER_SEC;
int count3 = seed(bools);
start = clock();
size_t size = bools.size();
bools.clear();
bools.resize(size);
double using_clear = double(clock() - start) / CLOCKS_PER_SEC;
int count4 = seed(bools);
start = clock();
std::fill(bools.begin(), bools.end(), false);
double using_fill = double(clock() - start) / CLOCKS_PER_SEC;
std::cout << "Time using assign: " << using_assign << "\n";
std::cout << "Time using loop: " << using_loop << "\n";
std::cout << "Time using clear: " << using_clear << "\n";
std::cout << "Time using fill: " << using_fill << "\n";
std::cout << "Ignore: " << count1 << "\t" << count2 << "\t" << count3 << "\t" << count4 << "\n";
}
So this creates a vector, sets some randomly selected bits in it, counts them, and clears them (and repeats). The setting/counting/printing is done to ensure that even with aggressive optimization, the compiler can't/won't optimize out our code to clear the vector.
I found the results interesting, to say the least. First the result with VC++:
Time using assign: 0.141
Time using loop: 0.068
Time using clear: 0.141
Time using fill: 0.087
Ignore: 16777216 16777216 16777216 16777216
So, with VC++, the fastest method is what you'd probably initially think of as the most naive -- a loop that assigns to each individual item. With g++, the results are just a tad different though:
Time using assign: 0.002
Time using loop: 0.08
Time using clear: 0.002
Time using fill: 0.001
Ignore: 16777216 16777216 16777216 16777216
Here, the loop is (by far) the slowest method (and the others are basically tied -- the 1 ms difference in speed isn't really repeatable).
For what it's worth, in spite of this part of the test showing up as much faster with g++, the overall times were within 1% of each other (4.944 seconds for VC++, 4.915 seconds for g++).

Try
v.assign(v.size(), false);
Have a look at this link:
http://www.cplusplus.com/reference/vector/vector/assign/
Or the following
std::fill(v.begin(), v.end(), 0)

You are out of luck. std::vector<bool> is a specialization that apparently does not even guarantee contiguous memory or random access iterators (or even forward?!), at least based on my reading of cppreference -- decoding the standard would be the next step.
So write implementation specific code, pray and use some standard zeroing technique, or do not use the type. I vote 3.
The recieved wisdom is that it was a mistake, and may become deprecated. Use a different container if possible. And definitely do not mess around with the internal guts, or rely on its packing. Check if you have dynamic bitset in your std library mayhap, or roll your own wrapper around std::vector<unsigned char>.

I ran into this as a performance issue recently. I hadn't tried looking for answers on the web but did find that using assignment with the constructor was 10x faster using g++ O3 (Debian 4.7.2-5) 4.7.2. I found this question because I was looking to avoid the additional malloc. Looks like the assign is optimized as well as the constructor and about twice as good in my benchmark.
unsigned sz = v.size(); for (unsigned ii = 0; ii != sz; ++ii) v[ii] = false;
v = std::vector(sz, false); // 10x faster
v.assign(sz, false); > // 20x faster
So, I wouldn't say to shy away from using the specialization of vector<bool>; just be very cognizant of the bit vector representation.

Use the std::vector<bool>::assign method, which is provided for this purpose.
If an implementation is specific for bool, then assign, most likely, also implemented appropriately.

If you're able to switch from vector<bool> to a custom bit vector representation, then you can use a representation designed specifically for fast clear operations, and get some potentially quite significant speedups (although not without tradeoffs).
The trick is to use integers per bit vector entry and a single 'rolling threshold' value that determines which entries actually then evaluate to true.
You can then clear the bit vector by just increasing the single threshold value, without touching the rest of the data (until the threshold overflows).
A more complete write up about this, and some example code, can be found here.

It seems that one nice option hasn't been mentioned yet:
auto size = v.size();
v.resize(0);
v.resize(size);
The STL implementer will supposedly have picked the most efficient means of zeroising, so we don't even need to know which particular method that might be. And this works with real vectors as well (think templates), not just the std::vector<bool> monstrosity.
There can be a minuscule added advantage for reused buffers in loops (e.g. sieves, whatever), where you simply resize to whatever will be needed for the current round, instead of to the original size.

As an alternative to std::vector<bool>, check out boost::dynamic_bitset (https://www.boost.org/doc/libs/1_72_0/libs/dynamic_bitset/dynamic_bitset.html). You can zero one (ie, set each element to false) out by calling the reset() member function.
Like clearing, say, std::vector<int>, reset on a boost::dynamic_bitset can also compile down to a memset, whereas you probably won't get that with std::vector<bool>. For example, see https://godbolt.org/z/aqSGCi

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js