C++ - Check if One Array of Strings Contains All Elements of Another

C++ - Check if One Array of Strings Contains All Elements of Another - c++

I've recently been porting a Python application to C++, but am now at a loss as to how I can port a specific function. Here's the corresponding Python code:
def foo(a, b): # Where `a' is a list of strings, as is `b'
for x in a:
if not x in b:
return False
return True
I wish to have a similar function:
bool
foo (char* a[], char* b[])
{
// ...
}
What's the easiest way to do this? I've tried working with the STL algorithms, but can't seem to get them to work. For example, I currently have this (using the glib types):
gboolean
foo (gchar* a[], gchar* b[])
{
gboolean result;
std::sort (a, (a + (sizeof (a) / sizeof (*a))); // The second argument corresponds to the size of the array.
std::sort (b, (b + (sizeof (b) / sizeof (*b)));
result = std::includes (b, (b + (sizeof (b) / sizeof (*b))),
a, (a + (sizeof (a) / sizeof (*a))));
return result;
}
I'm more than willing to use features of C++11.

I'm just going to add a few comments to what others have stressed and give a better algorithm for what you want.
Do not use pointers here. Using pointers doesn't make it c++, it makes it bad coding. If you have a book that taught you c++ this way, throw it out. Just because a language has a feature, does not mean it is proper to use it anywhere you can. If you want to become a professional programmer, you need to learn to use the appropriate parts of your languages for any given action. When you need a data structure, use the one appropriate to your activity. Pointers aren't data structures, they are reference types used when you need an object with state lifetime - i.e. when an object is created on one asynchronous event and destroyed on another. If an object lives it's lifetime without any asynchronous wait, it can be modeled as a stack object and should be. Pointers should never be exposed to application code without being wrapped in an object, because standard operations (like new) throw exceptions, and pointers do not clean themselves up. In other words, pointers should always be used only inside classes and only when necessary to respond with dynamic created objects to external events to the class (which may be asynchronous).
Do not use arrays here. Arrays are simple homogeneous collection data types of stack lifetime of size known at compiletime. They are not meant for iteration. If you need an object that allows iteration, there are types that have built in facilities for this. To do it with an array, though, means you are keeping track of a size variable external to the array. It also means you are enforcing external to the array that the iteration will not extend past the last element using a newly formed condition each iteration (note this is different than just managing size - it is managing an invariant, the reason you make classes in the first place). You do not get to reuse standard algorithms, are fighting decay-to-pointer, and generally are making brittle code. Arrays are (again) useful only if they are encapsulated and used where the only requirement is random access into a simple type, without iteration.
Do not sort a vector here. This one is just odd, because it is not a good translation from your original problem, and I'm not sure where it came from. Don't optimise early, but don't pessimise early by choosing a bad algorithm either. The requirement here is to look for each string inside another collection of strings. A sorted vector is an invariant (so, again, think something that needs to be encapsulated) - you can use existing classes from libraries like boost or roll your own. However, a little bit better on average is to use a hash table. With amortised O(N) lookup (with N the size of a lookup string - remember it's amortised O(1) number of hash-compares, and for strings this O(N)), a natural first way to translate "look up a string" is to make an unordered_set<string> be your b in the algorithm. This changes the complexity of the algorithm from O(NM log P) (with N now the average size of strings in a, M the size of collection a and P the size of collection b), to O(NM). If the collection b grows large, this can be quite a savings.
In other words
gboolean foo(vector<string> const& a, unordered_set<string> const& b)
Note, you can now pass constant to the function. If you build your collections with their use in mind, then you often have potential extra savings down the line.
The point with this response is that you really should never get in the habit of writing code like that posted. It is a shame that there are a few really (really) bad books out there that teach coding with strings like this, and it is a real shame because there is no need to ever have code look that horrible. It fosters the idea that c++ is a tough language, when it has some really nice abstractions that do this easier and with better performance than many standard idioms in other languages. An example of a good book that teaches you how to use the power of the language up front, so you don't build bad habits, is "Accelerated C++" by Koenig and Moo.
But also, you should always think about the points made here, independent of the language you are using. You should never try to enforce invariants outside of encapsulation - that was the biggest source of savings of reuse found in Object Oriented Design. And you should always choose your data structures appropriate for their actual use. And whenever possible, use the power of the language you are using to your advantage, to keep you from having to reinvent the wheel. C++ already has string management and compare built in, it already has efficient lookup data structures. It has the power to make many tasks that you can describe simply coded simply, if you give the problem a little thought.

Your first problem is related to the way arrays are (not) handled in C++. Arrays live a kind of very fragile shadow existence where, if you as much as look at them in a funny way, they are converted into pointers. Your function doesn't take two pointers-to-arrays as you expect. It takes two pointers to pointers.
In other words, you lose all information about the size of the arrays. sizeof(a) doesn't give you the size of the array. It gives you the size of a pointer to a pointer.
So you have two options: the quick and dirty ad-hoc solution is to pass the array sizes explicitly:
gboolean foo (gchar** a, int a_size, gchar** b, int b_size)
Alternatively, and much nicer, you can use vectors instead of arrays:
gboolean foo (const std::vector<gchar*>& a, const std::vector<gchar*>& b)
Vectors are dynamically sized arrays, and as such, they know their size. a.size() will give you the number of elements in a vector. But they also have two convenient member functions, begin() and end(), designed to work with the standard library algorithms.
So, to sort a vector:
std::sort(a.begin(), a.end());
And likewise for std::includes.
Your second problem is that you don't operate on strings, but on char pointers. In other words, std::sort will sort by pointer address, rather than by string contents.
Again, you have two options:
If you insist on using char pointers instead of strings, you can specify a custom comparer for std::sort (using a lambda because you mentioned you were ok with them in a comment)
std::sort(a.begin(), a.end(), [](gchar* lhs, gchar* rhs) { return strcmp(lhs, rhs) < 0; });
Likewise, std::includes takes an optional fifth parameter used to compare elements. The same lambda could be used there.
Alternatively, you simply use std::string instead of your char pointers. Then the default comparer works:
gboolean
foo (const std::vector<std::string>& a, const std::vector<std::string>& b)
{
gboolean result;
std::sort (a.begin(), a.end());
std::sort (b.begin(), b.end());
result = std::includes (b.begin(), b.end(),
a.begin(), a.end());
return result;
}
Simpler, cleaner and safer.

The sort in the C++ version isn't working because it's sorting the pointer values (comparing them with std::less as it does with everything else). You can get around this by supplying a proper comparison functor. But why aren't you actually using std::string in the C++ code? The Python strings are real strings, so it makes sense to port them as real strings.

In your sample snippet your use of std::includes is pointless since it will use operator< to compare your elements. Unless you are storing the same pointers in both your arrays the operation will not yield the result you are looking for.
Comparing adresses is not the same thing as comparing the true content of your c-style-strings.
You'll also have to supply std::sort with the neccessary comparator, preferrably std::strcmp (wrapped in a functor).
It's currently suffering from the same problem as your use of std::includes, it's comparing addresses instead of the contents of your c-style-strings.
This whole "problem" could have been avoided by using std::strings and std::vectors.
Example snippet
#include <iostream>
#include <algorithm>
#include <cstring>
typedef char gchar;
gchar const * a1[5] = {
"hello", "world", "stack", "overflow", "internet"
};
gchar const * a2[] = {
"world", "internet", "hello"
};
...
int
main (int argc, char *argv[])
{
auto Sorter = [](gchar const* lhs, gchar const* rhs) {
return std::strcmp (lhs, rhs) < 0 ? true : false;
};
std::sort (a1, a1 + 5, Sorter);
std::sort (a2, a2 + 3, Sorter);
if (std::includes (a1, a1 + 5, a2, a2 + 3, Sorter)) {
std::cerr << "all elements in a2 was found in a1!\n";
} else {
std::cerr << "all elements in a2 wasn't found in a1!\n";
}
}
output
all elements in a2 was found in a1!

A naive transcription of the python version would be:
bool foo(std::vector<std::string> const &a,std::vector<std::string> const &b) {
for(auto &s : a)
if(end(b) == std::find(begin(b),end(b),s))
return false;
return true;
}
It turns out that sorting the input is very slow. (And wrong in the face of duplicate elements.) Even the naive function is generally much faster. Just goes to show again that premature optimization is the root of all evil.
Here's an unordered_set version that is usually somewhat faster than the naive version (or was for the values/usage patterns I tested):
bool foo(std::vector<std::string> const& a,std::unordered_set<std::string> const& b) {
for(auto &s:a)
if(b.count(s) < 1)
return false;
return true;
}
On the other hand, if the vectors are already sorted and b is relatively small ( less than around 200k for me ) then std::includes is very fast. So if you care about speed you just have to optimize for the data and usage pattern you're actually dealing with.

Related

I can't get the right output that I want and the answer changes every time

So I am trying to code for this question:
Yes, I have to use arrays since it is a requirement.
Consider the problem of adding two n-bit binary integers, stored in two n-element arrays A and B. The sum of the two integers should be stored in binary form in an (n+1) element array C . State the problem formally and write pseudocode for adding the two integers.
I know that the ans array contains the correct output at the end of the addd function. However, I am not able to output that answer.
Below is my code. Please help me figure where in the code I'm going wrong, and what I can do to change it so it works. I will be very grateful.
#include <iostream>
using namespace std;
int * addd(int a[], int n1, int b[], int n2)
{
int s;
if(n1<n2) {s=n2+1;}
else {s=n1+1;}
int ans[s];
int i=n1-1, j=n2-1, k=s-1;
int carry=0;
while(i>=0 && j>=0 && k>0)
{
ans[k]=(a[i]+b[j]+carry)%2;
//cout<<k<<" "<<ans[k]<<endl;
carry=(a[i]+b[j]+carry)/2;
i--; j--; k--;
}
//cout<<"Carry "<<carry<<endl;
ans[0]=carry;
return ans;
}
int main(int argc, const char * argv[]) {
// insert code here...
int a[]={0,0,0,1,1,1};
int n1=sizeof(a)/sizeof(a[0]);
int b[]={1,0,1,1,0,1};
int n2=sizeof(b)/sizeof(b[0]);
int *p=addd(a,6,b,6);
// cout<<p[1]<<endl;
// cout<<p[0]<<" "<<p[1]<<" "<<p[2]<<" "<<p[3]<<" "<<p[4]<<" "<<p[5]<<" "<<p[6]<<endl;
return 0;
}

using namespace std;
Don't write using namespace std;. I have a summary I paste in from a file of common issues when I'm active in the Code Review Stack Exchange, but I don't have that here. Instead, you should just declare the symbols you need, like using std::cout;
int * addd(int a[], int n1, int b[], int n2)
The parameters of the form int a[] are very odd. This comes from C and is actually transformed into int* a and is not passing the array per-se.
The inputs should be const.
The names are not clear, but I'm guessing that n1 is the size of the array? In the Standard Guidelines, you'll see that passing a pointer plus length is strongly discouraged. The Standard Guidelines Library supplies a simple span type to use for this instead.
And the length should be size_t not int.
Based on the description, I think each element is only one bit, right? So why are the arrays of type int? I'd use bool or perhaps int8_t as being easier to work with.
What are you returning? If a and b and their lengths are the input, where is the output that you are returning a pointer to the beginning of? This is not giving value semantics, as you are returning a pointer to something that must exist elsewhere so what is its lifetime?
int s;
int ans[s];
return ans;
Well, there's your problem. First of all, declaring an array of a size that's not a constant is not even legal. (This is a gnu extension that implements C's VLA feature but not without issues as it breaks the C++ type system)
Regardless of that, you are returning a pointer to the first element of the local array, so what happens to the memory when the function returns? Boom.
int s;
No. Initialize values when they are created.
if(n1<n2) {s=n2+1;}
else {s=n1+1;}
Learn the library.
How about:
const size_t s = 1+std::max(n1,n2);
and then the portable way to get your memory is:
std::vector<int> ans(s);
Your main logic will not work if one array is shorter than the other. The shorter input should behave as if it had leading zeros to match. Consider abstracting the problem of "getting the next bit" so you don't duplicate the code for handling each input and make an unreadable mess. You really should have learned to use collections and iterators first.
now:
return ans;
would work as intended since it is a value. You just need to declare the function to be the right type. So just use auto for the return type and it knows.
int n1=sizeof(a)/sizeof(a[0]);
Noooooooo.
There is a standard function to give the size of a built-in primitive array. But really, this should be done automatically as part of the passing, not as a separate thing, as noted earlier.
int *p=addd(a,6,b,6);
You wrote 6 instead of n1 etc.
Anyway, with the previous edits, it becomes:
using std::size;
const auto p = addd (a, size(a), b, size(b));
Finally, concerning:
cout<<p[0]<<" "<<p[1]<<" "<<p[2]<<" "<<p[3]<<" "<<p[4]<<" "<<p[5]<<" "<<p[6]<<endl;
How about using loops?
for (auto val : p) cout << val;
cout << '\n';
oh, don't use endl. It's not needed for cout which auto-flushes anyway, and it's slow. Modern best practice is to use '\n' and then flush explicitly if/when needed (like, never).

Let's look at:
int ans[s];
Apart that this is not even part of the standard and probably the compiler is giving you some warnings (see link), that command allocate temporary memory in the stack which gets deallocated on function exit: that's why you are getting every time different results, you are reading garbage, i.e. memory that in the meantime might have been overwritten.
You can replace it for example with
int* ans = new int[s];
Don't forget though to deallocate the memory when you have finished using the buffer (outside the function), to avoid memory leakage.
Some other notes:
int s;
if(n1<n2) {s=n2+1;}
else {s=n1+1;}
This can be more elegantly written as:
const int s = (n1 < n2) ? n2 + 1 : n1 + 1;
Also, the actual computation code is imprecise as it leads to wrong results if n1 is not equal to n2: You need further code to finish processing the remaining bits of the longest array. By the way you don't need to check on k > 0 because of the way you have defined s.
The following should work:
int i=n1-1, j=n2-1, k=s-1;
int carry=0;
while(i>=0 && j>=0)
{
ans[k]=(a[i]+b[j]+carry)%2;
carry=(a[i]+b[j]+carry)/2;
i--; j--; k--;
}
while(i>=0) {
ans[k]=(a[i]+carry)%2;
carry=(a[i]+carry)/2;
i--; k--;
}
while(j>=0) {
ans[k]=(b[j]+carry)%2;
carry=(b[j]+carry)/2;
j--; k--;
}
ans[0]=carry;
return ans;
}

If You Must Only Use C Arrays
Returning ans is returning the pointer to a local variable. The object the pointer refers to is no longer valid after then function has returned, so trying to read it would lead to undefined behavior.
One way to fix this is to pass in the address to an array to hold your answer, and populate that, instead of using a VLA (which is a non-standard C++ extension).
A VLA (variable length array) is an array which takes its size from a run-time computed value. In your case:
int s;
//... code that initializes s
int ans[s];
ans is a VLA because you are not using a constant to determine the array size. However, that is not a standard feature of the C++ language (it is an optional one in the C language).
You can modify your function so that ans is actually provided by the caller.
int * addd(int a[], int n1, int b[], int n2, int ans[])
{
//...
And then the caller would be responsible for passing in a large enough array to hold the answer.
Your function also appears to be incomplete.
while(i>=0 && j>=0 && k>0)
{
ans[k]=(a[i]+b[j]+carry)%2;
//cout<<k<<" "<<ans[k]<<endl;
carry=(a[i]+b[j]+carry)/2;
i--; j--; k--;
}
If one array is shorter than the other, then the index for the shorter array will reach 0 first. Then, when that corresponding index goes negative, the loop will stop, without handling the remaining terms in the longer array. This essentially makes the corresponding entries in ans be uninitialized. Reading those values results in undefined behavior.
To address this, you should populate the remaining entries in ans with the correct calculation based on carry and the remaining entries in the longer array.
A More C++ Approach
The original answer above was provided assuming you were constrained to only using C style arrays for both input and output, and that you wanted an answer that would allow you to stay close to your original implementation.
Below is a more C++ oriented solution, assuming you still need to provide C arrays as input, but otherwise no other constraint.
C Array Wrapper
A C array does not provide the amenities that you may be accustomed to have when using C++ containers. To gain some of these nice to have features, you can write an adapter that allows a C array to behave like a C++ container.
template <typename T, std::size_t N>
struct c_array_ref {
typedef T ARR_TYPE[N];
ARR_TYPE &arr_;
typedef T * iterator;
typedef std::reverse_iterator<T *> reverse_iterator;
c_array_ref (T (&arr)[N]) : arr_(arr) {}
std::size_t size () { return N; }
T & operator [] (int i) { return arr_[i]; }
operator ARR_TYPE & () { return arr_; }
iterator begin () { return &arr_[0]; }
iterator end () { return begin() + N; }
reverse_iterator rbegin () { return reverse_iterator(end()); }
reverse_iterator rend () { return reverse_iterator(begin()); }
};
Use C Array References
Instead of passing in two arguments as information about the array, you can pass in the array by reference, and use template argument deduction to deduce the array size.
Return a std::array
Although you cannot return a local C array like you attempted in your question, you can return an array that is wrapped inside a struct or class. That is precisely what the convenience container std::array provides. When you use C array references and template argument deduction to obtain the array size, you can now compute at compile time the proper array size that std::array should have for the return value.
template <std::size_t N1, std::size_t N2>
std::array<int, ((N1 < N2) ? N2 : N1) + 1>
addd(int (&a)[N1], int (&b)[N2])
{
Normalize the Input
It is much easier to solve the problem if you assume the arguments have been arranged in a particular order. If you always want the second argument to be the larger array, you can do that with a simple recursive call. This is perfectly safe, since we know the recursion will happen at most once.
if (N2 < N1) return addd(b, a);
Use C++ Containers (or Look-Alike Adapters)
We can now convert our arguments to the adapter shown earlier, and also create a std::array to hold the output.
c_array_ref<int, N1> aa(a);
c_array_ref<int, N2> bb(b);
std::array<int, std::max(N1, N2)+1> ans;
Leverage Existing Algorithms if Possible
In order to deal with the short comings of your original program, you can adjust your implementation a bit in an attempt to remove special cases. One way to do that is to store the result of adding the longer array to 0 and storing it into the output. However, this can mostly be accomplished with a simple call to std::copy.
ans[0] = 0;
std::copy(bb.begin(), bb.end(), ans.begin() + 1);
Since we know the input consists of only 1s and 0s, we can compute straight addition from the shorter array into the longer array, without concern for carry (that will be addressed in the next step). To compute this addition, we apply std::transform with a lambda expression.
std::transform(aa.rbegin(), aa.rend(), ans.rbegin(),
ans.rbegin(),
[](int a, int b) -> int { return a + b; });
Lastly, we can make a pass over the output array to fix up the carry computation. After doing so, we are ready to return the result. The return is possible because we are using std::array to represent the answer.
for (auto i = ans.rbegin(); i != ans.rend()-1; ++i) {
*(i+1) += *i / 2;
*i %= 2;
}
return ans;
}
A Simpler main Function
We now only need to pass in the two arrays to the addd function, since template type deduction will discover the sizes of the arrays. In addition, the output generator can be handled more easily with an ostream_iterator.
int main(int, const char * []) {
int a[]={1,0,0,0,1,1,1};
int b[]={1,0,1,1,0,1};
auto p=addd(a,b);
std::copy(p.begin(), p.end(),
std::ostream_iterator<int>(std::cout, " "));
return 0;
}
Try it online!

If I may editorialize a bit... I think this is a deceptively difficult question for beginners, and as-stated should flag problems in the design review long before any attempt at coding. It's telling you to do things that are not good/typical/idiomatic/proper in C++, and distracting you with issues that get in the way of the actual logic to be developed.
Consider the core algorithm you wrote (and Antonio corrected): that can be understood and discussed without worrying about just how A and B are actually passed in for this code to use, or exactly what kind of collection it is. If they were std::vector, std::array, or primitive C array, the usage would be identical. Likewise, how does one return the result out of the code? You populate ans here, and how it is gotten into and/or out of the code and back to main is not relevant.
Primitive C arrays are not first-class objects in C++ and there are special rules (inherited from C) on how they are passed as arguments.
Returning is even worse, and returning dynamic-sized things was a major headache in C and memory management like this is a major source of bugs and security flaws. What we want is value semantics.
Second, using arrays and subscripts is not idiomatic in C++. You use iterators and abstract over the exact nature of the collection. If you were interested in writing super-efficent back-end code that doesn't itself deal with memory management (it's called by other code that deals with the actual collections involved) it would look like std::merge which is a venerable function that dates back to the early 90's.
template< class InputIt1, class InputIt2, class OutputIt >
OutputIt merge( InputIt1 first1, InputIt1 last1,
InputIt2 first2, InputIt2 last2,
OutputIt d_first );
You can find others with similar signatures, that take two different ranges for input and outputs to a third area. If you write addp exactly like this, you could call it with primitive C arrays of hardcoded size:
int8_t A[] {0,0,0,1,1,1};
int8_t B[] {1,0,1,1,0,1};
int8_t C[ ??? ];
using std::begin; std::end;
addp (begin(A),end(A), begin(B), end(B), begin(C));
Note that it's up to the caller to have prepared an output area large enough, and there's no error checking.
However, the same code can be used with vectors, or even any combination of different container types. This could populate a std::vector as the result by passing an insertion iterator. But in this particular algorithm that's difficult since you're computing it in reverse order.
std::array
Improving upon the situation with primitive C arrays, you could use the std::array class which is exactly the same array but without the strange passing/returning rules. It's actually just a primitive C array inside a wrapping struct. See this documentation: https://en.cppreference.com/w/cpp/container/array
So you could write it as:
using BBBNum1 = std::array<int8_t, 6>
BBBNum1 addp (const BBBNum1& A, const BBBNum1& B) { ... }
The code inside can use A[i] etc. in the same way you are, but it also can get the size via A.size(). The issue here is that the inputs are the same length, and the output is the same as well (not 1 larger). Using templates, it could be written to make the lengths flexible but still only specified at compile time.
std::vector
The vector is like an array but with a run-time length. It's dynamic, and the go-to collection you should reach for in C++.
using BBBNum2 = std::vector<int8_t>
BBBNum2 addp (const BBBNum2& A, const BBBNum2& B) { ... }
Again, the code inside this function can refer to B[j] etc. and use B.size() exactly the same as with the array collection. But now, the size is a run-time property, and can be different for each one.
You would create your result, as in my first post, by giving the size as a constructor argument, and then you can return the vector by-value. Note that the compiler will do this efficiently and not actually have to copy anything if you write:
auto C = addp (A, B);
now for the real work
OK, now that this distraction is at least out of the way, you can worry about actually writing the implementation. I hope you are convinced that using vector instead of a C primitive array does not affect your problem logic or even the (available) syntax of using subscripts. Especially since the problem referred to psudocode, I interpret its use of "array" as "suitable indexable collection" and not specifically the primitive C array type.
The issue of going through 2 sequences together and dealing with differing lengths is actually a general purpose idea. In C++20, the Range library has things that make quick work of this. Older 3rd party libraries exist as well, and you might find it called zip or something like that.
But, let's look at writing it from scratch.
You want to read an item at a time from two inputs, but neatly make it look like they're the same length. You don't want to write the same code three times, or elaborate on the cases where A is shorter or where B may be shorter... just abstract out the idea that they are read together, and if one runs out it provides zeros.
This is its own piece of code that can be applied twice, to A and to B.
class backwards_bit_reader {
const BBBnum2& x;
size_t index;
public:
backwards_bit_reader(const BBBnum2& x) : x{x}, index{x.size()} {}
bool done() const { return index == 0; }
int8_t next()
{
if (done()) return 0; // keep reading infinite leading zeros
--index;
return x[index];
}
};
Now you can write something like:
backwards_bit_reader A_in { A };
backwards_bit_reader B_in { B };
while (!A_in.done() && !B_in.done()) {
const a = A_in.next();
const b = B_in.next();
const c = a+b+carry;
carry = c/2; // update
C[--k]= c%2;
}
C[0]= carry; // the final bit, one longer than the input
It can be written far more compactly, but this is clear.
another approach
The problem is, is writing backwards_bit_reader beyond what you've learned thus far? How else might you apply the same logic to both A and B without duplicating the statements?
You should be learning to recognize what's sometimes called "code smell". Repeating the same block of code multiple times, and repeating the same steps with nothing changed but which variable it's applying to, should be seen as ugly and unacceptable.
You can at least cut back the cases by ensuring that B is always the longer one, if they are of different length. Do this by swapping A and B if that's not the case, as a preliminary step. (Actually implementing that well is another digression)
But the logic is still nearly duplicated, since you have to deal with the possibility of the carry propagating all the way to the end. Just now you have 2 copies instead of 3.
Extending the shorter one, at least in façade, is the only way to write one loop.
how realistic is this problem?
It's simplified to the point of being silly, but if it's not done in base 2 but with larger values, this is actually implementing multi-precision arithmetic, which is a real thing people want to do. That's why I named the type above BBBNum for "Bad Binary Bignum".
Getting down to an actual range of memory and wanting the code to be fast and optimized is also something you want to do sometimes. The BigNum is one example; you often see this with string processing. But we'll want to make an efficient back-end that operates on memory without knowing how it was allocated, and higher-level wrappers that call it.
For example:
void addp (const int8_t* a_begin, const int8_t* a_end,
const int8_t* b_begin, const int8_t* b_end,
int8_t* result_begin, int8_t* result_end);
will use the provided range for output, not knowing or caring how it was allocated, and taking input that's any contiguous range without caring what type of container is used to manage it as long as it's contiguous. Note that as you saw with the std::merge example, it's more idiomatic to pass begin and end rather than begin and size.
But then you have helper functions like:
BBBNum2 addp (const BBBNum2& A, const BBBNum2& B)
{
BBBNum result (1+std::max(A.size(),B.size());
addp (A.data(), A.data()+A.size(), B.data(), B.data()+B.size(), C.data(), C.data()+C.size());
}
Now the casual user can call it using vectors and a dynamically-created result, but it's still available to call for arrays, pre-allocated result buffers, etc.

Stable sorting a vector using std::sort

So I have some code like this, I want to sort the vector based on id and put the last overridden element first:
struct Data {
int64_t id;
double value;
};
std::vector<Data> v;
// add some Datas to v
// add some 'override' Datas with duplicated `id`s
std::sort(v.begin(), v.end(),
[](const Data& a, const Data& b) {
if (a.id < b.id) {
return true;
} else if (b.id < a.id) {
return false;
}
return &a > &b;
});
Since vectors are contiguous, &a > &b should work to put the appended overrides first in the sorted vector, which should be equivalent to using std::stable_sort, but I am not sure if there is a state in the std::sort implementation where the equal values would be swapped such that the address of an element that appeared later in the original vector is earlier now. I don't want to use stable_sort because it is significantly slower for my use case. I have also considered adding a field to the struct that keeps track of the original index, but I will need to copy the vector for that.
It seems to work here: https://onlinegdb.com/Hk8z1giqX

std::sort gives no guarantees whatsoever on when elements are compared, and in practice, I strongly suspect most implementations will misbehave for your comparator.
The common std::sort implementation is either plain quicksort or a hybrid sort (quicksort switching to a different sort for small ranges), implemented in-place to avoid using extra memory. As such, the comparator will be invoked with the same element at different memory addresses as the sort progresses; you can't use memory addresses to implement a stable sort.
Either add the necessary info to make the sort innately stable (e.g. the suggested initial index value) or use std::stable_sort. Using memory addresses to stabilize the sort won't work.
For the record, having experimented a bit, I suspect your test case is too small to trigger the issue. At a guess, the hybrid sorting strategy works coincidentally for smallish vectors, but breaks down when the vector gets large enough for an actual quicksort to occur. Once I increase your vector size with some more filler, the stability disappears, Try it online!

Is it acceptable to use std::merge for overlapping ranges

I have an algorithm which requires applying set union many times to growing sets of integers. For efficiency I represent the sets as sorted vectors, so that their union can be obtained by merging them.
A classical way to merge two sorted vectors is this:
void inmerge(vector<int> &a, const vector<int> &b) {
a.reserve(a.size() + b.size());
std::copy(b.begin(), b.end(), std::back_inserter(a));
std::inplace_merge(a.begin(), a.end() - b.size(), a.end());
}
Unfortunately, std::inplace_merge appears to be much slower than std::sort in this case, because of the allocation overhead. The fastest way is to use std::merge directly to output into one of the vectors. In order not to write a value before reading it, we have to proceed from the ends, like this:
void inmerge(vector<int> &a, const vector<int> &b) {
a.resize(a.size() + b.size());
orig_a_rbegin = a.rbegin() + b.size();
std::merge(orig_a_rbegin, a.rend(), b.rbegin(), b.rend(), a.rend(), [](int x, int y) { return x > y; });
}
It is for sure that an implementation of merge will never write more elements than it has read, so this is a safe thing to do. Unfortunately, the C++ standard (even C++17 draft) forbids this:
The resulting range shall not overlap with either of the original
ranges.
Is it okay to ignore this restriction if I know what I'm doing?

No, ignoring a mandate of the standard (or any other documentation of some library you're using) is never ok. You may know what you are doing, but are you sure you know what the library is doing - or might be doing in the next version?
For example, the merge algorithm could detect that at least two of your ranges are reverse ranges, unwrap them (and unwrap or reverse the third), and do the merge in the other direction. No observable difference as long as the preconditions are kept, but possibly a tiny bit faster since the overhead of the reverse iterators is gone. But it would really screw with your code.

To state it simply: No.
A bit longer: If you ignore a mandate by the standard you end up in Undefined Behaviour land and your compiler is free to do whatever it wants.
This includes doing exactly what you expect, doing nothing at all, crashing the program, deleting all your files or summoning nasal demons. That's not a place you want to be.

Is there an easy way to sort an array of char*'s ? C++

I've got an array of char* in a file.
The company I work for stores data in flat files.. Sometimes the data is sorted, but sometimes it's not.
I'd like to sort the data in the files.
Now I could write the code to do this, from scratch.
Is there an easier way?
Of course an in-place sort would be the best option. I'm working on large files and have little RAM. But I'll consider all options.
All strings are the same length.
This is some sample data:
the data is of fixed length
the Data is of fixed length
thIS data is of fixed lengt
This would represent three records of length 28. The app knows the length. Each record ends with CRLF (\r\n), though it shouldn't matter for this sort.

template<size_t length> int less(const char* left, const char* right) {
return memcmp(left, right, length) < 0;
}
std::sort(array, array + array_length, less<buffer_length>);

Use the GNU sort program (externally) if you can't fit the data into RAM: it will sort arbitrary sized files and the larger the file, the smaller the additional cost of creating the process.

You can use the algorithms in the STL on arrays native datatypes, not just on STL containers. The other suggestion to use std::sort won't work as posted however, because strcmp returns a value that evaluates to true for all comparisons when the strings aren't the same, not just if the left hand side is less than the right hand side -- which is what std::sort wants; a binary predicate returning true of the left hand side is less than the right hand side.
This works:
struct string_lt : public std::binary_function<bool, char, char>
{
bool operator()(const char* lhs, const char* rhs)
{
int ret = strcmp(lhs, rhs);
return ret < 0;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
char* strings [] = {"Hello", "World", "Alpha", "Beta", "Omega"};
size_t numStrings = sizeof(strings)/sizeof(strings[0]);
std::sort(&strings[0], &strings[numStrings], string_lt());
return 0;
}

boost::bind can do it:
// ascending
std::sort(c, c + size, boost::bind(std::strcmp, _1, _2) < 0);
// descending
std::sort(c, c + size, boost::bind(std::strcmp, _1, _2) > 0);
Edit: The strings are not null-terminated:
// ascending
std::sort(c, c + array_size, boost::bind(std::memcmp, _1, _2, size) < 0);
// descending
std::sort(c, c + array_size, boost::bind(std::memcmp, _1, _2, size) > 0);

Probably the easiest way is to used the old stdlib.h function qsort.
This should work:
qsort( array, num_elements, sizeof( char* ), strcmp )
Please note that this is standard C and only works reliable with English text.
If you have a list of String objects, then other things are possible in C++.
If you are on Linux and writing a gtk or Qt application then I would propose that you have a look at these libraries beforehand.

If the files are large and do not fit in RAM, you can use bin/bucket sort to split the data into smaller files and finally aggregate the pieces in a result file. Other responses show you how to sort each individual bucket file.

The canonical way to sort an array of character strings in C, and therefore an available but not necessarily recommended way to do so in C++, uses a level of indirection to strcmp():
static int qsort_strcmp(const void *v1, const void *v2)
{
const char *s1 = *(char * const *)v1;
const char *s2 = *(char * const *)v2;
return(strcmp(s1, s2));
}
static void somefunc(void) // Or omit the parameter altogether in C++
{
char **array = ...assignment...
size_t num_in_array = ...number of char pointers in array...
...
qsort(array, num_in_array, sizeof(char *), qsort_strcmp);
...more code...
}

A few things come to mind:
If your data is too big to fit into memory, you may want to just build up an index in-memory of file offsets, then memory-mapping the file to access the strings (depends on your OS).
In-place is going to require a lot of memory copies. If you can, use a shell sort. Then once you know the final order, it's much easier to reorder the strings in-place in linear time.
If the strings are all the same length, you really want a radix sort. If you're not familiar with a radix sort, here's the basic idea: Comparison-based sorting (which is what std::sort, qsort, and any other general-purpose sorting) always requires O(N log N) time. Radix sorting compares a single digit at a time (starting at str[0] and ending at str[K-1] for a K-lenth string), and overall can require only O(N) time to execute.
Consult the Internetfor a much better detailed description of radix sorting algorithms than I can provide. Aside from what I've said, I would avoid all of the other solutions that use standard libarary sorting facilities. They just aren't designed your particular problem, unfortunately.

You probably want to look into memory mapped files (see http://en.wikipedia.org/wiki/Memory-mapped_file), mmap() function (http://en.wikipedia.org/wiki/Mmap) on POSIX-complaint OSes. You'll essentially get a pointer to contiguous memory representing the file's contents.
The good side is that the OS will take care of loading parts of the file into memory and unloading them again, as needed.
One downside is that you'll need to resolve to some form of file locking to avoid corruption if more than one process is likely to access the file.
Another downside is that this doesn't guarantee good performance - to do that, you'll need a sorting algorithm that tries to avoid constantly loading and unloading pages (unless of course you have enough memory to load the entire file into memory).
Hope this has given you some ideas!

Best way to store constant data in C++

I have an array of constant data like following:
enum Language {GERMAN=LANG_DE, ENGLISH=LANG_EN, ...};
struct LanguageName {
ELanguage language;
const char *name;
};
const Language[] languages = {
GERMAN, "German",
ENGLISH, "English",
.
.
.
};
When I have a function which accesses the array and find the entry based on the Language enum parameter. Should I write a loop to find the specific entry in the array or are there better ways to do this.
I know I could add the LanguageName-objects to an std::map but wouldn't this be overkill for such a simple problem? I do not have an object to store the std::map so the map would be constructed for every call of the function.
What way would you recommend?
Is it better to encapsulate this compile time constant array in a class which handles the lookup?

If the enum values are contiguous starting from 0, use an array with the enum as index.
If not, this is what I usually do:
const char* find_language(Language lang)
{
typedef std::map<Language,const char*> lang_map_type;
typedef lang_map_type::value_type lang_map_entry_type;
static const lang_map_entry_type lang_map_entries[] = { /*...*/ }
static const lang_map_type lang_map( lang_map_entries
, lang_map_entries + sizeof(lang_map_entries)
/ sizeof(lang_map_entries[0]) );
lang_map_type::const_iterator it = lang_map.find(lang);
if( it == lang_map.end() ) return NULL;
return it->second;
}
If you consider a map for constants, always also consider using a vector.
Function-local statics are a nice way to get rid of a good part of the dependency problems of globals, but are dangerous in a multi-threaded environment. If you're worried about that, you might rather want to use globals:
typedef std::map<Language,const char*> lang_map_type;
typedef lang_map_type::value_type lang_map_entry_type;
const lang_map_entry_type lang_map_entries[] = { /*...*/ }
const lang_map_type lang_map( lang_map_entries
, lang_map_entries + sizeof(lang_map_entries)
/ sizeof(lang_map_entries[0]) );
const char* find_language(Language lang)
{
lang_map_type::const_iterator it = lang_map.find(lang);
if( it == lang_map.end() ) return NULL;
return it->second;
}

There are three basic approaches that I'd choose from. One is the switch statement, and it is a very good option under certain conditions. Remember - the compiler is probably going to compile that into an efficient table-lookup for you, though it will be looking up pointers to the case code blocks rather than data values.
Options two and three involve static arrays of the type you are using. Option two is a simple linear search - which you are (I think) already doing - very appropriate if the number of items is small.
Option three is a binary search. Static arrays can be used with standard library algorithms - just use the first and first+count pointers in the same way that you'd use begin and end iterators. You will need to ensure the data is sorted (using std::sort or std::stable_sort), and use std::lower_bound to do the binary search.
The complication in this case is that you'll need a comparison function object which acts like operator< with a stored or referenced value, but which only looks at the key field of your struct. The following is a rough template...
class cMyComparison
{
private:
const fieldtype& m_Value; // Note - only storing a reference
public:
cMyComparison (const fieldtype& p_Value) : m_Value (p_Value) {}
bool operator() (const structtype& p_Struct) const
{
return (p_Struct.field < m_Value);
// Warning : I have a habit of getting this comparison backwards,
// and I haven't double-checked this
}
};
This kind of thing should get simpler in the next C++ standard revision, when IIRC we'll get anonymous functions (lambdas) and closures.
If you can't put the sort in your apps initialisation, you might need an already-sorted boolean static variable to ensure you only sort once.
Note - this is for information only - in your case, I think you should either stick with linear search or use a switch statement. The binary search is probably only a good idea when...
There are a lot of data items to search
Searches are done very frequently (many times per second)
The key enumerate values are sparse (lots of big gaps) - otherwise, switch is better.
If the coding effort were trivial, it wouldn't be a big deal, but C++ currently makes this a bit harder than it should be.
One minor note - it may be a good idea to define an enumerate for the size of your array, and to ensure that your static array declaration uses that enumerate. That way, your compiler should complain if you modify the table (add/remove items) and forget to update the size enum, so your searches should never miss items or go out of bounds.

I think you have two questions here:
What is the best way to store a constant global variable (with possible Multi-Threaded access) ?
How to store your data (which container use) ?
The solution described by sbi is elegant, but you should be aware of 2 potential problems:
In case of Multi-Threaded access, the initialization could be skrewed.
You will potentially attempt to access this variable after its destruction.
Both issues on the lifetime of static objects are being covered in another thread.
Let's begin with the constant global variable storage issue.
The solution proposed by sbi is therefore adequate if you are not concerned by 1. or 2., on any other case I would recommend the use of a Singleton, such as the ones provided by Loki. Read the associated documentation to understand the various policies on lifetime, it is very valuable.
I think that the use of an array + a map seems wasteful and it hurts my eyes to read this. I personally prefer a slightly more elegant (imho) solution.
const char* find_language(Language lang)
{
typedef std::map<Language, const char*> map_type;
typedef lang_map_type::value_type value_type;
// I'll let you work out how 'my_stl_builder' works,
// it makes for an interesting exercise and it's easy enough
// Note that even if this is slightly slower (?), it is only executed ONCE!
static const map_type = my_stl_builder<map_type>()
<< value_type(GERMAN, "German")
<< value_type(ENGLISH, "English")
<< value_type(DUTCH, "Dutch")
....
;
map_type::const_iterator it = lang_map.find(lang);
if( it == lang_map.end() ) return NULL;
return it->second;
}
And now on to the container type issue.
If you are concerned about performance, then you should be aware that for small data collection, a vector of pairs is normally more efficient in look ups than a map. Once again I would turn toward Loki (and its AssocVector), but really I don't think that you should worry about performance.
I tend to choose my container depending on the interface I am likely to need first and here the map interface is really what you want.
Also: why do you use 'const char*' rather than a 'std::string'?
I have seen too many people using a 'const char*' like a std::string (like in forgetting that you have to use strcmp) to be bothered by the alleged loss of memory / performance...

It depends on the purpose of the array. If you plan on showing the values in a list (for a user selection, perhaps) the array would be the most efficient way of storing them. If you plan on frequently looking up values by their enum key, you should look into a more efficient data structure like a map.

There is no need to write a loop. You can use the enum value as index for the array.

I would make an enum with sequential language codes
enum { GERMAN=0, ENGLISH, SWAHILI, ENOUGH };
The put them all into array
const char *langnames[] = {
"German", "English", "Swahili"
};
Then I would check if sizeof(langnames)==sizeof(*langnames)*ENOUGH in debug build.
And pray that I have no duplicates or swapped languages ;-)

If you want fast and simple solution , Can try like this
enum ELanguage {GERMAN=0, ENGLISH=1};
static const string Ger="GERMAN";
static const string Eng="ENGLISH";
bool getLanguage(const ELanguage& aIndex,string & arName)
{
switch(aIndex)
{
case GERMAN:
{
arName=Ger;
return true;
}
case ENGLISH:
{
arName=Eng;
}
default:
{
// Log Error
return false;
}
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js