Clamp iterator value to end() with std::min - c++

I have a vector with n Strings in it. Now lets say I want to "page" or "group" those strings in a map.
typedef std::vector<std::string> TStringVec;
TStringVec myVec;
//.. fill myVecwith n elements
typedef std::map<int, TStringVec>> TPagedMap;
TPagedMap myMap;
int ItemsPerPage = 3 // or whatever
int PagesRequired = std::ceil(myVec.size() / nItemsPerPage);
for (int Page = 0; Page < NumPagesMax; ++Page)
{
TStringVec::const_iterator Begin = myVec.begin() + (ItemsPerPage * Page);
TStringVec::const_iterator End = myVec.begin() + ItemsPerPage * (Page+1);
myMap[Page] = TStringVec(Begin, End);
}
One can easily spot the problem here. When determining the end iterator, I risk leaving the allocated space by the vector.
Quick example: 5 elements in the vector, ItemsPerPage is 3. That means we need a total of 2 pages in the map to group all elements.
Now when hitting the last iteration, begin is pointing at myVec[3] but end is "pointing" to myVec[6]. Remember, myVec only has 5 elements.
Could this case be safely handled by swapping
TStringVec::const_iterator End = myVec.begin() + ItemsPerPage * (Page+1);
with
TStringVec::const_iterator End = std::min(myVec.begin() + ItemsPerPage * (Page+1), myVec.end());
It compiles of course, and it seems to work. But I'm not sure if this can be considered a safe thing to do. Any advice or a definitive answer?
I think the question is... Is a value "past" .end() guaranteed to be larger than the adress returned by .end()?
Thanks in advance.
EDIT: of course, a if check beforehand could solve the problem, but I'm looking for a more elegant solution.

There are two things wrong with your proposed replacement:
You are potentially creating an iterator past the end-iterator, which is UB.
For some unfathomable reason, you want to stop at end() - 1?? Everywhere else, you properly use half-open ranges.
What you want is more like
auto End = myVec.cbegin() + std::min(ItemsPerPage * (Page + 1), myVec.size());
Also take note that I used auto to avoid needlessly specifying complicated type-names.
As an aside, using std::ceil on an integer is not very useful conceptually, at least the compiler will likely optimize out the round-trip through double.

With range-v3, you may use chunk view:
std::vector<std::string> myVec = /*...*/;
const int ItemsPerPage = 3 // or whatever
std::map<int, std::vector<std::string>>> myMap;
int counter = 0;
for (const auto& page : myVec | ranges::view::chunk(ItemsPerPage)) {
myMap[counter++] = page;
}
Demo

Related

Fast 'group by/count' std::vector<std::u16string> into a std::map<u16string, int>

I have a function that reads ~10000 words into a vector, I then want to group all the words into a map to 'count' how many times a certain word appears.
While the code 'works' it can sometimes take 2 seconds to re-build the map.
NB: Unfortunately, I cannot change the 'read' function, I have to work with the vector of std::u16string.
std::vector<std::u16string> vValues;
vValues.push_back( ... )
...
std::map<std::u16string, int> mValues;
for( auto it = vValues.begin(); it != vValues.end(); ++it )
{
if( mValues.find( *it ) == mValues.end() )
{
mValues[*it] = 1;
}
else
{
++mValues[*it];
}
}
How could I speed up the 'group by' while keeping track of the number of times the word appears in the vector?
If you call std::map::operator[] on a new key, the value of the key will be value initialized (to 0 for PODs like int). So, your loop can be simplified to:
for (auto it = vValues.begin(); it != vValues.end(); ++it)
++mValues[*it];
If there is no key *it, then the default value will be 0, but then it is incremented immediately, and it becomes 1.
If the key already exists, then it is simply incremented.
Furthermore, it doesn't look like you need the map to be ordered, so you can use a std::unordered_map instead, as insertion is average constant time, instead of logarithmic, which would speed it up even further.
std::vector<std::u16string> vValues;
vValues.push_back( ... )
...
std::sort( vValues.begin(), vValues.end() );
struct counted {
std::u16string value;
std::size_t count;
};
std::vector<counted> result;
auto it = vValues.begin();
while (it != vValues.end()) {
auto r = std::equal_range( it, vValues.end(), *it );
result.push_back({ *it, r.second-r.first });
it = r.second;
}
After this is done, result will contain {value, count} for each value and will be sorted.
As all work was done in contiguous containers, it should be faster than your implementation.
If you aren't allowed to mutate vValues, one thing you could do is create a vector of gsl::span<char16_t> from it then sort that, then create the result vector similarly. (If you don't have gsl::span, write one, they aren't hard to write)
Failing that, even copying result once may be faster than your original solution.
Using a gsl::span<char16_t const> in counted would save some allocations as well (reuse the storage within the vValues, at the cost of tying their lifetimes together.
One serious concern is that if your strings are extremely long, determining that two strings are equal is expensive. And if they have common prefixes, determining they are different can be expensive. We do log(n) comparisons per distinct element in the equal_range code, and n log(n) in the sort; sometimes sorting (hash of string, string) pairs can be faster than sorting (string)s alone, as it makes unlike strings easy to detect.
Live example with 4 different versions. Simply change the test1 to test2 or test3 or test4.
test3 is fastest in every test I did:
std::unordered_map<std::string, int> test3(std::vector<std::string> vValues)
{
std::unordered_map<std::string, int> mValues;
for( auto it = vValues.begin(); it != vValues.end(); ++it )
{
++mValues[std::move(*it)];
}
return mValues;
}
than all the other versions.
And here’s an alternative. You might consider storing a non-owning shared pointer, but if you can’t control the format of your inputs, Yakk’s suggestion of gsl::span might work. This is from the Guidelines Support Library.
std::unordered_map<std::u16string, unsigned> hash_corpus;
// constexpr float heuristic_parameter = ?;
// hash_corpus.max_load_factor(heuristic_parameter);
/* The maximum possible number of entries in the hash table is the size of
* the input vector.
*/
hash_corpus.reserve(corpus.size());
// Paul McKenzie suggested this trick in the comments:
for ( const std::u16string& s : corpus)
++hash_corpus[s]; // If the key is not in the table, [] inserts with value 0.

Insert multiple values into vector

I have a std::vector<T> variable. I also have two variables of type T, the first of which represents the value in the vector after which I am to insert, while the second represents the value to insert.
So lets say I have this container: 1,2,1,1,2,2
And the two values are 2 and 3 with respect to their definitions above. Then I wish to write a function which will update the container to instead contain:
1,2,3,1,1,2,3,2,3
I am using c++98 and boost. What std or boost functions might I use to implement this function?
Iterating over the vector and using std::insert is one way, but it gets messy when one realizes that you need to remember to hop over the value you just inserted.
This is what I would probably do:
vector<T> copy;
for (vector<T>::iterator i=original.begin(); i!=original.end(); ++i)
{
copy.push_back(*i);
if (*i == first)
copy.push_back(second);
}
original.swap(copy);
Put a call to reserve in there if you want. You know you need room for at least original.size() elements. You could also do an initial iteraton over the vector (or use std::count) to determine the exact amount of elements to reserve, but without testing, I don't know whether that would improve performance.
I propose a solution that works in place and in O(n) in memory and O(2n) time. Instead of O(n^2) in time by the solution proposed by Laethnes and O(2n) in memory by the solution proposed by Benjamin.
// First pass, count elements equal to first.
std::size_t elems = std::count(data.begin(), data.end(), first);
// Resize so we'll add without reallocating the elements.
data.resize(data.size() + elems);
vector<T>::reverse_iterator end = data.rbegin() + elems;
// Iterate from the end. Move elements from the end to the new end (and so elements to insert will have some place).
for(vector<T>::reverse_iterator new_end = data.rbegin(); end != data.rend() && elems > 0; ++new_end,++end)
{
// If the current element is the one we search, insert second first. (We iterate from the end).
if(*end == first)
{
*new_end = second;
++new_end;
--elems;
}
// Copy the data to the end.
*new_end = *end;
}
This algorithm may be buggy but the idea is to copy only once each elements by:
Firstly count how much elements we'll need to insert.
Secondly by going though the data from the end and moving each elements to the new end.
This is what I probably would do:
typedef ::std::vector<int> MyList;
typedef MyList::iterator MyListIter;
MyList data;
// ... fill data ...
const int searchValue = 2;
const int addValue = 3;
// Find first occurence of searched value
MyListIter iter = ::std::find(data.begin(), data.end(), searchValue);
while(iter != data.end())
{
// We want to add our value after searched one
++iter;
// Insert value and return iterator pointing to the inserted position
// (original iterator is invalid now).
iter = data.insert(iter, addValue);
// This is needed only if we want to be sure that out value won't be used
// - for example if searchValue == addValue is true, code would create
// infinite loop.
++iter;
// Search for next value.
iter = ::std::find(iter, data.end(), searchValue);
}
but as you can see, I couldn't avoid the incrementation you mentioned. But I don't think that would be bad thing: I would put this code to separate functions (probably in some kind of "core/utils" module) and - of course - implement this function as template, so I would write it only once - only once worrying about incrementing value is IMHO acceptable. Very acceptable.
template <class ValueType>
void insertAfter(::std::vector<ValueType> &io_data,
const ValueType &i_searchValue,
const ValueType &i_insertAfterValue);
or even better (IMHO)
template <class ListType, class ValueType>
void insertAfter(ListType &io_data,
const ValueType &i_searchValue,
const ValueType &i_insertAfterValue);
EDIT:
well, I would solve problem little different way: first count number of the searched value occurrence (preferably store in some kind of cache which can be kept and used repeatably) so I could prepare array before (only one allocation) and used memcpy to move original values (for types like int only, of course) or memmove (if the vector allocated size is sufficient already).
In place, O(1) additional memory and O(n) time (Live at Coliru):
template <typename T, typename A>
void do_thing(std::vector<T, A>& vec, T target, T inserted) {
using std::swap;
typedef typename std::vector<T, A>::size_type size_t;
const size_t occurrences = std::count(vec.begin(), vec.end(), target);
if (occurrences == 0) return;
const size_t original_size = vec.size();
vec.resize(original_size + occurrences, inserted);
for(size_t i = original_size - 1, end = i + occurrences; i > 0; --i, --end) {
if (vec[i] == target) {
--end;
}
swap(vec[i], vec[end]);
}
}

Moving from C array to std::map <int, val> operator `-` gotcha

I had a C style array of some values. I needed it to be a map for memory economy (not allocate all at once and keep but allocate as needed)... It can be made into a set or in futher optimization a vector. But I got on one painfull gotcha: val * v; auto val_index = v - val_collection used to give item id... now such code will not compile. will it in std::vector case?
std::distance can give you the distance from the beginning of a container (or other sequence):
std::vector<val>::iterator v = whatever();
size_t val_index = std::distance(val_collection.begin(), v);
For random-access containers (including vector, but not map), you could also use - if you like:
size_t val_index = v - val_collection.begin();

erase element from vector

I have the following vector passed to a function
void WuManber::Initialize( const vector<const char *> &patterns,
bool bCaseSensitive, bool bIncludeSpecialCharacters, bool bIncludeExtendedAscii )
I want to erase any element that is less in length than 2
I tried the following but it didn't compile even
can you tell me what I am missing here.
for(vector<const char *>::iterator iter = patterns.begin();iter != patterns.end();iter++)
{//my for start
size_t lenPattern = strlen((iter).c_str);
if ( 2 > lenPattern )
patterns.erase(iter);
}//my for end
On top of the problems others have pointed out, it's a bad idea to erase items from the vector as you iterate over it. There are techniques to do it right, but it's generally slow and fragile. remove_if is almost always a better option for lots of random erasures from a vector:
#include <algorithm>
bool less_than_two_characters(const char* str) { return strlen(str) < 2; }
void Initialize(vector<const char*>& v) {
v.erase(std::remove_if(v.begin(), v.end(), less_than_two_characters), v.end());
}
In C++0x you can do that more concisely with a lambda function but the above is more likely to work on a slightly older compiler.
This cannot work, because if you erase something from your vector you invalidate your iterator.
It probably does not compile because you use your iterater in a wrong way. You might try iter->c_str or (*iter).c_str. On the other hand, give us the error message ;)
Next thing, you try to modify a const vector. This is why the compiler is complaining.
You could do this with an index, like this:
for (int i = 0; i < patterns.size(); ++i) {
size_t lenPattern = strlen(patterns[i]);
if (2 > lenPattern) {
patterns.erase(patterns.begin() + i);
--i;
}
}
However, this is not very elegant, as I manipulate the counter...
First, as Tim mentioned, the patterns parameter is a const reference, so the compiler won't let you modify it - change that if you want to be able to erase elements in it.
Keep in mind that iter 'points to' a pointer (a char const* to be specific). So you dereference the iterator to get to the string pointer:
size_t lenPattern = strlen(*iter);
if ( 2 > lenPattern )
iter = patterns.erase(iter);
Also, in the last line of the snippet, iter is assigned whatever erase() returns to keep it a valid iterator.
Note that erasing the element pointed to by iter will not free whatever string is pointed to by the pointer in the vector. It's not clear whether or not that might be necessary, since the vector might not 'own' the strings that are pointed to.

C++ STL Vectors: Get iterator from index?

So, I wrote a bunch of code that accesses elements in an stl vector by index[], but now I need to copy just a chunk of the vector. It looks like vector.insert(pos, first, last) is the function I want... except I only have first and last as ints. Is there any nice way I can get an iterator to these values?
Try this:
vector<Type>::iterator nth = v.begin() + index;
way mentioned by #dirkgently ( v.begin() + index ) nice and fast for vectors
but std::advance( v.begin(), index ) most generic way and for random access iterators works constant time too.
EDIT
differences in usage:
std::vector<>::iterator it = ( v.begin() + index );
or
std::vector<>::iterator it = v.begin();
std::advance( it, index );
added after #litb notes.
Also; auto it = std::next(v.begin(), index);
Update: Needs a C++11x compliant compiler
You can always use std::advance to move the iterator a certain amount of positions in constant time:
std::vector<int>::iterator it = myvector.begin();
std::advance(it, 2);
Actutally std::vector are meant to be used as C tab when needed. (C++ standard requests that for vector implementation , as far as I know - replacement for array in Wikipedia)
For instance it is perfectly legal to do this folowing, according to me:
int main()
{
void foo(const char *);
sdt::vector<char> vec;
vec.push_back('h');
vec.push_back('e');
vec.push_back('l');
vec.push_back('l');
vec.push_back('o');
vec.push_back('/0');
foo(&vec[0]);
}
Of course, either foo must not copy the address passed as a parameter and store it somewhere, or you should ensure in your program to never push any new item in vec, or requesting to change its capacity. Or risk segmentation fault...
Therefore in your exemple it leads to
vector.insert(pos, &vec[first_index], &vec[last_index]);