Keeping only N smallest elements with STL (with duplicates)

Keeping only N smallest elements with STL (with duplicates) - c++

The MVCE below attempts to output only the 5 smallest elements in ascending order from a large incoming input stream of random elements (which contains duplicates).
int main(int argc, char *argv[])
{
std::set<int> s; //EDIT: std::multiset is an answer to Q1
for (int i : {6, 6, 5, 8, 3, 4, 0, 2, 8, 9, 7, 2}) //Billions of elements in reality
{
if ( (s.size() < 5) || (i <= *(--s.end())) ) //Insert only if not full or when the element to be inserted is smaller than the greatest one already in the set
{
if (s.size() >= 5) //Limit the number of smallest elements that are kept. In reality ~8000
s.erase(*(--s.end())); //Erase the largest element
s.insert(i);
}
}
for (int d: s)
std::cout << d << " "; //print the 5 smallest elements in ascending order
std::cout << '\n';
return 0;
}
The output is:
0 2 3 4
The output should be:
0 2 2 3 4
Q1: What must be changed to allow duplicates ?
Q2: How can this code be made faster while not wasting GBs of memory for storing all of the input elements? (the code is way too slow now, as it is).

This sounds like the classical interview question "how to store the smallest N items, without knowledge of the size of the data that will be processed?".
One answer is to use a max-heap of N items, and then adjust the heap (remove the top element, add the new element, heapify) if the subsequent item is less than or equal to the top most item in the heap.
This can be easily done using the C++ library functions std::make_heap, std::pop_heap, and std::push_heap.
Here is an example:
#include <vector>
#include <algorithm>
#include <iostream>
int main(int argc, char *argv[])
{
std::vector<int> s;
for (int i : {6, 6, 5, 8, 3, 4, 0, 2, 8, 9, 7, 2})
{
// add the first 5 elements to the vector
if (s.size() < 5)
{
s.push_back(i);
if ( s.size() == 5 )
// make the max-heap of the 5 elements
std::make_heap(s.begin(), s.end());
continue;
}
// now check if the next element is smaller than the top of the heap
if (s.front() >= i)
{
// remove the front of the heap by placing it at the end of the vector
std::pop_heap(s.begin(), s.end());
// get rid of that item now
s.pop_back();
// add the new item
s.push_back(i);
// heapify
std::push_heap(s.begin(), s.end());
}
}
// sort the heap
std::sort_heap(s.begin(), s.end());
for (int d : s)
std::cout << d << " "; //print the 5 smallest elements in ascending order
std::cout << '\n';
return 0;
}
Output:
0 2 2 3 4
Of course you can make this a function and replace the hard-coded 5 with N.
If there are billions of elements, i.e. many more elements than N, the only thing that will be kept in the heap are N elements.
The max-heap is only manipulated if it is detected that the new item satisfies being one of the smallest N elements, and that is easily done by inspecting the top item in the heap and comparing it with the new item that is being processed.

try this, no need to sort everything. this only sorts until the first 5 elements are on the front of the vector, does not need any additional memory (sort is in-place), vector is fast regarding inserts
#include <iostream>
#include <vector>
#include <algorithm>
int main()
{
std::vector<int> vec{ 6, 6, 5, 8, 3, 4, 0, 2, 8, 9, 7, 2 };
int numElements = 5;
std::partial_sort(vec.begin(), vec.begin() + numElements, vec.end());
for (int i = 0; i < numElements; ++i)
{
std::cout << vec[i] << "\n";
}
return 0;
}
if you do not want to store all inputs, it depends a bit on how you read the input, but the solution will be a bit different. For example read chunks, take smallest 5 of each chunck and in the end just execute once more on the combined "smallest 5" of each chunk.

Q2 ans: Go trought first N elements of the multiset(i am not sure if it is sorted highest to lowest or lowest to highest, so adjust that), and push_back() them to a std::vector.

Related

Duplicates in Array

Given an integer array nums sorted in non-decreasing order, remove some duplicates in-place such that each unique element appears at most twice. The relative order of the elements should be kept the same.
Since it is impossible to change the length of the array in some languages, you must instead have the result be placed in the first part of the array nums. More formally, if there are k elements after removing the duplicates, then the first k elements of nums should hold the final result. It does not matter what you leave beyond the first k elements.
Return k after placing the final result in the first k slots of nums.
What is wrong with my code ??
map<int,int> m;
for(int i = 0 ; i < nums.size() ; i++){
m[nums[i]]++;
if(m[nums[i]] > 2)nums.erase(nums.begin() + i);
}
return nums.size();

From the given text, we can derive the following requirements
Given an integer array nums
sorted in non-decreasing order,
remove some duplicates in-place such that each unique element appears at most twice.
The relative order of the elements should be kept the same.
Since it is impossible to change the length of the array in some languages, you must instead have the result be placed in the first part of the array nums.
More formally, if there are k elements after removing the duplicates, then the first k elements of nums should hold the final result.
It does not matter what you leave beyond the first k elements
Return k after placing the final result in the first k slots of nums.
So, after elicitating the requirements, we know that we have a fixed size array, presumably (because of the simplicity of the task) a C-Style array or a C++ std::array. Because of the shown source code, we assume a std::array.
It will be sorted in increasing order. Their shall be an in-place removal of elements. So, no additional variables. The rest of the requirements already shows the solution.
--> If we find duplicates (more than 2) we will shift the rest of the values one to the left and overwrite one of the duplicates. Then the logical number of elements in the array will be one less. So, the loop must run one step less.
This ends up in a rather simple program:
#include <iostream>
#include <array>
// Array with some test values
constexpr int ArraySize = 25;
std::array<int, ArraySize> nums{ 1,2,2,2,3,3,3,4,4,4,4,4,6,5,5,5,5,5,6,6,6,6,6,6,9,9 };
int main() {
// Currentlogical end of the data in the array. In the beginning, last value in the array
size_t endIndex = nums.size();
// Check allelments from left to tright
for (size_t index = 0; index < endIndex;) {
// Check, if 3 elements are same
if ((index < (endIndex -2)) and nums[index] == nums[index + 1] and nums[index + 1] == nums[index + 2]) {
// Yes, found 3 same elements. We willdelete one, so the endIndex needs to be decremented
--endIndex;
// Now hsift all array elements one to the left
for (size_t shiftIndex = index + 2; shiftIndex < endIndex; ++shiftIndex)
nums[shiftIndex] = nums[shiftIndex + 1];
}
else ++index;
}
// SHow result
std::cout << endIndex << '\n';
}

I can offer the solution of your problem.
#include <iostream>
#include <vector>
#include <set>
using namespace std;
void showContentSet(set<int>& input)
{
for(auto iterator=input.begin(); iterator!=input.end(); ++iterator)
{
cout<<*iterator<<", ";
}
return;
}
void showContentVector(vector<int>& input)
{
for(int i=0; i<input.size(); ++i)
{
cout<<input[i]<<", ";
}
return;
}
void solve()
{
vector<int> numbers={1, 2, 1, 3, 4, 5, 7, 5, 8, 5, 9, 5};
set<int> indicesToDelete;
for(int i=0; i<numbers.size(); ++i)
{
int count=0;
for(int j=0; j<numbers.size(); ++j)
{
if(numbers[i]==numbers[j])
{
++count;
if(count>2)
{
indicesToDelete.insert(j);
}
}
}
}
cout<<"indicesToDelete <- ";
showContentSet(indicesToDelete);
int newOrder=0;
cout<<endl<<"Before, numbers <- ";
showContentVector(numbers);
for(auto iterator=indicesToDelete.begin(); iterator!=indicesToDelete.end(); ++iterator)
{
numbers.erase(numbers.begin()+(*iterator-newOrder));
++newOrder;
}
cout<<endl<<"After, numbers <- ";
showContentVector(numbers);
cout<<endl;
return;
}
int main()
{
solve();
return 0;
}
Here is the result:
indicesToDelete <- 9, 11,
Before, numbers <- 1, 2, 1, 3, 4, 5, 7, 5, 8, 5, 9, 5,
After, numbers <- 1, 2, 1, 3, 4, 5, 7, 5, 8, 9,

I suggest using a frequency array.
frequency array works, That you count how many duplicates of each number while inputting, It's stored usually in an array called freq, Also can be stored in a map<int, int> or unordered_map<int, int>.
And because of input is in non-decreasing order, outputting this solution will be easy.
Note: this solution won't work if input numbers is bigger than 10^5
Solution:
#include <iostream>
const int N = 1e5 + 1; // Maximum size of input array
int n;
int nums[N], freq[N];
int main()
{
// Input
std::cin >> n;
for (int i = 0; i < n; i++)
{
std::cin >> nums[i];
freq[nums[i]]++;
}
// Outputting numbers, Using frequency array of it
for (int i = 0; i < N; i++)
{
if (freq[i] >= 1)
std::cout << i << ' ';
if (freq[i] >= 2)
std::cout << i << ' ';
}
return 0;
}

This is basically a conditional copy operation. Copy the entire range, but skip elements that have been copied twice already.
The following code makes exactly one pass over the entire range. More importantly it avoids erase operations, which will repeatedly shift all elements to the left.
vector<int> nums; // must be sorted already
if (nums.size()<=1)
return; // already done
// basically you copy all elements inside the vector
// but copy them only if the requirement has been met.
// `in` is the source iterator. It increments every time.
// `out` is the destination iterator. It increments only
// after a copy.
auto in=nums.begin();
auto out=nums.begin();
// current is the current 'int' value
int current=*in;
// `count` counts the repeat count of the current value
int count=0;
while (in!=nums.end()) {
if (*in==current) {
// The current value repeats itself, so increment
// the count value
++count;
} else {
// No, this is a new value.
// initialise current and count
current=*in;
count=1;
}
if (count<=2) {
// only if at most two elements of the same value
// copy the current value to `out`
*out=current;
++out;
}
// try next element
++in;
}
// out points to the last valid element + 1

next_permutation not by reference?

I do not understand why inside the loop, the position of the values are changed. But outside do, while loop, all the values return to the original positions. Thus i need the //here code. I also tried a pointer array, but it showed the same behavior. Why so?
#include <iostream>
#include <algorithm>
using namespace std;
int main()
{
int a[] = {0, 1, 2};
do
{
for (int i = 0; i < 3; i++)
cout << a[i];
cout << endl;
} while (next_permutation(a, a + 3));
cout << endl;
// here
a[0] = 2;
a[1] = 1;
a[2] = 0;
do
{
for (int i = 0; i < 3; i++)
cout << a[i];
cout << endl;
} while (prev_permutation(a, a + 3));
return 0;
}

Thats how next_permutation is defined. The last permutation (the one that returns false) is the one that puts the elements in sorted order.
I suppose there is another misunderstanding. Here:
do
{
print_permutation();
} while (next_permutation(a, a + 3));
The last permutation you print inside the loop is that one before the one that makes next_permutation return false. Hence in the last iteration you are not printing the same permutation as outside of the loop. It is similar to:
bool increment(int& i) {
++i;
return i<10;
}
int i = 0;
do {
std::cout << i;
} while( increment(i) );
std::cout << i;
The last value printed inside the loop is 9, but the value of i after the loop is 10.

Every time it is called, std::next_permutation generates an permutation P' of the given container a where the current permutation is, say, P. As long as the permutation P' it generates is greater than P, it returns true. However, in the case for P being the greatest (i.e., see below for an explanation of greatness) permutation of the elements in the given container a, for instance P=[2, 1, 0] the next permutation P' it generates is [0, 1, 2], which is not greater than P. In that case, it returns false and the loop terminates. But, due to the side effect of the process, once the function returns, the elements in the container are already placed in that smallest permutation possible. That is why you see those elements in their smallest permutation possible after the loop terminates.
The word greater may be a little confusing. Basically std::next_permutation uses whatever comparison operators are available to compare individual elements and ends up iterating over them in a way that lexicographically increases with each following permutation. So, for a vector of [0, 1, 2], it would iterate the following permutations in that order:
0, 1, 2
0, 2, 1
1, 0, 2
1, 2, 0
2, 0, 1
2, 1, 0
That is why, the function recognizes that [0,1,2] would not be a greater permutation after [2,1,0] and returns false.

Sort one vector according to another

I'm basing my question off the answer to this one:
How to obtain the index permutation after the sorting
I have two std::vectors:
std::vector<int> time={5, 16, 4, 7};
std::vector<int> amplitude={10,17,8,16};
I want to order the vectors for increasing time, so eventually they will be:
TimeOrdered={4,5,7,16};
AmplitudeOrdered={8,10,16,17};
Once finished, I want to add both ordered vectors to a CERN ROOT TTree. I looked online for solutions and found the example above, where the top answer is to use the following code:
vector<int> data = {5, 16, 4, 7};
vector<int> index(data.size(), 0);
for (int i = 0 ; i != index.size() ; i++) {
index[i] = i;
}
sort(index.begin(), index.end(),[&](const int& a, const int& b) {
return (data[a] < data[b]);
}
);
for (int ii = 0 ; ii != index.size() ; ii++) {
cout << index[ii] << endl;
}
Which I like because it's simple, it doesn't require too many lines and it leaves me with two simple vectors, which I can then use easily for my TTree.
I tried, therefore, to generalise it:
void TwoVectorSort(){
std::vector<int> data={5, 16, 4, 7};
std::vector<int> data2={10,17,8,16};
sort(data2.begin(), data2.end(),[&](const int& a, const int& b) {
return (data[a] < data[b]);
}
);
for (int ii = 0 ; ii != data2.size() ; ii++) {
std::cout <<data[ii]<<"\t"<< data2[ii]<<"\t"<< std::endl;//<<index[ii]
}
}
But not only does it not work, it gives me something different each time. I'm running it as a macro with ROOT 6.18/04, using .x TwoVectorSort.cpp+.
Can anyone tell me why it doesn't work and what the simplest solution is? I'm by no means a c++ expert so I hope the answers won't be too technical!
Thanks in advance!

Indeed you can re-use the solution from the link you shared to solve your problem. But you need to keep building the index vector (and I believe there's no need to modify the time or amplitude vectors).
The index vector is used to store the index/position of the time vector values sorted from smallest to largest, so for time={5, 16, 4, 7}:
index[0] will contain the index of the smallest value from time (which is 4, at position 2), hence index[0]=2
index[1] will contain the index of the 2nd smallest value from time (which is 5, at position 0), hence index[1]=0
etc.
And since amplitude's order is based on time you can use index[pos] to access both vectors when building your tree:
time[index[pos]] and amplitude[index[pos]]
Code with the corrections:
#include <iostream>
#include <vector>
#include <algorithm>
int main(){
std::vector<int> time={5, 16, 4, 7};
std::vector<int> amplitude={10,17,8,16};
std::vector<int> index(time.size(), 0);
for (int i = 0 ; i != index.size() ; i++) {
index[i] = i;
}
sort(index.begin(), index.end(),
[&](const int& a, const int& b) {
return (time[a] < time[b]);
}
);
std::cout << "Time \t Ampl \t idx" << std::endl;
for (int ii = 0 ; ii != index.size() ; ++ii) {
std::cout << time[index[ii]] << " \t " << amplitude[index[ii]] << " \t " << index[ii] << std::endl;
}
}
Output:
Time Ampl idx
4 8 2
5 10 0
7 16 3
16 17 1
But not only does it not work, it gives me something different each time
That happened because the parameters that the lambda was receiving were from data2={10,17,8,16} and those values were being used as index to access the data vector at return (data[a] < data[b]). It caused some random sorting because it was accessing out of the vector's range and reading garbage from memory (hence the random behavior).

How to consider only first two elements from *pointer

From the below code you can see that the vector array has the same number twice or more than. What I want to do is to find the first two same number's position from the pointer *ptr
#include<iostream>
#include<iterator> // for iterators
#include<vector> // for vectors
using namespace std;
int main()
{
vector<int> ar = { 1,8,2, 2, 2, 5,7,7,7,7,8 };
// Declaring iterator to a vector
vector<int>::iterator ptr;
// Displaying vector elements using begin() and end()
cout << "The vector elements are : ";
for (ptr = ar.begin(); ptr < ar.end(); ptr++)
cout << *ptr << " ";
return 0;
}
Let's assume I want to print out the first two position and elements of 7 by dereferencing the pointer *ptr. Should I use an if a condition like
int *array = ptr.data();
for( int i =0; i < ar.size(); i++) {
if( array[i] - array[i+1]+ ==0)
cout<<array[i]<<endl;
}
But how would I guarantee that it is not looking for the only first two same elements from *ptr?
UPDATE
Clearing the question:
The reason I always want to know the first and second position of the same element from the dereferencing the pointer is that later I will do some study, and in that study, I will be given some time associated with the first and second position of the same number. The problem, I wanted to ignore the same elements which are still repetitive after the second time is because I want to ignore these element positions in my calculations.
For example, if you print out the code you would find the element: **The vector elements are 1 8 2 2 2 5 7 7 7 7 8 **. In this case, the first two positions of the element 2, is [2], and [3], therefore I would like to ignore the position [4]. Another thing to mention, that I don't care if the value or consequent or not[I mean for example 828, or 888, I would consider both]. For example, the number 8 is in location array[1], and in the [10]. I would also consider this.

Create a map where each value is stored as key, mapped to a list of indices:
std::unordered_map<int, std::vector<size_t>> indexMap;
Loop over your initial values and fill the map:
for (size_t index = 0; index < ar.size(); index++)
{
indexMap[ar[index]].push_back(index);
}
Now you can loop over your map and work with every value that has 2 or more indices and only use the first 2 indices for whatever you want to do:
for (auto const& [value, indices] : indexMap)
{
if (indices.size() < 2)
continue;
size_t firstIndex = indices[0];
size_t secondIndex = indices[1];
// do whatever
}
(If you don't use C++17 or up, use for (auto const& pair : indexMap), where pair.first is value and pair.second is indices.)

You could use map or unordered_map to register indexes of each value.
Here's a simple demo of the concept:
#include<iostream>
#include<vector>
#include<map>
using namespace std;
int main() {
vector<int> ar{ 1, 8, 2, 2, 2, 5, 7, 7, 7, 7, 8 };
map<int, vector<size_t> > occurrences{ };
for (size_t i = 0; i < ar.size(); ++i) {
occurrences[ar[i]].push_back(i);
}
for (const auto& occurrence:occurrences) {
cout << occurrence.first << ": ";
for (auto index: occurrence.second) {
cout << index << " ";
}
cout << endl;
}
return 0;
}
Output:
1: 0
2: 2 3 4
5: 5
7: 6 7 8 9
8: 1 10

How to iterate through every SECOND element of doubly linked list (ring)?

RING- linear data structure in which the end points to the beginning of the structure. It is also called a circular buffer, circular queue, or cyclic buffer.
I got a function to write. It's purpose is to produce another RING structure from the original RING, but it has its length defined and the thing is that it has to be every SECOND element of the original RING.
Example:
originalRing= 1,2,3,4,5
function newRing is called: newRing (originalRing, nRing, len1=5)
nRing=1,3,5,2,4
(explanation: '1' is the 1st element of the RING. Every second means I take 3, 5... but this is RING, so it goes like 1,2,3,4,5,1,2,3,4,5,... The function says nRing must have length of 5, so I am taking next every element: 2,4. Finally it gives 1,3,5,2,4)
I am using iterators (I have to, school project).
iterator i1 = nRing.begin(); //--- .begin() points to the 'beginning' of the Ring
if (originalRing.isEmpty()){ //---whether originalRing is empty or not
return false;}
if (originalRing.length()==1){ //--- if originalRing no. of elements is 1, returns that Ring
return originalRing;
}
if (len1<=0) //--- doesnt make sense
{return false;}
if(!i1.isNULL()) //--- checks whether iterator is null
{
for(int i = 0; i < len1; i++)
{
nRing.insertLast(i1.getKey()); //insert the element to the end of the Ring
i1++;
}
}
So here, the thing I am asking is that i1++ --- it iterates elements one by one.
My question is how to define a loop with the iterator defined to have every 2nd element attached ?

You could use std::stable_partition to do this. It'll divide the elements in your ring buffer so that those for which the lambda returns true comes before those elements for which it returns false. You can create a stateful/mutable lambda to toggle true/false for each iteration.
#include <iostream>
#include <vector> // std::vector
#include <algorithm> // std::stable_partition
int main() {
std::vector<int> RING = {1, 2, 3, 4, 5};
for(const auto& v : RING) std::cout << v << " ";
std::cout << "\n";
for(int i = 0; i < 4; ++i) {
std::stable_partition( RING.begin(), RING.end(),
[toggle = false](const auto&) mutable {
return toggle ^= true; // false becomes true and vice-a-versa
}
);
for(const auto& v : RING) std::cout << v << " ";
std::cout << "\n";
}
}
Output:
1 2 3 4 5
1 3 5 2 4
1 5 4 3 2
1 4 2 5 3
1 2 3 4 5

The thing to do is to put into the loop:
i1++;
i1++

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Keeping only N smallest elements with STL (with duplicates) - c++

Q2 ans: Go trought first N elements of the multiset(i am not sure if it is sorted highest to lowest or lowest to highest, so adjust that), and push_back() them to a std::vector.

Related

Duplicates in Array

next_permutation not by reference?

Sort one vector according to another

How to consider only first two elements from *pointer

How to iterate through every SECOND element of doubly linked list (ring)?

Categories

Resources