Efficient Computation of Frequent and Top-k Elements in Data Streams - c++

Here is the pseduo code for this algorithm.
Following is how I have implemented this.
#include <iostream>
#include <fstream>
#include <string>
#include <map>
typedef std::map<std::string, int> collection_t;
typedef collection_t::iterator collection_itr_t;
collection_t T;
collection_itr_t get_smallest_key() {
collection_itr_t min_key = T.begin();
collection_itr_t key = ++min_key;
while ( key != T.end() ) {
if ( key->second < min_key->second )
min_key = key;
++key;
}
return min_key;
}
void space_saving_frequent( std::string &i, int k ) {
if ( T.find(i) != T.end())
T[i]++;
else if ( T.size() < k ) {
T.insert(std::make_pair(i, 1 ));
} else {
collection_itr_t j = get_smallest_key();
int cnt = j->second + 1;
T.erase(j);
T.insert(std::make_pair(i, cnt));
}
}
int main ( int argc, char **argv) {
std::ifstream ifs(argv[1]);
if ( ifs.peek() == EOF )
return 1;
std::string line;
while( std::getline(ifs,line) ) {
std::string::size_type left = line.rfind('=') + 1;
std::string::size_type length = line.length();
std::string i = line.substr(left, length - left - 1);
space_saving_frequent(i, 5);
}
ifs.close();
return 0;
}
Original paper link : http://dimacs.rutgers.edu/~graham/pubs/papers/freqcacm.pdf
But code does not work, and I am no able to figure out where I am wrong.

If the items with least count are two or more, you can simply break ties arbitrarily by choosing, for instance, the item with lowest index stored in your data structure, or a random one among those of lowest count etc.
If you want to compare your implementation with a reference one, take a look at the implementation of Cormode and Hadjieleftheriou that you will find here. The code is more complex than yours, because you are not actually implementing the stream summary data structure. Their code also includes implementations for several other frequent items algorithms, and the authors compared the performances of those algorithms. Space saving proved to be in the majority of the cases, the best algorithm, with regard to several metrics such as precision, recall, update speed, space used etc. You will also find a paper discussing this experimental comparison. An improved version of this paper appeared later in Communications of the ACM. Here you can access a pdf version.

Related

Returning Array in C++ returns unaccessable elements

I am working on a project where I parse a string in to an array and then return it back to the main function. It parses fine but when I return it to the main function I can't get access to the array elements.
//This is from the Main function. It calls commaSeparatedToArray which returns the array.
for (int i = 0; i < numberOfStudents; i++) {
string * parsedToArray = mainRoster->commaSeparatedToArray(studentData[i]);
Degree degreeType = SOFTWARE;
for (int i = 0; i < 3; i++) {
if (degreeTypeStrings[i] == parsedToArray[8])
degreeType = static_cast<Degree>(i);
}
mainRoster->add(parsedToArray[0], parsedToArray[1], parsedToArray[2], parsedToArray[3], stoi(parsedToArray[4]), stoi(parsedToArray[5]), stoi(parsedToArray[6]), stoi(parsedToArray[7]), degreeType);
}
//Here is the commaSeparatedToArray function
string * roster::commaSeparatedToArray(string rowToParse) {
int currentArraySize = 0;
const int expectedArraySize = 9;
string valueArray[expectedArraySize];
int commaIndex = 0;
string remainingString = rowToParse;
while (remainingString.find(",") != string::npos) {
currentArraySize++;
if (currentArraySize <= expectedArraySize) {
commaIndex = static_cast<int>(remainingString.find(","));
valueArray[currentArraySize - 1] = remainingString.substr(0, commaIndex);
remainingString = remainingString.substr(commaIndex + 1, remainingString.length());
}
else {
cerr << "INVALID RECORD. Record has more values then is allowed.\n";
exit(-1);
}
}
if (currentArraySize <= expectedArraySize) {
currentArraySize++;
commaIndex = static_cast<int>(remainingString.find(","));
valueArray[currentArraySize - 1] = remainingString.substr(0, commaIndex);
remainingString = remainingString.substr(commaIndex + 1, remainingString.length());
}
if (currentArraySize < valueArray->size()) {
cerr << "INVALID RECORD. Record has fewer values then is allowed.\n";
exit(-1);
}
return valueArray;
}
1) You can't return arrays in C++. Your code (as I'm sure you know) returns a pointer to an array. That's an important difference.
2) The array is declared locally in the function and therefore no longer exists after the function has exitted.
3) Therefore once you have returned from the function you have a pointer to something which no longer exists. Bad news.
4) You must always consider the lifetime of objects when you program C++. One solution to this problem is to dynamically allocate the array (using new[]). This means that the array will still exist when you exit the function. But it has the signifcant disavantage that you must remember to delete[] the array at a suitable later time.
5) The best solution (in general) is to use a std::vector. Unlike an array a std::vector can be returned from a function. So this option leads to the simplest, most natural code.
vector<string> roster::commaSeparatedToArray(string rowToParse) {
...
vector<string> valueArray(expectedArraySize);
...
return valueArray;
}
Since your array/vector is constant size, you could also use a std::array
array<string, expectedArraySize> valueArray;
To complete the answer that John has already given, I made some example code to show you, how such function could look like.
Parsing, or tokenizing can be easily done with the std::sregex_token_iterator. That is one of the purposes for this iterator. You can see the simplicity of the usage below.
In the function we define a vector af string and use its range constructor to do the whole tokenizing.
Then we make a sanity check and return the data.
Please see:
#include <string>
#include <regex>
#include <iterator>
#include <vector>
#include <algorithm>
#include <iostream>
const std::regex separator(",");
constexpr size_t ExpectedColumnSize = 9;
std::vector<std::string> commaSeparatedToArray(std::string rowToParse)
{
// Parse row into substrings
std::vector<std::string> columns{
std::sregex_token_iterator(rowToParse.begin(),rowToParse.end(),separator ,-1),
std::sregex_token_iterator() };
// Check number of columns
if (columns.size() != ExpectedColumnSize) {
std::cerr << "Error. Unexpected number of columns in record\n";
}
return columns;
}
// test code
int main()
{
// Define test data
std::string testInputData{ "1,2,3,4,5,6,7,8,9" };
// Get the result from the parser
std::vector<std::string> parsedElements{ commaSeparatedToArray(testInputData) };
// show the result on the console
std::copy(parsedElements.begin(), parsedElements.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}

std::string::reserve() and std::string::clear() conundrum

This question starts with a bit of code, just because I think it is easier to see what I am after:
/*static*/
void
Url::Split
(std::list<std::string> & url
, const std::string& stringUrl
)
{
std::string collector;
collector.reserve(stringUrl.length());
for (auto c : stringUrl)
{
if (PathSeparator == c)
{
url.push_back(collector);
collector.clear(); // Sabotages my optimization with reserve() above!
}
else
{
collector.push_back(c);
}
}
url.push_back(collector);
}
In the code above, the collector.reserve(stringUrl.length()); line is supposed to reduce the amount of heap operations performed during the loop below. Each substring cannot be longer than the whole url, after all and so reserving enough capacity as I do it looks like a good idea.
But, once a substring is finished and I add it to the url parts list, I need to reset the string to length 0 one way or another. Brief "peek definition" inspection suggests to me that at least on my platform, the reserved buffer will be released and with that, the purpose of my reserve() call is compromised.
Internally it calls some _Eos(0) in case of clear.
I could as well accomplish the same with collector.resize(0) but peeking definition reveals it also calls _Eos(newsize) internally, so the behavior is the same as in case of calling clear().
Now the question is, if there is a portable way to establish the intended optimization and which std::string function would help me with that.
Of course I could write collector[0] = '\0'; but that looks very off to me.
Side note: While I found similar questions, I do not think this is a duplicate of any of them.
Thanks, in advance.
In the C++11 standard clear is defined in terms of erase, which is defined as value replacement. There is no obvious guarantee that the buffer isn't deallocated. It might be there, implicit in other stuff, but I failed to find any such.
Without a formal guarantee that clear doesn't deallocate, and it appears that at least as of C++11 it isn't there, you have the following options:
Ignore the problem.
After all, chances are that the micro-seconds incurred by dynamic buffer allocation will be absolutely irrelevant, and in addition, even without a formal guarantee the chance of clear deallocating is very low.
Require a C++ implementation where clear doesn't deallocate.
(You can add an assert to this effect, checking .capacity().)
Do your own buffer implementation.
Ignoring the problem appears to be safe even where the allocations (if performed) would be time critical, because with common implementations clear does not reduce the capacity.
E.g., here with g++ and Visual C++ as examples:
#include <iostream>
#include <string>
using namespace std;
auto main() -> int
{
string s = "Blah blah blah";
cout << s.capacity();
s.clear();
cout << ' ' << s.capacity() << endl;
}
C:\my\so\0284>g++ keep_capacity.cpp -std=c++11
C:\my\so\0284>a
14 14
C:\my\so\0284>cl keep_capacity.cpp /Feb
keep_capacity.cpp
C:\my\so\0284>b
15 15
C:\my\so\0284>_
Doing your own buffer management, if you really want to take it that far, can be done as follows:
#include <iostream>
#include <string>
#include <vector>
namespace my {
using std::string;
using std::vector;
class Collector
{
private:
vector<char> buffer_;
int size_;
public:
auto str() const
-> string
{ return string( buffer_.begin(), buffer_.begin() + size_ ); }
auto size() const -> int { return size_; }
void append( const char c )
{
if( size_ < int( buffer_.size() ) )
{
buffer_[size_++] = c;
}
else
{
buffer_.push_back( c );
buffer_.resize( buffer_.capacity() );
++size_;
}
}
void clear() { size_ = 0; }
explicit Collector( const int initial_capacity = 0 )
: buffer_( initial_capacity )
, size_( 0 )
{ buffer_.resize( buffer_.capacity() ); }
};
auto split( const string& url, const char pathSeparator = '/' )
-> vector<string>
{
vector<string> result;
Collector collector( url.length() );
for( const auto c : url )
{
if( pathSeparator == c )
{
result.push_back( collector.str() );
collector.clear();
}
else
{
collector.append( c );
}
}
if( collector.size() > 0 ) { result.push_back( collector.str() ); }
return result;
}
} // namespace my
auto main() -> int
{
using namespace std;
auto const url = "http://en.wikipedia.org/wiki/Uniform_resource_locator";
for( string const& part : my::split( url ) )
{
cout << '[' << part << ']' << endl;
}
}

Balance ordering parenthesis via Dynamic programing

Hi from the very famous book Code to Crack i come across a question :
Implement an algorithm to print all valid (e.g., properly opened and closed) combinations of n-pairs of parentheses.
Example:
input: 3 (e.g., 3 pairs of parentheses)
output: ()()(), ()(()), (())(), ((()))
#include <iostream>
#include <string>
using namespace std;
void _paren(int l,int r,string s,int count);
void paren(int n)
{
string s="";
_paren(n,n,s,n);
}
void _paren(int l,int r,string s,int count){
if(l<0 || r<0)
return;
if(l==0 && r==0)
cout<<s<<endl;
else{
if(l>0)
{
_paren(l-1,r,s+"(",count+1);
}
if(r>l)
_paren(l,r-1,s+")",count+1);
}
}
int main(){
int n;
cin>>n;
paren(n);
return 0;
}
This is a recursive approach I tried for it . I am pretty sure that we can solve this through dynamic programming as well , as we are already using a lot of value again and again , but I have no idea how to implement this through Dynamic programming I tried tabular bottom up approach but couldnt work out. Please help me out just the basic idea on how to work with this
DP does not really help you. The recursive algorithm is time and space optimal!
In fact, there is a reason not to use DP: the memory requirements! This will be huge.
A better algorithm is to have one character array that you pass in, have the recursive method modify parts of it and print that when needed. I believe that solution is given in the book you mention.
DP can reduce count of traversed states by choosing the optimal solution every call. It also help you to reuse calculated values. There is no calculations, every valid state must be visited, and non-valid states can be avoided by if( ).
I suggest you to implement some another recursion (at least without copying new string object after call, just declare global char array and send it to output when you need).
My idea of recursion is
char arr[maxN]; int n; // n is string length, must be even though
void func(int pos, int count) { // position in string, count of opened '('
if( pos == n ) {
for(int i = 0; i < n; i++)
cout << char(arr[i]);
cout << "\n";
return;
}
if( n-pos-1 > count ) {
arr[pos] = '('; func(pos+1,count+1);
}
if( count > 0 ) {
arr[pos] = ')'; func(pos+1,count-1);
}
}
I didn't checked it, but the idea is clear I think.

How can I generate the following tree of permutations in C++?

Suppose we have the string ABCD
I would like to create the following tree:
ABCD <------ level 1
ABC ABD ACD BCD <------ level 2
AB AC AD BC BD CD <------ level 3
A B C D <------ level 4
And save it inside a vector in the following order:
ABCD->ABC->ABD->ACD->BCD->AB->AC->AD->BC->BD->CD->A->B->C->D
So from the starting point, I want to generate the nodes of the next level, store them inside the vector, then generate the nodes of the next level and do the same thing for all the remaining levels
I have created the following program to generate level 2 from level 1.
void test(int dimensions, vector<string> & nodes, const char* currentNode){
int i,j;
for(i=dimensions-1;i>=0;i--){
char *temp = new char[dimensions];
int counter = 0;
for(j=0;j<dimensions;j++){
if(j!=i){
temp[counter] = currentNode[j];
counter++;
}
}
temp[counter] = '\0';
nodes.push_back(temp);
}
}
which is called from main:
vector<string> nodes;
int dimension = 4;
nodes.push_back("ABCD");
test(dimension, nodes, "ABCD");
This gives me the following:
As you can see the nodes of the level 2 are added successfully, however if I try to apply recursion here, for example for node "ABC"
I would get as a result:
AB -> AC -> BC
These will be saved successfully, however if the recursion keeps going, for example for node AB now it will find A -> B
so the the resulting order of the nodes saved in the vector won't be how I described in the beginning.
Instead of
ABCD->ABC->ABD->ACD->BCD->AB->AC->AD->BC->BD->CD->...
it will be
ABCD->ABC->ABD->ACD->BCD->AB->AC->A->B->...
Finally, I would like the computation of this tree to be generalized for any number of dimensions. For example the initial node could be ABCD or ABCDEFGHIJKLM.
For some reason I believe this is very difficult to do, however I'm not exactly certain about it. Note that I don't want to use any external libraries for computing the permutations, I need to understand 100% the code in order to proceed with the algorithm that I want to implement.
Thank you in advance
As stated in the comments, I don't see how this is remotely related to permutations, but here's the code for what I think you're trying to achieve:
#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
typedef std::vector<std::string> Layer;
Layer getNextLayer(const Layer &);
int main()
{
std::vector<Layer> layers;
layers.push_back(Layer());
layers[0].push_back("ABCDE");
while ( layers.back().back().size() > 1 )
{
layers.push_back(getNextLayer(layers.back()));
for ( size_t i = 0; i < layers.back().size(); ++i )
{
std::cout << layers.back()[i] << " ";
}
std::cout << "\n";
}
}
Layer getNextLayer(const Layer &layer)
{
Layer result;
for ( size_t i = 0; i < layer.size(); ++i )
{
const std::string item = layer[i];
for ( size_t j = 0; j < item.size(); ++j )
{
std::string new_item = item;
new_item.erase(new_item.begin() + j); // erase j^th charachter from item
result.push_back(new_item);
}
}
std::sort(result.begin(), result.end());
result.erase(std::unique(result.begin(), result.end()), result.end()); // erase duplicates
return result;
}
This creates each layer based on the last one. To store it all in one vector, you just have to merge all these layers.

How to capture a string into variable in a recursive function?

I tried to print all the possible combination of members of several vectors. Why
the function below doesn't return the string as I expected?
#include <iostream>
#include <vector>
#include <fstream>
#include <sstream>
using namespace std;
string EnumAll(const vector<vector<string> > &allVecs, size_t vecIndex, string
strSoFar)
{
string ResultString;
if (vecIndex >= allVecs.size())
{
//cout << strSoFar << endl;
ResultString = strSoFar;
//return ResultString;
}
for (size_t i=0; i<allVecs[vecIndex].size(); i++) {
strSoFar=EnumAll(allVecs, vecIndex+1, strSoFar+allVecs[vecIndex][i]);
}
ResultString = strSoFar; // Updated but still doesn't return the string.
return ResultString;
}
int main ( int arg_count, char *arg_vec[] ) {
vector <string> Vec1;
Vec1.push_back("T");
Vec1.push_back("C");
Vec1.push_back("A");
vector <string> Vec2;
Vec2.push_back("C");
Vec2.push_back("G");
Vec2.push_back("A");
vector <string> Vec3;
Vec3.push_back("C");
Vec3.push_back("G");
Vec3.push_back("T");
vector <vector<string> > allVecs;
allVecs.push_back(Vec1);
allVecs.push_back(Vec2);
allVecs.push_back(Vec3);
string OutputString = EnumAll(allVecs,0,"");
// print the string or process it with other function.
cout << OutputString << endl; // This prints nothing why?
return 0;
}
The expected output is:
TCC
TCG
TCT
TGC
TGG
TGT
TAC
TAG
TAT
CCC
CCG
CCT
CGC
CGG
CGT
CAC
CAG
CAT
ACC
ACG
ACT
AGC
AGG
AGT
AAC
AAG
AAT
You call EnumAll recursively, but you ignore the string that it returns. You have to decide how you are going to aggregate those strings - or what you are going to do with them.
Your function doesn't return anything because your last call doesn't return anything since there's no return and the end of your function.
Edit:
One thing that you can do, is to insert your ResultString to a global vector each time before the return. And at the end, all your results will be available in this vector.
Here is an alternate solution. This does not expect you to pass anything but the initial vectors:
int resultSize( vector< vector<string> > vector ){
int x=1;
for( int i=0;i<vector.size(); i++ )
x *= vector[i].size();
return x;
}
vector<string> enumAll(const vector< vector<string> > allVecs )
{
//__ASSERT( allVecs.size() > 0 );
vector<string> result;
if( allVecs.size() == 1 ){
for( int i=0 ; i< allVecs[0].size(); i++){
result.push_back( allVecs[0][i] );
}
return result;
}
for( int i=0; i<allVecs[0].size(); i++ ){
for( int j=0; j<resultSize( vector< vector<string> >(allVecs.begin()+1, allVecs.end() ) ); j++){
result.push_back( allVecs[0][i] + enumAll(vector< vector<string> >(allVecs.begin()+1, allVecs.end() ))[j] );//enumAll on each tempVector is called multiple times. Can be optimzed.
}
}
}
Advantage of this method:
This is very readable in terms of the recursion. It has easily identifiable recursion base step and also the recursion itself. It works as follows: Each iteration of the recursion enumerates all possible strings from n-1 vectors and the current step simply enumerates them.
Disadvantages of this method:
1. enumAll() function is called multiple times returning the same result.
2. Heavy on stack usage since this is not tail recursion.
We can fix (1.) by doing the following, but unless we eliminate tail recursion, we cannot get rid of (2.).
vector<string> enumAll(const vector< vector<string> > allVecs )
{
//__ASSERT( allVecs.size() > 0 );
vector<string> result;
if( allVecs.size() == 1 ){
for( int i=0 ; i< allVecs[0].size(); i++){
result.push_back( allVecs[0][i] );
}
return result;
}
const vector< vector<string> > tempVector(allVecs.begin()+1, allVecs.end() );
vector<string> tempResult = enumAll( tempVector );// recurse
int size = resultSize( tempVector );
cout << size << " " << tempResult.size() << endl;
for( int i=0; i<allVecs[0].size(); i++ ){
for( int j=0; j<size; j++){
result.push_back( allVecs[0][i] + tempResult[j] );
}
}
}
Your second return should also accumulate the strSoFar in some way. Something like:
for (size_t i=0; i<allVecs[vecIndex].size(); i++)
{
strSoFar = EnumAll(allVecs, vecIndex+1, strSoFar+allVecs[vecIndex][i]);
}
ResultString = strSoFar;
return ResultString;
The code you provided crashes. In the following line, notice that you will be exceeding the limits of vecIndex. There is no check on it in the loop. Also, in the if condition above, you donot reset the vecIndex either. So you will have an access violation.
strSoFar = EnumAll(allVecs, vecIndex+1, strSoFar+allVecs[vecIndex][i]);
To fix it, either rest vecIndex in the if() or use the following for statement:
for (size_t i=0; i<allVecs[vecIndex].size() && vecIndex < allVecs.size(); i++){...}
Edit: However, this does not give the correct output yet.
Your function determines all the correct combinations but they are lost since you do not aggregate them properly.
I see you asked the same question here. I will assume you are now looking for a means to get the output back to the top level so you can handle it from there.
The problem then comes down to how you aggregate the output. You are using a string, but are looking for multiple rows of data. There are infinite answers to this .. here is one using a vector container.
#include <iostream>
#include <vector>
#include <fstream>
#include <sstream>
using namespace std;
void printAll(const vector<string> data);
void EnumAll(const vector<vector<string> > &allVecs, size_t vecIndex, vector<string>&allStr, string strSoFar)
{
if (vecIndex >= allVecs.size())
{
allStr.push_back(strSoFar);
return;
}
for (size_t i=0; i<allVecs[vecIndex].size(); i++)
EnumAll(allVecs, vecIndex+1, allStr, strSoFar+allVecs[vecIndex][i]);
}
int main ( int arg_count, char *arg_vec[] ) {
vector <string> Vec1;
Vec1.push_back("T");
Vec1.push_back("C");
Vec1.push_back("A");
vector <string> Vec2;
Vec2.push_back("C");
Vec2.push_back("G");
Vec2.push_back("A");
vector <string> Vec3;
Vec3.push_back("C");
Vec3.push_back("G");
Vec3.push_back("T");
vector <vector<string> > allVecs;
allVecs.push_back(Vec1);
allVecs.push_back(Vec2);
allVecs.push_back(Vec3);
vector<string> allStr;
EnumAll(allVecs,0,allStr,"");
// print the string or process it with other function.
printAll(allStr);
return 0;
}
void printAll(const vector<string> data)
{
vector<string>::const_iterator c = data.begin();
while(c!=data.end())
{
cout << *c << endl;
++c;
}
}