Count the occurrences and print top K using C/STL - c++

I have a large text file with tokens in each line. I want to count the number of occurrences of each token and sort this. How do I do that efficiently in C++ preferably using built-in functions and shortest coding (and, of course most efficient) ? I know how to do it in python, but not sure how to do it using unordered_map in STL.

I'd go with the unordered_map approach. For selecting the most frequent k tokens, assuming that k is smaller than the total number of tokens, you should take a look at std::partial_sort.
By the way, ++frequency_map[token] (where frequency_map is, say, std::unordered_map<std::string, long>) is perfectly acceptable in C++, although I think the equivalent in Python will blow up on newly seen tokens.
OK, here you go:
void most_frequent_k_tokens(istream& in, ostream& out, long k = 1) {
using mapT = std::unordered_map<string, long>;
using pairT = typename mapT::value_type;
mapT freq;
for (std::string token; in >> token; ) ++freq[token];
std::vector<pairT*> tmp;
for (auto& p : freq) tmp.push_back(&p);
auto lim = tmp.begin() + std::min<long>(k, tmp.size());
std::partial_sort(tmp.begin(), lim, tmp.end(),
[](pairT* a, pairT* b)->bool {
return a->second > b->second
|| (a->second == b->second && a->first < b->first);
});
for (auto it = tmp.begin(); it != lim; ++it)
out << (*it)->second << ' ' << (*it)->first << std::endl;
}

Assuming you know how to read lines from a file in C++, this should be a push in the right direction
std::string token = "token read from file";
std::unordered_map<std::string,int> map_of_tokens;
map_of_tokens[token] = map_of_tokens[token] + 1;
You can then print them out in as such (for a test):
for ( auto i = map_of_tokens.begin(); i != map_of_tokens.end(); ++i ) {
std::cout << i->first << " : " << i->second << "\n";
}

Related

How to work around the error ' does not provide a call operator ' in C++ when trying to use += operator for strings?

I am working on vector of strings and map of strings and int to print a histogram.
What I need to do now is to return a count of strings, from the most frequent to the less frequent elements. This is the part of my code where I am having issues so far:
string histogram( const vector<string> &v) {
string st = "";
map<vector<string>, int> h;
for(auto ite = v.begin(); ite!= v.end(); ite++ ) {
if(h.find(v) == h.end()) {
h[v] = 1;
} else {
h[v]++;
}
}
for (auto it: h) {
st += "[" + it.first + ":" + it.second + "]";
}
return st;
}
The error I keep getting is related to this line:
st += "[" + it.first + ":" + it.second + "]";
I have been checking online how to fix this error or what I am missing for I still can't see exactly how to work with the operator for the strings.

How to find a sequence of letter in a given string?

My string is "AAABBAABABB",
and I want to get the result as
A = 3
B = 2
A = 2
B = 1
A = 1
B = 2
I have tried to use
for (int i = 0; i < n - 1; i++) {
if (msg[i] == msg[i + 1]) {
if(msg[i]==A)
a++;
else
b++;
}
}
I tried this by it didn't work for me. And I don't understand if there any other ways to find it out. Please help me out.
Iterate through the array by followings:
If i = 0, we can set a variable as 0th character and counter by 1.
If ith character is equal to the previous character, we can increase the counter.
If ith character is not equal to the (i-1)th character we can print the character, counter and start counting the new character.
Try the following snippet:
char ch = msg[0];
int cnt = 1;
for (int i = 1; i < n; i ++){
if(msg[i] != msg[i-1]){
cout<<ch<<" "<<cnt<<endl;
cnt = 1;
ch = msg[i];
}
else {
cnt++;
}
}
cout<<ch<<" "<<cnt<<endl;
You can use std::vector<std::pair<char, std::size_t>> to store character occurrences.
Eventually, you would have something like:
#include <iostream>
#include <utility>
#include <vector>
#include <string>
int main() {
std::vector<std::pair<char, std::size_t>> occurrences;
std::string str{ "AAABBAABABB" };
for (auto const c : str) {
if (!occurrences.empty() && occurrences.back().first == c) {
occurrences.back().second++;
} else {
occurrences.emplace_back(c, 1);
}
}
for (auto const& it : occurrences) {
std::cout << it.first << " " << it.second << std::endl;
}
return 0;
}
It will output:
A 3
B 2
A 2
B 1
A 1
B 2
Demo
This is very similar to run length encoding. I think the simplest way (less line of codes) I can think of is like this:
void runLength(const char* msg) {
const char *p = msg;
while (p && *p) {
const char *start = p++; // start of a run
while (*p == *start) p++; // move p to next run (different run)
std::cout << *start << " = " << (p - start) << std::endl;
}
}
Please note that:
This function does not need to know the length of input string before hand, it will stop at the end of string, the '\0'.
It also works for empty string and NULL. Both these work: runLength(""); runLength(nullptr);
I can not comment yet, if you look carefully, mahbubcseju's code does not work for empty msg.
With std, you might do:
void print_sequence(const std::string& s)
{
auto it = s.begin();
while (it != s.end()) {
auto next = std::adjacent_find(it, s.end(), std::not_equal_to<>{});
next = next == s.end() ? s.end() : next + 1;
std::cout << *it << " = " << std::distance(it, next) << std::endl;
it = next;
}
}
Demo
Welcome to stackoverflow. Oooh, an algorithm problem? I'll add a recursive example:
#include <iostream>
void countingThing( const std::string &input, size_t index = 1, size_t count = 1 ) {
if( input.size() == 0 ) return;
if( input[index] != input[index - 1] ) {
std::cout << input[index - 1] << " = " << count << std::endl;
count = 0;
}
if( index < input.size() ) return countingThing( input, index + 1, count + 1 );
}
int main() {
countingThing( "AAABBAABABB" );
return 0;
}
To help work out algorithms and figuring out what to write in your code, I suggest a few steps:
First, write out your problem in multiple ways, what sort of input it expects and how you would like the output to be.
Secondly, try and solve it on paper, how the logic would work - a good tip to this is try to understand how YOU would solve it. Your brain is a good problem solver, and if you can listen to what it does, you can turn it into code (it isn't always the most efficient, though).
Thirdly, work it out on paper, see if your solution does what you expect it to do by following your steps by hand. Then you can translate the solution to code, knowing exactly what you need to write.

How to find in c++ STL in a specific manner?

map< pair<int,int> , int > m ;
Here pair.first and pair.second are positive and pair.second >= pair.first.
I would like to find all iterator/s in map m such that for a given key. Key is an integer which lies between pairs e.g. key is 2 and pair is [2,5] and [1,2] etc.
e.g. m[1,3] = 10 , m[3,5] = 6 , m[1,8] = 9 , m[7,8] = 15 then when I search for m.find(3) then it would return iterator for m[1,3] , m[1,8] , m[3,5] .If there is no key then it would return m.end().
I'm not sure why you want to do this, but these days Boost Interval Container library is pretty capable.
Assuming that you might have wanted to keep track of the total (sum) of mapped values for a specific point, you could simply apply a splitting Combining Style to your input data and profit:
Live On Coliru
#include <boost/icl/split_interval_map.hpp>
#include <boost/range/iterator_range.hpp>
#include <iostream>
namespace icl = boost::icl;
int main() {
using Map = icl::split_interval_map<int, int>;
using Ival = Map::interval_type;
Map m;
m.add({Ival::closed(1,3), 10});
m.add({Ival::closed(3,5), 6});
m.add({Ival::closed(1,8), 9});
m.add({Ival::closed(7,8), 15});
for (auto e : m) std::cout << e.first << " -> " << e.second << "\n";
std::cout << "------------------------------\n";
for (auto e : boost::make_iterator_range(m.equal_range(Ival::closed(3,3))))
std::cout << e.first << " -> " << e.second << "\n";
}
This will tell us:
[1,3) -> 19
[3,3] -> 25
(3,5] -> 15
(5,7) -> 9
[7,8] -> 24
------------------------------
[3,3] -> 25
Notice how
the consolidation very accurately reflects that the point [3,3] is the only only point that coincided with both [1,3] and [3,5] from the input data simultaneously, and as a result, we get halfopen intervals in the combined set ([1,3), [3,3] and (3,5]).
Note also how the query for this one point correctly returns the sum of 10+6+9 for all the three intervals you were interested in.
What Use Is This?
So, you see I shifted the focus of the question from the "How?" to the "What?". It usually helps to state the goal of code instead of the particular mechanics.
Of course, if instead of the sum you'd have been interested in the average, the minimum or the maximum, you'd likely find yourself writing some custom combining strategy.
Bonus In case you wanted, here's how you can at least write the solution to the problem posed in the OP using Boost Icl: Live On Coliru. Though it's not particularly efficient, it's straight forward and robust.
I think that instead of iterators you could store (and use) corresponding keys of the map. If so then the code could look like
#include <iostream>
#include <map>
#include <algorithm>
#include <vector>
#include <utility>
int main()
{
std::map<std::pair<int, int>, int> m;
m[{ 1, 3}] = 10;
m[{ 3, 5}] = 6;
m[{ 7, 8}] = 15;
typedef std::map<std::pair<int, int>, int>::value_type value_type;
typedef std::map<std::pair<int, int>, int>::key_type key_type;
int search;
auto in_range = [&search]( const value_type &value )
{
return value.first.first <= search && search <= value.first.second;
};
search = 3;
std::vector<key_type> v;
v.reserve( std::count_if( m.begin(), m.end(), in_range ) );
for ( const auto &p : m )
{
if ( in_range( p ) ) v.push_back( p.first );
}
for ( const auto &p : v )
{
std::cout << "[ " << p.first << ", " << p.second << " ]" << std::endl;
}
return 0;
}
The output is
[ 1, 3 ]
[ 3, 5 ]
Take into account that it is supposed that key.first is less than or equal to key.second where key is the key of the map.
There's no way to avoid a linear search from the start of the map, because if the first element is {0,INT_MAX} then it matches and the elements you want are not necessarily in a contiguous range, e.g. if you have {1,3},{2,2}{3,5} you only want the first and last elements when the key is 3.
You can stop searching when you reach an element with first greater than the key.
Something like this (untested):
typedef map< pair<int,int> , int > Map;
std::vector<Map::iterator>
find(int key, const Map& m)
{
std::vector<Map::iterator> matches;
for (Map::iterator it = m.begin(), end = m.end(); it != end && key <= it->first; ++it)
{
if (it.first >= key && key <= it.second)
matches.push_back(it);
}
return matches;
}
You could turn it into a functor and use find_if but I'm not sure it's worth it.
If you just want one iterator returned per call:
typedef map< pair<int,int> , int > Map;
Map::iterator
find(int key, const Map& m, Map::iterator it)
{
for (Map::iterator end = m.end(); it != end && key <= it->first; ++it)
{
if (it.first >= key && key <= it.second)
return it;
}
return m.end();
}
Map::iterator
find(int key, const Map& m)
{ return find(key, m, m.begin()); }
if you only need an iterator to the next value found in the map, you can use the std::find_if algorithm like this:
int key=2;
map<pair<int,int>, int>::iterator it =std::find_if(m.begin(),m.end(),
[key](pair<pair<int,int>,int> entry)
{
return (entry.first.first <= key)
&& (entry.first.second >= key);
}
);
cout << "the next value is: [" << it->first.first << "/";
cout << it->first.second << "] " << it->second << endl;

Longest common substring from more than two strings - C++

I need to compute the longest common substrings from a set of filenames in C++.
Precisely, I have an std::list of std::strings (or the QT equivalent, also fine)
char const *x[] = {"FirstFileWord.xls", "SecondFileBlue.xls", "ThirdFileWhite.xls", "ForthFileGreen.xls"};
std::list<std::string> files(x, x + sizeof(x) / sizeof(*x));
I need to compute the n distinct longest common substrings of all strings, in this case e.g. for n=2
"File" and ".xls"
If I could compute the longest common subsequence, I could cut it out it and run the algorithm again to get the second longest, so essentially this boils down to:
Is there a (reference?) implementation for computing the LCS of a std::list of std::strings?
This is not a good answer but a dirty solution that I have - brute force on a QList of QUrls from which only the part after the last "/" is taken. I'd love to replace this with "proper" code.
(I have discovered http://www.icir.org/christian/libstree/ - which would help greatly, but I can't get it to compile on my machine. Someone used this maybe?)
QString SubstringMatching::getMatchPattern(QList<QUrl> urls)
{
QString a;
int foundPosition = -1;
int foundLength = -1;
for (int i=urls.first().toString().lastIndexOf("/")+1; i<urls.first().toString().length(); i++)
{
bool hit=true;
int xj;
for (int j=0; j<urls.first().toString().length()-i+1; j++ ) // try to match from position i up to the end of the string :: test character at pos. (i+j)
{
if (!hit) break;
QString firstString = urls.first().toString().right( urls.first().toString().length()-i ).left( j ); // this needs to match all k strings
//qDebug() << "SEARCH " << firstString;
for (int k=1; k<urls.length(); k++) // test all other strings, k = test string number
{
if (!hit) break;
//qDebug() << " IN " << urls.at(k).toString().right(urls.at(k).toString().length() - urls.at(k).toString().lastIndexOf("/")+1);
//qDebug() << " RES " << urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1);
if (urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1)<0) {
xj = j;
//qDebug() << "HIT LENGTH " << xj-1 << " : " << firstString;
hit = false;
}
}
}
if (hit) xj = urls.first().toString().length()-i+1; // hit up to the end of the string
if ((xj-2)>foundLength) // have longer match than existing, j=1 is match length
{
foundPosition = i; // at the current position
foundLength = xj-1;
//qDebug() << "Found at " << i << " length " << foundLength;
}
}
a = urls.first().toString().right( urls.first().toString().length()-foundPosition ).left( foundLength );
//qDebug() << a;
return a;
}
If as you say suffix trees are too heavyweight or otherwise impractical, the following
fairly simple brute-force approach may be adequate for your application.
I assume distinct substrings shall be non-overlapping and are picked from
left to right.
Even with these assumptions, there need not be a unique set that comprises
"the N distinct longest common substrings" of a set of strings. Whatever N is,
there might be more than N distinct common substrings all of the same maximal
length and any choice of N from among them would be arbitrary. Accordingly
the solution finds the at-most N *sets* of the longest distinct common
substrings in which all those of the same length are one set.
The algorithm is as follows:
Q is the target quota of lengths.
Strings is the problem set of strings.
Results is an initially empty multimap that maps a length to a set of strings,
Results[l] being the set with length l
N, initially 0, is the number of distinct lengths represented in Results
If Q is 0 or Strings is empty return Results
Find any shortest member of Strings; keep a copy of it S and remove it
from Strings. We proceed by comparing the substrings of S with those
of Strings because all the common substrings of {Strings, S} must be
substrings of S.
Iteratively generate all the substrings of S, longest first, using the
obvious nested loop controlled by offset and length. For each substring ss of
S:
If ss is not a common substring of Strings, next.
Iterate over Results[l] for l >= the length of ss until end of
Results or until ss is found to be a substring of the examined
result. In the latter case, ss is not distinct from a result already
in hand, so next.
ss is common substring distinct from any already in hand. Iterate over
Results[l] for l < the length of ss, deleting each result that is a
substring of ss, because all those are shorter than ss and not distinct
from it. ss is now a common substring distinct from any already in hand and
all others that remain in hand are distinct from ss.
For l = the length of ss, check whether Results[l] exists, i.e. if
there are any results in hand the same length as ss. If not, call that
a NewLength condition.
Check also if N == Q, i.e. we have already reached the target quota of distinct
lengths. If NewLength obtains and also N == Q, call that a StickOrRaise condition.
If StickOrRaise obtains then compare the length of ss with l = the
length of the shortest results in hand. If ss is shorter than l
then it is too short for our quota, so next. If ss is longer than l
then all the shortest results in hand are to be ousted in favour of ss, so delete
Results[l] and decrement N.
Insert ss into Results keyed by its length.
If NewLength obtains, increment N.
Abandon the inner iteration over substrings of S that have the
same offset of ss but are shorter, because none of them are distinct
from ss.
Advance the offset in S for the outer iteration by the length of ss,
to the start of the next non-overlapping substring.
Return Results.
Here is a program that implements the solution and demonstrates it with
a list of strings:
#include <list>
#include <map>
#include <string>
#include <iostream>
#include <algorithm>
using namespace std;
// Get a non-const iterator to the shortest string in a list
list<string>::iterator shortest_of(list<string> & strings)
{
auto where = strings.end();
size_t min_len = size_t(-1);
for (auto i = strings.begin(); i != strings.end(); ++i) {
if (i->size() < min_len) {
where = i;
min_len = i->size();
}
}
return where;
}
// Say whether a string is a common substring of a list of strings
bool
is_common_substring_of(
string const & candidate, list<string> const & strings)
{
for (string const & s : strings) {
if (s.find(candidate) == string::npos) {
return false;
}
}
return true;
}
/* Get a multimap whose keys are the at-most `quota` greatest
lengths of common substrings of the list of strings `strings`, each key
multi-mapped to the set of common substrings of that length.
*/
multimap<size_t,string>
n_longest_common_substring_sets(list<string> & strings, unsigned quota)
{
size_t nlengths = 0;
multimap<size_t,string> results;
if (quota == 0) {
return results;
}
auto shortest_i = shortest_of(strings);
if (shortest_i == strings.end()) {
return results;
}
string shortest = *shortest_i;
strings.erase(shortest_i);
for ( size_t start = 0; start < shortest.size();) {
size_t skip = 1;
for (size_t len = shortest.size(); len > 0; --len) {
string subs = shortest.substr(start,len);
if (!is_common_substring_of(subs,strings)) {
continue;
}
auto i = results.lower_bound(subs.size());
for ( ;i != results.end() &&
i->second.find(subs) == string::npos; ++i) {}
if (i != results.end()) {
continue;
}
for (i = results.begin();
i != results.end() && i->first < subs.size(); ) {
if (subs.find(i->second) != string::npos) {
i = results.erase(i);
} else {
++i;
}
}
auto hint = results.lower_bound(subs.size());
bool new_len = hint == results.end() || hint->first != subs.size();
if (new_len && nlengths == quota) {
size_t min_len = results.begin()->first;
if (min_len > subs.size()) {
continue;
}
results.erase(min_len);
--nlengths;
}
nlengths += new_len;
results.emplace_hint(hint,subs.size(),subs);
len = 1;
skip = subs.size();
}
start += skip;
}
return results;
}
// Testing ...
int main()
{
list<string> strings{
"OfBitWordFirstFileWordZ.xls",
"SecondZWordBitWordOfFileBlue.xls",
"ThirdFileZBitWordWhiteOfWord.xls",
"WordFourthWordFileBitGreenZOf.xls"};
auto results = n_longest_common_substring_sets(strings,4);
for (auto const & val : results) {
cout << "length: " << val.first
<< ", substring: " << val.second << endl;
}
return 0;
}
Output:
length: 1, substring: Z
length: 2, substring: Of
length: 3, substring: Bit
length: 4, substring: .xls
length: 4, substring: File
length: 4, substring: Word
(Built with gcc 4.8.1)

Determining most freq char element in a vector<char>?

I am trying to determine the most frequent character in a vector that has chars as its elements.
I am thinking of doing this:
looping through the vector and creating a map, where a key would be a unique char found in the vector. The corresponding value would be the integer count of the frequency of that char.
After I have gone through all elements in the vector, the map will
contain all character frequencies. Thus I will then have to find
which key had the highest value and therefore determine the most
frequent character in the vector.
This seems quite convoluted though, thus I was wondering if someone could suggest if this method would be considered 'acceptable' in terms of performance/good coding
Can this be done in a better way?
If you are only using regular ascii characters, you can make the solution a bit faster - instead of using a map, use an array of size 256 and count the occurrences of the character with a given code 'x' in the array cell count[x]. This will remove an logarithm(256) from your solution and thus will make it a bit faster. I do not think much more can be done with respect to optimization of this algorithm.
Sorting a vector of chars and then iterating through that looking for the maximum run lengths seems to be 5 times faster than using the map approach (using the fairly unscientific test code below acting on 16M chars). On the surface both functions should perform close to each other because they execute with O(N log N). However the sorting method probably benefits from branch prediction and move semantics of the in-place vector sort.
The resultant output is:
Most freq char is '\334', appears 66288 times.
usingSort() took 938 milliseconds
Most freq char is '\334', appears 66288 times.
usingMap() took 5124 milliseconds
And the code is:
#include <iostream>
#include <map>
#include <vector>
#include <chrono>
void usingMap(std::vector<char> v)
{
std::map<char, int> m;
for ( auto c : v )
{
auto it= m.find(c);
if( it != m.end() )
m[c]++;
else
m[c] = 1;
}
char mostFreq;
int count = 0;
for ( auto mi : m )
if ( mi.second > count )
{
mostFreq = mi.first;
count = mi.second;
}
std::cout << "Most freq char is '" << mostFreq << "', appears " << count << " times.\n";
}
void usingSort(std::vector<char> v)
{
std::sort( v.begin(), v.end() );
char currentChar = v[0];
char mostChar = v[0];
int currentCount = 0;
int mostCount = 0;
for ( auto c : v )
{
if ( c == currentChar )
currentCount++;
else
{
if ( currentCount > mostCount)
{
mostChar = currentChar;
mostCount = currentCount;
}
currentChar = c;
currentCount = 1;
}
}
std::cout << "Most freq char is '" << mostChar << "', appears " << mostCount << " times.\n";
}
int main(int argc, const char * argv[])
{
size_t size = 1024*1024*16;
std::vector<char> v(size);
for ( int i = 0; i < size; i++)
{
v[i] = random() % 256;
}
auto t1 = std::chrono::high_resolution_clock::now();
usingSort(v);
auto t2 = std::chrono::high_resolution_clock::now();
std::cout
<< "usingSort() took "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count()
<< " milliseconds\n";
auto t3 = std::chrono::high_resolution_clock::now();
usingMap(v);
auto t4 = std::chrono::high_resolution_clock::now();
std::cout
<< "usingMap() took "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t4-t3).count()
<< " milliseconds\n";
return 0;
}