Determining most freq char element in a vector<char>? - c++

I am trying to determine the most frequent character in a vector that has chars as its elements.
I am thinking of doing this:
looping through the vector and creating a map, where a key would be a unique char found in the vector. The corresponding value would be the integer count of the frequency of that char.
After I have gone through all elements in the vector, the map will
contain all character frequencies. Thus I will then have to find
which key had the highest value and therefore determine the most
frequent character in the vector.
This seems quite convoluted though, thus I was wondering if someone could suggest if this method would be considered 'acceptable' in terms of performance/good coding
Can this be done in a better way?

If you are only using regular ascii characters, you can make the solution a bit faster - instead of using a map, use an array of size 256 and count the occurrences of the character with a given code 'x' in the array cell count[x]. This will remove an logarithm(256) from your solution and thus will make it a bit faster. I do not think much more can be done with respect to optimization of this algorithm.

Sorting a vector of chars and then iterating through that looking for the maximum run lengths seems to be 5 times faster than using the map approach (using the fairly unscientific test code below acting on 16M chars). On the surface both functions should perform close to each other because they execute with O(N log N). However the sorting method probably benefits from branch prediction and move semantics of the in-place vector sort.
The resultant output is:
Most freq char is '\334', appears 66288 times.
usingSort() took 938 milliseconds
Most freq char is '\334', appears 66288 times.
usingMap() took 5124 milliseconds
And the code is:
#include <iostream>
#include <map>
#include <vector>
#include <chrono>
void usingMap(std::vector<char> v)
{
std::map<char, int> m;
for ( auto c : v )
{
auto it= m.find(c);
if( it != m.end() )
m[c]++;
else
m[c] = 1;
}
char mostFreq;
int count = 0;
for ( auto mi : m )
if ( mi.second > count )
{
mostFreq = mi.first;
count = mi.second;
}
std::cout << "Most freq char is '" << mostFreq << "', appears " << count << " times.\n";
}
void usingSort(std::vector<char> v)
{
std::sort( v.begin(), v.end() );
char currentChar = v[0];
char mostChar = v[0];
int currentCount = 0;
int mostCount = 0;
for ( auto c : v )
{
if ( c == currentChar )
currentCount++;
else
{
if ( currentCount > mostCount)
{
mostChar = currentChar;
mostCount = currentCount;
}
currentChar = c;
currentCount = 1;
}
}
std::cout << "Most freq char is '" << mostChar << "', appears " << mostCount << " times.\n";
}
int main(int argc, const char * argv[])
{
size_t size = 1024*1024*16;
std::vector<char> v(size);
for ( int i = 0; i < size; i++)
{
v[i] = random() % 256;
}
auto t1 = std::chrono::high_resolution_clock::now();
usingSort(v);
auto t2 = std::chrono::high_resolution_clock::now();
std::cout
<< "usingSort() took "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count()
<< " milliseconds\n";
auto t3 = std::chrono::high_resolution_clock::now();
usingMap(v);
auto t4 = std::chrono::high_resolution_clock::now();
std::cout
<< "usingMap() took "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t4-t3).count()
<< " milliseconds\n";
return 0;
}

Related

How to compare long strings in C++?

I know how to compare two strings with "==" or "compare", but if the string is very long, should we use a hash function and then compare with hash code ?
static int n = 100000;
bool TestCompare(const string& a, const string& b) {
return a == b;
}
bool TestCompareHash(const string& a, const string& b) {
std::hash<std::string> hash_fn;
std::size_t str_hash_a = hash_fn(a);
std::size_t str_hash_b = hash_fn(b);
return str_hash_a == str_hash_b;
}
int main()
{
string a(100, 'a');
string b(100, 'c');
std::chrono::time_point<std::chrono::system_clock> now = std::chrono::system_clock::now();
for (int i = 0; i < n; i++) {
TestCompare(a, b);
}
std::chrono::duration<float> difference = std::chrono::system_clock::now() - now;
cout << "difference.count() 1: " << difference.count() << endl;
now = std::chrono::system_clock::now();
for (int i = 0; i < n; i++) {
TestCompareHash(a, b);
}
difference = std::chrono::system_clock::now() - now;
cout << "difference.count() 2: " << difference.count() << endl;
return 0;
}
I tested such a code and found that the hash_test will slow down when the string becomes longer, why ?
when string length is 100
difference.count() 1: 0.00263665
difference.count() 2: 0.00713478 //hash
when string length is 10000
difference.count() 1: 0.00322366
difference.count() 2: 1.99765 //hash
I made some improvements to the test from the comments like "make both strings exact matches except for the last character".
It seems that doing hashing does not save the amount of calculations. It may be possible to do these operations in the database to avoid a single point of problem, but it may not make much sense in comparing strings?
In your case the main issue is that you need to compute those hashes first and that costs more than comparison of strings (which "compares chars until they don't match", O(n) complexity at worst). You didn't provide hash_fn() but it generally must "go over all chars" (O(n) complexity).
Hashes would help if you compute and store them once and then expect to compare the strings many times.
Note the hashes can be used only to compare for equality (e.g. no > or <).

How to find a sequence of letter in a given string?

My string is "AAABBAABABB",
and I want to get the result as
A = 3
B = 2
A = 2
B = 1
A = 1
B = 2
I have tried to use
for (int i = 0; i < n - 1; i++) {
if (msg[i] == msg[i + 1]) {
if(msg[i]==A)
a++;
else
b++;
}
}
I tried this by it didn't work for me. And I don't understand if there any other ways to find it out. Please help me out.
Iterate through the array by followings:
If i = 0, we can set a variable as 0th character and counter by 1.
If ith character is equal to the previous character, we can increase the counter.
If ith character is not equal to the (i-1)th character we can print the character, counter and start counting the new character.
Try the following snippet:
char ch = msg[0];
int cnt = 1;
for (int i = 1; i < n; i ++){
if(msg[i] != msg[i-1]){
cout<<ch<<" "<<cnt<<endl;
cnt = 1;
ch = msg[i];
}
else {
cnt++;
}
}
cout<<ch<<" "<<cnt<<endl;
You can use std::vector<std::pair<char, std::size_t>> to store character occurrences.
Eventually, you would have something like:
#include <iostream>
#include <utility>
#include <vector>
#include <string>
int main() {
std::vector<std::pair<char, std::size_t>> occurrences;
std::string str{ "AAABBAABABB" };
for (auto const c : str) {
if (!occurrences.empty() && occurrences.back().first == c) {
occurrences.back().second++;
} else {
occurrences.emplace_back(c, 1);
}
}
for (auto const& it : occurrences) {
std::cout << it.first << " " << it.second << std::endl;
}
return 0;
}
It will output:
A 3
B 2
A 2
B 1
A 1
B 2
Demo
This is very similar to run length encoding. I think the simplest way (less line of codes) I can think of is like this:
void runLength(const char* msg) {
const char *p = msg;
while (p && *p) {
const char *start = p++; // start of a run
while (*p == *start) p++; // move p to next run (different run)
std::cout << *start << " = " << (p - start) << std::endl;
}
}
Please note that:
This function does not need to know the length of input string before hand, it will stop at the end of string, the '\0'.
It also works for empty string and NULL. Both these work: runLength(""); runLength(nullptr);
I can not comment yet, if you look carefully, mahbubcseju's code does not work for empty msg.
With std, you might do:
void print_sequence(const std::string& s)
{
auto it = s.begin();
while (it != s.end()) {
auto next = std::adjacent_find(it, s.end(), std::not_equal_to<>{});
next = next == s.end() ? s.end() : next + 1;
std::cout << *it << " = " << std::distance(it, next) << std::endl;
it = next;
}
}
Demo
Welcome to stackoverflow. Oooh, an algorithm problem? I'll add a recursive example:
#include <iostream>
void countingThing( const std::string &input, size_t index = 1, size_t count = 1 ) {
if( input.size() == 0 ) return;
if( input[index] != input[index - 1] ) {
std::cout << input[index - 1] << " = " << count << std::endl;
count = 0;
}
if( index < input.size() ) return countingThing( input, index + 1, count + 1 );
}
int main() {
countingThing( "AAABBAABABB" );
return 0;
}
To help work out algorithms and figuring out what to write in your code, I suggest a few steps:
First, write out your problem in multiple ways, what sort of input it expects and how you would like the output to be.
Secondly, try and solve it on paper, how the logic would work - a good tip to this is try to understand how YOU would solve it. Your brain is a good problem solver, and if you can listen to what it does, you can turn it into code (it isn't always the most efficient, though).
Thirdly, work it out on paper, see if your solution does what you expect it to do by following your steps by hand. Then you can translate the solution to code, knowing exactly what you need to write.

Is there an alternate way to conditionally increment array values without using if statements? C++

If have an array,
int amounts[26] = { 0, 0, 0, ...};
and I want each digit of the array to represent the amount of a different string, such that amounts[0] = amount; of 'a''s that are found within a given string, is there anyway to increment each value without using if statements?
Psuedocode example:
int amounts[26] = { 0, 0, 0, ...};
string word = "blahblah";
loop here to check and increment amounts[0] based on amount of 'a's in string
repeat loop for each letter in word.`
At the end of the loop, based on the string word, amounts should be as follows:
amounts[0] = 2 ('a')
amounts[1] = 2 ('b')
amounts[2] = 0 ('c')
// etc
Given your example, assuming the entire string is lowercase and valid characters, there's a fairly simply solution (that is to say, you handle the validation)
for (int i = 0; i < word.size(); i++) {
amounts[word[i]-'a']++; // you can also do a pre-increment if you want
}
What you want:
const char char_offset = 'a';
const int num_chars = 26;
std::vector<int> amounts(num_chars, 0);
std::string word = "blahblah";
for (auto c : word) {
int i = c - char_offset;
// this if statement is only for range checking.
// you can remove it if you are sure about the data range.
if (i >= 0 && i < num_chars) {
++amounts[i];
}
}
for (int i = 0; i < (int)amounts.size(); ++i) {
std::cout << (char)(char_offset + i) << ": " << amounts[i] << std::endl;
}
Output
a: 2
b: 2
c: 0
d: 0
e: 0
f: 0
g: 0
h: 2
i: 0
j: 0
k: 0
l: 2
m: 0
n: 0
...
Use std::unordered_map< std::string, int >. Note that std::unordered_map< char, int > would be more efficient if only a single character is required. std::string allows counting complex strings (e.g. map["substring"]++ )
Maps can be accessed using bracket notation ( e.g. map[index] ), and thus can effectively remove the need for if statements.
#include <string>
#include <unordered_map>
#include <iostream>
int main()
{
std::unordered_map< std::string, int > map = { {"a",0} };
map["a"] += 1;
std::cout << map["a"];
}
A general and portable solution would be
const std::string alphabet = "abcdefghijklmnopqrstuvwxyz";
for (int i = 0; alphabet[i]; ++i)
amounts[i] = std::count(word.begin(), word.end(), alphabet[i]);
If you can assume the set of lowercase letters is a contiguous range, this can be simplified to
for (char c = 'a'; c <= 'z'; ++c)
amounts[c - 'a'] = std::count(word.begin(), word.end(), c);
No (overt) if in the above. Of course, there is nothing preventing std::count() being implemented using one.
The following has quite some chances of being one of the fastest in matters of counting:
std::array<unsigned int, (1U << CHAR_BIT)> counts({ });
for(auto c : word)
counts[c]++;
Getting individual values is quite efficient:
std::cout << "a: " << counts['a'] << std::endl
Iterating over the letters - well, will require a little trick:
for(char const* c = "abcdefghijklmnopqrstuvwxyz"; *c; ++c)
// case-insensitive:
std::cout << *c << ": " << counts[*c] + counts[toupper(*c)] << std::endl;
Sure, you are wasting a bit of memory - which might cost you the performance gained again: If the array does not fit into the cache any more...

Sorting characters in a string first by frequency and then alphabetically

Given a string, I'm trying to count the occurrence of each letter in the string and then sort their frequency from highest to lowest. Then, for letters that have similar number of occurrences, I have to sort them alphabetically.
Here is what I have been able to do so far:
I created an int array of size 26 corresponding to the 26 letters of the alphabet with individual values representing the number of times it appeared in the sentence
I pushed the contents of this array into a vector of pairs, v, of int and char (int for the frequency, and char for the actual letter)
I sorted this vector of pairs using std::sort(v.begin(), v.end());
In displaying the frequency count, I just used a for loop starting from the last index to display the result from highest to lowest. I am having problems, however, with regard to those letters having similar frequencies, because I need them displayed in alphabetical order. I tried using a nested for loop with the inner loop starting with the lowest index and using a conditional statement to check if its frequency is the same as the outer loop. This seemed to work, but my problem is that I can't seem to figure out how to control these loops so that redundant outputs will be avoided. To understand what I'm saying, please see this example output:
Enter a string: hello world
Pushing the array into a vector pair v:
d = 1
e = 1
h = 1
l = 3
o = 2
r = 1
w = 1
Sorted first according to frequency then alphabetically:
l = 3
o = 2
d = 1
e = 1
h = 1
r = 1
w = 1
d = 1
e = 1
h = 1
r = 1
d = 1
e = 1
h = 1
d = 1
e = 1
d = 1
Press any key to continue . . .
As you can see, it would have been fine if it wasn't for the redundant outputs brought about by the incorrect for loops.
If you can suggest more efficient or better implementations with regard to my concern, then I would highly appreciate it as long as they're not too complicated or too advanced as I am just a C++ beginner.
If you need to see my code, here it is:
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
using namespace std;
int main() {
cout<<"Enter a string: ";
string input;
getline(cin, input);
int letters[26]= {0};
for (int x = 0; x < input.length(); x++) {
if (isalpha(input[x])) {
int c = tolower(input[x] - 'a');
letters[c]++;
}
}
cout<<"\nPushing the array into a vector pair v: \n";
vector<pair<int, char> > v;
for (int x = 0; x < 26; x++) {
if (letters[x] > 0) {
char c = x + 'a';
cout << c << " = " << letters[x] << "\n";
v.push_back(std::make_pair(letters[x], c));
}
}
// Sort the vector of pairs.
std::sort(v.begin(), v.end());
// I need help here!
cout<<"\n\nSorted first according to frequency then alphabetically: \n";
for (int x = v.size() - 1 ; x >= 0; x--) {
for (int y = 0; y < x; y++) {
if (v[x].first == v[y].first) {
cout << v[y].second<< " = " << v[y].first<<endl;
}
}
cout << v[x].second<< " = " << v[x].first<<endl;
}
system("pause");
return 0;
}
You could simplify this a lot, in two steps:
First use a map to count the number of occurrences of each character in the string:
std::unordered_map<char, unsigned int> count;
for( char character : string )
count[character]++;
Use the values of that map as comparison criteria:
std::sort( std::begin( string ) , std::end( string ) ,
[&]( char lhs , char rhs )
{
return count[lhs] < count[rhs];
}
);
Here is a working example running at ideone.
If you want highest frequency then lowest letter, an easy way would be to store negative values for frequency, then negate it after you sort. A more efficient way would be to change the function used for sorting, but that is a touch trickier:
struct sort_helper {
bool operator()(std::pair<int,char> lhs, std::pair<int,char> rhs) const{
return std::make_pair(-lhs.first,lhs.second)<std::make_pair(-rhs.first,rhs.second);
}
};
std::sort(vec.begin(),vec.end(),sort_helper());
(Posted on behalf of the OP.)
Thanks to the responses of the awesome people here at Stack Overflow, I was finally able to fix my problem. Here is my final code in case anyone is interested or for future references of people who might be stuck in the same boat:
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
using namespace std;
struct Letters
{
Letters() : freq(0){}
Letters(char letter,int freq) {
this->freq = freq;
this->letter = letter;
}
char letter;
int freq;
};
bool Greater(const Letters& a, const Letters& b)
{
if(a.freq == b.freq)
return a.letter < b.letter;
return a.freq > b.freq;
}
int main () {
cout<<"Enter a string: ";
string input;
getline(cin, input);
vector<Letters> count;
int letters[26]= {0};
for (int x = 0; x < input.length(); x++) {
if (isalpha(input[x])) {
int c = tolower(input[x] - 'a');
letters[c]++;
}
}
for (int x = 0; x < 26; x++) {
if (letters[x] > 0) {
char c = x + 'a';
count.push_back(Letters(c, letters[x]));
}
}
cout<<"\nUnsorted list..\n";
for (int x = 0 ; x < count.size(); x++) {
cout<<count[x].letter<< " = "<< count[x].freq<<"\n";
}
std::sort(count.begin(),count.end(),Greater);
cout<<"\nSorted list according to frequency then alphabetically..\n";
for (int x = 0 ; x < count.size(); x++) {
cout<<count[x].letter<< " = "<< count[x].freq<<"\n";
}
system("pause");
return 0;
}
Example output:
Enter a string: hello world
Unsorted list..
d = 1
e = 1
h = 1
l = 3
o = 2
r = 1
w = 1
Sorted list according to frequency then alphabetically..
l = 3
o = 2
d = 1
e = 1
h = 1
r = 1
w = 1
Press any key to continue . . .
I basically just followed the advice of #OliCharlesworth and implemented a custom comparator through the help of this guide: A Function Pointer as Comparison Function.
Although I'm pretty sure that my code can still be made more efficient, I'm still pretty happy with the results.
// CODE BY VIJAY JANGID in C language
// Using arrays, Time complexity - ( O(N) * distinct characters )
// Efficient answer
#include <stdio.h>
int main() {
int iSizeFrequencyArray= 58;
// 122 - 65 = 57 for A to z
int frequencyArray[iSizeFrequencyArray];
int iIndex = 0;
// Initializing frequency to zero for all
for (iIndex = 0; iIndex < iSizeFrequencyArray; iIndex++) {
frequencyArray[iIndex] = 0;
}
int iMyStringLength = 1000;
char chMyString[iMyStringLength];
// take input for the string
scanf("%s", &chMyString);
// calculating length
int iSizeMyString;
while(chMyString[++iSizeMyString]);
// saving each character frequency in the freq. array
for (iIndex = 0; iIndex < iSizeMyString; iIndex++) {
int currentChar = chMyString[iIndex];
frequencyArray[currentChar - 65]++;
}
/* // To print the frequency of each alphabet
for (iIndex = 0; iIndex < iSizeFrequencyArray; iIndex++) {
char currentChar = iIndex + 65;
printf("\n%c - %d", currentChar, frequencyArray[iIndex ]);
}
*/
int lowestDone = 0, lowest = 0, highestSeen = 0;
for( iIndex = 0; iIndex < iSizeFrequencyArray; iIndex++ ) {
if(frequencyArray[iIndex] > highestSeen) {
highestSeen = frequencyArray[iIndex];
}
}
// assigning sorted values to the current array
while (lowest != highestSeen) {
// calculating lowest frequency
for( iIndex = 0; iIndex < iSizeFrequencyArray; iIndex++ ) {
if( frequencyArray[iIndex] > lowestDone &&
frequencyArray[iIndex] < lowest) {
lowest = frequencyArray[iIndex]; // taking lowest value
}
}
// printing that frequency
for( iIndex =0; iIndex < iSizeFrequencyArray; iIndex++ ) {
// print that work for that times
if(frequencyArray[iIndex] == lowest){
char currentChar = iIndex + 65;
int iIndex3;
for(iIndex3 = 0; iIndex3 < lowest; iIndex3++){
printf("%c", currentChar);
}
}
}
// now that is done, move to next lowest
lowestDone = lowest;
// reset to highest value, to get the next lowest one
lowest = highestSeen+1;
}
return 0;
}
Explanation:
First create array to store repetition of size (112 - 65) to store asci characters from A to z.
Store the frequency of each character by incrementing at each occurrence.
Now find the highest frequency.
Run a loop where condition is (lowest != highest) where lowest = 0 initially.
Now in each iteration print character which whose frequency is equal to lowest. They will be alphabetically in order automatically.
At last find the next higher frequency and print then so on.
When lowest reach highest then break loop.
Using an unordered_map for counting characters as suggested by #Manu343726 is a good idea. However, in order to produce your sorted output, another step is required.
My solution is also in C++11 and uses a lambda expression. This way you neither need to define a custom struct nor a comparison function. The code is almost complete, I just skipped reading the input:
#include <unordered_map>
#include <iostream>
#include <set>
int main() {
string input = "hello world";
unordered_map<char, unsigned int> count;
for (char character : input)
if (character >= 'a' && character <= 'z')
count[character]++;
cout << "Unsorted list:" << endl;
for (auto const &kv : count)
cout << kv.first << " = " << kv.second << endl;
using myPair = pair<char, unsigned int>;
auto comp = [](const myPair& a, const myPair& b) {
return (a.second > b.second || a.second == b.second && a.first < b.first);
};
set<myPair, decltype(comp)> sorted(comp);
for(auto const &kv : count)
sorted.insert(kv);
cout << "Sorted list according to frequency then alphabetically:" << endl;
for (auto const &kv : sorted)
cout << kv.first << " = " << kv.second << endl;
return 0;
}
Output:
Unsorted list:
r = 1
h = 1
e = 1
d = 1
o = 2
w = 1
l = 3
Sorted list according to frequency then alphabetically:
l = 3
o = 2
d = 1
e = 1
h = 1
r = 1
w = 1
Note 1: Instead of inserting each element from the unordered_map into the set, it might be more efficient to use the function std::transform or std:copy, but my code is at least short.
Note 2: Instead of using a custom sorted set which maintains the order you want, it might be more efficient to use a vector of pairs and sort it once in the end, but your solution is already similar to this.
Code on Ideone
#include<stdio.h>
// CODE BY AKSHAY BHADERIYA
char iFrequencySort (char iString[]);
void vSort (int arr[], int arr1[], int len);
int
main ()
{
int iLen, iCount;
char iString[100], str[100];
printf ("Enter a string : ");
scanf ("%s", iString);
iFrequencySort (iString);
return 0;
}
char
iFrequencySort (char iString[])
{
int iFreq[100] = { 0 };
int iI, iJ, iK, iAsc, iLen1 = 0, iLen = 0;
while (iString[++iLen]);
int iOccurrence[94];
int iCharacter[94];
for (iI = 0; iI < iLen; iI++)
{ //frequency of the characters
iAsc = (int) iString[iI];
iFreq[iAsc - 32]++;
}
for (iI = 0, iJ = 0; iI < 94; iI++)
{ //the characters and occurrence arrays
if (iFreq[iI] != 0)
{
iCharacter[iJ] = iI;
iOccurrence[iJ] = iFreq[iI];
iJ++;
}
}
iLen1 = iJ;
vSort (iOccurrence, iCharacter, iLen1); //sorting both arrays
/*letter array consists only the index of iFreq array.
Converting it to the ASCII value of corresponding character */
for (iI = 0; iI < iLen1; iI++)
{
iCharacter[iI] += 32;
}
iK = 0;
for (iI = 0; iI < iLen1; iI++)
{ //characters into original string
for (iJ = 0; iJ < iOccurrence[iI]; iJ++)
{
iString[iK++] = (char) iCharacter[iI];
}
}
printf ("%s", iString);
}
void
vSort (int iOccurrence[], int iCharacter[], int len)
{
int iI, iJ, iTemp;
for (iI = 0; iI < len - 1; iI++)
{
for (iJ = iI + 1; iJ < len; iJ++)
{
if (iOccurrence[iI] > iOccurrence[iJ])
{
iTemp = iOccurrence[iI];
iOccurrence[iI] = iOccurrence[iJ];
iOccurrence[iJ] = iTemp;
iTemp = iCharacter[iI];
iCharacter[iI] = iCharacter[iJ];
iCharacter[iJ] = iTemp;
}
}
}
}
Answers are given and one is accepted. I would like to give an additional answer showing the standard approach for this task.
There is often the requirement to first count things and then to get back their rank or some topmost value or other information.
One of the most common solution is to use a so called associative container for that, and, here specifically, a std::map or even better a std::unordered_map. This, because we need a key value, in the above described way a letter and an associted value, here the count for this letter. The key is unique. There cannot be more than one of the same letter in it. This would of course not make any sense.
Associative containers are very efficient by accessing their elements by their key value.
OK, there are 2 of them. The std::map and the std::unordered_map. One uses a tree to store the key in a sorted manner and the other use fast hashing algorithms to access the key values. Since we are later not interested in sorted keys, but in sorted count of occurence, we can choose the std::unordred_map. As a futher benefit, this will use fast the hashing algorithms mentioned to access a key.
The maps have an additional huge advantage. The have an index operator [], that will look very fast for a key value. If found, it will return a reference to the value associated with the key. If not found, it will create a key and initialize its value with the default (0 in our case). And then counting of any key is as simple as map[key]++.
But then, later, we here often hear: But it must be sorted by the count. That does of course not work, because the count my have duplicate values, and the map can only contain unique key values. So, impossible.
The solution is to use a second associative container a std::multiset which can have more of the same keys and a custome sort operator, where we can sort according to the value. In this we store the not a key and a value as 2 elements, but a std::pair with both values. And we sort by the 2nd part of the pair.
We cannot use a std::multi:set in the first place, because we need the unique key (in this case the letter).
The above described approach gives us extreme flexibility and ease of use. We can basically count anything with this algorithm
It could for example look the the below compact code:
#include <iostream>
#include <string>
#include <utility>
#include <set>
#include <unordered_map>
#include <type_traits>
#include <cctype>
// ------------------------------------------------------------
// Create aliases. Save typing work and make code more readable
using Pair = std::pair<char, unsigned int>;
// Standard approach for counter
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;
// Sorted values will be stored in a multiset
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return (p1.second == p2.second) ? p1.first<p2.first : p1.second>p2.second; } };
using Rank = std::multiset<Pair, Comp>;
// ------------------------------------------------------------
// --------------------------------------------------------------------------------------
// Compact function to calculate the frequency of charcters and then get their rank
Rank getRank(std::string& text) {
// Definition of our counter
Counter counter{};
// Iterate over all charcters in text and count their frequency
for (const char c : text) if (std::isalpha(c)) counter[char(std::tolower(c))]++;
// Return ranks,sorted by frequency and then sorted by character
return { counter.begin(), counter.end() };
}
// --------------------------------------------------------------------------------------
// Test, driver code
int main() {
// Get a string from the user
if (std::string text{}; std::getline(std::cin, text))
// Calculate rank and show result
for (const auto& [letter, count] : getRank(text))
std::cout << letter << " = " << count << '\n';
}
Please see the minimal statements used. Very elegant.
But often we do see that arrays are use as an associted container. They have also an index (a key) and a value. Disadvantage may be a tine space overhead for unsued keys. Additionally the will only work for something wit a know magnitude. For example for 26 letters. Other countries alphabets may have more or less letters. Then this kind of solution would be not that flexible. Anyway it is also often used and OK.
So, your solution maybe a littel bit more complex, but will of course still work.
Let me give you an additional example for getting the topmost value of any container. Here you will see, how flexible such a solution can be.
I am sorry, but it is a little bit advanced. . .
#include <iostream>
#include <utility>
#include <unordered_map>
#include <queue>
#include <vector>
#include <iterator>
#include <type_traits>
#include <string>
// Helper for type trait We want to identify an iterable container ----------------------------------------------------
template <typename Container>
auto isIterableHelper(int) -> decltype (
std::begin(std::declval<Container&>()) != std::end(std::declval<Container&>()), // begin/end and operator !=
++std::declval<decltype(std::begin(std::declval<Container&>()))&>(), // operator ++
void(*std::begin(std::declval<Container&>())), // operator*
void(), // Handle potential operator ,
std::true_type{});
template <typename T>
std::false_type isIterableHelper(...);
// The type trait -----------------------------------------------------------------------------------------------------
template <typename Container>
using is_iterable = decltype(isIterableHelper<Container>(0));
// Some Alias names for later easier reading --------------------------------------------------------------------------
template <typename Container>
using ValueType = std::decay_t<decltype(*std::begin(std::declval<Container&>()))>;
template <typename Container>
using Pair = std::pair<ValueType<Container>, size_t>;
template <typename Container>
using Counter = std::unordered_map<ValueType<Container>, size_t>;
template <typename Container>
using UnderlyingContainer = std::vector<Pair<Container>>;
// Predicate Functor
template <class Container> struct LessForSecondOfPair {
bool operator () (const Pair<Container>& p1, const Pair<Container>& p2) { return p1.second < p2.second; }
};
template <typename Container>
using MaxHeap = std::priority_queue<Pair<Container>, UnderlyingContainer<Container>, LessForSecondOfPair<Container>>;
// Function to get most frequent used number in any Container ---------------------------------------------------------
template <class Container>
auto topFrequent(const Container& data) {
if constexpr (is_iterable<Container>::value) {
// Count all occurences of data
Counter<Container> counter{};
for (const auto& d : data) counter[d]++;
// Build a Max-Heap
MaxHeap<Container> maxHeap(counter.begin(), counter.end());
// Return most frequent number
return maxHeap.top().first;
}
else
return data;
}
// Test
int main() {
std::vector testVector{ 1,2,2,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,6,7 };
std::cout << "Most frequent is: " << topFrequent(testVector) << "\n";
double cStyleArray[] = { 1.1, 2.2, 2.2, 3.3, 3.3, 3.3 };
std::cout << "Most frequent is: " << topFrequent(cStyleArray) << "\n";
std::string s{ "abbcccddddeeeeeffffffggggggg" };
std::cout << "Most frequent is: " << topFrequent(s) << "\n";
double value = 12.34;
std::cout << "Most frequent is: " << topFrequent(value) << "\n";
return 0;
}

Count the occurrences and print top K using C/STL

I have a large text file with tokens in each line. I want to count the number of occurrences of each token and sort this. How do I do that efficiently in C++ preferably using built-in functions and shortest coding (and, of course most efficient) ? I know how to do it in python, but not sure how to do it using unordered_map in STL.
I'd go with the unordered_map approach. For selecting the most frequent k tokens, assuming that k is smaller than the total number of tokens, you should take a look at std::partial_sort.
By the way, ++frequency_map[token] (where frequency_map is, say, std::unordered_map<std::string, long>) is perfectly acceptable in C++, although I think the equivalent in Python will blow up on newly seen tokens.
OK, here you go:
void most_frequent_k_tokens(istream& in, ostream& out, long k = 1) {
using mapT = std::unordered_map<string, long>;
using pairT = typename mapT::value_type;
mapT freq;
for (std::string token; in >> token; ) ++freq[token];
std::vector<pairT*> tmp;
for (auto& p : freq) tmp.push_back(&p);
auto lim = tmp.begin() + std::min<long>(k, tmp.size());
std::partial_sort(tmp.begin(), lim, tmp.end(),
[](pairT* a, pairT* b)->bool {
return a->second > b->second
|| (a->second == b->second && a->first < b->first);
});
for (auto it = tmp.begin(); it != lim; ++it)
out << (*it)->second << ' ' << (*it)->first << std::endl;
}
Assuming you know how to read lines from a file in C++, this should be a push in the right direction
std::string token = "token read from file";
std::unordered_map<std::string,int> map_of_tokens;
map_of_tokens[token] = map_of_tokens[token] + 1;
You can then print them out in as such (for a test):
for ( auto i = map_of_tokens.begin(); i != map_of_tokens.end(); ++i ) {
std::cout << i->first << " : " << i->second << "\n";
}