Word Frequency Statistics

Word Frequency Statistics - c++

In an pre-interview, I am faced with a question like this:
Given a string consists of words separated by a single white space, print out the words in descending order sorted by the number of times they appear in the string.
For example an input string of “a b b” would generate the following output:
b : 2
a : 1
Firstly, I'd say it is not so clear that whether the input string is made up of single-letter words or multiple-letter words. If the former is the case, it could be simple.
Here is my thought:
int c[26] = {0};
char *pIn = strIn;
while (*pIn != 0 && *pIn != ' ')
{
++c[*pIn];
++pIn;
}
/* how to sort the array c[26] and remember the original index? */
I can get the statistics of the frequecy of every single-letter word in the input string, and I can get it sorted (using QuickSort or whatever). But after the count array is sorted, how to get the single-letter word associated with the count so that I can print them out in pair later?
If the input string is made of of multiple-letter word, I plan to use a map<const char *, int> to track the frequency. But again, how to sort the map's key-value pair?
The question is in C or C++, and any suggestion is welcome.
Thanks!

I would use a std::map<std::string, int> to store the words and their counts. Then I would use something this to get the words:
while(std::cin >> word) {
// increment map's count for that word
}
finally, you just need to figure out how to print them in order of frequency, I'll leave that as an exercise for you.

You're definitely wrong in assuming that you need only 26 options, 'cause your employer will want to allow multiple-character words as well (and maybe even numbers?).
This means you're going to need an array with a variable length. I strongly recommend using a vector or, even better, a map.
To find the character sequences in the string, find your current position (start at 0) and the position of the next space. Then that's the word. Set the current position to the space and do it again. Keep repeating this until you're at the end.
By using the map you'll already have the word/count available.
If the job you're applying for requires university skills, I strongly recommend optimizing the map by adding some kind of hashing function. However, judging by the difficulty of the question I assume that that is not the case.

Taking the C-language case:
I like brute-force, straightforward algos so I would do it in this way:
Tokenize the input string to give an unsorted array of words. I'll have to actually, physically move each word (because each is of variable length); and I think I'll need an array of char*, which I'll use as the arg to qsort( ).
qsort( ) (descending) that array of words. (In the COMPAR function of qsort(), pretend that bigger words are smaller words so that the array acquires descending sort order.)
3.a. Go through the now-sorted array, looking for subarrays of identical words. The end of a subarray, and the beginning of the next, is signalled by the first non-identical word I see.
3.b. When I get to the end of a subarray (or to the end of the sorted array), I know (1) the word and (2) the number of identical words in the subarray.
EDIT new step 4: Save, in another array (call it array2), a char* to a word in the subarry and the count of identical words in the subarray.
When no more words in sorted array, I'm done. it's time to print.
qsort( ) array2 by word frequency.
go through array2, printing each word and its frequency.
I'M DONE! Let's go to lunch.

All the answers prior to mine did not give really an answer.
Let us think on a potential solution.
There is a more or less standard approach for counting something in a container.
We can use an associative container like a std::map or a std::unordered_map. And here we associate a "key", in this case the word, to a count, with a value, in this case the count of the specific word.
And luckily the maps have a very nice index operator[]. This will look for the given key and, if found, return a reference to the value. If not found, then it will create a new entry with the key and return a reference to the new entry. So, in both cases, we will get a reference to the value used for counting. And then we can simply write:
std::unordered_map<char,int> counter{};
counter[word]++;
And that looks really intuitive.
After this operation, you have already the frequency table. Either sorted by the key (the word), by using a std::map or unsorted, but faster accessible with a std::unordered_map.
Now you want to sort according to the frequency/count. Unfortunately this is not possible with maps.
Therefore we need to use a second container, like a ```std::vector`````which we then can sort unsing std::sort for any given predicate, or, we can copy the values into a container, like a std::multiset that implicitely orders its elements.
For getting out the words of a std::string we simply use a std::istringstream and the standard extraction operator >>. No big deal at all.
And because writing all this long names for the std containers, we create alias names, with the using keyword.
After all this, we now write ultra compact code and fulfill the task with just a few lines of code:
#include <iostream>
#include <string>
#include <sstream>
#include <utility>
#include <set>
#include <unordered_map>
#include <type_traits>
#include <iomanip>
// ------------------------------------------------------------
// Create aliases. Save typing work and make code more readable
using Pair = std::pair<std::string, unsigned int>;
// Standard approach for counter
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;
// Sorted values will be stored in a multiset
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return (p1.second == p2.second) ? p1.first<p2.first : p1.second>p2.second; } };
using Rank = std::multiset<Pair, Comp>;
// ------------------------------------------------------------
std::istringstream text{ " 4444 55555 1 22 4444 333 55555 333 333 4444 4444 55555 55555 55555 22 "};
int main() {
Counter counter;
// Count
for (std::string word{}; text >> word; counter[word]++);
// Sort
Rank rank(counter.begin(), counter.end());
// Output
for (const auto& [word, count] : rank) std::cout << std::setw(15) << word << " : " << count << '\n';
}

Related

Find All Palindrome Substrings in a String by Rearranging Characters

For fun and practice, I have tried to solve the following problem (using C++): Given a string, return all the palindromes that can be obtained by rearranging its characters.
I've come up with an algorithm that doesn't work completely. Sometimes, it finds all the palindromes, but other times it finds some but not all.
It works by swapping each adjacent pair of characters N times, where N is the length of the input string. Here is the code:
std::vector<std::string> palindromeGen(std::string charactersSet) {
std::vector<std::string> pals;
for (const auto &c : charactersSet) {
for (auto i = 0, j = 1; i < charactersSet.length() - 1; ++i, ++j) {
std::swap(charactersSet[i], charactersSet[j]);
if (isPalandrome(charactersSet)) {
if (std::find(pals.begin(), pals.end(), charactersSet) == pals.end()) {
// if palindrome is unique
pals.push_back(charactersSet);
}
}
}
}
return pals;
}
What's the fault in this algorithm? I'm mostly concerned about the functionality of the algorithm, rather than the efficiency. Although I'll appreciate tips about efficiency as well. Thanks.

This probably fits a bit better in Code Review but here goes:
Logic Error
You change charactersSet while iterating over it, meaning that your iterator breaks. You need to make a copy of characterSet, and iterate over that.
Things to Change
Since pals holds only unique values, it should be a std::set instead of a std::vector. This will simplify some things. Also, your isPalandrome method spells palindrome wrong!
Alternative Approach
Since palindromes can only take a certain form, consider sorting the input string first, so that you can have a list of characters with an even number of occurrences, and a list of characters with an odd number. You can only have one character with an odd number of occurrences (and this only works for an odd length input). This should let you discard a bunch of possibilities. Then you can work through the different possible combinations of one half of the palindrome (since you can build one half from the other).

Here is another implementation that leverages std::next_permutation:
#include <string>
#include <algorithm>
#include <set>
std::set<std::string> palindromeGen(std::string charactersSet)
{
std::set<std::string> pals;
std::sort(charactersSet.begin(), charactersSet.end());
do
{
// check if the string is the same backwards as forwards
if ( isPalindrome(charactersSet))
pals.insert(charactersSet);
} while (std::next_permutation(charactersSet.begin(), charactersSet.end()));
return pals;
}
We first sort the original string. This is required for std::next_permutation to work correctly. A call to the isPalindrome function with a permutation of the string is done in a loop. Then if the string is a palindrome, it's stored in the set. The subsequent call to std::next_permutation just rearranges the string.
Here is a Live Example. Granted, it uses a reversed copy of the string as the "isPalindrome" function (probably not efficient), but you should get the idea.

Hashing Function/Code

so I'm just learning (or trying to) a bit about hashing. I'm attempting to make a hashing function, however I'm confused where I save the data to. I'm trying to calculate the number of collisions and print that out. I have made 3 different files, one with 10,000 words, 20,000 words and 30,000 words. Each word is just 10 random numbers/letters.
long hash(char* s]){
long h;
for(int i = 0; i < 10; i++){
h = h + (int)s[i];
}
//A lot of examples then mod h by the table size
//I'm a bit confused what this table is... Is it an array of
//10,000 (or however many words)?
//h % TABLE_SIZE
return h
}
int main (int argc, char* argv[]){
fstream input(argv[1]);
char* nextWord;
while(!input.eof()){
input >> nextWord;
hash(nextWord);
}
}
So that's what I currently have, but I can't figure out what the table is exactly, as I said in the comments above... Is it a predefined array in my main with the number of words in it? For example, if I have a file of 10 words, do I make an array a of size 10 in my main? Then if/when I return h, lets say the order goes: 3, 7, 2, 3
The 4th word is a collision, correct? When that happens, I add 1 to collision and then add 1 to then check if slot 4 is also full?
Thanks for the help!

The point of hashing is to have a constant time access to every element you store. I'll try to explain on simple example bellow.
First, you need to know how much data you'd have to store. If for example you want to store numbers and you know, that you won't store numbers greater than 10. Simpliest solution is to create an array with 10 elements. That array is your "table", where you store your numbers. So how do I achieve that amazing constant time access? Hashing function! It's point is to return you an index to your array. Let's create a simple one: If you'd like to store 7, you just save it to array on position 7. Every time, you'd like to look, for element 7, you just pass it to your hasning funcion and bzaah! You got an position to your element in constant time! But what if you'd like to store more elements with value 7? Your simple hashing function is returning 7 for every element and now its position i already occupied! How to solve that? Well, there is not many solution, the simpliest are:
1: Chaining - you simply save element on first free position. This has significant draw back. Imagine, you want to delete some element ... (this is the method, you describing in question)
2: Linked list - if you create an array of pointers on some linked lists, you can easilly add your new element at the end of linked list, that is on position 7!
Both of this simple solutions has its drawbacks and cons. I guess you can see them. As #rwols has said, you don't have to use array. You can also use a tree or be a real C++ master and use unordered_map and unordered_set with custom hash function, which is quite cool. Also there is structure named trie, which is usefull, when you'd like to create some sort of dictionary (where is really hard to know, how many words you will need to store)
To sum it up. You has to know, how many things, you wan't to store and then, create ideal hashing function, that covers up array of apropriate size and in perfect world, it has to have uniform index distribution, with no colisions. (Achiving this is pretty hard and in the real world, I guess, this is impossible, so the less colisions, the better.)
Your hash function, is pretty bad. It will have lot of colisions (like strings "ab" and "ba") and also, you need to mod m it with m being the size of you array (aka. table), so you can save it to some array and you can profit of it. The modus is a way of simplyfiing the has function, because has function has to "fit" in table, that you specified in beginning, because you can't save element on position 11, 12, ... if you have array of 10.
How should good hashing function look like? Well, there is better sources than me. Some example (Alert! It's in Java)
To your example: You simply can't save 10k or even more words into table of size 10. That'll create a lot of collisions and you loose the main benefit of hashing function - constant access to elements you saved.
And how would your code look? Something like this:
int main (int argc, char* argv[]){
fstream input(argv[1]);
char* nextWord;
TypeOfElement table[size_of_table];
while(!input.eof()){
input >> nextWord;
table[hash(nextWord)] = // desired element which you want to save
}
}
But I guess, your goal isn't to save something somewhere, but to count number of colisions. Also note that code above doesn't solve colisions. If you'd like to count colisions, create array table of ints and initialize it to zero. Than, just increment the value, which is stored on index, which is returned by your hash funcion, like this:
table[hash(nextWord)]++;
I hope I helped. Please specify, what else you want to know.

If a hash table is required then as others have stated std::unordered_map will work in most cases. Now if you need something more powerful because of a large entry base, then I would suggest looking into tries. Tries combine the concepts of (Vector-Array) insertion, (Hashing) & Linked Lists. The run time is close to O(M) where M is the amount of characters in a string if you are hashing a string. It helps to remove the chance of collisions. And the more you add to a trie structure the less work has to be done as certain nodes are opened and created. The one draw back is that tries require more memory. Here is a diagram
Now your trie may vary on the size of the array due to what you are storing, but the overall concept and construction of one is the same. If you was doing a word - definition look up then you may want an array of 26 or a few more for each possible hashing character.

To count a number of words which have same hash, we should know hashes of all previous words. When you count a hash of some word, you should write it down, for example in some array. So you need an array with size equal to the number of words.
Then you should compare the new hash with all previous ones. Method of counting depends on what you need - number of pair of collisions or number off same elements.

Hash function should not be responsible for storing data. Normally you would have a container that uses hash function internally.
From what you wrote I understood that you want to create hashtable. One way you could do that (probably not the most efficient one, but should give you an idea):
#include <fstream>
#include <vector>
#include <string>
#include <map>
#include <memory>
using namespace std;
namespace example {
long hash(char* s){
long h;
for(int i = 0; i < 10; i++){
h = h + (int)s[i];
}
return h;
}
}
int main (int argc, char* argv[]){
fstream input(argv[1]);
char* nextWord;
std::map<long, std::unique_ptr<std::vector<std::string>>> hashtable;
while(!input.eof()){
input >> nextWord;
long newHash = example::hash(nextWord);
auto it = hashtable.find(newHash);
// Collision detected?
if (it == hashtable.end()) {
hashtable.insert(std::make_pair(newHash, std::unique_ptr<std::vector<std::string>>(new std::vector<std::string> { nextWord } )));
}
else {
it->second->push_back(nextWord);
}
}
}
I used some C++ 11 features to write an example faster.

I am not sure that I understand what you do not understand. The explanations below might help you.
A hash table is a kind of associative array. It is used to map keys to values in a similar manner an array is used to map indexes (keys) to values. For instance, an array of three numbers, { 11, -22, 33 }, associates index 0 to 11, index 1 to -22 and index 2 to 33.
Now, let us assume that we would like to associate 1 to 11, 2 to -22 and 3 to 33. The solution is simple: we keep the same array, only we transform the key by subtracting one from it, thus obtaining the original index
This is fine until we realize that this is just a particular case. What if the keys are not so “predictable”? A solution would be to put the associations in a list of {key, value} pairs and when someone is asking for a key, just search the list: { 123, 11}, {3, -22}, {0, 33} If the value associated to 3 is asked, we simply search the keys in list for a match and find -22. That’s fine, but if the list is large we’re in trouble. We could speed the search if we sort the array by keys and use binary search, but still the search may take some time if the list is large.
The search speed may be further enhanced if we break the list in sub-lists (or buckets) made of related pairs. This is what a hash function does: puts together pairs by related keys (an ideal hash function would associate one key to one value).
A hash table is a two columns table (an array):
The first column is the hash key (the index computed by a hash function). The size of the hash table is given by the maximum value of the hash function. If, for instance, the last step in computing the hash function is modulo 10, the size of the table will be 10; the pairs list will be broken into 10 sub-lists.
The second column is a list (bucket) of key/values pairs (the sub-list I was taking about).

Efficient C++ data structure to search for intervals in sets of integer values

I'm looking for a data structure (and an C++ implementation) that allows to search (efficiently) for all elements having an integer value within a given interval. Example: say the set contains:
3,4,5,7,11,13,17,20,21
Now I want to now all elements from this set within [5,19]. So the answer should be 5,7,11,13,17
For my usage trivial search is not an option, as the number of elements is large (several million elements) and I have to do the search quite often. Any suggestions?

For this, you typically use std::set, that is an ordered set which has a search tree built on top (at least that's one possible implementation).
To get the elements in the queried interval, find the two iterators pointing at the first and last element you're looking for. That's a use case of the algorithm std::lower_bound and upper_bound to consider both interval limits as inclusive: [x,y]. (If you want to have the end exclusive, use lower_bound also for the end.)
These algorithms have logarithmic complexity on the size of the set: O(log n)
Note that you may also use a std::vector if you sort it before applying these operations. This might be advantageous in some situations, but if you always want to sort the elements, use std::set, as it does that automatically for you.
Live demo
#include <set>
#include <algorithm>
#include <iostream>
int main()
{
// Your set (Note that these numbers don't have to be given in order):
std::set<int> s = { 3,4,5,7,11,13,17,20,21 };
// Your query:
int x = 5;
int y = 19;
// The iterators:
auto lower = std::lower_bound(s.begin(), s.end(), x);
auto upper = std::upper_bound(s.begin(), s.end(), y);
// Iterating over them:
for (auto it = lower; it != upper; ++it) {
// Do something with *it, or just print *it:
std::cout << *it << '\n';
}
}
Output:
5
7
11
13
17

For searching within the intervals like you mentioned, Segment trees are the best. In competitive programming, several questions are based on this data structure.
One such implementation could be found here:
http://www.sanfoundry.com/cpp-program-implement-segement-tree/
You might need to modify the code to suit your question, but the basic implementation remains the same.

Does hash_map automatically sort [C++]?

In the below code, hash_map is automatically sorting or maybe inserting elements in a sorted order. Any ideas why it does this?? Suggestions Please??
This is NOT a homework problem, trying to solve an interview question posted on glassdoor dot com.
#include <iostream>
#include <vector>
#include <ext/hash_map>
#include <map>
#include <string.h>
#include <sstream>
using namespace __gnu_cxx;
using namespace std;
struct eqstr
{
bool operator()(int i, int j) const
{
return i==j;
}
};
typedef hash_map<int, int, hash<int>, eqstr> myHash;
int main()
{
myHash array;
int inputArr[20] = {1,43,4,5,6,17,12,163,15,16,7,18,19,20,122,124,125,126,128,100};
for(int i=0;i<20;i++){
array[inputArr[i]] = inputArr[i]; //save value
}
myHash::iterator it = array.begin();
int data;
for (; it != array.end(); ++it) {
data = it->first;
cout << ":: " << data;
}
}
//!Output ::: 1:: 4:: 5:: 6:: 7:: 12:: 15:: 16:: 17:: 18:: 19:: 20:: 43:: 100:: 122:: 124:: 125:: 126:: 128:: 163

Hash map will not automatically sort your data. In fact the order is unspecified, depending on your hash function and input order. It is just that in your case the numbers turns out are sorted.
You may want to read about hash table for how this container stores the data.
A clear counter example can be created by replacing that 100 with 999999999. The result is
:: 1:: 4:: 5:: 6:: 7:: 12:: 15:: 16:: 17:: 18:: 19:: 20:: 999999999:: 43:: 122:: 124:: 125:: 126:: 128:: 163
(The actual reason is the hash_map's bucket_count is 193 and the hash function of int is an identity function, so any numbers below 193 will appear sorted.)

A hash map may appear to be sorted based on a few factors:
The hash function returns values that are in the same order as the input, for the range of input that you are providing.
There are no hash collisions.
There are no guarantees that the results will be sorted, it's only a coincidence if they come out that way.

Think about how the hash function operates. A hash is always a function f:input->output that maps the input set I into a (usually smaller) output set O so that the input set is approximately uniformly distributed across the output set.
There is no requirement that order should be preserved; in fact, it's unusual that it would be preserved, because (since the output set is smaller) there will be values *i,j*that have the same hash. This is called a collision.
On the other hand, there's no reason it shouldn't. IN fact, it can be proven that there will always exist at least one sequence that will preserve the order.
But there's another possibility: if ALL the values collide, then they get stored in some other kind of data structure, like a list. It may be that these all collidge, and the other structure imposes order.
Three possibilities: hash_map happens to sort that particular sequence, or hash_map is actually implimented as an array, or the values collidge and the implementation stores collisions in a way that gives a sorted order.

C++ string sort like a human being?

I would like to sort alphanumeric strings the way a human being would sort them. I.e., "A2" comes before "A10", and "a" certainly comes before "Z"! Is there any way to do with without writing a mini-parser? Ideally it would also put "A1B1" before "A1B10". I see the question "Natural (human alpha-numeric) sort in Microsoft SQL 2005" with a possible answer, but it uses various library functions, as does "Sorting Strings for Humans with IComparer".
Below is a test case that currently fails:
#include <set>
#include <iterator>
#include <iostream>
#include <vector>
#include <cassert>
template <typename T>
struct LexicographicSort {
inline bool operator() (const T& lhs, const T& rhs) const{
std::ostringstream s1,s2;
s1 << toLower(lhs); s2 << toLower(rhs);
bool less = s1.str() < s2.str();
//Answer: bool less = doj::alphanum_less<std::string>()(s1.str(), s2.str());
std::cout<<s1.str()<<" "<<s2.str()<<" "<<less<<"\n";
return less;
}
inline std::string toLower(const std::string& str) const {
std::string newString("");
for (std::string::const_iterator charIt = str.begin();
charIt!=str.end();++charIt) {
newString.push_back(std::tolower(*charIt));
}
return newString;
}
};
int main(void) {
const std::string reference[5] = {"ab","B","c1","c2","c10"};
std::vector<std::string> referenceStrings(&(reference[0]), &(reference[5]));
//Insert in reverse order so we know they get sorted
std::set<std::string,LexicographicSort<std::string> > strings(referenceStrings.rbegin(), referenceStrings.rend());
std::cout<<"Items:\n";
std::copy(strings.begin(), strings.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
std::vector<std::string> sortedStrings(strings.begin(), strings.end());
assert(sortedStrings == referenceStrings);
}

Is there any way to do with without writing a mini-parser?
Let someone else do that?
I'm using this implementation: http://www.davekoelle.com/alphanum.html, I've modified it to support wchar_t, too.

It really depends what you mean by "parser." If you want to avoid writing a parser, I would think you should avail yourself of library functions.
Treat the string as a sequence of subsequences which are uniformly alphabetic, numeric, or "other."
Get the next alphanumeric sequence of each string using isalnum and backtrack-checking for + or - if it is a number. Use strtold in-place to find the end of a numeric subsequence.
If one is numeric and one is alphabetic, the string with the numeric subsequence comes first.
If one string has run out of characters, it comes first.
Use strcoll to compare alphabetic subsequences within the current locale.
Use strtold to compare numeric subsequences within the current locale.
Repeat until finished with one or both strings.
Break ties with strcmp.
This algorithm has something of a weakness in comparing numeric strings which exceed the precision of long double.

Is there any way to do it without writing a mini parser? I would think the answer is no. But writing a parser isn't that tough. I had to do this a while ago to sort our company's stock numbers. Basically just scan the number and turn it into an array. Check the "type" of every character: alpha, number, maybe you have others you need to deal with special. Like I had to treat hyphens special because we wanted A-B-C to sort before AB-A. Then start peeling off characters. As long as they are the same type as the first character, they go into the same bucket. Once the type changes, you start putting them in a different bucket. Then you also need a compare function that compares bucket-by-bucket. When both buckets are alpha, you just do a normal alpha compare. When both are digits, convert both to integer and do an integer compare, or pad the shorter to the length of the longer or something equivalent. When they're different types, you'll need a rule for how those compare, like does A-A come before or after A-1 ?
It's not a trivial job and you have to come up with rules for all the odd cases that may arise, but I would think you could get it together in a few hours of work.

Without any parsing, there's no way to compare human written numbers (high values first with leading zeroes stripped) and normal characters as part of the same string.
The parsing doesn't need to be terribly complex though. A simple hash table to deal with things like case sensitivity and stripping special characters ('A'='a'=1,'B'='b'='2,... or 'A'=1,'a'=2,'B'=3,..., '-'=0(strip)), remap your string to an array of the hashed values, then truncate number cases (if a number is encountered and the last character was a number, multiply the last number by ten and add the current value to it).
From there, sort as normal.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Word Frequency Statistics - c++

Related

Find All Palindrome Substrings in a String by Rearranging Characters

Hashing Function/Code

Efficient C++ data structure to search for intervals in sets of integer values

Does hash_map automatically sort [C++]?

C++ string sort like a human being?

Categories

Resources