Most frequent character in range - c++

I have a string s of length n. What is the most efficient data structure / algorithm to use for finding the most frequent character in range i..j?
The string doesn't change over time, I just need to repeat queries that ask for the most frequent char among s[i], s[i + 1], ... , s[j].

An array in which you hold the number of occurences of each character. You increase the respective value while iterating throught the string once. While doing this, you can remember the current max in the array; alternitively, look for the highest value in the array at the end.
Pseudocode
arr = [0]
for ( char in string )
arr[char]++
mostFrequent = highest(arr)

Do a single iteration over the array and for each position remember how many occurances of each character are there up to that position. So something like this:
"abcdabc"
for index 0:
count['a'] = 1
count['b'] = 0
etc...
for index 1:
....
count['a'] = 1
count['b'] = 1
count['c'] = 0
etc...
for index 2:
....
count['a'] = 1
count['b'] = 1
count['c'] = 1
....
And so on. For index 6:
....
count['a'] = 2
count['b'] = 2
count['c'] = 2
count['d'] = 1
... all others are 0
After you compute this array you can get the number of occurances of a given letter in an interval (i, j) in constant time - simply compute count[j] - count[i-1] (careful here for i = 0!).
So for each query you will have to iterate over all letters not over all characters in the interval and thus instead of iterating over 10^6 characters you will only pass over at most 128(assuming you only have ASCII symbols).
A drawback - you need more memory, depending on the size of the alphabet you are using.

If you wish to get efficient results on intervals, you can build an integral distribution vector at each index of your sequence. Then by subtracting integral distributions at j+1 and i you can obtain the distribution at the interval from s[i],s[i+1],...,s[j].
Some pseudocode in Python follows. I assume your characters are chars, hence 256 distribution entries.
def buildIntegralDistributions(s):
IDs=[] # integral distribution
D=[0]*256
IDs.append(D[:])
for x in s:
D[ord(x)]+=1
IDs.append(D[:])
return IDs
def getIntervalDistribution(IDs, i,j):
D=[0]*256
for k in range(256):
D[k]=IDs[j][k]-IDs[i][k]
return D
s='abababbbb'
IDs=buildIntegralDistributions(s)
Dij=getIntervalDistribution(IDs, 2,4)
>>> s[2:4]
'ab'
>>> Dij[ord('a')] # how many 'a'-s in s[2:4]?
1
>>> Dij[ord('b')] # how many 'b'-s in s[2:4]?
1

You need to specify your algorithmic requirements in terms of space and time complexity.
If you insist on O(1) space complexity, just sorting (e.g. using lexicographic ordering of bits if there is no natural comparison operator available) and counting the number of occurances of the highest element will give you O(N log N) time complexity.
If you insist on O(N) time complexity, use #Luchian Grigore 's solution which also takes O(N) space complexity (well, O(K) for K-letter alphabet).

string="something"
arrCount[string.length()];
after each access of string call freq()
freq(char accessedChar){
arrCount[string.indexOf(x)]+=1
}
to get the most frequent char call the string.charAt(arrCount.max())

assuming the string is constant, and different i and j will be passed to query occurences.
If you want to minimize processing time you can make a
struct occurences{
char c;
std::list<int> positions;
};
and keep a std::list<occurences> for each character. for fast searching you can keep positions ordered.
and if you want to minimize memory you can just keep an incrementing integer and loop through i .. j

The most time-efficient algorithm, as has been suggested, is to store the frequencies of each character in an array. Note, however, that if you simply index the array with the characters, you may invoke undefined behaviour. Namely, if you are processing text that contains code points outside of the range 0x00-0x7F, such as text encoded with UTF-8, you may end up with a segmentation violation at best, and stack data corruption at worst:
char frequncies [256] = {};
frequencies ['á'] = 9; // Oops. If our implementation represents char using a
// signed eight-bit integer, we just referenced memory
// outside of our array bounds!
A solution that properly accounts for this would look something like the following:
template <typename charT>
charT most_frequent (const basic_string <charT>& str)
{
constexpr auto charT_max = numeric_limits <charT>::max ();
constexpr auto charT_min = numeric_limits <charT>::lowest ();
size_t frequencies [charT_max - charT_min + 1] = {};
for (auto c : str)
++frequencies [c - charT_min];
charT most_frequent;
size_t count = 0;
for (charT c = charT_min; c < charT_max; ++c)
if (frequencies [c - charT_min] > count)
{
most_frequent = c;
count = frequencies [c - charT_min];
}
// We have to check charT_max outside of the loop,
// as otherwise it will probably never terminate
if (frequencies [charT_max - charT_min] > count)
return charT_max;
return most_frequent;
}
If you want to iterate over the same string multiple times, modify the above algorithm (as construct_array) to use a std::array <size_t, numeric_limits <charT>::max () - numeric_limits <charT>::lowest () + 1>. Then return that array instead of the max character after the first for loop and omit the part of the algorithm that finds the most frequent character. Construct a std::map <std::string, std::array <...>> in your top-level code and store the returned array in that. Then move the code for finding the most frequent character into that top-level code and use the cached count array:
char most_frequent (string s)
{
static map <string, array <...>> cache;
if (cache.count (s) == 0)
map [s] = construct_array (s);
// find the most frequent character, as above, replacing `frequencies`
// with map [s], then return it
}
Now, this only works for whole strings. If you want to process relatively small substrings repeatedly, you should use the first version instead. Otherwise, I'd say that your best bet is probably to do something like the second solution, but partitioning the string into manageable chunks; that way, you can fetch most of the information from your cache, only having to recalculate the frequencies in the chunks in which your iterators reside.

The fastest would be to use an unordered_map or similar:
pair<char, int> fast(const string& s) {
unordered_map<char, int> result;
for(const auto i : s) ++result[i];
return *max_element(cbegin(result), cend(result), [](const auto& lhs, const auto& rhs) { return lhs.second < rhs.second; });
}
The lightest, memory-wise, would require a non-constant input which could be sorted, such that find_first_not_of or similar could be used:
pair<char, int> light(string& s) {
pair<char, int> result;
int start = 0;
sort(begin(s), end(s));
for(auto finish = s.find_first_not_of(s.front()); finish != string::npos; start = finish, finish = s.find_first_not_of(s[start], start)) if(const int second = finish - start; second > result.second) result = make_pair(s[start], second);
if(const int second = size(s) - start; second > result.second) result = make_pair(s[start], second);
return result;
}
It should be noted that both of these functions have the precondition of a non-empty string. Also if there is a tie for the most characters in the string both functions will return the character that is lexographically first as having the most.
Live Example

Related

Get string of characters from a vector of strings where the resulting string is equivalent to the majority of the same chars in the nth pos in strings

I'm reading from a file a series of strings, one for each line.
The strings all have the same length.
File looks like this:
010110011101
111110010100
010000100110
010010001001
010100100011
[...]
Result : 010110000111
I'll need to compare each 1st char of every string to obtain one single string at the end.
If the majority of the nth char in the strings is 1, the result string in that index will be 1, otherwise it's going to be 0 and so on.
For reference, the example I provided should return the value shown in the code block, as the majority of the chars in the first index of all strings is 0, the second one is 1 and so on.
I'm pretty new to C++ as I'm trying to move from Web Development to Software Development.
Also, I tought about making this with vectors, but maybe there is a better way.
Thanks.
First off, you show the result of your example input should be 010110000111 but it should actually be 010110000101 instead, because the 11th column has more 0s than 1s in it.
That being said, what you are asking for is simple. Just put the strings into a std::vector, and then run a loop for each string index, running a second loop through the vector counting the number of 1s and 0s at the current string index, eg:
vector<string> vec;
// fill vec as needed...
string result(12, '\0');
for(size_t i = 0; i < 12; ++i) {
int digits[2]{};
for(const auto &str : vec) {
digits[str[i] - '0']++;
}
result[i] = (digits[1] > digits[0]) ? '1' : '0';
}
// use result as needed...
Online Demo

What is the faster way to search for sequence of numbers in a 2d vector?

Given a 2d array(the array can be larger than 10k*10k) with integer values, What is the faster way to search for a given sequence of numbers in the array?
Assume the 2d array which is in the file is read into a big 1d vector and is accessed as big_matrix(row*x+width).
There are 3 types of searches I would like to do on the same 2d array. They are Search Ordered, Search Unordered, Search Best Match. Here's my approach to each of the search functions.
Search Ordered: This function finds all the rows in which given number sequence(order of numbers matters) is present. Here's the KMP method to find the given number sequence I implemented:
void searchPattern(std::vector<int> const &pattern, std::vector<int> const &big_matrix, int begin, int finish,
int width, std::vector<int> &searchResult) {
auto M = (int) pattern.size();
auto N = width; // size of one row
while (begin < finish) {
int i = 0;
int j = 0;
while (i < N) {
if (pattern[j] == big_matrix[(begin * width) + i]) {
j++;
i++;
}
if (j == M) {
searchResult[begin] = begin;
begin++;
break;
} else if (i < N && pattern[j] != big_matrix[(begin * width) + i]) {
if (j != 0)
j = lps[j - 1]; // lookup table as in KMP
else
i = i + 1;
}
}
if (j != M) {
searchResult[begin] = -1;
begin++;
}
}
}
Complexity: O(m*n); m is the number of rows, n is the number of cols
Search Unordered/Search Best Match: This function finds all the rows in which given number sequence is present(order of numbers doesn't matter).
Here I am sorting the large array initially and will just sort only the input array during the search.
void SearchUnordered/BestMatch(std::vector<int> const &match, std::vector<int> const &big_matrix_sorted, int begin, int finish,
int width, std::vector<int> &searchResult) {
std::vector<int>::iterator it;
std::vector<int> v(match.size() + width);
while (begin < finish) {
it = std::set_intersection(match.begin(), match.end(), big_matrix_sorted.begin() + begin * width,
big_matrix_sorted.begin() + begin * width + width, v.begin());
v.resize(it - v.begin());
if (v.size() == subseq.size())
searchResult[begin] = begin;
else
searchResult[begin] = -1;
begin++;
/* For search best match the last few lines will change as follows:
searchResult[begin] = (int) v.size();
begin++; and largest in searchResult will be the result */
}
}
Complexity: O(m*(l + n)); l - the length of the pattern, m is the number of rows, n is the number of cols.
Preprocessing of big_matrix(Constructing lookup table, storing a sorted version of it. You're allowed to do any pre-processing stuff.) is not taken into consideration. How can I improve the complexity(to O(log (m*n)) of these search functions?
If you want to do it faster overall, but already have the right algorithm. You may get some performance by just optimising the code (memory allocations, removing duplicate operations if the compiler didn't etc.). For example there may be a gain by removing the two big_matrix[(row * width) + i] and assigning it to a local variable. Be careful to profile and measure realistic cases.
For bigger gains, threads can be an option. You can process here one row at a time, so should be roughly linear speedup with the number of cores. C++ 11 has std::async, which can handle some of the work for launching threads and getting results, rather than dealing with std::thread yourself or platform specific mechanisms. There are some other newer things that may be useful as well in newer versions of C++.
void searchPatternRow(std::vector<int> const &pattern, std::vector<int> const &big_matrix, int row, int width, std::vector<int> &searchResult);
void searchPattern(std::vector<int> const &pattern, std::vector<int> const &big_matrix, int begin, int finish, int width, std::vector<int> &searchResult)
{
std::vector<std::future<void>> futures;
for (int row = begin; row < finish; ++row)
std::async([&, row]() { searchPatternRow(pattern, big_matrix, row, width, searchResult); });
for (auto &future : futures) future.wait(); // Note, also implicit when the future from async gets destructed
}
To improve threaded efficiency you may want to batch and search say 10 rows. There are also some considerations with threads writing to the same cache line for searchResult.
When searching for exact match, you can do this quite efficient by use of what I will call a "moving hash".
When you search, you calculate a hash on your search string, and at the same time you keep calculating a moving hash on the data you are searching. When comparing you then first compares the hash, and only if that match, you then go on and compare the actual data.
Now the tick is to chose an hash algorithm that can easily be updated each time you move one spot, instead of recalculating everything. An example of such a hash is eg. the sum of all the digits.
If I have the following array: 012345678901234567890 and I want to find 34567 in this array, I could define the hash as the sum of all the digits in the search string. This would give a hash of 25 (3+4+5+6+7). I would then search through the array and keep updating a running hash on the array. The first hash in the array would be 10 (0+1+2+3+4) and the second hash would be 15 (1+2+3+4+5). But instead of recalculte the second hash, I can just update the previous hash by adding 5 (the new digit) and subtracting 0 (the old digit).
As updating the "running hash" is O(1) you can speed up the process considerable if you have a good Hash algorithm that don't give many false hits. The simple sum I use as hash is properbly too simple, but other methodes allows for this updating of the hash, eg XOR ..

Big-O analysis of two algorithms

I created two solutions for leetcode problem 17 in which it asks you to generate all possible text strings from a phone number combination, e.g. "3" results in ["d","e","f"].
My first solution uses a recursive algorithm to generate the strings and is given below:
class Solution {
public:
void fill_LUT(vector<string>& lut) {
lut.clear();
lut.push_back(" ");
lut.push_back("");
lut.push_back("abc");
lut.push_back("def");
lut.push_back("ghi");
lut.push_back("jkl");
lut.push_back("mno");
lut.push_back("pqrs");
lut.push_back("tuv");
lut.push_back("wxyz");
}
void generate_strings(int index, string& digits, vector<string>& lut, vector<string>& r, string& work) {
if(index >= digits.size()) {
r.push_back(work);
return;
}
char idx = digits[index] - '0';
for(char c : lut[idx]) {
work.push_back(c);
generate_strings(index+1, digits, lut, r, work);
work.pop_back();
}
}
vector<string> letterCombinations(string digits) {
vector<string> r;
vector<string> lut;
fill_LUT(lut);
if(digits.size() <= 0)
return r;
string work;
generate_strings(0, digits, lut, r, work);
return r;
}
};
I am a bit rusty with big-O, but it appears to me that the space complexity would be O(n) for the recursive call, i.e. its maximum depth, O(n) for the buffer string, and O(n*c^n) for the resulting strings. Would this sum together as O(n+n*c^n)?
For time complexity I am a bit confused. Each level of the recursion performs c pushes + pops + recursive calls multiplied by the number of operations by the next level, so it sounds like c^1 + c^2 + ... + c^n. In addition, there are c^n duplications of n length strings. How do I consolidate this into a nice big-O representation?
The second solution views the number of results as a mixed radix number and converts it to a string as you might perform an int to hex string conversion:
class Solution {
public:
void fill_LUT(vector<string>& lut) {
lut.clear();
lut.push_back(" ");
lut.push_back("");
lut.push_back("abc");
lut.push_back("def");
lut.push_back("ghi");
lut.push_back("jkl");
lut.push_back("mno");
lut.push_back("pqrs");
lut.push_back("tuv");
lut.push_back("wxyz");
}
vector<string> letterCombinations(string digits) {
vector<string> r;
vector<string> lut;
fill_LUT(lut);
if(digits.size() <= 0)
return r;
unsigned total = 1;
for(int i = 0; i < digits.size(); i++) {
digits[i] = digits[i]-'0';
auto m = lut[digits[i]].size();
if(m > 0) total *= m;
}
for(int i = 0; i < total; i++) {
int current = i;
r.push_back(string());
string& s = r.back();
for(char c : digits) {
int radix = lut[c].size();
if(radix != 0) {
s.push_back(lut[c][current % radix]);
current = current / radix;
}
}
}
return r;
}
};
In this case, I believe that the space complexity is O(n*c^n) similar to the first solution minus the buffer and recursion, and the time complexity must be O(n) for the first for loop and an additional O(n*c^n) to create a result string for each of the possible results. The final big-O for this is O(n+n*c^n). Is my thought process correct?
Edit: To add some clarification to the code, imagine an input string of "234". The first recursive solution will call generate_strings with the arguments (0, "234", lut, r, work). lut is a look up table that converts a number to its corresponding characters. r is the vector containing the resulting strings. work is a buffer where the work is performed.
The first recursive call will then see that the index 0 digit is 2 which corresponds with "abc", push a to work, and then call generate_strings with the arguments (1, "234", lut, r, work). Once the call returns it will then push b to work and rinse and repeat.
When index is equal to the size of digits then a unique string has been generated and the string is pushed onto r.
For the second solution, the input string is first converted from it's ascii representation to it's integer representation. For example "234" is converted to "\x02\x03\x04". Then the code uses those as indices to look up the number of corresponding characters in the lookup table and calculates the total number of strings that will be in the result. e.g. if the input string was "234", 2 corresponds with abc, which has 3 characters. 3 corresponds with def which has 3 characters. 4 corresponds with ghi which has 3 characters. The total number of possible strings is 3*3*3 = 27.
Then the code uses a counter to represent each of the possible strings. If i were 15 it would be evaluated by first finding 15 % 3 which is 0, corresponding with the first character for the first digit (a). Then divide 15 by 3 which is 5. 5 % 3 is 2 which corresponds with the third character for the second digit, which is f. Finally divide 5 by 3 and you get 1. 1 % 3 is 1 which corresponds with the second character for the third digit, h. Therefore the string that corresponds with the number 15 is afh. This is performed for each number and the resulting strings are stored in r.
Recursive algorithm:
Space: each level of recursion is O(1) and there are O(n) levels. Thus it is O(n) for the recursion. The space of result is O(c^n), where c = max(lut[i].length). Total space for the algorithm is O(c^n).
Time: Let T(n) be the cost for digit with length n. Then we have recursion formula : T(n) <= c T(n-1) + O(1). Solve this equation give T(n) = O(c^n).
Hashing algorithm:
Space: if you need the space to store all results, then it is still O(c^n).
Time: O(n+c^n) = O(c^n).
I like the Hashing algorithm because it is better if the question asks you to give a specific string result (suppose we order them by alphabet order). In that case, the space and time is only O(n).
This question reminds me to a similar question: generate all permutations of the set {1,2,3,...,n}. The hashing approach is better because by generating the permutation one by one and process it, we can save a lot of space.

Can you do Top-K frequent Element better than O(nlogn) ? (code attached) [duplicate]

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.
My thinking is like this.
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.
sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.
After sorting, we just take the first K words. This takes O(K) time.
To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).
We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be
2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;
3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).
To summarize, this solution cost time O(n+k*lg(n)).
This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.
This can be done in O(n) time
Solution 1:
Steps:
Count words and hash it, which will end up in the structure like this
var hash = {
"I" : 13,
"like" : 3,
"meow" : 3,
"geek" : 3,
"burger" : 2,
"cat" : 1,
"foo" : 100,
...
...
Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size
Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:
0 1 2 3 100
[[ ],[cat],[burger],[like, meow, geek],[]...[foo]]
Then just traverse the array from the end, and collect the k words.
Solution 2:
Steps:
Same as above
Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
After traversing through the array, we just convert the min heap to array and return the array.
You're not going to get generally better runtime than the solution you've described. You have to do at least O(n) work to evaluate all the words, and then O(k) extra work to find the top k terms.
If your problem set is really big, you can use a distributed solution such as map/reduce. Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity.
A small variation on your solution yields an O(n) algorithm if we don't care about ranking the top K, and a O(n+k*lg(k)) solution if we do. I believe both of these bounds are optimal within a constant factor.
The optimization here comes again after we run through the list, inserting into the hash table. We can use the median of medians algorithm to select the Kth largest element in the list. This algorithm is provably O(n).
After selecting the Kth smallest element, we partition the list around that element just as in quicksort. This is obviously also O(n). Anything on the "left" side of the pivot is in our group of K elements, so we're done (we can simply throw away everything else as we go along).
So this strategy is:
Go through each word and insert it into a hash table: O(n)
Select the Kth smallest element: O(n)
Partition around that element: O(n)
If you want to rank the K elements, simply sort them with any efficient comparison sort in O(k * lg(k)) time, yielding a total run time of O(n+k * lg(k)).
The O(n) time bound is optimal within a constant factor because we must examine each word at least once.
The O(n + k * lg(k)) time bound is also optimal because there is no comparison-based way to sort k elements in less than k * lg(k) time.
If your "big word list" is big enough, you can simply sample and get estimates. Otherwise, I like hash aggregation.
Edit:
By sample I mean choose some subset of pages and calculate the most frequent word in those pages. Provided you select the pages in a reasonable way and select a statistically significant sample, your estimates of the most frequent words should be reasonable.
This approach is really only reasonable if you have so much data that processing it all is just kind of silly. If you only have a few megs, you should be able to tear through the data and calculate an exact answer without breaking a sweat rather than bothering to calculate an estimate.
You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.
This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].
If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.
You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.
Your problem is same as this-
http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
Use Trie and min heap to efficieinty solve it.
If what you're after is the list of k most frequent words in your text for any practical k and for any natural langage, then the complexity of your algorithm is not relevant.
Just sample, say, a few million words from your text, process that with any algorithm in a matter of seconds, and the most frequent counts will be very accurate.
As a side note, the complexity of the dummy algorithm (1. count all 2. sort the counts 3. take the best) is O(n+m*log(m)), where m is the number of different words in your text. log(m) is much smaller than (n/m), so it remains O(n).
Practically, the long step is counting.
Utilize memory efficient data structure to store the words
Use MaxHeap, to find the top K frequent words.
Here is the code
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;
import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;
public class TopKFrequentItems {
private int maxSize;
private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;
public TopKFrequentItems(int k) {
this.maxSize = k;
this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}
private Comparator<TrieEntry> maxHeapComparator() {
return new Comparator<TrieEntry>() {
#Override
public int compare(TrieEntry o1, TrieEntry o2) {
return o1.frequency - o2.frequency;
}
};
}
public void add(String word) {
this.trie.insert(word);
}
public List<TopK> getItems() {
for (TrieEntry trieEntry : this.trie.getAll()) {
if (this.maxHeap.size() < this.maxSize) {
this.maxHeap.add(trieEntry);
} else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
this.maxHeap.remove();
this.maxHeap.add(trieEntry);
}
}
List<TopK> result = new ArrayList<TopK>();
for (TrieEntry entry : this.maxHeap) {
result.add(new TopK(entry));
}
return result;
}
public static class TopK {
public String item;
public int frequency;
public TopK(String item, int frequency) {
this.item = item;
this.frequency = frequency;
}
public TopK(TrieEntry entry) {
this(entry.word, entry.frequency);
}
#Override
public String toString() {
return String.format("TopK [item=%s, frequency=%s]", item, frequency);
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + frequency;
result = prime * result + ((item == null) ? 0 : item.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
TopK other = (TopK) obj;
if (frequency != other.frequency)
return false;
if (item == null) {
if (other.item != null)
return false;
} else if (!item.equals(other.item))
return false;
return true;
}
}
}
Here is the unit tests
#Test
public void test() {
TopKFrequentItems stream = new TopKFrequentItems(2);
stream.add("hell");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hero");
stream.add("hero");
stream.add("hero");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("home");
stream.add("go");
stream.add("go");
assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}
For more details refer this test case
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.This is same as every one explained above
While insertion itself in hashmap , keep the Treeset(specific to java, there are implementations in every language) of size 10(k=10) to keep the top 10 frequent words. Till size is less than 10, keep adding it. If size equal to 10, if inserted element is greater than minimum element i.e. first element. If yes remove it and insert new element
To restrict the size of treeset see this link
Suppose we have a word sequence "ad" "ad" "boy" "big" "bad" "com" "come" "cold". And K=2.
as you mentioned "partitioning using the first letter of words", we got
("ad", "ad") ("boy", "big", "bad") ("com" "come" "cold")
"then partitioning the largest multi-word set using the next character until you have k single-word sets."
it will partition ("boy", "big", "bad") ("com" "come" "cold"), the first partition ("ad", "ad") is missed, while "ad" is actually the most frequent word.
Perhaps I misunderstand your point. Can you please detail your process about partition?
I believe this problem can be solved by an O(n) algorithm. We could make the sorting on the fly. In other words, the sorting in that case is a sub-problem of the traditional sorting problem since only one counter gets incremented by one every time we access the hash table. Initially, the list is sorted since all counters are zero. As we keep incrementing counters in the hash table, we bookkeep another array of hash values ordered by frequency as follows. Every time we increment a counter, we check its index in the ranked array and check if its count exceeds its predecessor in the list. If so, we swap these two elements. As such we obtain a solution that is at most O(n) where n is the number of words in the original text.
I was struggling with this as well and get inspired by #aly. Instead of sorting afterwards, we can just maintain a presorted list of words (List<Set<String>>) and the word will be in the set at position X where X is the current count of the word. In generally, here's how it works:
for each word, store it as part of map of it's occurrence: Map<String, Integer>.
then, based on the count, remove it from the previous count set, and add it into the new count set.
The drawback of this is the list maybe big - can be optimized by using a TreeMap<Integer, Set<String>> - but this will add some overhead. Ultimately we can use a mix of HashMap or our own data structure.
The code
public class WordFrequencyCounter {
private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
List<Set<String>> reverseCounters = new ArrayList<Set<String>>();
private static class MutableCounter {
int i = 1;
}
public List<String> countMostFrequentWords(String text, int max) {
int lastPosition = 0;
int length = text.length();
for (int i = 0; i < length; i++) {
char c = text.charAt(i);
if (c <= WORD_SEPARATOR_MAX) {
if (i != lastPosition) {
String word = text.substring(lastPosition, i);
MutableCounter counter = counters.get(word);
if (counter == null) {
counter = new MutableCounter();
counters.put(word, counter);
} else {
Set<String> strings = reverseCounters.get(counter.i);
strings.remove(word);
counter.i ++;
}
addToReverseLookup(counter.i, word);
}
lastPosition = i + 1;
}
}
List<String> ret = new ArrayList<String>();
int count = 0;
for (int i = reverseCounters.size() - 1; i >= 0; i--) {
Set<String> strings = reverseCounters.get(i);
for (String s : strings) {
ret.add(s);
System.out.print(s + ":" + i);
count++;
if (count == max) break;
}
if (count == max) break;
}
return ret;
}
private void addToReverseLookup(int count, String word) {
while (count >= reverseCounters.size()) {
reverseCounters.add(new HashSet<String>());
}
Set<String> strings = reverseCounters.get(count);
strings.add(word);
}
}
I just find out the other solution for this problem. But I am not sure it is right.
Solution:
Use a Hash table to record all words' frequency T(n) = O(n)
Choose first k elements of hash table, and restore them in one buffer (whose space = k). T(n) = O(k)
Each time, firstly we need find the current min element of the buffer, and just compare the min element of the buffer with the (n - k) elements of hash table one by one. If the element of hash table is greater than this min element of buffer, then drop the current buffer's min, and add the element of the hash table. So each time we find the min one in the buffer need T(n) = O(k), and traverse the whole hash table need T(n) = O(n - k). So the whole time complexity for this process is T(n) = O((n-k) * k).
After traverse the whole hash table, the result is in this buffer.
The whole time complexity: T(n) = O(n) + O(k) + O(kn - k^2) = O(kn + n - k^2 + k). Since, k is really smaller than n in general. So for this solution, the time complexity is T(n) = O(kn). That is linear time, when k is really small. Is it right? I am really not sure.
Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.
http://en.wikipedia.org/wiki/Trie
This is an interesting idea to search and I could find this paper related to Top-K https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf
Also there is an implementation of it here.
Simplest code to get the occurrence of most frequently used word.
function strOccurence(str){
var arr = str.split(" ");
var length = arr.length,temp = {},max;
while(length--){
if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
{
temp[arr[length]] = 1;
}
else if(arr[length].trim().length > 0)
{
temp[arr[length]] = temp[arr[length]] + 1;
}
}
console.log(temp);
var max = [];
for(i in temp)
{
max[temp[i]] = i;
}
console.log(max[max.length])
//if you want second highest
console.log(max[max.length - 2])
}
In these situations, I recommend to use Java built-in features. Since, they are already well tested and stable. In this problem, I find the repetitions of the words by using HashMap data structure. Then, I push the results to an array of objects. I sort the object by Arrays.sort() and print the top k words and their repetitions.
import java.io.*;
import java.lang.reflect.Array;
import java.util.*;
public class TopKWordsTextFile {
static class SortObject implements Comparable<SortObject>{
private String key;
private int value;
public SortObject(String key, int value) {
super();
this.key = key;
this.value = value;
}
#Override
public int compareTo(SortObject o) {
//descending order
return o.value - this.value;
}
}
public static void main(String[] args) {
HashMap<String,Integer> hm = new HashMap<>();
int k = 1;
try {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));
String line;
while ((line = br.readLine()) != null) {
// process the line.
//System.out.println(line);
String[] tokens = line.split(" ");
for(int i=0; i<tokens.length; i++){
if(hm.containsKey(tokens[i])){
//If the key already exists
Integer prev = hm.get(tokens[i]);
hm.put(tokens[i],prev+1);
}else{
//If the key doesn't exist
hm.put(tokens[i],1);
}
}
}
//Close the input
br.close();
//Print all words with their repetitions. You can use 3 for printing top 3 words.
k = hm.size();
// Get a set of the entries
Set set = hm.entrySet();
// Get an iterator
Iterator i = set.iterator();
int index = 0;
// Display elements
SortObject[] objects = new SortObject[hm.size()];
while(i.hasNext()) {
Map.Entry e = (Map.Entry)i.next();
//System.out.print("Key: "+e.getKey() + ": ");
//System.out.println(" Value: "+e.getValue());
String tempS = (String) e.getKey();
int tempI = (int) e.getValue();
objects[index] = new SortObject(tempS,tempI);
index++;
}
System.out.println();
//Sort the array
Arrays.sort(objects);
//Print top k
for(int j=0; j<k; j++){
System.out.println(objects[j].key+":"+objects[j].value);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
For more information, please visit https://github.com/m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java. I hope it helps.
**
C++11 Implementation of the above thought
**
class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {
unordered_map<int,int> map;
for(int num : nums){
map[num]++;
}
vector<int> res;
// we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
// pair<first, second>: first is frequency, second is number
priority_queue<pair<int,int>> pq;
for(auto it = map.begin(); it != map.end(); it++){
pq.push(make_pair(it->second, it->first));
// onece the size bigger than size-k, we will pop the value, which is the top k frequent element value
if(pq.size() > (int)map.size() - k){
res.push_back(pq.top().second);
pq.pop();
}
}
return res;
}
};

how to find distinct substrings?

Given a string, and a fixed length l, how can I count the number of distinct substrings whose length is l?
The size of character set is also known. (denote it as s)
For example, given a string "PccjcjcZ", s = 4, l = 3,
then there are 5 distinct substrings:
“Pcc”; “ccj”; “cjc”; “jcj”; “jcZ”
I try to use hash table, but the speed is still slow.
In fact I don't know how to use the character size.
I have done things like this
int diffPatterns(const string& src, int len, int setSize) {
int cnt = 0;
node* table[1 << 15];
int tableSize = 1 << 15;
for (int i = 0; i < tableSize; ++i) {
table[i] = NULL;
}
unsigned int hashValue = 0;
int end = (int)src.size() - len;
for (int i = 0; i <= end; ++i) {
hashValue = hashF(src, i, len);
if (table[hashValue] == NULL) {
table[hashValue] = new node(i);
cnt ++;
} else {
if (!compList(src, i, table[hashValue], len)) {
cnt ++;
};
}
}
for (int i = 0; i < tableSize; ++i) {
deleteList(table[i]);
}
return cnt;
}
Hastables are fine and practical, but keep in mind that if the length of substrings is L, and the whole string length is N, then the algorithm is Theta((N+1-L)*L) which is Theta(NL) for most L. Remember, just computing the hash takes Theta(L) time. Plus there might be collisions.
Suffix trees can be used, and provide a guaranteed O(N) time algorithm (count number of paths at depth L or greater), but the implementation is complicated. Saving grace is you can probably find off the shelf implementations in the language of your choice.
The idea of using a hashtable is good. It should work well.
The idea of implementing your own hashtable as an array of length 2^15 is bad. See Hashtable in C++? instead.
You can use an unorder_set and insert the strings into the set and then get the size of the set. Since the values in a set are unique it will take care of not including substrings that are the same as ones previously found. This should give you close to O(StringSize - SubstringSize) complexity
#include <iostream>
#include <string>
#include <unordered_set>
int main()
{
std::string test = "PccjcjcZ";
std::unordered_set<std::string> counter;
size_t substringSize = 3;
for (size_t i = 0; i < test.size() - substringSize + 1; ++i)
{
counter.insert(test.substr(i, substringSize));
}
std::cout << counter.size();
std::cin.get();
return 0;
}
Veronica Kham answered good to the question, but we can improve this method to expected O(n) and still use a simple hash table rather than suffix tree or any other advanced data structure.
Hash function
Let X and Y are two adjacent substrings of length L, more precisely:
X = A[i, i + L - 1]
Y = B[i + 1, i + 1 + L - 1]
Let assign to each letter of our alphabet a single non negative integer, for example a := 1, b := 2 and so on.
Let's define a hash function h now:
h(A[i, j]) := (P^(L-1) * A[i] + P^(L-2) * A[i + 1] + ... + A[j]) % M
where P is a prime number ideally greater than the alphabet size and M is a very big number denoting the number of different possible hashes, for example you can set M to maximum available unsigned long long int in your system.
Algorithm
The crucial observation is the following:
If you have a hash computed for X, you can compute a hash for Y in
O(1) time.
Let assume that we have computed h(X), which can be done in O(L) time obviously. We want to compute h(Y). Notice that since X and Y differ by only 2 characters, and we can do that easily using addition and multiplication:
h(Y) = ((h(X) - P^L * A[i]) * P) + A[j + 1]) % M
Basically, we are subtracting letter A[i] multiplied by its coefficient in h(X), multiplying the result by P in order to get proper coefficients for the rest of letters and at the end, we are adding the last letter A[j + 1].
Notice that we can precompute powers of P at the beginning and we can do it modulo M.
Since our hashing functions returns integers, we can use any hash table to store them. Remember to make all computations modulo M and avoid integer overflow.
Collisions
Of course, there might occur a collision, but since P is prime and M is really huge, it is a rare situation.
If you want to lower the probability of a collision, you can use two different hashing functions, for example by using different modulo in each of them. If probability of a collision is p using one such function, then for two functions it is p^2 and we can make it arbitrary small by this trick.
Use Rolling hashes.
This will make the runtime expected O(n).
This might be repeating pkacprzak's answer, except, it gives a name for easier remembrance etc.
Suffix Automaton also can finish it in O(N).
It's easy to code, but hard to understand.
Here are papers about it http://dl.acm.org/citation.cfm?doid=375360.375365
http://www.sciencedirect.com/science/article/pii/S0304397509002370