Get combinations where some items should appear more than others - combinations

I apologize if this is better suited to math.stackexchange.com, I just thought it's important that this is for programming.
Given an array of items ['a', 'b', 'c', 'd'], how can I get only the combinations where certain elements appear a certain percentage of the time?
i.e. only combinations where 'a' appears 70% of the time, 'c' appears 45% of the time, and 'd' appears 0% of the time.
I may not be asking the right question, but I'm struggling to word it correctly.
Right now, I have the below function that gets ALL combinations with no repetitions.
function printCombos(items, r) {
/**
* This code is contributed by Devesh Agrawal
* #param {*[]} items - input array
* #param {number} slots - size of combination
* #param {number} start - start index in items
* #param {number} end - end index in items
* #param {number} index - current index in items
* #param {*[]} tmpData - temporary array to store current combination
*/
const getCombos = (items, slots, start, end, index, tmpData = []) => {
if (index == slots) {
// current combination ready to be recorded
console.log(tmpData.join(' '))
return;
}
// replace index with all possible elements. The condition
// "end-i+1 >= r-index" makes sure that including one element
// at index will make a combination with remaining elements
// at remaining positions
for (let i = start; i <= end && end - i + 1 >= slots - index; ++i) {
tmpData[index] = items[i];
getCombos(items, slots, i + 1, end, index + 1, tmpData);
}
}
// first call
getCombos(items, r, 0, items.length - 1, 0, []);
}
sample:
const items = ['a', 'b', 'c', 'd'];
const slots = 3;
printCombos(items, slots);
Any resource, wikipedia article, etc. with more about this problem as it relates to percentages would be appreciated as well.

Related

Implementing a crossover function for multiple "Salesmen" TSP in a genetic algorithm

I’m trying to solve a variant of the TSP problem with “multiple salesmen". I have a series of n waypoints and m drones and I want to generate a result which sorts of balances the number of waypoints between drones and returns an acceptable shortest travelling time. At the moment, I'm not really too worried about finding an optimal solution, I just want something that works at this point. I've sort of distilled my problem to a traditional TSP run multiple times. My example is for a series of waypoints:
[0,1,2,3,4,5,6,7,8,9,10,11]
where 0 == 11 is the start and end point. Say I have 4 drones, I want to generate something like:
Drone A = [0,1,2,3,11]
Drone B = [0,5,6,7,11]
Drone C = [0,4,8,11]
Drone D = [0,9,10,11]
However, I’m struggling to generate a consistent output in my crossover function. My current function looks like this:
DNA DNA::crossover( DNA &parentB)
{
// sol holds the individual solution for
// each drone
std::vector<std::vector<std::size_t>> sol;
// contains the values in flattened sol
// used to check for duplicates
std::vector<std::size_t> flat_sol;
// returns the number of solutions
// required
int number_of_paths = this→getSolution().size();
// limits the number of waypoints required for each drone
// subtracting 2 to remove “0” and “11”
std::size_t max_wp_per_drone = ((number_of_cities-2)/number_of_drones) + 1;
for(std::size_t i = 0; i < number_of_paths; i++)
{
int start = rand() % (this->getSolution().at(i).size() -2) + 1;
int end = start + 1 + rand() % ((this->getSolution().at(i).size()-2) - start +1);
std::vector<std::size_t>::const_iterator first = this->getSolution().at(i).begin()+start;
std::vector<std::size_t>::const_iterator second = this- >getSolution().at(i).begin()+end;
// First Problem occurs here… Sometimes, newOrder can return nothing based on
//the positions of start and end. Tried to mitigate by putting a while loop
to regenerate the vector
std::vector<std::size_t> newOrder(first, second);
// RETURNS a vector from the vector of vectors sol
flat_sol = flatten(sol);
// compare new Order with solution and remove any duplicates..
for(std::size_t k = 0; k < newOrder.size(); k++ )
{
int duplicate = newOrder.at(k);
if(std::find(flat_sol.begin(), flat_sol.end(), duplicate) != flat_sol.end())
{
// second problem is found here, sometimes,
// new order might only return a vector with a single value
// or values that have already been assigned to another drone.
// In this case, those values are removed and newOrder is now 0
newOrder.erase(newOrder.begin()+k);
}
}
// attempt to create the vectors here.
for(std::size_t j = 1; j <=parentB.getSolution().at(i).size()-2; j++)
{
int city = parentB.getSolution().at(i).at(j);
if(newOrder.empty())
{
if(std::find(flat_sol.begin(), flat_sol.end(), city) == flat_sol.end())
{
newOrder.push_back(city);
}
}
else if((std::find(newOrder.begin(), newOrder.end(), city) == newOrder.end())
&&(std::find(flat_sol.begin(), flat_sol.end(), city) == flat_sol.end())
&& newOrder.size() < max_wp_per_drone )
{
newOrder.push_back(city);
}
}
sol.push_back(newOrder);
}
// waypoints and number_of drones are known,
//0 and 11 are appended to each vector in sol in the constructor.
return DNA(sol, waypoints, number_of_drones);
}
A sample output from my previous runs return the following:
[0,7,9,8, 11]
[0, 1,2,4,11]
[0, 10, 6, 11]
[0,3,11]
// This output is missing one waypoint.
[0,10,7,5, 11]
[0, 8,3,1,11]
[0, 6, 9, 11]
[0,2,4,11]
// This output is correct.
Unfortunately, this means in my subsequent generations of new children. and me getting the correct output seems to be random. For example, for one generation, I had a population size which had 40 correct children and 60 children with missing waypoints while in some cases, I've had more correct children. Any tips or help is appreciated.
Solved this by taking a slightly different approach. Instead of splitting the series of waypoints before perfoming crossover, I simply pass the series of waypoints
[0,1,2,3,4,5,6,7,8,9,10,11]
perform crossover, and when computing fitness of each set, I split the waypoints based on m drones and find the best solution of each generation. New crossover function looks like this:
DNA DNA::crossover( DNA &parentB)
{
int start = rand () % (this->getOrder().size()-1);
int end = getRandomInt<std::size_t>(start +1 , this->getOrder().size()-1);
std::vector<std::size_t>::const_iterator first = this->getOrder().begin() + start;
std::vector<std::size_t>::const_iterator second = this->getOrder().begin() + end;
std::vector<std::size_t> newOrder(first, second);
for(std::size_t i = 0; i < parentB.getOrder().size(); i++)
{
int city = parentB.getOrder().at(i);
if(std::find(newOrder.begin(), newOrder.end(), city) == newOrder.end())
{
newOrder.push_back(city);
}
}
return DNA(newOrder, waypoints, number_of_drones);
}

Efficient way to check if sum is possible from a given set of numbers [duplicate]

I've been tasked with helping some accountants solve a common problem they have - given a list of transactions and a total deposit, which transactions are part of the deposit? For example, say I have this list of numbers:
1.00
2.50
3.75
8.00
And I know that my total deposit is 10.50, I can easily see that it's made up of the 8.00 and 2.50 transaction. However, given a hundred transactions and a deposit in the millions, it quickly becomes much more difficult.
In testing a brute force solution (which takes way too long to be practical), I had two questions:
With a list of about 60 numbers, it seems to find a dozen or more combinations for any total that's reasonable. I was expecting a single combination to satisfy my total, or maybe a few possibilities, but there always seem to be a ton of combinations. Is there a math principle that describes why this is? It seems that given a collection of random numbers of even a medium size, you can find a multiple combination that adds up to just about any total you want.
I built a brute force solution for the problem, but it's clearly O(n!), and quickly grows out of control. Aside from the obvious shortcuts (exclude numbers larger than the total themselves), is there a way to shorten the time to calculate this?
Details on my current (super-slow) solution:
The list of detail amounts is sorted largest to smallest, and then the following process runs recursively:
Take the next item in the list and see if adding it to your running total makes your total match the target. If it does, set aside the current chain as a match. If it falls short of your target, add it to your running total, remove it from the list of detail amounts, and then call this process again
This way it excludes the larger numbers quickly, cutting the list down to only the numbers it needs to consider. However, it's still n! and larger lists never seem to finish, so I'm interested in any shortcuts I might be able to take to speed this up - I suspect that even cutting 1 number out of the list would cut the calculation time in half.
Thanks for your help!
This special case of the Knapsack problem is called Subset Sum.
C# version
setup test:
using System;
using System.Collections.Generic;
public class Program
{
public static void Main(string[] args)
{
// subtotal list
List<double> totals = new List<double>(new double[] { 1, -1, 18, 23, 3.50, 8, 70, 99.50, 87, 22, 4, 4, 100.50, 120, 27, 101.50, 100.50 });
// get matches
List<double[]> results = Knapsack.MatchTotal(100.50, totals);
// print results
foreach (var result in results)
{
Console.WriteLine(string.Join(",", result));
}
Console.WriteLine("Done.");
Console.ReadKey();
}
}
code:
using System.Collections.Generic;
using System.Linq;
public class Knapsack
{
internal static List<double[]> MatchTotal(double theTotal, List<double> subTotals)
{
List<double[]> results = new List<double[]>();
while (subTotals.Contains(theTotal))
{
results.Add(new double[1] { theTotal });
subTotals.Remove(theTotal);
}
// if no subtotals were passed
// or all matched the Total
// return
if (subTotals.Count == 0)
return results;
subTotals.Sort();
double mostNegativeNumber = subTotals[0];
if (mostNegativeNumber > 0)
mostNegativeNumber = 0;
// if there aren't any negative values
// we can remove any values bigger than the total
if (mostNegativeNumber == 0)
subTotals.RemoveAll(d => d > theTotal);
// if there aren't any negative values
// and sum is less than the total no need to look further
if (mostNegativeNumber == 0 && subTotals.Sum() < theTotal)
return results;
// get the combinations for the remaining subTotals
// skip 1 since we already removed subTotals that match
for (int choose = 2; choose <= subTotals.Count; choose++)
{
// get combinations for each length
IEnumerable<IEnumerable<double>> combos = Combination.Combinations(subTotals.AsEnumerable(), choose);
// add combinations where the sum mathces the total to the result list
results.AddRange(from combo in combos
where combo.Sum() == theTotal
select combo.ToArray());
}
return results;
}
}
public static class Combination
{
public static IEnumerable<IEnumerable<T>> Combinations<T>(this IEnumerable<T> elements, int choose)
{
return choose == 0 ? // if choose = 0
new[] { new T[0] } : // return empty Type array
elements.SelectMany((element, i) => // else recursively iterate over array to create combinations
elements.Skip(i + 1).Combinations(choose - 1).Select(combo => (new[] { element }).Concat(combo)));
}
}
results:
100.5
100.5
-1,101.5
1,99.5
3.5,27,70
3.5,4,23,70
3.5,4,23,70
-1,1,3.5,27,70
1,3.5,4,22,70
1,3.5,4,22,70
1,3.5,8,18,70
-1,1,3.5,4,23,70
-1,1,3.5,4,23,70
1,3.5,4,4,18,70
-1,3.5,8,18,22,23,27
-1,3.5,4,4,18,22,23,27
Done.
If subTotals are repeated, there will appear to be duplicate results (the desired effect). In reality, you will probably want to use the subTotal Tupled with some ID, so you can relate it back to your data.
If I understand your problem correctly, you have a set of transactions, and you merely wish to know which of them could have been included in a given total. So if there are 4 possible transactions, then there are 2^4 = 16 possible sets to inspect. This problem is, for 100 possible transactions, the search space has 2^100 = 1267650600228229401496703205376 possible combinations to search over. For 1000 potential transactions in the mix, it grows to a total of
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
sets that you must test. Brute force will hardly be a viable solution on these problems.
Instead, use a solver that can handle knapsack problems. But even then, I'm not sure that you can generate a complete enumeration of all possible solutions without some variation of brute force.
There is a cheap Excel Add-in that solves this problem: SumMatch
The Excel Solver Addin as posted over on superuser.com has a great solution (if you have Excel) https://superuser.com/questions/204925/excel-find-a-subset-of-numbers-that-add-to-a-given-total
Its kind of like 0-1 Knapsack problem which is NP-complete and can be solved through dynamic programming in polynomial time.
http://en.wikipedia.org/wiki/Knapsack_problem
But at the end of the algorithm you also need to check that the sum is what you wanted.
Depending on your data you could first look at the cents portion of each transaction. Like in your initial example you know that 2.50 has to be part of the total because it is the only set of non-zero cent transactions which add to 50.
Not a super efficient solution but heres an implementation in coffeescript
combinations returns all possible combinations of the elements in list
combinations = (list) ->
permuations = Math.pow(2, list.length) - 1
out = []
combinations = []
while permuations
out = []
for i in [0..list.length]
y = ( 1 << i )
if( y & permuations and (y isnt permuations))
out.push(list[i])
if out.length <= list.length and out.length > 0
combinations.push(out)
permuations--
return combinations
and then find_components makes use of it to determine which numbers add up to total
find_components = (total, list) ->
# given a list that is assumed to have only unique elements
list_combinations = combinations(list)
for combination in list_combinations
sum = 0
for number in combination
sum += number
if sum is total
return combination
return []
Heres an example
list = [7.2, 3.3, 4.5, 6.0, 2, 4.1]
total = 7.2 + 2 + 4.1
console.log(find_components(total, list))
which returns [ 7.2, 2, 4.1 ]
#include <stdio.h>
#include <stdlib.h>
/* Takes at least 3 numbers as arguments.
* First number is desired sum.
* Find the subset of the rest that comes closest
* to the desired sum without going over.
*/
static long *elements;
static int nelements;
/* A linked list of some elements, not necessarily all */
/* The list represents the optimal subset for elements in the range [index..nelements-1] */
struct status {
long sum; /* sum of all the elements in the list */
struct status *next; /* points to next element in the list */
int index; /* index into elements array of this element */
};
/* find the subset of elements[startingat .. nelements-1] whose sum is closest to but does not exceed desiredsum */
struct status *reportoptimalsubset(long desiredsum, int startingat) {
struct status *sumcdr = NULL;
struct status *sumlist = NULL;
/* sum of zero elements or summing to zero */
if (startingat == nelements || desiredsum == 0) {
return NULL;
}
/* optimal sum using the current element */
/* if current elements[startingat] too big, it won't fit, don't try it */
if (elements[startingat] <= desiredsum) {
sumlist = malloc(sizeof(struct status));
sumlist->index = startingat;
sumlist->next = reportoptimalsubset(desiredsum - elements[startingat], startingat + 1);
sumlist->sum = elements[startingat] + (sumlist->next ? sumlist->next->sum : 0);
if (sumlist->sum == desiredsum)
return sumlist;
}
/* optimal sum not using current element */
sumcdr = reportoptimalsubset(desiredsum, startingat + 1);
if (!sumcdr) return sumlist;
if (!sumlist) return sumcdr;
return (sumcdr->sum < sumlist->sum) ? sumlist : sumcdr;
}
int main(int argc, char **argv) {
struct status *result = NULL;
long desiredsum = strtol(argv[1], NULL, 10);
nelements = argc - 2;
elements = malloc(sizeof(long) * nelements);
for (int i = 0; i < nelements; i++) {
elements[i] = strtol(argv[i + 2], NULL , 10);
}
result = reportoptimalsubset(desiredsum, 0);
if (result)
printf("optimal subset = %ld\n", result->sum);
while (result) {
printf("%ld + ", elements[result->index]);
result = result->next;
}
printf("\n");
}
Best to avoid use of floats and doubles when doing arithmetic and equality comparisons btw.

Can you do Top-K frequent Element better than O(nlogn) ? (code attached) [duplicate]

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.
My thinking is like this.
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.
sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.
After sorting, we just take the first K words. This takes O(K) time.
To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).
We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be
2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;
3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).
To summarize, this solution cost time O(n+k*lg(n)).
This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.
This can be done in O(n) time
Solution 1:
Steps:
Count words and hash it, which will end up in the structure like this
var hash = {
"I" : 13,
"like" : 3,
"meow" : 3,
"geek" : 3,
"burger" : 2,
"cat" : 1,
"foo" : 100,
...
...
Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size
Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:
0 1 2 3 100
[[ ],[cat],[burger],[like, meow, geek],[]...[foo]]
Then just traverse the array from the end, and collect the k words.
Solution 2:
Steps:
Same as above
Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
After traversing through the array, we just convert the min heap to array and return the array.
You're not going to get generally better runtime than the solution you've described. You have to do at least O(n) work to evaluate all the words, and then O(k) extra work to find the top k terms.
If your problem set is really big, you can use a distributed solution such as map/reduce. Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity.
A small variation on your solution yields an O(n) algorithm if we don't care about ranking the top K, and a O(n+k*lg(k)) solution if we do. I believe both of these bounds are optimal within a constant factor.
The optimization here comes again after we run through the list, inserting into the hash table. We can use the median of medians algorithm to select the Kth largest element in the list. This algorithm is provably O(n).
After selecting the Kth smallest element, we partition the list around that element just as in quicksort. This is obviously also O(n). Anything on the "left" side of the pivot is in our group of K elements, so we're done (we can simply throw away everything else as we go along).
So this strategy is:
Go through each word and insert it into a hash table: O(n)
Select the Kth smallest element: O(n)
Partition around that element: O(n)
If you want to rank the K elements, simply sort them with any efficient comparison sort in O(k * lg(k)) time, yielding a total run time of O(n+k * lg(k)).
The O(n) time bound is optimal within a constant factor because we must examine each word at least once.
The O(n + k * lg(k)) time bound is also optimal because there is no comparison-based way to sort k elements in less than k * lg(k) time.
If your "big word list" is big enough, you can simply sample and get estimates. Otherwise, I like hash aggregation.
Edit:
By sample I mean choose some subset of pages and calculate the most frequent word in those pages. Provided you select the pages in a reasonable way and select a statistically significant sample, your estimates of the most frequent words should be reasonable.
This approach is really only reasonable if you have so much data that processing it all is just kind of silly. If you only have a few megs, you should be able to tear through the data and calculate an exact answer without breaking a sweat rather than bothering to calculate an estimate.
You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.
This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].
If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.
You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.
Your problem is same as this-
http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
Use Trie and min heap to efficieinty solve it.
If what you're after is the list of k most frequent words in your text for any practical k and for any natural langage, then the complexity of your algorithm is not relevant.
Just sample, say, a few million words from your text, process that with any algorithm in a matter of seconds, and the most frequent counts will be very accurate.
As a side note, the complexity of the dummy algorithm (1. count all 2. sort the counts 3. take the best) is O(n+m*log(m)), where m is the number of different words in your text. log(m) is much smaller than (n/m), so it remains O(n).
Practically, the long step is counting.
Utilize memory efficient data structure to store the words
Use MaxHeap, to find the top K frequent words.
Here is the code
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;
import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;
public class TopKFrequentItems {
private int maxSize;
private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;
public TopKFrequentItems(int k) {
this.maxSize = k;
this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}
private Comparator<TrieEntry> maxHeapComparator() {
return new Comparator<TrieEntry>() {
#Override
public int compare(TrieEntry o1, TrieEntry o2) {
return o1.frequency - o2.frequency;
}
};
}
public void add(String word) {
this.trie.insert(word);
}
public List<TopK> getItems() {
for (TrieEntry trieEntry : this.trie.getAll()) {
if (this.maxHeap.size() < this.maxSize) {
this.maxHeap.add(trieEntry);
} else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
this.maxHeap.remove();
this.maxHeap.add(trieEntry);
}
}
List<TopK> result = new ArrayList<TopK>();
for (TrieEntry entry : this.maxHeap) {
result.add(new TopK(entry));
}
return result;
}
public static class TopK {
public String item;
public int frequency;
public TopK(String item, int frequency) {
this.item = item;
this.frequency = frequency;
}
public TopK(TrieEntry entry) {
this(entry.word, entry.frequency);
}
#Override
public String toString() {
return String.format("TopK [item=%s, frequency=%s]", item, frequency);
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + frequency;
result = prime * result + ((item == null) ? 0 : item.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
TopK other = (TopK) obj;
if (frequency != other.frequency)
return false;
if (item == null) {
if (other.item != null)
return false;
} else if (!item.equals(other.item))
return false;
return true;
}
}
}
Here is the unit tests
#Test
public void test() {
TopKFrequentItems stream = new TopKFrequentItems(2);
stream.add("hell");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hero");
stream.add("hero");
stream.add("hero");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("home");
stream.add("go");
stream.add("go");
assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}
For more details refer this test case
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.This is same as every one explained above
While insertion itself in hashmap , keep the Treeset(specific to java, there are implementations in every language) of size 10(k=10) to keep the top 10 frequent words. Till size is less than 10, keep adding it. If size equal to 10, if inserted element is greater than minimum element i.e. first element. If yes remove it and insert new element
To restrict the size of treeset see this link
Suppose we have a word sequence "ad" "ad" "boy" "big" "bad" "com" "come" "cold". And K=2.
as you mentioned "partitioning using the first letter of words", we got
("ad", "ad") ("boy", "big", "bad") ("com" "come" "cold")
"then partitioning the largest multi-word set using the next character until you have k single-word sets."
it will partition ("boy", "big", "bad") ("com" "come" "cold"), the first partition ("ad", "ad") is missed, while "ad" is actually the most frequent word.
Perhaps I misunderstand your point. Can you please detail your process about partition?
I believe this problem can be solved by an O(n) algorithm. We could make the sorting on the fly. In other words, the sorting in that case is a sub-problem of the traditional sorting problem since only one counter gets incremented by one every time we access the hash table. Initially, the list is sorted since all counters are zero. As we keep incrementing counters in the hash table, we bookkeep another array of hash values ordered by frequency as follows. Every time we increment a counter, we check its index in the ranked array and check if its count exceeds its predecessor in the list. If so, we swap these two elements. As such we obtain a solution that is at most O(n) where n is the number of words in the original text.
I was struggling with this as well and get inspired by #aly. Instead of sorting afterwards, we can just maintain a presorted list of words (List<Set<String>>) and the word will be in the set at position X where X is the current count of the word. In generally, here's how it works:
for each word, store it as part of map of it's occurrence: Map<String, Integer>.
then, based on the count, remove it from the previous count set, and add it into the new count set.
The drawback of this is the list maybe big - can be optimized by using a TreeMap<Integer, Set<String>> - but this will add some overhead. Ultimately we can use a mix of HashMap or our own data structure.
The code
public class WordFrequencyCounter {
private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
List<Set<String>> reverseCounters = new ArrayList<Set<String>>();
private static class MutableCounter {
int i = 1;
}
public List<String> countMostFrequentWords(String text, int max) {
int lastPosition = 0;
int length = text.length();
for (int i = 0; i < length; i++) {
char c = text.charAt(i);
if (c <= WORD_SEPARATOR_MAX) {
if (i != lastPosition) {
String word = text.substring(lastPosition, i);
MutableCounter counter = counters.get(word);
if (counter == null) {
counter = new MutableCounter();
counters.put(word, counter);
} else {
Set<String> strings = reverseCounters.get(counter.i);
strings.remove(word);
counter.i ++;
}
addToReverseLookup(counter.i, word);
}
lastPosition = i + 1;
}
}
List<String> ret = new ArrayList<String>();
int count = 0;
for (int i = reverseCounters.size() - 1; i >= 0; i--) {
Set<String> strings = reverseCounters.get(i);
for (String s : strings) {
ret.add(s);
System.out.print(s + ":" + i);
count++;
if (count == max) break;
}
if (count == max) break;
}
return ret;
}
private void addToReverseLookup(int count, String word) {
while (count >= reverseCounters.size()) {
reverseCounters.add(new HashSet<String>());
}
Set<String> strings = reverseCounters.get(count);
strings.add(word);
}
}
I just find out the other solution for this problem. But I am not sure it is right.
Solution:
Use a Hash table to record all words' frequency T(n) = O(n)
Choose first k elements of hash table, and restore them in one buffer (whose space = k). T(n) = O(k)
Each time, firstly we need find the current min element of the buffer, and just compare the min element of the buffer with the (n - k) elements of hash table one by one. If the element of hash table is greater than this min element of buffer, then drop the current buffer's min, and add the element of the hash table. So each time we find the min one in the buffer need T(n) = O(k), and traverse the whole hash table need T(n) = O(n - k). So the whole time complexity for this process is T(n) = O((n-k) * k).
After traverse the whole hash table, the result is in this buffer.
The whole time complexity: T(n) = O(n) + O(k) + O(kn - k^2) = O(kn + n - k^2 + k). Since, k is really smaller than n in general. So for this solution, the time complexity is T(n) = O(kn). That is linear time, when k is really small. Is it right? I am really not sure.
Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.
http://en.wikipedia.org/wiki/Trie
This is an interesting idea to search and I could find this paper related to Top-K https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf
Also there is an implementation of it here.
Simplest code to get the occurrence of most frequently used word.
function strOccurence(str){
var arr = str.split(" ");
var length = arr.length,temp = {},max;
while(length--){
if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
{
temp[arr[length]] = 1;
}
else if(arr[length].trim().length > 0)
{
temp[arr[length]] = temp[arr[length]] + 1;
}
}
console.log(temp);
var max = [];
for(i in temp)
{
max[temp[i]] = i;
}
console.log(max[max.length])
//if you want second highest
console.log(max[max.length - 2])
}
In these situations, I recommend to use Java built-in features. Since, they are already well tested and stable. In this problem, I find the repetitions of the words by using HashMap data structure. Then, I push the results to an array of objects. I sort the object by Arrays.sort() and print the top k words and their repetitions.
import java.io.*;
import java.lang.reflect.Array;
import java.util.*;
public class TopKWordsTextFile {
static class SortObject implements Comparable<SortObject>{
private String key;
private int value;
public SortObject(String key, int value) {
super();
this.key = key;
this.value = value;
}
#Override
public int compareTo(SortObject o) {
//descending order
return o.value - this.value;
}
}
public static void main(String[] args) {
HashMap<String,Integer> hm = new HashMap<>();
int k = 1;
try {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));
String line;
while ((line = br.readLine()) != null) {
// process the line.
//System.out.println(line);
String[] tokens = line.split(" ");
for(int i=0; i<tokens.length; i++){
if(hm.containsKey(tokens[i])){
//If the key already exists
Integer prev = hm.get(tokens[i]);
hm.put(tokens[i],prev+1);
}else{
//If the key doesn't exist
hm.put(tokens[i],1);
}
}
}
//Close the input
br.close();
//Print all words with their repetitions. You can use 3 for printing top 3 words.
k = hm.size();
// Get a set of the entries
Set set = hm.entrySet();
// Get an iterator
Iterator i = set.iterator();
int index = 0;
// Display elements
SortObject[] objects = new SortObject[hm.size()];
while(i.hasNext()) {
Map.Entry e = (Map.Entry)i.next();
//System.out.print("Key: "+e.getKey() + ": ");
//System.out.println(" Value: "+e.getValue());
String tempS = (String) e.getKey();
int tempI = (int) e.getValue();
objects[index] = new SortObject(tempS,tempI);
index++;
}
System.out.println();
//Sort the array
Arrays.sort(objects);
//Print top k
for(int j=0; j<k; j++){
System.out.println(objects[j].key+":"+objects[j].value);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
For more information, please visit https://github.com/m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java. I hope it helps.
**
C++11 Implementation of the above thought
**
class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {
unordered_map<int,int> map;
for(int num : nums){
map[num]++;
}
vector<int> res;
// we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
// pair<first, second>: first is frequency, second is number
priority_queue<pair<int,int>> pq;
for(auto it = map.begin(); it != map.end(); it++){
pq.push(make_pair(it->second, it->first));
// onece the size bigger than size-k, we will pop the value, which is the top k frequent element value
if(pq.size() > (int)map.size() - k){
res.push_back(pq.top().second);
pq.pop();
}
}
return res;
}
};

Complexity of the "Search words/strings in Matrix of Char" Algorithm

I have a task to search in a grid of letters (20×20 <= MxN <= 1000×1000) words (5 <= length <= 100) from a list. Any word hidden in the grid is always in the form of a zig-zag segments whose length may be only 2 or 3. Zigzag segments can only be from left to right or from bottom to top.
The complexity required is equal to the product of the number of letters in the grid and the number of letters from list.
For grid:
••••••••••••••••••••
••••••••ate•••••x•••
•••••••er•••••••e•••
••••••it••••••••v•••
••••ell••••••a••f•••
•••at••••e••••••rbg•
•••s•••••••ga•••••••
and list of words {"forward", "iterate", "phone", "satellite"}
output will be
3,6,iterate
6,3,satellite
I did this in C++:
I saved all prefixes and words in an unordered_map<string, int> where key is prefix/word and value is 1 for prefix and 2 for word. Now I do something like this (pseudocode):
for (char c in grid)
check(c + "");
}
where:
check(string s) {
if s is key in unsorted_map {
if (value[s] == 2) //it's a word
print s; //and position
if (not up 3 time consecutive) //limit the segments <= 3
check(s + next_up_char_from_grid);
if (not right 3 time consecutive)
check(s + next_right_char_from_grid);
}
}
This implementation works great for random chars in grid and words from dictionary but complexity C ≃ O(M * N * 2K) > O(M * N * R) A better approximation C ≃ O(M * N * (1,6)K) because of the restrictions of length segments
M * N = number of chars in grid
K = the maximum length of any word from list (5 <= K <= 100)
R = number of chars in list of words
Worst case: max grid, max word length and same single char in grid and word
How can I archive the required complexity? It is possible only with the given restrictions?
Your check() function will do many repetition work.
For grid
•aa
ab•
aa•
and word 'aabaa'
There are two ways to get 'aabaa' which are same after letter 'b'
(top, right, top, right) or (right, top, top, right)
From this trait, we use an array a[position][n][m] to record whether for a specific word its prefix with length position can be got at grid [m, n]
For the previous example,
follow such sequence
a[0][2][0] = true
a[1][1][0] = a[1][2][1] = true
a[2][1][1] = true
a[3][0][1] = true
a[4][0][2] = true
'aabaa' can be found in grind
So complexity will be N*M*K*S
S is the number of words in list

Strategy to modify permutation algorithm to prevent duplicate printouts

I've been reviewing algorithms for practice, and I'm currently looking at a permutation algorithm that I quite like:
void permute(char* set, int begin, int end) {
int range = end - begin;
if (range == 1)
cout << set << endl;
else {
for(int i = 0; i < range; ++i) {
swap(&set[begin], &set[begin+i]);
permute(set, begin+1, end);
swap(&set[begin], &set[begin+i]);
}
}
}
I actually wanted to apply this to a situation where there will be many repeated characters though, so I need to be able to modify it to prevent the printing of duplicate permutations.
How would I go about detecting that I was generating a duplicate? I know I could store this in a hash or something similar, but that's not an optimal solution - I'd prefer one that didn't require extra storage. Can someone give me a suggestion?
PS: I don't want to use the STL permutation mechanisms, and I don't want a reference to another "unique permutation algorithm" somewhere. I'd like to understand the mechanism used to prevent duplication so I can build it into this in learn, if possible.
There is no general way to prevent arbitrary functions from generating duplicates. You can always filter out the duplicates, of course, but you don't want that, and for very good reasons. So you need a special way to generate only non-duplicates.
One way would be to generate the permutations in increasing lexicographical order. Then you can just compare if a "new" permatutation is the same as the last one, and then skip it. It gets even better: the algorithm for generating permutations in increasing lexicographical order given at http://en.wikipedia.org/wiki/Permutations#Generation_in_lexicographic_order doesn't even generate the duplicates at all!
However, that is not an answer to your question, as it is a different algorithm (although based on swapping, too).
So, let's look at your algorithm a little closer. One key observation is:
Once a character is swapped to position begin, it will stay there for all nested calls of permute.
We'll combine this with the following general observation about permutations:
If you permute a string s, but only at positions where there's the same character, s will remain the same. In fact, all duplicate permutations have a different order for the occurences of some character c, where c occurs at the same positions.
OK, so all we have to do is to make sure that the occurences of each character are always in the same order as in the beginning. Code follows, but... I don't really speak C++, so I'll use Python and hope to get away with claiming it's pseudo code.
We start by your original algorithm, rewritten in 'pseudo code':
def permute(s, begin, end):
if end == begin + 1:
print(s)
else:
for i in range(begin, end):
s[begin], s[i] = s[i], s[begin]
permute(s, begin + 1, end)
s[begin], s[i] = s[i], s[begin]
and a helper function that makes calling it easier:
def permutations_w_duplicates(s):
permute(list(s), 0, len(s)) # use a list, as in Python strings are not mutable
Now we extend the permute function with some bookkeeping about how many times a certain character has been swapped to the begin position (i.e. has been fixed), and we also remember the original order of the occurences of each character (char_number). Each character that we try to swap to the begin position then has to be the next higher in the original order, i.e. the number of fixes for a character defines which original occurence of this character may be fixed next - I call this next_fixable.
def permute2(s, next_fixable, char_number, begin, end):
if end == begin + 1:
print(s)
else:
for i in range(begin, end):
if next_fixable[s[i]] == char_number[i]:
next_fixable[s[i]] += 1
char_number[begin], char_number[i] = char_number[i], char_number[begin]
s[begin], s[i] = s[i], s[begin]
permute2(s, next_fixable, char_number, begin + 1, end)
s[begin], s[i] = s[i], s[begin]
char_number[begin], char_number[i] = char_number[i], char_number[begin]
next_fixable[s[i]] -= 1
Again, we use a helper function:
def permutations_wo_duplicates(s):
alphabet = set(s)
next_fixable = dict.fromkeys(alphabet, 0)
count = dict.fromkeys(alphabet, 0)
char_number = [0] * len(s)
for i, c in enumerate(s):
char_number[i] = count[c]
count[c] += 1
permute2(list(s), next_fixable, char_number, 0, len(s))
That's it!
Almost. You can stop here and rewrite in C++ if you like, but if you are interested in some test data, read on.
I used a slightly different code for testing, because I didn't want to print all permutations. In Python, you would replace the print with a yield, with turns the function into a generator function, the result of which can be iterated over with a for loop, and the permutations will be computed only when needed. This is the real code and test I used:
def permute2(s, next_fixable, char_number, begin, end):
if end == begin + 1:
yield "".join(s) # join the characters to form a string
else:
for i in range(begin, end):
if next_fixable[s[i]] == char_number[i]:
next_fixable[s[i]] += 1
char_number[begin], char_number[i] = char_number[i], char_number[begin]
s[begin], s[i] = s[i], s[begin]
for p in permute2(s, next_fixable, char_number, begin + 1, end):
yield p
s[begin], s[i] = s[i], s[begin]
char_number[begin], char_number[i] = char_number[i], char_number[begin]
next_fixable[s[i]] -= 1
def permutations_wo_duplicates(s):
alphabet = set(s)
next_fixable = dict.fromkeys(alphabet, 0)
count = dict.fromkeys(alphabet, 0)
char_number = [0] * len(s)
for i, c in enumerate(s):
char_number[i] = count[c]
count[c] += 1
for p in permute2(list(s), next_fixable, char_number, 0, len(s)):
yield p
s = "FOOQUUXFOO"
A = list(permutations_w_duplicates(s))
print("%s has %s permutations (counting duplicates)" % (s, len(A)))
print("permutations of these that are unique: %s" % len(set(A)))
B = list(permutations_wo_duplicates(s))
print("%s has %s unique permutations (directly computed)" % (s, len(B)))
print("The first 10 permutations :", A[:10])
print("The first 10 unique permutations:", B[:10])
And the result:
FOOQUUXFOO has 3628800 permutations (counting duplicates)
permutations of these that are unique: 37800
FOOQUUXFOO has 37800 unique permutations (directly computed)
The first 10 permutations : ['FOOQUUXFOO', 'FOOQUUXFOO', 'FOOQUUXOFO', 'FOOQUUXOOF', 'FOOQUUXOOF', 'FOOQUUXOFO', 'FOOQUUFXOO', 'FOOQUUFXOO', 'FOOQUUFOXO', 'FOOQUUFOOX']
The first 10 unique permutations: ['FOOQUUXFOO', 'FOOQUUXOFO', 'FOOQUUXOOF', 'FOOQUUFXOO', 'FOOQUUFOXO', 'FOOQUUFOOX', 'FOOQUUOFXO', 'FOOQUUOFOX', 'FOOQUUOXFO', 'FOOQUUOXOF']
Note that the permutations are computed in the same order than the original algorithm, just without the duplicates. Note that 37800 * 2! * 2! * 4! = 3628800, just like you would expect.
You could add an if statement to prevent the swap code from executing if it would swap two identical characters. The for loop is then
for(int i = 0; i < range; ++i) {
if(i==0 || set[begin] != set[begin+i]) {
swap(&set[begin], &set[begin+i]);
permute(set, begin+1, end);
swap(&set[begin], &set[begin+i]);
}
}
The reason for allowing the case i==0 is make sure the recursive call happens exactly once even if all the characters of the set are the same.
A simple solution is to change the duplicate characters randomly to characters that aren't already present. Then after permutation, change the characters back. Only accept a permutation if its characters are in order.
e.g. if you have "a,b,b"
you would have had the following:
a b b
a b b
b a b
b a b
b b a
b b a
But, if we start with a,b,b and note the duplicate b's, then we can change the second b to a c
now we have a b c
a b c - accept because b is before c. change c back to b to get a b b
a c b - reject because c is before b
b a c - accept as b a b
b c a - accept as b b a
c b a - reject as c comes before b.
c a b - reject as c comes before b.
OPTION 1
One option would be to use 256 bits of storage on the stack to store a bitmask of which characters you had tried in the for loop, and only to recurse for new characters.
OPTION 2
A second option is to use the approach suggested in the comments ( http://n1b-algo.blogspot.com/2009/01/string-permutations.html) and change the for loop to:
else {
char last=0;
for(int i = 0; i < range; ++i) {
if (last==set[begin+i])
continue;
last = set[begin+i];
swap(&set[begin], &set[begin+i]);
permute(set, begin+1, end);
swap(&set[begin], &set[begin+i]);
}
}
However, to use this approach you will also have to sort the characters set[begin],set[begin+1],...set[end-1] at the entry to the function.
Note that you have to sort every time the function is called. (The blog post does not seem to mention this, but otherwise you will generate too many results for an input string "aabbc". The problem is that the string does not stay sorted after swap is used.)
This is still not very efficient. For example, for a string containing 1 'a' and N 'b's this approach will end up calling the sort N times for an overall complexity of N^2logN
OPTION 3
A more efficient approach for long strings containing lots of repeats would be to maintain both the string "set" and a dictionary of how many of each type of character you have left to use. The for loop would change to a loop over the keys of the dictonary as these would be the unique characters that are allowed at that position.
This would have complexity equal to the number of output strings, and only a very small extra amount of storage to hold the dictionary.
Simply insert each element to a set. It automatically removes duplicates. Declare set s as global variable.
set <string>s;
void permute(string a, int l, int r) {
int i;
if (l == r)
s.insert(a);
else
{
for (i = l; i <= r; i++)
{
swap((a[l]), (a[i]));
permute(a, l+1, r);
swap((a[l]), (a[i])); //backtrack
}
}
}
Finally print using the function
void printr()
{
set <string> ::iterator itr;
for (itr = s.begin(); itr != s.end(); ++itr)
{
cout << '\t' << *itr;
}
cout << '\t' << *itr;
}
The key is not to swap the same character twice. So, you could use an unordered_set to memorize which characters have been swapped.
void permute(string& input, int begin, vector<string>& output) {
if (begin == input.size()){
output.push_back(input);
}
else {
unordered_set<char> swapped;
for(int i = begin; i < input.size(); i++) {
// Do not swap a character that has been swapped
if(swapped.find(input[i]) == swapped.end()){
swapped.insert(input[i]);
swap(input[begin], input[i]);
permute(input, begin+1, output);
swap(input[begin], input[i]);
}
}
}
}
You can go through your original code by hand, and you will find those cases where duplicates occur are "swapping with the character which has been swapped."
Ex: input = "BAA"
index = 0, i = 0, input = "BAA"
----> index = 1, i = 1, input = "BAA"
----> index = 1, i = 2, input = "BAA" (duplicate)
index = 0, i = 1, input = "ABA"
----> index = 1, i = 1, input = "ABA"
----> index = 1, i = 2, input = "AAB"
index = 0, i = 2, input = "AAB"
----> index = 1, i = 1, input = "AAB" (duplicate)
----> index = 1, i = 2, input = "ABA" (duplicate)