Top 10 Frequencies in a Hash Table with Linked Lists - c++

The code below will print me the highest frequency it can find in my hash table (of which is a bunch of linked lists) 10 times. I need my code to print the top 10 frequencies in my hash table. I do not know how to do this (code examples would be great, plain english logic/pseudocode is just as great).
I create a temporary hashing list called 'tmp' which is pointing to my hash table 'hashtable'
A while loop then goes through the list and looks for the highest frequency, which is an int 'tmp->freq'
The loop will continue this process of duplicating the highest frequency it finds with the variable 'topfreq' until it reaches the end of the linked lists on the the hash table.
My 'node' is a struct comprising of the variables 'freq' (int) and 'word' (128 char). When the loop has nothing else to search for it prints these two values on screen.
The problem is, I can't wrap my head around figuring out how to find the next lowest number from the number I've just found (and this can include another node with the same freq value, so I have to check that the word is not the same too).
void toptenwords()
{
int topfreq = 0;
int minfreq = 0;
char topword[SIZEOFWORD];
for(int p = 0; p < 10; p++) // We need the top 10 frequencies... so we do this 10 times
{
for(int m = 0; m < HASHTABLESIZE; m++) // Go through the entire hast table
{
node* tmp;
tmp = hashtable[m];
while(tmp != NULL) // Walk through the entire linked list
{
if(tmp->freq > topfreq) // If the freqency on hand is larger that the one found, store...
{
topfreq = tmp->freq;
strcpy(topword, tmp->word);
}
tmp = tmp->next;
}
}
cout << topfreq << "\t" << topword << endl;
}
}
Any and all help would be GREATLY appreciated :)

Keep an array of 10 node pointers, and insert each node into the array, maintaining the array in sorted order. The eleventh node in the array is overwritten on each iteration and contains junk.
void toptenwords()
{
int topfreq = 0;
int minfreq = 0;
node *topwords[11];
int current_topwords = 0;
for(int m = 0; m < HASHTABLESIZE; m++) // Go through the entire hast table
{
node* tmp;
tmp = hashtable[m];
while(tmp != NULL) // Walk through the entire linked list
{
topwords[current_topwords] = tmp;
current_topwords++;
for(int i = current_topwords - 1; i > 0; i--)
{
if(topwords[i]->freq > topwords[i - 1]->freq)
{
node *temp = topwords[i - 1];
topwords[i - 1] = topwords[i];
topwords[i] = temp;
}
else break;
}
if(current_topwords > 10) current_topwords = 10;
tmp = tmp->next;
}
}
}

I would maintain a set of words already used and change the inner-most if condition to test for frequency greater than previous top frequency AND tmp->word not in list of words already used.

When iterating over the hash table (and then over each linked list contained therein) keep a self balancing binary tree (std::set) as a "result" list. As you come across each frequency, insert it into the list, then truncate the list if it has more than 10 entries. When you finish, you'll have a set (sorted list) of the top ten frequencies, which you can manipulate as you desire.
There may be perform gains to be had by using sets instead of linked lists in the hash table itself, but you can work that out for yourself.

Step 1 (Inefficient):
Move the vector into a sorted container via insertion sort, but insert into a container (e.g. linkedlist or vector) of size 10, and drop any elements that fall off the bottom of the list.
Step 2 (Efficient):
Same as step 1, but keep track of the size of the item at the bottom of the list, and skip the insertion step entirely if the current item is too small.

Suppose there are n words in total, and we need the most-frequent k words (here, k = 10).
If n is much larger than k, the most efficient way I know of is to maintain a min-heap (i.e. the top element has the minimum frequency of all elements in the heap). On each iteration, you insert the next frequency into the heap, and if the heap now contains k+1 elements, you remove the smallest. This way, the heap is maintained at a size of k elements throughout, containing at any time the k highest-frequency elements seen so far. At the end of processing, read out the k highest-frequency elements in increasing order.
Time complexity: For each of n words, we do two things: insert into a heap of size at most k, and remove the minimum element. Each operation costs O(log k) time, so the entire loop takes O(nlog k) time. Finally, we read out the k elements from a heap of size at most k, taking O(klog k) time, for a total time of O((n+k)log k). Since we know that k < n, O(klog k) is at worst O(nlog k), so this can be simplified to just O(nlog k).

A hash table containing linked lists of words seems like a peculiar data structure to use if the goal is to accumulate are word frequencies.
Nonetheless, the efficient way to get the ten highest frequency nodes is to insert each into a priority queue/heap, such as the Fibonacci heap, which has O(1) insertion time and O(n) deletion time. Assuming that iteration over the hash table table is fast, this method has a runtime which is O(n×O(1) + 10×O(n)) ≡ O(n).

The absolute fastest way to do this would be to use a SoftHeap. Using a SoftHeap, you can find the top 10 items in O(n) time whereas every other solution posted here would take O(n lg n) time.
http://en.wikipedia.org/wiki/Soft_heap
This wikipedia article shows how to find the median in O(n) time using a softheap, and the top 10 is simply a subset of the median problem. You could then sort the items that were in the top 10 if you needed them in order, and since you're always at most sorting 10 items, it's still O(n) time.

Related

Order Notation of pop-max in a binary heap

I need to write a pop_max function for a binary heap that removes the max element. The solution given is as below:
void pop_max() {
assert(!m_heap.empty());
int tmp = (size()+1)/2;
for (int i = tmp+1; i < size(); i++) {
if (m_heap[tmp] < m_heap[i])
tmp = i;
}
m_heap[tmp] = m_heap.back();
m_heap.pop_back();
this->percolate_up(tmp);
}
The solution also says the number of "nodes" visited is n+log(n) where n is the total number of nodes in the heap. It then goes on to say the running time is o(n).
This makes zero sense to me though.
Their solution finds the first leaf node int tmp = (size()+1)/2; then goes through the remaining leaf nodes.
Is their solution not n/2 nodes visited and o(n/2) for running time as well? Could someone explain why this might be?
Edit: O(n/2) = O(n). But what about the number of nodes visited? I still don't quite understand how it is o(n+log(n))
O(n/2) is equal to O(n)
Number of nodes visited means n for the leaf nodes and log(n) for percolate up!

Can you do Top-K frequent Element better than O(nlogn) ? (code attached) [duplicate]

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.
My thinking is like this.
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.
sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.
After sorting, we just take the first K words. This takes O(K) time.
To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).
We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be
2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;
3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).
To summarize, this solution cost time O(n+k*lg(n)).
This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.
This can be done in O(n) time
Solution 1:
Steps:
Count words and hash it, which will end up in the structure like this
var hash = {
"I" : 13,
"like" : 3,
"meow" : 3,
"geek" : 3,
"burger" : 2,
"cat" : 1,
"foo" : 100,
...
...
Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size
Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:
0 1 2 3 100
[[ ],[cat],[burger],[like, meow, geek],[]...[foo]]
Then just traverse the array from the end, and collect the k words.
Solution 2:
Steps:
Same as above
Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
After traversing through the array, we just convert the min heap to array and return the array.
You're not going to get generally better runtime than the solution you've described. You have to do at least O(n) work to evaluate all the words, and then O(k) extra work to find the top k terms.
If your problem set is really big, you can use a distributed solution such as map/reduce. Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity.
A small variation on your solution yields an O(n) algorithm if we don't care about ranking the top K, and a O(n+k*lg(k)) solution if we do. I believe both of these bounds are optimal within a constant factor.
The optimization here comes again after we run through the list, inserting into the hash table. We can use the median of medians algorithm to select the Kth largest element in the list. This algorithm is provably O(n).
After selecting the Kth smallest element, we partition the list around that element just as in quicksort. This is obviously also O(n). Anything on the "left" side of the pivot is in our group of K elements, so we're done (we can simply throw away everything else as we go along).
So this strategy is:
Go through each word and insert it into a hash table: O(n)
Select the Kth smallest element: O(n)
Partition around that element: O(n)
If you want to rank the K elements, simply sort them with any efficient comparison sort in O(k * lg(k)) time, yielding a total run time of O(n+k * lg(k)).
The O(n) time bound is optimal within a constant factor because we must examine each word at least once.
The O(n + k * lg(k)) time bound is also optimal because there is no comparison-based way to sort k elements in less than k * lg(k) time.
If your "big word list" is big enough, you can simply sample and get estimates. Otherwise, I like hash aggregation.
Edit:
By sample I mean choose some subset of pages and calculate the most frequent word in those pages. Provided you select the pages in a reasonable way and select a statistically significant sample, your estimates of the most frequent words should be reasonable.
This approach is really only reasonable if you have so much data that processing it all is just kind of silly. If you only have a few megs, you should be able to tear through the data and calculate an exact answer without breaking a sweat rather than bothering to calculate an estimate.
You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.
This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].
If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.
You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.
Your problem is same as this-
http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
Use Trie and min heap to efficieinty solve it.
If what you're after is the list of k most frequent words in your text for any practical k and for any natural langage, then the complexity of your algorithm is not relevant.
Just sample, say, a few million words from your text, process that with any algorithm in a matter of seconds, and the most frequent counts will be very accurate.
As a side note, the complexity of the dummy algorithm (1. count all 2. sort the counts 3. take the best) is O(n+m*log(m)), where m is the number of different words in your text. log(m) is much smaller than (n/m), so it remains O(n).
Practically, the long step is counting.
Utilize memory efficient data structure to store the words
Use MaxHeap, to find the top K frequent words.
Here is the code
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;
import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;
public class TopKFrequentItems {
private int maxSize;
private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;
public TopKFrequentItems(int k) {
this.maxSize = k;
this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}
private Comparator<TrieEntry> maxHeapComparator() {
return new Comparator<TrieEntry>() {
#Override
public int compare(TrieEntry o1, TrieEntry o2) {
return o1.frequency - o2.frequency;
}
};
}
public void add(String word) {
this.trie.insert(word);
}
public List<TopK> getItems() {
for (TrieEntry trieEntry : this.trie.getAll()) {
if (this.maxHeap.size() < this.maxSize) {
this.maxHeap.add(trieEntry);
} else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
this.maxHeap.remove();
this.maxHeap.add(trieEntry);
}
}
List<TopK> result = new ArrayList<TopK>();
for (TrieEntry entry : this.maxHeap) {
result.add(new TopK(entry));
}
return result;
}
public static class TopK {
public String item;
public int frequency;
public TopK(String item, int frequency) {
this.item = item;
this.frequency = frequency;
}
public TopK(TrieEntry entry) {
this(entry.word, entry.frequency);
}
#Override
public String toString() {
return String.format("TopK [item=%s, frequency=%s]", item, frequency);
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + frequency;
result = prime * result + ((item == null) ? 0 : item.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
TopK other = (TopK) obj;
if (frequency != other.frequency)
return false;
if (item == null) {
if (other.item != null)
return false;
} else if (!item.equals(other.item))
return false;
return true;
}
}
}
Here is the unit tests
#Test
public void test() {
TopKFrequentItems stream = new TopKFrequentItems(2);
stream.add("hell");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hero");
stream.add("hero");
stream.add("hero");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("home");
stream.add("go");
stream.add("go");
assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}
For more details refer this test case
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.This is same as every one explained above
While insertion itself in hashmap , keep the Treeset(specific to java, there are implementations in every language) of size 10(k=10) to keep the top 10 frequent words. Till size is less than 10, keep adding it. If size equal to 10, if inserted element is greater than minimum element i.e. first element. If yes remove it and insert new element
To restrict the size of treeset see this link
Suppose we have a word sequence "ad" "ad" "boy" "big" "bad" "com" "come" "cold". And K=2.
as you mentioned "partitioning using the first letter of words", we got
("ad", "ad") ("boy", "big", "bad") ("com" "come" "cold")
"then partitioning the largest multi-word set using the next character until you have k single-word sets."
it will partition ("boy", "big", "bad") ("com" "come" "cold"), the first partition ("ad", "ad") is missed, while "ad" is actually the most frequent word.
Perhaps I misunderstand your point. Can you please detail your process about partition?
I believe this problem can be solved by an O(n) algorithm. We could make the sorting on the fly. In other words, the sorting in that case is a sub-problem of the traditional sorting problem since only one counter gets incremented by one every time we access the hash table. Initially, the list is sorted since all counters are zero. As we keep incrementing counters in the hash table, we bookkeep another array of hash values ordered by frequency as follows. Every time we increment a counter, we check its index in the ranked array and check if its count exceeds its predecessor in the list. If so, we swap these two elements. As such we obtain a solution that is at most O(n) where n is the number of words in the original text.
I was struggling with this as well and get inspired by #aly. Instead of sorting afterwards, we can just maintain a presorted list of words (List<Set<String>>) and the word will be in the set at position X where X is the current count of the word. In generally, here's how it works:
for each word, store it as part of map of it's occurrence: Map<String, Integer>.
then, based on the count, remove it from the previous count set, and add it into the new count set.
The drawback of this is the list maybe big - can be optimized by using a TreeMap<Integer, Set<String>> - but this will add some overhead. Ultimately we can use a mix of HashMap or our own data structure.
The code
public class WordFrequencyCounter {
private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
List<Set<String>> reverseCounters = new ArrayList<Set<String>>();
private static class MutableCounter {
int i = 1;
}
public List<String> countMostFrequentWords(String text, int max) {
int lastPosition = 0;
int length = text.length();
for (int i = 0; i < length; i++) {
char c = text.charAt(i);
if (c <= WORD_SEPARATOR_MAX) {
if (i != lastPosition) {
String word = text.substring(lastPosition, i);
MutableCounter counter = counters.get(word);
if (counter == null) {
counter = new MutableCounter();
counters.put(word, counter);
} else {
Set<String> strings = reverseCounters.get(counter.i);
strings.remove(word);
counter.i ++;
}
addToReverseLookup(counter.i, word);
}
lastPosition = i + 1;
}
}
List<String> ret = new ArrayList<String>();
int count = 0;
for (int i = reverseCounters.size() - 1; i >= 0; i--) {
Set<String> strings = reverseCounters.get(i);
for (String s : strings) {
ret.add(s);
System.out.print(s + ":" + i);
count++;
if (count == max) break;
}
if (count == max) break;
}
return ret;
}
private void addToReverseLookup(int count, String word) {
while (count >= reverseCounters.size()) {
reverseCounters.add(new HashSet<String>());
}
Set<String> strings = reverseCounters.get(count);
strings.add(word);
}
}
I just find out the other solution for this problem. But I am not sure it is right.
Solution:
Use a Hash table to record all words' frequency T(n) = O(n)
Choose first k elements of hash table, and restore them in one buffer (whose space = k). T(n) = O(k)
Each time, firstly we need find the current min element of the buffer, and just compare the min element of the buffer with the (n - k) elements of hash table one by one. If the element of hash table is greater than this min element of buffer, then drop the current buffer's min, and add the element of the hash table. So each time we find the min one in the buffer need T(n) = O(k), and traverse the whole hash table need T(n) = O(n - k). So the whole time complexity for this process is T(n) = O((n-k) * k).
After traverse the whole hash table, the result is in this buffer.
The whole time complexity: T(n) = O(n) + O(k) + O(kn - k^2) = O(kn + n - k^2 + k). Since, k is really smaller than n in general. So for this solution, the time complexity is T(n) = O(kn). That is linear time, when k is really small. Is it right? I am really not sure.
Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.
http://en.wikipedia.org/wiki/Trie
This is an interesting idea to search and I could find this paper related to Top-K https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf
Also there is an implementation of it here.
Simplest code to get the occurrence of most frequently used word.
function strOccurence(str){
var arr = str.split(" ");
var length = arr.length,temp = {},max;
while(length--){
if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
{
temp[arr[length]] = 1;
}
else if(arr[length].trim().length > 0)
{
temp[arr[length]] = temp[arr[length]] + 1;
}
}
console.log(temp);
var max = [];
for(i in temp)
{
max[temp[i]] = i;
}
console.log(max[max.length])
//if you want second highest
console.log(max[max.length - 2])
}
In these situations, I recommend to use Java built-in features. Since, they are already well tested and stable. In this problem, I find the repetitions of the words by using HashMap data structure. Then, I push the results to an array of objects. I sort the object by Arrays.sort() and print the top k words and their repetitions.
import java.io.*;
import java.lang.reflect.Array;
import java.util.*;
public class TopKWordsTextFile {
static class SortObject implements Comparable<SortObject>{
private String key;
private int value;
public SortObject(String key, int value) {
super();
this.key = key;
this.value = value;
}
#Override
public int compareTo(SortObject o) {
//descending order
return o.value - this.value;
}
}
public static void main(String[] args) {
HashMap<String,Integer> hm = new HashMap<>();
int k = 1;
try {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));
String line;
while ((line = br.readLine()) != null) {
// process the line.
//System.out.println(line);
String[] tokens = line.split(" ");
for(int i=0; i<tokens.length; i++){
if(hm.containsKey(tokens[i])){
//If the key already exists
Integer prev = hm.get(tokens[i]);
hm.put(tokens[i],prev+1);
}else{
//If the key doesn't exist
hm.put(tokens[i],1);
}
}
}
//Close the input
br.close();
//Print all words with their repetitions. You can use 3 for printing top 3 words.
k = hm.size();
// Get a set of the entries
Set set = hm.entrySet();
// Get an iterator
Iterator i = set.iterator();
int index = 0;
// Display elements
SortObject[] objects = new SortObject[hm.size()];
while(i.hasNext()) {
Map.Entry e = (Map.Entry)i.next();
//System.out.print("Key: "+e.getKey() + ": ");
//System.out.println(" Value: "+e.getValue());
String tempS = (String) e.getKey();
int tempI = (int) e.getValue();
objects[index] = new SortObject(tempS,tempI);
index++;
}
System.out.println();
//Sort the array
Arrays.sort(objects);
//Print top k
for(int j=0; j<k; j++){
System.out.println(objects[j].key+":"+objects[j].value);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
For more information, please visit https://github.com/m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java. I hope it helps.
**
C++11 Implementation of the above thought
**
class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {
unordered_map<int,int> map;
for(int num : nums){
map[num]++;
}
vector<int> res;
// we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
// pair<first, second>: first is frequency, second is number
priority_queue<pair<int,int>> pq;
for(auto it = map.begin(); it != map.end(); it++){
pq.push(make_pair(it->second, it->first));
// onece the size bigger than size-k, we will pop the value, which is the top k frequent element value
if(pq.size() > (int)map.size() - k){
res.push_back(pq.top().second);
pq.pop();
}
}
return res;
}
};

Time Limit Exceeded for Merge k Sorted Lists(leetcode)

merge-k-sorted-lists
Merge k sorted linked lists and return it as one sorted list. Analyze and describe its complexity.
My code:
ListNode *mergeTwoLists(ListNode *p1, ListNode *p2) {
ListNode dummy(-1);
ListNode *head = &dummy;
while(p1 != nullptr && p2 != nullptr) {
if (p1->val < p2->val) {
head->next = p1;
head = head->next;
p1 = p1->next;
} else {
head->next = p2;
head = head->next;
p2 = p2->next;
}
}
if (p1 != nullptr) {
head->next = p1;
}
if (p2 != nullptr) {
head->next = p2;
}
//head->next = nullptr;
return dummy.next;
}
ListNode *mergeKLists(vector<ListNode *> &lists) {
if (lists.size() == 0) return nullptr;
if (lists.size() == 1) return lists[0];
ListNode *p1, *p2, *p;
while (lists.size() > 1) {
p1 = lists.back();
lists.pop_back();
p2 = lists.back();
lists.pop_back();
p = mergeTwoLists(p1, p2);
lists.push_back(p);
}
return lists[0];
}
I always get Time Limit Exceeded. How should i change the program?
What your are doing has complexity O(nk^2) where n is the size of each array. You merge two lists at a time. Why ? you merge first two lists it takes 2n operations also the size of the first two combined is 2n. Now you merge this with the third, the array size becomes 3n and 3n operations are done, so total number of operations are 2n+3n+....kn ( arithmetic progression ) which is O(nk^2). Instead take a priority queue ( min heap ) insert first elements of all k lists. Now each time take the smallest element from priority queue ( put this in your new list ), remove it from the priority queue and insert the next element of the list to which this element belonged. As all elements are inserted and deleted from priority queue once and in total there are nk elements the complexity is O(nklog(k)). ( Time to delete / insert ) priority queue is O(log(number_of_elements_in_queue)). And in the queue at maximum there are k elements at any time.
For a more detailed explanation plus a code have a look here : Merging k sorted lists. I assume this would be enough to get AC on leetcode :).
Your problem is that you are doing unbalanced merges. If each list has n elements to start with and merge(a,b) means you merge lists of length a and b (which takes time O(a+b)), then the operations you are doing are
merge(n,n)
merge(2n,n)
merge(3n,n)
merge(4n,n)
....
and so you're paying a lot of cost iterating over the long list so many times; with k elements you're doing about (1/2) k^2 n work.
You could look for a specialized imbalance merging algorithm, but a much easier approach would be to just reorganize your work to merge lists of similar size. If you started with k lists each of n elements, then you would do
k/2 instances of `merge(n,n)`
k/4 instances of `merge(2n,2n)`
...
1 instance of `merge(nk/2, nk/2)`
Each step takes nk time, and there are lg(k) steps, for a total cost of nk lg(k).
If k isn't a power of 2 or the lists are not all the same length, there are lots of things you can do to try and minimize the overall amount of work, but a very simple way is to make lists a deque instead of a vector, and for each merge you pop two lists of the back and push the result in the front instead of the back. Another simple optimization on this is to first sort the lists by length.
The other answer is likely better when k is not too large. When k is rather large you're probably better off with a hybrid algorithm: you pick an appropriate m and you organize the total work as I've described, but rather than merging 2 lists at a time, you merge m lists at a time.
My first two guesses at an appropriate m are ceil(sqrt(k)) and the largest value for which the other answer's algorithm is efficient for an m-way merge.
(if for some strange reason m is still very large, then you do the m-way merge with the hybrid algorithm)
Why do I make the predictions above? The other answer only makes one pass through the data, so as long as your CPU can efficiently maintain a priority queue of length k as well as read from k lists at the same time, it is surely better than my algorithm which makes many passes through the data.
But when k gets too large, you run into problems:
Your TLB might not have enough entries to read from k lists at a time
Your cache might not be big enough to store a cache line or two from all of k of the lists as well as fit a priority queue
cache misses and especially TLB misses will degrade performance. The hybrid algorithm reorganizes the work so that you keep the benefit of my algorithmic approach (balanced merges) while nearly all of the work is done with the efficient m-way merge from the other answer.

Is this a shell sort or an insertion sort?

I'm just starting to learn about sorting algorithms and found one online. At first i thought it was a shell sort but it's missing that distinct interval of "k" and the halving of the array so i'm not sure if it is or not. My second guess is an insertion sort but i'm just here to double check:
for(n = 1; n < num; n++)
{
key = A[n];
k = n;
while((k > 0) && (A[k-1] > key))
{
A[k] = A[k-1];
k = k-1;
}
A[k] = key;
}
Also if you can explain why that'd be helpful as well
Shell Sort consists of many insertion sorts that are performed on sub-arrays of the original array.
The code you have provided is insertion sort.
To get shell sort, it would be roughly having other fors around your code changing h (that gap in shell sort) and starting index of the sub-array and inside, instead of moving from k to k-1, you move from k to k+h (or k-h depending on which direction you do the insertion sort)
I think you're right, that does look a lot like an insertion sort.
This fragment assumes A[0] is already inserted. If n == 0, then the k > 0 check will fail and execution will continue at A[k] = key;, properly storing the first element into the array.
This fragment also assumes that A[0:n-1] is already sorted. It inspects A[n] and starts scanning the array backward, moving forward one place every element that is larger than the original A[n] key.
Once the scanning encounters an element less than or equal to the key, it inserts it in that location.
It's called insertion sort because the line A[k] = key inserts the current value in the correct position in the partially sorted array.

How to implement a minimum heap sort to find the kth smallest element?

I've been implementing selection sort problems for class and one of the assignments is to find the kth smallest element in the array using a minimum heap. I know the procedure is:
heapify the array
delete the minimum (root) k times
return kth smallest element in the group
I don't have any problems creating a minimum heap. I'm just not sure how to go about properly deleting the minimum k times and successfully return the kth smallest element in the group. Here's what I have so far:
bool Example::min_heap_select(long k, long & kth_smallest) const {
//duplicate test group (thanks, const!)
Example test = Example(*this);
//variable delcaration and initlization
int n = test._total ;
int i;
//Heapifying stage (THIS WORKS CORRECTLY)
for (i = n/2; i >= 0; i--) {
//allows for heap construction
test.percolate_down_protected(i, n);
}//for
//Delete min phase (THIS DOESN'T WORK)
for(i = n-1; i >= (n-k+1); i--) {
//deletes the min by swapping elements
int tmp = test._group[0];
test._group[0] = test._group[i];
test._group[i] = tmp;
//resumes perc down
test.percolate_down_protected(0, i);
}//for
//IDK WHAT TO RETURN
kth_smallest = test._group[0];
void Example::percolate_down_protected(long i, long n) {
//variable declaration and initlization:
int currPos, child, r_child, tmp;
currPos = i;
tmp = _group[i];
child = left_child(i);
//set a sentinel and begin loop (no recursion allowed)
while (child < n) {
//calculates the right child's position
r_child = child + 1;
//we'll set the child to index of greater than right and left children
if ((r_child > n ) && (_group[r_child] >= _group[child])) {
child = r_child;
}
//find the correct spot
if (tmp <= _group [child]) {
break;
}
//make sure the smaller child is beneath the parent
_group[currPos] = _group[child];
//shift the tree down
currPos = child;
child = left_child(currPos);
}
//put tmp where it belongs
_group[currPos] = tmp;
}
As I stated before, the minimum heap part works correctly. I understand what I what to do- it seems easy to delete the root k times but then after that what index in the array do I return... 0? This almost works- it doesn't worth with k = n or k = 1.Would the kth smallest element be in the Any help would be much appreciated!
The only array index which is meaningful to the user is zero, which is the minimum element. So, after removing k elements, the k'th smallest element will be at zero.
Probably you should destroy the heap and return the value rather than asking the user to concern themself with the heap itself… but I don't know the details of the assignment.
Note that the C++ Standard Library has algorithms to help with this: make_heap, pop_heap, and nth_element.
I am not providing a detailed answer, just explaining the key points in getting k smallest elements in a min-heap ordered tree. The approach uses skip lists.
First form a skip list of nodes of the tree with just one element the node corresponding to the root of the heap. the 1st minimum element is just the value stored at this node.
Now delete this node and insert its child nodes in the right position such that to maintain the order of values. This steps takes O(logk) time.
The second minimum value is just then the value at first node in this skip list.
Repeat the above steps until you get all the k minimum elements. The overall time complexity will be log(2)+log(3)+log(4)+... log(k) = O(k.logk). Forming a heap takes time n, so overall time complexity is O(n+klogk).
There is one more approach without making a heap that is Quickselect, which has an average time complexity of O(n) but worst case as O(n^2).
The striking difference between the two approaches is that the first approach gives all the k elements the minimum upto the kth minimum, while quickSelect gives only the kth minimum element.
Memory wise the former approach uses O(n) extra space which quickSelect uses O(1).