Given a collection of stacks of different heights, how can I select every combination possible? - c++

Input: total cost.
Output: all the combinations of levels that give the desired cost.
Every level of each stack costs a different amount (level 1 in stack 1 doesn't cost the same as level 1 in stack 2). I have a function that converts the level to the actual cost based on the base cost (level 1), which I entered manually (hard coded).
I need to find the combination of levels that give me the inputed cost. I realize there are more than one possible solutions, but I only need a way to iterate trough every possibility.
Here is what I need:
input = 224, this is one of the solutions:
I'm making a simple program that needs to select levels of different stacks and then calculate the cost, and I need to know every possible cost that exists... Each level of each stack costs a different amount of money, but that is not the problem, the problem is how to select one level for each stack.
I probably explained that very vaguely, so here's a picture (you'll have to excuse my poor drawing skills):
So, all stacks have the level 0, and level 0 always costs 0 money.
Additional info:
I have an array called "maxLevels", length of that array is the number of stacks, and each element is the number of the highest level in that stack (for example, maxLevels[0] == 2).
You can iterate from the 1st level because the level 0 doesn't matter at all.
The selected levels should be saved in an array (name: "currentLevels) that is similar to maxLevels (same length) but, instead of containing the maximum level of a stack, it contains the selected level of a stack (for example: currentLevels[3] == 2).
I'm programming in C++, but pseudocode is fine as well.
This isn't homework, I'm doing it for fun (it's basically for a game).

I'm not sure I understand the question, but here's how to churn through all the possible combinations of selecting one item from each stack (in this case 3*1*2*3*1 = 18 possibilities):
void visit_each_combination(size_t *maxLevels, size_t num_of_stacks, Visitor &visitor, std::vector<size_t> &choices_so_far) {
if (num_of_stacks == 0) {
visitor.visit(choices_so_far);
} else {
for (size_t pos = 0; pos <= maxLevels[0]; ++pos) {
choices_so_far.push_back(pos);
visit_each_combination(maxLevels+1, num_of_stacks-1, visitor, choices_so_far);
choices_so_far.pop_back();
}
}
}
You can replace visitor.visit with whatever you want to do with each combination, to make the code more specific. I've used a vector choices_so_far instead of your array currentLevels, but it could just as well work with an array.

This is very simple, if I've understood it correctly. The minimum cost is 0, and the maximum cost is just the sum of the heights of the stacks. To achieve any specific cost between these limits, you can start from the left, selecting the maximum level for each stack until your target is achieved, and then select level 0 for the remaining stacks. (You may have to adjust the last non-zero stack if you overshoot the target.)

I solved it, I think. #Steve Jessop gave me the idea to use recursion.
Algorithm:
circ(currentStack)
{
for (i = 0; i <= allStacks[currentStack]; i ++)
if (currentStack == lastStack && i == allStacks[currentStack])
return 0;
else if (currentStack != lastStack)
circ(++ currentStack);
}

Related

Possible way to find the actual billboard locations in the Billboard Highway Problem [Dynamic Programming]

I've been learning about dynamic programming the past few days and came across the Highway Billboard Problem. From what I can understand we are able to find the maximum revenue that can be generated from the possible sites, revenue, size of the highway and the minimum distance between two certain billboards. Is there a possible way we can also find out the actual billboard locations alongside the maximum revenue.
For the code I've been looking at this https://www.geeksforgeeks.org/highway-billboard-problem/
Yes, it is possible to write down the sequence of the chosen sites.
There are two max function calls. Replace them by own maximum choice with if, and inside branch where current site is used, add current position to list (to the emptied list in the first max clause, as far as I understand)
For example,
maxRev[i] = max(maxRev[i-1], revenue[nxtbb]);
change to this (pseudocode, did not check validity)
if (revenue[nxtbb] > maxRev[i-1]) {
maxRev[i] = revenue[nxtbb];
sitelist.clear();
sitelist.push(i);
}
else
maxRev[i] = maxRev[i-1];
and
maxRev[i] = max(maxRev[i-t-1]+revenue[nxtbb], maxRev[i-1]);
change to
if (maxRev[i-t-1]+revenue[nxtbb] > maxRev[i-1]) {
maxRev[i] = maxRev[i-t-1]+revenue[nxtbb];
sitelist.push(i);
}
else
maxRev[i] = maxRev[i-1];

Recursive backtracking, showing the best solution

For school I am supposed to use recursive backtracking to solve a Boat puzzle. The user inputs a maximum weight for the boat, the amount of item types, and a weight and value for each item type. More than one of each item type can be placed on the boat.
Our assignment states "The program should find a solution that fills the boat with selected valuable items such that the total value of the items in the boat is maximized while the total weight of the items stays within the weight capacity of the boat."
It also has pretty specific template for the recursive backtracking algorithm.
Currently I am using contiguous lists of items to store the possible items and the items on the boat. The item struct includes int members for weight, value, count (of how many times it is used) and a unique code for printing purposes. I then have a Boat class which contains data members max_weight, current_weight, value_sum, and members for each of the contiguous lists, and then member functions needed to solve the puzzle. All of my class functions seem to be working perfectly and my recursion is indeed displaying the correct answer given the example input.
The thing I can't figure out is the condition for extra credit, which is, "Modify your program so that it displays the best solution, which has the lowest total weight. If there are two solutions with the same total weight, break the tie by selecting the solution with the least items in it." I've looked at it for awhile, but I'm just not sure how I can change it make sure the weight is minimized while also maximizing the value. Here is the code for my solution:
bool solve(Boat &boat) {
if (boat.no_more()) {
boat.print();
return true;
}
else {
int pos;
for (int i = 0; i < boat.size(); i++){
if (boat.can_place(i)) {
pos = boat.add_item(i);
bool solved = solve(boat);
boat.remove_item(pos);
if (solved) return true;
}
}
return false;
}
}
All functions do pretty much exactly what their name says. No more returns true if none of the possible items will fit on the boat. Size returns the size of the list of possible items. Adding and removing items change the item count data and also the Boat current_weight and value_sum members accordingly. Also the add_item, remove_item and can_place parameter is the index of the possible item that is being used. In order to make sure maximized value is found, the list of possible items is sorted in descending order by value in the Boat's constructor, which takes a list of possible items as a parameter.
Also here is an example of what input and output look like:
Any insight is greatly appreciated!
It turned out that the above solution was correct. The only reason I was getting an incorrect answer was because of my implementation of the nomore() function. In the function I was checking if any item in the possible items list was less than the weight left on the boat. I should have been checking if they were less than or equal to the weight on the boat. A simple mistake.
The wikipedia entry was indeed of use and I enjoyed the comic :)

C++, determine the part that have the highest zero crosses

I’m not specialist in signal processing. I’m doing simple processing on 1D signal using c++. I want really to know how I can determine the part that have the highest zero cross rate (highest frequency!). Is there a simple way or method to tell the beginning and the end of this part.
This image illustrate the form of my signal, and this image is what I need to do (two indexes of beginning and end)
Edited:
Actually I have no prior idea about the width of the beginning and the end, it's so variable.
I could calculate the number of zero crossing, but I have no idea how to define it's range
double calculateZC(vector<double> signals){
int ZC_counter=0;
int size=signals.size();
for (int i=0; i<size-1; i++){
if((signals[i]>=0 && signals[i+1]<0) || (signals[i]<0 && signals[i+1]>=0)){
ZC_counter++;
}
}
return ZC_counter;
}
Here is a fairly simple strategy which might give you some point to start. The outline of the algorithm is as follows
Input: Vector of your data points {y0,y1,...}
Parameters:
Window size sigma.
A threshold 0<p<1 defining when to start looking for a region.
Output: The start- and endpoint {t0,t1} of the region with the most zero-crossings
I won't give any C++ code, but the method should be easy to implement. As example let us use the following function
What we desire is the region between about 480 and 600 where the zero density higher than in the front. First step in the algorithm is to calculate the positions of zeros. You can do this by what you already have but instead of counting, you store the values for i where you met a zero.
This will give you a list of zero positions
From this list (you can do this directly in the above for-loop!) you create a list having the same size as your input data which looks like {0,0,0,...,1,0,..,1,0,..}. Every zero-crossing position in your input data is marked with a 1.
The next step is to smooth this list with a smoothing filter of size sigma. Here, you can use what you like; in the simplest case a moving average or a Gaussian filter. The higher you choose sigma the bigger becomes your look around window which measures how many zero-crossings are around a certain point. Let me give the output of this filter together with the original zero positions. Note that I used a Gaussian filter of size 10 here
In a next step, you go through the filtered data find the maximum value. In this case it is about 0.15. Now you choose your second parameter which is some percentage of this maximum. Lets say p=0.6.
The final step is to go through the filtered data and when the value is greater than p you start to remember a new region. As soon as the value drops below p, you end this region and remember start and endpoint. Once you are finished walking through the data, you are left with a list of regions, each defined by a start and an endpoint. Now you choose the region with the biggest extend and you are done.
(Optionally, you could add the filter size to each end of the final region)
For the above example, I get 11 regions as follows
{{164,173},{196,205},{220,230},{241,252},{259,271},{278,290},
{297,309},{318,327},{341,350},{458,468},{476,590}}
where the one with the biggest extend is the last one {476,590}. The final result looks (with 1/2 filter region padding)
Conclusion
Please don't be discouraged by the length of my answer. I tried to explain everything in detail. The implementation is really just some loops:
one loop to create the zero-crossings list {0,0,..,1,0,...}
one nested loop for the moving average filter (or you use some library Gaussian filter). Here you can at the same time extract the maximum value
one loop to extract all regions
one loop to extract the largest region if you haven't already extracted it in the above step

Finding the most common three-item sequence in a very large file

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page sequence of all. The log files are too large to be held in main memory at once.
Sample log file:
User ID  Page ID
A          1
A          2
A          3
B          2
B          3
C          1
B          4
A          4
Corresponding results:
A: 1-2-3, 2-3-4
B: 2-3-4
2-3-4 is the most popular three-page sequence
My idea is to use use two hash tables. The first hashes on user ID and stores its sequence; the second hashes three-page sequences and stores the number of times each one appears. This takes O(n) space and O(n) time.
However, since I have to use two hash tables, memory cannot hold everything at once, and I have to use disk. It is not efficient to access disk very often.
How can I do this better?
If you want to quickly get an approximate result, use hash tables, as you intended, but add a limited-size queue to each hash table to drop least recently used entries.
If you want exact result, use external sort procedure to sort logs by userid, then combine every 3 consecutive entries and sort again, this time - by page IDs.
Update (sort by timestamp)
Some preprocessing may be needed to properly use logfiles' timestamps:
If the logfiles are already sorted by timestamp, no preprocessing needed.
If there are several log files (possibly coming from independent processes), and each file is already sorted by timestamp, open all these files and use merge sort to read them.
If files are almost sorted by timestamp (as if several independent processes write logs to single file), use binary heap to get data in correct order.
If files are not sorted by timestamp (which is not likely in practice), use external sort by timestamp.
Update2 (Improving approximate method)
Approximate method with LRU queue should produce quite good results for randomly distributed data. But webpage visits may have different patterns at different time of day, or may be different on weekends. The original approach may give poor results for such data. To improve this, hierarchical LRU queue may be used.
Partition LRU queue into log(N) smaller queues. With sizes N/2, N/4, ... Largest one should contain any elements, next one - only elements, seen at least 2 times, next one - at least 4 times, ... If element is removed from some sub-queue, it is added to other one, so it lives in all sub-queues, which are lower in hierarchy, before it is completely removed. Such a priority queue is still of O(1) complexity, but allows much better approximation for most popular page.
There's probably syntax errors galore here, but this should take a limited amount of RAM for a virtually unlimited length log file.
typedef int pageid;
typedef int userid;
typedef pageid[3] sequence;
typedef int sequence_count;
const int num_pages = 1000; //where 1-1000 inclusive are valid pageids
const int num_passes = 4;
std::unordered_map<userid, sequence> userhistory;
std::unordered_map<sequence, sequence_count> visits;
sequence_count max_count=0;
sequence max_sequence={};
userid curuser;
pageid curpage;
for(int pass=0; pass<num_passes; ++pass) { //have to go in four passes
std::ifstream logfile("log.log");
pageid minpage = num_pages/num_passes*pass; //where first page is in a range
pageid maxpage = num_pages/num_passes*(pass+1)+1;
if (pass==num_passes-1) //if it's last pass, fix rounding errors
maxpage = MAX_INT;
while(logfile >> curuser >> curpage) { //read in line
sequence& curhistory = userhistory[curuser]; //find that user's history
curhistory[2] = curhistory[1];
curhistory[1] = curhistory[0];
curhistory[0] = curhistory[curpage]; //push back new page for that user
//if they visited three pages in a row
if (curhistory[2] > minpage && curhistory[2]<maxpage) {
sequence_count& count = visits[curhistory]; //get times sequence was hit
++count; //and increase it
if (count > max_count) { //if that's new max
max_count = count; //update the max
max_sequence = curhistory; //arrays, so this is memcpy or something
}
}
}
}
std::cout << "The sequence visited the most is :\n";
std::cout << max_sequence[2] << '\n';
std::cout << max_sequence[1] << '\n';
std::cout << max_sequence[0] << '\n';
std::cout << "with " << max_count << " visits.\n";
Note that If you pageid or userid are strings instead of ints, you'll take a significant speed/size/caching penalty.
[EDIT2] It now works in 4 (customizable) passes, which means it uses less memory, making this work realistically in RAM. It just goes proportionately slower.
If you have 1000 web pages then you have 1 billion possible 3-page sequences. If you have a simple array of 32-bit counters then you'd use 4GB of memory. There might be ways to prune this down by discarding data as you go, but if you want to guarantee to get the correct answer then this is always going to be your worst case - there's no avoiding it, and inventing ways to save memory in the average case will make the worst case even more memory hungry.
On top of that, you have to track the users. For each user you need to store the last two pages they visited. Assuming the users are referred to by name in the logs, you'd need to store the users' names in a hash table, plus the two page numbers, so let's say 24 bytes per user on average (probably conservative - I'm assuming short user names). With 1000 users that would be 24KB; with 1000000 users 24MB.
Clearly the sequence counters dominate the memory problem.
If you do only have 1000 pages then 4GB of memory is not unreasonable in a modern 64-bit machine, especially with a good amount of disk-backed virtual memory. If you don't have enough swap space, you could just create an mmapped file (on Linux - I presume Windows has something similar), and rely on the OS to always have to most used cases cached in memory.
So, basically, the maths dictates that if you have a large number of pages to track, and you want to be able to cope with the worst case, then you're going to have to accept that you'll have to use disk files.
I think that a limited-capacity hash table is probably the right answer. You could probably optimize it for a specific machine by sizing it according to the memory available. Having got that you need to handle the case where the table reaches capacity. It may not need to be terribly efficient if it's likely you rarely get there. Here's some ideas:
Evict the least commonly used sequences to file, keeping the most common in memory. I'd need two passes over the table to determine what level is below average, and then to do the eviction. Somehow you'd need to know where you'd put each entry, whenever you get a hash-miss, which might prove tricky.
Just dump the whole table to file, and build a new one from scratch. Repeat. Finally, recombine the matching entries from all the tables. The last part might also prove tricky.
Use an mmapped file to extend the table. Ensure that the file is used primarily for the least-commonly used sequences, as in my first suggestion. Basically, you'd simply use it as virtual memory - the file would be meaningless later, after the addresses have been forgotten, but you wouldn't need to keep it that long. I'm assuming there isn't enough regular virtual memory here, and/or you don't want to use it. Obviously, this is for 64-bit systems only.
I think you only have to store the most recently seen triple for each userid right?
So you have two hash tables. The first containing key of userid, value of most recently seen triple has size equal to number of userids.
EDIT: assumes file sorted by timestamp already.
The second hash table has a key of userid:page-triple, and a value of count of times seen.
I know you said c++ but here's some awk which does this in a single pass (should be pretty straight-forward to convert to c++):
# $1 is userid, $2 is pageid
{
old = ids[$1]; # map with id, most-recently-seen triple
split(old,oldarr,"-");
oldarr[1]=oldarr[2];
oldarr[2]=oldarr[3];
oldarr[3] = $2;
ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
tripleid = $1":"ids[$1]; # build a triple-id of userid:triple
if (oldarr[1] != "") { # don't accumulate incomplete triples
triples[tripleid]++; } # count this triple-id
}
END {
MAX = 0;
for (tid in triples) {
print tid" "triples[tid];
if (triples[tid] > MAX) MAX = tid;
}
print "MAX is->" MAX" seen "triples[tid]" times";
}
If you are using Unix, the sort command can cope with arbitrarily large files. So you could do something like this:
sort -k1,1 -s logfile > sorted (note that this is a stable sort (-s) on the first column)
Perform some custom processing of sorted that outputs each triplet as a new line to another file, say triplets, using either C++ or a shell script. So in the example given you get a file with three lines: 1-2-3, 2-3-4, 2-3-4. This processing is quick because Step 1 means that you are only dealing with one user's visits at a time, so you can work through the sorted file a line at a time.
sort triplets | uniq -c | sort -r -n | head -1 should give the most common triplet and its count (it sorts the triplets, counts the occurrences of each, sorts them in descending order of count and takes the top one).
This approach might not have optimal performance, but it shouldn't run out of memory.

Clustering algorithm with upper bound requirement for each cluster size

I need to do a partition of approximately 50000 points into distinct clusters. There is one requirement: the size of every cluster cannot exceed K. Is there any clustering algorithm that can do this job?
Please note that upper bound, K, of every cluster is the same, say 100.
Most clustering algorithms can be used to create a tree in which the lowest level is just a single element - either because they naturally work "bottom up" by joining pairs of elements and then groups of joined elements, or because - like K-Means, they can be used to repeatedly split groups into smaller groups.
Once you have a tree, you can decide where to split off subtrees to form your clusters of size <= 100. Pruning an existing tree is often quite easy. Suppose that you want to divide an existing tree to minimise the sum of some cost of the clusters you create. You might have:
f(tree-node, list_of_clusters)
{
cost = infinity;
if (size of tree below tree-node <= 100)
{
cost = cost_function(stuff below tree-node);
}
temp_list = new List();
cost_children = 0;
for (children of tree_node)
{
cost_children += f(child, temp_list);
}
if (cost_children < cost)
{
list_of_clusters.add_all(temp_list);
return cost_children;
}
list_of_clusters.add(tree_node);
return cost;
}
One way is to use hierarchical K-means, but you keep splitting each cluster which is larger than K, until all of them are smaller.
Another (in some sense opposite approach) would be to use hierarchical agglomerative clustering, i.e. a bottom up approach and again make sure you don't merge cluster if they'll form a new one of size > K.
The issue with naive clustering is that you do indeed have to calculate a distance matrix that holds the distance of A from every other member in the set. It depends whether you've pre-processed the population or your amalgamating the clusters into typical individuals then recalculating the distance matrix again.