I'm looking to use optaplanner to help calculate the best discount for a user buying a list of items (e.g. A Shopping Cart).
Does this sound like a good idea? Has any tried it? Any help would be greatly appreciated.
Consider:
I have a user wanting to buy a number of different items
There are a number of Promotional Discounts available to this customer including: Percentage Discount and Buy-one-get-one-free on a number of the items
-
Each discount may only apply to some of the items
Each item can only have one Promotion applied
But each Promotion may be applied to multiple items. The same Percentage Discount may be available to multiple items
Applying each Promotion in different orders may result in a different total discount
Goal : I want to identify which Promotions applied in a certain order will give the user the greatest discount.
I have looked at both drools-expert (considering the brute force option) and optaplanner. With optaplanner I do not see how I can do the following:
Take into account that a Promotion may apply to multiple items
Discount gained from a Promotion may differ depending on the state of the basket (i.e. which promotions have already been applied) when it is applied.
Something like this should work. It allows the same promotion to be used on multiple items. Add a hard constraint that a buy-3-get-1 needs
class ItemAssignment {
Item item;
#PlanningVariable(nullable = true, ...)
Promotion promotion;
}
Add a hard constraint for promotions that can only be used once. Add a soft constraint to activate a promotion that requires 3 items to be bought. I 'd do an insertLogical(new Discount) in that soft constraint, so the other rules can become smart enough as to which promotions have already been applied.
I am a bit in doubt if metaheuristics are the best approach for this, presuming that your problem is very small. Unless someone's buying more than 10 items at a time, I think brute force (implemented in optaplanner) or branch&bound (not yet implemented in optaplanner) might be a better choice...
Related
For some context, I recommend to watch this video from Tom Scott in which he determines what the best "thing" is : https://youtu.be/ALy6e7GbDRQ. I think it will help for the explanation of my question.
Basically I am trying to make an algorithm/program that makes one user rank all the items from a list by choosing which one he likes the most between two items at a time.
For example with the most basic list with 3 items: A, B and C.
The order of operations would look like this:
The user is presented with the choice: A or C.
And he prefer C.
The user is presented with the choice: B or C.
And he prefer C.
The user is presented with the choice: A or B.
and he prefer A.
So we know that the ranked order is [C > A > B] in 3 comparisons.
But what if at step 4, the user would have chosen B? We could assume by logic that he prefer B over A before step 5. So we would know that the ranked list would be [B > C > A] in only 2 comparison and it is as accurate as the first situation. Consequently, we can see that the number of steps depends on the user's choices.
So what is the right or most efficient way to rank all the items in the smallest number of comparisons? I thought about literally comparing every possible combination and have a counter for every item that goes up each time an item is selected over the other. But with that way the number of comparisons to do would just grow exponentially depending on the size of the list and it would not be the most efficient way just like in the example above. I also thought about using a simple sort algorithm in which the user would do the comparison "manually" but I am not very familiar with sort algorithm in general.
In Tom Scott's video, the website pick two items at random and he uses such a big pool of user that the problem just corrects itself by probability and statistic and he does not need to worry about having every item being picked equally to be accurate. Since I just want one user to rank the items, I need to find a way to pick the right items to compare to be the most efficient possible (I think). Another difference is that I will not have over 7000 items like in Tom's version. I want to use this to rank for example "All Disney movies", "Certain music genres" or "All Zelda video games" so I am pretty sure that I will have at max maybe around 100 items.
My question is : am I on the right track? Should I use a simple sort algorithm (if yes which one should I use?) or is the only mathematical way to do this effectively is to compare every combination? Or am I missing something that could help me? If I brute force every combination, it will clearly be the most simple and accurate way to rank the items but it will certainly not be the most efficient. I suppose that the order of the comparisons and which items we compare really matter to be the most efficient and this is where I am lost.
I know it is not really a conventional question to ask here but thanks for helping me or orienting me on the right path.
Assuming that the user will create a total ordering of items, the algorithm should maintain a fully ordered list of an increasing number of items. Take the next unordered item and use the comparisons needed for binary search to insert it in the ordered list.
Take the next unordered item and use binary search comparisons of two items to insert that.
Repeat until all items are ordered.
The binary searches should minimise the comparisons at each stage and present the user with more meaningful comparisons over time for each new item.
I wanna try to use UMAP for my high-dimensional dataset as a preprocessing step (not for data visualization) in order to decrease the number of features, but how can I choose (if there is a method) the right number of dimensions in which to map the original data? For example, in PCA you can select the number of Factors that explain a fixed % of variances.
There is no good way to do this comparable to the explicit measure given by PCA. As a rule of thumb, however, you will get significantly diminishing returns for an embedding dimension larger than the n_neighbors value. With that in mind, and since you actually have a downstream task, it makes the most sense to build a pipeline to the downstream task evaluation and look at cross validation over the number of UMAP dimensions.
Model name is usually a singular name of some entity. And table name is a plural form of the word.
For example Transaction is stored in transactions table.
But there are cases when a whole table is described by singular word, means a scope of entities. For example: journal, log, history.
And there no more precise name for a single row besides "entry" or "item". But model named ThingsJounralEntry looks messy, and simple ThingsJournal is confusing, because an instance doesn't describe any actual journal, but single entry.
Is there a common naming approach for such cases better than described above?
Your question seems to show that there are actually two naming issues in dance. One is regarding the elements of your collection, which you ask explicitly. The other one is regarding the collection itself, which is rather implicit. In fact, when you refer to Journal you feel compelled to clarify (accounting). This means that your collection class would be better named AccountingJournal, which would remove the ambiguity.
Now, since the description you provide of these objects (collection and elements) is a little bit succinct, I don't have enough information as to suggest an appropriate name. However, in order to give you some hints I would recommend considering not just the nature of the elements, but the responsibility they will have in your model. Will they represent entities or actions? If the elements are entities (things) consider names that denote simple nouns that would replicate or be familiar with the language used by accountants. Examples include AccountingEntry or AccountingRecord. If your elements represent actions then use a suffix that stresses such characteristic, for example, AccountingAnnotation or AccountingRegistration. Another question you can ask yourself is what kind of messages these elements will receive? For instance, if they will represent additions and subtractions of money you might want to use AccountingOperation or AccountChange.
Whatever the situation dictated by the domain you should verify that your naming conventions sound as sentences said by actual domain experts: "[this is an] accounting record of [something]" or "add this accounting record to the journal" or "register this accounting operation in the journal" or even "this is an accounting operation for subtracting this amount of money," etc.
The (intellectual) activity of naming you objects is directly connected with the activity of shaping your model. Exercise your prospective model in your head by saying aloud the messages that they would typically receive and check that the language you are shaping closely reproduces the language of the domain. In other words, let your objects talk and hear what they say, they will tell you their names.
there is a sentiment analysis tool to find out people's perception on social network.
This tool can:
(1) Decompose a document into a set of sentences.
(2) Decompose each sentence into a set of words, and perform filtering such that only
product name and adjectives are preserved.
e.g. "This MacBook is awesome. Sony is better than Macbook."
After processing, We can get:
{MacBook, awesome}
{Sony, better}. (not the truth :D)
We just assume there exists a list of product names, P, that we will ever
care, and there exist a list of adjectives, A, that we will ever care.
My questions are:
Can we reduce this problem into a specialized association rule mining
problem and how? If yes, anything need to be noticed like reduction, parameter
settings (minsup and minconf), additional constraints, and modication to the
Aprior algorithm to solve the problem.
Any way to artificially spam the result like adding "horrible" to the top 1 adjective? And any good ways to prevent this spam?
Thanks.
Have you considered counting?
For every product, count how often each adjective occurs.
Report the top-3 adjectives for each product.
Takes just one pass over your data, and does not use a lot of memory (unless you have millions of products to track).
There is no reason to use association rule mining. Association rule mining only pays off when you are looking for large itemsets (i.e. 4 or more terms) and they are equally important. If you know that one term is special (e.g. product name vs. adjectives), it makes sense to split the data set by this unique key, and then use counting.
I am trying to schedule a certain number of events in the week according to certain constraints, and would like to spread out these events as evenly as possible throughout the week.
If I add the standard deviation of the intervals between events to the objective function, then CPLEX can minimise it.
I am struggling to define the standard deviation of the intervals in terms of CPLEX expressions, mainly because the events don't have to be in any particular sequence, and I don't know which event is prior to any other one.
I feel sure this must be a solved problem, but I have not been able to find help in IBM's cplex documentation or on the internet.
Scheduling Uniformly Spaced Events
Here a few possible ideas for you to try:
Let t0, t1, t2, t3 ... tn be the event times. (These are variables chosen by the model.)
Let d1 = t1-t0, d2=t2-t1 etc... dn.
Goal: We want all these d's to be roughly equal, which would have the effect of roughly evenly spacing out the t's.
Options
Option 1: Put a cost on the deviation from ideal
Let us take one example. Let's say that you want to schedule 10 events in a week (168 hours.) With no other constraint except
equal spacing, we could have the first event start at time=0, and the last one end at time t=168. The others would be 168/(10-9) =~ 18.6 hours apart. Let's call this d_ideal.
We don't want d's to be much less than d_ideal (18.6) or much greater than d_ideal.
That is, in the objective, add Cost_dev * (abs(d_ideal - dj))
(You have to create two variable for each d (d+ and d-to handle the absolute values in the objective function.)
Option 1a
In the method above, all deviations are priced the same. So the model doesn't care if it deviates by 3 hours, or two deviations
of 1.5 hours each. The way to handle that is to add step-wise costs. Small cost for small deviations, with very high cost for high deviations. (You make them step-wise linear so that the formulation stays an LP/IP)
Option 2: Max-min
This is around your minimize the std. deviation of d's idea. We want to maximize each d (increase the inter-event separation.)
But we would also hugely punish (big cost) that particular d value that is the greatest. (In English, we don't want to let
any single d to get too large)
This is the MinMax idea. (Minimize the maximum d value, but also maximize individual d's)
Option 3: Two LPs: Solve first, then move the events around in a second LP
One drawback of layering on more and more of these side constraints is that the formulation becomes complicated.
To address this, I have seen two (or more) passes used. You solve the base LP first, assign events and then in another
LP, you address the issue of uniformly distributing times.
The goal of the second LP is to move the events around, without breaking any hard constraints.
Option 3a: Choose one of many "copies"
To achieve this, we use the following idea:
We allow multiple possible time slots for an event, and make the model select one.
The Event e1 (currently assigned to time t1) is copied into (say) 3 other possible slots.
e11 + e12 + e13 + e14 = 1
The second model can choose to move the event to a "better" time slot, or leave it be. (The old solution
is always feasible.)
The reason you are not seeing much in CPLEX manuals is that these are all formulation ideas. If you search for job- or event-scheduling
using LP's, you will come across a few pdf's that mighe be helpful.