Learning an action using PPO reinforcement learning that is also a negative reward? - action

I’m doing RL on a problem where I learn 2 actions and my reward = action 1 - action 2. Since action 2 is getting subtracted, the agent learns to output 0.0 value for action 2 (action space for both the actions is between 0.0 and 0.1). Can someone please advice how can I make the agent explore non-zero values for action 2.

Related

Linear Programming - Re-setting a variable based on it's cumulative count

Detailed business problem:
I'm trying to solve a production scheduling business problem as below:
I have two plants producing FG A and B respectively.
Both the products consume the same Raw Material x
I need to create a 30 day production schedule looking at the Raw Material availability.
FG A and B can be produced if there is sufficient raw material available on the day.
After every 6 days of production the plant has to undergo maintenance and the production on that day will be zero.
Objective is to maximize the margin looking at the day level Raw material available and adhere to the production constraint (i.e. shutdown after every 6th day)
I need to build a linear programming to address the below problem:
Variable y: (binary)
variable z: cumulative of y
When z > 6 then y = 0. I also need to reset the cumulation of z after this point.
Desired output:
How can I build the statement to MILP constraint. Are there any techniques for solving this problem. Thank you.
I think you can model your maintenance differently. Just forbid any sequences of 7 ones for y. I.e.
y[t-6]+y[t-5]+y[t-4]+y[t-3]+y[t-2]+y[t-1]+y[t] <= 6 for t=1,..,T
This is easier than using your accumulator. Note that the beginning needs some attention: you can use historic data for this. I.e., at t=1, the values for t=0,-1,-2,.. are known.
Your accumulator approach is not inherently wrong. We often use it to model inventory. An inventory capacity is a restriction on how large the accumulated inventory can be.

Netlogo - Creating a global mean for a value that is sometimes not existing

I'm trying to measure the mean value of agents that are performing a certain acitivity. To calculate this value, I tried to use: set mean-powerdemand mean [ powerdemand ] of agents with [ powerdemand > 0 ]
To give some contexT: My model concerns charging electric cars. I want to measure the powerdemand of agents that are actually charging their car and thus have a powerdemand > 0. I want to exclude the non-charging agents for this mean calculation, as this would bring down the mean value. However, as there are moments in time (specifically at the start of the run) where no cars are charging, I am getting the error: Can't find mean of a list with no numbers: []
Does someone know a way to work around calculations with agent-own variables and/or lists that are sometimes empty?
I started of by calculating this value in a plot by using the same code. By doing so, it does not prohibit me from running the model, but since I want to use it as a reporter in the BehaviourSpace set-up, I want to make a global value of it.
Thanks.

Calculate a sucess rate depending on a percentage set by a value

I am deeply sorry if this question has been asked and answered already but I couldn't find anything related online. To be honest I'm not even sure what should I search for haha.
So, I want to calculate if a result is a success(true) or a fail(false) depending on a predetermined fixed percentage and a value.
To explain my example:
I want to calculate the success of a mission based on the number of operative doing it.
Using 1 operative, mission success rate of 50%. (50%to be a success /
50% to be a fail)
Using 2 operatives, mission success rate of 60%. (60%to be a success /
40% to be a fail)
Using 3 operatives, mission success rate of 70%. (70%to be a success /
30% to be a fail)
Is this possible to somehow translate that into google sheet formula?
Thanks for your help guys!
try:
=ARRAYFORMULA(IF(A2:A=1, 0.5,
IF(A2:A=2, 0.6,
IF(A2:A=3, 0.7, ))))
Found this:
=IFS(K2="sh1",IF(RANDBETWEEN(1,100)>40,"YES","NO"),K2="sh2",IF(RANDBETWEEN(1,100)>20,"YES","NO"),K2="sh3",IF(RANDBETWEEN(1,100)>10,"YES","NO"))

How to interpret solution metrics in AWS Personalize?

Can someone help me interpret the AWS Personalize solution version metrics in layman’s terms or, at the very least, tell me what these metrics should ideally look like?
I have no knowledge of Machine Learning and wanted to take advantage of Personalize as it is marketed as a 'no-previous-knowledge-required' ML SaaS. However, the “Solution version metrics” in my solution results seem to require a fairly high level of math knowledge.
My Solution version metrics are as follows:
Normalized discounted cumulative
At 5: 0.9881, At 10: 0.9890, At 25: 0.9898
Precision
At 5: 0.1981, At 10: 0.0993, At 25: 0.0399
Mean reciprocal rank
At 25: 0.9833
Research
I have looked through the Personalize Developer's Guide which includes a short definition of each metric on page 72. I also attempted to skim through the Wikipedia articles on discounted cumulative gain and mean reciprocal rank. From reading, this is my interpretation of each metric:
NDG = Consistency of relevance of recommendations; Is the first recommendation as relevant as the last?
Precision = Relevance of recommendations to user; How relevant are your recommendations to users across the board?
MRR = Relevance of first recommendation in the list versus the others in the list; How relevant is your first recommendation to each user?
If these interpretations are right, then my solution metrics indicate that I am highly consistent about recommending irrelevant content. Is that a valid conclusion?
Alright, my company has Developer Tier Support so I was able to get an answer to this question from AWS.
Answer Summary
The metrics are better the closer they are to '1'. My interpretation of my metrics was pretty much correct but my conclusion was not.
Apparently, these metrics (and Personalize in general) do not take into account how much a user likes an item. Personalize only cares how soon a relevant recommendation gets to the user. This makes sense because if you get the 25th item in a queue and don't like anything you've seen, you are not likely to continue looking.
Given this, what's happening in my solution is that the first-ish recommendation is relevant but none of the others are.
Detailed Answer from AWS
I will start with relatively easier question first: What are the ideal values for these metrics, so that a solution version can be preferred over another solution version?
The answer to the above question is that for each metric, higher numbers are better. [1] If you have more than one solution version, please prefer the solution version with higher values for these metrics. Please note that you can create number of solution versions by Overriding Default Recipe Parameters [2]. And by using Hyperparameters [3].
The second question: How to understand and interpret the metrics for AWS Personalize Solution version?
I can confirm from my research that the definitions and interpretation provided for these metrics in the case by you are valid.
Before I explain each metric, here is a primer for one of the main concept in Machine Learning. How these metrics are calculated?
The Model training step during the creation of solution version splits the input dataset into two parts, a training dataset (~70%) and test dataset (~30%). The training dataset is used during the Model training. Once the model is trained, it is used to predict the values for test dataset. Once the prediction is made it is validated against the known (and correct) value in the test dataset. [4]
I researched further to find more resources to understand the concept behind these metrics and also elaborate further an example provided in the AWS documentation. [1]
"mean_reciprocal_rank_at_25"
Let’s first understand Reciprocal Rank:
For example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e A, B, C, D, E.
Once these 5 recommended movies are compared against the actual movies liked by that user (in the test dataset) we find out that only movie B and E are actually liked by the user.
The Reciprocal Rank will only consider the first relevant (correct according to test dataset) recommendation which is movie B located at rank 2 and it will ignore the movie E located at rank 5. Thus the Reciprocal Rank will be 1/2 = 0.5
Now let’s expand the above example to understand Mean Reciprocal Rank: [5] Let’s assume that we ran predictions for three users and below movies were recommended.
User 1: A, B, C, D, E (user liked B and E, thus the Reciprocal Rank is 1/2)
User 2: F, G, H, I, J (user liked H and I, thus the Reciprocal Rank is 1/3)
User 3: K, L, M, N, O (user liked K, M and N, thus the Reciprocal Rank is 1)
The Mean Reciprocal Rank will be sum of all the individual Reciprocal Ranks divided by the total number of queries ran for predictions, which is 3.
(1/2 + 1/3 + 1)/3 = (0.5+0.33+1)/3 = (1.83)/3 = 0.61
In case of AWS Personalize Solution version metrics, the mean of the reciprocal ranks of the first relevant recommendation out of the top 25 recommendations over all queries is called “mean_reciprocal_rank_at_25”.
"precision_at_K"
It can be stated as the capability of a model for delivering the relevant elements with the least amount of recommendations.
The concept of precision is described in the following free video available at Coursera. [6] A very good article on the same topic can be found here. [7]
Let’s consider the same example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e; A, B, C, D, E. Once these 5 recommended movies are compared against the actual movies liked by that user (correct values in the test dataset) we find out that only movie B and E are actually liked by the user.
The precision_at_5 will be 2 correctly predicted movies out of total 5 movies and can be stated as 2/5=0.4
"normalized_discounted_cumulative_gain_at_K"
This metric use the concept of Logarithm and Logarithmic Scale to assign weighting factor to relevant items (correct values in the test dataset). The full description of Logarithm and Logarithmic Scale is beyond the scope of this document. The main objective of using Logarithmic scale is to reduce wide-ranging quantities to tiny scopes.
discounted_cumulative_gain_at_K
Let’s consider the same example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e; A, B, C, D, E. Once these 5 recommended movies are compared against the actual movies liked by that user (correct values in the test dataset) we find out that only movie B and E are actually liked by the user.
To produce the cumulative discounted gain (DCG) at 5, each relevant item is assigned a weighting factor (using Logarithmic Scale) based on its position in the top 5 recommendations. The value produced by this formula is called as “discounted value”.
The formula is 1/log(1 + position)
As B is at position 2 so the discounted value is = 1/log(1 + 2)
As E is at position 5 so the discounted value is = 1/log(1 + 5)
The cumulative discounted gain (DCG) is calculated by adding discounted values for both relevant items DCG = ( 1/log(1 + 2) + 1/log(1 + 5) )
normalized_discounted_cumulative_gain_at_K
First of all, what is “ideal DCG”?
In the above example the ideal predictions should look like B, E, A, C, D. Thus the relevant items should be at number 1 and 2 in ideal case. To produce the “ideal DCG” at 5, each relevant item is assigned a weighting factor (using Logarithmic Scale) based on its position in the top 5 recommendations. The value produced by this formula is called as “discounted value”.
The formula is 1/log(1 + position).
As B is at position 1 so the discounted value is = 1/log(1 + 1)
As E is at position 2 so the discounted value is = 1/log(1 + 2)
The ideal DCG is calculated by adding discounted values for both relevant items DCG = ( 1/log(1 + 1) + 1/log(1 + 2) )
The normalized discounted cumulative gain (NDCG) is the DCG divided by the “ideal DCG”.
DCG / ideal DCG = (1/log(1 + 2) + 1/log(1 + 5)) / (1/log(1 + 1) + 1/log(1 + 2)) = 0.6241
I hope the information provided above is helpful in understanding the concept behind these metrics.
[1] https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html
[2] https://docs.aws.amazon.com/personalize/latest/dg/customizing-solution-config.html
[3] https://docs.aws.amazon.com/personalize/latest/dg/customizing-solution-config-hpo.html
[4] https://medium.com/#m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54
[5] https://www.blabladata.com/2014/10/26/evaluating-recommender-systems/
[6] https://www.coursera.org/lecture/ml-foundations/optimal-recommenders-4EQc2
[7] https://medium.com/#bond.kirill.alexandrovich/precision-and-recall-in-recommender-systems-and-some-metrics-stuff-ca2ad385c5f8

Math Expression on AWS Cloudwatch metrics is not giving expected output

I have created two metrics (m1 and m2) on my logs which will give me sum of some filter pattern, I wanted to add math expression in metric to sum these two metrics so I have added SUM([m1,m2]) but it is not giving me actual sum, Please refer below snapshot.
I tried to add expressions as m1+m2 but still no luck. One thing I tried, m1 + 2 is giving me exact sum as 5. Not sure if anything is missing here.
Update (2019-07-18):
Adding stacked snapshot,
The SUM() functions sums up values per datapoint. On your last datapoints you have the value 2 for Completed and no value for Failed, so the sum is 2 + 0 = 2. Number widget on the other hand displays the last value returned which for Failed count is 3, but that 3 didn't happen at the last observed time period, it happened before.
You can do few thing here:
Update the metric filter on the logs to emit the value 0 as default if no Failed events are encountered.
Add a new expression to your graph, FILL(m1, 0), with ID e3 for example, which will give you a continuous line with zeros when there are no failures and the number of failures otherwise. Then you can update your SUM expression to be SUM([m2, e3]).
You can do this on both or your metrics, so you don't have gaps in any of them. This will make the graphing and alarming more consistent and intuitive.