'Where' clause with graphite queries

'Where' clause with graphite queries - mapreduce

I hold a lot of time-series metric data in graphite. Let's assume I have 2 different metrics X and Y that represent the prices of two different products.
I would now like to query the data from an app and do something like this (of course, that's a pseudo-sql):
Select all points of metric X where value is smaller than value of metric Y during a time frame
I couldn't find any reasonable way, without writing my own script or some map-reduce job, to do it.
I could, for example, plot the graphs and try to understand it visually pretty easily. But it won't be usable for an app to use.
I also thought about using some functions like currentBelow and currentAbove but it doesn't look like I can provide two different series to compare but only one specific integer per the entire period of time.

Disclaimer: I hope there is some better solution ;).
The solution is:
Metrics:
product.X.price
product.Y.price
create a filter mask - values 0 is not greater, 1 is greater
diffSeries - to set bound 0 - above points we want to include, below to exclude.
diffSeries(product.X.price, product.Y.price)
removeBelowValue - to remove all excluded - below 0,
removeBelowValue(diffSeries(product.X.price, product.Y.price), 0)
divideSeries - divide above series by itself to create a mask of 0 and 1 - fortunately in graphite's functions 0/0 = 0 (sic!),
divideSeries(removeBelowValue(diffSeries(product.X.price, product.Y.price), 0), removeBelowValue(diffSeries(product.X.price, product.Y.price), 0))
filter out using prepared mask with multiplySeries, since
0 * datapoint = 0
1 * datapoint = datapoint
multiplySeries(product.X.price, divideSeries(removeBelowValue(diffSeries(product.X.price, product.Y.price), 0), removeBelowValue(diffSeries(product.X.price, product.Y.price), 0)))

Related

Linear Programming - Re-setting a variable based on it's cumulative count

Detailed business problem:
I'm trying to solve a production scheduling business problem as below:
I have two plants producing FG A and B respectively.
Both the products consume the same Raw Material x
I need to create a 30 day production schedule looking at the Raw Material availability.
FG A and B can be produced if there is sufficient raw material available on the day.
After every 6 days of production the plant has to undergo maintenance and the production on that day will be zero.
Objective is to maximize the margin looking at the day level Raw material available and adhere to the production constraint (i.e. shutdown after every 6th day)
I need to build a linear programming to address the below problem:
Variable y: (binary)
variable z: cumulative of y
When z > 6 then y = 0. I also need to reset the cumulation of z after this point.
Desired output:
How can I build the statement to MILP constraint. Are there any techniques for solving this problem. Thank you.

I think you can model your maintenance differently. Just forbid any sequences of 7 ones for y. I.e.
y[t-6]+y[t-5]+y[t-4]+y[t-3]+y[t-2]+y[t-1]+y[t] <= 6 for t=1,..,T
This is easier than using your accumulator. Note that the beginning needs some attention: you can use historic data for this. I.e., at t=1, the values for t=0,-1,-2,.. are known.
Your accumulator approach is not inherently wrong. We often use it to model inventory. An inventory capacity is a restriction on how large the accumulated inventory can be.

How to formulate discrete-time resource scheduling into problem?

I'm new to linear programming and trying to develop an ILP model around a problem I'm trying to solve.
My problem is analogous to a machine resource scheduling problem. I have a set of binary variables to represent paired-combinations of machines with a discrete-time grid. Job A takes 1 hour, Job B takes 1 hr and 15 minutes, so the time grid should be in 15 minute intervals. Therefore Job A would use 4 time units, and Job B would use 5 time units.
I'm having difficulty figuring out how to express a constraint such that when a job is assigned to a machine, the units it occupies are sequential in the time variable. Is there an example of how to model this constraint? I'm using PuLP if it helps.
Thanks!

You want to implement the constraint:
x(t-1) = 0 and x(t) = 1 ==> x(t)+...+x(t+n-1) = n
One way is:
x(t)+...+x(t+n-1) >= n*(x(t)-x(t-1))
Notes:
you need to repeat this constraint for each t.
A slightly better version is:
x(t+1)+...+x(t+n-1) >= (n-1)*(x(t)-x(t-1))
There is also a disaggregated version of this constraint that may help performance (depending on the solver: some solvers can do this disaggregation automatically).
Things can become interesting near the beginning and end of the planning period. E.g. machine started at t=-1.
Update:
A different approach is just to limit the "start" of a job to 1. I.e. allow only the combination
x(j,t-1) = 0 and x(j,t) = 1
for a given job j. This can be handled in a similar way:
start(j,t) >= x(j,t)-x(j,t-1)
sum(t, start(j,t)) <= 1
0 <= start(j,t) <= 1

GCP Console: How are percentile charts calculated?

I do not understand how the charts that show percentiles are calculated inside the Google Cloud Platform Monitoring UI.
Here is how I am creating the standard chart:
Example log events
Creating a log-based metric for request durations
https://cloud.google.com/logging/docs/logs-based-metrics/distribution-metrics
https://console.cloud.google.com/logs/viewer
Here I have configured a histogram of 20 buckets, starting from 0, each bucket taking 100ms.
0 - 100,
100 - 200,
... until 2 seconds
Creating a chart to show percentiles over time
https://console.cloud.google.com/monitoring/metrics-explorer
I do not understand how these histogram buckets work with "aggregator", "aligner" and "alignment period".
The UI forces using an "aligner" and "alignment period".
Questions
A. If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?
B. Do the histogram buckets configured for the log-based metric affect these sums?

I've been looking into the same question for a couple of days and found the Understanding distribution percentiles section in official docs quite helpful.
The percentile value is a computed value. The computation takes into account the number of buckets, the width of the buckets, and the total count of samples. Because the actual measured values aren't known, the percentile computation can't rely on this data.
They have a good example with buckets [0, 1) [1, 2) [2, 4) [4, 8) [8, 16) [16, 32) [32, 64) [64, 128) [128, 256) and only one measurement in the last bucket [128, 256) (none in all other buckets).
You use the bucket counts to determine that the [128, 256) bucket contains the 50th percentile.
You assume that the measured values within the selected bucket are uniformly distributed and therefore the best estimate of the 50th percentile is the bucket midpoint.
A. If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?
I find the GCP Console UI for Metrics explorer a little misleading/confusing with the wording as well (but maybe it's just me unfamiliar with their terms). The key concepts here are Alignment and Reduction, I think.
The aligner produces a single value placed at the end of each alignment period.
A reducer is a function that is applied to the values across a set of time series to produce a single value.
The difference between the two are horizontal vs. vertical aggregations. With the UI, Aggregator (both primary and secondary) are reducers.
Back to the question, a sum alignment applying before a percentile reducer seems more useful in other use cases than yours. In short, a mean or max aligner may be more useful to your "duration_ms" metric, but they're not available in the dropdown on UI, and to be honest I haven't figured out how to implement them in MQL Editor either. Just referencing from the docs here. There are other aligners that may also be useful, but I'm just gonna leave them out for now.
B. Do the histogram buckets configured for the log-based metric affect these sums?
Same as #Anthony, I'm not quite sure what the question is implying either. Just going to assume you're asking if you can align/reduce log-based metrics using these aligners/reducers, and the answer would be yes. However, you'll need to know what metric type you're using (counter vs distribution) and aggregate them in corresponding ways as you need.

Before we look at your questions, we must understand Histograms.
By using the documentation you had provided in the post, there is a section in the document that explains Histogram Buckets. Looking at this section and reflecting your setup, we can see that you are using the Linear type to specify the boundaries between histogram buckets for distribution metrics.
Furthermore, the Linear type has three values for calculations:
offset value (Start value [a])
width value (Bucket width [b])
I value (Number of buckets [N])
Every bucket has the same width and the boundaries are calculated using the following formula: offset + width x I (Where I = 0,1,2,...,∞).
For example, if the start value is 5, the number of buckets is 4, and the bucket width is 15, then the bucket ranges are as follows:
[-INF, 5), [5, 20), [20, 35), [35, 50), [50, 65), [65, +INF]
Now we understand the formula, we can look at your questions and answer them:
How are percentile charts calculated?
If we look at this documentation on Selecting metrics, we can see that there is a section that speaks about how Aggregation works. I would suggest looking into this part to understand how Aggregation works in GCP
The formula to calculate the Percentile is the following:
R = P / 100 (N + 1)
Where R represents the rank order of the score. P represents the percentile rank. N represents the number of scores in the distribution.
If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?
In the same section, it also explains what the Alignment Period is, but for the most part, the alignment period determines the length of time for subdividing the time series. For example, you can break a time series into one-minute chunks or one-hour chunks. The data in each period is summarized so that a single value represents that period. The default alignment period is one minute.
Although you can set the alignment interval for your data, time series might be realigned when you change the time interval displayed on a chart or change the zoom level.
Do the histogram buckets configured for the log-based metric affect these sums?
I am not too sure on what you are applying here, are you asking if when logs are created, the sums would be altered via by the logs being generated?
I hope this helps!

Use case for "sets of tuple data" in Pyomo

When we specify the data for a set we have the ability to give it tuples of data. For example, we could write in our .dat file the following:
set A : 1 2 3 :=
1 + - -
2 - - +
3 - + +
This would specify that we would have 4 tuples in our set: (1,1), (2,3), (3,2), (3,3)
But I guess that I am struggling to understand exactly why we would want to do this? Furthermore, suppose we instantiated a Set object in our code as:
model.Aset = RangeSet(4, dimen=2)
Would this then specify that our tuples would have the indices 1, 2, 3, and 4?
I am thinking that specifying tuples in our set could potentially be useful when working with some data in which it's important to have a bit of a "spatial" understanding of the problem. But I would be curious to hear from the community what the potential applications of specifying set data this way might be.

The most common place this appears is when you're trying to model edges between nodes in a network. Networks aren't usually completely dense (have edges between every pair of nodes) so it's beneficial to represent just the edges that appear using a sparse set of tuples.

Recoding 0-3 values

I have speech data set so here is how it is coded now:
Hypernasality (0-3)
Speech understandibility (0-3)
Speech Acceptability (0-3)
Where 0 is good 3 is severe deviation from normal speech.
Hypnasality (0 and 1)
Audible Air Emission (0 and 1)
Where 0 is none and 1 is yes
I recoded my data this way:
foreach j in speechunderstandibility speechacceptability hypernasality {
recode `j' (0 = 3) (3 = 0) (1 = 2) (2 = 1), gen (`j'_1)
}
foreach j in hyponasality audibleemission {
recode `j' (0 = 1) (1 = 0), gen (`j'_1)
}
However, when I run my regression it gives me counter-intuitive results.
My dependent variable is speech outcome and beta of interest is cleft severity.
Results after recoding would say" Cleft severity improves speech but cleft surgery decreases it"
If I leave it the way it is coded then all 5 outcomes mentioned above have different outcomes.
I need them to go in one direction so I can build a summary index.

It may be a raw data issue. I would make sure that all the data points were entered correctly, because at some stage of data entry, the entry of the 0-3 may have gotten mixed up. So it may be a mix up during data-entry.
Secondly, if you're really sure of the data entry (this sounds like a data entry issue to me, or like Nick Cox says, a data interpretation issue), then perhaps try using "gen" and "replace" commands to recode your variables inside or outside of a loop.
When I have trouble with a looped command, I dissect each part and raw code it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js