RPN (Risk Priority Number) of FMEA Analysis and SLO - sre

One of the concepts in FMEA ( Failure Mode & Effect Analysis ) is the RPN (Risk Profile Number) which decides how to prioritize your actions for addressing those failures.
However, going by just severity, probability and effectiveness of control seems to leave out a vital component - SLI, e.g., there might be an issue that occurs moderately frequently, is moderate severity and moderate control effectiveness : that would make it 333 => RPN = 27. Let's also say that this failure leads to a downtime of 30 minutes ). Now Let's say that there is another failure mode that is RPN 15 (531) , but it takes 12 hours for it to be rectified. Going by the FMEA analysis, I would end up worrying more about the first one rather than the second.
How would I capture the availability bit here and how does RPN even relate to SLI ( if it even does )

Related

GCP : Cloud Functions Graphs

When I execute a CF on GCP, it has graphs on 4 parameter. Invocations, Active Instance are easy to understand what data is trying to say. But I am unable to make sense of other graphs,i.e execution time & memory usage. This is a screenshot of one of our http triggered CF. Can someone explain how exactly to make sense of this data? What does CF mean when it says, 99th percentile: 882.85
Is 99th percentile good or bad?
It is neither good nor bad; these are the statistics for the average execution time.
See what percentile actually means, in order to understand the chart's meaning.
eg. 99% of the observations fall below the average execution duration of 882.85 ms -
and that 1% of the observations have extreme values, which do not fall below that.
These 882.85 ms might only be suboptimal, in case the function could possibly run quicker.
It's represented alike this, so that a few extreme values won't distort the whole statistics.

Training Tensorflow Inception-v3 Imagenet on modest hardware setup

I've been training Inception V3 on a modest machine with a single GPU (GeForce GTX 980 Ti, 6GB). The maximum batch size appears to be around 40.
I've used the default learning rate settings specified in the inception_train.py file: initial_learning_rate = 0.1, num_epochs_per_decay = 30 and learning_rate_decay_factor = 0.16. After a couple of weeks of training the best accuracy I was able to achieve is as follows (About 500K-1M iterations):
2016-06-06 12:07:52.245005: precision # 1 = 0.5767 recall # 5 = 0.8143 [50016 examples]
2016-06-09 22:35:10.118852: precision # 1 = 0.5957 recall # 5 = 0.8294 [50016 examples]
2016-06-14 15:30:59.532629: precision # 1 = 0.6112 recall # 5 = 0.8396 [50016 examples]
2016-06-20 13:57:14.025797: precision # 1 = 0.6136 recall # 5 = 0.8423 [50016 examples]
I've tried fiddling with the settings towards the end of the training session, but couldn't see any improvements in accuracy.
I've started a new training session from scratch with num_epochs_per_decay = 10 and learning_rate_decay_factor = 0.001 based on some other posts in this forum, but it's sort of grasping in the dark here.
Any recommendations on good defaults for a small hardware setup like mine?
TL,DR: There is no known method for training an Inception V3 model from scratch in a tolerable amount of time from a modest hardware set up. I would strongly suggest retraining a pre-trained model on your desired task.
On a small hardware set up like yours, it will be difficult to achieve maximum performance. Generally speaking for CNN's, the best performance is with the largest batch sizes possible. This means that for CNN's the training procedure is often limited by the maximum batch size that can fit in GPU memory.
The Inception V3 model available for download here was trained with an effective batch size of 1600 across 50 GPU's -- where each GPU ran a batch size of 32.
Given your modest hardware, my number one suggestion would be to download the pre-trained mode from the link above and retrain the model for the individual task you have at hand. This would make your life much happier.
As a thought experiment (but hardly practical) .. if you feel especially compelled to exactly match the training performance of the model from the pre-trained model by training from scratch, you could do the following insane procedure on your 1 GPU. Namely, you could run the following procedure:
Run with a batch size of 32
Store the gradients from the run
Repeat this 50 times.
Average the gradients from the 50 batches.
Update all variables with the gradients.
Repeat
I am only mentioning this to give you a conceptual sense of what would need to be accomplished to achieve the exact same performance. Given the speed numbers you mentioned, this procedure would take months to run. Hardly practical.
More realistically, if you are still strongly interested in training from scratch and doing the best you can, here are some general guidelines:
Always run with the largest batch size possible. It looks like you are already doing that. Great.
Make sure that you are not CPU bound. That is, make sure that the input processing queue's are always modestly full as displayed on TensorBoard. If not, increase the number of preprocessing threads or use a different CPU if available.
Re: learning rate. If you are always running synchronous training (which must be the case if you only have 1 GPU), then the higher batch size, the higher the tolerable learning rate. I would a try a series of several quick runs (e.g. a few hours each) to identify the highest learning possible which does not lead to NaN's. After you find such a learning rate, knock it down by say 5-10% and run with that.
As for num_epochs_per_decay and decay_rate, there are several strategies. The strategy highlighted by 10 epochs per decay, 0.001 decay factor is to hammer the model for as long as possible until the eval accuracy asymptotes. And then lower the learning rate. This is a simple strategy which is nice. I would verify that is what you see in your model monitoring that the eval accuracy and determining that it indeed asymptotes before you allow the model to decay the learning rate. Finally, the decay factor is a bit ad-hoc but lowering by say a power of 10 seems to be a good rule of thumb.
Note again that these are general guidelines and others might even offer differing advice. The reason why we can not give you more specific guidance is that CNNs of this size are just not often trained from scratch on a modest hardware setup.
Excellent tips.
There is precedence for training using a similar setup as yours.
Check this out - http://3dvision.princeton.edu/pvt/GoogLeNet/
These people trained GoogleNet, but, using Caffe. Still, studying their experience would be useful.

How to measure latency of low latency c++ application

I need to measure message decoding latency (3 to 5 us ) of a low latency application.
I used following method,
1. Get time T1
2. Decode Data
3. Get time T2
4. L1 = T2 -T1
5. Store L1 in a array (size = 100000)
6. Repeat same steps for 100000 times.
7. Print array.
8. Get the 99% and 95% presentile for the data set.
But i got fluctuation between each test. Can some one explain the reason for this ?
Could you suggest any alternative method for this.
Note: Application is tight loop (acquire 100% cpu) and Bind to CPU via taskset commad
There are a number of different ways that performance metrics can be gathered either using code profilers or by using existing system calls.
NC State University has a good resource on the different types of timers and profilers that are available as well as the appropriate case for using each and some examples on their HPC website here.
Fluctuations will inevitably occur on most modern systems, certain BIOS setting related to hyper threading and frequency scaling can have a significant impact on the performance of certain applications, as can power-consumption and cooling/environmental settings.
Looking at the distribution of results as a histogram and/or fitting them to a Gaussian will also help determine how normal the distribution is and if the fluctuations are normal statistical noise or serious outliers. Running additional tests would also be beneficial.

How to get maximum value out of automated testing during a limited period of time?

I have a relatively large project, which up to now had almost no automatic tests.
Now the management has given us several weeks of time to stabilize the product, including test automation.
In order to get the most value of these X weeks we can spend on automatic testing, I need to know what classes/methods to test first.
How can I prioritize testing efforts (decide, which class/method to test now, which later) apart from the approaches listed below?
Calculate the dependents for each class (how many other classes use class, including transitive dependencies). Classes with the greatest number of dependent classes should be tested first.
Find out, which classes change most frequently (according to the version control system). Frequent changes may be a symptom of either lots of bugs or active development in these classes. In both cases, it makes sense to write unit tests for them.
Find out, which classes are involved in bug reports from testers and/or customers.
All of your ideas seems good. This article can help you with prioritizing and automating.
This is formula of how to do Estimating Testing effort:
Method for Testing Process (based on use case driven approach).
Step 1 : count number of use cases (NUC) of system
step 2 : Set Avg Time Test Cases(ATTC) as per test plan
step 3 : Estimate total number of test cases (NTC)
Total number of test cases = Number of usecases X Avg testcases per a use case
Step 4 : Set Avg Execution Time (AET) per a test case (idelly 15 min depends on your system)
Step 5 : Calculate Total Execution Time (TET)
TET = Total number of test cases * AET
Step 6 : Calculate Test Case Creation Time (TCCT)
usually we will take 1.5 times of TET as TCCT
TCCT = 1.5 * TET
Step 7 : Time for ReTest Case Execution (RTCE) this is for retesting
usually we take 0.5 times of TET
RTCE = 0.5 * TET
Step 8 : Set Report generation Time (RGT
usually we take 0.2 times of TET
RGT = 0.2 * TET
Step 9 : Set Test Environment Setup Time (TEST)
it also depends on test plan
Step 10 : Total Estimation time = TET + TCCT+ RTCE + RGT + TEST + some buffer...;)
Here is example how it works:
Total No of use cases (NUC) : 227
Average test cases per Use cases(AET) : 10
Estimated Test cases(NTC) : 227 * 10 = 2270
Time estimation execution (TET) : 2270/4 = 567.5 hr
Time for creating testcases (TCCT) : 567.5*4/3 = 756.6 hr
Time for retesting (RTCE) : 567.5/2 = 283.75 hr
Report Generation(RGT) = 100 hr
Test Environment Setup Time(TEST) = 20 hr.
Total Hrs 1727.85 + buffer
4 means Number of test cases executed per hour
i.e 15 min will take for execution of each test case
And since you are going to automate almost from scratch
up to now had almost no automatic tests
I think you may consider not only benefits, but Myths about Automated testing too:
Automation cannot replace the human
Once automated, cost savings is a given
Every test case can be automated
Testing can be fully automated
One test tool is suitable for all tasks
Automated Testing doesn’t mean Automatic
After all Testing ... It means Computer aided testing.
Other than the above answer, I would like to add few more points regrading priority and coverage for automation:
1. Before you start any coding, understand what all test cases can give you maximum coverage like a full flow or end to end test scenario.
2. Always maintain the usability of automation test suite - like code those test cases first which can be utilised all the time and can cover most of regression part rather than concentrating a specific feature of application.
3. When you've a limited time, try to avoid implementing fancy things like logger, email summary report, html report etc...
4. Better to have data driven framework rather than having a keyword or hybrid framework because you can cover many test cases by just varying your test data in a limited time.
5. Maintain an excel sheet or csv for test data, test results, you can use JXL Lib to handle excel sheet in java.
6. Read excel sheet, write excel sheet are the most common method which you may want to use in your automation code. You can get reference about this from blog: http://testingmindzz.blogspot.in/

Programming for a Financial Application

I've seen this twice now, and I just don't understand it. When calculating a "Finance Charge" for a fixed rate loan, applications make the user enter in all possible loan amounts and associated finance charges. Even though these rates are calculable (30%), they application makes the user fill out a table like this:
Loan Amount Finance Charge
100 30
105 31.5
etc, with the loan amounts being provided from $5 to $1500 in $5 increments.
We are starting a new initiative to rebuild this system. Is there a valid reason for doing a rate table this way? I would imagine that we should keep a simple interest field, and calculate it every time we need it.
I'm really at a loss as to why anyone would hardcode a table like that instead of calculating...I mean, computers are kind of designed to do stuff like this. Right?
It looks like compound interest where you're generously rounding up. The 100 case + 1 is pretty boring. But the 105 case + 1 is interesting.
T[0] = FC[105] => 31.5
T[1] = FC[136.5] => ?
Where does 136.5 hit -- 135 or 140? At 140, you've made an extra $1.05.
Or... If the rates were ever not calculable, that would be one reason for this implementation.
Or... The other reason (and one I would do if annoyed enough) would be that these rates were constantly changing, the developer got fed up with it, and he gave them an interface where the end users could set them on their own. The $5 buckets seem outrageous but maybe they were real jerks...