What is P99 latency? - web-services

What does P99 latency represent? I keep hearing about this in discussions about an application's performance but couldn't find a resource online that would talk about this.

It's 99th percentile. It means that 99% of the requests should be faster than given latency. In other words only 1% of the requests are allowed to be slower.

We can explain it through an analogy, if 100 students are running a race then 99 students should complete the race in "latency" time.

Imagine that you are collecting performance data of your service and the below table is the collection of results (the latency values are fictional to illustrate the idea).
Latency Number of requests
1s 5
2s 5
3s 10
4s 40
5s 20
6s 15
7s 4
8s 1
The P99 latency of your service is 7s. Only 1% of the requests take longer than that. So, if you can decrease the P99 latency of your service, you increase its performance.

Lets take an example from here
Request latency:
min: 0.1
max: 7.2
median: 0.2
p95: 0.5
p99: 1.3
So we can say, 99 percent of web requests, the average latency found was 1.3ms (milli seconds/microseconds depends on your system latency measures configured).
Like #tranmq said, if we decrease the P99 latency of the service, we can increase its performance.
And it is also worth noting the p95, since may be few requests makes p99 to be more costlier than p95 e.g.) initial requests that builds cache, class objects warm up, threads init, etc.
So p95 may be cutting out those 5% worst case scenarios. Still out of that 5%, we dont know percentile of real noise cases Vs worst case inputs.
Finally; we can have roughly 1% noise in our measurements (like network congestions, outages, service degradations), so the p99 latency is a good representative of practically the worst case. And, almost always, our goal is to reduce the p99 latency.

Explaining P99 it through an analogy:
If 100 horses are running in a race, 99 horses should complete the race in less than or equal to "latency" time. Only 1 horse is allowed to finish the race in time higher than "latency" time.
That means if P99 is 10ms, 99 percentile requests should have latency less than or equal to 10ms.

If p99 value is 1ms, it means, 99 out of 100 requests take less than 1ms, and 1 request take about 1 or more than 1ms.

Related

How does OpenMP actually reduce clock cycles?

It might be a silly question but, with OpenMP you can achieve to distribute the number of operations between all the cores your CPU has. Of course, it is going to be faster in 99% times because you went from a single core doing N operations to K cores doing the same amount operations at the same time.
Despite of this, the total amount of clock cycles should be the same, right? Because the number of operations is the same. Or I am wrong?
This question boils down more or less to the difference between CPU time and elapsed time. Indeed, we see more often than none here questions which start by "my code doesn't scale, why?", for which the first answer is "How did you measure the time?" (I let you make a quick search and I'm sure you'll find many results)
But to illustrate more how things work, let's imagine you have a fixed-size problem, for which you have an algorithm that is perfectly parallelized.You have 120 actions to do, each taking 1 second. Then, 1 CPU core would take 120s, 2 cores would take 60s, 3 cores 40s, etc.
That is the elapsed time that is decreasing. However, 2 cores, running for 60 seconds in parallel, will consume 120s of CPU time. This means that the overall number of clock cycles won't have reduced compared to having only one CPU core running.
In summary, for a perfectly parallelized problem, you expect to see your elapsed time scaling down perfectly with the number of cores used, and the CPU time to remain constant.
In reality, what you often see is the elapsed time to scale down less than expected, due to parallelization overheads and/or imperfect parallelization. By the meantime, you see the CPU time slightly increasing with the number of cores used, for the same reasons.
I think the answer depends on how you define the total amount of clock cycles. If you define it as the sum of all the clock cycles from the different cores then you are correct and there will not be fewer clock cycles. But if you define it as the amount of clock cycles for the "main" core between initiating and completing the distributed operations then it is hopefully fewer.

Redshift CPU utilisation is 100 percent most of the time

I have a 96 Vcpu Redshift ra3.4xlarge 8 node cluster, Most of the times the CPU utilisation is 100 percent , It was a dc2.large 3 node cluster before , that was also always 100 percent that's why we increased it to ra3. We are doing most of our computes on Redshift but the data is not that much! I read somewhere Doesn't matter how much compute you increase unless its significantly , there will only be a slight improvement in the Computation. Can anyone explain this?
I can give it a shot. Having 100% CPU for long stretches of time is generally not a good (optimal) thing in Redshift. You see Redshift is made for performing analytics on massive amounts of structured data. To do this it utilizes several resources - disks/disk IO bandwidth, memory, CPU, and network bandwidth. If you workload is well matched to Redshift your utilization of all these things will average around 60%. Sometimes CPU bound, sometimes memory bound, sometimes network bandwidth bound, etc. Lots of data being read means disk IO bandwidth is at a premium, lots of redistribution of data means network IO bandwidth is constraining. If you are using all these factors above 50% capacity you are getting what you paid for. Once any of these factors gets to 100% there is a significant drop-off of performance as working around the oversubscribed item steals performance.
Now you are in a situation where you are see 100% for a significant portion of the operating time, right? This means you have all these other attributes you have paid for but are not using AND inefficiencies are being realized to manage through this (though of all the factors, high CPU has the lease overhead). The big question is why.
There are a few possibilities but the most likely, in my experience, is inefficiently queries. An example might be the best way to explain this. I've seen queries that are intended to find all the combinations of certain factors from several tables. So they cross join these tables but this produces lots of repeats so they add DISTINCT, problem solved. But this still creates all the duplicates and then reduces the set down. All the work is being done and most of the results thrown away. However, if they pared down the factors in the tables first, then cross joined them, the total work will be significantly lower. This example will do exactly what you are seeing, high CPU as it spins making repeat combinations and then throwing most of them away.
If you have many of this type of "fat in the middle" query where lots of extra data is made and immediately reduced, you won't get a lot of benefit for adding CPU resources. Things will get 2X faster with 2X the cluster size but you are buying 2X of all these other resources that aren't helping you. You would expect that buying 2X CPU and 2X memory and 2X disk IO etc. would give you much more than a 2X improvement. Being constrained on 1 thing make scaling costly. Also, you are unlikely to see the CPU utilization come down as your queries just "spin the tires" of the CPU. More CPUs will just mean you can run more queries resulting in the spinning more tires.
Now the above is just my #1 guess based on my consulting experience. It could be that your workload just isn't right for Redshift. I've seen people try to put many small database problems into Redshift thinking that it's powerful so it must be good at this too. They turn up the slot count to try to pump more work into Redshift but just create more issues. Or I've seem people try to run transactional workloads. Or ... If you have the wrong tool for the job it may not work well. One 6 ton dump truck isn't the same thing as 50 motorcycle delivery team - each has their purpose but they aren't interchangeable.
Another possibility is that you have a very unusual workload but Redshift is still the best tool for the job. You don't need all the strengths of Redshift but this is ok, you are getting the job done at an appropriate cost. If this case 100% CPU is just how your workload uses Redshift. It's not a problem, just reality. Now I doubt this is the case, but it is possible. I'd want to be sure I'm getting all the value from the money I'm spending before assuming everything is ok.

Can I use the block height to measure the passage of a year based on the average block time in RSK and Ethereum?

I want to build a Solidity smart contract in RSK and Ethereum that pays dividends every year.
Should I use the block time or can I rely on the block number, assuming a the current average inter-block time in RSK and Ethereum?
RSK and Ethereum have trunk blocks, which chained and executed, and uncle blocks (now called ommers), which are referenced but not executed. Both RSK and Ethereum have difficulty adjustment functions that try to maintain a target density of blocks (including trunk and ommers). In other words, a fixed number of blocks mined per time period. The adjustment functions in RSK and Ethereum are not equal, but both target a block density, not an inter-block time in the chain. Therefore, if the mining network produces a higher number of ommer blocks, the number of trunk blocks created over a period decreases, and the trunk average inter-block time increases. In the case of Ethereum, the number of ommers have oscillated between 5% and 40% in the last 5 years, but in the last 2 years it has stayed relatively stable between 4% and 8%. This translates to a +-2% error when measuring time based on block count. However, in Ethereum the “difficulty bomb” has affected the average block time much more than the ommer rate. The average block time is ~14 seconds now, but it has peaked to 30, 20 and 17 seconds at different times. Therefore, in Ethereum, the a number of blocks should not be used to measure long periods of time. It may be used only for short periods, not longer than a month. More importantly, if Ethereum switches to PoS, the average block interval will decrease to 12 seconds at that point.
Here we show the Ethereum ommer rate:
(source: https://ycharts.com/indicators/ethereum_uncle_rate)
And this is Ethereum average block time:
(source: https://ycharts.com/indicators/ethereum_average_block_time)
The spikes are caused by the difficulty bomb and the abrupt decays by hard-forks that delayed the bomb.
In RSK, most miners are configured to minimize mining pool bandwidth and create a high number of ommers. This is permitted and encouraged by design. They can also be configured to minimize the number of ommers, and consume more bandwidth. RSK targets approximately a density of 2 blocks every 33 seconds, and currently one block is an ommer, and the other is part of the trunk. If the RSK/Bitcoin miners decide in the future to switch to the ommer-minimizing mode, almost no ommers will be created and the average trunk block interval will decrease to 16.5 seconds (to keep the 2 blocks per 33 seconds invariant). This is why, even if the trunk block interval in RSK is currently very stable, in the future (and without prior notice) it can suddenly change from 22 seconds down to 16.5 seconds. This makes the block number an unreliable source for computing the time for values such as the interest rate.
On the other hand, the block time cannot be easily forged because nodes check that block time is not in the future, and not prior to the parent block time. Also RSK has a consensus rule that ties RSK timestamp to Bitcoin timestamp, which makes cheating extremely expensive as the Bitcoin blocks back-dated or forward-dated produced by merge-mining would be invalid.
Here is the RSK average block time and average uncle rate from June 2018 to March 2021. The X-axis shows the block number.
Each dot in the chart corresponds to a day. We can see that the block interval is highly correlated to the uncle rate.
The EVM opcode NUMBER (which is used to obtain the block height) returns the number of trunk blocks, not considering ommers. As a consequence, the value returned cannot be used to count all types of blocks. However, a new opcode OMMERCOUNT could be added, to query the total number of ommers referenced up to the current block. Together with NUMBER, these opcodes could be used to better approximate the passage of time.

Understanding FMA performance

I would like to understand how to compute FMA performance. If we look into the description here:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_fmadd_ps&expand=2520,2520&techs=FMA
for Skylake architecture the instruction have Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
So as far as I understand if the max (turbo) clock frequency is 3GHz, then for a single core in one second I can execute 1 500 000 000 instructions.
Is it right? If so, what could be the reason that I am observing a slightly higher performance?
A throughput of 0.5 means that the processor can execute two independent FMAs per cycle. So at 3GHz, the maximum FMA throughout is 6 billion per second. You said you are only able achieve a throughput that is slightly larger than 1.5B. This can happen due to one or more of the following reasons:
The frontend is delivering less than 2 FMA uops every single cycle due to a frontend bottleneck (the DSB path or the MITE path).
There are data dependencies between the FMAs or with other instructions (that are perhaps part of the looping mechanics). This can be stated alternatively as follows: there are less than 2 FMAs that are ready in the RS every single cycle. Latency comes into play when there are dependencies.
Some of the FMAs are using memory operands which if they are not found in the L1D cache when they are needed, a throughput of 2 FMAs per cycle cannot be sustained.
The core frequency becomes less than 3GHz during the experiment. This factor only impacts the throughput per second, not per cycle.
Other reasons depending on how exactly your loop works and how you are measuring throughput.
Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
Just working out the units gives cycles²/instr, which is strange and I have no interpretation for it.
The throughput listed here is really a reciprocal throughput, in CPI, so 0.5 cycles per instruction or 2 instructions per cycle. These numbers are related by being each others reciprocal, the latency has nothing to do with it.
There is a related calculation that does involve both latency and (reciprocal) throughput, namely the product of the latency and the throughput: 4 * 2 = 8 (in units of "number of instructions"). This is how many independent instances of the operation can be "in flight" (started but not completed) simultaneously, comparable with the bandwidth-delay product in network theory. This number informs some code design decisions, because it is a lower bound on the amount of instruction-level parallelism the code needs to expose to the CPU in order for it to fully use the computation resources.

What is optimal setting for multipart_threshold and mutilpart_chunksize while doing S3 multipart upload

Is there any formula available to the identify optimal setting for multipart_threshold and mutilpart_chunksize? Suppose if I am trying to upload 2 TB file in S3 using multipart upload then what will be the best value for threshold and chunksize . Any script to use multi-part upload/download will be appreciated .
There is no single optimal setting. The optimal values are dependent on a number of factors, including the latency and available bandwidth between the system where aws-cli is running and the S3 region in question, the amount of CPU and network capacity on the machine, and the size of the objects.
For some scenarios, no significant difference in overall performance is likely, regardless of configuration.
Perhaps more important than either of the parameters you asked about, is max_concurrent_requests. This defaults to 10.
I have a 100 Mbit/s connection to the Internet and am 80 ms away from the most distant S3 region that I use heavily. Anecdotally, I can tell you that I can upload to this region at about 5 Mbit/s per part, so by default my best case performance with default settings would be about 50 Mbit/s (5 Mbit/s × max_concurrent_requests -- your mileage may vary greatly, 5 Mbit/s is for illustration purposes only).
If I increase max_concurrent_requests I will increase my net transfer speed approximately linearly, until I max out a resource (most likely, my 100 Mbit/s connection). However, increasing max_concurrent_requests beyond total_upload_size ÷ multipart_chunksize would result in no further gains. If I increase max_concurrent_requests to a ridiculously large value, I'll saturate my connection beyond reliability, and packet drops will cause errors, retries, re-transmission, and other inefficiencies -- thus there are diminishing returns on how much I can increase this and have it be beneficial.
Independently, increasing multipart_chunksize will tend to improve my performance and decreasing it will degrade this, but only by small degrees, because I'll spend proportionally less or more time in housekeeping compared to actually transmitting data --
so a larger multipart_chunksize will be better if my connection is fast and clean. It will be worse if my connection is slow and error-prone. Multipart uploads can recover from failed part-uploads, but the minimum re-transmission unit after a hard failure is multipart_chunksize -- So a larger value increases the amount I have to re-transmit, in order to recover from hard errors. (This refers to part requests that fail entirely, not TCP re-transmission, which are of course much smaller.)
There is a clear interaction among total_object_size, max_concurrent_requests, and multipart_chunksize.
The least "interesting" parameter is multipart_threshold, which is the object size at which the multipart mechanism is even engaged: objects smaller than multipart_threshold will not use multipart. On a fast, clean connection, increasing this value may be advisable, but beyond some threshold, larger values will mean slower transfers, because it will mean we don't use the parallelism allowed by multipart.
For a 2 TB file, any value of multipart_chunksize below 200 MB will have no meaning, because multipart requires that a single file be uploaded in not more than 10,000 distinct parts, and 200 MB × 10,000 is 2 TB. Smaller values may actually be invalid and result in an error, unless aws-cli silently overrides your specification and uses a larger value, because the 10,000 part maximum is imposed by the S3 service, itself. Untested.
For a 2 TB file, multipart_threshold has no impact, because multipart is mandatory for objects exceeding 5 GB total size.
For a 2 TB file, max_concurrent_requests should be set as high as your connection can sustain, for optimal throughput. You will need a mechanism to monitor your bandwidth usage in order to titrate this to an optimal value.