Why is the transaction throughput of RChain blockchain calculated in per node per CPU ? Instead of just per CPU? - concurrency

I get that RChain blockchain runs faster because of concurrency, and that block proposal can be parallelised (no total ordering).
Nevertheless I have trouble understanding why the TPS of rchain is calculated in per node per CPU ? In my view every node has to replay all transactions, so the TPS calculation should only be per CPU. Do you have an understanding of this ??

Related

What's preventing the Ethereum blockchain to getting to big too fast

So I recently started looking at Solidity on the Ethereum blockchain, and have a question about the size that smart contracts generate.
I'm aware that there is a size limit for the byte code generated by the contract itself, and that it cannot exceed 27kb. Also there's an upper limit for transactions too. However, what I'm curious about is that, since there's no limit on the variables that smart contract stores, what is stopping those variables from get very large in sizes? For popular smart contracts like uniswap, I would imagine they can generate hundreds of thousands of transactions per day and the state they keep would be huge.
If I understand it correctly, basically every node on the chain would store the whole blockchain, so limiting the size of blockchain would be very important. Is there anything done to limit the size of smart contracts, which mainly I think is dominated by the state variables they store.
Is there anything done to limit the size of smart contracts, which mainly I think is dominated by the state variables they store.
No. Ethereum will grow infinitely and currently there is no viable plan to limit state growth besides keeping transaction costs high and block space premium.
You read more about this in my post Scaling EVM here.
TLDR: The block size limit.
The protocol has a hardcoded limit that prevents the blockchain from growing too fast.
Full Answer
Growth Speed
The protocol measures storage (and computation) in a unit called gas. Each transaction consumes more or less gas depending on what its doing, such that an ether transfer costs 21k gas, but a uniswap v2 swap consumes around 100k gas. Deploying big contracts consume more.
The current gas limit is 30 million per block, so the actual number of transactions varies even if the blocks are always full (some consume more than others).
FYI.: This is why transactions per second is a BS, marketing metric in blockchains with rich smart contracts.
Deeper Dive
Storage as of June 2022
The Ethereum blockchain is currently ~180 GB in size. This is the stuff that is critical to the existence and from which absolutely every thing else is calculated.
Péter Szilágyi is the lead developer of the oldest, flagship ethereum node implementation
That being said, nodes generate a lot of data while processing the blockchain to generate the current state (i.e. how much money do you have on your wallet now).
Today, if you want to run a node that stores every single block and transaction starting from genesis (or what bitcoin, but not ethereum engineers, call an archive node) you currently need around 580 Gb (this grows as the node runs). See Etherscan's geth after they deleting some locally generated data, June 26, 2022.
If you want to run what ethereum engineers call an archive node - a node that not only keeps all blocks from genesis but also does not delete generated data - then you currently need 1.5 TB of storage using erigon.
Older clients that do not use the flat key-value storage, generate considerably more data (in the order of 10TB).
The Future
There are a lot of proposals, research and active development efforts working in parallel and so this part of the answer might become outdated. Here are some of them:
Sharding: Ethereum will split data (but not execution) into multiple shards, without losing confidence that the entirety of it is available via Data Availability Sampling;
Layer 2 Technologies: These move gas that was consumed by computation to another layer, without losing guarantees of the first layer such as censorship resistance and security. The two most promising instances of this (on Ethereum) are optimistic and zero-knowledge rollups.
State Expiry: Registers, Cache, RAM, SSD, HDD, Tape libraries are storage solutions, ordered by from fastest, most expensive to slowest, cheapest. Ethereum will follow the same strategy: move state data that is not accessed often in cheaper places;
Verkle Trees;
Portal network;
State Rent;
Bitcoin's Lightning network is the first blockchain layer 2 technology.

Can I use the block height to measure the passage of a year based on the average block time in RSK and Ethereum?

I want to build a Solidity smart contract in RSK and Ethereum that pays dividends every year.
Should I use the block time or can I rely on the block number, assuming a the current average inter-block time in RSK and Ethereum?
RSK and Ethereum have trunk blocks, which chained and executed, and uncle blocks (now called ommers), which are referenced but not executed. Both RSK and Ethereum have difficulty adjustment functions that try to maintain a target density of blocks (including trunk and ommers). In other words, a fixed number of blocks mined per time period. The adjustment functions in RSK and Ethereum are not equal, but both target a block density, not an inter-block time in the chain. Therefore, if the mining network produces a higher number of ommer blocks, the number of trunk blocks created over a period decreases, and the trunk average inter-block time increases. In the case of Ethereum, the number of ommers have oscillated between 5% and 40% in the last 5 years, but in the last 2 years it has stayed relatively stable between 4% and 8%. This translates to a +-2% error when measuring time based on block count. However, in Ethereum the “difficulty bomb” has affected the average block time much more than the ommer rate. The average block time is ~14 seconds now, but it has peaked to 30, 20 and 17 seconds at different times. Therefore, in Ethereum, the a number of blocks should not be used to measure long periods of time. It may be used only for short periods, not longer than a month. More importantly, if Ethereum switches to PoS, the average block interval will decrease to 12 seconds at that point.
Here we show the Ethereum ommer rate:
(source: https://ycharts.com/indicators/ethereum_uncle_rate)
And this is Ethereum average block time:
(source: https://ycharts.com/indicators/ethereum_average_block_time)
The spikes are caused by the difficulty bomb and the abrupt decays by hard-forks that delayed the bomb.
In RSK, most miners are configured to minimize mining pool bandwidth and create a high number of ommers. This is permitted and encouraged by design. They can also be configured to minimize the number of ommers, and consume more bandwidth. RSK targets approximately a density of 2 blocks every 33 seconds, and currently one block is an ommer, and the other is part of the trunk. If the RSK/Bitcoin miners decide in the future to switch to the ommer-minimizing mode, almost no ommers will be created and the average trunk block interval will decrease to 16.5 seconds (to keep the 2 blocks per 33 seconds invariant). This is why, even if the trunk block interval in RSK is currently very stable, in the future (and without prior notice) it can suddenly change from 22 seconds down to 16.5 seconds. This makes the block number an unreliable source for computing the time for values such as the interest rate.
On the other hand, the block time cannot be easily forged because nodes check that block time is not in the future, and not prior to the parent block time. Also RSK has a consensus rule that ties RSK timestamp to Bitcoin timestamp, which makes cheating extremely expensive as the Bitcoin blocks back-dated or forward-dated produced by merge-mining would be invalid.
Here is the RSK average block time and average uncle rate from June 2018 to March 2021. The X-axis shows the block number.
Each dot in the chart corresponds to a day. We can see that the block interval is highly correlated to the uncle rate.
The EVM opcode NUMBER (which is used to obtain the block height) returns the number of trunk blocks, not considering ommers. As a consequence, the value returned cannot be used to count all types of blocks. However, a new opcode OMMERCOUNT could be added, to query the total number of ommers referenced up to the current block. Together with NUMBER, these opcodes could be used to better approximate the passage of time.

Understanding FMA performance

I would like to understand how to compute FMA performance. If we look into the description here:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_fmadd_ps&expand=2520,2520&techs=FMA
for Skylake architecture the instruction have Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
So as far as I understand if the max (turbo) clock frequency is 3GHz, then for a single core in one second I can execute 1 500 000 000 instructions.
Is it right? If so, what could be the reason that I am observing a slightly higher performance?
A throughput of 0.5 means that the processor can execute two independent FMAs per cycle. So at 3GHz, the maximum FMA throughout is 6 billion per second. You said you are only able achieve a throughput that is slightly larger than 1.5B. This can happen due to one or more of the following reasons:
The frontend is delivering less than 2 FMA uops every single cycle due to a frontend bottleneck (the DSB path or the MITE path).
There are data dependencies between the FMAs or with other instructions (that are perhaps part of the looping mechanics). This can be stated alternatively as follows: there are less than 2 FMAs that are ready in the RS every single cycle. Latency comes into play when there are dependencies.
Some of the FMAs are using memory operands which if they are not found in the L1D cache when they are needed, a throughput of 2 FMAs per cycle cannot be sustained.
The core frequency becomes less than 3GHz during the experiment. This factor only impacts the throughput per second, not per cycle.
Other reasons depending on how exactly your loop works and how you are measuring throughput.
Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
Just working out the units gives cycles²/instr, which is strange and I have no interpretation for it.
The throughput listed here is really a reciprocal throughput, in CPI, so 0.5 cycles per instruction or 2 instructions per cycle. These numbers are related by being each others reciprocal, the latency has nothing to do with it.
There is a related calculation that does involve both latency and (reciprocal) throughput, namely the product of the latency and the throughput: 4 * 2 = 8 (in units of "number of instructions"). This is how many independent instances of the operation can be "in flight" (started but not completed) simultaneously, comparable with the bandwidth-delay product in network theory. This number informs some code design decisions, because it is a lower bound on the amount of instruction-level parallelism the code needs to expose to the CPU in order for it to fully use the computation resources.

Calculating mflop/s of a HPC application using memory bandwidth info

I want to calculate mflops (million of operations per second per processor) of a HPC application(NAS benchmark) without running the application. I have measured the memory bandwidth of each core of my system (a supercomputer) using Stream Benchmark. I'm wondering how I can get the mflops per processor of the application by having memory bandwidth info of the cores.
My node has 64GiB memory (includes 16 cores-2 sockets) and 58 GiB/s aggregated bandwidth using all physical cores. The memory bandwidth of my cores are varied from 2728.1204 MB/s to 10948.8962 MB/s for Triad function that it's must be because of NUMA architecture.
Any help would be appreciate.
You can't get estimate of MFLOPS/GFLOPS of benchmark only from memory bandwidth results from STREAM. You need to know two more parameters: peak MFLOPS/GFLOPS of your CPU core (better as max FLOP operations per clock cycle with all variants of vector instructions and cpu frequency limits: min, mean, max) and also GFLOPS/GBytes (flops to bytes ratio, Arithmetic Intensity) of every program you need to estimate (every NAS Benchmark).
The Stream benchmark has very low arithmetic intensity (0 DP=FP64 flops per two double operands = 2*8 bytes in Copy, 1 flops per 16 bytes in Scale, 1 flops / 24 byte in Add and 2 flops / 24 bytes in Triad). So, Stream benchmark is limited by memory bandwidth in correct runs (and by cache bandwidth in incorrect runs on ). Many benchmarks may have higher
With this data (memory bandwidth, max gflops/GHz on different vectorization levels, normal/maximal/low frequency of cpu, arithmetic intensity of the test) you can start to use roofline performance model https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
With roofline you have x axis with flops/byte; y axis of GFlop/s (both are at logarithmic scale). The line of the "roof" consists of two parts for every CPU (or machine).
First part is inclined and corresponds to low arithmetic intensity. Applications in this part will have to wait data to be loaded from memory, they have no data to operate on with full GFlop/s speed of CPU; the tests are limited by memory. This line is defined by STREAM benchmark.
Second part of line is straight, it corresponds to higher intensity. Tasks here are not limited by memory bandwidth, they are limited by available FLOPS. And for modern CPU all flops are available only with wide vector instruction (Instruction-level parallelism), and not all tasks can use widest vectors:

CUDA performance improves when running more threads than there are cores

Why does performance improve when I run more than 32 threads per block?
My graphics card has 480 CUDA Cores (15 MS * 32 SP).
Each SM has 1-4 warp schedulers (Tesla = 1, Fermi = 2, Kepler = 4). Each warp scheduler is responsible for executing a subset of the warps allocated to the SM. Each warp scheduler maintains a list of eligible warps. A warp is eligible if it can issue an instruction on the next cycle. A warp is not eligible if it is stalled on a data dependency, waiting to fetch and instruction, or the execution unit for the next instruction is busy. On each cycle each warp scheduler will pick a warp from the list of eligible warp and issue 1 or 2 instructions.
The more active warps per SM the larger the number of warps each warp scheduler will have to pick from on each cycle. In most cases, optimal performance is achieved when there is sufficient active warps per SM to have 1 eligible warp per warp scheduler per cycle. Increasing occupancy beyond this point does not increase performance and may decrease performance.
A typical target for active warps is 50-66% of the maximum warps for the SM. The ratio of warps to maximum warps supported by a launch configuration is called Theoretical Occupancy. The runtime ratio of active warps per cycle to maximum warps per cycle is Achieved Occupancy. For a GTX480 (CC 2.0 device) a good starting point when designing a kernel is 50-66% Theoretical Occupancy. CC 2.0 SM can have a maximum of 48 warps. A 50% occupancy means 24 warps or 768 threads per SM.
The CUDA Profiling Activity in Nsight Visual Studio Edition can show the theoretical occupancy, achieved occupancy, active warps per SM, eligible warps per SM, and stall reasons.
The CUDA Visual Profiler, nvprof, and the command line profiler can show theoretical occupancy, active warps, and achieved occupancy.
NOTE: The count of CUDA cores should only be used to compare cards of similar architectures, to calculate theoretical FLOPS, and to potentially compare differences between architectures. Do not use the count when designing algorithms.
Welcome to Stack Overflow. The reason is that CUDA cores are pipelined. On Fermi, the pipeline is around 20 clocks long. This means that to saturate the GPU, you may need up to 20 threads per core.
The primary reason is the memory latency hiding model of CUDA. Most modern CPU's use cache to hide the latency to main memory. This results in a large percentage of chip resources being devoted to cache. Most desktop and server processors have several megabytes of cache on the die, which actually accounts for most of the die space. In order to pack on more cores with the same energy usage and heat dissipation characteristics, CUDA-based chips instead devote their chip space to throwing on tons of CUDA cores (which are mostly just floating-point ALU's.) Since there is very little cache, they instead rely on having more threads ready to run while other threads are waiting on memory accesses to return in order to hide that latency. This gives the cores something productive to be working on while some warps are waiting on memory accesses. The more warps per SM, the more chance one of them will be runnable at any given time.
CUDA also has zero-cost thread switching in order to aid in this memory-latency-hiding scheme. A normal CPU will incur a large overhead to switch from execution of one thread to the next due to need to store all of the register values for the thread it is switching away from onto the stack and then loading all of the ones for the thread it is switching to. CUDA SM's just have tons and tons of registers, so each thread has its own set of physical registers assigned to it through the life of the thread. Since there is no need to store and load register values, each SM can execute threads from one warp on one clock cycle and execute threads from a different warp on the very next clock cycle.