Maximum throughput of a corda application - blockchain

I start working with corda plaform 3 days ago and currently i meet an issue about getting the thoughput of a corda application.
I worked with hyperledger before so the script i used for performance testing is caliper. The main idea is send transaction with send rate and see when the transaction is committed. With the information of time created and time committed i can calulate the throughput of the system. When i do the test for corda, i send transaction with send rate arround 50 txn per sec and get the thoughput of 3-5 tps.
The application i used for testing is cordapp-example with the default config. I configured to run with docker in my local machine (4 container - one for notary, 3 for node party).
So is that the actual performance of a corda application? Does anyone do this or have any article about this. I want to build a application with throughput arround 1000 tps. So what is the configuration for this system if using corda plaform (resources, number of nodes, etc,..)

The open source version of Corda isn't optimised and won't be able to reach 1000 tps. That's a pretty demanding use case. Try downloading the Enterprise version from here:
https://www.r3.com/corda-enterprise/
and see if you get better performance.
You might want also to email partner#r3.com and get a more formal relationship in place because we're constantly optimising to get higher and higher tps levels, so you'll probably want to be working closely with the performance team - in particular apps can do things that slow the node down and right now most of the knowledge about how to make fast apps is in the heads of the perf team. Over time there'll be optimisation advice added to the developer docs but we're not there yet.

Related

Need help building an uptime dashboard for a distributed system

I have a product for which I would like to create a dashboard to show
its availability/uptime over time and display any outages.
Specifically I am looking for
ability to report historical information on service uptime
provide details on any service outages
The product is running on a fleet of linux servers and connects to a DB running
on a separate instance, also we have some dedicated instances that run nightly
batch jobs. My system also relies on some external services to provide
additional functionality for select customers. There is redis cache also for
caching data for multiple customers.
We replicate all the above setup (application servers, DB, jobs servers, redis
cache etc) into dedicated clusters for large customers. Small customers are put
on one of the shared clusters to keep costs low.
Currently we are running health checks on application servers only and providing
that information in a simple HTML page. This is a go to page for end-users/customers
and support teams.
Since the product is constructed using multiple systems/services our current HTML
page often times says that the system is up and running fine while can be experiencing
issues with some of its components or external services.
Current health check is using a simple HTTP request and looks for a 200
status code, this check runs every minute and we plot this data into a simple
chart to show last 30 days. We also show a list of outages with timestamp and
additional static information that is manually added.
We would like to build a more robust solution that monitors much more than the HTTP port
and where we have more details like what part
of the system is having issues and how those issues are impacting the system and
which customers are impacted.
Appreciate any guidance or help. We prefer to build the solution using
open source tools since we dont have much budget. Goal is to improve things for
my team members who are already overloaded.
I'm not sure if this will be overkill or not for your setup, given that I don't know your product, but have a look at the ELK Stack and see if you can use some components or at least some ideas from there:
What is the ELK Stack?
The Complete Guide to the ELK Stack

Sitecore publishing and lag of upto 30 seconds

We have noticed an interesting issue in our Sitecore install. Any auto publish or scheduled publish jobs takes a long time when compared to our other environments. Between each individual job there seems to be a lag of anywhere from 5 to 30 seconds. In our other environments we do not see any lags as difference between 2 publishing jobs in those environments is less than a second.
We have tried the following up until now -
We have already checked for differences between the problematic and
other environments and do not see any differences in configuration or
code.
The caches are pretty similar in all environments.
We tried enabling parallel publishing but that did not make much difference.
Indexing is also very quick in the problematic environment and finishes within one second for each job.
At this point, we are not sure what is causing this issue. Any suggestions would be helpful.
Thanks
As Sitecore would allow maximum one publish to be executed at once to avoid data corruption, I would assume you might add publish jobs faster than they are processed => queueing.
In order to make accurate conclusions, the publish operation needs to be profiled - that will give an answer on wallclock time spend (like ~80% on network + database operations, only 20% in Sitecore code).
You'll need to collect a few 20-second long profiles while observing publishing lag.
From there you'll see how the time is spent.
Please keep in mind that seeing obsolete content in the browser does not necessarily mean publishing is slow - there are 100500 caching layers in the middle that can influence.
Looks like I have similar issue.
Do have multiple IaaS Sitecore installations. 2 environments (hosted on onr VM) have much better performance (package installation, publish etc).
Also have 2 more Sitecore installations on other VM, and publish and package installation there is 4-5 times slower, than on first VM.
I've used the same Sitecore installation configuration, but with different prefix.
In my case i was migrating from Sitecore 8.2 to Sitecore 9.2. Used Unicorn to migrate data, and saw, that content publish (seems, writes to master) is slow right away.
So, on first 2 environments migration with unicorn, content publish and package installation was a way faster, but on 2 other this process is slower.

How to profile Django's bottlenecks for scaling?

I am using django and tastypie for REST API.
For profiling, I am using django-silk and below is a summary of requests:
How do I profile the complete flow? Time taken except for database queries is (382 - 147) ms on average. How do I figure out the bottleneck and optimize/scale? I did use #silk_profile() for the get_object_list method for this resource, but even this method doesn't seem to be bottleneck.
I used caching for decreasing response time, but that didn't help much, what are the other options?
When testing using loader.io, the peak the server can handle is 1000 requests per 30 secs (which seems very low). Other than caching (which I already tried) what might help?
Here's a bunch of suggestions:
bring the query per request at least below 5 per request (34 per request is really bad)
install django toolbar and have a look where the time is spent
use gunicorn or uwsgi behind a reverse proxy (NGINX)
You have too much queries, even if they are relatively fast you spend
some time to reach database etc. Also if you have external cache
storage (for example, redis) it could take some time to connect
there.
To investigate slow parts of the code you have two options:
Use a profiler - profiling at local PC could make no sense if you have distributed system deployed to several machines
Add tracing points to your code that will record some message and current time (something like https://gist.github.com/dbf256/0f1d5d7d2c9aa70bce89). Deploy this patched code and test it with your load-testing tool and check logs.

CouchDB load spike (even with low traffic)?

We're been running CouchDB v1.5.0 on AWS and its been working fine. Recently AWS came out with new prices for their new m3 instances so we switched our CouchDB instance to use an m3.large. We have a relatively small database with < 10GB of data in it.
Our steady state metrics for it are system loads of 0.2 and memory usages of 5% or so. However, we noticed that every few hours (3-4 times per day) we get a huge spike that floors our load to 1.5 or so and memory usage to close to 100%.
We don't run any cronjobs that involve the database and our traffic flow about the same over the day. We do run a continuous replication from one database on the west coast to another on the east coast.
This has been stumping me for a bit - any ideas?
Just wanted to follow up on this question in case it helps anyone.
While I didn't figure out the direct answer to my load spike question, I did discover another bug from inspecting the logs that I was able to solve.
In my case, running "sudo service couchdb stop" was not actually stopping couchdb. On top of that, every couple of seconds a new process of couch would try and spawn only to be blocked by the existing couchdb process.
Ultimately, removing the respawn flag /etc/init.d/couchdb fixed this error.

Electric Cloud / BuildForge: worth the expense?

I just saw a demo of electric cloud and it was very interesting, but it is expensive.
Pro: Excellent features
- extract the secret sauce from my builds and make them more standardized with reusable steps
- parallelize the build to speed it up and use my build farm more effectively
- restart the build from any step
- integrate test automation and promotion (perhaps even deployment into production) with good logging, auditing and reporting
Con: enterprise sized price tag
I feel like I could probably use STAF, maven and hudson with some plugin development to do most (but not all) of what these tools offer, but it would require a lot of customization and feels like beating my clothes against rocks instead of paying for a washing machine.
Does anyone have opinions to share about these options and what aspects of the environment makes one choice fit better than another?
At my last company, we deployed both the commander and accelerator. At my current company, we are planning on doing the same thing.
My last company did about 70 builds per day. The build time was 12 hours. The total build time was reduced to about 3 hours using accelerator. We started the deployment building only the very latest release and its incoming streams. We used commander to follow a continuous integration (CI) model - the same "recipe" was used to do both the CI and nightly builds with the CI builds using some different options. The number of nightly build failures dropped down to near 0 and the velocity of development increased significantly. At that point, all we heard from development was "ME NEXT"!!! The ROI for this was incredible.
Yes you can develop some of this using hudson or cruisecontrol, but as you indicated you'll be missing a lot of functionality and end up spending spending time customizing and supporting this environment.
Feel free to contact me if you'd like to discuss this more.
I do not have experience with BuildForge.
We started our 20 team program with Jenkins and Incredibuild, however, this didn't scale as well as we had hoped. Many of our teams would check in the day or two before the end of a sprint (yes, a behavioral issue) and Jenkins would get overwhelmed. A build without IncrediBuild would take ~90mins and with ~12mins. This does not include the wait time teams would face since Jenkins builds in a serial manner (queue).
We moved to Electric Commander + Accelerator and saw our build times decrease to ~5mins. The biggest benefit, however, was running parallel builds. Teams don't have to wait any more for their build to start. We use EC's schedules for each team and our build is much more modular/maintainable (written in perl).
Be warned, their dashboard is not like Jenkins. This was a common complaint from our teams. There are ways to run EC from Jenkins (so you get the Jenkins dashboard with the EC speed) though.
tl;dr Electric Cloud is great if you need to scale.