Sitecore publishing and lag of upto 30 seconds - sitecore

We have noticed an interesting issue in our Sitecore install. Any auto publish or scheduled publish jobs takes a long time when compared to our other environments. Between each individual job there seems to be a lag of anywhere from 5 to 30 seconds. In our other environments we do not see any lags as difference between 2 publishing jobs in those environments is less than a second.
We have tried the following up until now -
We have already checked for differences between the problematic and
other environments and do not see any differences in configuration or
code.
The caches are pretty similar in all environments.
We tried enabling parallel publishing but that did not make much difference.
Indexing is also very quick in the problematic environment and finishes within one second for each job.
At this point, we are not sure what is causing this issue. Any suggestions would be helpful.
Thanks

As Sitecore would allow maximum one publish to be executed at once to avoid data corruption, I would assume you might add publish jobs faster than they are processed => queueing.
In order to make accurate conclusions, the publish operation needs to be profiled - that will give an answer on wallclock time spend (like ~80% on network + database operations, only 20% in Sitecore code).
You'll need to collect a few 20-second long profiles while observing publishing lag.
From there you'll see how the time is spent.
Please keep in mind that seeing obsolete content in the browser does not necessarily mean publishing is slow - there are 100500 caching layers in the middle that can influence.

Looks like I have similar issue.
Do have multiple IaaS Sitecore installations. 2 environments (hosted on onr VM) have much better performance (package installation, publish etc).
Also have 2 more Sitecore installations on other VM, and publish and package installation there is 4-5 times slower, than on first VM.
I've used the same Sitecore installation configuration, but with different prefix.
In my case i was migrating from Sitecore 8.2 to Sitecore 9.2. Used Unicorn to migrate data, and saw, that content publish (seems, writes to master) is slow right away.
So, on first 2 environments migration with unicorn, content publish and package installation was a way faster, but on 2 other this process is slower.

Related

Spark History Server ListBucket costs

We are using Spark history 3.2.1 to monitor our Spark applications.
We have thousands of daily jobs (running on Kubernetes) that writes event logs to S3 bucket (in a dedicated folder).
We are using history-server to analyze and compare completed jobs (incomplete running jobs never appeared in the UI but it's not a requirement now).
Recently I've noticed increase in our ListBucket API Operation in AWS billing cost explorer. This cost is higher than the cost of the StandardStorage (the price we pay for storing the data itself). It's up to few hundreds per month!
Running history-server with DEBUG log level exposed the "problem": every 10s the the history-server list the bucket to get all logs and then it iterate over each folder to get it's content. So if I want to keep the last 10,000 jobs, I'll have to pay for 10,101 ListBucket requests every 10s!
Here is one example (out of the 10k) reproduced locally with minio as S3:
22/02/20 06:44:31 DEBUG wire: http-outgoing-57 << "<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Name>local-audience</Name><Prefix>history-logs/eventlog_v2_spark-ffffdf5903c841259f28b53981746b76/</Prefix><KeyCount>2</KeyCount><MaxKeys>5000</MaxKeys><Delimiter>/</Delimiter><IsTruncated>false</IsTruncated><Contents><Key>history-logs/eventlog_v2_spark-ffffdf5903c841259f28b53981746b76/appstatus_spark-ffffdf5903c841259f28b53981746b76</Key><LastModified>2022-02-12T17:00:15.304Z</LastModified><ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag><Size>0</Size><Owner><ID></ID><DisplayName></DisplayName></Owner><StorageClass>STANDARD</StorageClass></Contents><Contents><Key>history-logs/eventlog_v2_spark-ffffdf5903c841259f28b53981746b76/events_1_spark-ffffdf5903c841259f28b53981746b76</Key><LastModified>2022-02-12T17:00:15.136Z</LastModified><ETag>"f91cc774d92c6f6c2ca4d0e1a1e76e13"</ETag><Size>868837</Size><Owner><ID></ID><DisplayName></DisplayName></Owner><StorageClass>STANDARD</StorageClass></Contents></ListBucketResult>"
To ensure that the cost comes from history-server I turned it off for a day and there was no charge per ListBucket since then:
To mitigate the problem (because we still need the history-server), I can set the spark.history.fs.update.interval to higher number (such as 3600s or so). As we are checking the history-server once a day it is overkill and doesn't worth it (cost wise).
Why does it scan the completed jobs every time (over and over again) and not only new jobs? is there a way to configure such behavior to avoid those ListBucket operations?
If I care only for completed jobs, and assuming I can wait few minutes to see the list, is there a mode that can load the list only when I login to the UI? (rather than periodically doing it for nothing).
P.S - I'm using AWS lifecycle rules to clean this folder every few few days (and not the server cleaning feature), by expiration objects after few days.
treewalking in s3 is (a) expensive and (b) horribly slow, especially given that a deep tree scan exists. If you want to fix this and can write scala code, see if you can fix the server to switch to a deep listing by moving to FileSystem.listFiles(path, true). Yes that involves coding, but the OSS community depends on everyone fixing their own personal issues and sharing the outcome
After digging into this issue, I decided to stop using the "rolling" feature for now - as my application jobs are relatively small.
I removed the:
spark.eventLog.rolling.enabled: true
spark.eventLog.rolling.maxFileSize: 16m
from the spark-submit command and the cost is now back to normal...
I also wrote about it here.
#stevel thanks for your answer - I will try to contribute and fix that! :)

Maximum throughput of a corda application

I start working with corda plaform 3 days ago and currently i meet an issue about getting the thoughput of a corda application.
I worked with hyperledger before so the script i used for performance testing is caliper. The main idea is send transaction with send rate and see when the transaction is committed. With the information of time created and time committed i can calulate the throughput of the system. When i do the test for corda, i send transaction with send rate arround 50 txn per sec and get the thoughput of 3-5 tps.
The application i used for testing is cordapp-example with the default config. I configured to run with docker in my local machine (4 container - one for notary, 3 for node party).
So is that the actual performance of a corda application? Does anyone do this or have any article about this. I want to build a application with throughput arround 1000 tps. So what is the configuration for this system if using corda plaform (resources, number of nodes, etc,..)
The open source version of Corda isn't optimised and won't be able to reach 1000 tps. That's a pretty demanding use case. Try downloading the Enterprise version from here:
https://www.r3.com/corda-enterprise/
and see if you get better performance.
You might want also to email partner#r3.com and get a more formal relationship in place because we're constantly optimising to get higher and higher tps levels, so you'll probably want to be working closely with the performance team - in particular apps can do things that slow the node down and right now most of the knowledge about how to make fast apps is in the heads of the perf team. Over time there'll be optimisation advice added to the developer docs but we're not there yet.

Django "migrate" consuming too much CPU

Our staging server, a t2.micro instance on AWS was getting down constantly. On investigating, we found that when manage.py migrate is run CPU usage shoots up to 99%. It was easily reproducible on the local machine. We are running Django 1.9 and postgresql database. I am not sure now, is it us doing something wrong or it is meant to be that way. We have around 18 apps in the project, but running migrate app_name also results in same behaviour. Attaching the screenshots of CPU usage.
Also, I profiled the migrate function, here is a graph:
Are you depending on migrate to run regularly? Because once the project is nearing and then entering production state, there shouldn't be many migrations to run. Or do you mean that migrate takes this long, even if migrate --list shows that there is nothing to migrate?
Also, to know what Postgres is doing, you should set up logging of queries including their time. You can filter to log only longer running queries:
http://www.postgresql.org/docs/9.5/static/runtime-config-logging.html
Run those queries through the explain analyze sql command:
psql> EXPLAIN ANALYZE <complete query>;
http://www.postgresql.org/docs/9.5/static/using-explain.html
You need to provide the information you get from explain to get further help.
EDIT:
Also, you could try to squash migrations if you have a lot of migration files. I could imagine that Django works itself through all of them, one by one. So if you have many apps with many files depending on each other, you can imagine what happens.
https://docs.djangoproject.com/en/1.9/topics/migrations/#squashing-migrations
EDIT 2:
Moving this from the comment into the answer:
Does migrate --list also consume that much CPU? If not, then you could run it first, see whether there really is a need to migrate and only run migrate on those apps that have open migrations.
I think this would be the best. If you can profile in more detail, you might actually address the Django community for help. I could imagine that you have an interesting setup with which to find out how to tune the Django migrations to do less (actually unnecessary) work. But I don't know the migrations code too much so I cannot tell.
But this also depends on how many apps we are talking about, and how many migration files. If you have less than 30 apps (including 3rd party), I think it should work fine and there is something else wrong (IMHO!).
Also, you have not shown the resource usage of your server. If the slowness is due to swapping/too much RAM usage you really might be able to boost it by supplying more RAM (to the process).
I believe migrations consume a lot, specially when having many models and many apps, more apps more dependencies more migrations complexity.
I would recommend starting a new instance which only run migration and shutdown after this. This way you web server could be reachable.
This does not address the problem statement exactly but a part of it. I went through the documentation of AWS t2.micro and found that T2.Micro instances are designed to handle the CPU Burst of short intervals(~1 min) happening after reasonable long intervals. From the t2.micro documentation:
A CPU Credit provides the performance of a full CPU core for one minute. Traditional Amazon EC2 instance types provide fixed performance, while T2 instances provide a baseline level of CPU performance with the ability to burst above that baseline level. The baseline performance and ability to burst are governed by CPU credits.
Running migration's shouldn't be an issue given this ^ even if it is consuming 100% of the CPU. We investigated more and found that there were crons running on the server which were not supposed to be.

How to profile Django's bottlenecks for scaling?

I am using django and tastypie for REST API.
For profiling, I am using django-silk and below is a summary of requests:
How do I profile the complete flow? Time taken except for database queries is (382 - 147) ms on average. How do I figure out the bottleneck and optimize/scale? I did use #silk_profile() for the get_object_list method for this resource, but even this method doesn't seem to be bottleneck.
I used caching for decreasing response time, but that didn't help much, what are the other options?
When testing using loader.io, the peak the server can handle is 1000 requests per 30 secs (which seems very low). Other than caching (which I already tried) what might help?
Here's a bunch of suggestions:
bring the query per request at least below 5 per request (34 per request is really bad)
install django toolbar and have a look where the time is spent
use gunicorn or uwsgi behind a reverse proxy (NGINX)
You have too much queries, even if they are relatively fast you spend
some time to reach database etc. Also if you have external cache
storage (for example, redis) it could take some time to connect
there.
To investigate slow parts of the code you have two options:
Use a profiler - profiling at local PC could make no sense if you have distributed system deployed to several machines
Add tracing points to your code that will record some message and current time (something like https://gist.github.com/dbf256/0f1d5d7d2c9aa70bce89). Deploy this patched code and test it with your load-testing tool and check logs.

Sitecore performance enhancements

We need our Sitecore web application to process 60-80 web requests per second. We are using Sitecore 7.0. We have tried a 1 Webserver + 1 Database server deployment, but it only processes 20-25 requests per second. Web server queues up all the other requests in the Memory. As we increase the load, memory fills up.(We did all Sitecore performance enhancements recommended). We need 4X performance to reach the goal :).
Will it be possible to achieve this goal by upgrading the existing server, or do we have to add more web servers in production environment.
Note: We are using Lucene indexing as well.
Here are some things you can consider without changing overall architecture of your deployment
CDN to offload media and static asset requests
This leaves your content delivery server available to handle important content queries and display logic.
Example www.cloudflare.com
Configure and use Sitecore's built-in caching
This is from the guide:
Investigation and configuration of the Sitecore Caches is broken down
into multiple tasks. This way each task is more focused and
simplified. The focus is on configuration and tuning of the Sitecore
Database Caches (prefetch, data, and item caches.)
For configuration
of the output rendering caching properties, the customer should be
made aware of both the Sitecore Cache Configuration Reference and the
Sitecore Presentation Component Reference as to how properly enable
and the properties to expire these caches.
Check out the Sitecore Tuning Guide
Find Slow Queries or Controls
It sounds like your application follows Sitecore best practices, but I leave this note in for anyone that might find this answer. Use Sitecore's built-in Debug mode to identify the slowest running controls and sublayouts. Additionally, if you have Analytics set up there is a "Slow Pages" report that might give you some information on where your application is slowing down.
Those things being said, if you're prepared to provision additional servers and set up a load-balanced environment then read on.
Separate Content Delivery and Content Management
To me the first logical step before load-balancing content delivery servers is to separate the content management from the equation. This is pretty easy and the Scaling Guide walks you through getting the HistoryEngine set up to keep those Lucene indexes up to date.
Set up Load Balancer with 2 or more Content Delivery servers
Once you've done the first step this can be as easy as cloning your content delivery server and adding it to your load balancer "pool". There are a couple of things to consider here like: Does your web application allow users to log in? So you'll need to worry about sticky sessions or machine keys. Does your web application use file media instead of blob media? I haven't had to deal with this, but I understand that's another consideration.
Scale your SQL solution
I've seen applications with up to four load balanced content delivery servers and the SQL Server did not have a problem - I think this will be unique to each case depending on a lot of factors: horsepower and tuning of SQL Server, content model of your application, complexity of your queries, caching configuration on content delivery servers, etc. Again, the Scaling Guide covers SQL Mirroring and Failover, so that is going to be your first stop on getting that going.
Finally, I would say contact Sitecore. These guys have probably seen more of what's gone right and what's gone wrong with installations and could get you on the right path. Good luck!
This answer written from a Sitecore developer perspective:
Bottom line: You need to figure out exactly where your performance bottleneck is. That is going to take some digging, but will be very worthwhile. You should definitely be able to serve 60-80 requests/s without any trouble... but of course that makes a lot of assumptions about the nature of your site and the requests.
For my site, I found Sitecore's caching implementation to be sub-par... I created some very simple and aggressive application-specific caches in my app and this made all the difference in the world. For instance, we have 900+ "Partner" items where our sites' advertisements live... and simply putting all these objects in an array in the Application object sped up page requests significantly. Finding an object in a Hashtable indexed by its Item.Name or ID is going to be a lot faster than Sitecore.Context.Database.GetItem("/itempath") or a SelectItems() call (at least, that's my experience). If your architecture and data set will allow this strategy, we've had good experience with it.
Another thing to watch out for is XSLT renderings. Personally, I avoid them completely in favor of ASP.NET UserControls. The XSLT rendering is just slow. As much as 10x slower than a native UserControl rendering the same HTML. So if you have a few of these... replace with some custom code and you'll see a world of difference.