Sidekiq Stats mysteriously reset - amazon-web-services

We are in the process of migrating from Heroku to AWS, and I am noticing the Sidekiq stats mysteriously resetting for no apparent reason.
This is happening in several different applications that are connected to the same Redis instance, each with its own namespace set in initializers/sidekiq.rb.
The stats reset across all of the Sidekiq counters at the same time. It seems like perhaps we are momentarily dropping the Redis connection, but that is just wild conjecture and at any rate I'm not sure how to mitigate it.
Is this a common problem? Is there a setting I can tweak?

Someone is running the FLUSHDB or FLUSHALL command and clearing out data in Redis. Perhaps one of the apps when it starts.

Related

How can I make celery more robust with regards to dropped tasks?

Occasionally (read: too often) my celery setup will drop tasks. I'm running the latest celery 4.x on Django 1.11 with a redis backend for the queue and results.
I don't know exactly why tasks are being dropped, but what I suspect is that a worker is starting a job, then the worker is killed for some reason (autoscaling action, redeployment, out-of-memory...) and the job is killed in the middle.
At this point probably it has exited the redis queue and it won't be picked up again.
So my questions are:
How can I monitor this kind of thing? I use celerymon, and the task is not reported as failed, and yet I don't see in my database the data that I expected by the task that I suspect failed.
How can I make celery retry such tasks without implementing my own "fake queue" with flags in the database?
How do I make celery more robust and dependable in general?
Thanks for any pointers!
You have to use RabbitMq instead redis, I read this in the celery documentation(right here: https://docs.celeryproject.org/en/stable/getting-started/first-steps-with-celery.html#choosing-a-broker):
RabbitMQ is feature-complete, stable, durable and easy to install. It’s an
excellent choice for a production environment.
Redis is also feature-complete, but is more susceptible to data loss in the
event of abrupt termination or power failures.
Using rabbit mq your problem of lossing message on restart have to gone.

Very slow: ActiveRecord::QueryCache#call

I have an app on heroku, running on Puma:
workers 2
threads_count 3
pool 5
It looks like some requests get stuck in the middleware, and it makes the app very slow (VERY!).
I have seen other people threads about this problem but no solution so far.
Please let me know if you have any hint.
!
!
I work for Heroku support and Middleware/Rack/ActiveRecord::QueryCache#call is a commonly reported as a problem by New Relic. Unfortunately, it's usually a red herring as each time the source of the problem lies elsewhere.
QueryCache is where Rails first tries to check out a connection for use, so any problems with a connection will show up here as a request getting 'stuck' waiting. This doesn't mean the database server is out of connections necessarily (if you have Librato charts for Postgres they will show this). It likely means something is causing certain database connections to enter a bad state, and new requests for a connection are waiting. This can occur in older versions of Puma where multiple threads are used and the reaping_frequency is set - if some connections get into a bad state and the others are reaped this will cause problems.
Some high-level suggestions are as follows:
Upgrade Ruby & Puma
If using the rack-timeout gem, upgrade that too
These upgrades often help. If not, there are other options to look into such as switching from threads to worker based processes or using a Postgres connection pool such as PgBouncer. We have more suggestions on configuring concurrent web servers for use with Postgres here: https://devcenter.heroku.com/articles/concurrency-and-database-connections
I will answer my own question:
I simply had to check all queries to my DB. One of them was taking a VERY long time, and even if it was not requested often, it would slow down the whole server for quite some time afterwards(even after the process was done, there was a sort of "traffic jam" on the server).
Solution:
Check all the queries to your database, fix the slowest ones (it might simply mean breaking it down in few steps, it might mean make it run at night when there is no traffic, etc...).
Once this queries are fixed, everything should go back to normal.
I recently started seeing a spike in time spent in ActiveRecord::QueryCache#call. After looking at the source, I decided to try clearing said cache using ActiveRecord::Base.connection.clear_query_cache from a Rails Console attached to the production environment. The error I got back was PG::ConnectionBad: could not fork new process for connection: Cannot allocate memory which lead me to this other SO question at least Heroku Rails could not fork new process for connection: Cannot allocate memory

counter randomly resets on page refresh not sure why? [heroku/ python]

So right now I have a global counter variable. And it updates whenever a message is sent. Running my program locally I get no weirdness, but on heroku, sometimes the variable just bumps down to zero if I reload the page. Reload it again, its back to a number.
I don't know why that is happening. It only happens on heroku. What am I doing wrong? Is there a better way to do this?
https://github.com/sharkwheels/sendCatsWeb/blob/master/app.py
Thanks!
Your counter variable is running in-memory only.
What this means is that if your Python process stops running (restarts, crashes, whatever) -- the next time this web server boots up, that counter will be reset back to 0 (the default value).
When you run any programs on Heroku, they will restart frequently. This happens randomly, and for many reasons:
Maybe Heroku had to move your application to another web server due to load issues.
Maybe your process crashes, and Heroku restarted it.
Maybe Heroku created another instance of your web app to handle increased load.
This stuff can happen for many reasons.
Regardless -- you should NEVER expect Heroku to run your webapp continuously. Always expect your process to be restarted. This is a best practice in the web development community.
Your application should ideally function something like this:
When visitors visit your page, you send an increment request to a database counter of some sort that is persisted permanently in some form of database.
This way, regardless of how often your web application reboots, your data is never lost.

Duration of service alert constantly changing on Nagios

OK, so before I start, full disclosure: I'm pretty new to Nagios (only been using it for 3 weeks), so forgive me for lack of brevity in this explanation.
In my environment which I inherited, I have two redundant Nagios instances running (primary and secondary). On the primary, I added an active check for seeing if Apache is running on a select group of remote hosts (modifying commands.cfg and services.cfg). Unfortunately, it didn't go well so I had to revert the changes back to the previous configuration.
Where my issue comes in is this: after reverting the changes (deleted the added lines, started Nagios back up), the primary instance of Nagios' web UI is showing that a particular service is going critical intermittently with a change in duration, e.g., when the service is showing as OK, it'll be 4 hours but when it's critical, it'll show as 10 days (see here for an example host; the screenshots were taken less than a minute apart). This is only happening when I'm refreshing any of the Current Status pages or going to an individual host to view monitored services and refreshing there as well. Also, to note, this is a passive check for the service with checking freshness enabled.
I've already did a manual check from the primary Nagios server via the CLI and the status comes back as OK every time. I figured that there was a stale state somewhere in retention.dat, status.dat, objects.cache, or objects.precache, but even after stopping Nagios, removing said files, and starting it back up, and restarting NSCA, the same behavior persists. The secondary Nagios server isn't showing this behavior and is showing the correct statuses for all hosts and services and no modifications were made to it either.
Any help would be greatly appreciated and in advance, thanks! I've already posted up on the Nagios Support forums, but to no avail.
EDIT: Never mind. Turns out there were two instances of Nagios running, hence the intermittent nature. Killed off both and started Nagios again and it stabilized.

What could be causing seemingly random AWS EC2 server to Crash? (Error couldn't establish database connection)

To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!