Nginx chock down - amazon-web-services

In my production environment i had nginx-1.6.0 as proxy server and tomcat-7 at backend installed on AWS. On random pattern the site was going down because nginx was getting chock down and was not receiving any traffic although tomcat was running fine all the time.
I even upgraded the nginx version to 1.10.2 (latest stable) but this behavior continues.
I have checked everything, server was having average traffic, there was no error in logs during this period. there is just a increase in CPU % at this time window.
But if only increase in cpu % is the reason then why is tomcat not showing any error ?
Not sure what is the exact reason, please help !

Related

Concurrent requests to Nginx Server

I am having a problem with my server dealing with a large volume of concurrent users signing on and operating at the same time. Our business case requires the user base to be logging in at the exact same time (1 min window) and performing various operations in our application. Once the server goes past a 1000 concurrent users, the website starts loading very slowly and constant giving 502 errors.
We have reviewed the server metrics (CPU, RAM, Data traffic utilization) on the cloud are most resources are operating at below 10%. Scaling up the server doesn't help.
While the website is constant giving 502 errors are responding after a long time, any direct database queries and SSH connection are working fine. As such we have concluded that issue is primarily focused on the number of concurrent requests the server is handling due to any Nginx or Gunicorn configuration we may have set up incorrectly.
Please advice on any possible incorrect configuration (or any other solution) to this issue :
Server info :
Server specs - AWS c4.8xlarge ( CPU and RAM)
Web Server -Nginx
Backend Server -Gunicorn
nginx conf imageGunicorn conf file

504 gateway timeout for any requests to Nginx with lot of free resources

We have been maintaining a project internally which has both web and mobile application platform. The backend of the project is developed in Django 1.9 (Python 3.4) and deployed in AWS.
The server stack consists of Nginx, Gunicorn, Django and PostgreSQL. We use Redis based cache server to serve resource intensive heavy queries. Our AWS resources include:
t1.medium EC2 (2 core, 4 GB RAM)
PostgreSQL RDS with one additional read-replica.
Right now Gunicorn is set to create 5 workers (by following the 2*n+1 rule). Load wise, there are like 20-30 mobile users making requests in every minute and there are 5-10 users checking the web panel every hour. So I would say, not very much load.
Now this setup works alright for 80% days. But when something goes wrong (for example, we detect a bug in the live system and we had to switch off the server for maintenance for few hours. In the mean time, the mobile apps have a queue of requests ready in their app. So when we make the backend live, a lot of users hit the system at the same time.), the server stops behaving normally and started responding with 504 gateway timeout error.
Surprisingly every time this happened, we found the server resources (CPU, Memory) to be free by 70-80% and the connection pool in the databases are mostly free.
Any idea where the problem is? How to debug? If you have already faced a similar issue, please share the fix.
Thank you,

A timeout occur while attempting to local fusebox

I have a fusebox application setup on coldfusion 9 and it is in production mode. When application got started, it is giving me following error.
A timeout occur while attempting to local fusebox.
I have increased the timeout but that not help. Also tried other solution but nothing help.
There is one more thing. When i increased the JVM heap size from 1-2 gb to 2-3 gb. Then after restart of coldfusion service, Error remain for 1-2 hour and then site start working.
It also stop working after running for more than half day correctly. Then i get 504 gateway error and i have to restart the service again.
Can any one guide me to solve this problem?
Which FB version are you running? Is it set to production or development mode? If set to development, try setting it to production mode.

jetty 404 error page on hot deployment

I am currently using Jetty 9.1.4 on Windows.
When I deploy the war file without hot deployment config, and then restart the Jetty service. During that 5-10 seconds starting process, all client connections to my Jetty server are waiting for the server to finish loading. Then clients will be able to view the contents.
Now with hot deployment config on, the default Jetty 404 error page shows within that 5-10 second loading interval.
Is there anyway I can make the hot deployment has the same behavior as the complete restart - clients connections will wait instead seeing the 404 error page ?
Unfortunately this does not seem to be possible currently after talking with the Jetty developers on IRC #jetty.
One solution I will try to use are two Jetty instances with a loadbalancing reverse proxy (e.g. nginx) before them and taking one instance down for deployment.
Of course this will instantly lead to new requirements (session persistence/sharing) which need to be handled. So in conclusion: much work to do in the Java world for zero downtime on deployments.
Edit: I will try this, seems like a simple enough solution http://rafaelsteil.com/zero-downtime-deploy-script-for-jetty/ Github: https://github.com/rafaelsteil/jetty-zero-downtime-deploy

Is there a way for the cache to stay up without timeout after crash in AppFabric Cache?

First my setup that is used for testing purpose:
3 Virtual Machines running with the following configuration:
MS Windows 2008 Server Standard Edition
Latest version of AppFabric Cache
Each one has a local network share where the config file is stored (I have added all the machines in each config)
The cache is distributed but not high availibility (we don't have Enterprise version of Windows)
Each host is configured as lead, so according to the documentation at least one host should be allowed to crash.
Each machine has the website I testing installed, and local cache configured
One linux machine that is used as a proxy (varnish is used) to distribute the traffic for testing purpose.
That's the setup and now on to the problem. The scenario I am testing is simulating one of the servers crashing and then bring it back in the cluster. I have problem both with the server crashing and bringing it back up. Steps I am using to test it:
Direct the traffic with Varnish on the linux machine to one server only.
Log in to make sure there is something in the cache.
Unplug the network cable for one of the other servers (simulates that server crashing)
Now I get a cache timeout and I get a service error. I want the application to still be up on the servers that didn't crash, and it take some time for the cache to come back up on the remaining servers. Is that how it should be? Plugging the network cable back in and starting the host cause a similar problem.
So my question is if I have missed something? What I would like to see happen is that if one server crashes the cache should still remaing upp since a majority of the leads are still up, and starting the crashed server again should bring it back gracefully into the cluster without any causing any problems on the other hosts. But that might no be how it works?
I ran through a similar test scenario a few months ago where I had a test client generating load on a 3 lead-server cluster with a variety of Puts, Gets, and Removes. I rebooted one of the servers multiple times while the load test was running and the cache stayed online. If I remember correctly, there were a limited number errors as that server rebooted, but overall the cache appeared to remain healthy.
I'm not sure why you're not seeing similar results, but I would try removing the Varnish proxy from your test and see if that helps.