We are using django-eventstream for sending out events to clients. You can think of our workflow to be celery like use case but a very simple one. Things were working flawlessly until we hit the 'too many open files' error (Redhat 7.4). We tracked which processes are opening the files using 'lsof' and found python was shooting several threads which loaded the required libraries (mostly .so files). We are using gunicorn as our server which spawns uvicorn workers. Tried to fall back to 'runserver', but faced the same issue.
On trying out the 'time' and 'chat' examples, we saw the same behavior. On every refresh of the page (same machine, same browser, same tab) a new thread is spawned and 'lsof' lists an increment of about 2k files on every refresh of the page.
We tried to recreate the same issue on two other different machines with the same OS. Saw the same behavior, expect in 1 machine. This was a laptop with 4GB of RAM and the rest are servers with 256GB of RAM. Interestingly everything works absolutely fine in the laptop, but not in the servers. Maybe because of the relative sparsity of resources, OS is closing the files in the laptop but not in servers, which is causing the 'too many open files' error?
Any idea how to resolve this issue?
Cheers!
Going ahead with the threads assumption, tried to limit number of threads by setting ASGI_THREADS . The number of threads are now limited and thus the number of files too. I don't know what will happen if users > ASGI_THREADS will try to connect to the server. I guess now I need to read up on load balancing..
Related
I had installed ColdFusion 2018 recently and with the installation less than a month old (and my understanding of the technology even less), my Cold Fusion service has stopped working. I have tried a number of things and have referred to a number of articles and out of many such errors where the service is not being accessible, some of them were able to get it resolved. However, some other obscure reason that may be causing this error have been untouched and unknown.
Whenever, I try to restart the service, I get an error as shown below:
Windows could not start the ColdFusion 8 Application Server on Local Computer. For more information, review the System Event Log. If this is a non-Microsoft service, contact the service vendor, and refer to server-specific code error 2.”
Without much understanding, I started to google it out. Looking into every one of these posts, I tried
Configure JRE and try to relaunch the service by looking at "JAVA_HOME" variable and JVM.config
Run the batch files in every possible combination to find if anything clicks
Check if the present JAVA version works and is compatible with Coldfusion version installed
Fiddling with the "SessionStorage" var in neo-runtime.xml file as some suggested
and few other tricks coupled with a numerous service restart attempts and a few machine reboots as well.
A service that renders Cold Fusion pages should be shut down abruptly. To add to agony, the CF Admin also depends on the service and hence does not work.
Any pointers to any potential solutions?
I have an app on heroku, running on Puma:
workers 2
threads_count 3
pool 5
It looks like some requests get stuck in the middleware, and it makes the app very slow (VERY!).
I have seen other people threads about this problem but no solution so far.
Please let me know if you have any hint.
!
!
I work for Heroku support and Middleware/Rack/ActiveRecord::QueryCache#call is a commonly reported as a problem by New Relic. Unfortunately, it's usually a red herring as each time the source of the problem lies elsewhere.
QueryCache is where Rails first tries to check out a connection for use, so any problems with a connection will show up here as a request getting 'stuck' waiting. This doesn't mean the database server is out of connections necessarily (if you have Librato charts for Postgres they will show this). It likely means something is causing certain database connections to enter a bad state, and new requests for a connection are waiting. This can occur in older versions of Puma where multiple threads are used and the reaping_frequency is set - if some connections get into a bad state and the others are reaped this will cause problems.
Some high-level suggestions are as follows:
Upgrade Ruby & Puma
If using the rack-timeout gem, upgrade that too
These upgrades often help. If not, there are other options to look into such as switching from threads to worker based processes or using a Postgres connection pool such as PgBouncer. We have more suggestions on configuring concurrent web servers for use with Postgres here: https://devcenter.heroku.com/articles/concurrency-and-database-connections
I will answer my own question:
I simply had to check all queries to my DB. One of them was taking a VERY long time, and even if it was not requested often, it would slow down the whole server for quite some time afterwards(even after the process was done, there was a sort of "traffic jam" on the server).
Solution:
Check all the queries to your database, fix the slowest ones (it might simply mean breaking it down in few steps, it might mean make it run at night when there is no traffic, etc...).
Once this queries are fixed, everything should go back to normal.
I recently started seeing a spike in time spent in ActiveRecord::QueryCache#call. After looking at the source, I decided to try clearing said cache using ActiveRecord::Base.connection.clear_query_cache from a Rails Console attached to the production environment. The error I got back was PG::ConnectionBad: could not fork new process for connection: Cannot allocate memory which lead me to this other SO question at least Heroku Rails could not fork new process for connection: Cannot allocate memory
I am using Coldfusion MX8 server and one of the scheduled task was running from 2 years but now suddenly from 01/12/2014 scheduled tasks are not running. When i browsed the file in browser then the file is running successfully without error.
I am not sure is there any updatation or license expiration problem. I am aware that mid of this year Adobe closed the support for coldfusion 8.
The first most common problem of this problem is external to the server. When you say you browsed to the file and it worked in a browser, it is very important to know if that test was performed on the server desktop. Knowing that you can browse to the file from your desktop or laptop is of small value.
The most common source of issues like this is a change in the DNS or network stack that is interfereing with resolution. For example, if the internal DNS serving your DMZ suddenly starts serving the "external" address - suddenly your server can't browse to your domain. Or if the IP served by the server for the domain in question goes from being 127.0.0.1 to some other IP that the server can't acces correctly due to reverse proxy or LB or some other rule. Finally, sometimes the Apache or IIS is altered so that an IP that previously was serviced (127.0.0.1 being the most common example) now does not respond.
If it is something intrinsic to the scheduler service then Frank's advice is pretty good - especially look for "proxy schduler" entries in the log - they can give you good clues. I would also log results of a scheduled task to a file. Then check the file. If it exists then your scheduled tasks ARE running - they are just not succeeding. Good luck!
I've seen the cf scheduling service crash in CF8. The rest of CF is unaffected.
Have you tried restarting the server?
Here are your concerns:
Your File (works since you tested it manually).
Your Scheduled Task (failed).
Your Coldfusion Application (Service) (any changes here)?
Your Server (what about here).
To test your problem create a duplicate task and schedule it. Leave the other one in place (maybe set your new one to run earlier). Use the same file too. See if it completes.
If it doesn't then you have a larger problem. Since the Coldfusion Server sits atop of the JVM there could be something happening there. Things just don't stop working unless something got corrupted or you got compromised. If you hardened your server by rearranging/renaming the file structure to make it more secure...It would break your task.
So going back: if your test schedule works then determine what is different between the two. Note you have logging capabilities. Logging abilities for CF8
If you are not directly incharge of maintaining this server, then I would recommend asking around and see if there was recent maintenance, if so, what was done to the server?
I am trying to set up Geoserver as a backend to our MVC app. Geoserver works great...except it only lets me do one thing at a time. If I am processing a shapefile, the REST interface and GUI lock up until the job is done processing.
I know that there is the option to Cluster a geoserver configuration, but that would only be load balancing, so instead of only one read/write operation, I would have two instead...but we need to scale this up to at least 20 concurrent tasks at one time.
All of the references I've seen on the internet talk about locking down the number of concurrent connections, but only 1 is allowed the whole time.
Obviously GeoServer is used in production environments that have more than 1 request at the same time. I am just stumped about how to make it happen.
A few weeks ago, my colleague sent this email to the Geoserver Development team, the problem was described as a configuration lock...and that by changing a variable we could release it. The only place I saw this variable was in the source code on GitHub.
Is there a way to specify in one of the config files of Geoserver to turn these locks off so I can do concurrent read/writes? If anybody out there has encountered this before PLEASE HELP!!! Thanks!
On Fri, May 16, 2014 at 7:34 PM, Sean Winstead wrote:
Hi,
We are using GeoServer 2.5 RC2. When uploading a shape file via the REST
API, the server does not respond to other requests until after the shape
file has been processed.
For example, if I start a file upload and then click on the Layers menu
item in the web app, the response for the Layers page is not received until
after the file upload and processing have completed.
I researched the issue but did not find a suitable cause/answer. I did
install the control flow extension and created an controlflow.properties
file in the data directory, but this did not appear to have any effect.
How do I diagnose the cause of this behavior?
Simple, it's the configuration lock. Our configuration subsystem is not
able to handle correct concurrent writes,
or reads during writes, so there is a whole instance read/write lock that
is taken every time you use the rest
api and the user interface, nothing can be done while the lock is in place
If you want, you can disable it using the system variable
GeoServerConfigurationLock.enabled,
-DGeoServerConfigurationLock.enabled=true
but of course we cannot predict what will happen to the configuration if
you do that.
Cheers
Andrea
-DGeoServerConfigurationLock.enabled=true is referring to a startup parameter given to the java command when GeoServer is first started. Looking at GeoServer's bin/startup.sh and bin\startup.bat the approved way to do this is via an environment variable named JAVA_OPTS. You will see lines like
if [ -z "$JAVA_OPTS" ]; then
export JAVA_OPTS="-XX:MaxPermSize=128m"
fi
in startup.sh and
if "%JAVA_OPTS%" == "" (set JAVA_OPTS=-XX:MaxPermSize=128m)
in startup.bat. You will need to make those
... JAVA_OPTS="-DGeoServerConfigurationLock.enabled=true -XX:MaxPermSize=128m"
or define that JAVA_OPTS environment variable similarly before GeoServer is started.
The development team's response of "of course we cannot predict what will happen to the configuration if you do that", however, suggests to me that there may be concurrency issues lurking; which may be likely to surface more frequently as you scale up. Maybe you want to think about disconnecting the backend processing of those shape files from the REST requests to do so using some queueing mechanism instead of disabling GeoServer's configuration lock.
Thank You, I figured it out. We didn't even need to do this because we were only using one login for the REST interface (admin) instead of making a new user for each repository, now the locking issue doesn't happen.
First my setup that is used for testing purpose:
3 Virtual Machines running with the following configuration:
MS Windows 2008 Server Standard Edition
Latest version of AppFabric Cache
Each one has a local network share where the config file is stored (I have added all the machines in each config)
The cache is distributed but not high availibility (we don't have Enterprise version of Windows)
Each host is configured as lead, so according to the documentation at least one host should be allowed to crash.
Each machine has the website I testing installed, and local cache configured
One linux machine that is used as a proxy (varnish is used) to distribute the traffic for testing purpose.
That's the setup and now on to the problem. The scenario I am testing is simulating one of the servers crashing and then bring it back in the cluster. I have problem both with the server crashing and bringing it back up. Steps I am using to test it:
Direct the traffic with Varnish on the linux machine to one server only.
Log in to make sure there is something in the cache.
Unplug the network cable for one of the other servers (simulates that server crashing)
Now I get a cache timeout and I get a service error. I want the application to still be up on the servers that didn't crash, and it take some time for the cache to come back up on the remaining servers. Is that how it should be? Plugging the network cable back in and starting the host cause a similar problem.
So my question is if I have missed something? What I would like to see happen is that if one server crashes the cache should still remaing upp since a majority of the leads are still up, and starting the crashed server again should bring it back gracefully into the cluster without any causing any problems on the other hosts. But that might no be how it works?
I ran through a similar test scenario a few months ago where I had a test client generating load on a 3 lead-server cluster with a variety of Puts, Gets, and Removes. I rebooted one of the servers multiple times while the load test was running and the cache stayed online. If I remember correctly, there were a limited number errors as that server rebooted, but overall the cache appeared to remain healthy.
I'm not sure why you're not seeing similar results, but I would try removing the Varnish proxy from your test and see if that helps.