I'm working on a python 2.7 script that must check in a Fedora Commons repository for the existence of some data in 20'000 objects.
Basically this means sending 20'000 HTTP requests to 20'000 different urls on the repository (that runs on a Tomcat server).
I wrote a script that does the job, but I've been warned by the server system administrator that it opens too many network connections, which causes some troubles.
My script uses so far urllib2 to make the HTTP requests.
response = urllib2.urlopen(url)
response_content = response.read()
And actually this code opens one new network connection per request.
I have tried to use other libraries to make the requests, but could not find any way to reuse the same connection for all requests. Both solutions below still open many network connections, even if their number is really lower (actually both solutions seem to open one connection for 100 HTTP requests, which is still around 200 connections in my case).
httplib:
url = "http://localhost:8080/fedora/objects/test:1234?test="
url_infos = urlparse(url)
conn = httplib.HTTPConnection(url_infos.hostname + ":" + str(url_infos.port))
for x in range(0, 20000):
myurl = url + str(x)
conn.request("GET", myurl)
r = conn.getresponse()
response_content = r.read()
print x, "\t", myurl, "\t", r.status
requests:
url = "http://localhost:8080/fedora/objects/test:1234?test="
s = requests.Session()
for x in range(0, 20000):
myurl = url + str(x)
r = s.get(myurl)
response_content = r.content
print x, "\t", myurl, "\t", r.status_code
Even if the number of connections is much better, ideally I'd like to use one or very few connections for all requests. Is that even possible ? Is this number of 100 requests per connection related to the system or to the server ? By the way I also tried to make the requests pointing to an Apache server and the result was the same.
The fact that both solutions shared some code like Lukasa said, and the fact that both results were equivalent whenever querying Apache or Tomcat
made me first think it was related to the Python code. But in fact it was related to the servers configurations.
The trick is that both Apache and Tomcat share a setting which indicates how many HTTP requests can be made within the same TCP connection. And both have a default value of 100.
Tomcat:
maxKeepAliveRequests:
The maximum number of HTTP requests which can be pipelined until the connection is closed by the server.
If not specified, this attribute is set to 100.
See http://tomcat.apache.org/tomcat-7.0-doc/config/http.html#Standard_Implementation
Apache:
MaxKeepAliveRequests:
The MaxKeepAliveRequests directive limits the number of requests allowed per connection when KeepAlive is on
Default: MaxKeepAliveRequests 100
See http://httpd.apache.org/docs/2.2/en/mod/core.html#maxkeepaliverequests
By modifying these values only a very few connections can be created indeed
Related
I have an HTTP API Gateway with a HTTP Integration backend server on EC2. The API has lots of queries during the day and looking at the logs i realized that the API is returning sometimes a 503 HTTP Code with a body:
{ "message": "Service Unavailable" }
When i found out this, i tried the API and running the HTTP requests many times on Postman, when i try twenty times i get at least one 503.
I then thought that the HTTP Integration Server was busy but the server is not loaded and i tried going directly to the HTTP Integration Server and i get 200 responses all the times.
The timeout parameter is set to 30000ms and the endpoint average response time is 200ms so timeout is not a problem. Also the HTTP 503 is not after 30 seconds of the request but instantly.
Can anyone help me?
Thanks
I solved this issue by editing the keep-alive connection parameters of my internal integration server. The AWS API Gateway needs the keep alive parameters on a standard configuration, so I started tweaking my NGINX server parameters until I solved the issue.
Had the same issue on a selfmade Microservice with Node that was integrated into AWS API-Gateway. After some reconfiguration of the Cloudwatch-Logs I got further indicator on what is wrong: INTEGRATION_NETWORK_FAILURE
Verify your problem is alike - i.e. through elaborated log output
In API-Gateway - Logging add more output in "Log format"
Use this or similar content for "Log format":
{"httpMethod":"$context.httpMethod","integrationErrorMessage":"$context.integrationErrorMessage","protocol":"$context.protocol","requestId":"$context.requestId","requestTime":"$context.requestTime","resourcePath":"$context.resourcePath","responseLength":"$context.responseLength","routeKey":"$context.routeKey","sourceIp":"$context.identity.sourceIp","status":"$context.status","errMsg":"$context.error.message","errType":"$context.error.responseType","intError":"$context.integration.error","intIntStatus":"$context.integration.integrationStatus","intLat":"$context.integration.latency","intReqID":"$context.integration.requestId","intStatus":"$context.integration.status"}
After using API-Gateway Endpoint and failing consult the logs again - should be looking like that:
Solve in NodeJS Microservice (using Express)
Add timeouts for headers and keep-alive on express servers socket configuration when upon listening.
const app = require('express')();
// if not already set and required to advertise the keep-alive through HTTP-Response you might want to use this
/*
app.use((req: Request, res: Response, next: NextFunction) => {
res.setHeader('Connection', 'keep-alive');
res.setHeader('Keep-Alive', 'timeout=30');
next();
});
*/
/* ..you r main logic.. */
const server = app.listen(8080, 'localhost', () => {
console.warn(`⚡️[server]: Server is running at http://localhost:8080`);
});
server.keepAliveTimeout = 30 * 1000; // <- important lines
server.headersTimeout = 35 * 1000; // <- important lines
Reason
Some AWS Components seem to demand a connection kept alive - even if server responding otherwise (connection: close). Upon reusage in API Gateway (and possibly AWS ELBs) the recycling will fail because other-side most likely already closed hence the assumed "NETWORK-FAILURE".
This error seems intermittent - since at least the API-Gateway seems to close unused connections after a while providing a clean execution the next time. I can only assume they do that for high-performance and not divert to anything less.
I recently tried my hands at the new Gmail API. And all seems to work fine except one thing. My issue is as follows:
I working on a receptionist project that may need to generate more than one email in less than a minute during busy hours. So just for testing purposes I run the following code which works fine:
if __name__ == '__main__':
service = setup() //Simply an helper function to do the basic credential check. Works fine!
print('service:'+str(service))
for counter in range(1, 10):
print('Sending message '+ str(counter))
message = create_message(<SENDER_EMAIL_ID>,<RECEIVER_EMAIL_ID>, "Email Number: "+ str(counter) , "Sample text")
response = send_message(service, 'me' , message)
print(response)
The setup() function is as follows:
credentials = get_credentials()
http = credentials.authorize(httplib2.Http())
service = discovery.build('gmail', 'v1', http=http)
Now, when I run the code say thrice consecutively in less than a minute, the code runs fine and I am able to see all the 27 emails in the sent folder of the SENDER_EMAIL_ID using a web browser. And thus Gmail API is sending all the messages through whenever a request is being made. However, only some of these emails are being received at the RECEIVER_EMAIL_ID and rest are just being dropped.
However, if I run the program with say 2-5 minutes delay then all the mails are being received.
I have no idea why this is.
Any help would be really appreciated. :)
To expound more on #ken-y-n's response in the comments section, GMail API has usage limits. Specifically for this product, Daily usage is about
1 Billion quota units / day
250 quota units / user / second
You may have encountered the rateLimitExceeded error during your tests.
Since you're sending emails thru a loop, it will cost you about 100 units when calling send (plus other costs depending on the methods you're calling). This is the reason why some emails seemed to be dropped. You can counter this by implementing exponential backoff on the messages that failed to send.
Another alternative instead of running it thru a loop, is to use Batch requests which groups your API calls together to reduce the number of HTTP connections your app making.
I have a Java web app hosted on Google App Engine (GAE). The User clicks on a button and he gets a data table with 100 rows. At the bottom of the page, there is a "Make Web service calls" button. Clicking on that, the application will take one row at a time and make a third party web-service call using the URLConnection class. That part is working fine.
However, since there is a 60 second limit to the HttpRequest/Response cycle, all the 100 transactions don't go through as the timeout happens around row 50 or so.
How do I create a loop and send the Web service calls without the User having to click on the 'Make Webservice calls' more than once?
Is there a way to stop the loop before 60 seconds and then start again without committing the HttpResponse? (I don't want to use asynchronous Google backend).
Also, does GAE support file upload (to get the 100 rows from a file instead of a database)
Thank you.
Adding some code as per the comments:
URL url = new URL(urlString);
HttpURLConnection connection = (HttpURLConnection) url
.openConnection();
connection.setDoOutput(true);
connection.setRequestMethod("POST");
connection.setConnectTimeout(35000);
connection.setRequestProperty("Accept-Language", "en-US,en;q=0.5");
connection.setRequestProperty("Authorization", encodedCredentials);
// Send post request
DataOutputStream wr = new DataOutputStream(
connection.getOutputStream());
wr.writeBytes(submitRequest);
It all depends on what happens with the results of these calls.
If results are not returned to a UI, there is no need to block it. You can use Tasks API to create 100 tasks and return a response to a user. This will take a few seconds at most. The additional benefit is that you can make up to 10 calls in parallel by using tasks.
If results have to be returned to a user, you can still use up to 10 threads to process as many requests in parallel as possible. Hopefully, this will bring your time under 1 minute, but you cannot guarantee it since you depend on responses from third-party resources which maybe unavailable at the moment. You will have to implement your own retry mechanism.
Also note that users are not accustomed to waiting for several minutes for a website to respond. You may consider a different approach when a user is notified after the last request is processed without blocking your client code.
And yes, you can load data from files on App Engine.
Try using asynchronous urlfetch calls:
LinkedList<Future<HttpResponse>> futures;
// Start all the request
for (Url url : urls) {
HttpRequest request = new HttpRequest(url, HTTPMethod.POST);
request.setPayload(...)
futures.add(urlfetchservice.fetchAsync(request);
}
// Collect all the results
for (Future<HttpResponse> future : futures) {
HttpResponse response = future.get()
// Do something with future
}
I am trying to put some of the message system to redis. I have a question regarding the connection management towards redis from django. Below is taken from quora:
When talking to Redis from Django (or indeed any other web framework, I imagine) an interesting challenge is deciding when to connect and disconnect.
If you make a new connection for every query to Redis, that's a ton of unnecessary overhead considering a single page request might make hundreds of Redis requests.
If you keep one connection open in the thread / process, you end up with loads of unclosed connections which can lead to problems. I've also seen the Redis client library throw the occasional timeout error, which is obviously bad.
The best result I've had has been from opening a single Redis connection at the start of the request, then closing it at the end - which can be achieved with Django middleware. It feels a bit dirty though having to add a piece of middleware just to get this behaviour.
Does anybody had a chance to create such redis middleware , I am always in favor of not reinventing the wheel but didn't find anything on google related to this topic.
I implemented the middleware :
import redis
from redis_sessions import settings
# Avoid new redis connection on each request
if settings.SESSION_REDIS_URL is not None:
redis_server = redis.StrictRedis.from_url(settings.SESSION_REDIS_URL)
elif settings.SESSION_REDIS_UNIX_DOMAIN_SOCKET_PATH is None:
redis_server = redis.StrictRedis(
host=settings.SESSION_REDIS_HOST,
port=settings.SESSION_REDIS_PORT,
db=settings.SESSION_REDIS_DB,
password=settings.SESSION_REDIS_PASSWORD
)
else:
redis_server = redis.StrictRedis(
unix_socket_path=settings.SESSION_REDIS_UNIX_DOMAIN_SOCKET_PATH,
db=settings.SESSION_REDIS_DB,
password=settings.SESSION_REDIS_PASSWORD,
)
class ReddisMiddleWare(object):
def process_request(self,request):
request.redisserver = redis_server
Then in the view I am just using request.redisserver.get(key) .
Here is my setup right now:
connection = mail.get_connection()
maillist = []
# my real setup is a little more complex for-loop, but basicly I add all recipients to a list.
for person in object_list:
mail_subject = "Mail subject here"
mail_body = "Mail body text...bla bla"
email_sender = "me#example.com"
maillist.append((mail_subject, mail_body, email_sender, [person.email]))
#send_mass_mail wants a tuple, so we convert the list
mailtuple = tuple(maillist)
mail.send_mass_mail(mailtuple, fail_silently=False, connection=connection)
However, the forloop iterates over 1000+ objects/persons and when I try this method I'm able to send 101 emails, and then it stops. No errors (as I can see) anywhere.
A fellow developer mentioned that maybe the POST size was too big? Any ideas from the SO-community?
Your SMTP server probably has some send limits. For example, I believe Gmail limits outgoing mail to 100 recipients.
As Micah suggested, there is a good chance you are hitting server limits.
Generally, when dealing with mass mail, it is always a good idea to throttle the sending. Doing 50 mails every 5 seconds for 300 seconds beats 3000 mails at once for many practical reasons including smtp server limitations.
Since you mentioned a POST limit - do you send out the emails in a view? I'm wondering how you handle canceled requests in your setup.
I'm using a management command to send out 1000+ newsletters. But instead of send_mass_mail i use the normal send method in a loop. It takes about 5 minutes (haven't a correct count atm) to send out the mails and i haven't run into any server limits yet.
My plan is to switch to celery to handle sending through a web interface. Perhaps you want to have a look at it in case you haven't already.
http://celeryproject.org/