REST - Get updated resource - web-services

I working on a service which scrapes specific links from blogs. The service makes calls to different sites which pulls in and stores the data.
I'm having troubles specifying the url for updating the data on the server where I now use the verb update to pull in the latest links.
I currently use the following endpoints:
GET /user/{ID}/links - gets all previously scraped links (few milliseconds)
GET /user/{ID}/links/update - starts scraping and returned the scraped data (few seconds)
What would be a good option for the second url? some examples I came up with myself.
GET /user/{ID}/links?collection=(all|cached|latest)
GET /user/{ID}/links?update=1
GET /user/{ID}/links/latest
GET /user/{ID}/links/new

Using GET to start a process isn't very RESTful. You aren't really GETting information, you're asking the server to process information. You probably want to POST against /user/{ID]/links (a quick Google for PUT vs POST will give you endless reading if you're curious about the finer points there). You'd then have two options:
POST with background process: If using a background process (or queue) you can return a 202 Accepted, indicating that the service has accepted the request and is about to do something. 202 generally indicates that the client shouldn't wait around, which makes sense when performing time dependent actions like scraping. The client can then issue GET requests on the first link to retrieve updates.
Creative use of Last-Modified headers can tell the client when new updates are available. If you want to be super fancy, you can implement HEAD /user/{ID}/links that will return a Last-Modified header without a response body (saving both bandwidth and processing).
POST with direct processing: If you're doing the processing during the request (not a great plan in the grand scheme of things), you can return a 200 OK with a response body containing the updated links.
Subsequent GETs would perform as normal.
More info here
And here
And here

Related

Request data seemingly dirty in multithreaded flask app

We are seeing a random error that seems to be caused by two requests' data getting mixed up. We receive a request for quoting shipping costs on an Order, but the request fails because the requested Order is not accessible by the requesting account. I'm looking for anyone who can provide an inkling on what might be happening here, I haven't found anything on google, the official flask help channels, or SO that looks like what we're experiencing.
We're deployed on AWS, with apache, mod_wsgi, 1 process, 15 threads, about 10 instances.
Here's the code that sends the email:
msg = f"Order ID {self.shipping.order.id} is not valid for this Account {self.user.account_id}"
body = f"Error:<br/>{msg}<br/>Request Data:<br/>{request.data}<br/>Headers:<br/>{request.headers}"
send_email(msg, body, "devops#*******.com")
request_data = None
The problem is that in that scenario we email ourselves with the error and the request data, and the request data we're getting, in many cases, would've never landed in that particular piece of code. It can be a request from the frontend to get the current user's settings, for example, that make no reference to any orders, nevermind trying to get a shipping quote for it.
Comparing the application logs with apache's access_log, we see that, in all cases, we got two requests on the same instance, one requesting the quoting, and another which is the request that is actually getting logged. We don't know whether these two requests are processed by the same thread in rapid succession, or by different threads, but they come so close together that I think the latter is much more probable. We have no way of univocally tying the access_log entries with the application logging, so far, so we don't know which one of the requests is logging the error, but the fact is that we're getting routed to a view that does not correspond to the request's content (i.e., we're not sure whether the quoting request is getting the wrong request object, or if the other one is getting routed to the wrong view).
Another fact that is of interest is that we use graphql, so part of the routing is done after flask/werkzeug do theirs, but the body we get from flask.request at the moment the error shows up does not correspond with the graphql function/mutation that gets executed. But this also happens in views mapped directly through flask. The user is looked up by the flask-login workflow at the very beginning, and it corresponds to the "bad" request (i.e., the one not for quoting).
The actual issue was a bug on one of python-graphql's libraries (promise), not on Flask, werkzeug or apache. It was not the request data that was "moving" to a different thread, but a different thread trying to resolve the promise for a query that was supposed to be handled elsewhere.

How to send post from Zapier python in less than 1.00 seconds?

Is there a way to send a POST from a "Code by Zapier" Zap to MailChimp to add a subscriber to a list and have it reliably complete in less than 1.00 second?
I spent the weekend at a volunteer hackathon for non-profit organizations. My non-profit client needs some data parsed out of an email and used to add a subscriber to a list in MailChimp (the Commerce portion of SquareSpace emails the data but doesn't allow setting storage on the purchase form to MailChimp -- even though that works in SquareSpace if you're not in the Commerce area). We found we could do that with Zapier -- except we ran up to the limits of what one can do with a free account on Zapier and the non-profit couldn't purchase a paid account right now (the Zapier discount for non-profits is a 15% reduction).
The first limitation was we couldn't do a 3-step zap (maximum 2 steps for free accounts) to go from (1) a Gmail trigger to (2) "Code by Zapier" to parse the email contents and then (3) to MailChimp. The workaround we came up with was to delete step #3 and send to MailChimp directly via http POST to the MailChimp API from a Python script in "Code by Zapier". This worked in test mode in Zapier.
But once the Zap was turned on and we ran an end-to-end test with the site, the Zap failed. There is a 1.00 second runtime limitation to free Zaps: after that Zapier kills the job. The POST to MailChimp took long enough that the Zap timed out.
I used "Code by Zapier" with Python to send the post. They use Python 2.7.10. I was able to import requests to do the post, and I found several other modules worked too, such as json, httplib, and urllib.
What I'm wondering is whether there's a way to get the POST to happen reliably in under 1 second. For example, is there a way to use an async send and then not wait for the response. And I'm constrained to Python 2.7.10 and the Zapier environment. Zapier also allows JavaScript as an alternative to Python, so that might be another path to investigate if there's no solution in Python.
David here, from the Zapier Platform team.
I can't speak to the speed of Python specifically, but I know that javascript can fire off requests without waiting for a response. We've got a basic example here, which you'd modify to send the request and the immediately end execution (by calling the callback function). This won't be a great experience because errors will happen silently, but it'll almost certainly fit in the 1 second window.
Separately, the whole python stdlib is available, as well as the requests module (docs)

How to update progress bar while making a Django Rest api request?

My django rest app accepts request to scrape multiple pages for prices & compare them (which takes time ~5 seconds) then returns a list of the prices from each page as a json object.
I want to update the user with the current operation, for example if I scrape 3 pages I want to update the interface like this :
Searching 1/3
Searching 2/3
Searching 3/3
How can I do this?
I am using Angular 2 for my front end but this shouldn't make a big difference as it's a backend issue.
This isn't the only way, but this is how I do this in Django.
Things you'll need
Asynchronous worker procecess
This allows you to do work outside the context of the request-response cycle. The most common are either django-rq or Celery. I'd recommend django-rq for its simplicity, especially if all you're implementing is a progress indicator.
Caching layer (optional)
While you can use the database for persistence in this case, temporary cache key-value stores make more sense here as the progress information is ephemeral. The Memcached backend is built into Django, however I'd recommend switching to Redis as it's more fully featured, super fast, and since it's behind Django's caching abstraction, does not add complexity. (It's also a requirement for using the django-rq worker processes above)
Implementation
Overview
Basically, we're going to send a request to the server to start the async worker, and poll a different progress-indicator endpoint which gives the current status of that worker's progress until it's finished (or failed).
Server side
Refactor the function you'd like to track the progress of into an async task function (using the #job decorator in the case of django-rq)
The initial POST endpoint should first generate a random unique ID to identify the request (possibly with uuid). Then, pass the POST data along with this unique ID to the async function (in django-rq this would look something like function_name.delay(payload, unique_id)). Since this is an async call, the interpreter does not wait for the task to finish and moves on immediately. Return a HttpResponse with a JSON payload that includes the unique ID.
Back in the async function, we need to set the progress using cache. At the very top of the function, we should add a cache.set(unique_id, 0) to show that there is zero progress so far. Using your own math implementation, as the progress approaches 100% completion, change this value to be closer to 1. If for some reason the operation fails, you can set this to -1.
Create a new endpoint to be polled by the browser to check the progress. This looks for a unique_id query parameter and uses this to look up the progress with cache.get(unique_id). Return a JSON object back with the progress amount.
Client side
After sending the POST request for the action and receiving a response, that response should include the unique_id. Immediately start polling the progress endpoint at a regular interval, setting the unique_id as a query parameter. The interval could be something like 1 second using setInterval(), with logic to prevent sending a new request if there is still a pending request.
When the progress received equals to 1 (or -1 for failures), you know the process is finished and you can stop polling
That's it! It's a bit of work just to get progress indicators, but once you've done it once it's much easier to re-use the pattern in other projects.
Another way to do this which I have not explored is via Webhooks / Channels. In this way, polling is not required, and the server simply sends the messages to the client directly.

Auditing Jetty Client requests and responses

I have a requirement to count the jetty transactions and measure the time it took to process the request and get back the response using JMX for our monitoring system.
I am using Jetty 8.1.7 and I can’t seem to find a proper way to do this. I basically need to identify when request is sent (due to Jetty Async approach this is triggered from thread A) and when the response is complete (as the oncompleteResponse is done in another thread).
I usually use ThreadLocal for such state in other areas I need similar functionality, but obviously this won’t work here.
Any ideas how to overcome?
To use jetty's async requests you basically have to subclass ContentExchange and override its methods. So you can add an extra field to it which would contain a timestamp of when the request was sent, and use it later in your onResponseComplete() method to measure the processing time. If you need to know the time when your request was actually sent to the server instead of when it was created you can override the onRequestCommitted() and onRequestComplete() methods.

Sustain an http connection while django processes a big request (20mins+)

I've got a django site that is producing a csv download. The content of the csv is dictated by user defined parameters. It's possible that users will set parameters that require significant thinking time on the server. I need a way of sustaining the http connection so the browser doesn't kick up an error message. I heard that it's possible to send intermittent http headers to do this. Can anyone point me in the right direction to set this up on a django site?
(unfortunatly I'm stuck with the possibility of slow reports - improving my sql won't mitigate this)
Don't do it online. Trigger an offline task, use a bit of Javascript to repeatedly call a view that checks if the task has finished, and redirect to the finished file when it's ready.
Instead of blocking the user and it's browser for 20 minutes (which is not a good idea) do the time-consuming task in the background. When the task will finish and generate the result simply notify the user so that he/she will just need to download the ready result.