Camel route POSTs to service that takes 20+ minutes to respond - web-services

I have an Apache Camel (version 2.15.3) route that is configured as follows (using a mix of XML and Java DSL):
Read a file from one of several folders on an FTP site.
Set a header to indicate which folder it was read from.
Do some processing and auditing.
Synchronously POST to an external REST service (jax-rs 1.1, Glassfish, Java EE 6).
The REST service takes a long time to do its job, 20+ minutes.
Receive the reply.
Do some more processing and auditing.
Write the response to one of several folders on an FTP site.
Use the header set at the start to know which folder to write to.
This is all configured in a single path of chained routes.
The problem is that the connection to the external REST service will timeout while the service is still processing. The infrastructure is a bit complex (edge servers, load balancers, Glassfish), and regardless I don't think increasing the timeout is the right solution.
How can I implement this route such that I avoid timeouts while still meeting all my requirements to (1) write the response to the appropriate FTP folder, (2) audit the transaction, and (3) meet other transaction/context-specific requirements?
I'm relatively new to Camel and REST, so maybe this is easy, but I don't know what Camel and REST tools and techniques to use.
(Questions and suggestions for improvement are welcome.)

Isn't it possible to break the two main steps a part and have two asynchronous operations?
I would do as follows.
Read a file from one of several folders on an FTP site.
Set a header to indicate which folder it was read from.
Save the header and file name and other relevant information in a cache. There is a camel component called camel-cache that is relatively easy to setup and you can store key-value or any other objects.
Do some processing and auditing. Asynchronously POST to an external REST service (jax-rs 1.1, Glassfish, Java EE 6). Note that we are posting asynchronously here.
Step 2.
Receive the reply.
Lookup the reply identifiers i.e. filename or some other identifier in cache to match the reply and then fetch the header.
Do some more processing and auditing.
Write the response to one of several folders on an FTP site.
This way, you don't need to wait and processing can take 20 min or longer. You just set your cache values to not expire for say 24h.

This is a typical asynchronous use case. Can the rest service give you a token id or some unique id immediately after you hit them ?
So that you can have a batch job or some other camel route which will pick up this id from a database/cache and hit the rest service again after 20 minutes.
This is the ideal solution I can think of, if the rest service can provision this.
You are right, waiting for 20 minutes on a synchronous call is a crazy idea. Also what is the estimated size of the file/payload which you are planning to post to the rest service ?

Related

Django(2.11) simultaneous (within 10ms) identical HTTP requests

Consider a POST/PUT REST API (using DRF).
If the server receives request1 and within a couple of ms request2 with identical everything to request1 (duplicate request), is there a way to avoid the request2 to be executed using some Django way? Or Should I deal with it manually by some state?
Any inputs would be much appreciated.
There isn't anything out of the box so you would need to write something your self potentially a piece of custom middleware (https://docs.djangoproject.com/en/3.0/topics/http/middleware/) would be best as then it would run over all of the requests. You would need to capture and exam the requests so you'd need a fast storage of some sort such as a memory store.
You could also look into the python asynco library - https://docs.python.org/3/library/asyncio-sync.html
Another possible solution would be using a FIFO message queue which is configured to support de-duplication based on content. This would turn the request into an deferred process though so it may not be suitable for your needs.

Communicate internally between Google Cloud Functions?

We've created a Google Cloud Function that is essentially an internal API. Is there any way that other internal Google Cloud Functions can talk to the API function without exposing a HTTP endpoint for that function?
We've looked at PubSub but as far as we can see, you can send a request (per say!) but you can't receive a response.
Ideally, we don't want to expose a HTTP endpoint due to the extra security ramifications and we are trying to follow a microservice approach so every function is its own entity.
I sympathize with your microservices approach and trying to keep your services independent. You can accomplish this without opening all your functions to HTTP. Chris Richardson describes a similar case on his excellent website microservices.io:
You have applied the Database per Service pattern. Each service has
its own database. Some business transactions, however, span multiple
services so you need a mechanism to ensure data consistency across
services. For example, lets imagine that you are building an e-commerce store
where customers have a credit limit. The application must ensure that
a new order will not exceed the customer’s credit limit. Since Orders
and Customers are in different databases the application cannot simply
use a local ACID transaction.
He then goes on:
An e-commerce application that uses this approach would create an
order using a choreography-based saga that consists of the following
steps:
The Order Service creates an Order in a pending state and publishes an OrderCreated event.
The Customer Service receives the event attempts to reserve credit for that Order. It publishes either a Credit Reserved event or a
CreditLimitExceeded event.
The Order Service receives the event and changes the state of the order to either approved or cancelled.
Basically, instead of a direct function call that returns a value synchronously, the first microservice sends an asynchronous "request event" to the second microservice which issues a "response event" that the first service picks up. You would use Cloud PubSub to send and receive the messages.
You can read more about this under the Saga pattern on his website.
The most straightforward thing to do is wrap your API up into a regular function or object, and deploy that extra code along with each function that needs to use it. You may even wish to fully modularize the code, as you would expect from an npm module.

AWS API Gateway Cache - Multiple service hits with burst of calls

I am working on a mobile app that will broadcast a push message to hundreds of thousands of devices at a time. When each user opens their app from the push message, the app will hit our API for data. The API resource will be identical for each user of this push.
Now let's assume that all 500,000 users open their app at the same time. API Gateway will get 500,000 identical calls.
Because all 500,000 nearly concurrent requests are asking for the same data, I want to cache it. But keep in mind that it takes about 2 seconds to compute the requested value.
What I want to happen
I want API Gateway to see that the data is not in the cache, let the first call through to my backend service while the other requests are held in queue, populate the cache from the first call, and then respond to the other 499,999 requests using the cached data.
What is (seems to be) happening
API Gateway, seeing that there is no cached value, is sending every one of the 500,000 requests to the backend service! So I will be recomputing the value with some complex db query way more times than resources will allow. This happens because the last call comes into API Gateway before the first call has populated the cache.
Is there any way I can get this behavior?
I know that based on my example that perhaps I could prime the cache by invoking the API call myself just before broadcasting the bulk push job, but the actual use-case is slightly more complicated than my simplified example. But rest assured, solving this simplified use-case will solve what I am trying to do.
If you anticipate that kind of burst concurrency, priming the cache yourself is certainly the best option. Have you also considered adding throttling to the stage/method to protect your backend from a large surge in traffic? Clients could be instructed to retry on throttles and they would eventually get a response.
I'll bring your feedback and proposed solution to the team and put it on our backlog.

Auditing Jetty Client requests and responses

I have a requirement to count the jetty transactions and measure the time it took to process the request and get back the response using JMX for our monitoring system.
I am using Jetty 8.1.7 and I can’t seem to find a proper way to do this. I basically need to identify when request is sent (due to Jetty Async approach this is triggered from thread A) and when the response is complete (as the oncompleteResponse is done in another thread).
I usually use ThreadLocal for such state in other areas I need similar functionality, but obviously this won’t work here.
Any ideas how to overcome?
To use jetty's async requests you basically have to subclass ContentExchange and override its methods. So you can add an extra field to it which would contain a timestamp of when the request was sent, and use it later in your onResponseComplete() method to measure the processing time. If you need to know the time when your request was actually sent to the server instead of when it was created you can override the onRequestCommitted() and onRequestComplete() methods.

Sustain an http connection while django processes a big request (20mins+)

I've got a django site that is producing a csv download. The content of the csv is dictated by user defined parameters. It's possible that users will set parameters that require significant thinking time on the server. I need a way of sustaining the http connection so the browser doesn't kick up an error message. I heard that it's possible to send intermittent http headers to do this. Can anyone point me in the right direction to set this up on a django site?
(unfortunatly I'm stuck with the possibility of slow reports - improving my sql won't mitigate this)
Don't do it online. Trigger an offline task, use a bit of Javascript to repeatedly call a view that checks if the task has finished, and redirect to the finished file when it's ready.
Instead of blocking the user and it's browser for 20 minutes (which is not a good idea) do the time-consuming task in the background. When the task will finish and generate the result simply notify the user so that he/she will just need to download the ready result.