SWF Activity is not completing even though the computation has finished - amazon-web-services

I'm testing a new SWF workflow, and I've got some activity that makes a RESTful call out to another service. Problem is, I can see through logging that the actual call takes less than a second to complete, but the Activity always times out in SWF (START_TO_CLOSE of 5 mins). Being more specific, the RESTful call is a list call, and when I limit the batch size to a small number, the Activity completes and moves on very quickly. But at some seemingly arbitrary threshold, it chokes completely.
Does anyone have any insight into this? I've read that SWF calls have a size limitation of 1 MB, does anyone know how to find the size of data my workers are trying to pass SWF?

After some remote debugging, it turns out the response from the task is too big and the activity is failing silently. The failure occurs when the framework tries to report the response back to SWF, and the SDK calls RespondActivityTaskCompleted. That API has a length restriction on the internal result param:
Length Constraints: Maximum length of 32768.
This is a validation error that throws an uncaught exception and is swallowed internally until the Activity times out.

I wouldn't recommend using activity input and output parameters for passing large data sets. SWF is an orchestration technology, not the data passing one. The standard workarounds are:
Storing result in a separate store (S3 for example) and passing reference to it.
Caching result locally on a machine and route all following activities to the same host for them to have access to the cached result. See fileprocessing sample for the details of routing approach.
BTW. Have you checked out Cadence which is an open source version of SWF with much better client side libraries?

Related

Concurrency Issue not Recreatable through a JMeter Script

Development analysis had revealed that a certain issue had occurred due to two concurrent user requests being executed in the server. However, when the same was performed using a JMeter script with two threads(users), the issue did not get reproduced despite both threads being synchronized using a synchronization timer for the save method request and the listeners indicating that both response times of the threads were the same for that particular request.
What could be the potential cause of this observation? and could there be suggestions to improve the test to either disprove or prove this claim in a better way?
I can come up with the following assumptions:
The analysis result is not correct
The analysis result is correct but the issue is intermittent or Heizenbug-like
You're not sending "the same" request using JMeter, maybe the payload is incorrect or you miss a header. If it's possible to obtain a network footprint of the issue in form of i.e. .pcap or .har file you could compare it with the network footprint produced by JMeter
Concurrent user actions and simultaneous user actions and two different things. Please see the details in this article.
Following diagrams taken from the article explain the difference.
JMeter will simulate the concurrent user actions by defaults. What you really need is to simulate simultaneous user actions. You can achieve this by adding a Synchronizing Timer to your test plan.
The purpose of the SyncTimer is to block threads until X number of threads have been blocked, and then they are all released at once. A SyncTimer can thus create large instant loads at various points of the test plan.

AWS lambda execution fails only first time I run it with 'customer function error'

I trigger a lambda function via API gateway and everything works perfectly with the one exception that the very first time I trigger it on a given day it fails.
Strangely, the lambda function logs don't show any errors. I get my usual START log statement and then the request and context of the trigger, then after 5s, it ends unexpectedly.
When I look into the API gateway logs this is the error it returns:
Lambda execution failed with status 200 due to customer function error: 2018-12-10T11:00:31.208Z cc233168-fc9n-11fc-a05a-577bb4sd2b2ccc Task timed out after 5.01 seconds.
Has anyone encountered a similar problem? What is customer function error and how may I resolve this?
without knowing much of the background code you are using, i would termed this a Cold Start. Cold start happens for the first request where your function has not be called for a very long time. If you notice error message says "Time Out after 5.01 seconds. which is default set. you can increase a time out.
Alternatively, you could consider reducing the impact of cold starts by reducing the length of cold starts reference :
by authoring your Lambda functions in a language that doesn’t incur a high cold start time — i.e. Node.js, Python, or Go
choose a higher memory setting for functions on the critical path of handling user requests (i.e. anything that the user would have to wait for a response from, including intermediate APIs)
optimizing your function’s dependencies, and package size
You can also explore by putting a cron job through Cloud Watch after every specific interval to call your API through PING
Adding to Yash's answer:
I've only seen Lambda execution failed with status 200 in API Gateway execution logs, though in case it can manifest in other ways: ensure you have execution logging enabled for the endpoint. If you didn't already have it enabled you'll need to wait for the problem to manifest again.
You can verify it's a cold start problem as follows:
In the log entry with the error grab the #logStream value and the timestamp for the event; it'll be a long string of alphanumerics like a4f8115980dc83a511eeedc493a78741
Open the log group for that endpoint's execution log -> find the log stream with the identifier you just grabbed
Narrow the date/time range to a window around the time where the event occurred
If you chose a narrow window and if it's a cold start problem: I would expect the offending request to be the first one in the list. Click the There are older events to load. Load more. at the top of the list.
You should now see a gap of time between the last request received and the offending request.
In my case the error says connection reset by peer which leads me to think it's behaving as though a virtual machine were put to sleep then awoken in the sense that it believes TCP connections it previously had open are still valid.
In the short term the solution we're going with is to implement a retry strategy.
Besides the cold-start problem, there's another potential aspect of this problem: your API Gateway access log format.
Do the following:
Find the access log entries that correspond to the offending request in the execution log.
Is the HTTP status == 502?
502s in the API Gateway access log usually (always?) indicate the Lambda responded with malformed JSON.
The most obvious reason for it returning malformed JSON is a bug in your code. One of the less obvious reasons: a mistake in the access log format.
If you suspect that's the case, look for the following:
Quoted fields that shouldn't be; eg $context.error.messageString
Un-quoted fields that should be. A common idiom is to leave numeric fields un-quoted because it makes insights queries like this work: | filter #status >= 500. As convenient as that is, if the field isn't guaranteed to produce a numeric result then the JSON response will be malformed.
Trailing commas in {} bodies
Here's the documentation for many of the the context variables, though one thing to keep in mind: the context variables that are available differ between the different API Gateway endpoint types (lambda, websocket, etc).

AWS API Gateway Cache - Multiple service hits with burst of calls

I am working on a mobile app that will broadcast a push message to hundreds of thousands of devices at a time. When each user opens their app from the push message, the app will hit our API for data. The API resource will be identical for each user of this push.
Now let's assume that all 500,000 users open their app at the same time. API Gateway will get 500,000 identical calls.
Because all 500,000 nearly concurrent requests are asking for the same data, I want to cache it. But keep in mind that it takes about 2 seconds to compute the requested value.
What I want to happen
I want API Gateway to see that the data is not in the cache, let the first call through to my backend service while the other requests are held in queue, populate the cache from the first call, and then respond to the other 499,999 requests using the cached data.
What is (seems to be) happening
API Gateway, seeing that there is no cached value, is sending every one of the 500,000 requests to the backend service! So I will be recomputing the value with some complex db query way more times than resources will allow. This happens because the last call comes into API Gateway before the first call has populated the cache.
Is there any way I can get this behavior?
I know that based on my example that perhaps I could prime the cache by invoking the API call myself just before broadcasting the bulk push job, but the actual use-case is slightly more complicated than my simplified example. But rest assured, solving this simplified use-case will solve what I am trying to do.
If you anticipate that kind of burst concurrency, priming the cache yourself is certainly the best option. Have you also considered adding throttling to the stage/method to protect your backend from a large surge in traffic? Clients could be instructed to retry on throttles and they would eventually get a response.
I'll bring your feedback and proposed solution to the team and put it on our backlog.

WSO2 delivery-garantee pattern implementation: doesn't work sampling processor with more than 20 attempts

I'm quite a newbie in WSO2 so sorry for the mistakes (and for my english too ... )
I need to implement a proxy with delivery-garantee pattern and here you are my solution (I'm started from this post http://charith.wickramaarachchi.org/2012/05/another-message-redelivery-pattern-with.html):
a proxy invoke an external service giving, as input, the initial
client message
if the external service is running all works fine and
the reply is given to the client
if the external service is down or generate a SOAP fault, I'll
put the message in a store (retry store), and then, using a sampling
processor (after a time "t"), I'll try again for "n" max attempts:
at any attempt, if the external service is down or generate a SOAP
fault, I'll put the message again in the retry store, and the
process is repeated
after "n" attempts, if the external service is still out of
service, the message is stored in another store (garbage store)
All works fine when I try to test with one message, but when I try to test with more messages (> 20 but this number is variable ... ), the sampling processor hangs completely, nothing is shown in the logs. Looking in the console, sometimes (but not always ...), the processor is off, deactivate and in this case, to restore, I need to undeploy, stop and restart, and then deploy again my .car.
NOTE: I've to use the sampling processor and not the forwarding processor because this processor, after "n" attempts deactive itself and I can't use it for my goals.
I can't put here the complete code because is too long, but I can give you a sample .car that you can deploy and execute on your WSO2 installation (to simulate the external service I've used the echo service ...).
Here you are the sample car that you can download
Thank you very much in advance: all suggestions are appreciated!!!
Cesare
Message Forwarding Processor
Retrieves the messages stored in a message store and reliably forwards them to a specified endpoint. This processor attempts to send one message at a time and it does not dequeue a message from the store until it receives a response from the target endpoint. Therefore this processor is ideal for implementing in-order delivery scenarios and guaranteed delivery scenarios.
Sampling Processor
Retrieves the messages stored in a message store and injects them to a given sequence at specified intervals. This processor utilizes the Quartz scheduler framework for periodically processing messages. This can be used to implement message rate throttling scenarios.
--> You can use the forwarding processor and configure it so that it will never be deactivated, just add this parameter : <parameter name="max.delivery.attempts">-1</parameter>

The rate of control plane requests made by this account is too high

I'm using AWS Dynamo DB and it keeps giving me the following error when trying to create DB by https://www.npmjs.org/package/dynamodb:
The rate of control plane requests made by this account is too high
Does anyone know what the reason is?
Thanks
Could you share your code that is calling the create? And does this happen every time, or only sometimes? If you can get insight into whether the CreateTable API call is failing, or a DescribeTable API call is failing, that would be helpful too. If you can log the request ids of all of the requests you're making, and share them on this post, we (the DynamoDB folks) can see if we can get more details on our side.
This error may occur when you create, update, or delete many tables simultaneously (as in call the API with many operations simultaneously). This is easy to do in Node.js because of its non-blocking programming model. The error may also happen if you CreateTable and then immediately call DescribeTable simultaneously or immediately after (this typically doesn't happen though).