Calling multiple lambdas and return aggregated result - c++

I need to implement functionality:
Receive api request
From the input, create X different cases
For each case, do calculation
Aggregate all results, and send result back to api gateway
First guess was using Step Functions, but this service has limit for 32kb of data transferred between steps, which is not working for me. Also, since I have about 10 steps, I assume it would be hard to implement and also expensive to try to use S3 for storage of this inter-steps data.
Second guess was calling multiple lambdas from single lambda, and wait for all responses. Since I use AWS C++ SDK, it seems a bit complicated since there is almost no documentation for c++ or good examples of this case.
The simplest solution for me would be to create multiple threads inside single lambda, but it supports only 2 cores, which is also not working for me, I need at least 50-100.
Do you have any other solution or idea, as simple as possible. Is it possible to use aws batch or sqs for this or something else?

Related

What is the best way to know all concurrent running AWS Lambdas have ended?

I am running multiple lambdas in parallel based on the s3 trigger. I want to get the end time to send back to the user when all the lambdas have ended their execution.
If you need coordination between multiple lambdas, and something to happen when all of them are completed, your best bet is to use a Parallel Task in Step Functions, to run them in parallel and have an additional lambda as the next task after the Parallel one. This is pretty much the standard use case for Step Functions/State Machines - maintaining "state" between Lambdas (including and beyond knowing when the others are complete).
This also gives you only a single entry point for your process as opposed to trying to replicate the data yourself to multiple lambdas.

Highly concurrent AWS Express Step Functions

I have a system the receives records from Kinesis stream, Lambda is consuming the stream and invokes one function per shard, this function takes a batch of records and invokes an Async Express Step Function to process each record. The Step Function contains a Task relies on a third party. I have the timeout for this task set but this still can cause high number of concurrent step functions to start executing, when the task is taking longer, as the step functions are not completing quickly enough, causing throttling on Lambda executions further down the line.
To mitigate the issue I am thinking of implementing a "Semaphore" for concurrent Express function executions. There isn't too much out there in terms of similar approach, I found this article but the approach of checking how many active executions there are at a time would only work with Standard Step Function. If it would work with Express I can imagine I could throw error in the function that receives Kinesis record if the arbitrary Step Function execution limit is exceeded, causing Kinesis+Lambda to retry until capacity is available. But as I am using Express workflow, calling ListExecutions is not really an option.
Is there a solution for limiting number of parallel Async Express Step Function executions out there or do you see how I could alternatively implement the "Semaphore" approach?
Have you considered triggering on step function per lambda invoke and using a map state to do the multiple records per batch? The map state allows you to limit the number of concurrent executions. This doesn’t address multiple executions of the step function, and could lead to issues with timeouts if you are pushing the boundary of the five minute limits for express functions.
I think if you find that you need to throttle something across partitions you are going to be in a world of complex solutions. One could imagine a two phase commit system of tracking concurrent executions and handling timeouts, but these solutions are often more complicated than they are worth.
Perhaps the solution is to make adjustments downstream to reduce the concurrency there? If you end up with other lambdas being invoked too many times at once you can put SQS in front of them and enable batching as well as manage throttling there. In general you should use something like SQS to trigger lambdas at the point where high concurrency is a problem, and less so at points that feed into it. In other words if your current step functions can handle the high concurrency you should let them, and anything has issues as a result of it should be managed at that point.

Easiest way to synchronize a Document across multiple Collections in Firestore?

Suppose I have two top-level collections, users and companies. The companies collection contains a sub collection of Users, called employees. What's the most simple way to ensure that User records in the users and companies/employees paths are synchronized? Is it more common to use batch operations or a trigger function?
If your document writes are coming directly from your client app, you can use security rules to make sure that all documents have the same values as part of a batch write. If you write the rules correctly, it will force the client to make appropriate batch writes at all required locations, assuming that you have a well-defined document structure.
You can see a similar example of this technique in this other question that ensures that clients increment and decrement a document counter with each create and delete. Your rules will obviously be more complex.
Since security rules only apply to client code, there are no similar techniques for backend code. If you're writing code on the backend, you just have to make sure your code for batch writes are all correct.
I see no need to trigger a Cloud Function if you're able to do a batch write, as the batch will take effect atomically and immediately, while the function will have some latency, and possibly incur a race condition, since you don't have a guaranteed order of execution.

Best way to parallelize AWS Lambda

I have a large file being uploaded on S3, and for each line in the file I need to make a long running rest API call. I'm trying to figure out the best way to break up the work. My current flow idea is
Lambda (break up file by line) -> SNS (notification per line) -> Lambda (separate per line/notification)
This seems like it is a common use case, but I can't find many references to it, am I missing something? Is there a better option to break up my work and get it done in a reasonable amount of time?
The Best way is going to be subjective. The method you are using currently, Lambda->SNS->Lambda, is one possible method. As JohnAllen pointed out, you could simply do Lambda->Lambda.
Your scenario reminds me of this project, which has a single Lambda function adding items to a Kinesis stream, which then triggers many parallel Lambda functions.
I think Lambda->Kinesis->Lambda might be a better fit for your use case than Lambda->SNS->Lambda if you are generating a very large number of Lambda tasks. I would be worried that the SNS implementation would run up against the maximum number of concurrent Lambda functions, while the Kinesis implementation would queue them up and handle that gracefully.

Is there a distributed data processing pipeline framework, or a good way to organize one?

I am designing an application that requires of a distributed set of processing workers that need to asynchronously consume and produce data in a specific flow. For example:
Component A fetches pages.
Component B analyzes pages from A.
Component C stores analyzed bits and pieces from B.
There are obviously more than just three components involved.
Further requirements:
Each component needs to be a separate process (or set of processes).
Producers don't know anything about their consumers. In other words, component A just produces data, not knowing which components consume that data.
This is a kind of data flow solved by topology-oriented systems like Storm. While Storm looks good, I'm skeptical; it's a Java system and it's based on Thrift, neither of which I am a fan of.
I am currently leaning towards a pub/sub-style approach which uses AMQP as the data transport, with HTTP as the protocol for data sharing/storage. This means the AMQP queue model becomes a public API — in other words, a consumer needs to know which AMQP host and queue that the producer uses — which I'm not particularly happy about, but it might be worth the compromise.
Another issue with the AMQP approach is that each component will have to have very similar logic for:
Connecting to the queue
Handling connection errors
Serializing/deserializing data into a common format
Running the actual workers (goroutines or forking subprocesses)
Dynamic scaling of workers
Fault tolerance
Node registration
Processing metrics
Queue throttling
Queue prioritization (some workers are less important than others)
…and many other little details that each component will need.
Even if a consumer is logically very simple (think MapReduce jobs, something like splitting text into tokens), there is a lot of boilerplate. Certainly I can do all this myself — I am very familiar with AMQP and queues and everything else — and wrap all this up in a common package shared by all the components, but then I am already on my way to inventing a framework.
Does a good framework exist for this kind of stuff?
Note that I am asking specifically about Go. I want to avoid Hadoop and the whole Java stack.
Edit: Added some points for clarity.
Because Go has CSP channels, I suggest that Go provides a special opportunity to implement a framework for parallelism that is simple, concise, and yet completely general. It should be possible to do rather better than most existing frameworks with rather less code. Java and the JVM can have nothing like this.
It requires just the implementation of channels using configurable TCP transports. This would consist of
a writing channel-end API, including some general specification of the intended server for the reading end
a reading channel-end API, including listening port configuration and support for select
marshalling/unmarshalling glue to transfer data - probably encoding/gob
A success acceptance test of such a framework should be that a program using channels should be divisible across multiple processors and yet retain the same functional behaviour (even if the performance is different).
There are quite a few existing transport-layer networking projects in Go. Notable is ZeroMQ (0MQ) (gozmq, zmq2, zmq3).
I guess you are looking for a message queue, like beanstalkd, RabbitMQ, or ØMQ (pronounced zero-MQ). The essence of all of these tools is that they provide push/receive methods for FIFO (or non-FIFO) queues and some even have pub/sub.
So, one component puts data in a queue and another one reads. This approach is very flexible in adding or removing components and in scaling each of them up or down.
Most of these tools already have libraries for Go (ØMQ is very popular among Gophers) and other languages, so your overhead code is very little. Just import a library and start receiving and pushing messages.
And to decrease this overhead and avoid dependency on a particular API, you can write a thin package of yours which uses one of these message queue systems to provide very simple push/receive calls and use this package in all of your tools.
I understand that you want to avoid Hadoop+Java, but instead of spending time developing your own framework, you may want to have a look at Cascading. It provides a layer of abstraction over underlying MapReduce jobs.
Best summarized on Wikipedia, It [Cascading] follows a ‘source-pipe-sink’ paradigm, where data is captured from sources, follows reusable ‘pipes’ that perform data analysis processes, where the results are stored in output files or ‘sinks’. Pipes are created independent from the data they will process. Once tied to data sources and sinks, it is called a ‘flow’. These flows can be grouped into a ‘cascade’, and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied. Pipes and flows can be reused and reordered to support different business needs.
You may also want to have a look at some of their examples, Log Parser, Log Analysis, TF-IDF (especially this flow diagram).