Triggering Lambda on basis of multiple files

Triggering Lambda on basis of multiple files - amazon-web-services

I'm a bit confused, as I need to run an AWS glue job, when multiple specific files are available in s3. On every file put event in s3, I am triggering a lambda which writes that file metadata to dynamodb. Here in dynamodb, I am also maintaining a counter which counts the number of required files present.
But when multiple files are uploaded at once, which triggers multiple lambdas, they write at nearly the same time in dynamodb, which impacts the counter; hence the counter is not able to count accurately.
I need a better way to start a job, when specific (multiple) files are made available in s3.
Kindly suggest a better way.

Dynamo is eventually consistent by default. You need to request a strongly consistent read to guarantee you are reading the same data that was written.
See this page for more information, or for a more concrete example, see the ConsistentRead flag in the GetItem docs.
It's worth noting that these will only minimise your problem. There will also be a very small window between read/writes where network lag causes one function to read/write while another is doing so too. You should think about only allowing one function to run at a time, or some other logic to guarantee mutually exclusive access to the DB.

It sounds like you are getting the current count, incrementing it in your Lambda function, then updating DynamoDB with the new value. Instead you need to be using DynamoDB Atomic Counters, which will ensure that multiple concurrent updates will not cause the problems you are describing.
By using Atomic counters you simply send DynamoDB a request to increment your counter by 1. If your Lambda needs to check if this was the last file you were waiting on before doing other work, then you can use the return value from the update call to check what the new count is.

Not sure what you mean by "specific" (multiple) files.
If you are expecting specific file names (or "patterns"), then you could just check for all the expected files as first instruction of your lambda function. I.e. you expect files: A.txt, B.txt, C.txt, then test if your s3 bucket contains those 3 specific files (or 3 *.txt files or whatever suits your requirements). If that's the case then keep processing, if not then return from the function. This would technically work in case of concurrency calls.

Related

AWS "state file" solution for Lambda

I'm using a library in lambda where a "state file" is persisted
This is what it looks like in code:
def initialize
#config = '/tmp/dogscaler.yaml'
#state = self.load
end
If you need to look at the whole logic
https://github.com/cvent/dogscaler/blob/master/lib/dogscaler/state.rb#L5
My issue is that, this won't work in lambda (it being serverless). I'm trying to look for a solution where I don't have to change the logic in how the file is read and modifed.
Can this be achieved with S3?
Would something like this pseudo code work?
read s3://path/to/file
write s3://path/to/file
Are there better solutions to S3?
Additional Context
The file is needed for a cooldown period logic. Every time the application runs, it would check a time stamp from that file to make a judgement on wether to change an element or not. File is less than 1KB.

Based on the updated information you could store the data in a number of places.
S3 would be perfectly fine, but might be overkill if this is all you're using it for.
The same can be said of DynamoDB.
Parameter Store is a solid option for your use case. Bear in mind that if you are calling it often you may need to increase your TPS limit. It doesn't sound like that will be an issue for you. Also keep in mind that there is no protection here for multiple instances of your Lambda function writing to the parameter at the "same time." The last write will win. If you need to protect against that DynamoDB is probably the best option.

Apache beam / PubSub time delay before processing files

I need to delay processing or publishing filenames (files).
I am looking for the best option.
Currently I have two Apache Beam Dataflows and PubSub in between. First dataflow reads filenames from source and pushes those to PubSub topic. Another dataflow reads them and process them. However my use case is to start processing/reading actual files minimum 1 hour after they are being created in the source.
So I have two options:
1) Delay publishing a message in order to process it right away but in the good/expected moment
2) Delay processing of retrieved files
Like above mentioned I am looking for the best solution. I am not sure if guava retry mechanism should be used in Apache Beam ? Any other ideas?

You could likely achieve what you want via triggering/window configuration in the publishing job.
Then, you could define a windowing configuration where the trigger does not fire until after a 1 hour delay. Something like:
Window.<String>into(FixedWindows.of(Duration.standardMinutes(1))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardHours(1)))
Keep in mind that you'll end up with a job that's simply sitting doing not much of anything except holding onto state for an hour. Also, the above is based solely on processing time, so it will wait an hour after job start even if the actual creation time of the files is old enough that it could emit the results immediately.
You could refine this to an event time trigger, but you would likely need to write your own code to assign timestamps to your records (the filenames). To my knowledge, Beam does not currently have built-in support for reading the creation time of files. When reading files via TextIO, for example, I have observed that the records are all assigned a default static timestamp. You should check the specifics of the transform you're using to read filenames to see if it perhaps does something more useful for your purposes. You can also use a WithTimestamps transform to assign timestamps on your own.

S3 vs EFS propagation delay for distributed file system?

I'm working on a project that utilizes multiple docker containers
which all need to have access to the same files for comparison purposes. What's important is that if a file appears visible to one container, then there is minimal time between when it appears visible to other containers.
As an example heres the situation I'm trying to avoid:
Let's say we have two files, A and B, and two containers, 1 and 2. File A is both uploaded to the filesystem and submitted for comparison at roughly the same time. Immediately after, the same happens to file B. Soon after File A appears visible to container 1 and file B appears visible to container 2. Due to the way the files propagated on the distributed file system, file B is not visible to container 1 and file A is not visible to container 2. Container 1 is now told to compare file A to all other files and container 2 is told to compare B to all other files. Because of the propagation delay, A and B were never compared to each other.
I'm trying to decide between EFS and S3 to use as the place to store all of these files. Im wondering which would better fit my needs (or if theres a third option I'm unaware of).
The characteristics of the files/containers are:
- All files are small text files averaging 2kb in size (although rarely they can be 10 kb)
- Theres currently 20mb of files total, but I expect there to be 1gb by the end of the year
- These containers are not in a swarm
- The output of each comparison are already being uploaded to S3
- Trying to make sure that every file is compared to every other file is extremely important, so the propagation delay is definitely the most important factor
(One last note: If I use end up using S3, I would probably be using sync to pull down all new files put into the bucket)
Edit: To answer Kannaiyan's questions, what I'm trying to achieve is having every file file compared to every other file at least once. I can't exactly say what I'm comparing, but the comparison happens by executing a closed source linux binary that takes in the file you want to compare and the files you want to compare it against (the distributed file system is holding all the files I want to compare against). They need to be in containers for two reasons:
The binary relies heavily upon a specific file system setup, and containerizing it ensures that the file system will always be correct (I know its dumb but again the binary is closed source and theres no way around it)
The binary only runs on linux, and containerizing it makes development easier in terms of testing on local machines.
Lastly the files only accumulate over time as we get more and more submissions. Every files only read from and never modified after being added to the system.

In the end, I decided that the approach I was going for originally was too complicated. Instead I ended up using S3 to store all the files, as well as using DynamoDB to act as a cache for the keys of the most recently stored files. Keys are added to the DynamoDB table only after a successful upload to S3. Whenever a comparison operation runs, the containers sync the desired S3 directory, then check the DynamoDB to see if any files are missing. Due to S3's read-after-write consistency, if any files are missing they can be pulled from S3 without needing to wait for propagation to all the S3 caches. This allows for a pretty much an instantly propagating distributed file system.

AWS boto3 -- Difference between `batch_writer` and `batch_write_item`

I'm currently applying boto3 with dynamodb, and I noticed that there are two types of batch write
batch_writer is used in tutorial, and it seems like you can just iterate through different JSON objects to do insert (this is just one example, of course)
batch_write_items seems to me is a dynamo-specific function. However, I'm not 100% sure about this, and I'm not sure what's the difference between these two functions (performance, methodology, what not)
Do they do the same thing? If they are, why having 2 different functions? If they're not, what's the difference? How's the performance comparison?

As far as I understand and use these APIs, with the batch_write_item(), you can even handle the data for more than one table in one query. But with batch_writer(), it means you are going to specify the actions are only applicable for a certain table. I think that should be the very basic difference I can tell you.

batch_writer creates a context manager for writing objects to Amazon
DynamoDB in batch.
The batch writer will automatically handle buffering and sending items
in batches.
In addition, the batch writer will also automatically handle any
unprocessed items and resend them as needed. All you need to do is
call put_item for any items you want to add, and delete_item for any
items you want to delete.
In addition, you can specify auto_dedup if the batch might contain
duplicated requests and you want this writer to handle de-dup for you.
source

Why should I use Simple Queue Service (SQS) over ElastiCache on AWS

SQS seems really easy to use but has some message size restrictions, e.g. 256 KB message size (really, really small). On the other hand, ElasticCache seems to be more high-end? I am not sure if this assumptions are right - please correct me.
I am in the process of deploying an application on AWS use by making use of one or the other type of message passing (and/or caching) system. In what situations will I choose one over the other?

Comparing SQS to ElastiCache is somewhat like comparing... postcards to filing cabinets.
Which one is "better?"
It depends on what you want to accomplish, and in both cases, they have very little overlap in their functionality other than the fact that they all transiently store information.
A cache, like ElastiCache, is a place where information that's frequently accessed can be stored for frequent retrieval when repeatedly fetching that information from its authoritative source (often a database) is more expensive (typically in terms of resources or time) than fetching it from the cache would be. A cache is more like a filing cabinet with an open back that automatically drains old documents into the shredder any time there's a need to store new documents. This is referred to as eviction from the cache. Because of its purpose, the information stored in a cache is typically not considered durably stored. A node fails, or the data is evicted, and what you stored is not there any more.
“Also implicit is the fact that the cache can expire or evict values if they become too old or if the cache becomes full.”
— http://aws.amazon.com/blogs/aws/amazon-elasticache-distributed-in-memory-caching/
No problem, because the cache wasn't the authoritative data source... but while the data was there, you could access it very quickly. If you look something up and the cache doesn't have it, you go to the authoritative source for the data, and optionally, send a copy of it back to the cache so the next entity to ask for it may find it in the cache.
When you want something from a cache, you go ask for that thing, specifically.
On the other hand, a queue, like Simple Queue Service (SQS) is more like a postcard -- or, at least, that's what queue messages are like. You write the message, send it into the queue, and it pops out the other end -- once, typically. Ordering of messages isn't guaranteed (though its common for them to arrive in order) and messages are guaranteed to be delivered "at least once" (though, again, typically, duplicated messages are rare -- still, it's a massive, distributed infrastructure, so duplicated deliveries are possible).
When you want something from a queue, it sends you the next message in line -- you don't select which one.
If you need to cache information for random, quick, and repeated retrieval, and the information is disposable and recreatable, then of course ElastiCache is the choice of the two.
If you need to send messages between two parts of a system that run independently or at different speeds, then you're looking for a message queue, such as SQS. Typically, the small payload size limit is more than sufficient, because it's unnecessary to send a block of data in the queue message itself. Instead, you send a reference to the data, a pointer, an "id" from a transaction table or a URL to a web-accessible object (perhaps stored in S3) and the queue consumer can then fetch the chunk of data referenced by the queue message, and act on it.

Got the use case right, through experience. Actually deployed code on AWS about 2 years ago. Went with SQS because it was just what I needed to interface two parts of my cloud application. And indeed... 256 KB is way more than enough size ;).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js