I am trying to figure out if there is a way to prioritize data to fetch first from the stream regardless of when it was put into the stream. For example, I am writing two types of data A and B into the stream; I want data of type A to always be ahead of type B even though type B was written into the stream first. (data is feeding on stream to AWS lambda for execution).
I couldn’t find any information regarding the priority on the AWS website. The only solution I could think of was to have two steams setup and vary the read limits on them.
If there is anything like priority queue/stream in Kinesis, please give me some info on it.
Thank you,
Jaydeep
No, there's no way to do that with Kinesis. It's possible you could set PartitionKey or ExplicitHashKey to make some messages go to certain shards, where some shards are processed faster / more urgently. That's not quite the same, though.
SQS doesn't support priority queues either, unfortunately, but that SO thread has a few suggestions for alternatives.
Related
I'm using a library in lambda where a "state file" is persisted
This is what it looks like in code:
def initialize
#config = '/tmp/dogscaler.yaml'
#state = self.load
end
If you need to look at the whole logic
https://github.com/cvent/dogscaler/blob/master/lib/dogscaler/state.rb#L5
My issue is that, this won't work in lambda (it being serverless). I'm trying to look for a solution where I don't have to change the logic in how the file is read and modifed.
Can this be achieved with S3?
Would something like this pseudo code work?
read s3://path/to/file
write s3://path/to/file
Are there better solutions to S3?
Additional Context
The file is needed for a cooldown period logic. Every time the application runs, it would check a time stamp from that file to make a judgement on wether to change an element or not. File is less than 1KB.
Based on the updated information you could store the data in a number of places.
S3 would be perfectly fine, but might be overkill if this is all you're using it for.
The same can be said of DynamoDB.
Parameter Store is a solid option for your use case. Bear in mind that if you are calling it often you may need to increase your TPS limit. It doesn't sound like that will be an issue for you. Also keep in mind that there is no protection here for multiple instances of your Lambda function writing to the parameter at the "same time." The last write will win. If you need to protect against that DynamoDB is probably the best option.
I'm a bit confused, as I need to run an AWS glue job, when multiple specific files are available in s3. On every file put event in s3, I am triggering a lambda which writes that file metadata to dynamodb. Here in dynamodb, I am also maintaining a counter which counts the number of required files present.
But when multiple files are uploaded at once, which triggers multiple lambdas, they write at nearly the same time in dynamodb, which impacts the counter; hence the counter is not able to count accurately.
I need a better way to start a job, when specific (multiple) files are made available in s3.
Kindly suggest a better way.
Dynamo is eventually consistent by default. You need to request a strongly consistent read to guarantee you are reading the same data that was written.
See this page for more information, or for a more concrete example, see the ConsistentRead flag in the GetItem docs.
It's worth noting that these will only minimise your problem. There will also be a very small window between read/writes where network lag causes one function to read/write while another is doing so too. You should think about only allowing one function to run at a time, or some other logic to guarantee mutually exclusive access to the DB.
It sounds like you are getting the current count, incrementing it in your Lambda function, then updating DynamoDB with the new value. Instead you need to be using DynamoDB Atomic Counters, which will ensure that multiple concurrent updates will not cause the problems you are describing.
By using Atomic counters you simply send DynamoDB a request to increment your counter by 1. If your Lambda needs to check if this was the last file you were waiting on before doing other work, then you can use the return value from the update call to check what the new count is.
Not sure what you mean by "specific" (multiple) files.
If you are expecting specific file names (or "patterns"), then you could just check for all the expected files as first instruction of your lambda function. I.e. you expect files: A.txt, B.txt, C.txt, then test if your s3 bucket contains those 3 specific files (or 3 *.txt files or whatever suits your requirements). If that's the case then keep processing, if not then return from the function. This would technically work in case of concurrency calls.
I need to delay processing or publishing filenames (files).
I am looking for the best option.
Currently I have two Apache Beam Dataflows and PubSub in between. First dataflow reads filenames from source and pushes those to PubSub topic. Another dataflow reads them and process them. However my use case is to start processing/reading actual files minimum 1 hour after they are being created in the source.
So I have two options:
1) Delay publishing a message in order to process it right away but in the good/expected moment
2) Delay processing of retrieved files
Like above mentioned I am looking for the best solution. I am not sure if guava retry mechanism should be used in Apache Beam ? Any other ideas?
You could likely achieve what you want via triggering/window configuration in the publishing job.
Then, you could define a windowing configuration where the trigger does not fire until after a 1 hour delay. Something like:
Window.<String>into(FixedWindows.of(Duration.standardMinutes(1))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardHours(1)))
Keep in mind that you'll end up with a job that's simply sitting doing not much of anything except holding onto state for an hour. Also, the above is based solely on processing time, so it will wait an hour after job start even if the actual creation time of the files is old enough that it could emit the results immediately.
You could refine this to an event time trigger, but you would likely need to write your own code to assign timestamps to your records (the filenames). To my knowledge, Beam does not currently have built-in support for reading the creation time of files. When reading files via TextIO, for example, I have observed that the records are all assigned a default static timestamp. You should check the specifics of the transform you're using to read filenames to see if it perhaps does something more useful for your purposes. You can also use a WithTimestamps transform to assign timestamps on your own.
I'm currently applying boto3 with dynamodb, and I noticed that there are two types of batch write
batch_writer is used in tutorial, and it seems like you can just iterate through different JSON objects to do insert (this is just one example, of course)
batch_write_items seems to me is a dynamo-specific function. However, I'm not 100% sure about this, and I'm not sure what's the difference between these two functions (performance, methodology, what not)
Do they do the same thing? If they are, why having 2 different functions? If they're not, what's the difference? How's the performance comparison?
As far as I understand and use these APIs, with the batch_write_item(), you can even handle the data for more than one table in one query. But with batch_writer(), it means you are going to specify the actions are only applicable for a certain table. I think that should be the very basic difference I can tell you.
batch_writer creates a context manager for writing objects to Amazon
DynamoDB in batch.
The batch writer will automatically handle buffering and sending items
in batches.
In addition, the batch writer will also automatically handle any
unprocessed items and resend them as needed. All you need to do is
call put_item for any items you want to add, and delete_item for any
items you want to delete.
In addition, you can specify auto_dedup if the batch might contain
duplicated requests and you want this writer to handle de-dup for you.
source
SQS seems really easy to use but has some message size restrictions, e.g. 256 KB message size (really, really small). On the other hand, ElasticCache seems to be more high-end? I am not sure if this assumptions are right - please correct me.
I am in the process of deploying an application on AWS use by making use of one or the other type of message passing (and/or caching) system. In what situations will I choose one over the other?
Comparing SQS to ElastiCache is somewhat like comparing... postcards to filing cabinets.
Which one is "better?"
It depends on what you want to accomplish, and in both cases, they have very little overlap in their functionality other than the fact that they all transiently store information.
A cache, like ElastiCache, is a place where information that's frequently accessed can be stored for frequent retrieval when repeatedly fetching that information from its authoritative source (often a database) is more expensive (typically in terms of resources or time) than fetching it from the cache would be. A cache is more like a filing cabinet with an open back that automatically drains old documents into the shredder any time there's a need to store new documents. This is referred to as eviction from the cache. Because of its purpose, the information stored in a cache is typically not considered durably stored. A node fails, or the data is evicted, and what you stored is not there any more.
“Also implicit is the fact that the cache can expire or evict values if they become too old or if the cache becomes full.”
— http://aws.amazon.com/blogs/aws/amazon-elasticache-distributed-in-memory-caching/
No problem, because the cache wasn't the authoritative data source... but while the data was there, you could access it very quickly. If you look something up and the cache doesn't have it, you go to the authoritative source for the data, and optionally, send a copy of it back to the cache so the next entity to ask for it may find it in the cache.
When you want something from a cache, you go ask for that thing, specifically.
On the other hand, a queue, like Simple Queue Service (SQS) is more like a postcard -- or, at least, that's what queue messages are like. You write the message, send it into the queue, and it pops out the other end -- once, typically. Ordering of messages isn't guaranteed (though its common for them to arrive in order) and messages are guaranteed to be delivered "at least once" (though, again, typically, duplicated messages are rare -- still, it's a massive, distributed infrastructure, so duplicated deliveries are possible).
When you want something from a queue, it sends you the next message in line -- you don't select which one.
If you need to cache information for random, quick, and repeated retrieval, and the information is disposable and recreatable, then of course ElastiCache is the choice of the two.
If you need to send messages between two parts of a system that run independently or at different speeds, then you're looking for a message queue, such as SQS. Typically, the small payload size limit is more than sufficient, because it's unnecessary to send a block of data in the queue message itself. Instead, you send a reference to the data, a pointer, an "id" from a transaction table or a URL to a web-accessible object (perhaps stored in S3) and the queue consumer can then fetch the chunk of data referenced by the queue message, and act on it.
Got the use case right, through experience. Actually deployed code on AWS about 2 years ago. Went with SQS because it was just what I needed to interface two parts of my cloud application. And indeed... 256 KB is way more than enough size ;).