SQS seems really easy to use but has some message size restrictions, e.g. 256 KB message size (really, really small). On the other hand, ElasticCache seems to be more high-end? I am not sure if this assumptions are right - please correct me.
I am in the process of deploying an application on AWS use by making use of one or the other type of message passing (and/or caching) system. In what situations will I choose one over the other?
Comparing SQS to ElastiCache is somewhat like comparing... postcards to filing cabinets.
Which one is "better?"
It depends on what you want to accomplish, and in both cases, they have very little overlap in their functionality other than the fact that they all transiently store information.
A cache, like ElastiCache, is a place where information that's frequently accessed can be stored for frequent retrieval when repeatedly fetching that information from its authoritative source (often a database) is more expensive (typically in terms of resources or time) than fetching it from the cache would be. A cache is more like a filing cabinet with an open back that automatically drains old documents into the shredder any time there's a need to store new documents. This is referred to as eviction from the cache. Because of its purpose, the information stored in a cache is typically not considered durably stored. A node fails, or the data is evicted, and what you stored is not there any more.
“Also implicit is the fact that the cache can expire or evict values if they become too old or if the cache becomes full.”
— http://aws.amazon.com/blogs/aws/amazon-elasticache-distributed-in-memory-caching/
No problem, because the cache wasn't the authoritative data source... but while the data was there, you could access it very quickly. If you look something up and the cache doesn't have it, you go to the authoritative source for the data, and optionally, send a copy of it back to the cache so the next entity to ask for it may find it in the cache.
When you want something from a cache, you go ask for that thing, specifically.
On the other hand, a queue, like Simple Queue Service (SQS) is more like a postcard -- or, at least, that's what queue messages are like. You write the message, send it into the queue, and it pops out the other end -- once, typically. Ordering of messages isn't guaranteed (though its common for them to arrive in order) and messages are guaranteed to be delivered "at least once" (though, again, typically, duplicated messages are rare -- still, it's a massive, distributed infrastructure, so duplicated deliveries are possible).
When you want something from a queue, it sends you the next message in line -- you don't select which one.
If you need to cache information for random, quick, and repeated retrieval, and the information is disposable and recreatable, then of course ElastiCache is the choice of the two.
If you need to send messages between two parts of a system that run independently or at different speeds, then you're looking for a message queue, such as SQS. Typically, the small payload size limit is more than sufficient, because it's unnecessary to send a block of data in the queue message itself. Instead, you send a reference to the data, a pointer, an "id" from a transaction table or a URL to a web-accessible object (perhaps stored in S3) and the queue consumer can then fetch the chunk of data referenced by the queue message, and act on it.
Got the use case right, through experience. Actually deployed code on AWS about 2 years ago. Went with SQS because it was just what I needed to interface two parts of my cloud application. And indeed... 256 KB is way more than enough size ;).
Related
I'm a bit confused, as I need to run an AWS glue job, when multiple specific files are available in s3. On every file put event in s3, I am triggering a lambda which writes that file metadata to dynamodb. Here in dynamodb, I am also maintaining a counter which counts the number of required files present.
But when multiple files are uploaded at once, which triggers multiple lambdas, they write at nearly the same time in dynamodb, which impacts the counter; hence the counter is not able to count accurately.
I need a better way to start a job, when specific (multiple) files are made available in s3.
Kindly suggest a better way.
Dynamo is eventually consistent by default. You need to request a strongly consistent read to guarantee you are reading the same data that was written.
See this page for more information, or for a more concrete example, see the ConsistentRead flag in the GetItem docs.
It's worth noting that these will only minimise your problem. There will also be a very small window between read/writes where network lag causes one function to read/write while another is doing so too. You should think about only allowing one function to run at a time, or some other logic to guarantee mutually exclusive access to the DB.
It sounds like you are getting the current count, incrementing it in your Lambda function, then updating DynamoDB with the new value. Instead you need to be using DynamoDB Atomic Counters, which will ensure that multiple concurrent updates will not cause the problems you are describing.
By using Atomic counters you simply send DynamoDB a request to increment your counter by 1. If your Lambda needs to check if this was the last file you were waiting on before doing other work, then you can use the return value from the update call to check what the new count is.
Not sure what you mean by "specific" (multiple) files.
If you are expecting specific file names (or "patterns"), then you could just check for all the expected files as first instruction of your lambda function. I.e. you expect files: A.txt, B.txt, C.txt, then test if your s3 bucket contains those 3 specific files (or 3 *.txt files or whatever suits your requirements). If that's the case then keep processing, if not then return from the function. This would technically work in case of concurrency calls.
I am trying to figure out if there is a way to prioritize data to fetch first from the stream regardless of when it was put into the stream. For example, I am writing two types of data A and B into the stream; I want data of type A to always be ahead of type B even though type B was written into the stream first. (data is feeding on stream to AWS lambda for execution).
I couldn’t find any information regarding the priority on the AWS website. The only solution I could think of was to have two steams setup and vary the read limits on them.
If there is anything like priority queue/stream in Kinesis, please give me some info on it.
Thank you,
Jaydeep
No, there's no way to do that with Kinesis. It's possible you could set PartitionKey or ExplicitHashKey to make some messages go to certain shards, where some shards are processed faster / more urgently. That's not quite the same, though.
SQS doesn't support priority queues either, unfortunately, but that SO thread has a few suggestions for alternatives.
I'm currently applying boto3 with dynamodb, and I noticed that there are two types of batch write
batch_writer is used in tutorial, and it seems like you can just iterate through different JSON objects to do insert (this is just one example, of course)
batch_write_items seems to me is a dynamo-specific function. However, I'm not 100% sure about this, and I'm not sure what's the difference between these two functions (performance, methodology, what not)
Do they do the same thing? If they are, why having 2 different functions? If they're not, what's the difference? How's the performance comparison?
As far as I understand and use these APIs, with the batch_write_item(), you can even handle the data for more than one table in one query. But with batch_writer(), it means you are going to specify the actions are only applicable for a certain table. I think that should be the very basic difference I can tell you.
batch_writer creates a context manager for writing objects to Amazon
DynamoDB in batch.
The batch writer will automatically handle buffering and sending items
in batches.
In addition, the batch writer will also automatically handle any
unprocessed items and resend them as needed. All you need to do is
call put_item for any items you want to add, and delete_item for any
items you want to delete.
In addition, you can specify auto_dedup if the batch might contain
duplicated requests and you want this writer to handle de-dup for you.
source
As the title goes, I want to trigger a notification when some events happen.
A event above can be user-defined, such as updating specified files in 1-miniute.
If files are stored locally, I can easily make it with the system call inotify, but the case is that files locate on a distributed file system such as mfs..
How to make it? I wonder to know if there are some solutions or open-source project to solve this problem. Thanks.
If you have only black-box access (e.g. NFS protocol) to the remote system(s), you don't have much options unless the protocol supports what you need. So I'll assume you have control over the remote systems.
The "trivial" approach is running a local inotify/fanotify listener on each computer that would forward the notification over the network. FAM can do this over NFS.
A problem with all notification-based system is the risk of lost notifications in various edge cases. This becomes much more acute over a network - e.g. client confirms reciept of notification, then immediately crashes. There are reliable message queues you can build on but IMHO this way lies madness...
A saner approach is stateless hash-based scan.
I like to call the following design "hnotify" but that's not an established term. The ideas are widely used by many version control and backup systems, dating back to Plan 9.
The core idea is if you know cryptographic hashes for files, you can compose a single hash that represents a directory of files - it changes if any of the files changed - and you can build these bottom-up to represent the whole filesystem's state.
(Git stores things this way and is very efficient at it.)
Why are hash trees cool? If you have 2 hash trees — one representing the filesystem state you saw at point in the past, one representing the current state — you can easily find out what changed between them:
You start at the roots. If they are different you read the 2 root directories and compare hashes for subdirectories.
If a subdirectory has same hash in both trees, then nothing under it changed. No point going there.
If a subdirectory's hash changed, compare its contents recursively — call step (1).
If one has a subdirectory the other doesn't, well that's a change. With some global table you can also detect moves/renames.
Note that if few files changed, you only read a small portion of the current state. So the remote system doesn't have to send you the whole tree of hashes, it can be an interactive ping-pong of "give me hashes for this directory; ok now for this...".
(This is akin to how Git's dumb http protocol worked; there is a newer protocol with less round trips.)
This is as robust and bug-proof as polling the whole filesystem for changes — you can't miss anything — but reasonably efficient!
But how does the server track current hashes?
Unfortunately, fully hashing all disk writes is too expensive for most people. You may get if for free if you're lucky to be running a deduplicating filesystem, e.g. ZFS or Btrfs.
Otherwise you're stuck with re-reading all changed files (which is even more expensive than doing it in the filesystem layer) or using fake file hashes: upon any change to a file, invent a new random "hash" to invalidate it (and try to keep the fake hashes on moves). Still compute real hashes up the tree. Now you may have false positives — you "detect a change" when the content is the same — but never false negatives.
Anyway, the point is that whatever stateful hacks you do (e.g. inotify with periodic scans to be sure), you only do them locally on the server. Across the network, you only ever send hashes that represent snapshots of current state (or its subtrees)! This way you can have a distributed system with many servers and clients, intermittent connectivity, and still keep your sanity.
P.S. Btrfs can efficiently find differences from an older snapshot. But this is a snapshot taken on the server (and causing all data to be preserved!), less flexible than a client-side lightweight tree-of-hashes.
P.S. One of your tags is HadoopFS. I'm not really familiar with it, but I suspect a lot of its files are write-once-then-immutable, and it might be able to natively give you some kind of file/chunk ids that can serve as fake hashes?
Existing tools
The first tool that springs to my mind is bup index. bup is a very clever deduplicating backup tool built on git (only scalable to huge data), so it sits on the foundation described above. In theory, indexing data in bup on the server and doing git fetch over the network would even implement the hash-walking comparison of what's new that I described above — unfortunately the git repositories that bup produces are too big for git itself to cope with. Also you probably don't want bup to read and store all your data. But bup index is a separate subsystem that quickly scans a filesystem for potential changes, without yet reading the changed files.
Currently bup doesn't use inotify but it's been discussed in depth.
Oh, and bup uses Bloom Filters which are a nearly optimal way to represent sets with false positives. I'm almost certain Bloom filters have a role to play in optimizion stateless notification protocols ("here is a compressed bitmap of all I have; you should be able to narrow your queries with it" or "here is a compressed bitmap of what I want to be notified about"). Not sure if the way bup uses them is directly useful to you, but this data structure should definitely be in your toolbelt.
Another tool is git annex. It's also based on Git (are you noticing a trend?) but is designed to keep the data itself out of Git repos (so git fetch should just work!) and has a "WORM" option that uses fake hashes for faster performance.
Alternative design: compressed replayable journal
I used to think the above is the only sane stateless approach for clients to check what's changed. But I just read http://arstechnica.com/apple/2007/10/mac-os-x-10-5/7/ about OS X's FSEvents framework, which has a perhaps simpler design:
ALL changes are logged to a file. It's kept forever.
Clients can ask "replay for me everything since event 51348".
The magic trick is the log has coarse granularity ("something in this directory changed, go re-scan it to find out what", repeated changes within 30 seconds are combined) so this journal file is very compact.
At the low level you might resort to similar techniques — e.g. hashes — but the top-level interface is different: instead of snapshots you deal with a timeline of events. It may be an easier fit for some applications.
I've got an app that has about 10 types of objects. There will be potentially a few thousand object instances of each type. These lists of objects need to stay synchronized between apps running on different machines. If an object is added, changed or deleted, that needs to propagate to the other machines.
This will be a star topology -- there is a central master, and the rest are clients.
I DO have the concept of a session, so can store data about each client.
Is there a good design pattern to follow for this? Even better, is there a (template based?) library that would handle asking the container what has changed since client X came by and getting that delta to send out?
Right now I'm thinking every object-type container has an update counter. When something is added/changed/removed, the update counter is incremented, and the changed object(s) are tagged with that value. Each client will save the value of the update counter when it gets an update. Later it will come back and ask for any changes since it's update counter value. Finally, deletes are kept as tombstone records (although I'm not exactly sure when to clear them out).
One thing that makes this harder is clients can come and go without the central server necessarily knowing, although I guess there could be a timeout concept (if the server haven't heard from a client in 5 minutes, it assumes the client is gone)
Is this a well-known pattern? Any additional suggestions?
How you implement synchronization very much depends on your needs. Do the changes need to be sent to the clients, or is it sufficient that the clients checks if an object is up to date whenever it uses the objects? How bout using the Proxy pattern? This pattern allows you to create a proxy-implementation of your objects that can check if they are up to date or not, do update if they are not, and then return the result. I would do this by having a lastChanged timestamp on the objects on the master and a lastUpdated timestamp on the client objects. If latency is an issue checking if an object is up-to-date on each call is probably not a good idea. Consider having a separate thread that queries the master for changed objects and marks them "dirty". This could dramatically reduce the network traffic as well.
You could also look into the Observer pattern and Publish/Subscribe.
An option that might be simple to implement and still pretty efficient is to treat the pile of objects as an opaque blob and use librsync to synchronize them. It sounds like all of the updates flow one direction, from master to clients, and there's probably some persistent representation of the objects on the clients -- a file or something. I'm assuming it's a file for the rest of this answer, though any sequence of bytes can be used.
The way it would work is that each client would generate a librsync "signature" of its local copy of the blob and send that signature to the master. The signature is about 1% of the size of the blob. The master would then use librsync to compute a delta between that signature and the current data, and send the delta to the client, which would use librsync to apply the delta to its local copy of the blob.
The librsync API is simple, and the signature/delta data transfer is relatively efficient.
If that's not workable, it may still be useful to take a more manual "delta-based" approach, to avoid having to do per-object versioning. Each time the master makes a change, it should log that change to a journal, recording what was done and to which object. Versioning is done at the whole-database level, so in effect a version number is assigned to each journal entry.
When a client connects, it should send its version of the whole object collection, and the server can then respond with the contents of the journal between the client's version and the newest entry. If updates on a given object are done by completely replacing the object contents, then you can optimize this by filtering out all but the most recent version of each object. If the master also keeps track of which versions it has sent to which client, it can know when it is safe to discard old journal entries. Even if it doesn't track that, you can still discard old journal entries according to some heuristic (probably just age) and if you receive a connection from a client whose last version is older than your oldest journal entry, then you just have to send the entire set of objects to that client.