Accessing Data while updating the same data on AWS DynamoDB - amazon-web-services

I am planning to build a mini Content Management System. Checking the possibility of storing the Content on a DynamoDB, Will the services be able to access the content while Updating the same content? (Scenario of updating the content on CMS and publishing)
Or CloudSearch will be the better solution instead of DynamoDB in such use case?
Thanks in advance!

Please think about your use case and decide whether it requires eventually consistent read or strongly consistent read.
Read Consistency
Eventually Consistent Reads
When you read data from a DynamoDB table, the response might not reflect the results of a recently completed write operation. The response might include some stale data. If you repeat your read request after a short time, the response should return the latest data.
Strongly Consistent Reads
When you request a strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting the updates from all prior write operations that were successful. A strongly consistent read might not be available in the case of a network delay or outage.
Note:-
DynamoDB uses eventually consistent reads, unless you specify otherwise. Read operations (such as GetItem, Query, and Scan) provide a ConsistentRead parameter: If you set this parameter to true, DynamoDB will use strongly consistent reads during the operation.
AWS DynamoDB tool (Java) for transaction management:-
Out of the box, DynamoDB provides two of the four ACID properties:
Consistency and Durability. Within a single item, you also get
Atomicity and Isolation, but when your application needs to involve
multiple items you lose those properties. Sometimes that's good
enough, but many applications, especially distributed applications,
would appreciate some of that Atomicity and Isolation as well.
Fortunately, DynamoDB provides the tools (especially optimistic
concurrency control) so that an application can achieve these
properties and have full ACID transactions.
You can use this tool if you are using Java or AWS SDK Java for DynamoDB. I am not sure whether similar tool is available for other languages.
One of the features available on this library is Isolated Reads.
Isolated reads: Read operations to multiple items are not interfered with by other transactions.
Dynamodb transaction library for Java
Transaction design

Related

AWS Event-Sourcing implementation

I'm quite a newbe in microservices and Event-Sourcing and I was trying to figure out a way to deploy a whole system on AWS.
As far as I know there are two ways to implement an Event-Driven architecture:
Using AWS Kinesis Data Stream
Using AWS SNS + SQS
So my base strategy is that every command is converted to an event which is stored in DynamoDB and exploit DynamoDB Streams to notify other microservices about a new event. But how? Which of the previous two solutions should I use?
The first one has the advanteges of:
Message ordering
At least one delivery
But the disadvantages are quite problematic:
No built-in autoscaling (you can achieve it using triggers)
No message visibility functionality (apparently, asking to confirm that)
No topic subscription
Very strict read transactions: you can improve it using multiple shards from what I read here you must have a not well defined number of lamdas with different invocation priorities and a not well defined strategy to avoid duplicate processing across multiple instances of the same microservice.
The second one has the advanteges of:
Is completely managed
Very high TPS
Topic subscriptions
Message visibility functionality
Drawbacks:
SQS messages are best-effort ordering, still no idea of what they means.
It says "A standard queue makes a best effort to preserve the order of messages, but more than one copy of a message might be delivered out of order".
Does it means that giving n copies of a message the first copy is delivered in order while the others are delivered unordered compared to the other messages' copies? Or "more that one" could be "all"?
A very big thanks for every kind of advice!
I'm quite a newbe in microservices and Event-Sourcing
Review Greg Young's talk Polygot Data for more insight into what follows.
Sharing events across service boundaries has two basic approaches - a push model and a pull model. For subscribers that care about the ordering of events, a pull model is "simpler" to maintain.
The basic idea being that each subscriber tracks its own high water mark for how many events in a stream it has processed, and queries an ordered representation of the event list to get updates.
In AWS, you would normally get this representation by querying the authoritative service for the updated event list (the implementation of which could include paging). The service might provide the list of events by querying dynamodb directly, or by getting the most recent key from DynamoDB, and then looking up cached representations of the events in S3.
In this approach, the "events" that are being pushed out of the system are really just notifications, allowing the subscribers to reduce the latency between the write into Dynamo and their own read.
I would normally reach for SNS (fan-out) for broadcasting notifications. Consumers that need bookkeeping support for which notifications they have handled would use SQS. But the primary channel for communicating the ordered events is pull.
I myself haven't looked hard at Kinesis - there's some general discussion in earlier questions -- but I think Kevin Sookocheff is onto something when he writes
...if you dig a little deeper you will find that Kinesis is well suited for a very particular use case, and if your application doesn’t fit this use case, Kinesis may be a lot more trouble than it’s worth.
Kinesis’ primary use case is collecting, storing and processing real-time continuous data streams. Data streams are data that are generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes).
Another thing: the fact that I'm accessing data from another
microservice stream is an anti-pattern, isn't it?
Well, part of the point of dividing a system into microservices is to reduce the coupling between the capabilities of the system. Accessing data across the microservice boundaries increases the coupling. So there's some tension there.
But basically if I'm using a pull model I need to read
data from other microservices' stream. Is it avoidable?
If you query the service you need for the information, rather than digging it out of the stream yourself, you reduce the coupling -- much like asking a service for data rather than reaching into an RDBMS and querying the tables yourself.
If you can avoid sharing the information between services at all, then you get even less coupling.
(Naive example: order fulfillment needs to know when an order has been paid for; so it needs a correlation id when the payment is made, but it doesn't need any of the other billing details.)

Implement atomic transactions over multiple AWS resources

I want to implement atomic transactions over multiple AWS resources -- e.g. uploading an object to S3 and adding a record to a DynamoDB table. Both should happen in lockstep -- or not at all. If one of the operations fails, the other should be rolled back. I understand I can implement it myself, but I was wondering if there is an existing library that does it.
One of the challenges while implementing this is expiry of temporary credentials. What if credentials expire after one of the operations was performed?
Any suggestions?
Transactions are hard! Especially in a distributed system. Transactions are also slow.
If there is any way to redesign your system to not require transactional semantics, I strongly encourage you to try.
If you really need transactions, involving multiple AWS resources, across different services.. you sort of have to roll your own. You can leverage a distributed data store that supports atomic operations and build on top of that.
It won’t be easy.

Store streaming data - fast, cheap, reliable and good for batch consumption

I have a (spring-boot) web service that generates a json response for each request. This response, while returned to the querying user, also needs to be archived somewhere (so that we know what we responded with to the user).
The service needs to support 4,000 requests/second. As such, we need the archival method to be fast. The archived data would later be consumed by a map-reduce (batch) job.
I want to know which solution to use - Kafka, S3, or any other solution. The service has been deployed to AWS. So solutions within AWS are ideal.
The requirements are as follows:
Writes should be fast 94K req/s at least).
Writes should be non-blocking (so that the service response time is not affected).
Reads need not be fast but should be suitable for consumption by map-reduce jobs.
Data should be resilient to server crashes etc.
Should not be too expensive to write/store and read.
There is no data retirement plan, i.e. the data needs to persist until the end of time.
Which solutions do you recommend?
Some of your requirements like "should not be too expensive" are a bit vague. In the end, you are going to need to evaluate a service against all of your exact requirements yourself.
Given that qualification, I would look into streaming the data to Kenesis with the goal of archiving the data to S3. I recommend reading this blog post from AWS to get an idea of how to achieve this.

Redshift as a Web App Backend?

I am building an application (using Django's ORM) that will ingest a lot of events, let's say 50/s (1-2k per msg). Initially some "real time" processing and monitoring of the events is in scope so I'll be using redis to keep some of that data to make decisions, expunging them when it makes sense. I was going to persist all of the entities, including events in Postgres for "at rest" storage for now.
In the future I will need "analytical" capability for dashboards and other features. I want to use Amazon Redshift for this. I considered just going straight for Redshift and skipping Postgres. But I also see folks say that it should play more of a passive role. Maybe I could keep a window of data in the SQL backend and archive to Redshift regularly.
My question is:
Is it even normal to use something like Redshift as a backend for web applications or does it typically play more of a passive role? If not is it realistic to think I can scale the Postgres enough for the event data to start with only that? Also if not, does the "window of data and archival" method make sense?
EDIT Here are some things I've seen before writing the post:
Some say "yes go for it" regarding the should I use Redshift for this question.
Some say "eh not performant enough for most web apps" and support the front it with a postgres database camp.
Redshift (ParAccel) is an OLAP-optimised DB, based on a fork of a very old version of PostgreSQL.
It's good at parallelised read-mostly queries across lots of data. It's bad at many small transactions, especially many small write transactions as seen in typical OLTP workloads.
You're partway in between. If you don't mind a data loss window, then you could reasonably accumulate data points and have a writer thread or two write batches of them to Redshift in decent sized transactions.
If you can't afford any data loss window and expect to be processing 50+ TPS, then don't consider using Redshift directly. The round-trip costs alone would be horrifying. Use a local database - or even a file based append-only journal that you periodically rotate. Then periodically upload new data to Redshift for analysis.
A few other good reasons you probably shouldn't use Redshift directly:
OLAP DBs with column store designs often work best with star schemas or similar structures. Such schemas are slow and inefficient for OLTP workloads as inserts and updates touch many tables, but they make querying the data along various axes for analysis much more efficient.
Using an ORM to talk to an OLAP DB is asking for trouble. ORMs are quite bad enough on OLTP-optimised DBs, with their unfortunate tendency toward n+1 SELECTs and/or wasteful chained left joins, tendency to do many small inserts instead of a few big ones, etc. This will be even worse on most OLAP-optimised DBs.
Redshift is based on a painfully old PostgreSQL with a bunch of limitations and incompatibilities. Code written for normal PostgreSQL may not work with it.
Personally I'd avoid an ORM entirely for this - I'd just accumulate data locally in an SQLite or a local PostgreSQL or something, sending multi-valued INSERTs or using PostgreSQL's COPY to load chunks of data as I received it from an in-memory buffer. Then I'd use appropriate ETL tools to periodically transform the data from the local DB and merge it with what was already on the analytics server.
Now forget everything I just said and go do some benchmarks with a simulation of your app's workload. That's the only really useful way to tell.
In addition to Redshift's slow transaction processing (by modern DB standards) there's another big challenge:
Redshift only supports serializable transaction isolation, most likely as a compromise to support ACID transactions while also optimizing for parallel OLAP mostly-read workload.
That can result in all kinds of concurrency-related failures that would not have been failures on typical DB that support read-committed isolation by default.

DynamoDB's PutItem is multiple zones safe?

Accoring to the link [1]
Amazon DynamoDB has built-in fault tolerance, automatically and synchronously
replicating your data across three Availability Zones in a Region for high
availability and to help protect your data against individual machine, or even
facility failures.
So can I assume that, at the time I get result for a success write, it is already replicated into three Availability zones?
[1] http://aws.amazon.com/dynamodb/
I think it depends on how you do the read:
from http://aws.amazon.com/dynamodb/faqs/
Q: What is the consistency model of Amazon DynamoDB?
When reading data from Amazon DynamoDB, users can specify whether they want the read to be eventually consistent or strongly consistent:
Eventually Consistent Reads (Default) – the eventual consistency option maximizes your read throughput. However, an eventually consistent read might not reflect the results of a recently completed write. Consistency across all copies of data is usually reached within a second. Repeating a read after a short time should return the updated data.
Strongly Consistent Reads — in addition to eventual consistency, Amazon DynamoDB also gives you the flexibility and control to request a strongly consistent read if your application, or an element of your application, requires it. A strongly consistent read returns a result that reflects all writes that received a successful response prior to the read.
Yes, you can rely on the data being there if PutItem succeeded.
automatically and synchronously replicating your data across three Availability Zones
The keyword is synchronously, meaning at the same time. At the same time it accepts your data, it's writing to all three Availability Zones. If PutItem returned before copmleting those writes, DynamoDB wouldn't have the consistency and durability guarantees advertised.