I'm reading through BigQuery doc and I'm confused by the point quoted below.
Source: Google Doc
BigQuery transparently and automatically provides highly durable,
replicated storage in multiple locations and high availability with no
extra charge and no additional setup.
Source: Google Doc
BigQuery does not automatically provide a backup or replica of your data in
another geographic region. You can create cross-region dataset copies
to enhance your disaster recovery strategy.
Does BigQuery automatically replicate data across zones/regions?
For long term data storage, given the options of Big Table, Big Query and Regional Persistent Disk, is it preferable to use Regional Persistent Disk to automatically replicate data across different geographical location?
Yes, BigQuery automatically replicates data across zones/regions.
It's the part of Google doc
In either case, BigQuery automatically stores copies of your data in two different Google Cloud zones within the selected location.
But as you read, I think that you're missing some information. it mentions a hard regional failure.
What is hard regional failure? as Google Doc describe,
Hard failure is an operational deficiency where hardware is destroyed. Hard failures are more severe than soft failures. Hard failure examples include damage from floods, terrorist attacks, earthquakes, and hurricanes.
For example, in Asia-east1 (Taiwan), earthquake occurrence is quite frequent, if you're creating a dataset in this region, you might consider cross-region dataset copies to enhance your disaster recovery strategy.
I think that you can export your table data to GCS for long term data storage
Because there are some storage classes. For example,
Archive Storage is the lowest-cost, highly durable storage service for data archiving, online backup, and disaster recovery.
Related
Recently, S3 announces strong read-after-write consistency. I'm curious as to how one can program that. Doesn't it violate the CAP theorem?
In my mind, the simplest way is to wait for the replication to happen and then return, but that would result in performance degradation.
AWS says that there is no performance difference. How is this achieved?
Another thought is that amazon has a giant index table that keeps track of all S3 objects and where it is stored (triple replication I believe). And it will need to update this index at every PUT/DELTE. Is that technically feasible?
As indicated by Martin above, there is a link to Reddit which discusses this. The top response from u/ryeguy gave this answer:
If I had to guess, s3 synchronously writes to a cluster of storage nodes before returning success, and then asynchronously replicates it to other nodes for stronger durability and availability. There used to be a risk of reading from a node that didn't receive a file's change yet, which could give you an outdated file. Now they added logic so the lookup router is aware of how far an update is propagated and can avoid routing reads to stale replicas.
I just pulled all this out of my ass and have no idea how s3 is actually architected behind the scenes, but given the durability and availability guarantees and the fact that this change doesn't lower them, it must be something along these lines.
Better answers are welcome.
Our assumptions will not work in the Cloud systems. There are a lot of factors involved in the risk analysis process like availability, consistency, disaster recovery, backup mechanism, maintenance burden, charges, etc. Also, we only take reference of theorems while designing. we can create our own by merging multiple of them. So I would like to share the link provided by AWS which illustrates the process in detail.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html
When you create a cluster with consistent view enabled, Amazon EMR uses an Amazon DynamoDB database to store object metadata and track consistency with Amazon S3. You must grant EMRFS role with permissions to access DynamoDB. If consistent view determines that Amazon S3 is inconsistent during a file system operation, it retries that operation according to rules that you can define. By default, the DynamoDB database has 400 read capacity and 100 write capacity. You can configure read/write capacity settings depending on the number of objects that EMRFS tracks and the number of nodes concurrently using the metadata. You can also configure other database and operational parameters. Using consistent view incurs DynamoDB charges, which are typically small, in addition to the charges for Amazon EMR.
I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!
On the AWS developer docs for Sagemaker, they recommend us to use PIPE mode to directly stream large datasets from S3 to the model training containers (since it's faster, uses less disk storage, reduces training time, etc.).
However, they don't include information on whether this data streaming transfer is charged for (they only include data transfer pricing for their model building & deployment stages, not training).
So, I wanted to ask if anyone knew whether this data transfer in PIPE mode is charged for, since if it is, I don't get how this would be recommended for large datasets, since streaming a few epochs for each model iteration can get prohibitively expensive for large datasets (my dataset, for example, is 6.3TB on S3).
Thank you!
You are charged for the S3 GET calls that you do similarly to what you would be charged if you used the FILE option of the training. However, these charges are usually marginal compared to the alternatives.
When you are using the FILE mode, you need to pay for the local EBS on the instances, and for the extra time that your instances are up and only copying the data from S3. If you are running multiple epochs, you will not benefit much from the PIPE mode, however, when you have so much data (6.3 TB), you don't really need to run multiple epochs.
The best usage of PIPE mode is when you can use a single pass over the data. In the era of big data, this is a better model of operation, as you can't retrain your models often. In SageMaker, you can point to your "old" model in the "model" channel, and your "new" data in the "train" channel and benefit from the PIPE mode to the maximum.
I just realized that on S3's official pricing page, it says the following under the Data transfer section:
Transfers between S3 buckets or from Amazon S3 to any service(s) within the same AWS Region are free.
And since my S3 bucket and my Sagemaker instances will be in the same AWS region, the data transfer costs should be free.
We have a completely serverless application, with only lambdas and DynamoDB.
The lambdas are running in two regions, and the originals are stored in Cloud9.
DynamoDB is configured with all tables global (bidirectional multi-master replication across the two regions), and the schema definitions are stored in Cloud9.
The only data loss we need to worry about is DynamoDB, which even if it crashed in both regions is presumably diligently backed up by AWS.
Given all of that, what is the point of classic backups? If both regions were completely obliterated, we'd likely be out of business anyway, and anything short of that would be recoverable from AWS.
Not all AWS regions support backup and restore functionality. You'll need to roll your own solution for backups in unsupported regions.
If all the regions your application runs in supports the backup functionality, you probably don't need to do it yourself. That is the point of going serverless. You let the platform handle simple DevOps tasks.
Having redundancy with regional or optionally cross-regional replication for DynamoDB provides mainly the durability, availability and fault tolerance for your data storage. However along with these inbuilt capabilities, still there can be the need for having backups.
For instance, if there is a data corruption due to an external threat (Like an attack) or based on an application malfunction, still you might want to restore the data back. This is one place where having backups is useful to restore the data back to a recent point of time.
There can also be compliance related requirement, which will require taking backups of your database system.
Another use case is when there is a need to create new DynamoDB tables for your build pipeline and quality assurance, it is more practical to re-use an already made snapshot of data from a backup rather taking a copy from the live database (Since it can consume the IOPS provisioned, affecting the application behaviors).
I am considering using AWS DynamoDB for an application we are building. I understand that setting a backup job that exports data from DynamoDB to S3 involves a data pipeline with EMR. But my question is do I need to worry about having a backup job set up on day 1? What are the chances that a data loss would happen?
There are multiple use-cases for DynamoDB table data copy elsewhere:
(1) Create a backup in S3 on a daily basis, in order to restore in case of accidental deletion of data or worse yet drop table (code bugs?)
(2) Create a backup in S3 to become the starting point of your analytics workflows. Once this data is backed up in S3, you can combine it with, say, your RDBMS system (RDS or on-premise) or other S3 data from log files. Data Integration workflows could involve EMR jobs to be ultimately loaded into Redshift (ETL) for BI queries. Or directly load these into Redshift to do more ELT style - so transforms happen within Redshift
(3) Copy (the whole set or a subset of) data from one table to another (either within the same region or another region) - so the old table can be garbage collected for controlled growth and cost containment. This table-to-table copy could also be used as a readily consumable backup table in case of, say region-specific availability issues. Or, use this mechanism to copy data from one region to another to serve it from an endpoint closer to the DynamoDB client application that is using it.
(4) Periodic restore of data from S3. Possibly as a way to load back post-analytics data back into DynamoDB for serving it in online applications with high-concurrency, low-latency requirements.
AWS Data Pipeline helps schedule all these scenarios with flexible data transfer solutions (using EMR underneath).
One caveat when using these solutions is to note that this is not a point-in-time backup: so any changes to the underlying table happening during the backup might be inconsistent.
This is really subjective. IMO you shouldn't worry about them 'now'.
You can also use simpler solutions other than pipleline. Perhaps that will be a good place to start.
After running DynamoDB as our main production database for more than a year I can say it is a great experience. No data loss and no downtime. The only thing that we care about is sometimes SDK misbehaves and tweaking provisioned throughput.
data pipeline has limit regions.
https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region
I would recommend setting up a Data pipeline to backup on daily basis to an S3 bucket - If you want to be really safe.
Dynamo DB itself might be very reliable, but nobody can protect you from your own accidental deletions (what if by mistake you or your colleague ended up deleting a table from the console). So I would suggest setup a backup on daily basis - It doesn't any case cost so much.
You can tell the Pipeline to only consume say may 25% of the capacity while backup is going on so that your real users don't see any delay. Every backup is "full" (not incremental), so in some periodic interval, you can delete some old backups if you are concerned about storage.