AWS data transfer Estimates in a distributed set up - amazon-web-services

I would like to understand how we can estimate the data transfer costs.
let me explain the set up,
I have a rest endpoint for accessing data from our caches for multiple users in multiple regions on the cloud.
the set up consists of cassandra, hazelcast caches for data storage. the added complexity is in having the source of the data to cassandra from components in on-premise server
Cassandra Set up:
cassandra nodes spread across the AZs. these are in two regions (UK and HK). streaming services from US, ME on premise servers access the data but only when the data is not present in our Hazelcast caches. the UK cassandra instance replicates data to HK instance for data consistency
HZ set up:
HZ caches are set up in 5 regions as a local cache. these caches sync up using a bidirectional sync. when a data is not found in the cache to serve a rest call, it initiates a gprc call from a service to pull the data to pull the missing data
my method of estimating data transfer is
for api, payload * number of requests in a day
How do I estimate the data transfer for cassandra replication ( includes the gossip ) and Hazelcast Replication across regions ?

For the Hazelcast part, if you enable diagnostics logging on the Hazelcast member, you can read the following metrics: bytesReceived and bytesSent.
Read more at: https://groups.google.com/g/hazelcast/c/IDIynkEG1YE

Related

Fastest way to Import Data from AWS Redshift to BI Tool

I have a table in AWS redshift running ra3.xlplus with 2 nodes which has 15 million rows. I am retrieving data on-premise at the office. I am trying to load that data into Memory in a BI tool. It takes a lot of time (12 minutes) to import that data over using a JDBC connection. Also tried on ODBC connection got same result. I tried to spin up a EC2 with a 25 gigabit connection on AWS, but got the same results.
For comparison loading that data in CSV format takes about 90 seconds.
Are there any solutions as to speed up data transfer.
There are ways to improve this but the true limiter needs to be identified. The likely the bottleneck is the network bandwidth between AWS and your on-prem system. As you are bringing a large amount of data down from the cloud you will want an efficient process for this transport.
JDBC and ODBC are not network efficient as you are seeing. The first thing that will help in moving the data is compression. The second is parallel transfer since there is a fair amount of handshaking in TCP protocol and there is more usable bandwidth than one connection can consume. So how I have done this in the past is to UNLOAD the data compressed to S3, then parallel copy the files from S3 to the local machine piping the files through decompress and saving them. Lastly these files are loaded into your BI tool.
Clearly setting this up takes some time so you want to be sure that the process will be used enough to justify this effort. Another way to go is to bring your BI tool closer to Redshift by locating it in an ec2 instance. The shorter network distance and higher bandwidth should bring down the transfer time significantly. A downside of locating your database in the cloud is that it is in the cloud and not on-prem.

What is the replication logic for Cluster mode enabled in Elasticache Redis?

I have a Redis cluster with cluster mode enabled and 3 nodes (1 master and 2 replicas). I have noticed that the CPU percentage of one of the replicas is similar to the master node while that of the other replica remains quite low. So, what is the replication logic at play here? Is it like only one replica is used to replicate data proactively and the other one is used only after the first one fails?
PFA screenshot of the CPU percentage usage over a week
PS: The application connects to the cluster using Configuration Endpoint
As it is stated here,
Redis (cluster mode disabled) clusters, use the Primary Endpoint for all write operations. Use the Reader Endpoint to evenly split incoming connections to the endpoint between all read replicas. Use the individual Node Endpoints for read operations (In the API/CLI these are referred to as Read Endpoints).
If you use the reader endpoint it will split the load evenly.
As it is stated here
Each read replica maintains a copy of the data from the cluster's primary node. Asynchronous replication mechanisms are used to keep the read replicas synchronized with the primary.
My optimistic guess is that; instead of reader endpoint, your application directly reads from the single replica. Maybe the endpoint(higher cpu) is hardcoded within the application.

Transfer/Replicate Data periodically from AWS Documentdb to Google Cloud Big Query

We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.
This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.

Lambda architecture on AWS: choose database for batch layer

We're building Lambda architecture on AWS stack. A lack of devops knowledge forces us to prefer AWS managed solution over custom deployments.
Our workflow:
[Batch layer]
Kinesys Firehouse -> S3 -Glue-> EMR (Spark) -Glue-> S3 views -----+
|===> Serving layer (ECS) => Users
Kinesys -> EMR (Spark Streaming) -> DynamoDB/ElasticCache views --+
[Speed layer]
We have already using 3 datastores: ElasticCache, DynamoDB and S3 (queried with Athena). Bach layer produce from 500,000 up to 6,000,000 row each hour. Only last hour results should be queried by serving layer with low latency random reads.
Neither of our databases fits batch-insert & random-read requirements. DynamoDB not fit batch-insert - it's too expensive because of throughput required for batch inserts. Athena is MPP and moreover has limitation of 20 concurrent queries. ElasticCache is used by streaming layer, not sure if it's good idea to perform batch inserts there.
Should we introduce the fourth storage solution or stay with existing?
Considered options:
Persist batch output to DynamoDB and ElasticCache (part of data that is updated rarely and can be compressed/aggregated goes to DynamoDB; frequently updated data ~8GB/day goes to elasticCache).
Introduce another database (HBase on EMR over S3/ Amazon redshift?) as a solution
Use S3 Select over parquet to overcome Athena concurrent query limits. That will also reduce query latency. But have S3 Select any concurrent query limits? I can't find any related info.
The first option is bad because of batch insert to ElasticCache used by streaming. Also does it follow Lambda architecture - keeping batch and speed layer views in the same data stores?
The second solution is bad because of the fourth database storage, isn't it?
In this case you might want to use something like HBase or Druid; not only can they handle batch inserts and very low latency random reads, they could even replace the DynamoDB/ElastiCache component from your solution, since you can write directly to them from the incoming stream (to a different table).
Druid is probably superior for this, but as per your requirements, you'll want HBase, as it is available on EMR with the Amazon Hadoop distribution, whereas Druid doesn't come in a managed offering.

Data upload required for cloud computing?

If I want to utilize Amazon Web Services to provide the hardware (cores and memory) to process a large amount of data, do I need to upload that data to AWS? Or can I keep the data on the system and rent the hardware?
Yes, in order for an AWS-managed system to process a large amount of data, you will need to upload the data to an AWS region for processing at some point. AWS does not rent out servers to other physical locations, as far as I'm aware (EDIT: actually, AWS does have an offering for on-premises data processing as of Nov 30 2016, see Snowball Edge).
AWS offers a variety of services for getting large amounts of data into its data centers for processing, (ranging from basic HTTP uploads to physically mailing disk drives for direct data import), and the best service to use will depend entirely on your specific use-case, needs and budget. See the overview page at Cloud Data Migration for an overview of the various services and help on selecting the most appropriate service.