Data ingestion configuration for spark in aws - amazon-web-services

I am working on batch files and which we receive 1GB csv input file at a time to EMR. What is the ideal configuration for
Master and Core for 1 GB data and how do you arrive at that conclusion and is there a standard procedure? I am using the below I want to downgrade the instances 1 core instance. My concern is if more data comes in how can I upgrade my configuration?
1 instance - Master- 4 VCore, 16GiB memory, EBS-64GB
2 instances - Core- 4 VCore, 16GiB memory, EBS-64GB
The ingestion code has simple transformation and converting to parquet.

Related

How does AWS Athena manage to load 10GB/s from s3? I've managed 230 mb/s from a c6gn.16xlarge

When running this query on AWS Athena, it manages to query a 63GB Traders.csv file
SELECT * FROM Trades WHERE TraderID = 1234567
Tt takes 6.81 seconds, scanning 63.82GB in so doing (almost exactly the size of the Trades.csv file, so is doing a full table scan).
What I'm shocked at is the unbelievable speed of data drawn from s3. It seems like AWS Athena's strategy is to use an unbelievably massive box with a ton of RAM and incredible s3 loading ability to get around the lack of indexing (although on a standard SQL DB you would have an index on TraderID and load millions times less data).
But in my experiments I only managed to get these data reads from S3 (which are still impressive):
InstanceType
Mb/s
Network Card Gigabits
t2.2xlarge
113
low
t3.2xlarge
140
up to 5
c5n.2xlarge
160
up to 25
c6gn.16xlarge
230
100
(that's megabytes rather than megabits)
I'm using an internal VPC Endpoint for the s3 on eu-west-1. Anyone got any tricks/tips for getting s3 to load fast? Has anyone got over 1GB/s read speeds from s3? Is this even possible?
It seems like AWS Athena's strategy is to use an unbelievably massive
box with a ton of RAM
No, it's more like many small boxes, not a single massive box. Athena is running your query in parallel, on multiple servers at once. The exact details of that are not published anywhere as far as I am aware, but they make very clear in the documentation that your queries run in parallel.

Subscribe Google Pub/sub topic to Cloud Storage Avro file gives me "quota exceeded" error - in a beginners tutorial?

I'm going through Google's Firestore to BigQuery pipeline tutorual and I've come to step 10 where I should set up an export from my topic to an avro file saved on cloud storage.
However, when I try running the job, after doing exactly what's mentioned in the tutorial, I get an error telling me that my project has insufficient quotas to execute the workflow. In the quota summary of the message, I notice that it says 1230/818 disk GB. Does that mean that the job requires 1230 GB disk space? Currently, there are only 100 documents in the Firestore?. This seems wrong to me?
All my Cloud storage buckets are empty:
But when I look at the resources used in the first export job I set up (Pubsub Topic to BigQuery) on page 9, I'm even more confused.
It seems like it's using CRAZY amounts of resources
Current vCPUs
4
Total vCPU time
2.511 vCPU hr
Current memory
15 GB
Total memory time
9.417 GB hr
Current PD
1.2 TB
Total PD time
772.181 GB hr
Current SSD PD
0 B
Total SSD PD time
0 GB hr
Can this be real, or have I done something completely wrong, since all these resources are used? I mean, there's no activity at all, It's just a subscription, right?
Under the hood, that step is calling a Cloud Dataflow template (this one to be exact) to read from Pub/Sub and write to GCS. In turn, Cloud Dataflow is using GCE instances (VMs) for its worker pool. Cloud Dataflow is requesting too many resources (GCE instances which need disk, ram, vCPUs etc) and is hitting your project's limit/quota.
You can override the default number of workers (try 1 to start with) and also also set the smallest VM type (n1-standard-1) when configuring the job under optional parameters. This should also save you some money too. Bonus!

Best AWS Instance for Partitioning Big Data

the problem that I am having right now is trying to find the best AWS instance for partitioning large data (scaling to greater than 1TB).
The data that I am receiving is structured data, and am hoping to partition it by either /year/month/day/ or /year/month/day/hour of the created at time. So far I have tried using EMR with the following configurations to partition 260GB of parquet data in /year/month/day (spark.dynamicAllocation.enabled == true):
3 r5.2xlarge (8 vCPU, 64GB) --> > 1 hour to just write to HDFS
2 c5.4xlarge (16 vCPU, 32GB) --> >> 1 hour to just write to HDFS (was 28% slower than the 3 r5.2xlarge)
2 r5d.4xlarge (16 vCPU, 128GB) --> 54 minutes to just write to HDFS (note, HDFS is on NVMe SSD)
This is a graph of what the 3 r5.2xlarge is producing:
This is a graph of what the 2 c5.4xlarge is producing (note, the two peaks are due to running the job twice):
This is a graph of what the 2 r5d.4xlarge is producing:
Is it possible for me to reach ~10 minutes? If so, would that mean adding more nodes or a different instance type?

Amazon redshift query aborts automatically after 1 hour

I have around 500GB compressed data in amazon s3. I wanted to load this data to Amazon Redshift. For that, I have created an internal table in AWS Athena and I am trying to load data in the internal table of Amazon Redshift.
Loading of this big data into Amazon Redshift is taking more than an hour. The problem is when I fired a query to load data it gets aborted after 1hour. I tried it 2-3 times but it's getting aborted after 1 hour. I am using Aginity Tool to fire the query. Also, in Aginity tool it is showing that query is currently running and the loader is spinning.
More Details:
Redshift cluster has 12 nodes with 2TB space for each node and I used 1.7 TB space.
S3 files are not the same size. One of them is 250GB. Some of them in MB.
I am using the command
create table table_name as select * from athena_schema.table_name
it stops exactly after 1hr.
Note: I have set the current query timeout in Aginity to 90000 sec.
I know this is an old thread, but for anyone coming here because of the same issue, I've realised that, at least for my case, the problem was the Aginity client; so, it's not related with Redshift or its Workload Manager, but only with such third party client called Aginity. In summary, use a different client like SQL Workbench and run the COPY command from there.
Hope this helps!
Carlos C.
More information, about my environment:
Redshift:
Cluster TypeThe cluster's type: Multi Node
Cluster: ds2.xlarge
NodesThe cluster's type: 4
Cluster Version: 1.0.4852
Client Environment:
Aginity Workbench for Redshift
Version 4.9.1.2686 (build 05/11/17)
Microsoft Windows NT 6.2.9200.0 (64-bit)
Network:
Connected to OpenVPN, via SSH Port tunneling.
The connection is not being dropped. This issue is only affecting the COPY command. The connection remains active.
Command:
copy tbl_XXXXXXX
from 's3://***************'
iam_role 'arn:aws:iam::***************:role/***************';
S3 Structure:
120 files of 6.2 GB each. 20 files of 874MB.
Output:
ERROR: 57014: Query (22381) cancelled on user's request
Statistics:
Start: ***************
End: ***************
Duration: 3,600.2420863
I'm not sure if following answer will solve your exact problem of timeout at exactly 1 Hr.
But, based on my experience, in case of Redshift loading data via Copy command is best and fast way. SO I feel that timeout issue shouldn't happen at all in your case.
The copy command in RedShift could load data from S3 or via SSH.
e.g.
Simple copy
copy sales from 'emr://j-SAMPLE2B500FC/myoutput/part-*' iam_role
'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;
e.g. Using Menifest
copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
PS: Even if you do it using Menifest and divide your data into Multiple files, it will be more faster as RedShift loads data in parallel.

HDFS replication and data distribution

I have a Hadoop cluster with 4 DataNodes. I am confused between the two issues : data replication and data distribution.
Suppose that I have a 2 GB file and my replication factor is 2 & block size is 128 MB. When I put this file into hdfs, I see that 2 copies of each 128 MB blocks are created and they are placed in datanode3 and datanode4. But datanode2 & datanode1 are not used. The data is replicated because of the replication factor but I expect to see some data blocks in datanode1 and datanode2. Is something wrong?
Let's say that I have 20 DataNodes and replication factor is 2. If I put a file (2 GB) on HDFS, I again expect to see two copies of each 128 MB but also expect to see these 128 MB blocks are distributed between 20 DataNodes.
Ideally, the 2GB file should get distributed among all the available DataNodes.
File Size: 2GB = 2048MB
Block Size: 128MB
Replication Factor: 2
With above configuration you should have: 2048 / 128 * 2 blocks i.e. 32 blocks. And these blocks should get distributed almost equally between all DataNodes. Considering you have 4 DataNodes, each of them should have around 8 blocks.
The reason I could think of for not having above situation is if the DataNodes are down. Check if all the DataNodes are up: sudo -u hdfs hdfs dfsadmin -report