I am using pyspark to read data from EMR. But if the EMR cluster is fully occupied, I can see on the cluster manager that all the memories are occupied by some ETL job, still can I run this script on my physical server that bring data from the EMR cluster to my physical server ?
what is the best practice suggest ?
Will it take the same amount of time to read the data from EMR to physical server? how does it handle the request if it is requested to read the data when it is fully occupied on EMR?
What kind of processes are executed on EMR (s3 bucket) while accessing/reading data from physical server through s3 utility?
Can I pull data to physical server when EMR cluster is fully occupied? If no, why ?
Thanks and Regards
The best practice is to expand your cluster and do things consistently. I strongly suggest that you do not develop out-of-band/one-off processes that move things outside of the normal flow of work.
Related
master node - does this node stores hdfs data in aws emr cluster?
task node - if this node does not store hdfs data, is it purely computational node? in this case does hadoop transfer to task node? does this not defeat data localization computation advantgae?
(Other than the edge case of a master-only cluster with no core or task instances...)
The master instance does not store any HDFS data, nor does it act as a computational node. The master instance runs services like the YARN ResourceManager and HDFS NameNode.
The only nodes that store data are those that run HDFS DataNode, which are only the core instances.
The core and task instances both run YARN NodeManager and thus are the "computational nodes".
Regarding your question, "in this case does hadoop transfer to task node", I assume that you are asking whether or not Hadoop transfers (HDFS) data to the task instances so that they may perform computations on HDFS data. In a sense, yes, task instances may read HDFS blocks remotely from core instances where the blocks are stored.
It's true that this means that task instances can never take advantage of data locality for HDFS data, but there are many cases where this does not matter anyway, such as for tasks that are read shuffle data from other nodes, or tasks that are reading data from remote storage anyway (e.g., Amazon S3). Furthermore, depending upon the core instance type being used, keep in mind that even the HDFS blocks might be getting stored in remote storage (i.e., EBS). That said, even when your task instances are reading data from a remote DataNode or remote service like S3 or EBS, it might not even be noticeable to the point that you need to worry about data locality.
I a simple question for someone with experience with AWS but I am getting a little confused with the terminology and know how to proceed with which node to purchase.
At my company we currently have a a postgres db that we insert into continuously.
We probably insert ~ 600M rows at year at the moment but would like to be able to scale up.
Each Row is basically a timestamp and two floats, one int and one enum type.
So the workload is write intensive but with also constant small reads.
(There will be the occasional large read)
There are also two services that need to be run (both Rust based)
1, We have a rust application that abstracts the db data allowing clients to access it through a restful interface.
2, We have a rust app that gets the data to import from thousands on individual devices through modbus)
These devices are on a private mobile network. Can I setup AWS cluster nodes to be able to access a private network through a VPN ?
We would like to move to Amazon Redshift but am confused with the node types
Amazon recommend choosing RA3 or DC2
If we chose ra3.4xlarge that means you get one cluster of nodes right ?
Can I run our rust services on that cluster along with a number of Redshift database instances?
I believe AWS uses docker and I could containerise my services easily I think.
Or am I misunderstanding things and when you purchase a Redshift cluster you can only run Redshift on this cluster and have to get a different one for containerised applications, possibly an ec2 cluster ?
Can anyone recommend a better fit for scaling this workload ?
Thanks
I would not recommend Redshift for this application and I'm a Redshift guy. Redshift is designed for analytic workloads (lots or reads and few, large writes). Constant updates is not what it is designed to do.
I would point you to Postgres RDS as the best fit. It has a Restful API interface already. This will be more of the transactional database you are looking for with little migration change.
When your data get really large (TB+) you can add Redshift to the mix to quickly perform the analytics you need.
Just my $.02
Redshift is a Managed service, you don't get any access to it for installing stuff, neither is there a possibility of installing/running any custom software of your own
Or am I misunderstanding things and when you purchase a Redshift cluster you can only run Redshift on this cluster
Yes, you don't run stuff - AWS manages the cluster and you run your analytics/queries etc.
have to get a different one for containerised applications, possibly an ec2 cluster ?
Yes, you could possibly make use of EC2, running the orchestrators on your own, or make use of ECS/Fargate/EKS depending on your budget/how skilled your members are etc
I have some doubts about a deployment of CDH on AWS. I read the reference architecture doc and other material I found on Cloudera Engineering Blog but I need some more suggestions about it.
1) Is the CDH deployment available only for some kind of instances or I can deploy it on all the AWS instance types?
2) Assuming I want to create a cluster that will be active 24x7. For a long-running cluster I understood it's better to have a cluster based on local-storage instances. If we consider a cluster of 2PBs I think that d2.8xlarge should be the best choice for the datanodes. About the Master Nodes: - if I want to deploy only 3 Master Nodes, is it better to have them as local-storage instances too or as EBS attached instances to be able to react quickly to a possible Master Node failure? - are there some best practice about the master node instance type (EBS or local-storage)? About the Data Nodes: - if a data node fails, Has the CDH some automated mechanism to automatically spin-up a new instance and connect it to the cluster in order to restore the cluster without down-times? Have we to build a script from scratch to do this thing? About the Edge Nodes: - are there some best practice about the instance type (EBS or local-storage)?
3) If I want to do a backup of the cluster on S3: - when I do a distcp from the CDH to S3, can I move the data directly on Glacier instead of the normal S3? If I have some compression applied on the data (e.g. snappy, gzip, etc.) and I do a distcp to S3: - Is the space occupied on S3 the same or the distcp command decompress the data for the copy?
If I have a cluster based on EBS attached instances: - is it possible to snapshot the disks and re-attach a datanode with the EBS disks rebuilt from the snapshot?
4) If I have the Data Nodes deployed as r4.8xlarge and I need more horsepower, is it possible to scale-up the cluster from r4.8xlarge to a r4.16xlarge on-the-fly? Attaching and detaching the disks in few mins?
Thanks a lot for the clarifications, I hope my doubts will help also other users.
1) There's no explicit restriction on instance types where CDH components will work, but you'd need to pick types with a minimum of horsepower. For example, I don't expect that a micro size instance would work for much of anything. A type that is too small will generally cause daemons to run out of memory. The reference architecture has suggested instance types for certain situations.
2) You should stick with EBS for the root volume of instance types. There are a few reasons, including that newer instance types don't even support local instance storage for the root disk.
CDH doesn't have a mechanism for replacing data nodes when they fail. You could roll something yourself, possibly with help from Cloudera Director.
3) You can set up lifecycle rules for data in S3 to migrate it from the standard storage class into Glacier over time, or you can just write directly to Glacier; it doesn't look like direct Glacier access can be done through the s3a connector. I'm pretty sure distcp and S3 won't fiddle with compression; what you copy is opaque to S3 for sure. You can snapshot EBS volumes (root or additionally attached), then detach them and re-attach them to a different instance; this isn't necessarily a great way to back up datanodes vs. the distcp route, because each datanode is unique and has changing data as the cluster runs.
4) You can resize EBS-backed EC2 instances without detaching and re-attaching disks. You do have to stop an instance to resize it.
Point 3 only:
You need to distcp to S3 and move that to glacier via the AWS settings
It doesn't do anything to the data, compression, etc.
see the (hortonworks doc) Distcp and S3 and read its warnings/caveats. In particular, incremental distcp isn't checksum-based, atomic distcp isn't, it's just "really slow distcp"
I am reading s3 buckets with drill and writing it back to s3 with parquet in order to read it with spark data frames for further analysis. I am required by AWS emr to have at least 2 core machines.
will using i mirco instance for master and cores affect performance?
I don't make a use of hdfs as such so I am thinking to make them mirco instances to save money.
All computation will be done in memory by R3.xlarge spot instances as task nodes anyway.
And finally does spark utilise multiple cores in each machine? or is it better to launch fleet of task nodes R3.xlarge with 4.1 version so they can be auto resized?
I don't know how familiar you are with Spark but there is a couple of things you need to know about core usage :
You can set the number of cores to use for the driver process, only in cluster mode. It's 1 by default.
You can also set the number of cores to use on each executor. For YARN and standalone mode only. It's 1 in YARN mode, and all the available cores on the worker in standalone mode. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
Now to answer both of your questions :
will using i micro instance for master and cores affect performance?
Yes, the driver needs minimum resources to schedule job, collect data sometimes etc. Performance-wise you'll need to benchmark according to your use case on what suits your usage better which you can do using Ganglia per example on AWS.
does spark utilise multiple cores in each machine?
Yes Spark uses multiple cores on each machine.
You can also read this concerning Which instance type is preferred for AWS EMR cluster for Spark.
The support of Spark is nearly new on AWS, but it's usually close to all other Spark cluster setups.
I advice you to read the AWS EMR developer guide - Plan EMR Instances chapter along with the Spark official documentation guide.
I am a newbie to amazon RDS. I have set up a db instance in RDS. I want to try the RDS read replicas feature.
I have few queries:
For what kind of applications read replicas are suitable?
Is the read replica replicates synchronously or asynchronously data to other read replicas?
Is it the substitute of the Multi AZ deployments?
How is it better than the master slave or master master replication in MYSQL.
If we have replicas on EC2 will it work the same way as RDS read replicas work
Thanks in advance.
For what kind of applications read replicas are suitable?
It is best suited if your application is
Read intensive and is used by several read clients
Can adopt ( live with ) a minor lag between the data written to db and data replicated to read replicas.
Is the read replica replicates synchronously or asynchronously data to other read replicas?
The replication is asynchronous, so expect a small lag for replication
Is it the substitute of the Multi AZ deployments ?
Multi AZ setup and Read Replica compliment each other; they aren't replacement or substitute for each other. Multi AZ setup is for High Availability ( Out of the Box Setup By AWS ) whereas Read Replica is purely to reduce / distribute the load on the Database Instances to improve the read performance and to avoid bottlenecks to the databases for writes and read. You can / need to write your application logic to divert your reads to Read Replica and Writes to Main Instance; to make the best use of the setup.
Generally people mix and match both Multi AZ and Read Replica(s) depending on the application and load.
How is it better than the master slave or master master replication in MYSQL
The comparison of the master master vs master slave depends on several factors like data, data volume, operation like write or read, load etc. you need to work to see exactly how the system performs with either of the setup.
The best advantage you go with Multi AZ / Read Replica is that, you can offload the DB management activities and overhead of supervising the replica setup and health to AWS; instead of you managing those by yourself.
If we have replicas on EC2 will it work the same way as RDS read replicas work
This is again more like corollary to the Q4. When try to install a database in your EC2 instance you need to take care ( monitor & manage ) - EC2 Instance Patches, Database Instance Patches, Replication Setup, Replication Lag, Availability.
Whereas when you leave that to AWS by using Read Replica they manage all the above for you. It is your call to choose which ever is best for you either depending on the application requires which involves factors like cost, availability, compliance etc.