I like to run the WordCount example on Hadoop2.0. I know we can do it either running java program(mapper & reducer) or using simple HiveQL.
When i write the HiveQL to run the WorCount example, my understanding is HIVE converts the SQL into MapReduce program and run example for me.
That being said, YARN architecture says that apart from running MapReduce application, now YARN allows user to non Mapreduce applications like (HIVE,PIG,Impala etc). im not able to connect the dots here. Isnt HiveSQL a MapReduce Program?
Hive is an abstraction program. It converts HiveQL into code to be executed with other engines, of which MapReduce is the most popular. You can also change the execution engine to Tez, if you're running Hortonworks, for example.
Cloudera is also planning on having HiveQL execute against Spark. So that's 3 execution engines, all of which would operate under YARN.
Related
I am trying to build a multi-node parallel job in AWS Batch running an R script. My R script runs independently multiple statistical models for multiple users. Hence, I want to split and distribute this job running on parallel on a cluster of several servers for faster execution. My understanding is that at some point I have to prepare a containerized version of my R-application code using a Dockerfile pushed to ECR. My question is:
The parallel logic should be placed inside the R code, while using 1 Dockerfile? If yes, how does Batch know how to split my job (in how many chunks) ?? Is the for-loop in the Rcode enough?
or I should define the parallel logic somewhere in the Dockerfile saying that: container1 run the models for user1-5, container2 run
the models for user6-10, etc.. ??
Could you please share some ideas or code on that topic for better understanding? Much appreciated.
AWS Batch does not inspect or change anything in your container, it just runs it. So you would need to handle the distribution of the work within the container itself.
Since these are independent processes (they don't communicate with each other over MPI, etc) you can leverage AWS Batch Array Jobs. Batch MNP jobs are for tightly-coupled workloads that need that inter-instance or inter-GPU communication using Elastic Fabric Adapter.
Your application code in the container can leverage the AWS_BATCH_JOB_ARRAY_INDEX environment variable to process a subset of users. AWS_BATCH_JOB_ARRAY_INDEX starts with 0 so you would need to account for that.
You can see an example in the AWS Batch docs for how to use the index.
Note that AWS_BATCH_JOB_ARRAY_INDEX is zero based, so you will need to account for that if your user numbering / naming scheme is different.
AWS Glue looks promising but I'm having a challenge with the development cycle time. If I edit PySpark scripts through the AWS console, it takes several minutes to run even on a minimal test dataset. This makes it a challenge to iterate quickly if I have to wait 3-5 minutes just to see whether I called the right method on glueContext or understood a particular DynamicFrame behavior.
What techniques would allow me to iterate faster?
I suppose I could develop Spark code locally, and deploy it to Glue as an execution framework. But if I need to test code with Glue-specific extensions, I am stuck.
For development and testing scripts Glue has Development Endpoints which you can use with notebooks like Zeppelin installed either on a local machine or on Amazon EC2 instance (other options are 'REPL Shell' and 'PyCharm Professional').
Please don't forget to remove the endpoint when you are done with testing since you pay for it even if it's idling.
I keep pyspark code in separate class file and glue code in another file. We use glue for reading and writing data only. We do test driven development using pytest in local machine. No need of dev endpoint or zeppelin. Once all syntactical or business logic specific bugs are fixed in pyspark, end-to-end testing is done using glue. We also wrote shell script, which uploads latest code to S3 bucket from which glue job is run.
https://github.com/fatangare/aws-glue-deploy-utility
https://github.com/fatangare/aws-python-shell-deploy
Where should I execute a python script that process ~7giga of data that is available on GCS. The output will be writen to GCS as well.
The script was debugged on datalab notebook with small dataset. I would like to scale up the processing. Should I allocate a big machine? I have no idea what size (resources) of machine is needed.
Many thanks,
Eila
Just in case,
Dataflow can’t work for that kind of data processing
From what I have read about HDF5, it seems that it is not easily parallelizable (See Parallel HDF5 and h5py multiprocessing_example) so I'll assume that reading that ~7GB must me done by one worker.
If there is no workaround to it, and you do not encounter memory issues while processing it on the machine you are already using, I do not see a need to upgrade your datalab instance.
I am aware of flume and Kafka but these are event driven tools. I don't need it to be event driven or real time but may be just schedule the import once in a day.
What are the data ingestion tools available for importing data from API's in HDFS?
I am not using HBase either but HDFS and Hive.
I have used R language for that for quite a time but I am looking for a more robust,may be native solution to Hadoop environment.
Look into using Scala or Python for this. There are a couple ways to approach pulling from an API into HDFS. The first approach would be to write a script which runs on your edge node(essentially just a linux server) and pulls data from the API and lands it in a directory on the linux file system. Then your script can use HDFS file system commands to put the data into HDFS.
The second approach would be to use Scala or Python with Spark to call the API and directly load the data into HDFS using a Spark submit job. Again this script would be run from an edge node it is just utilizing Spark to bypass having to land the data in the LFS.
The first option is easier to implement. The second option is worth looking into if you have huge data volumes or an API that could be parallelized by making calls to muliple IDs/accounts at once.
Is there a way to run SAS using batch if I don't have the sas.exe in my machine?
My computer has the SAS EG but the code is ran on our companies servers
Thanks
If you are asking whether it is possible to run SAS batch on your local machine without having SAS on your local machine, the answer is no.
If you are using EG to connect to a SAS server, and you want to execute a batch job on the SAS server, that is possible (just not with EG). For example, if you have terminal access to the SAS server via putty or whatever, you can do a batch submit.
Enterprise Guide is quite capable of scheduling jobs, whether or not you have a local SAS installation.
Wendy McHenry covers this well in Four Ways to Schedule SAS Tasks. Way 1 is what you probably are familiar with ('batch'), but Ways 2 through 4 are all possible in server environments.
Way 2 is what I use, which is specifically covered in Chris Hemedinger's post Doing More with SAS Enterprise Guide Automation. In Enterprise Guide since I think EG 4.3, there has been an option in the File menu "Schedule ...", as well as a right-click option on a process flow "Schedule ...". These create VBScript files that can be scheduled using your normal Windows scheduler, and allow you to schedule a process flow or a project to run unattended, even if it needs to connect to a server.
You need to make sure you can connect to that server using the credentials you'll schedule the job to run under, of course, and that any network connections are created when you're not logged in interactively, but other than that it's quite simple to schedule the job. Then, once you've run it, it will save the project with the updated log and results tabs.
If your company uses the full suite of server products, I would definitely recommend seeing if you can get Way 3 to work (using SAS Management Console) - that is likely easier than doing it through EG. That's how SAS would expect you to schedule jobs in that kind of environment (and lets your SAS Administrator have better visibility on when the server will be more/less busy).