Executing HiveQL in EMR cluster - amazon-web-services

I have created an EMR cluster thru AWS CLI
aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper
--tags Name="EMR-Atlas" --release-label emr-5.16.0 --ec2-attributes SubnetId=subnet-xxxxx,
KeyName=atlas-emr-dif --use-default-roles --ebs-root-volume-size 100 --instance-groups
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge InstanceGroupType=CORE,InstanceCount=1,
InstanceType=m4.xlarge --log-uri s3://xxx/logs/new-log --steps Name="Run Remote Script",
Jar=command-runner.jar,Args=
[bash,-c,
"curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh
-o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh"]
Then I have established a SSH connection for HUE:
--ssh -L 8888:localhost:8888 -i key.pem hadoop#<EMR Master IP Address>
I have created a Hive table thru HUE :
CREATE external TABLE us_disease
(
YearStart int,
StratificationCategory2 string,
GeoLocation string,
ResponseID string,
LocationID int,
TopicID string
)
row format delimited
fields terminated by ','
LOCATION 's3://XXXX/data/USHealthcare/'
TBLPROPERTIES ("skip.header.line.count"="1");
I am able to fetch records with SELECT statement thru HUE.
But, if I try to execute the select statement thru HQL it fails.
I tried in the following way:
My HQL is plain SELECT statment
select * from us_disease limit 10;
and I have stored the same in S3 as hive.hql.
I executed the hql thru step in emr cluster:
Log :
INFO redirectError to /mnt/var/log/hadoop/steps/s-xxxxxxxx/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-xxxxxxxx
INFO ProcessRunner started child process 30597 :
hadoop 30597 5505 0 11:40 ? 00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar hive-script --run-hive-script --args -f s3://dif-test/data-governance/hql/hive.hql
2021-03-30T11:40:36.318Z INFO HadoopJarStepRunner.Runner: startRun() called for s-xxxxxxxx Child Pid: 30597
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 127 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 2 seconds
2021-03-30T11:40:36.437Z INFO Step created jobs:
2021-03-30T11:40:36.438Z WARN Step failed with exitCode 127 and took 2 seconds
stderr:
/usr/lib/hadoop/bin/hadoop: line 169: /etc/alternatives/jre/bin/java: No such file or directory
Any help appreciated. Thank you.

The issue got fixed after I updated the emr version. Previously I was using emr-5.16.0 . I changed to emr-5.32.0.
Modified code :
aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper --tags Name="EMR-Atlas" --release-label emr-5.32.0 --ec2-attributes SubnetId=subnet-xxxx,KeyName=atlas-emr-dif --use-default-roles --ebs-root-volume-size 100 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge --log-uri s3://xxx/xxx/new-log --steps Name="Run Remote Script",Jar=command-runner.jar,Args=[bash,-c,"curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh -o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh"]

Related

bash: spark-submit: command not found while executing dag in AWS- Managed Apache Airflow

I have to run a spark job, (I am new to spark) and getting following error-
[2022-02-16 14:47:45,415] {{bash.py:135}} INFO - Tmp dir root location: /tmp
[2022-02-16 14:47:45,416] {{bash.py:158}} INFO - Running command: spark-submit --class org.xyz.practice.driver.PractitionerDriver s3://pfdt-poc-temp/xyz_test/org.xyz.spark-xy_mvp-1.0.0-SNAPSHOT.jar
[2022-02-16 14:47:45,422] {{bash.py:169}} INFO - Output:
[2022-02-16 14:47:45,423] {{bash.py:173}} INFO - bash: spark-submit: command not found
[2022-02-16 14:47:45,423] {{bash.py:177}} INFO - Command exited with return code 127
[2022-02-16 14:47:45,437] {{taskinstance.py:1482}} ERROR - Task failed with exception
What has to be done,
def run_spark(**kwargs):
import pyspark
sc = pyspark.SparkContext()
df = sc.textFile('s3://demoairflowpawan/people.txt')
logging.info('Number of lines in people.txt = {0}'.format(df.count()))
sc.stop()
spark_task = BashOperator(
task_id='spark_java',
bash_command='spark-submit --class {{ params.class }} {{ params.jar }}',
params={'class': 'org.xyz.practice.driver.PractitionerDriver', 'jar': 's3://pfdt-poc-temp/xyz_test/org.xyz.spark-xy_mvp-1.0.0-SNAPSHOT.jar'},
dag=dag
)
The question is - why do you expect the spark-submit to be there?
If you created the airflow default pods, then they come with airflow code only.
You can check here an example for spark and airflow - https://medium.com/codex/executing-spark-jobs-with-apache-airflow-3596717bbbe3 - and they state specifically "Spark binaries must be added and mapped".
So you need to figure out how to download the spark binaries to the existing airflow pod.
Alternatively - you can create another k8s job which will do the spark-submit, and have your DAG activate this job.
sorry for the high level answer...

Cloning existing EMR cluster into a new one using boto3

When creating a new cluster using boto3, I want to use configuration from existing clusters (which is terminated) and thus clone it.
As far as I know, emr_client.run_job_flow requires all the configuration(Instances, InstanceFleets etc) to be provided as parameters.
Is there any way I can clone from existing cluster like I can do from aws console for EMR.
What i can recommend you, is using the AWS CLI to fire your Cluster.
It permit to versioning your cluster configuration and you can easily load steps configuration with a json file.
aws create-cluster --name "Cluster's name" --ec2-attributes KeyName=SSH_KEY --instance-type m3.xlarge --release-label emr-5.2.1 --log-uri s3://mybucket/logs/ --enable-debugging --instance-count 1 --use-default-roles --applications Name=Spark --steps file://step.json
Where step.json looks like :
[
{
"Name": "Step #1",
"Type":"SPARK",
"Jar":"command-runner.jar",
"Args":
[
"--deploy-mode", "cluster",
"--class", "com.your.data.set.class",
"s3://path/to/your/spark-job.jar",
"-c", "s3://path/to/your/config/or/not",
"--aws-access-key", "ACCESS_KEY",
"--aws-secret-key", "SECRET_KEY"
],
"ActionOnFailure": "CANCEL_AND_WAIT"
}
]
(Multiple steps is okey too)
After that you can always startUp the same configured Cluster.
And for example Schedule the whole Cluster and steps from one AirFlow job.
But if you really want to use Boto3, i suppose that the describe_cluster() method can help you to get the whole informations and use the returned object to Fire Up a new one.
There is no way to get "emr export cli" through command line.
You should parse the parameter what you want to clone, through "describe-cluster".
See below sample,
https://github.com/awslabs/aws-support-tools/tree/master/EMR/Get_EMR_CLI_Export
import boto3
import json
import sys
cluster_id = sys.argv[1]
client = boto3.client('emr')
clst = client.describe_cluster(ClusterId=cluster_id)
...
awscli += ' --steps ' + '\'' + json.dumps(cli_steps) + '\''
...
awscli += ' --instance-groups ' + '\'' + json.dumps(cli_igroups) + '\''
print(awscli)
It works parsing the parameters from “describe-cluster” at first, and make strings to fit “create-cluster” of aws-cli.

AWS EMR Spark Step args bug

I'm submitting a Spark job to EMR via AWSCLI, EMR steps and spark configs are provided as separate json files. For some reason the name of my main class gets passed to my Spark jar as an unnecessary command line argument, resulting in a failed job.
AWSCLI command:
aws emr create-cluster \
--name "Spark-Cluster" \
--release-label emr-5.5.0 \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
InstanceGroupType=CORE,InstanceCount=20,InstanceType=m3.xlarge \
--applications Name=Spark \
--use-default-roles \
--configurations file://conf.json \
--steps file://steps.json \
--log-uri s3://blah/logs \
The json file describing my EMR Step:
[
{
"Name": "RunEMRJob",
"Jar": "s3://blah/blah.jar",
"ActionOnFailure": "TERMINATE_CLUSTER",
"Type": "CUSTOM_JAR",
"MainClass": "blah.blah.MainClass",
"Args": [
"--arg1",
"these",
"--arg2",
"get",
"--arg3",
"passed",
"--arg4",
"to",
"--arg5",
"spark",
"--arg6",
"main",
"--arg7",
"class"
]
}
]
The argument parser in my main class throws an error (and prints the parameters provided):
Exception in thread "main" java.lang.IllegalArgumentException: One or more parameters are invalid or missing:
blah.blah.MainClass --arg1 these --arg2 get --arg3 passed --arg4 to --arg5 spark --arg6 main --arg7 class
So for some reason the main class that I define in steps.json leaks into my separately provided command line arguments.
What's up?
I misunderstood how EMR steps work. There were two options for resolving this:
I could use Type = "CUSTOM_JAR" with Jar = "command-runner.jar" and add a normal spark-submit call to Args.
Using Type = "Spark" simply adds the "spark-submit" call as the first argument, one still needs to provide a master, jar location, main class etc...

AWS Cloudwatch Log - Is it possible to export existing log data from it?

I have managed to push my application logs to AWS Cloudwatch by using the AWS CloudWatch log agent. But the CloudWatch web console does not seem to provide a button to allow you to download/export the log data from it.
Any idea how I can achieve this goal?
The latest AWS CLI has a CloudWatch Logs cli, that allows you to download the logs as JSON, text file or any other output supported by AWS CLI.
For example to get the first 1MB up to 10,000 log entries from the stream a in group A to a text file, run:
aws logs get-log-events \
--log-group-name A --log-stream-name a \
--output text > a.log
The command is currently limited to a response size of maximum 1MB (up to 10,000 records per request), and if you have more you need to implement your own page stepping mechanism using the --next-token parameter. I expect that in the future the CLI will also allow full dump in a single command.
Update
Here's a small Bash script to list events from all streams in a specific group, since a specified time:
#!/bin/bash
function dumpstreams() {
aws $AWSARGS logs describe-log-streams \
--order-by LastEventTime --log-group-name $LOGGROUP \
--output text | while read -a st; do
[ "${st[4]}" -lt "$starttime" ] && continue
stname="${st[1]}"
echo ${stname##*:}
done | while read stream; do
aws $AWSARGS logs get-log-events \
--start-from-head --start-time $starttime \
--log-group-name $LOGGROUP --log-stream-name $stream --output text
done
}
AWSARGS="--profile myprofile --region us-east-1"
LOGGROUP="some-log-group"
TAIL=
starttime=$(date --date "-1 week" +%s)000
nexttime=$(date +%s)000
dumpstreams
if [ -n "$TAIL" ]; then
while true; do
starttime=$nexttime
nexttime=$(date +%s)000
sleep 1
dumpstreams
done
fi
That last part, if you set TAIL will continue to fetch log events and will report newer events as they come in (with some expected delay).
There is also a python project called awslogs, allowing to get the logs: https://github.com/jorgebastida/awslogs
There are things like:
list log groups:
$ awslogs groups
list streams for given log group:
$ awslogs streams /var/log/syslog
get the log records from all streams:
$ awslogs get /var/log/syslog
get the log records from specific stream :
$ awslogs get /var/log/syslog stream_A
and much more (filtering for time period, watching log streams...
I think, this tool might help you to do what you want.
It seems AWS has added the ability to export an entire log group to S3.
You'll need to setup permissions on the S3 bucket to allow cloudwatch to write to the bucket by adding the following to your bucket policy, replacing the region with your region and the bucket name with your bucket name.
{
"Effect": "Allow",
"Principal": {
"Service": "logs.us-east-1.amazonaws.com"
},
"Action": "s3:GetBucketAcl",
"Resource": "arn:aws:s3:::tsf-log-data"
},
{
"Effect": "Allow",
"Principal": {
"Service": "logs.us-east-1.amazonaws.com"
},
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::tsf-log-data/*",
"Condition": {
"StringEquals": {
"s3:x-amz-acl": "bucket-owner-full-control"
}
}
}
Details can be found in Step 2 of this AWS doc
The other answers were not useful with AWS Lambda logs since they create many log streams and I just wanted to dump everything in the last week. I finally found the following command to be what I needed:
aws logs tail --since 1w LOG_GROUP_NAME > output.log
Note that LOG_GROUP_NAME is the lambda function path (e.g. /aws/lambda/FUNCTION_NAME) and you can replace the since argument with a variety of times (1w = 1 week, 5m = 5 minutes, etc)
I would add that one liner to get all logs for a stream :
aws logs get-log-events --log-group-name my-log-group --log-stream-name my-log-stream | grep '"message":' | awk -F '"' '{ print $(NF-1) }' > my-log-group_my-log-stream.txt
Or in a slightly more readable format :
aws logs get-log-events \
--log-group-name my-log-group\
--log-stream-name my-log-stream \
| grep '"message":' \
| awk -F '"' '{ print $(NF-1) }' \
> my-log-group_my-log-stream.txt
And you can make a handy script out of it that is admittedly less powerful than #Guss's but simple enough. I saved it as getLogs.sh and invoke it with ./getLogs.sh log-group log-stream
#!/bin/bash
if [[ "${#}" != 2 ]]
then
echo "This script requires two arguments!"
echo
echo "Usage :"
echo "${0} <log-group-name> <log-stream-name>"
echo
echo "Example :"
echo "${0} my-log-group my-log-stream"
exit 1
fi
OUTPUT_FILE="${1}_${2}.log"
aws logs get-log-events \
--log-group-name "${1}"\
--log-stream-name "${2}" \
| grep '"message":' \
| awk -F '"' '{ print $(NF-1) }' \
> "${OUTPUT_FILE}"
echo "Logs stored in ${OUTPUT_FILE}"
Apparently there isn't an out-of-box way from AWS Console where you can download the CloudWatchLogs. Perhaps you can write a script to perform the CloudWatchLogs fetch using the SDK / API.
The good thing about CloudWatchLogs is that you can retain the logs for infinite time(Never Expire); unlike the CloudWatch which just keeps the logs for just 14 days. Which means you can run the script in monthly / quarterly frequency rather than on-demand.
More information about the CloudWatchLogs API,
http://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/Welcome.html
http://awsdocs.s3.amazonaws.com/cloudwatchlogs/latest/cwl-api.pdf
You can now perform exports via the Cloudwatch Management Console with the new Cloudwatch Logs Insights page. Full documentation here https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_ExportQueryResults.html. I had already started ingesting my Apache logs into Cloudwatch with JSON, so YMMV if you haven't set it up in advance.
Add Query to Dashboard or Export Query Results
After you run a query, you can add the query to a CloudWatch
dashboard, or copy the results to the clipboard.
Queries added to dashboards automatically re-run every time you load
the dashboard and every time that the dashboard refreshes. These
queries count toward your limit of four concurrent CloudWatch Logs
Insights queries.
To add query results to a dashboard
Open the CloudWatch console at
https://console.aws.amazon.com/cloudwatch/.
In the navigation pane, choose Insights.
Choose one or more log groups and run a query.
Choose Add to dashboard.
Select the dashboard, or choose Create new to create a new dashboard
for the query results.
Choose Add to dashboard.
To copy query results to the clipboard
Open the CloudWatch console at
https://console.aws.amazon.com/cloudwatch/.
In the navigation pane, choose Insights.
Choose one or more log groups and run a query.
Choose Actions, Copy query results.
Inspired by saputkin I have created a pyton script that downloads all the logs for a log group in given time period.
The script itself: https://github.com/slavogri/aws-logs-downloader.git
In case there are multiple log streams for that period multiple files will be created. Downloaded files will be stored in current directory, and will be named by the log streams that has a log events in given time period. (If the group name contains forward slashes, they will be replaced by underscores. Each file will be overwritten if it already exists.)
Prerequisite: You need to be logged in to your aws profile. The Script itself is going to use on behalf of you the AWS command line APIs: "aws logs describe-log-streams" and "aws logs get-log-events"
Usage example: python aws-logs-downloader -g /ecs/my-cluster-test-my-app -t "2021-09-04 05:59:50 +00:00" -i 60
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-g , --log-group (required) Log group name for which the log stream events needs to be downloaded
-t , --end-time (default: now) End date and time of the downloaded logs in format: %Y-%m-%d %H:%M:%S %z (example: 2021-09-04 05:59:50 +00:00)
-i , --interval (default: 30) Time period in minutes before the end-time. This will be used to calculate the time since which the logs will be downloaded.
-p , --profile (default: dev) The aws profile that is logged in, and on behalf of which the logs will be downloaded.
-r , --region (default: eu-central-1) The aws region from which the logs will be downloaded.
Please let me now if it was useful to you. :)
After I did it I learned that there is another option using Boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.get_log_events
Still the command line API seems to me like a good option.
export LOGGROUPNAME=[SOME_LOG_GROUP_NAME]; for LOGSTREAM in `aws --output text logs describe-log-streams --log-group-name ${LOGGROUPNAME} |awk '{print $7}'`; do aws --output text logs get-log-events --log-group-name ${LOGGROUPNAME} --log-stream-name ${LOGSTREAM} >> ${LOGGROUPNAME}_output.txt; done
Adapted #Guyss answer to macOS. As I am not really a bash guy, had to use python, to convert dates to a human-readable form.
runaswslog -1w gets last week and so on
runawslog() { sh awslogs.sh $1 | grep "EVENTS" | python parselogline.py; }
awslogs.sh:
#!/bin/bash
#set -x
function dumpstreams() {
aws $AWSARGS logs describe-log-streams \
--order-by LastEventTime --log-group-name $LOGGROUP \
--output text | while read -a st; do
[ "${st[4]}" -lt "$starttime" ] && continue
stname="${st[1]}"
echo ${stname##*:}
done | while read stream; do
aws $AWSARGS logs get-log-events \
--start-from-head --start-time $starttime \
--log-group-name $LOGGROUP --log-stream-name $stream --output text
done
}
AWSARGS=""
#AWSARGS="--profile myprofile --region us-east-1"
LOGGROUP="/aws/lambda/StockTrackFunc"
TAIL=
FROMDAT=$1
starttime=$(date -v ${FROMDAT} +%s)000
nexttime=$(date +%s)000
dumpstreams
if [ -n "$TAIL" ]; then
while true; do
starttime=$nexttime
nexttime=$(date +%s)000
sleep 1
dumpstreams
done
fi
parselogline.py:
import sys
import datetime
dat=sys.stdin.read()
for k in dat.split('\n'):
d=k.split('\t')
if len(d)<3:
continue
d[2]='\t'.join(d[2:])
print( str(datetime.datetime.fromtimestamp(int(d[1])/1000)) + '\t' + d[2] )
I had a similar use case where i had to download all the streams for a given log group. See if this script helps.
#!/bin/bash
if [[ "${#}" != 1 ]]
then
echo "This script requires two arguments!"
echo
echo "Usage :"
echo "${0} <log-group-name>"
exit 1
fi
streams=`aws logs describe-log-streams --log-group-name "${1}"`
for stream in $(jq '.logStreams | keys | .[]' <<< "$streams"); do
record=$(jq -r ".logStreams[$stream]" <<< "$streams")
streamName=$(jq -r ".logStreamName" <<< "$record")
echo "Downloading ${streamName}";
echo `aws logs get-log-events --log-group-name "${1}" --log-stream-name "$streamName" --output json > "${stream}.log" `
echo "Completed dowload:: ${streamName}";
done;
You have have pass log group name as an argument.
Eg: bash <name_of_the_bash_file>.sh <group_name>
I found AWS Documentation to be complete and accurate. https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3ExportTasks.html
This laid down steps for exporting logs from Cloudwatch to S3

knife bootstrap returning error

Running:
knife bootstrap ec2-54-221-16-158.compute-1.amazonaws.com --sudo -x chef -P chef -N server --run-list 'role[inicial]'
My recipes/default.rb:
script "teste de script" do
interpreter "bash"
cwd "/home/ubuntu"
code <<-EOH
as-create-launch-config LcTiagoN --image-id ami-0521316c --instance-type t1.micro --key tiagov
EOH
end
My roles/inicial.rb:
name "inicial"
run_list "recipe[my_cookbook]"
The following error occurs below:
ShellOut::ShellCommandFailed←[
0m
------------------------------------←[
0m
Expected process to exit with [0], but
received '127'
---- Begin output of "bash" "/tmp/che
f-script20140501-8463-12uvvvl" ----
STDOUT:
STDERR: /tmp/chef-script20140501-8463-
12uvvvl: line 1: as-create-launch-config: command not found
---- End output of "bash" "/tmp/chef-
script20140501-8463-12uvvvl" ----
However when I run the same command (as-create-launch-config LcTiagoN --image-id ami-0521316c --instance-type t1.micro --key tiagov) directly logged in the Amazon instance, the command is executed successfully.
Any suggestions?
Sounds like a problem with the PATH environment. Did you login as "chef" when running the as-create-launch-config command manually?
Best advice I can offer is to include the full path to the command in the bash script. For example:
script "teste de script" do
..
code <<-EOH
/path/to/this/cmd/as-create-launch-config ...
EOH
end