AWS EMR (4.x-5.x) classpath for custom jar step - amazon-web-services

When adding a custom jar step for an EMR cluster - how do you set the classpath to a dependent jar (required library)?
Let's say I have my jar file - myjar.jar but I need an external jar to run it - dependency.jar. Where do you configure this when creating the cluster? I am not using the command line, using the Advanced Options interface.
Thought I would post this after spending a number of hours poking around and reading outdated documentation.
The 2.x/3.x documentation that talks about setting the HADOOP_CLASSPATH does not work. They specify this does not work for 4.x and above anyway. Somewhere you need to specify a --libjars option. However, specifying that in the arguments list does not work either.
For example:
Step Name: MyCustomStep
Jar Location: s3://somebucket/myjar.jar
Arguments:
myclassname
option1
option2
--libjars dependentlib.jar

Copy your required jars to /usr/lib/hadoop-mapreduce/ in a bootstrap action. No other changes are necessary. Additional info below:
This command below works for me to copy a specific JDBC driver version:
sudo aws s3 cp s3://<your bucket>/mysql-connector-java-5.1.23-bin.jar /usr/lib/hadoop-mapreduce/
I have other dependencies so I have a bootstrap action for each jar I need copied, of course you could put all the copies in a single bash script. Below is .net code I use to get a bootstrap action to run the copy script. I am using .net SDK versions 3.3.* and launching the job with release label emr-5.2.0
public static BootstrapActionConfig CopyEmrJarDependency(string jarName)
{
return new BootstrapActionConfig()
{
Name = $"Copy jars for EMR dependency: {jarName}",
ScriptBootstrapAction = new ScriptBootstrapActionConfig()
{
Path = $"s3n://{Config.AwsS3CodeBucketName}/EMR/Scripts/copy-thirdPartyJar.sh",
Args = new List<string>()
{
$"s3://{Config.AwsS3CodeBucketName}/EMR/Java/lib/{jarName}",
"/usr/lib/hadoop-mapreduce/"
}
}
};
}
Note that the ScriptBootstrapActionConfig Path property uses the protocol "s3n://", but the protocol for the aws cp command should be "s3://"
My script copy-thirdPartyJar.sh contains the following:
#!/bin/bash
# $1 = location of jar
# $2 = attempted magic directory for java classpath
sudo aws s3 cp $1 $2

Related

VM Manager - OS Policy Assignment for a Windows VM in GCP

I am trying to create a couple of os policy assignments to configure - run some scripts with PowerShell - and install some security agents on a Windows VM (Windows Server 2022), by using the VM Manager. I am following the official Google documentation to setup the os policies. The VM Manager is already enabled, nevertheless I have difficulties creating the appropriate .yaml file which is required for the policy assignment since I haven't found any detailed examples.
Related topics I have found:
Google documentation offers a very simple example of installing an .msi file - Example OS policies.
An example of a fixed policy assignment in Terraform registry - google_os_config_os_policy_assignment, from where I managed to better comprehend the required structure for the .yaml file even though it is in a .json format.
Few examples provided at GCP GitHub repository (OSPolicyAssignments).
OS Policy resources in JSON representation - REST Resource, from where you can navigate to sample cases based on the selected resource.
But, it is still not very clear how to create the desired .yaml file. (ie. Copy some files, run a PowerShell script to perform an installation or an authentication). According to the Google documentation pkg, repository, exec, and file are the supported resource types.
Are there any more detailed examples I could use to understand what is needed? Have you already tried something similar?
Update: Adding an additional source.
You need to follow these steps:
Ensure that the OS Config agent is installed in your VM by running the below command in PowerShell:
PowerShell Get-Service google_osconfig_agent
you should see an output like this:
Status Name DisplayName
------ ---- -----------
Running google_osconfig... Google OSConfig Agent
if the agent is not installed, refer to this tutorial.
Set the metadata values to enable OSConfig agent with Cloud Shell command:
gcloud compute instances add-metadata $YOUR_VM_NAME \
--metadata=enable-osconfig=TRUE
Generate an OS policy and OS policy assignment yaml file. As an example, I am generating an OS policy that installs a msi file retrieved from a GCS bucket, and an OS policy assignment to run it in all Windows VMs:
# An OS policy assignment to install a Windows MSI downloaded from a Google Cloud Storage bucket
# on all VMs running Windows Server OS.
osPolicies:
- id: install-msi-policy
mode: ENFORCEMENT
resourceGroups:
- resources:
- id: install-msi
pkg:
desiredState: INSTALLED
msi:
source:
gcs:
bucket: <your_bucket_name>
object: chrome.msi
generation: 1656698823636455
instanceFilter:
inventories:
- osShortName: windows
rollout:
disruptionBudget:
fixed: 10
minWaitDuration: 300s
Note: Every file has its own generation number, you can get it with the command gsutil stat gs://<your_bucket_name>/<your_file_name>.
Apply the policies created in the previous step using Cloud Shell command:
gcloud compute os-config os-policy-assignments create $POLICY_NAME --location=$YOUR_ZONE --file=/<your-file-path>/<your_file_name.yaml> --async
Refer to the Examples of OS policy assignments for more scenarios, and check out this example of a PowerShell script.
Down below you can find the the .yaml file that worked, in my case. It copies a file, and executes a PowerShell command, so as to configure and deploy a sample agent (TrendMicro) - again this is specifically for a Windows VM.
.yaml file:
id: trendmicro-windows-policy
mode: ENFORCEMENT
resourceGroups:
- resources:
- id: copy-exe-file
file:
path: C:/Program Files/TrendMicro_Windows.ps1
state: CONTENTS_MATCH
permissions: '755'
file:
gcs:
bucket: [your_bucket_name]
generation: [your_generation_number]
object: Windows/TrendMicro/TrendMicro_Windows.ps1
- id: validate-running
exec:
validate:
interpreter: POWERSHELL
script: |
$service = Get-Service -Name 'ds_agent'
if ($service.Status -eq 'Running') {exit 100} else {exit 101}
enforce:
interpreter: POWERSHELL
script: |
Start-Process PowerShell -ArgumentList '-ExecutionPolicy Unrestricted','-File "C:\Program Files\TrendMicro_Windows.ps1"' -Verb RunAs
To elaborate a bit more, this .yaml file:
copy-exe-file: It copies the necessary installation script from GCS to a specified location on the VM. Generation number can be easily found on "VERSION HISTORY" when you select the object on GCS.
validate-running: This stage contains two different steps. On the validate it checks if the specific agent is up and running on the VM. If not, then it proceeds with the enforce step, where it executes the "TrendMicro_Windows.ps1" file with PowerShell. This .ps1 file downloads, configures and installs the agent. Note 1: This command is executed as Administrator and the full path of the file is specified. Note 2: Instead of Start-Process PowerShell a Start-Process pwsh can also be utilized. It was vital for one of my cases.
Essentially, a PowerShell command can be directly run at the enforce
step, nonetheless, I found it much easier to pass it first to a .ps1
file, and then just run this file. There are some restriction with the
.yaml file anywise.
PS: Passing osconfig-log-level - debug as a key-value pair as Metadata - directly to a VM or applied to all of them (Compute Engine > Setting - Metadata > EDIT > ADD ITEM) - provide some additional information and may help you on dealing with errors.

How to clone an AWS EB environment across platform branches

Background
Our AWS Elastic Beanstalk environment, running the latest version of the pre-configured "Python 3.7 on 64-bit Amazon Linux 2" platform branch, has a lot of custom configuration and environment properties.
Now we would like to switch this environment to the "Python 3.8 on 64-bit Amazon Linux 2" platform branch.
Basically, the goal is to clone the environment, keeping the current configuration (other than platform branch and version) and environment properties.
Problem
Unfortunately, when cloning, it is not possible to switch between different platform branches (we can only switch between platform versions within the same platform branch).
The documentation suggests that a blue/green deployment is required here. However, a blue/green deployment involves creating a new environment from scratch, so we would still need some other way to copy our configuration settings and environment properties.
Question
What would be the recommended way to copy the configuration settings and/or environment properties from the original environment into a newly created environment?
I suppose we could use eb config to download the original configuration, modify the environment name, platform branch and version, and so on, and then use eb config --update on the new environment. However, that feels like a hack.
Summary
save current config: eb config save <env name>
use a text editor to modify the platform branch in the saved config file
create new environment based on modified config file: eb create --cfg <config name> (add --sample to use the sample application)
if necessary, delete local config files
if necessary, use eb printenv and eb setenv to copy environment properties
EDIT: For some reason the saved config does not include all security group settings, so it may be necessary to check those manually, using the EB console (configuration->instances).
Background
AWS support have confirmed that using eb config is the way to go, and they referred to the online documentation for details.
Unfortunately, the documentation for the eb cli does not provide all the answers.
The following is based on my own adventures using the latest version of the eb cli (3.20.2) with botocore 1.21.50, and documentation at the time of writing (Sep 30, 2021). Note there's a documentation repo on github but it was last updated six months ago and does not match the latest online docs...
eb config
Here's a screenshot from the eb config docs:
Indeed, if you call eb config my-env or eb config my-env --display, environment properties are not shown.
However, this does not hold for eb config save: YAML files created using eb config save actually do include environment properties*.
*Beware, if your environment properties include secrets (e.g. passwords), these also end up in your saved configs, so make sure you don't commit those to version control.
Moreover, it is currently also possible to set environment properties using eb config --update.
This implies we should be able to "copy" both configuration settings and environment properties in one go.
EDIT: After some testing it turns out eb config save does not always get the complete set of environment properties: some properties may be skipped. Not yet sure why... Step 5 below might help in those cases.
Walk-through
Not sure if this is the best way to do it, but here's what seems to work for me:
Suppose we have an existing EB environment called py37-env with lots of custom configuration and properties, running the Python 3.7 platform branch.
The simplest way to "clone" this would be as follows:
Step 1: download the existing configuration
Download the configuration for the existing environment:
eb config save py37-env
By default, the config file will end up in our project directory as .elasticbeanstalk/saved_configs/py37-env-sc.cfg.yml.
The saved config file could look like this (just an example, also see environment manifest):
EnvironmentConfigurationMetadata:
Description: Configuration created from the EB CLI using "eb config save".
DateCreated: '1632989892000'
DateModified: '1632989892000'
Platform:
PlatformArn: arn:aws:elasticbeanstalk:eu-west-1::platform/Python 3.7 running on 64bit Amazon Linux 2/3.3.5
OptionSettings:
aws:elasticbeanstalk:application:environment:
MY_ENVIRONMENT_PROPERTY: myvalue
aws:elasticbeanstalk:command:
BatchSize: '30'
BatchSizeType: Percentage
aws:elb:policies:
ConnectionDrainingEnabled: true
aws:elb:loadbalancer:
CrossZone: true
aws:elasticbeanstalk:environment:
ServiceRole: aws-elasticbeanstalk-service-role
aws:elasticbeanstalk:healthreporting:system:
SystemType: enhanced
aws:autoscaling:launchconfiguration:
IamInstanceProfile: aws-elasticbeanstalk-ec2-role
EC2KeyName: my-key
aws:autoscaling:updatepolicy:rollingupdate:
RollingUpdateType: Health
RollingUpdateEnabled: true
EnvironmentTier:
Type: Standard
Name: WebServer
AWSConfigurationTemplateVersion: 1.1.0.0
Also see the list of available configuration options in the documentation.
Step 2: modify the saved configuration
We are only interested in the Platform, so it is sufficient here to replace 3.7 by 3.8 in the PlatformArn value.
If necessary, you can use e.g. eb platform list to get an overview of valid platform names.
Step 3: create a new environment based on the modified config file
eb create --cfg py37-env-sc
This will deploy the most recent application version. Use --version <my version> to deploy a specific version, or use --sample to deploy the sample application, as described in the docs.
This will automatically look for files in the default saved config folder, .elasticbeanstalk/saved_configs/.
If you get a ServiceError or InvalidParameterValueError at this point, make sure only to pass in the name of the file, i.e. without the file extension .cfg.yml and without the folders.
Step 4: clean up local saved configuration file
Just in case you have any secrets stored in the environment properties.
Step 5: alternative method for copying environment properties
If environment properties are not included in the saved config files, or if some of them are missing, here's an alternative way to copy them (using bash).
This might not be the most efficient implementation, but I think it serves to illustrate the approach. Error handling was omitted, for clarity.
source_env="py37-env" # or "$1"
target_env="py38-env" # or "$2"
# get the properties from the source environment
source_env_properties="$(eb printenv "$source_env")"
# format the output so it can be used with `eb setenv`
mapfile -t arg_array < <(echo "$source_env_properties" | grep "=" | sed -e 's/ =/=/g' -e 's/= /=/g' -e 's/^ *//g')
# copy the properties to the target environment
eb setenv -e "$target_env" "${arg_array[#]}"
This has the advantage that it does not store any secrets in local files.

How to import Spark packages in AWS Glue?

I would like to use the GrameFrames package, if I were to run pyspark locally I would use the command:
~/hadoop/spark-2.3.1-bin-hadoop2.7/bin/pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
But how would I run a AWS Glue script with this package? I found nothing in the documentation...
You can provide a path to extra libraries packaged into zip archives located in s3.
Please check out this doc for more details
It's possible to using graphframes as follows:
Download the graphframes python library package file e.g. from here. Unzip the .tar.gz and then re-archive to a .zip. Put somewhere in s3 that your glue job has access to
When setting up your glue job:
Make sure that your Python Library Path references the zip file
For job parameters, you need {"--conf": "spark.jars.packages=graphframes:graphframes:0.6.0-spark2.3-s_2.11"}
Every one looking for an answer please read this comment..
In order to use an external package in AWS Glue pySpark or Python-shell:
1)
Clone the repo from follwing url..
https://github.com/bhavintandel/py-packager/tree/master
git clone git#github.com:bhavintandel/py-packager.git
cd py-packager
2)
Add your required package under requirements.txt. For ex.,
pygeohash
Update the version and project name under setup.py. For ex.,
VERSION = "0.1.0"
PACKAGE_NAME = "dependencies"
3) Run the follwing "command1" to create .zip package for pyspark OR "command2" to create egg files for python-shell..
command1:
sudo make build_zip
Command2:
sudo make bdist_egg
Above commands will generate packae in dist folder.
4) Finally upload this pakcage from dist directory to S3 bucket. Then goto AWS Glue Job Console, edit job, find script libraries option, click on folder icon of "python library path" .. then select your s3 path.
finally use in your glue script:
import pygeohash as pgh
Done!
Also set --user-jars-firs: "true" parameter in glue job.

A sane way to set up CloudWatch logs (awslogs-agent)

tl;dr The configuration of cloudwatch agent is #$%^. Any straightforward way?
I wanted one place to store the logs, so I used Amazon CloudWatch Logs Agent. At first it seemed like I'd just add a Resource saying something like "create a log group, then a log stream and send this file, thank you" - all declarative and neat, but...
According to this doc I had to setup JSON configuration that created a BASH script that downloaded a Python script that set up the service that used a generated config in yet-another-language somewhere else.
I'd think logging is something frequently used, so there must be a declarative configuration way, not this 4-language crazy combo. Am I missing something, or is ops world so painful?
Thanks for ideas!
"Agent" is just an aws-cli plugin and a bunch of scripts. You can install the plugin with pip install awscli-cwlogs on most systems (assuming you already installed awscli itself). NOTE: I think Amazon Linux is not "most systems" and might require a different approach.
Then you'll need two configs: awscli config with the following content (also add credentials if needed and replace us-east-1 with your region):
[plugins]
cwlogs = cwlogs
[default]
region = us-east-1
and logging config with something like this (adjust to your needs according to the docs):
[general]
state_file = push-state
[logstream-cfn-init.log]
datetime_format = %Y-%m-%d %H:%M:%S,%f
file = /var/log/cfn-init.log
file_fingerprint_lines = 1-3
multi_line_start_pattern = {datetime_format}
log_group_name = ec2-logs
log_stream_name = {hostname}-{instance_id}/cfn-init.log
initial_position = start_of_file
encoding = utf_8
buffer_duration = 5000
after that, to start the daemon automatically you can create a systemd unit like this (change config paths to where you actually put them):
[Unit]
Description=CloudWatch logging daemon
[Service]
ExecStart=/usr/local/bin/aws logs push --config-file /etc/aws/cwlogs
Environment=AWS_CONFIG_FILE=/etc/aws/config
Restart=always
Type=simple
[Install]
WantedBy=multi-user.target
after that you can systemctl enable and systemctl start as usual. That's assuming your instance running a distribution that uses systemd (which is most of them nowadays but if not you should consult documentation to your distribution to learn how to run daemons).
Official setup script also adds a config for logrotate, I skipped that part because it wasn't required in my case but if your logs are rotated you might want to do something with it. Consult the setup script and logrotate documentation for details (essentially you just need to restart the daemon whenever files are rotated).
You've linked doco particular to CloudFormation so a bunch of the complexity is probably associated with that context.
Here's the stand-alone documentation for the Cloudwatch Logs Agent:
Quick Start
Agent Reference
If you're on Amazon Linux, you can install the 'awslogs' system package via yum. Once that's done, you can enable the logs plugin for the AWS CLI by making sure you have the following section in the CLI's config file:
[plugins]
cwlogs = cwlogs
E.g., the system package should create a file under /etc/awslogs/awscli.conf . You can use that file by setting the...
AWS_CONFIG_FILE=/etc/awslogs/awscli.conf
...environment variable.
Once that's all done, you can:
$ aws logs push help
and
$ cat /path/to/some/file | aws logs push [options]
The agent also comes with helpers to keep various log files in sync.

Deploy .war to AWS

I want to deploy war from Jenkins to Cloud.
Could you please let me know how to deploy war file from Jenkins on my local to AWS Bean Stalk ?
I tried using a Jenkins post-process plugin to copy the artifact to S3, but I get the following error:
ERROR: Failed to upload files java.io.IOException: put Destination [bucketName=https:, objectName=/s3-eu-west-1.amazonaws.com/bucketname/test.war]:
com.amazonaws.AmazonClientException: Unable to execute HTTP request: Connect to s3.amazonaws.com/s3.amazonaws.com/ timed out at hudson.plugins.s3.S3Profile.upload(S3Profile.java:85) at hudson.plugins.s3.S3BucketPublisher.perform(S3BucketPublisher.java:143)
Some work has been done on this.
http://purelyinstinctual.com/2013/03/18/automated-deployment-to-amazon-elastic-beanstalk-using-jenkins-on-ec2-part-2-guide/
Basically, this is just adding a post-build task to run the standard command line deployment scripts.
From the referenced page, assuming you have the post-build task plugin on Jenkins and the AWS command line tools installed:
STEP 1
In a Jenkins job configuration screen, add a “Post-build action” and choose the plugin “Publish artifacts to S3 bucket”, specify the Source (in our case, we use Maven so the source is target/.war and destination is your S3 bucket name)
STEP 2
Then, add a “Post-build task” (if you don’t have it, this is a plugin in Maven repo) to the same section above (“Post-build Actions”) and drag it below the “Publish artifacts to S3 bucket”. This is important that we want to make sure the war file is uploaded to S3 before proceeding with the scripts.
In the Post-build task portion, make sure you check the box “Run script only if all previous steps were successful”
In the script text area, put in the path of the script to automate the deployment (described in step 3 below). For us, we put something like this:
<path_to_script_file>/deploy.sh "$VERSION_NUMBER" "$VERSION_DESCRIPTION"
The $VERSION_NUMBER and $VERSION_DESCRIPTION are Jenkins’ build parameters and must be specified when a deployment is triggered. Both variables will be used for AEB deployment
STEP 3
The script
#!/bin/sh
export AWS_CREDENTIAL_FILE=<path_to_your aws.key file>
export PATH=$PATH:<path to bin file inside the "api" folder inside the AEB Command line tool (A)>
export PATH=$PATH:<path to root folder of s3cmd (B)>
//get the current time and append to the name of .war file that's being deployed.
//This will create a unique identifier for each .war file and allow us to rollback easily.
current_time=$(date +"%Y%m%d%H%M%S")
original_file="app.war"
new_file="app_$current_time.war"
//Rename the deployed war file with the new name.
s3cmd mv "s3://<your S3 bucket>/$original_file" "s3://<your S3 bucket>/$new_file"
//Create application version in AEB and link it with the renamed WAR file
elastic-beanstalk-create-application-version -a "Hoiio App" -l "$1" -d "$2" -s "<your S3 bucket>/$new_file"