If there are a way to get info at runtime about SparkMetrics configuration - amazon-web-services

I add metrics.properties file to resource directory (maven project) with CSV sinc. Everything is fine when I run Spark app locally - metrics appears. But when I file same fat jar to Amazon EMR I do not see any tries to put metrics into CSV sinc. So I want to check at runtime what are loaded settings for SparkMetrics subsystem. If there are any possibility to do this?
I looked into SparkEnv.get.metricsSystem but didn't find any.

That is basically because Spark on EMR is not picking up your custom metrics.properties file from the resources dir of the fat jar.
For EMR the preferred way to configure is through EMR Configurations API in which you need to pass the classification and properties in an embedded JSON.
For spark metrics subsystem here is an example to modify a couple of metrics
[
{
"Classification": "spark-metrics",
"Properties": {
"*.sink.csv.class": "org.apache.spark.metrics.sink.CsvSink",
"*.sink.csv.period": "1"
}
}
]
You can use this JSON when creating EMR cluster using Amazon Console or through SDK

Related

Does AWS Sagemaker PySparkProcessor manage autoscaling?

I'm using Sagemaker to generate to do preprocessing and generate training data and I'm following the Sagemaker API documentation here, but I don't see any way currently how to specify autoscaling within the EMR cluster. What should I include within the configuration argument that I pass to my spark_processor run() object? What shouldn't I include?
I'm aware of the this resource, but it doesn't seem comprehensive.
Below is my code; it is very much a "work-in-progress", but I would like to know if someone could provide me with or point me to a resource that shows:
Whether this PySparkProcessor object will manage autoscaling automatically. Should I put AutoScaling config within the configuration in the run() object?
An example of the full config that I can pass to the configuration variable.
Here's what I have so far for the configuration.
SPARK_CONFIG = \
{ "Configurations": [
{ "Classification": "spark-env",
"Configurations": [ {"Classification": "export"} ] }
]
}
spark_processor = PySparkProcessor(
tags=TAGS,
role=IAM_ROLE,
instance_count=2,
py_version="py37",
volume_size_in_gb=30,
container_version="1",
framework_version="3.0",
network_config=sm_network,
max_runtime_in_seconds=1800,
instance_type="ml.m5.2xlarge",
base_job_name=EMR_CLUSTER_NAME,
sagemaker_session=sagemaker_session,
)
spark_processor.run(
configuration=SPARK_CONFIG,
submit_app=LOCAL_PYSPARK_SCRIPT_DIR,
spark_event_logs_s3_uri="s3://{BUCKET_NAME}/{S3_PYSPARK_LOG_PREFIX}",
)
I'm used to interacting via Python more directly with EMR for these types of tasks. Doing that allows me to specify the entire EMR cluster config at once--including applications, autoscaling, EMR default and autoscaling roles--and then adding the steps to the cluster once it's created; however, much of this config seems to be abstracted away, and I don't know what remains or needs to be specified, specifically regarding the following config variables: AutoScalingRole, Applications, VisibleToAllUsers, JobFlowRole/ServiceRole etc.
I found the answer in the Sagemaker Python SDK github.
_valid_configuration_keys = ["Classification", "Properties", "Configurations"]
_valid_configuration_classifications = [
"core-site",
"hadoop-env",
"hadoop-log4j",
"hive-env",
"hive-log4j",
"hive-exec-log4j",
"hive-site",
"spark-defaults",
"spark-env",
"spark-log4j",
"spark-hive-site",
"spark-metrics",
"yarn-env",
"yarn-site",
"export",
]
Thus, specifying autoscaling, Visibility, and some other cluster level configurations seems not to be supported. However, the applications installed upon cluster start up seem to depend on the applications in the above list.

aws kinesis data analytics application (flink) change property originally located at flink-conf.yaml

As the runtime of my flink application I use managed flink by AWS (Kinesis Data Analytics Application)
I added functionality (sink) for write processed events from kinesis queue in S3 in a parquet format.
Locally everything works for me, but when I try to run the application in the cloud I get the following exception:
"throwableInformation": [
"com.esotericsoftware.kryo.KryoException: Error constructing instance of class: org.apache.avro.Schema$LockableArrayList",
"Serialization trace:",
"types (org.apache.avro.Schema$UnionSchema)",
"schema (org.apache.avro.Schema$Field)",
"fieldMap (org.apache.avro.Schema$RecordSchema)",
After finding a solution to the problem, I found that I need to change following property (checked this on a local cluster):
classloader.resolve-order: child-first -> classloader.resolve-order: parent-first
Is it possible to change this configuration when using AWS managed Fink (not EMR, Kinesis Data Analytics applications) in any way?
aws support answer: No. This property cannot be changed.

Google Dataprep copy flows from one project to another

I have two Google projects: dev and prod. I import data from also different storage buckets located in these projects: dev-bucket and prod-bucket.
After I have made and tested changes in the dev environment, how can I smoothly apply (deploy/copy) the changes to prod as well?
What I do now is I export the flow from devand then re-import it into prod. However, each time I need to manually do the following in the `prod flows:
Change the dataset that serve as inputs in the flow
Replace the manual and scheduled destinations for the right BigQuery dataset (dev-dataset-bigquery and prod-dataset-bigquery)
How can this be done more smoother?
If you want to copy data between Google Cloud Storage (GCS) buckets dev-bucket and prod-bucket, Google provides a Storage Transfer Service with this functionality. https://cloud.google.com/storage-transfer/docs/create-manage-transfer-console You can either manually trigger data to be copied from one bucket to another or have it run on a schedule.
For the second part, it sounds like both dev-dataset-bigquery and prod-dataset-bigquery are loaded from files in GCS? If this is the case, the BigQuery Transfer Service may be of use. https://cloud.google.com/bigquery/docs/cloud-storage-transfer You can trigger a transfer job manually, or have it run on a schedule.
As others have said in the comments, if you need to verify data before initiating transfers from dev to prod, a CI system such as spinnaker may help. If the verification can be automated, a system such as Apache Airflow (running on Cloud Composer, if you want a hosted version) provides more flexibility than the transfer services.
Follow below procedure for movement from one environment to another using API and for updating the dataset and the output as per new environment.
1)Export a plan
GET
https://api.clouddataprep.com/v4/plans/<plan_id>/package
2)Import the plan
Post:
https://api.clouddataprep.com/v4/plans/package
3)Update the input dataset
PUT:
https://api.clouddataprep.com/v4/importedDatasets/<datset_id>
{
"name": "<new_dataset_name>",
"bucket": "<bucket_name>",
"path": "<bucket_file_name>"
}
4)Update the output
PATCH
https://api.clouddataprep.com/v4/outputObjects/<output_id>
{
"publications": [
{
"path": [
"<project_name>",
"<dataset_name>"
],
"tableName": "<table_name>",
"targetType": "bigquery",
"action": "create"
}
]
}

ElasticSearch Migrate Data

I want to migrate data from Amazon AWS ElasticSearch version 2.3 to 5.1 and have created a data snaphsot in S3, now how do I copy those dump files from S3 to ES 5.1?
You're lucky, according to Elasticsearch documentation:
A snapshot of an index created in 2.x can be restored to 5.x.
Also, Elasticsearch has:
repository-s3 for S3 repository support
,which should be very useful in your case, I assuem.
The full guide how to do that is there. I will try to some of them:
Before any snapshot or restore operation can be performed, a snapshot
repository should be registered in Elasticsearch. The repository
settings are repository-type specific. See below for details.
PUT /_snapshot/my_backup
{
"type": "fs",
"settings": {
... repository specific settings ...
}
}
After several others steps, like verifying snapshot, you could restore it:
A snapshot can be restored using the following command:
POST /_snapshot/my_backup/snapshot_1/_restore

Elasticsearch Snapshot Fails With RepositoryMissingException

Three node ElasticSearch cluster on AWS. Bigdesk and Head both show a healthy cluster. All three nodes are running ES 1.3, and the latest Amazon Linux updates. When I fire off a snapshot request like:
http://localhost:9200/_snapshot/taxanalyst/201409031540-snapshot?wait_for_completion=true
the server churns away for several minutes before responding with the following:
{
"snapshot": {
"snapshot": "201409031521-snapshot",
"indices": [
"docs",
"pdflog"
],
"state": "PARTIAL",
"start_time": "2014-09-03T19:21:36.034Z",
"start_time_in_millis": 1409772096034,
"end_time": "2014-09-03T19:28:48.685Z",
"end_time_in_millis": 1409772528685,
"duration_in_millis": 432651,
"failures": [
{
"node_id": "ikauhFYEQ02Mca8fd1E4jA",
"index": "pdflog",
"reason": "RepositoryMissingException[[faxmanalips] missing]",
"shard_id": 0,
"status": "INTERNAL_SERVER_ERROR"
}
],
"shards": {
"total": 10,
"failed": 1,
"successful": 9
}
}
}
These are three nodes on three different virtual EC2 machines, but they're able to communicate via 9300/9200 without any problems. Indexing and searching works as expected. There doesn't appear to be anything in the elasticsearch log files that speaks to the server error.
Does anyone know what's going on here, or at least where a good place to start would be?
UPDATE: Turns out that each of the nodes in the cluster need to have snapshot directories that match the directory specified when you register the snapshot with the elasticsearch cluster.
I guess the next question is: when you want to tgz up the snapshot directory so you can archive it, or provision a backup cluster, is it sufficient to just tgz the snapshot directory on the Master node? Or do you have to somehow consolidate the snapshot directories of all the nodes. (That can't be right, can it?)
Elasticsearch supports shared file system repository uses the shared file system to store snapshots.
In order to register the shared file system repository it is necessary to mount the same shared filesystem to the same location on all master and data nodes.
All you need to know Put in elasticsearch.yml of all 3 nodes same repository_name.
for eg:-
path.repo:[/my_repository]
I think you are looking for this aws plugin for elasticsearch (I guess you already installed it to configure your cluster) : https://github.com/elasticsearch/elasticsearch-cloud-aws#s3-repository
It will allow you to create a repository mapped to a S3 bucket.
To use (create/restore/whatever) a snapshot, you need to create a repository first. Then when you will do some actions on a snapshot, Elasticsearch will directly manage it on your S3 bucket.