Pyspark - read data from elasticsearch cluster on EMR - amazon-web-services

I am trying to read data from elasticsearch from pyspark. I was using the elasticsearch-hadoop api in Spark. The es cluster sits on aws emr, which requires credential to sign in. My script is as below:
from pyspark import SparkContext, SparkConf sc.stop()
conf = SparkConf().setAppName("ESTest") sc = SparkContext(conf=conf)
es_read_conf = { "es.host" : "vhost", "es.nodes" : "node", "es.port" : "443",
"es.query": '{ "query": { "match_all": {} } }',
"es.input.json": "true", "es.net.https.auth.user": "aws_access_key",
"es.net.https.auth.pass": "aws_secret_key", "es.net.ssl": "true",
"es.resource" : "index/type", "es.nodes.wan.only": "true"
}
es_rdd = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_read_conf)
Pyspark keeps throwing error:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on
[index] failed; servernode:443] returned [403|Forbidden:]
I checked everything which all made sense except for the user and pass entries, would aws access key and secret key work here? We don't want to use the console user and password here for security purpose. Is there a different way to do the same thing?

Related

AWS SSM error while targets.1.member.values failed to satisfy constraint: Member must have length less than or equal to 50

I am trying to run a SSM command on more than 50 EC2 instances of my fleet. By using AWS boto3's SSM client, I am running a specific command on my nodes. My code is given below. After running the code, an unexpected error is showing up.
# running ec2 instances
instances = client.describe_instances()
instance_ids = [inst["InstanceId"] for inst in instances] # might contain more than 50 instances
# run command
run_cmd_resp = ssm_client.send_command(
Targets=[
{"Key": "InstanceIds", "Values": inst_ids_all},
],
DocumentName="AWS-RunShellScript",
DocumentVersion="1",
Parameters={
"commands": ["#!/bin/bash", "ls -ltrh", "# some commands"]
}
)
On executing this, getting below error
An error occurred (ValidationException) when calling the SendCommand operation: 1 validation error detected: Value '[...91 instance IDs...]' at 'targets.1.member.values' failed to satisfy constraint: Member must have length less than or equal to 50.
How do I run the SSM command my whole fleet?
As shown in the error message and boto3 documentation (link), the number of instances in one send_command call is limited up to 50. To run the SSM command for all instances, splitting the original list into 50 each could be a solution.
FYI: If your account has a fair amount of instances, describe_instances() can't retrieve all instance info in one api call, so it would be better to check whether NextToken is in response.
ref: How do you use "NextToken" in AWS API calls
# running ec2 instances
instances = client.describe_instances()
instance_ids = [inst["InstanceId"] for inst in instances]
while "NextToken" in instances:
instances = client.describe_instances(NextToken=instances["NextToken"])
instance_ids += [inst["InstanceId"] for inst in instances]
# run command
for i in range(0, len(instance_ids), 50):
target_instances = instance_ids[i : i + 50]
run_cmd_resp = ssm_client.send_command(
Targets=[
{"Key": "InstanceIds", "Values": inst_ids_all},
],
DocumentName="AWS-RunShellScript",
DocumentVersion="1",
Parameters={
"commands": ["#!/bin/bash", "ls -ltrh", "# some commands"]
}
)
Finally after #Rohan Kishibe's answer, I tried to implement below batched execution for the SSM runShellScript.
import math
ec2_ids_all = [...] # all instance IDs fetched by pagination.
PG_START, PG_STOP = 0, 50
PG_SIZE = 50
PG_COUNT = math.ceil(len(ec2_ids_all) / PG_SIZE)
for page in range(PG_COUNT):
cmd = ssm.send_command(
Targets=[{"Key": "InstanceIds", "Values": ec2_ids_all[PG_START:PG_STOP]}],
DocumentVersion="AWS-RunShellScript",
Parameters={"commands": ["ls -ltrh", "# other commands"]}
}
PG_START += PG_SIZE
PG_STOP += PG_SIZE
In above way, the total number of instance IDs will be distributed in batches and then executed accordingly. One can also save the Command IDs and batch instance IDs in a mapping for future usage.

AWS sagemaker endpoint received client (400) error

I've deployed a tensorflow multi-label classification model using a sagemaker endpoint as follows:
predictor = sagemaker_model.deploy(initial_instance_count=1, instance_type="ml.m5.2xlarge", endpoint_name='testing-2')
It gets deployed and works fine when I invoke it from the Sagemaker Jupyter instance:
sample = ['this movie was extremely good']
output=predictor.predict(sample)
output:
{'predictions': [[0.00370046496,
4.32942124e-06,
0.00080883503,
9.25126587e-05,
0.00023958087,
0.000130862]]}
However, I am unable to send a request to the deployed endpoint from other notebooks or sagemaker studio. I'm unsure of the request format.
I've tried several variations in the input format and still failed. The error message is as below:
sagemaker error
Request:
{
"body": {
"text": "Testing model's prediction on this text"
},
"contentType": "application/json",
"endpointName": "testing-2",
"customURL": "",
"customHeaders": [
{
"Key": "sm_endpoint_name",
"Value": "testing-2"
}
]
}
Error:
Error invoking endpoint: Received client error (400) from primary with message "{ "error": "Failed to process element:
0 key: text of 'instances' list. Error: INVALID_ARGUMENT: JSON object: does not have named input: text" }".
See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/testing-2
in account 793433463428 for more information.
Is there any way to find out exactly how the model expects the request format to be?
Earlier I had the same model on my local system and the way I tested it was using this curl request:
curl -s -H 'Content-Type: application/json' -d '{"text": "what ugly posts"}' http://localhost:7070/sentiment
And it worked fine without any issues.
I've tried different formats and replaced the "text" key inside body with other words like "input", "body", nothing etc.
Based on your description above, I assume you are deploying the TensorFlow model using the SageMaker TensorFlow container.
If you want to view what your model expects as input you can use the saved_model CLI:
1
├── keras_metadata.pb
├── saved_model.pb
└── variables
├── variables.data-00000-of-00001
└── variables.index
!saved_model_cli show --all --dir {"1"}
After you have confirmed the input name above you can invoke the endpoint as follows:
import json
import boto3
client = boto3.client('runtime.sagemaker')
data = {"instances": ['this movie was extremely good']}
response = client.invoke_endpoint(EndpointName=<EndpointName>,
Body=json.dumps(data))
response_body = response['Body']
print(response_body.read())
The same payload can then also be used in Studio when invoking the endpoint.

AWS Glue - Kafka Connection using SASL/SCRAM

I am trying to create an AWS Glue Streaming job that reads from Kafka (MSK) clusters using SASL/SCRAM client authentication for the connection, per
https://aws.amazon.com/about-aws/whats-new/2022/05/aws-glue-supports-sasl-authentication-apache-kafka/
The connection configuration has the following properties (plus adequate subnet and security groups):
"ConnectionProperties": {
"KAFKA_SASL_SCRAM_PASSWORD": "apassword",
"KAFKA_BOOTSTRAP_SERVERS": "theserver:9096",
"KAFKA_SASL_MECHANISM": "SCRAM-SHA-512",
"KAFKA_SASL_SCRAM_USERNAME": "auser",
"KAFKA_SSL_ENABLED": "false"
}
And the actual api method call is
df = glue_context.create_data_frame.from_options(
connection_type="kafka",
connection_options={
"connectionName": "kafka-glue-connector",
"security.protocol": "SASL_SSL",
"classification": "json",
"startingOffsets": "latest",
"topicName": "atopic",
"inferSchema": "true",
"typeOfData": "kafka",
"numRetries": 1,
}
)
When running logs show the client is attempting to connect to brokers using Kerberos, and runs into
22/10/19 18:45:54 INFO ConsumerConfig: ConsumerConfig values:
sasl.mechanism = GSSAPI
security.protocol = SASL_SSL
security.providers = null
send.buffer.bytes = 131072
...
org.apache.kafka.common.errors.SaslAuthenticationException: Failed to configure SaslClientAuthenticator
Caused by: org.apache.kafka.common.KafkaException: Principal could not be determined from Subject, this may be a transient failure due to Kerberos re-login
How can I authenticate the AWS Glue job using SASL/SCRAM? What properties do I need to set in the connection and in the method call?
Thank you

Read/write to AWS S3 from Apache Spark Kubernetes container via vpc endpoint giving 400 Bad Request

I am trying to read and write data to AWS S3 from Apache Spark Kubernetes Containervia vpc endpoint
The Kubernetes container is on premise (data center) in US region . Following is the Pyspark code to connect to S3:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
conf = (
SparkConf()
.setAppName("PySpark S3 Example")
.set("spark.hadoop.fs.s3a.endpoint.region", "us-east-1")
.set("spark.hadoop.fs.s3a.endpoint","<vpc-endpoint>")
.set("spark.hadoop.fs.s3a.access.key", "<access_key>")
.set("spark.hadoop.fs.s3a.secret.key", "<secret_key>")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")
.set("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.fs.s3a.path.style.access", "true")
.set("spark.hadoop.fs.s3a.server-side-encryption-algorithm","SSE-KMS")
.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
data = [{"key1": "value1", "key2": "value2"}, {"key1":"val1","key2":"val2"}]
df = spark.createDataFrame(data)
df.write.format("json").mode("append").save("s3a://<bucket-name>/test/")
Exception Raised:
py4j.protocol.Py4JJavaError: An error occurred while calling o91.save.
: org.apache.hadoop.fs.s3a.AWSBadRequestException: doesBucketExist on <bucket-name>
: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: <requestID>;
Any help would be appreciated
unless your hadoop s3a client is region aware (3.3.1+), setting that region option won't work. There's an aws sdk option "aws.region which you can set as as a system property instead.

AWS Machine Learning error when creating DataSource from S3

I am having an error when trying to automate AWS DataSource creation from S3:
I am running a shell script:
#!/bin/bash
for k in 1 2 3 4 5
do
aws machinelearning create-data-source-from-s3 --cli-input-json file://data/cfg/dsrc_training_00$k.json
aws machinelearning create-data-source-from-s3 --cli-input-json file://data/cfg/dsrc_validate_00$k.json
done
and here is an example of the json file it references:
{
"DataSourceId": "Iris_training_00{k}",
"DataSourceName": "[DS Iris] training 00{k}",
"DataSpec": {
"DataLocationS3": "s3://ml-test-predicto-bucket/shuffled_{k}.csv",
"DataSchemaLocationS3": "s3://ml-test-predicto-bucket/dsrc_iris.csv.schema",
"DataRearrangement": {"splitting":{"percentBegin" : 0, "percentEnd" : 70}}
},
"ComputeStatistics": true
}
But when I run my script from the command line I get the error:
Parameter validation failed:
Invalid type for parameter DataSpec.DataRearrangement, value: {u'splitting': {u'percentEnd': u'100', u'percentBegin': u'70'}}, type: <type 'dict'>, valid types: <type 'basestring'>
Can someone please help, I have looked at the API AWS ML documentation and I think I am doing everything right, but I can't seem to solve this error... many thanks !
The DataRearrangement element expects a JSON String object. You are passing a dictionary object.
Change:
"DataRearrangement": {"splitting":{"percentBegin" : 0, "percentEnd" : 70}}
[to]
"DataRearrangement": "{\"splitting\":{\"percentBegin\":0,\"percentEnd\":70}}"