AWS Glue - Kafka Connection using SASL/SCRAM - amazon-web-services

I am trying to create an AWS Glue Streaming job that reads from Kafka (MSK) clusters using SASL/SCRAM client authentication for the connection, per
https://aws.amazon.com/about-aws/whats-new/2022/05/aws-glue-supports-sasl-authentication-apache-kafka/
The connection configuration has the following properties (plus adequate subnet and security groups):
"ConnectionProperties": {
"KAFKA_SASL_SCRAM_PASSWORD": "apassword",
"KAFKA_BOOTSTRAP_SERVERS": "theserver:9096",
"KAFKA_SASL_MECHANISM": "SCRAM-SHA-512",
"KAFKA_SASL_SCRAM_USERNAME": "auser",
"KAFKA_SSL_ENABLED": "false"
}
And the actual api method call is
df = glue_context.create_data_frame.from_options(
connection_type="kafka",
connection_options={
"connectionName": "kafka-glue-connector",
"security.protocol": "SASL_SSL",
"classification": "json",
"startingOffsets": "latest",
"topicName": "atopic",
"inferSchema": "true",
"typeOfData": "kafka",
"numRetries": 1,
}
)
When running logs show the client is attempting to connect to brokers using Kerberos, and runs into
22/10/19 18:45:54 INFO ConsumerConfig: ConsumerConfig values:
sasl.mechanism = GSSAPI
security.protocol = SASL_SSL
security.providers = null
send.buffer.bytes = 131072
...
org.apache.kafka.common.errors.SaslAuthenticationException: Failed to configure SaslClientAuthenticator
Caused by: org.apache.kafka.common.KafkaException: Principal could not be determined from Subject, this may be a transient failure due to Kerberos re-login
How can I authenticate the AWS Glue job using SASL/SCRAM? What properties do I need to set in the connection and in the method call?
Thank you

Related

Read/write to AWS S3 from Apache Spark Kubernetes container via vpc endpoint giving 400 Bad Request

I am trying to read and write data to AWS S3 from Apache Spark Kubernetes Containervia vpc endpoint
The Kubernetes container is on premise (data center) in US region . Following is the Pyspark code to connect to S3:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
conf = (
SparkConf()
.setAppName("PySpark S3 Example")
.set("spark.hadoop.fs.s3a.endpoint.region", "us-east-1")
.set("spark.hadoop.fs.s3a.endpoint","<vpc-endpoint>")
.set("spark.hadoop.fs.s3a.access.key", "<access_key>")
.set("spark.hadoop.fs.s3a.secret.key", "<secret_key>")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")
.set("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.fs.s3a.path.style.access", "true")
.set("spark.hadoop.fs.s3a.server-side-encryption-algorithm","SSE-KMS")
.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
data = [{"key1": "value1", "key2": "value2"}, {"key1":"val1","key2":"val2"}]
df = spark.createDataFrame(data)
df.write.format("json").mode("append").save("s3a://<bucket-name>/test/")
Exception Raised:
py4j.protocol.Py4JJavaError: An error occurred while calling o91.save.
: org.apache.hadoop.fs.s3a.AWSBadRequestException: doesBucketExist on <bucket-name>
: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: <requestID>;
Any help would be appreciated
unless your hadoop s3a client is region aware (3.3.1+), setting that region option won't work. There's an aws sdk option "aws.region which you can set as as a system property instead.

Logstash AWS solving code 403 trying to reconnect

I'm trying to push documents from local to elastic server in AWS, and when trying to do so I get 403 error and Logstash keeps on trying to establish connection with the server like so:
[2021-05-09T11:09:52,707][TRACE][logstash.inputs.file ][main] Registering file input {:path=>["~/home/ubuntu/json_try/json_try.json"]}
[2021-05-09T11:09:52,737][DEBUG][logstash.javapipeline ][main] Shutdown waiting for worker thread {:pipeline_id=>"main", :thread=>"#<Thread:0x5033269f run>"}
[2021-05-09T11:09:53,441][DEBUG][logstash.outputs.amazonelasticsearch][main] Waiting for connectivity to Elasticsearch cluster. Retrying in 4s
[2021-05-09T11:09:56,403][INFO ][logstash.outputs.amazonelasticsearch][main] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>https://my-dom.co:8001/scans, :path=>"/"}
[2021-05-09T11:09:56,461][WARN ][logstash.outputs.amazonelasticsearch][main] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"https://my-dom.co:8001/scans", :error_type=>LogStash::Outputs::AmazonElasticSearch::HttpClient::Pool::BadResponseCodeError, :error=>"Got response code '403' contacting Elasticsearch at URL 'https://my-dom.co:8001/scans/'"}
[2021-05-09T11:09:56,849][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2021-05-09T11:09:56,853][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2021-05-09T11:09:57,444][DEBUG][logstash.outputs.amazonelasticsearch][main] Waiting for connectivity to Elasticsearch cluster. Retrying in 8s
.
.
.
I'm using the following logstash conf file:
input {
file{
type => "json"
path => "~/home/ubuntu/json_try/json_try.json"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
output{
amazon_es {
hosts => ["https://my-dom.co/scans"]
port => 8001
ssl => true
region => "us-east-1b"
index => "snapshot-%{+YYYY.MM.dd}"
}
}
Also I've exported AWS keys for the SSL to work. Is there anything I'm missing in order for the connection to succeed?
I've been able to solve this by using elasticsearch as my output plugin instead of amazon_es.
This usage will require cloud_id of the target AWS node, cloud_auth for it and also the target index in elastic for the data to be sent to. So the conf file will look something like this:
input {
file{
type => "json"
path => "~/home/ubuntu/json_try/json_try.json"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
output{
elasticsearch {
cloud_id: "node_name:node_hash"
cloud_auth: "auth_hash"
index: "snapshot-%{+YYYY.MM.dd}"
}
}

How to use boto3 ssm client to create port-forwarding session?

I used the python code below to create a port-forwarding session.
But it seems like the session is getting terminated in a few minutes? Can anyone tell me if I am missing something here.
My target is to bind a remote port (80) to a local port (8888).
self._session_parameters = {
"Target": self._instance_id,
"DocumentName": "AWS-StartPortForwardingSession",
"Parameters": {"portNumber"[str(self._remote_port)],
"localPortNumber":[str(self._remote_port)]}
self._response = self._ssm.start_session(
Target=self._session_parameters["Target"],
DocumentName=self._session_parameters["DocumentName"],
Parameters=self._session_parameters["Parameters"])
logger.info("Received response: %s", self._response["SessionId"])
self._session_id, self._token_value = (
self._response["SessionId"],
self._response["TokenValue"])

Access Kafka Cluster Outside GCP

I'm currently trying to access the kafka cluster(bitnami) from my local machine, however the problem is that even after exposing the required host and ports in server.properties and adding firewall rules to allow 9092 port it just doesn't connect.
I'm running 2 broker and 1 zookeeper configuration.
Expected Output: Producer.bootstrap_connected() should return True.
Actual Output: False
server.properties
listeners=SASL_PLAINTEXT://:9092
advertised.listeners=SASL_PLAINTEXT://gcp-cluster-name:9092
sasl.mechanism.inter.broker.protocol=PLAIN`
sasl.enabled.mechanisms=PLAIN
security.inter.broker.protocol=SASL_PLAINTEXT
Consumer.py
from kafka import KafkaConsumer
import json
sasl_mechanism = 'PLAIN'
security_protocol = 'SASL_PLAINTEXT'
# Create a new context using system defaults, disable all but TLS1.2
context = ssl.create_default_context()
context.options &= ssl.OP_NO_TLSv1
context.options &= ssl.OP_NO_TLSv1_1
consumer = KafkaConsumer('organic-sense',
bootstrap_servers='<server-ip>:9092',
value_deserializer=lambda x: json.loads(x.decode('utf-8')),
ssl_context=context,
sasl_plain_username='user',
sasl_plain_password='<password>',
sasl_mechanism=sasl_mechanism,
security_protocol = security_protocol,
)
print(consumer.bootstrap_connected())
for data in consumer:
print(data)

Pyspark - read data from elasticsearch cluster on EMR

I am trying to read data from elasticsearch from pyspark. I was using the elasticsearch-hadoop api in Spark. The es cluster sits on aws emr, which requires credential to sign in. My script is as below:
from pyspark import SparkContext, SparkConf sc.stop()
conf = SparkConf().setAppName("ESTest") sc = SparkContext(conf=conf)
es_read_conf = { "es.host" : "vhost", "es.nodes" : "node", "es.port" : "443",
"es.query": '{ "query": { "match_all": {} } }',
"es.input.json": "true", "es.net.https.auth.user": "aws_access_key",
"es.net.https.auth.pass": "aws_secret_key", "es.net.ssl": "true",
"es.resource" : "index/type", "es.nodes.wan.only": "true"
}
es_rdd = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_read_conf)
Pyspark keeps throwing error:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on
[index] failed; servernode:443] returned [403|Forbidden:]
I checked everything which all made sense except for the user and pass entries, would aws access key and secret key work here? We don't want to use the console user and password here for security purpose. Is there a different way to do the same thing?