AWS: Add Connection to glue job with boto3 - amazon-web-services

I want to add connection my glue job using lambda, so I did so:
response = glue.create_job(Name="redshift", Role="XXXXXXX",Command={
'Name':'glueetl',
'PythonVersion':'3',
'ScriptLocation':path_redshift_job,
},ExecutionProperty={'MaxConcurrentRuns':1}, Connections={
'Connections': ['RedshiftClusterConnection']})
But as you see in the image connection no connection is added to the job, what can I do to solve this issue ?

Related

How to set NumberOfWorkers, WorkerType as G2X in AWS Glue via airflow?

I am trying to create a glue job with this configuration 'NumberOfWorkers': 10, 'WorkerType': 'G.2X'. Here's my code for job creation.
glue_job_step = AwsGlueJobOperator(
job_name=glue_job_name,
job_desc="AWS Glue Job with Airflow",
script_location="s3://\<bucket_name\>/scripts/test_spark_hello.py",
create_job_kwargs={'GlueVersion': '3.0', 'NumberOfWorkers': 10, 'WorkerType': 'G.2X'},
num_of_dpus=10,
concurrent_run_limit=1,
script_args=None,
retry_limit=0,
region_name=region_name,
s3_bucket="s3_bucket_name",
iam_role_name=glue_iam_role,
run_job_kwargs=None,
wait_for_completion=True,
task_id='glue_job_step',
dag=dag
)
And I am facing the following error:
Failed to run aws glue job, error: An error occurred (InvalidInputException) when calling the CreateJob operation: Please do not set Allocated Capacity if using Worker Type and Number of Workers.
Note:
If i remove 'NumberOfWorkers' and 'WorkerType' the job is working with G1.X workers and 10 worker nodes. But just wondering how to upgrade to G2.X Workers.
I am using AWS MWAA for airflow infrastructure.

AWS AppFlow Salesforce to Redshift Error Creating Connection

I'm wanting to create a one-way real-time copy of a Salesforce (SF) object in Redshift. The idea being that when fields are updated in SF, those fields will be updated in Redshift as well. The history of changes are irrelevant in AWS/Redshift, that's all being tracked in SF - I just need a real-time read-only copy of that particular object to query. Preferably without having to query the whole SF object, clearing the Redshift table, and piping the data in.
I thought AWS AppFlow listening for SF Change Data Capture events might be a good setup for this:
When I try to create a flow, I don't have any issues with the SF source connection:
so I click "Connect" in the Destination details section to setup Redshift and I fill out this page and click "Connect" again:
About 5 seconds goes by and I receive this error pop-up:
An error occurred while creating the connection
Error while communicating to connector: Failed to validate Connection while attempting "select distinct(table_schema) from information_schema.tables limit 1" with connector failure Can't connect to JDBC database with message: Amazon Error setting/closing connection: SocketTimeoutException. (Service: null; Status Code: 400; Error Code: Client; Request ID: null; Proxy: null)
I know my connection string, username, password, etc are all good - I'm connected to Redshift in other apps. Any idea what the issue could be? Is this even the right solution for what I'm trying to do?
I solved this by adding the AppFlow IP ranges for my region to my Redshift VPC's security group inbound rules.

Post from GCP Pub Sub to on-prem Kafka

I have a requirement to publish messages from Google Pub/Sub topic to a Kafka running on my on-prem infrastructure. I stumbled on this link.
https://docs.confluent.io/current/connect/kafka-connect-gcp-pubsub/index.html
This should work. Wanted to know if you've used any other alternative solution to achieve this?
If you need to integrate PubSub and Kafka I suggest that you create a script for this purpose. In Python for example we have libraries for both PubSub and Kafka
Based on that, you could create a script more or less like below and run it inside some processing resource like Compute Engine or in your on premises server:
from google.cloud import pubsub_v1
from kafka import KafkaProducer
def callback(message):
print(message.data)
producer.send('<your-topic>', message.data)
message.ack()
producer = KafkaProducer(bootstrap_servers='localhost:1234') //Change it for your real parameter
subscription_name = "projects/<your-project>/subscriptions/<your-subscription>"
subscriber = pubsub_v1.SubscriberClient()
future = subscriber.subscribe(subscription_name, callback)

AWS Glue job runs correct but returns a connection refused error

I am running a test job on AWS. I am reading CSV data from S3 bucket, running a GLUE ETL job on it and storing the same data on Amazon Redshift. GLUE job is just reading the data from S3 and storing in Redshift without any modification. The job runs fine and I get the desired result in Redshift but it returns an error which I am unable to understand.
Here is the error log:
18/11/14 09:17:31 WARN YarnClient: The GET request failed for the URL http://169.254.76.1:8088/ws/v1/cluster/apps/application_1542186720539_0001
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.HttpHostConnectException: Connect to 169.254.76.1:8088 [/169.254.76.1] failed: Connection refused (Connection refused)
It is a WARN rather than error but I want to understand what is causing the WARN. I tried to search for the IP that is indicated in the WARN but I am not able to find the machine with the mentioned IP.
I noticed these error comming up to me in my AWS Glue Job so I found something that could be helpful from AWS:
This WARN message is not so special, and does not mean job failure or any errors directly. I guess there should be other cause.
I would recommend you to enable continuous logging, and check both driver/executor logs to see if there are any suspicious behavior.
If you enable job bookmark, please try disabling it and see how it goes without bookmark.
https://forums.aws.amazon.com/thread.jspa?messageID=927547
I had dissabled bookmarks from the begining. What I check is that my Glue job writing data to S3 and got an exeption per Memory, so what I did is to repartition the data.
MyDynamicFrame.coalesce(100).write.partitionBy("month").mode("overwrite").parquet("s3://"+bucket+"/"+path+"/out_data")
so if you have some write opperations, I'll recommend to check how you are writing to S3

boto.sqs connect to non-aws endpoint

I'm currently in need of connecting to a fake_sqs server for dev purposes but I can't find an easy way to specify endpoint to the boto.sqs connection. Currently in java and node.js there are ways to specify the queue endpoint and by passing something like 'localhst:someport' I can connect to my own sqs-like instance. I've tried the following with boto:
fake_region = regioninfo.SQSRegionInfo(name=name, endpoint=endpoint)
conn = fake_region.connect(aws_access_key_id="TEST", aws_secret_access_key="TEST", port=9324, is_secure=False);
and then:
queue = connAmazon.get_queue('some_queue')
but it fails to retrieve the queue object,it returns None. Has anyone achieved to connect to an own sqs instance ?
Here's how to create an SQS connection that connects to fake_sqs:
region = boto.sqs.regioninfo.SQSRegionInfo(
connection=None,
name='fake_sqs',
endpoint='localhost', # or wherever fake_sqs is running
connection_cls=boto.sqs.connection.SQSConnection,
)
conn = boto.sqs.connection.SQSConnection(
aws_access_key_id='fake_key',
aws_secret_access_key='fake_secret',
is_secure=False,
port=4568, # or wherever fake_sqs is running
region=region,
)
region.connection = conn
# you can now work with conn
# conn.create_queue('test_queue')
Be aware that, at the time of this writing, the fake_sqs library does not respond correctly to GET requests, which is how boto makes many of its requests. You can install a fork that has patched this functionality here: https://github.com/adammck/fake_sqs