Moto to launch aws glue jobs - amazon-web-services

I would like to test a glue job with Moto ( https://docs.getmoto.org/en/latest/docs/services/glue.html ).
So I first start by creating the glue job:
#mock_glue
#mock_s3
class TestStringMethods(unittest.TestCase):
...
...
self.s3_client.upload_fileobj(open("etl.py", "rb"),self.s3_bucket_name, "etl.py")
self.glue_client = boto3.client('glue')
self.glue_client.create_job(
Name="Test Monitoring Job",
Role="test_role",
Command=dict(Name="glueetl", ScriptLocation=f"s3://{self.s3_bucket_name}/etl.py"),
GlueVersion='2.0',
NumberOfWorkers=1,
WorkerType='G.1X'
)
#Job is created with correct confs
assert (job["Name"] == job_name)
assert (job["GlueVersion"] == "2.0")
Then I proceed to launch it:
job_run_response = self.glue_client.start_job_run(
JobName=job_name,
Arguments={...} )
and get the job run:
response = self.glue_client.get_job_run(
JobName=job_name,
RunId=job_run_response['JobRunId']
)
However, at this point I find out that the configuration of the job I just launched is not the same of what I have defined in the create job.
Look for example at the Glue version that I asserted before.
print(response)
{'JobRun': {'Id': '01',... 'Arguments': {'runSpark': 'spark -f test_file.py'}, 'ErrorMessage': '', 'PredecessorRuns': [{'JobName': 'string', 'RunId': 'string'}] ... 'GlueVersion': '0.9' }
There's no evidence my code has actually run, it looks like to be some standard default, apart from the job name.
My questions basically are:
have you experience with moto supporting this functionality?
if yes, can you recognise if something's off with this code?

Related

MWAA can retrieve variable by ID but not connection from AWS Secrets Manager

we've set up AWS SecretsManager as a secrets backend to Airflow (AWS MWAA) as described in their documentation. Unfortunately, nowhere is explained where the secrets are to be found and how they are to be used then. When I supply conn_id to a task in a DAG, we can see two errors in the task logs, ValueError: Invalid IPv6 URL and airflow.exceptions.AirflowNotFoundException: The conn_id redshift_conn isn't defined. What's even more surprising is that when retrieving variables stored the same way with Variable.get('my_variable_id'), it works just fine.
The question is: Am I wrongly expecting that the conn_id can be directly passed to operators as SomeOperator(conn_id='conn-id-in-secretsmanager')? Must I retrieve the connection manually each time I want to use it? I don't want to run something like read_from_aws_sm_fn in the code below every time beforehand...
Btw, neither the connection nor the variable show up in the Airflow UI.
Having stored a secret named airflow/connections/redshift_conn (and on the side one airflow/variables/my_variable_id), I expect the connection to be found and used when constructing RedshiftSQLOperator(task_id='mytask', redshift_conn_id='redshift_conn', sql='SELECT 1'). But this results in the above error.
I am able to retrieve the redshift connection manually in a DAG with a separate task, but I think that is not how SecretsManager is supposed to be used in this case.
The example DAG is below:
from airflow import DAG, settings, secrets
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
from airflow.models.baseoperator import chain
from airflow.models import Connection, Variable
from airflow.providers.amazon.aws.operators.redshift import RedshiftSQLOperator
from datetime import timedelta
sm_secret_id_name = f'airflow/connections/redshift_conn'
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(1),
'retries': 1,
}
def read_from_aws_sm_fn(**kwargs): # from AWS example code
### set up Secrets Manager
hook = AwsBaseHook(client_type='secretsmanager')
client = hook.get_client_type('secretsmanager')
response = client.get_secret_value(SecretId=sm_secret_id_name)
myConnSecretString = response["SecretString"]
print(myConnSecretString[:15])
return myConnSecretString
def get_variable(**kwargs):
my_var_value = Variable.get('my_test_variable')
print('variable:')
print(my_var_value)
return my_var_value
with DAG(
dag_id=f'redshift_test_dag',
default_args=default_args,
dagrun_timeout=timedelta(minutes=10),
start_date=days_ago(1),
schedule_interval=None,
tags=['example']
) as dag:
read_from_aws_sm_task = PythonOperator(
task_id="read_from_aws_sm",
python_callable=read_from_aws_sm_fn,
provide_context=True
) # works fine
query_redshift = RedshiftSQLOperator(
task_id='query_redshift',
redshift_conn_id='redshift_conn',
sql='SELECT 1;'
) # results in above errors :-(
try_to_get_variable_value = PythonOperator(
task_id='get_variable',
python_callable=get_variable,
provide_context=True
) # works fine!
The question is: Am I wrongly expecting that the conn_id can be directly passed to operators as SomeOperator(conn_id='conn-id-in-secretsmanager')? Must I retrieve the connection manually each time I want to use it? I don't want to run something like read_from_aws_sm_fn in the code below every time beforehand...
Using secret manager as a backend, you don't need to change the way you use the connections or variables. They work the same way, when looking up a connection/variable, airflow follow a search path.
Btw, neither the connection nor the variable show up in the Airflow UI.
The connection/variable will not up in the UI.
ValueError: Invalid IPv6 URL and airflow.exceptions.AirflowNotFoundException: The conn_id redshift_conn isn't defined
The 1st error is related to the secret and the 2nd error is due to the connection not existing in the airflow UI.
There is 2 formats to store connections in secret manager (depending on the aws provider version installed) the IPv6 URL error could be that its not parsing the connection correctly. Here is a link to the provider docs.
First step is defining the prefixes for connections and variables, if they are not defined, your secret backend will not check for the secret:
secrets.backend_kwargs : {"connections_prefix" : "airflow/connections", "variables_prefix" : "airflow/variables"}
Then for the secrets/connections, you should store them in those prefixes, respecting the required fields for the connection.
For example, for the connection my_postgress_conn:
{
"conn_type": "postgresql",
"login": "user",
"password": "pass",
"host": "host",
"extra": '{"key": "val"}',
}
You should store it in the path airflow/connections/my_postgress_conn, with the json dict as string.
And for the variables, you just need to store them in airflow/variables/<var_name>.

Cron job not able to trigger AWS Step function

I followed this tutorial here to configure a cron job to kick off my AWS Step Function Data Science SDK. Every time it tries to trigger the state machine it immediately get
I added the newly generated IAM role to have full access to SageMaker and Lambda so I am not sure what is going on.
I am going to provide more information here. The way I setup my sklearn estimator is like this
sklearn_estimator = SKLearn(
entry_point= sm_script,
role = role,
instance_count=1,
dependencies=[sm_utils_file, config_path, 'requirements.txt'],
instance_type=training_instance,
sagemaker_session=sm_sess,
framework_version=FRAMEWORK_VERSION,
base_job_name='{}-training'.format(base_name),
hyperparameters = {'config': config_file},
metric_definitions=[
{'Name': 'client_devices_validation_accuracy_top1', 'Regex': "client devices accuracy for top 1 = ([0-9.]+)"},
{'Name': 'client_devices_validation_f1_top1', 'Regex': "client devices f1 for top 1 = ([0-9.]+)"},
{'Name': 'account_and_password_validation_accuracy_top1', 'Regex': "account and password accuracy for top 1 = ([0-9.]+)"},
{'Name': 'account_and_password_validation_f1_top1', 'Regex': "account and password f1 for top 1 = ([0-9.]+)"}]
)
As you can see I have set some dependencies there that the model needs to train. For the training step I have defined it as so
training_step = steps.TrainingStep(
"Train Step",
estimator=sklearn_estimator,
data={
"train": sagemaker.TrainingInput(pre_train_utils.resolution_data_path()),
},
job_name=execution_input["TrainingJobName"],
wait_for_completion=True,
)
This part pre_train_utils.resolution_data_path() grabs the newest data from redshift and the pre_train_utils is stored as a dependency in the estimator so it should be fine? I am now thinking that this could be the problem?
Update:
I was able to find the error which states this
An error occurred while executing the state 'Train Step' (entered at the event id #2). The JSONPath '$$.Execution.Input['TrainingJobName']' specified for the field 'TrainingJobName.$' could not be found in the input
Specifically it needs to have a json input that looks like this in my case
{
"TrainingJobName": "tt-resolution-classifier-training-2022-09-02",
"ModelName": "tt-resolution-classifier-model-2022-09-02",
"EndpointName": "tt-resolution-classifier-endpoint",
"LambdaFunctionName": "odi-ds-grab-ticket-training-metrics"
}
How do I past this into AWSCloudWatch cron job, if I cannot pass it then I cannot automatically have the state machine train and deploy the endpoint...

How to add PubSub Destination while creating a job using preset "preset/web-hd" using google cloud transcoder-api?

I am converting a video, uploaded to cloud storage using a signed URL, using Transcoder API. I have written a cloud function that is triggered with write operations on the bucket. Everything is working fine but I need to get a notification when the conversion is completed. I am creating the job to convert the vid using the following code. I am trying to follow the solution proposed in this answer Google Cloud Platform: Convert uploaded MP4 file to HLS file
def create_job_from_preset(project_id, location, input_uri, output_uri, preset):
"""Creates a job based on a job preset.
Args:
project_id: The GCP project ID.
location: The location to start the job in.
input_uri: Uri of the video in the Cloud Storage bucket.
output_uri: Uri of the video output folder in the Cloud Storage bucket.
preset: The preset template (for example, 'preset/web-hd')."""
client = TranscoderServiceClient()
parent = f"projects/{project_id}/locations/{location}"
job = transcoder_v1.types.Job()
job.input_uri = input_uri
job.output_uri = output_uri
job.template_id = preset
job.ttl_after_completion_days = 1
job.config = transcoder_v1.types.JobConfig(
PubsubDestination={
topic_name=f"projects/{project_id}/topics/testing"
}
)
response = client.create_job(parent=parent, job=job)
print(f"Job: {response.name}")
return response
The following snippet in the above code is not working
job.config = transcoder_v1.types.JobConfig(
PubsubDestination={
topic_name=f"projects/{project_id}/topics/testing"
}
)
I have viewed the following but couldn't find any solution.
https://cloud.google.com/transcoder/docs/how-to/create-pub-sub
How to configure pubsub_destination in Transcoder API of GCP
You cannot not define any configuration on your JobConfig on your code if you are creating a job from a preset or template since the preset and template will already populate the JobConfig for you.
As an alternative, you may create job using an ad-hoc configuration and then define PubsubDestination as shown on below code:
Note that I also corrected the syntax in using the PubsubDestination
from google.cloud.video import transcoder_v1
from google.cloud.video.transcoder_v1.services.transcoder_service import (
TranscoderServiceClient,
)
def create_job_from_ad_hoc(project_id, location, input_uri, output_uri):
"""Creates a job based on an ad-hoc job configuration.
Args:
project_id: The GCP project ID.
location: The location to start the job in.
input_uri: Uri of the video in the Cloud Storage bucket.
output_uri: Uri of the video output folder in the Cloud Storage bucket."""
client = TranscoderServiceClient()
parent = f"projects/{project_id}/locations/{location}"
job = transcoder_v1.types.Job()
job.input_uri = input_uri
job.output_uri = output_uri
job.config = transcoder_v1.types.JobConfig(
elementary_streams=[
transcoder_v1.types.ElementaryStream(
key="video-stream0",
video_stream=transcoder_v1.types.VideoStream(
h264=transcoder_v1.types.VideoStream.H264CodecSettings(
height_pixels=360,
width_pixels=640,
bitrate_bps=550000,
frame_rate=60,
),
),
),
transcoder_v1.types.ElementaryStream(
key="video-stream1",
video_stream=transcoder_v1.types.VideoStream(
h264=transcoder_v1.types.VideoStream.H264CodecSettings(
height_pixels=720,
width_pixels=1280,
bitrate_bps=2500000,
frame_rate=60,
),
),
),
transcoder_v1.types.ElementaryStream(
key="audio-stream0",
audio_stream=transcoder_v1.types.AudioStream(
codec="aac", bitrate_bps=64000
),
),
],
mux_streams=[
transcoder_v1.types.MuxStream(
key="sd",
container="mp4",
elementary_streams=["video-stream0", "audio-stream0"],
),
transcoder_v1.types.MuxStream(
key="hd",
container="mp4",
elementary_streams=["video-stream1", "audio-stream0"],
),
],
pubsub_destination=
transcoder_v1.types.PubsubDestination(
topic=f"projects/{project_id}/topics/your-topic"
),
)
response = client.create_job(parent=parent, job=job)
print(f"Job: {response.name}")
return response
Output of my testing:
Other alternative is to create your own job template and then use it in your template_id so that you don't have to always define PubsubDestination in your code.

How to see progress when using Glue to export DynamoDB table

I'm trying to export every item in a DynamoDB table to S3. I found this tutorial https://aws.amazon.com/blogs/big-data/how-to-export-an-amazon-dynamodb-table-to-amazon-s3-using-aws-step-functions-and-aws-glue/ and followed the example. Basically,
table = glueContext.create_dynamic_frame.from_options(
"dynamodb",
connection_options={
"dynamodb.input.tableName": table_name,
"dynamodb.throughput.read.percent": read_percentage,
"dynamodb.splits": splits
}
)
glueContext.write_dynamic_frame.from_options(
frame=table,
connection_type="s3",
connection_options={
"path": output_path
},
format=output_format,
transformation_ctx="datasink"
)
I tested it in a tiny table in nonprod environment and it works fine. But my Dynamo table in production is over 400GB, 200 mil items. I suppose it'll take a while, but I have no idea how long to expect. Hours, or even days? Are there any way to show progress? For example, showing a count of how many items have been processed. I don't want to blindly start this job and wait.
One way would be to enable continuous logging for your AWS Glue Job to monitor its progress.
Another way would be to trigger a Lambda function whenever a file has been stored in S3, using Amazon S3 event notifications.
Did you try the custom waiter class within was docs?
For instance custom waiter for a Glue Job should look something like this:
class JobCompleteWaiter(CustomWaiter):
def __init__(self, client):
super().__init__(
"JobComplete",
"get_job_run",
"JobRun.JobRunState",
{"SUCCEEDED": WaitState.SUCCEEDED, "FAILED": WaitState.FAILED},
client,
max_tries=100,
)
def wait(self, JobName, RunId):
self._wait(JobName=JobName, RunId=RunId)
According to boto3 docs, you should expect a set of 6 different possible states from a JOB: STARTING'|'RUNNING'|'STOPPING'|'STOPPED'|'SUCCEEDED'|'FAILED'|'TIMEOUT'
So I chost checkein whether was SUCCEEDED or FAILED.

How to get the list of instances for AWS EMR?

Why is the list for EC2 different from the EMR list?
EC2: https://aws.amazon.com/ec2/spot/pricing/
EMR: https://aws.amazon.com/emr/pricing/
Why are not all the types of instances from the EC2 available for EMR? How to get this special list?
In case your question is not about the amazon console
(then it would surely be closed as off-topic):
As a programming solution, you are looking something like this: (using python boto3)
import boto3
client = boto3.client('emr')
for instance in client.list_instances():
print("Instance[%s] %s"%(instance.id, instance.name))
This is what I use, although I'm not 100% sure it's accurate (because I couldn't find documentation to support some of my choices (-BoxUsage, etc.)).
It's worth looking through the responses from AWS in order to figure out what the different values are for different fields in the pricing client responses.
Use the following to get the list of responses:
default_profile = boto3.session.Session(profile_name='default')
# Only us-east-1 has the pricing API
# - https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/pricing.html
pricing_client = default_profile.client('pricing', region_name='us-east-1')
service_name = 'ElasticMapReduce'
product_filters = [
{'Type': 'TERM_MATCH', 'Field': 'location', 'Value': aws_region_name}
]
response = pricing_client.get_products(
ServiceCode=service_name,
Filters=product_filters,
MaxResults=100
)
response_list.append(response)
num_prices = 100
while 'NextToken' in response:
# re-query to get next page
Once you've gotten the list of responses, you can then filter out the actual instance info:
emr_prices = {}
for response in response_list:
for price_info_str in response['PriceList']:
price_obj = json.loads(price_info_str)
attributes = price_obj['product']['attributes']
# Skip pricing info that doesn't specify a (EC2) instance type
if 'instanceType' not in attributes:
continue
inst_type = attributes['instanceType']
# AFAIK, Only usagetype attributes that contain the string '-BoxUsage' are the ones that contain the prices that we would use (empirical research)
# Other examples of values are <REGION-CODE>-M3BoxUsage, <REGION-CODE>-M5BoxUsage, <REGION-CODE>-M7BoxUsage (no clue what that means.. )
if '-BoxUsage' not in attributes['usagetype']:
continue
if 'OnDemand' not in price_obj['terms']:
continue
on_demand_info = price_obj['terms']['OnDemand']
price_dim = list(list(on_demand_info.values())[0]['priceDimensions'].values())[0]
emr_price = Decimal(price_dim['pricePerUnit']['USD'])
emr_prices[inst_type] = emr_price
Realistically, it's straightforward enough to figure this out from the boto3 docs. In particular, the get_products documentation.