Gluejob creation using gluejoboperator - amazon-web-services

Can anyone tell me how I can create a glue job using glue job operator in airflow.
job = AwsGlueJobOperator(
task_id = 'jobCreation',
job_name = 'jobname',
job_desc = f"creating job creation",
region_name = 'region',
iam_role_name = 'role',
num_of_dpus = 1,
concurrent_run_limit = 2,
script_location = f"s3://bucketname/filename.py",
s3_bucket = 'bucketname',
script_args = {'connections' : 'connectionname', '--key' :'value' },
create_job_kwargs={'GlueVersion':1},
)
I'm getting this error by using the above code. The error is:
Invalid type for parameter MaxRetries, value: None, type: <class 'NoneType'>, valid types: <class 'int'>
So, I think I should add maxretries arg but where should I add it? And where I should pass the catalog connection names in this glue job operator? I passed it in create_kwargs as I don't know where I should pass it.

Not sure if you figured this out but you should add this to your job:
retry_limit=1

Related

How to use dagster with great expectations?

The issue
I'm trying out great expectations with dagster, as per this guide
My pipeline seems to execute correctly until it reaches this block:
expectation = dagster_ge.ge_validation_op_factory(
name='ge_validation_op',
datasource_name='dev.data-pipeline-data-storage.data_pipelines.raw_data.sirene_update',
suite_name='suite.data_pipelines.raw_data.sirene_update',
)
if expectation["success"]:
print("Success")
trying to call expectation["success"] results in a
# TypeError: 'SolidDefinition' object is not subscriptable
When I go inside the code of ge_validation_op_factory, there is a _ge_validation_fn that should yield ExpectationResult, but somehow it gets coverted into a SolidDefinition...
Dagster version = 0.15.9;
Great Expectations version = 0.15.44
Code to reproduce the error
In my code, I am trying to interact with an s3 bucket, so it would be a bit tedious to re-create the code for my example but here it is anyway:
In a gx_postprocessing.py
import json
import boto3
import dagster_ge
from dagster import (
op,
graph,
Field,
String,
OpExecutionContext,
)
from typing import List, Dict
#op(
config_schema={
"bucket": Field(
String,
description="s3 bucket name",
),
"path_in_s3": Field(
String,
description="Prefix representing the path to data",
),
"technical_date": Field(
String,
description="date string to fetch data",
),
"file_name": Field(
String,
description="file name that contains the data",
),
}
)
def read_in_json_datafile_from_s3(context: OpExecutionContext):
bucket = context.op_config["bucket"]
path_in_s3 = context.op_config["path_in_s3"]
technical_date = context.op_config["technical_date"]
file_name = context.op_config["file_name"]
object = f"{path_in_s3}/" f"technical_date={technical_date}/" f"{file_name}"
s3 = boto3.resource("s3")
content_object = s3.Object(bucket, object)
file_content = content_object.get()["Body"].read().decode("utf-8")
json_content = json.loads(file_content)
return json_content
#op
def process_example_dq(data: List[Dict]):
return len(data)
#op
def postprocess_example_dq(numrows, expectation):
if expectation["success"]:
return numrows
else:
raise ValueError
#op
def validate_example_dq(context: OpExecutionContext):
expectation = dagster_ge.ge_validation_op_factory(
name='ge_validation_op',
datasource_name='my_bucket.data_pipelines.raw_data.example_update',
suite_name='suite.data_pipelines.raw_data.example_update',
)
return expectation
#graph(
config={
"read_in_json_datafile_from_s3": {
"config": {
"bucket": "my_bucket",
"path_in_s3": "my_path",
"technical_date": "2023-01-24",
"file_name": "myfile_20230124.json",
}
},
},
)
def example_update_evaluation():
output_dict = read_in_json_datafile_from_s3()
nb_items = process_example_dq(data=output_dict)
expectation = validate_example_dq()
postprocess_example_dq(
numrows=nb_items,
expectation=expectation,
)
Do not forget to add great_expectations_poc_pipeline to your __init__.py where the pipelines=[..] are listed.
In this example, dagster_ge.ge_validation_op_factory(...) is returning an OpDefinition, which is the same type of thing as (for example) process_example_dq, and should be composed in the graph definition the same way, rather than invoked within another op.
So instead, you'd want to have something like:
validate_example_dq = dagster_ge.ge_validation_op_factory(
name='ge_validation_op',
datasource_name='my_bucket.data_pipelines.raw_data.example_update',
suite_name='suite.data_pipelines.raw_data.example_update',
)
Then use that op inside your graph definition the same way you currently are (i.e. expectation = validate_example_dq())

Terraform variable inteporlation and evaluation

I'm working with modules in Terraform using Yaml approach to manage variables. I have a very simple module that should create parameter in AWS Parameter Store based on my RDS and IAM User modules output.So, I wrote this module:
resource "aws_ssm_parameter" "ssm_parameter" {
name = var.parameter_name
type = var.parameter_type
value = var.parameter_value
overwrite = var.overwrite
tags = var.tags
}
The variables I'm using are stored into a Yaml file like this:
ssms:
/arquitetura/catalogo/gitlab/token:
type: SecureString
value: ManualInclude
/arquitetura/catalogo/s3/access/key:
type: String
value: module.iam_user.access_key
/arquitetura/catalogo/s3/secret/access/key:
type: SecureString
value: module.iam_user.secret_access_key
/arquitetura/catalogo/rds/user:
type: String
value: module.rds_instance.database_username
/arquitetura/catalogo/rds/password:
type: SecureString
value: module.rds_instance.database_password
As we can see, I have in "value" the module output I would like to send to Parameter Store. I'm loading this variable file using file and yamldecode functions:
ssmfile = "./env/${terraform.workspace}/ssm.yaml"
ssmfilecontent = fileexists(local.ssmfile) ? file(local.ssmfile) : "ssmFileNotFound: true"
ssmsettings = yamldecode(local.ssmfilecontent)
So, I have a local.ssmsettings and I can write a module call like this:
module "ssm_parameter" {
source = "../aws-ssm-parameter-tf"
for_each = local.ssmsettings.ssms
parameter_name = each.key
parameter_type = each.value.type
parameter_value = each.value.value
tags = local.tags
}
Doing this, my parameter is stored as:
{
"Parameter": {
"Name": "/arquitetura/catalogo/rds/user",
"Type": "String",
"Value": "module.rds_instance.database_username",
"Version": 1,
"LastModifiedDate": "2022-12-15T19:02:01.825000-03:00",
"ARN": "arn:aws:ssm:sa-east-1:111111111111:parameter/arquitetura/catalogo/rds/user",
"DataType": "text"
}
}
Value is receiving the string module.rds_instance.database_username instead of the module output.
I know that file function doesn't interpolate variables and I know Terraform doesn't have an eval function.
Does anybody had the same situation that can tell me how you solved the problem or have any clue that I can follow?
I already tried to work with Terraform templates, without success.
Thanks in advance.
Terraform has no way to understand that the value strings in your YAML files are intended to be references to values elsewhere in your module, and even if it did it wouldn't be possible to resolve them from there because this YAML file is not actually a part of the Terraform module, and is instead just a data file that Terraform has loaded into memory.
However, you can get a similar effect by placing all of the values your YAML file might refer to into a map of strings inside your module:
locals {
ssm_indirect_values = tomap({
manual_include = "ManualInclude"
aws_access_key_id = module.iam_user.access_key
aws_secret_access_key = module.iam_user.secret_access_key
database_username = module.rds_instance.database_username
database_password = module.rds_instance.database_password
})
}
Then change your YAML data so that the value strings match with the keys in this map:
ssms:
/arquitetura/catalogo/gitlab/token:
type: SecureString
value: manual_include
/arquitetura/catalogo/s3/access/key:
type: String
value: aws_access_key_id
/arquitetura/catalogo/s3/secret/access/key:
type: SecureString
value: aws_secret_access_key
/arquitetura/catalogo/rds/user:
type: String
value: database_username
/arquitetura/catalogo/rds/password:
type: SecureString
value: database_password
You can then substitute the real values instead of the placeholders before you use the data structure in for_each:
locals {
ssm_file = "${path.module}/env/${terraform.workspace}/ssm.yaml"
ssm_file_content = file(local.ssm_file)
ssm_settings = yamldecode(local.ssm_file_content)
ssms = tomap({
for k, obj in local.ssm_settings.ssms :
k => {
type = obj.type
value = local.ssm_indirect_values[obj.value]
}
})
}
module "ssm_parameter" {
source = "../aws-ssm-parameter-tf"
for_each = local.ssms
parameter_name = each.key
parameter_type = each.value.type
parameter_value = each.value.value
tags = local.tags
}
The for expression in the definition of local.ssms uses the source value string as a lookup key into local.ssm_indirect_values, thereby inserting the real value.
The module "ssm_parameter" block now refers to the derived local.ssms instead of the original local.ssm_settings.ssms, so each.value.value will be the final resolved value rather than the lookup key, and so your parameter should be stored as you intended.

DynamoDB costs arising due to triggers

I have a workflow where I put files into an S3 bucket, which triggers a Lambda function. The Lambda function extracts some info about the file and inserts a row into a DynamoDB table for each file:
def put_filename_in_db(dynamodb, filekey, filename):
table = dynamodb.Table(dynamodb_table_name)
try:
response = table.put_item(
Item={
'masterclient': masterclient,
'filekey': filekey,
'filename': filename,
'filetype': filetype,
'source_bucket_name': source_bucket_name,
'unixtimestamp': unixtimestamp,
'processed_on': None,
'archive_path': None,
'archived_on': None,
}
)
except Exception as e:
raise Exception(f"Error")
return response
def get_files():
bucket_content = s3_client.list_objects(Bucket=str(source_bucket_name), Prefix=Incoming_prefix)['Contents']
file_list = []
for k, v in enumerate(bucket_content):
if (v['Key'].endswith("zip") and not v['Key'].startswith(Archive_prefix)):
filekey = v['Key']
filename = ...
dict = {"filekey": filekey, "filename": filename}
file_list.append(dict)
logger.info(f'Found {len(file_list)} files to process: {file_list}')
return file_list
def lambda_handler(event, context):
for current_item in get_files():
filekey = current_item['filekey']
filename = current_item['filename']
put_filename_in_db(dynamodb, filekey, filename)
return {
'statusCode': 200
}
This is how my DynamoDB table is defined in terraform:
resource "aws_dynamodb_table" "filenames" {
name = local.dynamodb_table_filenames
billing_mode = "PAY_PER_REQUEST"
#read_capacity = 10
#write_capacity = 10
hash_key = "filename"
stream_enabled = true
stream_view_type = "NEW_IMAGE"
attribute {
name = "filename"
type = "S"
}
}
resource "aws_lambda_event_source_mapping" "allow_dynamodb_table_to_trigger_lambda" {
event_source_arn = aws_dynamodb_table.filenames.stream_arn
function_name = aws_lambda_function.trigger_stepfunction_lambda.arn
starting_position = "LATEST"
}
New entries in the DynamoDB table trigger another Lambda function which contains this:
def parse_file_info_from_trigger(event):
filename = event['Records'][0]['dynamodb']['Keys']['filename']['S']
filetype = event['Records'][0]['dynamodb']['NewImage']['filetype']['S']
unixtimestamp = event['Records'][0]['dynamodb']['NewImage']['unixtimestamp']['S']
masterclient = event['Records'][0]['dynamodb']['NewImage']['masterclient']['S']
source_bucket_name = event['Records'][0]['dynamodb']['NewImage']['source_bucket_name']['S']
filekey = event['Records'][0]['dynamodb']['NewImage']['filekey']['S']
return filename, filetype, unixtimestamp, masterclient, source_bucket_name, filekey
def start_step_function(event, state_machine_zip_files_arn):
if event['Records'][0]['eventName'] == 'INSERT':
filename, filetype, unixtimestamp, masterclient, source_bucket_name, filekey = parse_file_info_from_trigger(event)
......
else:
logger.info(f'This is not an Insert event')
However, the costs for this process are extremely high. If I start testing with a single file loaded into S3, the overall DynamoDB costs for that day were $0.785. If I do it for around 50 files for a day, that would mean my total costs per day are 40$, which seems too high if we want to run the workflow on a daily basis.
Am I doing something wrong? Or is DynamoDB generally expensive? If it's the later, then what part exactly is costing so much? Or is it because put_filename_in_db is running in a loop?

Error when trying to use CustomPythonPackageTrainingJobRunOp in VertexAI pipeline

I am using the google cloud pipeline component CustomPythonPackageTrainingJobRunOp in a VertexAI pipeline . I have been able to run this package successfully as a CustomTrainingJob before. I can see multiple (11) error messages in the logs but the only one that seems to make sense to me is, "ValueError: too many values to unpack (expected 2) " but I am unable to figure out the solution. I can add all the other error messages too if required. I am logging some messages at the start of the training code so I know the errors happen before the training code is executed. I am completely stuck on this. Links to samples where someone has used CustomPythonPackageTrainingJobRunOp in a pipeline would very helpful as well. Below is the pipeline code that I am trying to execute:
import kfp
from kfp.v2 import compiler
from kfp.v2.google.client import AIPlatformClient
from google_cloud_pipeline_components import aiplatform as gcc_aip
#kfp.dsl.pipeline(name=pipeline_name)
def pipeline(
project: str = "adsfafs-321118",
location: str = "us-central1",
display_name: str = "vertex_pipeline",
python_package_gcs_uri: str = "gs://vertex/training/training-package-3.0.tar.gz",
python_module_name: str = "trainer.task",
container_uri: str = "us-docker.pkg.dev/vertex-ai/training/scikit-learn-cpu.0-23:latest",
staging_bucket: str = "vertex_bucket",
base_output_dir: str = "gs://vertex_artifacts/custom_training/"
):
gcc_aip.CustomPythonPackageTrainingJobRunOp(
display_name=display_name,
python_package_gcs_uri=python_package_gcs_uri,
python_module=python_module_name,
container_uri=container_uri,
project=project,
location=location,
staging_bucket=staging_bucket,
base_output_dir=base_output_dir,
args = ["--arg1=val1", "--arg2=val2", ...]
)
compiler.Compiler().compile(
pipeline_func=pipeline, package_path=package_path
)
api_client = AIPlatformClient(project_id=project_id, region=region)
response = api_client.create_run_from_job_spec(
package_path,
pipeline_root=pipeline_root_path
)
In the documentation for CustomPythonPackageTrainingJobRunOp, the type of the argument "python_module" seems to be "google.cloud.aiplatform.training_jobs.CustomPythonPackageTrainingJob" instead of string, which seems odd. However, I tried to re-define the pipeline, where I have replaced argument python_module in CustomPythonPackageTrainingJobRunOp with a CustomPythonPackageTrainingJob object instead of a string, as below but still getting the same error:
def pipeline(
project: str = "...",
location: str = "...",
display_name: str = "...",
python_package_gcs_uri: str = "...",
python_module_name: str = "...",
container_uri: str = "...",
staging_bucket: str = "...",
base_output_dir: str = "...",
):
job = aiplatform.CustomPythonPackageTrainingJob(
display_name= display_name,
python_package_gcs_uri=python_package_gcs_uri,
python_module_name=python_module_name,
container_uri=container_uri,
staging_bucket=staging_bucket
)
gcc_aip.CustomPythonPackageTrainingJobRunOp(
display_name=display_name,
python_package_gcs_uri=python_package_gcs_uri,
python_module=job,
container_uri=container_uri,
project=project,
location=location,
base_output_dir=base_output_dir,
args = ["--arg1=val1", "--arg2=val2", ...]
)
Edit:
Added the args that I was passing and had forgotten to add here.
Turns out that the way I was passing the args to the python module was incorrect. Instead of args = ["--arg1=val1", "--arg2=val2", ...], you need to specify args = ["--arg1", val1, "--arg2", val2, ...]

how to impliment Haystacksearch fetched autocomplete

I want implement fetching in autocomplete, here is my autocomplete function
def autocomplete(request):
fetch_field = request.GET.get('fetch_field')
sqs = SearchQuerySet().autocomplete(
content_auto=request.GET.get(
'query',
''))[
:5]
s = []
for result in sqs:
d = {"value": result.title, "data": result.object.slug}
s.append(d)
output = {'suggestions': s}
print('hihi' ,output)
return JsonResponse(output)
Now I can get fetch fields but I don't know how to fetch with SearchQuerySet.
sqs = SearchQuerySet().filter(field_want_to_fetch = fetch_field ).autocomplete(
content_auto=request.GET.get(
'query',
''))[
:5]
Use this !!