I want to pass an s3 filename as an input_file_path which I want to execute job from Glue console.Is their any way to give input_file_path argument from AWS Glue console?
If you want to create job parameter via Glue Console (UI), you have to first define the argument in the job:
unfold the Script libraries and job parameters sections
enter your parameter:
Read it in your code (Scala example):
def main(sysArgs: Array[String]) {
val args = GlueArgParser.getResolvedOptions(sysArgs, Array("input_file_path"))
print(s"Input path: ${args("input_file_path")}")
}
Read it in your code (Python):
import sys
args = getResolvedOptions(sys.argv, ['input_file_path'])
print(args['input_file_path'])
Related
I am following the steps in the SageMaker Monitoring Tutorial here:
https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_model_monitor/introduction/SageMaker-ModelMonitoring.html
And for the line:
bucket.Object(code_prefix + "/preprocessor.py").upload_file("preprocessor.py")
I get the error:
TypeError: expected string or bytes-like object
Which I dont understand, because the input to the upload_file() function is "preprocessor.py" which is a string.
To unblock you quickly, please take the sagemaker session route to upload the artifact to S3 bucket -
import sagemaker
from sagemaker import get_execution_role
# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket and prefix
bucket = sagemaker.Session().default_bucket()
artifact_name = "preprocessor.py"
prefix = "code" // change as required
sample_url = sagemaker.Session().upload_data(
artifact_name, bucket=bucket, key_prefix=prefix"
)
I am using Sagemaker python sdk for my inference job and following this guide. I am triggering my sagemaker inference job from Airflow with below python callable:
def transform(sage_role, inference_file_local_path, **kwargs):
"""
Python callable to execute Sagemaker SDK train job. It takes infer_batch_output, infer_batch_input, model_artifact,
instance_type and infer_file_name as run time parameter.
:param inference_file_local_path: Local entry_point path for Inference file.
:param sage_role: Sagemaker execution role.
"""
model = TensorFlowModel(entry_point=infer_file_name,
source_dir=inference_file_local_path,
model_data=model_artifact,
role=sage_role,
framework_version="2.5.1")
tensorflow_serving_transformer = model.transformer(
instance_count=1,
instance_type=instance_type,
accept="text/csv",
strategy="SingleRecord",
max_payload=10,
max_concurrent_transforms=10,
output_path=batch_output)
return tensorflow_serving_transformer.transform(data=batch_input, content_type='text/csv')
and my simply inference.py looks like:
def input_handler(data, context):
""" Pre-process request input before it is sent to TensorFlow Serving REST API
Args:
data (obj): the request data, in format of dict or string
context (Context): an object containing request and configuration details
Returns:
(dict): a JSON-serializable dict that contains request body and headers
"""
if context.request_content_type == 'application/x-npy':
# very simple numpy handler
payload = np.load(data.read().decode('utf-8'))
x_user_feature = np.asarray(payload.item().get('test').get('feature_a_list'))
x_channel_feature = np.asarray(payload.item().get('test').get('feature_b_list'))
examples = []
for index, elem in enumerate(x_user_feature):
examples.append({'feature_a_list': elem, 'feature_b_list': x_channel_feature[index]})
return json.dumps({'instances': examples})
if context.request_content_type == 'text/csv':
payload = pd.read_csv(data)
print("Model name is ..............")
model_name = context.model_name
print(model_name)
examples = []
row_ch = []
if config_exists(model_bucket, "{}{}".format(config_path, model_name)):
config_keys = get_s3_json_file(model_bucket, "{}{}".format(config_path, model_name))
feature_b_list = config_keys["feature_b_list"].split(",")
row_ch = [float(ch_feature_str) for ch_feature_str in feature_b_list]
if "column_names" in config_keys.keys():
cols = config_keys["column_names"].split(",")
payload.columns = cols
for index, row in payload.iterrows():
row_user = row['feature_a_list'].replace('[', '').replace(']', '').split()
row_user = [float(x) for x in row_user]
if not row_ch:
row_ch = row['feature_b_list'].replace('[', '').replace(']', '').split()
row_ch = [float(x) for x in row_ch]
example = {'feature_a_list': row_user, 'feature_b_list': row_ch}
examples.append(example)
raise ValueError('{{"error": "unsupported content type {}"}}'.format(
context.request_content_type or "unknown"))
def output_handler(data, context):
"""Post-process TensorFlow Serving output before it is returned to the client.
Args:
data (obj): the TensorFlow serving response
context (Context): an object containing request and configuration details
Returns:
(bytes, string): data to return to client, response content type
"""
if data.status_code != 200:
raise ValueError(data.content.decode('utf-8'))
response_content_type = context.accept_header
prediction = data.content
return prediction, response_content_type
It is working fine however I want to pass custom arguments to inference.py so that I can modify the input data accordingly based on requirement. I thought of using a config file per requirement and download it from s3 based on model name but as I am using model_data and passes model.tar.gz at runtime context.model_name is always None.
Is there a way I can pass run time argument to inference.py that I can use for customization?
In the docs I see sagemaker provides custom_attributes but I don't see any example of it on how to use it and access it in inference.py.
custom_attributes (string): content of ‘X-Amzn-SageMaker-Custom-Attributes’ header from the original request. For example, ‘tfs-model-name=half*plus*three,tfs-method=predict’
Currently CustomAttributes is supported in the InvokeEndpoint API call when using a realtime Endpoint.
As an example, you can look at passing JSON Lines as input to your Transform Job that contains the input payload and some custom arguments which you can consume in your inference.py file.
For example,
{
"input":"1,2,3,4",
"custom_args":"my_custom_arg"
}
I have two glue jobs, created from aws console. I would like to invoke one glue job (python) from another glue(python) job with parameters.what would be best approach to do this. I appreciate your help.
You can use Glue workflows, and setup workflow parameters as mentioned by Bob Haffner. Trigger the glue jobs using the workflow. The advantage here is, if the second glue job fails due to any errors, you can resume / rerun only the second job after fixing the issues. The workflow parameter you can pass from one glue job to another as well. The sample code for read/write workflow parameters:
If first glue job:
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'WORKFLOW_NAME', 'WORKFLOW_RUN_ID'])
workflow_name = args['WORKFLOW_NAME']
workflow_run_id = args['WORKFLOW_RUN_ID']
workflow_params = glue_client.get_workflow_run_properties(Name=workflow_name,RunId=workflow_run_id)["RunProperties"]
workflow_params['param1'] = param_value1
workflow_params['param2'] = param_value2
workflow_params['param3'] = param_value3
workflow_params['param4'] = param_value4
glue_client.put_workflow_run_properties(Name=workflow_name, RunId=workflow_run_id, RunProperties=workflow_params)
and in the second glue job:
args = getResolvedOptions(sys.argv, ['WORKFLOW_NAME', 'WORKFLOW_RUN_ID'])
workflow_name = args['WORKFLOW_NAME']
workflow_run_id = args['WORKFLOW_RUN_ID']
workflow_params = glue_client.get_workflow_run_properties(Name=workflow_name, RunId=workflow_run_id)["RunProperties"]
param_value1 = workflow_params['param1']
param_value2 = workflow_params['param2']
param_value3 = workflow_params['param3']
param_value4 = workflow_params['param4']
How to setup a glue workflow, refer here:
https://docs.aws.amazon.com/glue/latest/dg/creating_running_workflows.html
https://medium.com/#pioneer21st/orchestrating-etl-jobs-in-aws-glue-using-workflow-758ef10b8434
I have a program which creates a topic in pubSub and also publishes messages to the topic. I also have an automated dataflow job(using a template) which saves these messages into my BigQuery table. Now I intend to replace the template based job with a python pipeline where my requirement is to read data from PubSub, apply transformations and save the data into BigQuery/publish to another PubSub topic. I started writing the script in python and did a lot of trial and error to achieve it but to my dismay, I could not achieve it. The code looks like this:
import apache_beam as beam
from apache_beam.io import WriteToText
TOPIC_PATH = "projects/test-pipeline-253103/topics/test-pipeline-topic"
OUTPUT_PATH = "projects/test-pipeline-253103/topics/topic-repub"
def run():
o = beam.options.pipeline_options.PipelineOptions()
p = beam.Pipeline(options=o)
print("I reached here")
# # Read from PubSub into a PCollection.
data = (
p
| "Read From Pub/Sub" >> beam.io.ReadFromPubSub(topic=TOPIC_PATH)
)
data | beam.io.WriteToPubSub(topic=OUTPUT_PATH)
print("Lines: ", data)
run()
I will really appreciate if I can get some help at the earliest.
Note: I have my project set up on google cloud and I have my script running locally.
Here the working code.
import apache_beam as beam
TOPIC_PATH = "projects/test-pipeline-253103/topics/test-pipeline-topic"
OUTPUT_PATH = "projects/test-pipeline-253103/topics/topic-repub"
class PrintValue(beam.DoFn):
def process(self, element):
print(element)
return [element]
def run():
o = beam.options.pipeline_options.PipelineOptions()
# Replace this by --stream execution param
standard_options = o.view_as(beam.options.pipeline_options.StandardOptions)
standard_options.streaming = True
p = beam.Pipeline(options=o)
print("I reached here")
# # Read from PubSub into a PCollection.
data = p | beam.io.ReadFromPubSub(topic=TOPIC_PATH) | beam.ParDo(PrintValue()) | beam.io.WriteToPubSub(topic=OUTPUT_PATH)
# Don't forget to run the pipeline!
result = p.run()
result.wait_until_finish()
run()
In summary
You miss to run the pipeline. Indeed, Beam is a Graph programming model. So, in your previous code, you built your graph but you never run it. Here, at the end, run it (not blocking call) and wait the end (blocking call)
When you start your pipeline, Beam mention that PubSub work only in streaming mode. Thus, you can start your code with --streaming param, or do it programmatically as shown in my code
Be careful, streaming mode means to listen indefinitively on PubSub. If you run this on Dataflow, your pipeline will be always up, until you stop it. This can be cost expensive if you have few message. Be sure that is the target model
An alternative is to use your pipeline for a limited period of time (you use scheduler for starting it, and another one for stopping it). But, at this moment, you have to stack message. Here you use a Topic as entry of the pipeline. This option force Beam to create a temporary subscription and to listen message on this subscription. This means that the message publish before this subscription creation won't be received and processed.
The idea is to create a subscription, by this way the message will be stacked in it (up to 7 days, by default). Then, use the subscription name in entry of your pipeline beam.io.ReadFromPubSub(subscription=SUB_PATH). The messages will be unstacked and processed by Beam (Order not guaranteed!)
Based on the Beam programming guide, you simply have to add a Transform step in your pipeline. Here an example or transform:
class PrintValue(beam.DoFn):
def process(self, element):
print(element)
return [element]
Add it to your pipeline
data | beam.ParDo(PrintValue()) | beam.io.WriteToPubSub(topic=OUTPUT_PATH)
You can add the number of transforms that you want. You can test the value and set the elements in tagged PCollection (for having multiple output) for fan out, or use side input for fan in PCollection.
I am new to Jenkins and I am trying to fetch all the jar files from S3 bucket and display in choice parameter in Jenkins UI. So we have option to select particular jar file and deploy.
I am using below command for testing purpose in run script and it works.
aws s3 ls company-bucketname/test/
I have installed "Extensible choice plugin to achive this" and running below groovy script with no luck.
def command = 'aws s3 ls company-bucketname/test/ --output text'
def proc = command.execute()
proc.waitFor()
def output = proc.in.text
def exitcode= proc.exitValue()
def error = proc.err.text
if (error) {
println "Std Err: ${error}"
println "Process exit code: ${exitcode}"
return exitcode
}
//println output.split()
return output.tokenize()
Please let me know if there is better approach to display S3 values in Jenkins choice parameter.
According to this blog post, you can use the "Extended Choice Parameter Plugin" and create a script which returns an array of strings that will be used as your choices for the parameter.
It seems like you have the part of fetching the names of the jar files working, so all you need is that plugin and a method returning a list of the choices.