Invoke sagemaker endpoint with custom inference script - amazon-web-services

I've deployed a sagemaker endpoint using the following code:
from sagemaker.pytorch import PyTorchModel
from sagemaker import get_execution_role, Session
sess = Session()
role = get_execution_role()
model = PyTorchModel(model_data=my_trained_model_location,
role=role,
sagemaker_session=sess,
framework_version='1.5.0',
entry_point='inference.py',
source_dir='.')
predictor = model.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge',
endpoint_name='my_endpoint')
If I run:
import numpy as np
pseudo_data = [np.random.randn(1, 300), np.random.randn(6, 300), np.random.randn(3, 300), np.random.randn(7, 300), np.random.randn(5, 300)] # input data is a list of 2D numpy arrays with variable first dimension and fixed second dimension
result = predictor.predict(pseudo_data)
I can generate the result with no errors. However, if I want to invoke the endpoint and make prediction by running:
from sagemaker.predictor import RealTimePredictor
predictor = RealTimePredictor(endpoint='my_endpoint')
result = predictor.predict(pseudo_data)
I'd get an error:
Traceback (most recent call last):
File "default_local.py", line 77, in <module>
score = predictor.predict(input_data)
File "/home/biggytruck/.local/lib/python3.6/site-packages/sagemaker/predictor.py", line 113, in predict
response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
File "/home/biggytruck/.local/lib/python3.6/site-packages/botocore/client.py", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/biggytruck/.local/lib/python3.6/site-packages/botocore/client.py", line 608, in _make_api_call
api_params, operation_model, context=request_context)
File "/home/biggytruck/.local/lib/python3.6/site-packages/botocore/client.py", line 656, in _convert_to_request_dict
api_params, operation_model)
File "/home/biggytruck/.local/lib/python3.6/site-packages/botocore/validate.py", line 297, in serialize_to_request
raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid type for parameter Body
From my understanding, the error occurs because I didn't pass in inference.py as the entry point file, which is required to handle the input since it's not in a standard format supported by Sagemaker. However, sagemaker.predictor.RealTimePredictor doesn't allow me to define the entry point file. How can I solve this?

The error you're seeing is raised from the clientside SageMaker Python SDK library, not the remote endpoint that you have published.
Here is the documentation for the data argument (in your case, this is pseudo_data)
data (object) – Input data for which you want the model to provide inference. If a serializer was specified when creating the RealTimePredictor, the result of the serializer is sent as input data. Otherwise the data must be sequence of bytes, and the predict method then sends the bytes in the request body as is.
Source: https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.RealTimePredictor.predict
My guess is that pseudo_data is not the type that the SageMaker Python SDK is expecting, which is a sequence of bytes.

Related

Airflow 2: Job Not Found when transferring data from BigQuery into Cloud Storage

I am trying to migrate from Cloud Composer 1 into Cloud Composer 2 (from Airflow 1.10.15 into Airflow 2.2.5) and when attempting to load data from BigQuery into GCS using the BigQueryToGCSOperator
from airflow.providers.google.cloud.transfers.bigquery_to_gcs import BigQueryToGCSOperator
# ...
BigQueryToGCSOperator(
task_id='my-task',
source_project_dataset_table='my-project-name.dataset-name.table-name',
destination_cloud_storage_uris=f'gs://my-bucket/another-path/*.jsonl',
export_format='NEWLINE_DELIMITED_JSON',
compression=None,
location='europe-west2'
)
that results into the following error:
[2022-06-07, 11:17:01 UTC] {taskinstance.py:1776} ERROR - Task failed with exception
Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/transfers/bigquery_to_gcs.py", line 141, in execute
job = hook.get_job(job_id=job_id).to_api_repr()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 439, in inner_wrapper
return func(self, *args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 1492, in get_job
job = client.get_job(job_id=job_id, project=project_id, location=location)
File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 2066, in get_job
resource = self._call_api(
File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 782, in _call_api
return call()
File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func
return retry_target(
File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 190, in retry_target
return target()
File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.NotFound: 404 GET https://bigquery.googleapis.com/bigquery/v2/projects/my-project-name/jobs/airflow_1654592634552749_1896245556bd824c71f31c79d28cdfbe?projection=full&prettyPrint=false: Not found: Job my-project-name:airflow_1654592634552749_1896245556bd824c71f31c79d28cdfbe
Any clue what may be the issue here and why it does not work on Airflow 2.2.5 (even though the equivalent BigQueryToCloudStorageOperator works for Cloud Composer v1 in Airflow 1.10.15).
Apparently this seems to be a bug introduced in apache-airflow-providers-google version v7.0.0.
Also note that the file transfer from BQ into GCS will actually be successful (even though the task will fail).
As a workaround you can either revert back to a working version (if this is possible) e.g. to 6.8.0, or make use of the BQ API and get rid of BigQueryToGCSOperator.
For example,
from google.cloud import bigquery
from airflow.operators.python import PythonOperator
def load_bq_to_gcs():
client = bigquery.Client()
job_config = bigquery.job.ExtractJobConfig()
job_config.destination_format = bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON
destination_uri = f"{<gcs-bucket-destination>}*.jsonl"
dataset_ref = bigquery.DatasetReference(bq_project_name, bq_dataset_name)
table_ref = dataset_ref.table(bq_table_name)
extract_job = client.extract_table(
table_ref,
destination_uri,
job_config=job_config,
location='europe-west2',
)
extract_job.result()
and then create an instance of PythonOperator:
PythonOperator(
task_id='test_task',
python_callable=load_bq_to_gcs,
)

ERROR: tabular dataset treated as image dataset (Vertex AI Pipelines: Custom training)

I used Vertex AI Pipelines to custom train tabular data.
I ran the python code below.
I CREATE RUN the pipeline with the generated json.
The following error occurred at the start of the training.
Why were tabular data sets treated as image data sets? what is wrong?
Environment
Python 3.7.3
kfp==1.6.2
kfp-pipeline-spec==0.1.7
kfp-server-api==1.6.0
Error message
ValueError: ImageDataset class can not be used to retrieve dataset resource projects/nnnnnnnnnnnn/locations/us-central1/datasets/3781554739456507904, check the dataset type
f"{self.__class__.__name__} class can not be used to retrieve "
File "/opt/python3.7/lib/python3.7/site-packages/google/cloud/aiplatform/datasets/dataset.py", line 100, in _validate_metadata_schema_uri
self._validate_metadata_schema_uri()
File "/opt/python3.7/lib/python3.7/site-packages/google/cloud/aiplatform/datasets/dataset.py", line 82, in __init__
return annotation_type(value)
File "/opt/python3.7/lib/python3.7/site-packages/google_cloud_pipeline_components/aiplatform/remote_runner.py", line 176, in cast
value = cast(value, param_type)
File "/opt/python3.7/lib/python3.7/site-packages/google_cloud_pipeline_components/aiplatform/remote_runner.py", line 205, in prepare_parameters
prepare_parameters(serialized_args[METHOD_KEY], method, is_init=False)
File "/opt/python3.7/lib/python3.7/site-packages/google_cloud_pipeline_components/aiplatform/remote_runner.py", line 236, in runner
print(runner(args.cls_name, args.method_name, executor_input, kwargs))
File "/opt/python3.7/lib/python3.7/site-packages/google_cloud_pipeline_components/aiplatform/remote_runner.py", line 280, in main
main()
File "/opt/python3.7/lib/python3.7/site-packages/google_cloud_pipeline_components/aiplatform/remote_runner.py", line 284, in <module>
exec(code, run_globals)
File "/opt/python3.7/lib/python3.7/runpy.py", line 85, in _run_code
"__main__", mod_spec)
File "/opt/python3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
Traceback (most recent call last):
Python code:
import datetime
from kfp.v2 import dsl, compiler
from kfp.v2.google.client import AIPlatformClient
import google_cloud_pipeline_components.aiplatform as gcc_ai
PROJECT = "my-project"
PIPELINE_NAME = "test-pipeline"
PIPELINE_ROOT_PATH = f"gs://test-pipeline-20210525/{PIPELINE_NAME}"
#dsl.pipeline(
name=PIPELINE_NAME,
pipeline_root=PIPELINE_ROOT_PATH
)
def test_pipeline(
display_name: str=f"{PIPELINE_NAME}-2021MMDD-nn"
):
dataset_create_op = gcc_ai.TabularDatasetCreateOp(
project=PROJECT, display_name=display_name,
gcs_source="gs://used_apartment/datasource/train.csv"
)
training_job_run_op = gcc_ai.CustomContainerTrainingJobRunOp(
project=PROJECT, display_name=display_name,
container_uri="us-central1-docker.pkg.dev/my-project/dataops-rc2021/custom-train:latest",
staging_bucket="vertex_ai_staging_rc2021",
base_output_dir="gs://used_apartment/cstm_img_scrf/artifact",
model_serving_container_image_uri="us-central1-docker.pkg.dev/my-project/dataops-rc2021/custom-pred:latest",
model_serving_container_predict_route="/",
model_serving_container_health_route="/health",
model_serving_container_ports=[8080],
training_fraction_split=0.8,
validation_fraction_split=0.1,
test_fraction_split=0.1,
dataset=dataset_create_op.outputs["dataset"]
)
def run_pipeline(event=None, context=None):
# Compile the pipeline using the kfp.v2.compiler.Compiler
compiler.Compiler().compile(
pipeline_func=test_pipeline,
package_path="test-pipeline.json"
)
if __name__ == '__main__':
run_pipeline()
This seems to be a bug in CustomContainerTrainingJobRunOp component code. We were able to reproduce the error.
I have created the tracking bug https://github.com/kubeflow/pipelines/issues/5885.

Parsing multipage tables into CSV files with AWS Textract

I'm a total AWS newbie trying to parse tables of multi page files into CSV files with AWS Textract.
I tried using AWS's example in this page however when we are dealing with a multi-page file the response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES']) breaks since we need asynchronous processing in those cases, as you can see in the documentation here. The correct function to call would be client.start_document_analysis and after running it retrieve the file using client.get_document_analysis(JobId).
So, I adapted their example using this logic instead of using client.analyze_document function, the adapted piece of code looks like this:
client = boto3.client('textract')
response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
jobid=response['JobId']
jobstatus="IN_PROGRESS"
while jobstatus=="IN_PROGRESS":
response=client.get_document_analysis(JobId=jobid)
jobstatus=response['JobStatus']
if jobstatus == "IN_PROGRESS": print("IN_PROGRESS")
time.sleep(5)
But when I run that I get the following error:
Traceback (most recent call last):
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 125, in <module>
main(file_name)
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 112, in main
table_csv = get_table_csv_results(file_name)
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 62, in get_table_csv_results
response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 608, in _make_api_call
api_params, operation_model, context=request_context)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 656, in _convert_to_request_dict
api_params, operation_model)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/validate.py", line 297, in serialize_to_request
raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Missing required parameter in input: "DocumentLocation"
Unknown parameter in input: "Document", must be one of: DocumentLocation, FeatureTypes, ClientRequestToken, JobTag, NotificationChannel
And that happens because the standard way to call start_document_analysis is using an S3 file with this sort of synthax:
response = client.start_document_analysis(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
},
FeatureTypes=["TABLES"])
However, if I do that I will break the command line logic proposed in the AWS example:
python textract_python_table_parser.py file.pdf.
The question is: how do I adapt AWS example to be able to process multipage files?
Consider use two different lambdas. One for call textract and one for process the result.
Please read this document
https://aws.amazon.com/blogs/compute/getting-started-with-rpa-using-aws-step-functions-and-amazon-textract/
And check this repository
https://github.com/aws-samples/aws-step-functions-rpa
To process the JSON you can use this sample as reference
https://github.com/aws-samples/amazon-textract-response-parser
or use it directly as library.
python -m pip install amazon-textract-response-parser

Softlayer Object Storage Containers: Unable to find the server

Problem:
I'm trying to debug some of my code that uses Softlayer Object Storage however I kept getting some errors from SL itself. Since I was getting the error from SL I went ahead and tried writing some code that reproduces the error. Which can be seen below followed by the stack trace I get.
Question:
Does anyone know why I'm getting the below error besides the possible security prevention from spamming sl so many times?
Source Code:
#!/usr/local/bin/python2.7
import argparse
import object_storage
def main():
parser = argparse.ArgumentParser(description='Spam multiple sl storage containers.')
parser.add_argument("--username", type=str, required=True, help="softlayer username")
parser.add_argument("--apikey", type=str, required=True, help="softlayer api key")
parser.add_argument("--datacenter", type=str, required=True, help="softlayer datacenter")
parser.add_argument("--count", type=int, required=True, help="Amount of times to iterate")
args = parser.parse_args()
username = args.username
api_key = args.apikey
datacenter = args.datacenter
count = args.count
for i in range(0, count):
print "Trying to create sl_storage.containers() #{0}".format(i)
sl_storage = object_storage.get_client(username, api_key, datacenter=datacenter)
containers = sl_storage.containers()
del containers
del sl_storage
if __name__ == "__main__":
main()
Stack Trace:
Traceback (most recent call last):
File "/root/sl_test.py", line 32, in <module>
main()
File "/root/sl_test.py", line 27, in main
containers = sl_storage.containers()
File "/usr/local/lib/python2.7/site-packages/object_storage/client.py", line 293, in containers
formatter=_formatter)
File "/usr/local/lib/python2.7/site-packages/object_storage/client.py", line 354, in make_request
result = self.conn.make_request(method, url, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/object_storage/transport/httplib2conn.py", line 55, in make_request
response = _make_request(headers)
File "/usr/local/lib/python2.7/site-packages/object_storage/transport/httplib2conn.py", line 48, in _make_request
body=data)
File "/usr/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1659, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "/usr/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1399, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/usr/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1325, in _conn_request
raise ServerNotFoundError("Unable to find the server at %s" % conn.host)
httplib2.ServerNotFoundError: Unable to find the server at dal05.objectstorage.softlayer.net
Opened up an issues against softlayer-object_storage python package here https://github.com/softlayer/softlayer-object-storage-python/issues/50
First: I think you should have opened the issue here:
https://github.com/softlayer/softlayer-object-storage-python
second: I do not think this is an issue for me this is working fine, the error probably is due to you do not have any storage in dal05, you can verify that in the control portal by going https://control.softlayer.com/storage/objectstorage and make sure that there are containers in the dal05.
third: the client that you are using only works for swift storages, it does not work for S3 containers
The storage in this issue is Swift-based not S3.

django-cumulus: retrieve PIL.Image object from django.db.models.ImageField

I'm using django-cumulus to store my media on Rackspace cloud.
I need to retrieve data from ImageField to PIL.Image. I need it to make some changes on this image (cropping, filters, etc.) and save it to another cumulus ImageField.
I tried this code:
def field_to_image(field):
# field - cumulus-powered ImageField on some model
from StringIO import StringIO
from PIL import Image
r = field.read() # ERROR throws here!
image = Image.open(StringIO(r))
return image
It worked good on half of my files, but on the other half I'm always getting this error:
Traceback (most recent call last):
File "tmp.py", line 78, in <module>
resize_photos(start)
File "tmp.py", line 59, in resize_photos
photo.make_thumbs()
File "/hosting/site/news/models.py", line 65, in make_thumbs
i = functions.field_to_image(self.img)
File "/hosting/site/functions.py", line 169, in field_to_image
r = field.read()
File "/usr/local/lib/python2.7/dist-packages/cumulus/storage.py", line 352, in read
if self._pos == self._get_size() or chunk_size == 0:
File "/usr/local/lib/python2.7/dist-packages/cumulus/storage.py", line 322, in _get_size
self._size = self._storage.size(self.name)
File "/usr/local/lib/python2.7/dist-packages/cumulus/storage.py", line 244, in size
return self._get_object(name).total_bytes
AttributeError: 'bool' object has no attribute 'total_bytes'
Can anyone help me? Maybe there is the better way to retrieve PIL.Image object from rackspace?
The file I'm trying to read() exists and is available via url on Rackspace
It returns False if the file is not found in the container, hence is a very confusing error.
It is fixed now in repo, but still not in the released version: it returns None instead of False:
https://github.com/django-cumulus/django-cumulus/blob/master/cumulus/storage.py#L203
But the basic cause of the problem: the file is not found.