I have created a docker image which has Entrypoint as processing.py. This script is taking data from /opt/ml/processing/input and after processing putting it /opt/ml/processing/output folder.
For processing the data I should put the file in /opt/ml/processing/input from s3 and then pick processed file from /opt/ml/processing/output into S3.
Following script in sagemaker is doing it properly:
from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput
import sagemaker
input_data = 's3://sagemaker-ap-south-1-057036842446/sagemaker/Data/Training/Churn_Modelling.csv'
output_dir = 's3://sagemaker-ap-south-1-057036842446/sagemaker/Outputs/'
image_uri = '057036842446.dkr.ecr.ap-south-1.amazonaws.com/aws-docker-repo:latest'
aws_role = sagemaker.get_execution_role()
processor = Processor(image_uri= image_uri, role=aws_role, instance_count=1, instance_type="ml.m5.xlarge")
processor.run(
inputs=[
ProcessingInput(
source=input_data,
destination='/opt/ml/processing/input'
)
],
outputs=[
ProcessingOutput(
source='/opt/ml/processing/output',
destination=output_dir
)
]
)
Could someone please guide how this can be executed with lambda function? It is not recognizing sagemaker package, second there is a challenge in placing file before the script execution and pick processed files.
I am trying codepipeline to automate this operation. However got no success on that.
Not sure how to put image from S3 into folders internally used by script
I need to know how S3 processing step which pick data from /opt/ml/processing/input
If you want to kick off a Processing Job from Lambda you can use boto3 to make the CreateProcessingJob API call:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_processing_job
I would suggest creating the Job as you have been doing using the SageMaker SDK. Once created, you can describe the Job using the DescribeProcessingJob API call:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.describe_processing_job
You can then use the information from the DescribeProcessingJob API call output to fill out the CreateProcessingJob in Lambda.
Related
Have ~50k compressed (gzip) json files daily that need to be uploaded to BQ with some transformation, no API calls. The size of the files may be up to 1Gb.
What is the most cost-efficient way to do it?
Will appreciate any help.
Most efficient way to use Cloud Data Fusion.
I would suggest below approach
Create cloud function and trigger on every new file upload to uncompress file.
Create datafusion job with GCS file as source and bigquery as sink with desired transformation.
Refer below my youtube video.
https://youtu.be/89of33RcaRw
Here is (for example) one way - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json...
... but quickly looking over it however one can see that there are some specific limitations. So perhaps simplicity, customization and maintainability of solution can also be added to your “cost” function.
Not knowing some details (for example read "Limitations" section under my link above, what stack you have/willing/able to use, files names or if your files have nested fields etc etc etc ) my first thought is cloud function service ( https://cloud.google.com/functions/pricing ) that is "listening" (event type = Finalize/Create) to your cloud (storage) bucket where your files land (if you go this route put your storage and function in the same zone [if possible], which will make it cheaper).
If you can code Python here is some started code:
main.py
import pandas as pd
from pandas.io import gbq
from io import BytesIO, StringIO
import numpy as np
from google.cloud import storage, bigquery
import io
def process(event, context):
file = event
# check if its your file can also check for patterns in name
if file['name'] == 'YOUR_FILENAME':
try:
working_file = file['name']
storage_client = storage.Client()
bucket = storage_client.get_bucket('your_bucket_here')
blob = bucket.blob(working_file)
#https://stackoverflow.com/questions/49541026/how-do-i-unzip-a-zip-file-in-google-cloud-storage
zipbytes = io.BytesIO(blob.download_as_string())
#print for logging
print(f"file downloaded, {working_file}")
#read_file_as_df --- check out docs here = https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
# if nested might need to go text --> to dictionary and then do some preprocessing
df = pd.read_json(zipbytes, compression='gzip', low_memory= False)
#write processed to big query
df.to_gbq(destination_table ='your_dataset.your_table',
project_id ='your_project_id',
if_exists = 'append')
print(f"table bq created, {working_file}")
# if you want to delete processed file from your storage to save on storage costs uncomment 2 lines below
# blob.delete()
#print(f"blob delete, {working_file}")
except Exception as e:
print(f"exception occured {e}, {working_file}")
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-storage
google-cloud-bigquery
pandas
pandas.io
pandas-gbq
PS
Some alternatives include
Starting up a VM and run your script on a schedule and shutting VM down once process is done ( for example cloud scheduler –-> pub/sub –-> cloud function –-> which starts up your vm --> which then runs your script)
Using app engine to run your script (similar)
Using cloud run to run your script (similar)
Using composer/airflow (not similar to 1,2&3) [ could use all types of approaches including data transfers etc, just not sure what stack you are trying to use or what you already have running ]
Scheduling vertex ai workbook (not similar to 1,2&3, basically write up a jupyter notebook and schedule it to run in vertex ai)
Try to query files directly (https://cloud.google.com/bigquery/external-data-cloud-storage#bq_1) and schedule that query (https://cloud.google.com/bigquery/docs/scheduling-queries) to run (but again not sure about your overall pipeline)
Setup for all (except #5 & #6) just in technical debt to me is not worth it if you can get away with functions
Best of luck,
I would like to place a csv file in an S3 bucket and get predictions from a Sagemaker model using batch transform job automatically. I would like to do that by using s3 event notification (upon csv upload) to trigger a Lambda function which would do a batch transform job. The lambda function I have written so far is this:
import boto3
sagemaker = boto3.client('sagemaker')
input_data_path = 's3://yeex/upload/examples.csv'.format(default_bucket, 's3://yeex/upload/', 'examples.csv')
output_data_path = 's3://nooz/download/'.format(default_bucket, 's3://nooz/download')
transform_job = sagemaker.transformer.Transformer(
model_name = y_xgboost_21,
instance_count = 1,
instance_type = 'ml.m5.large',
strategy = 'SingleRecord',
assemble_with = 'Line',
output_path = output_data_path,
base_transform_job_name='y-test-batch',
sagemaker_session=sagemaker.Session(),
accept = 'text/csv')
transform_job.transform(data = input_data_path,
content_type = 'text/csv',
split_type = 'Line')
The error it returns is that object sagemaker does not have module transform
What is the syntax I should use in Lambda function?
While Boto3 (boto3.client("sagemaker")) is the general-purpose AWS SDK for Python across different services, examples that you might see referencing classes like Estimator, Transformer, Predictor and etc are referring to the SageMaker Python SDK (import sagemaker).
In general I'd say (almost?) anything that can be done in one can also be done in the other as they use the same underlying service APIs - but the purpose of the SM Python SDK is to provide higher-level abstractions and useful utilities: For example transparently zipping and uploading a source_dir to S3 to deliver "script mode" training.
As far as I'm aware, the SageMaker Python SDK is still not pre-installed in AWS Lambda Python runtimes by default: But it is an open-source and pip-installable package.
So you have 2 choices here:
Continue using boto3 and create your transform job via the low-level create_transform_job API
Install sagemaker in your Python Lambda bundle (Tools like AWS SAM or CDK might make this process easier) and instead import sagemaker so you can use the Transformer and other high-level Python APIs.
I have a Sagemaker model trained and deployed and looking to run a Batch Transform on multiple files.
I have a lambda function configured to run when a new file is uploaded to S3.
Currently, I have only seen ways to use the Invoke Endpoint function on lambda
i.e
runtime= boto3.client('runtime.sagemaker')
message = ["On Wed Sep PDT Brent Welch said Hacksaw said W"]
response = runtime.invoke_endpoint(EndpointName="sagemaker-scikit-learn-2020-02-05-13-44-45-011",
ContentType='text/plain',
Body=message)
json.loads(response['Body'].read())
However, i have multiple files that need to processed by the Sagemaker model.
I have code to create a transformer and run batch transform on multiple files
import sagemaker
training_job = sagemaker.estimator.Estimator.attach("{model_name")
transformer = training_job.transformer(instance_count=1,instance_type='ml.m4.xlarge',strategy='MultiRecord',assemble_with='Line')
#batch_input_s3
transformer.transform(batch_input_s3)
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()
however, im unable to use this transformer code in a lambda since it requires the sagemaker library, which is over the size limit for zip file is 50mb
I am uisng AWS Code pipeline to perform cloud formation. My source code is committed in GitHub repository. When ever a commit is happening in my github repository, AWS Code Pipeline will starts its execution and perform the cloud formation. These functionalities are working fine.
In my project I have multiple modules. So if a user is modified only in one module, the entire module's lambda's are updated. Is there any way to restrict this using AWS Code Pipeline.
My Code Pipeline has 3 stages.
Source
Build
Deploy
The following is the snapshot of my code pipeline.
We had a similar issue and eventually we came to conclusion that this is not exactly possible. So unless you separate your modules into different repos and make separate pipelines for each of them it is always going to execute everything.
The good thing is that with each execution of the pipeline it is not entirely redeploying everything when the cloud formation is executed. In the deploy stage you can add Create Changeset part which is basically going to detect what is changed from the previous CloudFormation deployment and it is going to redeploy only those parts and will not touch anything else.
This is the exact issue we faced recently and while I see comments mentioning that it isn't possible to achieve with a single repository, I have found a workaround!
Generally, the code pipeline is triggered by a CloudWatch event listening to the GitHub/Code Commit repository. Rather than triggering the pipeline, I made the CloudWatch event trigger a lambda function. In the lambda, we can write the logic to execute the pipeline(s) only for module which has changes. This works really nicely and provides a lot of control over the pipeline execution. This way multiple pipeline can be created from a single repository, solving the problem mention in the question.
Lambda logic can be something like:
import boto3
# Map config files to pipelines
project_pipeline_mapping = {
"CodeQuality_ScoreCard" : "test-pipeline-code-quality",
"ProductQuality_ScoreCard" : "test-product-quality-pipeline"
}
files_to_ignore = [ "readme.md" ]
codecommit_client = boto3.client('codecommit')
codepipeline_client = boto3.client('codepipeline')
def lambda_handler(event, context):
projects_changed = []
# Extract commits
print("\n EVENT::: " , event)
old_commit_id = event["detail"]["oldCommitId"]
new_commit_id = event["detail"]["commitId"]
# Get commit differences
codecommit_response = codecommit_client.get_differences(
repositoryName="ScorecardAPI",
beforeCommitSpecifier=str(old_commit_id),
afterCommitSpecifier=str(new_commit_id)
)
print ("\n Code commit response: ", codecommit_response)
# Search commit differences for files that trigger executions
for difference in codecommit_response["differences"]:
file_name = difference["afterBlob"]["path"]
project_name = file_name.split('/')[0]
print("\nChanged project: ", project_name)
# If project corresponds to pipeline, add it to the pipeline array
if project_name in project_pipeline_mapping:
projects_changed.insert(len(projects_changed),project_name)
projects_changed = list(dict.fromkeys(projects_changed))
print("pipeline(s) to be executed: ", projects_changed)
for project in projects_changed:
codepipeline_response = codepipeline_client.start_pipeline_execution(
name=project_pipeline_mapping[project]
)
Check AWS blog on this topic: Customizing triggers for AWS CodePipeline with AWS Lambda and Amazon CloudWatch Events
Why not model this as a pipeline per module?
I need to automate a process to extract data from Google Big Query and exported to an external CSV in a external server outside of the GCP.
I just researching how to to that I found some commands to run from my External Server. But I prefer to do everything in GCP to avoid possible problems.
To run the query to CSV in Google storage
bq --location=US extract --compression GZIP 'dataset.table' gs://example-bucket/myfile.csv
To Download the csv from Google Storage
gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [OBJECT_DESTINATION]
But I would like to hear your suggestions
If you want to fully automatize this process, I would do the following:
Create a Cloud Function to handle the export:
This is the more lightweight solution, as Cloud Functions are serverless, and provide flexibility to implement code with the Client Libraries. See the quickstart, I recommend you to use the console to create the functions to start with.
In this example I recommend you to trigger the Cloud Function from an HTTP request, i.e. when the function URL is called, it will run the code inside of it.
An example Cloud Function code in Python, that creates the export when a HTTP request is made:
main.py
from google.cloud import bigquery
def hello_world(request):
project_name = "MY_PROJECT"
bucket_name = "MY_BUCKET"
dataset_name = "MY_DATASET"
table_name = "MY_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
requirements.txt
google-cloud-bigquery
Note that the job will run asynchronously in the background, you will receive a return response with the job ID, which you can use to check the state of the export job in the Cloud Shell, by running:
bq show -j <job_id>
Create a Cloud Scheduler scheduled job:
Follow this documentation to get started. You can set the Frequency with the standard cron format, for example 0 0 * * * will run the job every day at midnight.
As a target, choose HTTP, in the URL put the Cloud Function HTTP URL (you can find it in the console, inside the Cloud Function details, under the Trigger tab), and as HTTP method choose GET.
Create it, and you can test it in the Cloud Scheduler by pressing the Run now button in the Console.
Synchronize your external server and the bucket:
Up until now you only have scheduled exports to run every 24 hours, now to synchronize the bucket contents with your local computer, you can use the gsutil rsync command. If you want to save the imports, lets say to the my_exports folder, you can run, in your external server:
gsutil rsync gs://BUCKET_WITH_EXPORTS /local-path-to/my_exports
To periodically run this command in your server, you could create a standard cron job in your crontab inside your external server, to run each day as well, just at a few hours later than the bigquery export, to ensure that the export has been made.
Extra:
I have hard-coded most of the variables in the Cloud Function to be always the same. However, you can send parameters to the function, if you do a POST request instead of a GET request, and send the parameters as data in the body.
You will have to change the Cloud Scheduler job to send a POST request to the Cloud Function HTTP URL, and in the same place you can set the body to send the parameters regarding the table, dataset and bucket, for example. This will allow you to run exports from different tables at different hours, and to different buckets.