Currently I am using Serverless framework for deploying my applications on AWS.
https://serverless.com/
Using the serverless.yml file, we create the DynamoDB tables which are required for the application. These tables are accessed from the lambda functions.
When the application is deployed, I want few of these tables to be loaded with the initial set of data.
Is this possible?
Can you provide me some pointers for inserting this initial data?
Is this possible with AWS SAM?
Don't know if there's a specific way to do this in serverless, however, just add a call to the AWS CLI like this to your build pipeline:
aws dynamodb batch-write-item --request-items file://initialdata.json
Where initialdata.json looks something like this:
{
"Forum": [
{
"PutRequest": {
"Item": {
"Name": {"S":"Amazon DynamoDB"},
"Category": {"S":"Amazon Web Services"},
"Threads": {"N":"2"},
"Messages": {"N":"4"},
"Views": {"N":"1000"}
}
}
},
{
"PutRequest": {
"Item": {
"Name": {"S":"Amazon S3"},
"Category": {"S":"Amazon Web Services"}
}
}
}
]
}
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SampleData.LoadData.html
A more Serverless Framework option is to use a tool like the serverless-plugin-scripts plugin that allows you to add your own CLI commands to the deploy process by default:
https://github.com/mvila/serverless-plugin-scripts
Related
Sagemaker pipelines are rather unclear to me, I'm not experienced in the field of ML but I'm working on figuring out the pipeline definitions.
I have a few questions:
Is sagemaker pipelines a stand-alone service/feature? Because I don't see any option to create them through the console, though I do see CloudFormation and CDK resources.
Is a sagemaker pipeline essentially codepipeline? How do these integrate, how do these differ?
There's also a Python SDK, how does this differ from the CDK and CloudFormation?
I can't seem to find any examples besides the Python SDK usage, how come?
The docs and workshops seem only to properly describe the Python SDK usage,it would be really helpful if someone could clear this up for me!
SageMaker has two things called Pipelines: Model Building Pipelines and Serial Inference Pipelines. I believe you're referring to the former
A model building pipeline defines steps in a machine learning workflow, such as pre-processing, hyperparameter tuning, batch transformations, and setting up endpoints
A serial inference pipeline is two or more SageMaker models run one after the other
A model building pipeline is defined in JSON, and is hosted/run in some sort of proprietary, serverless fashion by SageMaker
Is sagemaker pipelines a stand-alone service/feature? Because I don't see any option to create them through the console, though I do see CloudFormation and CDK resources.
You can create/modify them using the API, which can also be called via the CLI, Python SDK, or CloudFormation. These all use the AWS API under the hood
You can start/stop/view them in SageMaker Studio:
Left-side Navigation bar > SageMaker resources > Drop-down menu > Pipelines
Is a sagemaker pipeline essentially codepipeline? How do these integrate, how do these differ?
Unlikely. CodePipeline is more for building and deploying code, not specific to SageMaker. There is no direct integration as far as I can tell, other than that you can start a SM pipeline with CP
There's also a Python SDK, how does this differ from the CDK and CloudFormation?
The Python SDK is a stand-alone library to interact with SageMaker in a developer-friendly fashion. It's more dynamic than CloudFormation. Let's you build pipelines using code. Whereas CloudFormation takes a static JSON string
A very simple example of Python SageMaker SDK usage:
processor = SKLearnProcessor(
framework_version="0.23-1",
instance_count=1,
instance_type="ml.m5.large",
role="role-arn",
)
processing_step = ProcessingStep(
name="processing",
processor=processor,
code="preprocessor.py"
)
pipeline = Pipeline(name="foo", steps=[processing_step])
pipeline.upsert(role_arn = ...)
pipeline.start()
pipeline.definition() produces rather verbose JSON like this:
{
"Version": "2020-12-01",
"Metadata": {},
"Parameters": [],
"PipelineExperimentConfig": {
"ExperimentName": {
"Get": "Execution.PipelineName"
},
"TrialName": {
"Get": "Execution.PipelineExecutionId"
}
},
"Steps": [
{
"Name": "processing",
"Type": "Processing",
"Arguments": {
"ProcessingResources": {
"ClusterConfig": {
"InstanceType": "ml.m5.large",
"InstanceCount": 1,
"VolumeSizeInGB": 30
}
},
"AppSpecification": {
"ImageUri": "246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3",
"ContainerEntrypoint": [
"python3",
"/opt/ml/processing/input/code/preprocessor.py"
]
},
"RoleArn": "arn:aws:iam::123456789012:role/foo",
"ProcessingInputs": [
{
"InputName": "code",
"AppManaged": false,
"S3Input": {
"S3Uri": "s3://bucket/preprocessor.py",
"LocalPath": "/opt/ml/processing/input/code",
"S3DataType": "S3Prefix",
"S3InputMode": "File",
"S3DataDistributionType": "FullyReplicated",
"S3CompressionType": "None"
}
}
]
}
}
]
}
You could use the above JSON with CloudFormation/CDK, but you build the JSON with the SageMaker SDK
You can also define model building workflows using Step Function State Machines, using the Data Science SDK, or Airflow
I want to download the public arn for a more compact version of spacy from this GitHub repository.
"arn:aws:lambda:us-west-2:113088814899:layer:Klayers-python37-spacy:27"
How can I achieve this?
You can get it from a Arn using the get-layer-version-by-arn function in the CLI.
You can run the below command to get the source of the Lambda layer you requested.
aws lambda get-layer-version-by-arn \
--arn "arn:aws:lambda:us-west-2:113088814899:layer:Klayers-python37-spacy:27"
An example of the response you will receive is below
{
"LayerVersionArn": "arn:aws:lambda:us-west-2:123456789012:layer:AWSLambda-Python37-SciPy1x:2",
"Description": "AWS Lambda SciPy layer for Python 3.7 (scipy-1.1.0, numpy-1.15.4) https://github.com/scipy/scipy/releases/tag/v1.1.0 https://github.com/numpy/numpy/releases/tag/v1.15.4",
"CreatedDate": "2018-11-12T10:09:38.398+0000",
"LayerArn": "arn:aws:lambda:us-west-2:123456789012:layer:AWSLambda-Python37-SciPy1x",
"Content": {
"CodeSize": 41784542,
"CodeSha256": "GGmv8ocUw4cly0T8HL0Vx/f5V4RmSCGNjDIslY4VskM=",
"Location": "https://awslambda-us-west-2-layers.s3.us-west-2.amazonaws.com/snapshots/123456789012/..."
},
"Version": 2,
"CompatibleRuntimes": [
"python3.7"
],
"LicenseInfo": "SciPy: https://github.com/scipy/scipy/blob/master/LICENSE.txt, NumPy: https://github.com/numpy/numpy/blob/master/LICENSE.txt"
}
Once you run this you will get a response returned with a key of "Content", containing a subkey of "Location" which references the S3 path to download the layer contents.
You can download from this path, you will then need to configure this as a Lambda layer again after removing any dependencies.
Please ensure in this process that you only remove unnecessary dependencies.
This question concerns AWS Connect, the cloud-based call center. For those people who have been involved in the setup and configuration of AWS Connect, is there any portion of Amazon Connect that is configurable through a continuous integration flow other than any possible Lambda touchpoints. What I am looking for is scripting various functions such as loading exported flows, etc.
Looking at the AWS CLI, I see a number of AWS Connect calls but a majority is getting access to information (https://docs.aws.amazon.com/cli/latest/reference/connect/index.html) but very few that are available to configure portions of AWS Connect.
There is basically nothing at this time. All contact flows must be imported/exported by hand. Other settings (e.g. routing profiles, prompts, etc.) must be re-created manually.
Someone has created a "beta" Connect CloudFormation template though that actually uses puppeteer behind the scenes to automate the import/export process. I imagine that Amazon will eventually support this, because devops is certainly one of the rough edges of the platform right now.
For new people checking this question. Amazon has recently published the APIs you are looking for. create-contact-flow
It uses a JSON-based language specific to Amazon Connect, below is an example:
{
"Version": "2019-10-30",
"StartAction": "12345678-1234-1234-1234-123456789012",
"Metadata": {
"EntryPointPosition": {"X": 88,"Y": 100},
"ActionMetadata": {
"12345678-1234-1234-1234-123456789012": {
"Position": {"X": 270, "Y": 98}
},
"abcdef-abcd-abcd-abcd-abcdefghijkl": {
"Position": {"X": 545, "Y": 92}
}
}
},
"Actions": [
{
"Identifier": "12345678-1234-1234-1234-123456789012",
"Type": "MessageParticipant",
"Transitions": {
"NextAction": "abcdef-abcd-abcd-abcd-abcdefghijkl",
"Errors": [],
"Conditions": []
},
"Parameters": {
"Prompt": {
"Text": "Thanks for calling the sample flow!",
"TextType": "text",
"PromptId": null
}
}
},
{
"Identifier": "abcdef-abcd-abcd-abcd-abcdefghijkl",
"Type": "DisconnectParticipant",
"Transitions": {},
"Parameters": {}
}
]
}
Exporting from the GUI does not produce a JSON in this format. Obviously, a problem with this is keeping a state. I am keeping a close eye on Terraform/CloudFormation/CDK and will update this post if there is any support (that does not use puppeteer).
I think it's doable now; with the newest APIs, you can do many things to script the entire process. There are some issues with the contact flows themself, but I think this will improve over the next few months.
In the meantime, there is some effort to add Amazon Connet to Terraform. here are the issues and the WIP PRs
Github Issue
Service PR
Resource PR
I have an issue trying to use API Gateway as a proxy to DynamoDB.
Basically it works great if I know the structure of the data I want to store but I cannot manage to make it dynamic regardless of the payload structure.
There are many websites explaining how to use API Gateway as a proxy to DynamoDB.
None that I found explains how to store a JSON object though.
Basically I send this JSON to my API endpoint:
{
"entryId":"abc",
"data":{
"key1":"123",
"key2":123
}
}
If I map using the following template, the data gets put in my database properly
{
"TableName": "Events",
"Item": {
"entryId": {
"S": "abc"
},
"data": {
"M": {
"key1": {
"S": "123"
},
"key2": {
"N": "123"
}
}
}
}
}
However, I don't know the structure of "data" hence why I want the mapping to be dynamic, or even better, I would like to avoid any mapping at all.
I managed to make it dynamic but all my entries are of type String now:
"data": { "M" : {
#foreach($key in $input.path('$.data').keySet())
"$key" : {"S": "$input.path('$.data').get($key)"}#if($foreach.hasNext),#end
#end }
}
Is it possible to get the type dynamically?
I am not quite sure how API Gateway mapping works yet.
Thank you for you help.
Seb
You aren't going to avoid some sort of mapping when inserting into Dynamodb. I would recommend using a Lambda function instead of a service proxy to give you more control and flexibility in mapping the data to your Dynamodb schema.
You can enable CloudWatch log to verify the payload after transformation is expected. You are also able to use the test invoke feature from AWS API Gateway console to find out how your mapping works.
Here is the blog for using Amazon API Gateway as a proxy for DynamoDB. https://aws.amazon.com/blogs/compute/using-amazon-api-gateway-as-a-proxy-for-dynamodb/
I have just created an account on Amazon AWS and I am going to use DATAPIPELINE to schedule my queries. Is it possible to run multiple complex SQL queries from .sql file using SQLACTIVITY of data pipeline?
My overall objective is to process the raw data from REDSHIFT/s3 using sql queries from data pipeline and save it to s3. Is it the feasible way to go?
Any help in this regard will be appreciated.
Yes, if you plan on moving the data from Redshift to S3, you need to do an UNLOAD command found here: http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
The input of your sql queries will be a single DATA Node and Output will be a single data file. Data pipeline provide only one "Select query" field in which you will write your extraction/transformation query. I don't think there are any use case of multiple queries file.
However if you want to make your pipeline configurable ,you can make your pipeline configurable by adding "parameters" and values objects in your pipeline definition JSON.
{
"objects":[
{
"selectQuery":"#{myRdsSelectQuery}"
}
],
"parameters":[
{
"description":"myRdsSelectQuery",
"id":"myRdsSelectQuery",
"type":"String"
}
],
"values":{
"myRdsSelectQuery":"Select Query"
}
}
If you want to execute and schedule multiple sql script , you can do with ShellCommandActivity.
I managed to execute a script with multiple insert statements with following AWS datapipeline configuration:
{
"id": "ExecuteSqlScript",
"name": "ExecuteSqlScript",
"type": "SqlActivity",
"scriptUri": "s3://mybucket/inserts.sql",
"database": { "ref": "rds_mysql" },
"runsOn": { "ref": "Ec2Instance" }
}, {
"id": "rds_mysql",
"name": "rds_mysql",
"type": "JdbcDatabase",
"username": "#{myUsername}",
"*password": "#{*myPassword}",
"connectionString" : "#{myConnStr}",
"jdbcDriverClass": "com.mysql.jdbc.Driver",
"jdbcProperties": ["allowMultiQueries=true","zeroDateTimeBehavior=convertToNull"]
},
It is important to allow the MySql driver to execute multiple queries with allowMultiQueries=true and the script s3 path is provided by scriptUri