Related
I'm trying to set up a deployment pipeline using CodeCommit, ECR and ECS. My pipeline passes the source and build steps fine. I can deploy manually via CodeDeploy if I upload my appspec.yaml file to an s3 bucket. Deploys triggered by a change to my CodeCommit repository always fail with the error:
An AppSpec file is required, but could not be found in the revision
When I look at the details of the failed deployment, I can pull up the revision location, which shows this:
I see in the troubleshooting code deploy section that some editors can cause issues. I'm using vscode on linux, so I don't think that should be an issue. Also, if I upload the same appspec file to s3 and reference it from a manual deployment, it works fine.
I've also tried uploading the same file, but named appspec.yml. Still failed.
The role that this deployment uses has full s3 access, not sure if it could be any other permissions-related problem.
Here is my codepipeline definition:
{
"pipeline": {
"roleArn": "arn:aws:iam::690517313378:role/service-role/AWSCodePipelineServiceRole-us-east-1-blottermappertf",
"stages": [
{
"name": "Source",
"actions": [
{
"inputArtifacts": [],
"name": "Source",
"region": "us-east-1",
"actionTypeId": {
"category": "Source",
"owner": "AWS",
"version": "1",
"provider": "CodeCommit"
},
"outputArtifacts": [
{
"name": "SourceArtifact"
}
],
"configuration": {
"PollForSourceChanges": "false",
"BranchName": "master",
"RepositoryName": "blottermapper"
},
"runOrder": 1
}
]
},
{
"name": "Build",
"actions": [
{
"inputArtifacts": [
{
"name": "SourceArtifact"
}
],
"name": "Build",
"region": "us-east-1",
"actionTypeId": {
"category": "Build",
"owner": "AWS",
"version": "1",
"provider": "CodeBuild"
},
"outputArtifacts": [
{
"name": "BuildArtifact"
}
],
"configuration": {
"ProjectName": "blottermapper",
"EnvironmentVariables": "[{\"name\":\"REPOSITORY_URI\",\"value\":\"690517313378.dkr.ecr.us-east-1.amazonaws.com/net.threeninetyfive\",\"type\":\"PLAINTEXT\"}]"
},
"runOrder": 1
}
]
},
{
"name": "Deploy",
"actions": [
{
"inputArtifacts": [
{
"name": "BuildArtifact"
}
],
"name": "Deploy",
"region": "us-east-1",
"actionTypeId": {
"category": "Deploy",
"owner": "AWS",
"version": "1",
"provider": "CodeDeploy"
},
"outputArtifacts": [],
"configuration": {
"ApplicationName": "blottermappertf",
"DeploymentGroupName": "blottermappertf"
},
"runOrder": 1
}
]
}
],
"artifactStore": {
"type": "S3",
"location": "codepipeline-us-east-1-634554346591"
},
"name": "blottermappertf",
"version": 1
},
"metadata": {
"pipelineArn": "arn:aws:codepipeline:us-east-1:690517313378:blottermappertf",
"updated": 1573712712.49,
"created": 1573712712.49
}
}
"An AppSpec file is required, but could not be found in the revision"
The above error is related to the wrong configuration for your codepipeline. To perform ECS codedeploy deployments, the provider in your codepipeline stage for deployment must be "ECS (blue/green)" not "Codedeploy" (codedeploy is used for EC2 deployments.)
Even though in the back-end it uses codedeploy, the name of the provider is "ECS (blue/green)".
I found the answer here:
The deployment specifies that the revision is a null file, but the revision provided is a zip file
I was using the wrong action provider when setting up my deployment. I chose ECS and I should have chosen ECS Blue/Green.
The ambiguous error message made debugging and searching for answers on stack overflow difficult for me.
I'm setting up a pipeline to automate cloudformation stack templates deployment.
The pipeline itself is created in the aws eu-west-1 region, but cloudformation stacks templates would be deployed in any other region.
Actually I know and can execute pipeline action in a different account, but I don't see where to specify the region I would like my template to be deployed in, like we do with aws cli : aws --region cloudformation deploy.....
Is there anyway to trigger a pipeline in one region and execute a deploy action in another region please?
The action configuration properties don't offer such possibility...
A workaround would be to run aws cli deploy command from cli in the codebuild container and speficy the good region, But I would like to know if there is a more elegant way to do it
If you're looking to deploy to multiple regions, one after the other, you could create a Code Pipeline pipeline in every region you want to deploy to, and set up S3 cross-region replication so that the output of the first pipeline becomes the input to a pipeline in the next region.
Here's a blog post explaining this further: https://aws.amazon.com/blogs/devops/building-a-cross-regioncross-account-code-deployment-solution-on-aws/
Since late Nov 2018, CodePipeline supports cross regional deploys. However it still leaves a lot to be desired as you need to create artifact buckets in each region and copy over the deployment artifacts (e.g. in the codebuild container as you mentioned) to them before the Deploy action is triggered. So it's not as automated as it could be, but if you go through the process of setting it up, it works well.
CodePipeline now supports cross region deployment and for to trigger the pipeline in different region we can specify the "Region": "us-west-2" property in the action stage for CloudFormation which will trigger the deployment in that specific region.
Steps to follow for this setup:
Create two bucket in two different region which for example bucket in "us-east-1" and bucket in "us-west-2" (We can also use bucket already created by CodePipeline when you will setup pipeline first time in any region)
Configure the pipeline in such a way that is can use respective bucket while taking action in respective account.
specify the region in the action for CodePipeline.
Note: I have attached the sample CloudFormation template which will help you to do the cross region CloudFormation deployment.
{
"Parameters": {
"BranchName": {
"Description": "CodeCommit branch name for all the resources",
"Type": "String",
"Default": "master"
},
"RepositoryName": {
"Description": "CodeComit repository name",
"Type": "String",
"Default": "aws-account-resources"
},
"CFNServiceRoleDeployA": {
"Description": "CFN service role for create resourcecs for account-A",
"Type": "String",
"Default": "arn:aws:iam::xxxxxxxxxxxxxx:role/CloudFormation-service-role-cp"
},
"CodePipelineServiceRole": {
"Description": "Service role for codepipeline",
"Type": "String",
"Default": "arn:aws:iam::xxxxxxxxxxxxxx:role/AWS-CodePipeline-Service"
},
"CodePipelineArtifactStoreBucket1": {
"Description": "S3 bucket to store the artifacts",
"Type": "String",
"Default": "bucket-us-east-1"
},
"CodePipelineArtifactStoreBucket2": {
"Description": "S3 bucket to store the artifacts",
"Type": "String",
"Default": "bucket-us-west-2"
}
},
"Resources": {
"AppPipeline": {
"Type": "AWS::CodePipeline::Pipeline",
"Properties": {
"Name": {"Fn::Sub": "${AWS::StackName}-cross-account-pipeline" },
"ArtifactStores": [
{
"ArtifactStore": {
"Type": "S3",
"Location": {
"Ref": "CodePipelineArtifactStoreBucket1"
}
},
"Region": "us-east-1"
},
{
"ArtifactStore": {
"Type": "S3",
"Location": {
"Ref": "CodePipelineArtifactStoreBucket2"
}
},
"Region": "us-west-2"
}
],
"RoleArn": {
"Ref": "CodePipelineServiceRole"
},
"Stages": [
{
"Name": "Source",
"Actions": [
{
"Name": "SourceAction",
"ActionTypeId": {
"Category": "Source",
"Owner": "AWS",
"Version": 1,
"Provider": "CodeCommit"
},
"OutputArtifacts": [
{
"Name": "SourceOutput"
}
],
"Configuration": {
"BranchName": {
"Ref": "BranchName"
},
"RepositoryName": {
"Ref": "RepositoryName"
},
"PollForSourceChanges": true
},
"RunOrder": 1
}
]
},
{
"Name": "Deploy-to-account-A",
"Actions": [
{
"Name": "stage-1",
"InputArtifacts": [
{
"Name": "SourceOutput"
}
],
"ActionTypeId": {
"Category": "Deploy",
"Owner": "AWS",
"Version": 1,
"Provider": "CloudFormation"
},
"Configuration": {
"ActionMode": "CREATE_UPDATE",
"StackName": "cloudformation-stack-name-account-A",
"TemplatePath":"SourceOutput::accountA.json",
"Capabilities": "CAPABILITY_IAM",
"RoleArn": {
"Ref": "CFNServiceRoleDeployA"
}
},
"RunOrder": 2,
"Region": "us-west-2"
}
]
}
]
}
}
}
}
I have added an EMR cluster to a stack. After updating the stack successfully (CloudFormation), I can see the master and slave nodes in EC2 console and I can SSH into the master node. But AWS console does not show the new cluster. Even aws emr list-clusters doesn't show the cluster. I have triple checked the region and I am certain I'm looking at the right region.
Relevant CloudFormation JSON:
"Spark01EmrCluster": {
"Type": "AWS::EMR::Cluster",
"Properties": {
"Name": "Spark01EmrCluster",
"Applications": [
{
"Name": "Spark"
},
{
"Name": "Ganglia"
},
{
"Name": "Zeppelin"
}
],
"Instances": {
"Ec2KeyName": {"Ref": "KeyName"},
"Ec2SubnetId": {"Ref": "PublicSubnetId"},
"MasterInstanceGroup": {
"InstanceCount": 1,
"InstanceType": "m4.large",
"Name": "Master"
},
"CoreInstanceGroup": {
"InstanceCount": 1,
"InstanceType": "m4.large",
"Name": "Core"
}
},
"Configurations": [
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"ConfigurationProperties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}
]
}
],
"BootstrapActions": [
{
"Name": "InstallPipPackages",
"ScriptBootstrapAction": {
"Path": "[S3 PATH]"
}
}
],
"JobFlowRole": {"Ref": "Spark01InstanceProfile"},
"ServiceRole": "MyStackEmrDefaultRole",
"ReleaseLabel": "emr-5.13.0"
}
}
The reason is missing VisibleToAllUsers property, which defaults to false. Since I'm using AWS Vault (i.e. using STS AssumeRole API to authenticate), I'm basically a different user every time, so I couldn't see the cluster. I couldn't update the stack to add VisibleToAllUsers either as I was getting Job flow ID does not exist.
The solution was to login as root user and fix things from there (I had to delete the cluster manually, but removing it from the stack template JSON and updating the stack would probably have worked if I hadn't messed things up already).
I then added the cluster back to the template (with VisibleToAllUsers set to true) and updated the stack as usual (AWS Vault).
I'm trying to run simple AWS Data Pipeline for my POC. The case that I have is following: get data from CSV stored on S3, perform simple hive query on them and put results back to S3.
I've created very basic pipeline definition and tried to run it on different emr versions: 4.2.0 and 5.3.1 - both are failing though in different places.
So pipeline definition is following:
{
"objects": [
{
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"maximumRetries": "1",
"enableDebugging": "true",
"name": "EmrCluster",
"keyPair": "Jeff Key Pair",
"id": "EmrClusterId_CM5Td",
"releaseLabel": "emr-5.3.1",
"region": "us-west-2",
"type": "EmrCluster",
"terminateAfter": "1 Day"
},
{
"column": [
"policyID INT",
"statecode STRING"
],
"name": "SampleCSVOutputFormat",
"id": "DataFormatId_9sLJ0",
"type": "CSV"
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "s3://aws-logs/datapipeline/",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"directoryPath": "s3://data-pipeline-input/",
"dataFormat": {
"ref": "DataFormatId_KIMjx"
},
"name": "InputDataNode",
"id": "DataNodeId_RyNzr",
"type": "S3DataNode"
},
{
"s3EncryptionType": "NONE",
"directoryPath": "s3://data-pipeline-output/",
"dataFormat": {
"ref": "DataFormatId_9sLJ0"
},
"name": "OutputDataNode",
"id": "DataNodeId_lnwhV",
"type": "S3DataNode"
},
{
"output": {
"ref": "DataNodeId_lnwhV"
},
"input": {
"ref": "DataNodeId_RyNzr"
},
"stage": "true",
"maximumRetries": "2",
"name": "HiveTest",
"hiveScript": "INSERT OVERWRITE TABLE ${output1} select policyID, statecode from ${input1};",
"runsOn": {
"ref": "EmrClusterId_CM5Td"
},
"id": "HiveActivityId_JFqr5",
"type": "HiveActivity"
},
{
"name": "SampleCSVDataFormat",
"column": [
"policyID INT",
"statecode STRING",
"county STRING",
"eq_site_limit FLOAT",
"hu_site_limit FLOAT",
"fl_site_limit FLOAT",
"fr_site_limit FLOAT",
"tiv_2011 FLOAT",
"tiv_2012 FLOAT",
"eq_site_deductible FLOAT",
"hu_site_deductible FLOAT",
"fl_site_deductible FLOAT",
"fr_site_deductible FLOAT",
"point_latitude FLOAT",
"point_longitude FLOAT",
"line STRING",
"construction STRING",
"point_granularity INT"
],
"id": "DataFormatId_KIMjx",
"type": "CSV"
}
],
"parameters": []
}
And CSV file looks like this:
policyID,statecode,county,eq_site_limit,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity
119736,FL,CLAY COUNTY,498960,498960,498960,498960,498960,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1
448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,1438163.57,0,0,0,0,30.063936,-81.707664,Residential,Masonry,3
206893,FL,CLAY COUNTY,190724.4,190724.4,190724.4,190724.4,190724.4,192476.78,0,0,0,0,30.089579,-81.700455,Residential,Wood,1
HiveActivity is just a simple query (copy from AWS docs):
"INSERT OVERWRITE TABLE ${output1} select policyID, statecode from ${input1};"
However it fails when running on emr-5.3.1:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask
/mnt/taskRunner/./hive-script:617:in `<main>': Error executing cmd: /usr/share/aws/emr/scripts/hive-script "--base-path" "s3://us-west-2.elasticmapreduce/libs/hive/" "--hive-versions" "latest" "--run-hive-script" "--args" "-f"
Going deep into logs I could find following exception:
2017-02-25T00:33:00,434 ERROR [316e5d21-dfd8-4663-a03c-2ea4bae7b1a0 main([])]: tez.DagUtils (:()) - Could not find the jar that was being uploaded
2017-02-25T00:33:00,434 ERROR [316e5d21-dfd8-4663-a03c-2ea4bae7b1a0 main([])]: exec.Task (:()) - Failed to execute tez graph.
java.io.IOException: Previous writer likely failed to write hdfs://ip-170-41-32-05.us-west-2.compute.internal:8020/tmp/hive/hadoop/_tez_session_dir/31ae6d21-dfd8-4123-a03c-2ea4bae7b1a0/emr-hive-goodies.jar. Failing because I am unlikely to write too.
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeResource(DagUtils.java:1022)
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.addTempResources(DagUtils.java:902)
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeTempFilesFromConf(DagUtils.java:845)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.refreshLocalResourcesFromConf(TezSessionState.java:466)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.updateSession(TezTask.java:294)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:155)
When running on emr-4.2.0 I have another crash:
Number of reduce tasks is set to 0 since there's no reduce operator
java.lang.NullPointerException
at org.apache.hadoop.fs.Path.<init>(Path.java:105)
at org.apache.hadoop.fs.Path.<init>(Path.java:94)
at org.apache.hadoop.hive.ql.exec.Utilities.toTempPath(Utilities.java:1517)
at org.apache.hadoop.hive.ql.exec.Utilities.createTmpDirs(Utilities.java:3555)
at org.apache.hadoop.hive.ql.exec.Utilities.createTmpDirs(Utilities.java:3520)
Both S3 and EMR cluster are in same region and running under same AWS account. I've tried bunch of experiments with S3DataNode and EMRCluster configurations but it always crashes.
Also I couldn't find any working example of data pipeline with HiveActivity nor in documentation or over github.
Can someone please help me figure it out? Thank you.
I was facing the same problem when updating my EMR cluster from a 4.*.* release to 5.28.0 release. After changing the release label, I followed #andrii-gorishnii comment and added
delete jar /mnt/taskRunner/emr-hive-goodies.jar;
to the beginning of my Hive Script and it solved my problem! Thanks #andrii-gorishnii
I have been using UNLOAD statement in Redshift for a while now, it makes it easier to dump the file to S3 and then allow people to analysie.
The time has come to try to automate it. We have Amazon Data Pipeline running for several tasks and I wanted to run SQLActivity to execute UNLOAD automatically. I use SQL script hosted in S3.
The query itself is correct but what I have been trying to figure out is how can I dynamically assign the name of the file. For example:
UNLOAD('<the_query>')
TO 's3://my-bucket/' || to_char(current_date)
WITH CREDENTIALS '<credentials>'
ALLOWOVERWRITE
PARALLEL OFF
doesn't work and of course I suspect that you can't execute functions (to_char) in the "TO" line. Is there any other way I can do it?
And if UNLOAD is not the way, do I have any other options how to automate such tasks with current available infrastructure (Redshift + S3 + Data Pipeline, our Amazon EMR is not active yet).
The only thing that I thought could work (but not sure) is not instead of using script, to copy the script into the Script option in SQLActivity (at the moment it points to a file) and reference {#ScheduleStartTime}
Why not use RedshiftCopyActivity to copy from Redshift to S3? Input is RedshiftDataNode and output is S3DataNode where you can specify expression for directoryPath.
You can also specify the transformSql property in RedshiftCopyActivity to override the default value of : select * from + inputRedshiftTable.
Sample pipeline:
{
"objects": [{
"id": "CSVId1",
"name": "DefaultCSV1",
"type": "CSV"
}, {
"id": "RedshiftDatabaseId1",
"databaseName": "dbname",
"username": "user",
"name": "DefaultRedshiftDatabase1",
"*password": "password",
"type": "RedshiftDatabase",
"clusterId": "redshiftclusterId"
}, {
"id": "Default",
"scheduleType": "timeseries",
"failureAndRerunMode": "CASCADE",
"name": "Default",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
}, {
"id": "RedshiftDataNodeId1",
"schedule": {
"ref": "ScheduleId1"
},
"tableName": "orders",
"name": "DefaultRedshiftDataNode1",
"type": "RedshiftDataNode",
"database": {
"ref": "RedshiftDatabaseId1"
}
}, {
"id": "Ec2ResourceId1",
"schedule": {
"ref": "ScheduleId1"
},
"securityGroups": "MySecurityGroup",
"name": "DefaultEc2Resource1",
"role": "DataPipelineDefaultRole",
"logUri": "s3://myLogs",
"resourceRole": "DataPipelineDefaultResourceRole",
"type": "Ec2Resource"
}, {
"myComment": "This object is used to control the task schedule.",
"id": "DefaultSchedule1",
"name": "RunOnce",
"occurrences": "1",
"period": "1 Day",
"type": "Schedule",
"startAt": "FIRST_ACTIVATION_DATE_TIME"
}, {
"id": "S3DataNodeId1",
"schedule": {
"ref": "ScheduleId1"
},
"directoryPath": "s3://my-bucket/#{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
"name": "DefaultS3DataNode1",
"dataFormat": {
"ref": "CSVId1"
},
"type": "S3DataNode"
}, {
"id": "RedshiftCopyActivityId1",
"output": {
"ref": "S3DataNodeId1"
},
"input": {
"ref": "RedshiftDataNodeId1"
},
"schedule": {
"ref": "ScheduleId1"
},
"name": "DefaultRedshiftCopyActivity1",
"runsOn": {
"ref": "Ec2ResourceId1"
},
"type": "RedshiftCopyActivity"
}]
}
Are you able to SSH into the cluster? If so, I would suggest writing a shell script where you can create variables and whatnot, then pass in those variables into a connection's statement-query
By using a redshift procedural wrapper around unload statement and dynamically deriving the s3 path name.
Execute the dynamic query and in your job, call the procedure that dynamically creates the UNLOAD statement and executes the statement.
This way you can avoid the other services. But depends on what kind of usecase you are working on.