Where does EMR store Spark stdout?

Where does EMR store Spark stdout? - amazon-web-services

I am running my Spark application on EMR, and have several println() statements. Other than the console, where do these statements get logged?
My S3 aws-logs directory structure for my cluster looks like:
node
├── i-0031cd7a536a42g1e
│   ├── applications
│   ├── bootstrap-actions
│   ├── daemons
│   ├── provision-node
│   └── setup-devices
containers/
├── application_12341331455631_0001
│   ├── container_12341331455631_0001_01_000001

You can find println's in a few places:
Resource Manager -> Your Application -> Logs -> stdout
Your S3 log directory -> containers/application_.../container_.../stdout (though this takes a few minutes to populate after the application)
SSH into the EMR, yarn logs -applicationId <Application ID> -log_files <log_file_type>

There is a very important thing that you need to consider when printing from Spark: are you running code that gets executed in the driver or is it code that runs in the executor?
For example, if you do the following, it will output in the console as you are bringing data back to the driver:
for i in your_rdd.collect():
print i
But the following will run within an executor and thus it will be written in the Spark logs:
def run_in_executor(value):
print value
your_rdd.map(lambda x: value(x))
Now going to your original question, the second case will write to the log location. Logs are usually written to the master node which is located in /mnt/var/log/hadoop/steps, but it might be better to configure logs to an s3 bucket with --log-uri. That way it will be easier to find.

Related

Amplify functions deployment not working for Golang runtime

I'm quite new to the Amplify function world. I've been struggling to deploy my Golang function, connected to a DynamoDB stream. Even though I am able to run my lambda successfully by manually uploading a .zip I created myself after I built the binary using GOARCH=amd64 GOOS=linux go build src/index.go (I develop on a mac), when I use the Amplify CLI tools I am not able to deploy my function.
This is the folder structure of my function myfunction
+ myfunction
├── amplify.state
├── custom-policies.json
├── dist
│ └── latest-build.zip
├── function-parameters.json
├── go.mod
├── go.sum
├── parameters.json
├── src
│ ├── event.json
│ └── index.go
└── tinkusercreate-cloudformation-template.json
The problem is I can't use the amplify function build command, since looks like it creates a .zip file with my source file index.go in there (not the binary), so the lambda, regardless of the handler I set, it seems not able to run it from that source. I record errors like
fork/exec /var/task/index.go: exec format error: PathError null or
fork/exec /var/task/index: no such file or directory: PathError null
depending on the handler I set.
Is there a way to make Amplify build function work for a Golang lambda? I would like to successfully execute the command amplify function build myfunction so that I will be able to deliver the working deployment by amplify push to my target environment.

How to use variables from other modules in Terraform: Adding Host Project id to the Service Projects. (GCP)

My infrastructure its composed by a Host Project and several Service Projects that are using its Shared VPC.
I have refactored my .tf files of my infrustructure as it follows:
├── env
| ├── dev
│   ├── main.tf
│   ├── outputs.tf
│   └── variables.tf
│   ├── pre
│   └── pro
├── host
│   ├── main.tf
│   ├── outputs.tf
│   ├── terraform.tfvars
│   └── variables.tf
└── modules
├── compute
├── network
└── projects
The order of creation of the infrastructure is:
terraform apply in /host
terraform apply in /env/dev (for instance)
In the main.tf of the host directory I have created the VPC and enabled Shared VPC hosting:
# Creation of the hosted network
resource "google_compute_network" "shared_network" {
name = var.network_name
auto_create_subnetworks = false
project = google_compute_shared_vpc_host_project.host_project.project
mtu = "1460"
}
# Enable shared VPC hosting in the host project.
resource "google_compute_shared_vpc_host_project" "host_project" {
project = google_project.host_project.project_id
depends_on = [google_project_service.host_project]
}
The issue comes when I have refer to the Shared VPC Network in the Service Projects.
In the main.tf from env/dev/ I have set the following:
resource "google_compute_shared_vpc_service_project" "service_project_1" {
host_project = google_project.host_project.project_id
service_project = google_project.service_project_1.project_id
depends_on = [
google_compute_shared_vpc_host_project.host_project,
google_project_service.service_project_1,
]
}
QUESTION
How do I refer to the Host Project ID from another directory in the Service Project?
What I have tried so far
I have thought of using Ouput Values and Data Sources:
In the host/outputs.tf declared as an output the Project ID as:
output "project_id" {
value = google_project.host_project.project_id
}
But then I end up not knowing how to implement this output in my env/dev/main.tf
I have thought on Data Sources and, in the env/dev/main.tf fetch for the Host Project ID. But then, in order to fetch it, I would need its name (which breaks the purpose of providing it in a programatic way if I have to hardcode it).
What should I try next? What I am missing?

The files under the env/dev folder can't see anything above it, only any referenced modules.
You could refactor the host folder into a module to allow access to it's outputs... but that adds a risk that the host will be destroyed whenever you destroy a dev environment.
I would try running terraform output -raw project_id after creating the host and piping it to a text file or environment variable. Then using that as the input for a new "host_project" or similar variable in the 'env/dev' deployment.

Packer file provisioner doesn't copy

I have a File Provisioner configured on my packer template json:
"provisioners": [{
"type": "file",
"source": "packer/api-clients/database.yml",
"destination": "/tmp/api-clients-database.yml"
},
The code below doesn't work when I'm trying to build an AMI on Amazon AWS, it always says:
Bad source 'packer/api-clients/database.yml': stat packer/api-clients/database.yml: no such file or directory
If I do this:
"source": "api-clients/database.yml",
It works like a charm. But I must have all my Packer files inside of a packer folder within my app folder for organization purposes.
What am I doing wrong?
My app folder is like this:
api_v1
├── template.json
├── app
│   ├── bin
│   ├── config
│   ├── packer
│   │   ├── api-clients
│   │   │   └── database.yml
│   ├── lib
│   ├── log
│   ├── ...
It seems that it has something to do with Relative Paths / Absolute Paths on Packer but I couldn't figure out what is wrong...
Thanks in advance,

Since the path doesn't start with a / it's a relative path. The are relative to the current working directory when executing packer build.
With source packer/api-clients/database.yml you have to run packer from the app directory, i.e.
packer build ../template.json
With source api-clients/database.yml you have to run packer from the packer directory, i.e.
packer build ../../template.json
For more info see Packer documentation - File provisioner: source.

It is as you have surmised a path thing.
You do not say from what folder you are calling packer and what the calling command is, or when you have it working with "source": "api-clients/database.yml", if you have moved the api-clients folder or it works with packer in that location.
If your folder structure will always look that way then to avoid confusions if you use a full path for the source it will always work no matter where you run packer from
eg
/api_v1/app/packer/api-clients/database.yml
if you must use relative paths then make sure that the source path is always relative from the folder in which packer is run.

How to deploy a Go web application in Beanstalk with custom project folder structure

I'm new to Go.
I am trying to deploy a simple web project to EB without success.
I would like to deploy a project with the following local structure to Amazon EB:
$GOPATH
├── bin
├── pkg
└── src
├── github.com
│   ├── AstralinkIO
│   │   └── api-server <-- project/repository root
│   │   ├── bin
│   │   ├── cmd <-- main package
│   │   ├── pkg
│   │   ├── static
│   │   └── vendor
But I'm not sure how to do that, when building the command, Amazon is treating api-server as the $GOPATH, and of course import paths are broken.
I read that most of the time it's best to keep all repos under the same workspace, but it makes deployment harder..
I'm using Procfile and Buildfile to customize output path, but I can't find a solution to dependencies.
What is the best way to deploy such project to EB?

Long time has past since I used Beanstalk, so I'm a bit rusty on the details. But basic idea is as follows. AWS Beanstalk support for go is a bit odd by design. It basically extracts your source files into a folder on the server, declares that folder as GOPATH and tries to build your application assuming that your main package is at the root of your GOPATH. Which is not a standard layout for go projects. So your options are:
1) Package your whole GOPATH as "source bundle" for Beanstalk. Then you should be able to write build.sh script to change GOPATH and build it your way. Then call build.sh from your Buildfile.
2) Change your main package to be a regular package (e.g. github.com/AstralinkIO/api-server/cmd). Then create an application.go file at the root of your GOPATH (yes, outside of src, while all actual packages are in src as they should be). Your application.go will become your "package main" and will only contain a main function (which will call your current Main function from github.com/AstralinkIO/api-server/cmd). Should do the trick. Though your mileage might vary.
3) A bit easier option is to use Docker-based Go Platform instead. It still builds your go application on the server with mostly same issues as above, but it's better documented and possibility to test it locally helps a lot with getting configuration and build right. It will also give you some insights into how Beanstalk builds go applications thus helping with options 1 and 2. I used this option myself until I moved to plain EC2 instances. And I still use skills gained as a result of it to build my current app releases using docker.
4) Your best option though (in my humble opinion) is to build your app yourselves and package it as a ready to run binary file. See second bullet point paragraph here
Well, which ever option you choose - good luck!

How to make CloudFront never cache index.html on S3 bucket

I have a React app hosted on an S3 bucket. The code is minified using yarn build (it's a create-react-app based app). The build folder looks something like:
build
├── asset-manifest.json
├── favicon.ico
├── images
│   ├── map-background.png
│   └── robot-icon.svg
├── index.html
├── js
│   ├── fontawesome.js
│   ├── packs
│   │   ├── brands.js
│   │   ├── light.js
│   │   ├── regular.js
│   │   └── solid.js
│   └── README.md
├── service-worker.js
└── static
├── css
│   ├── main.bf27c1d9.css
│   └── main.bf27c1d9.css.map
└── js
├── main.8d11d7ab.js
└── main.8d11d7ab.js.map
I never want index.html to be cached, because if I update the code (causing the hex suffix in main.*.js to update), I need the user's next visit to pick up on the <script src> change in index.html to point to the updated code.
In CloudFront, I can only seem to exclude paths, and excluding "/" doesn't seem to work properly. I'm getting strange behavior where I change the code, and if I hit refresh, I see it, but if I quit Chrome and go back, I see very outdated code for some reason.
I don't want to have to trigger an invalidation on every code release (via CodeBuild). Is there some other way? I think one of the challenges is that since this is an app using React Router, I'm having to do some trickery by setting the error document to index.html and forcing an HTTP status 200 instead of 403.

A solution based on CloudFront configuration:
Go to your CloudFront distribution, under the "Behavior" tab and create a new behavior.
Specify the following values:
Path Pattern: index.html
Object Caching: customize
Maximum TTL: 0 (or another very small value)
Default TTL: 0 (or another very small value)
Save this configuration.
CloudFront will not cache index.html anymore.

If you never want index.html to be cached, set the Cache-Control: max-age=0 header on that file only. CloudFront will make a request back to your origin S3 bucket on every request, but it sounds like this is desired behavior.
If you're wanting to set longer expiry times and invalidate the CloudFront cache manually, you can use a * or /* as your invalidation path (not / as you have mentioned). This can take up to 15 minutes for all CloudFront edge nodes around the world to reflect the changes in your origin however.

Here is the command I ran to set cache-control on my index.html file after uploading new files to s3 and invalidating Cloudfront:
aws s3 cp s3://bucket/index.html s3://bucket/index.html --metadata-directive REPLACE --cache-control max-age=0 --content-type "text/html"

It's much better to run an invalidation for index.html on every release than to defeat Cloudfront's purpose and serve it (what is basically an entrypoint for your app) from S3 every single time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Where does EMR store Spark stdout? - amazon-web-services

Related

Amplify functions deployment not working for Golang runtime

How to use variables from other modules in Terraform: Adding Host Project id to the Service Projects. (GCP)

Packer file provisioner doesn't copy

How to deploy a Go web application in Beanstalk with custom project folder structure

How to make CloudFront never cache index.html on S3 bucket

Categories

Resources