Glue crawler created multiple tables from a partitioned S3 bucket - amazon-web-services

I have an S3 bucket that is structured like this:
root/
├── year=2020/
│ └── month=01
│ ├── day=01
| | ├── file1.log
| | ├── ...
| | └── file8.log
│ ├── day=...
│ └── day=31
| ├── file1.log
| ├── ...
| └── file8.log
└── year=2019/
├── ...
Each day would have 8 files with identical names across the days ─ there would be a file1.log in every 'day' folders. I crawled this bucket using a custom classifier.
Expected behavior: Glue will create one single table with year, month, and day as partition fields, and several other fields that I described in my custom classifier. I then can use the table in my Job scripts.
Actual behavior:
1) Glue created one table that fulfilled my expectations. However, when I tried to access it in Job scripts, the table was devoid of columns.
2) Glue created one table for every 'day' partitions, and 8 tables for every file<number>.log files
I have tried excluding **_SUCCESS and **crc like people suggested on this other question: AWS Glue Crawler adding tables for every partition? However, it doesn't seem to work. I have also checked the 'Create a single schema for each S3 path' option in the crawler's setting. It still doesn't work.
What am I missing?

You should have one folder at root (e.g. customers) and inside it, you should have partition sub-folders. If you have partitions at S3 bucket level, it will not create one table.

Related

How to use variables from other modules in Terraform: Adding Host Project id to the Service Projects. (GCP)

My infrastructure its composed by a Host Project and several Service Projects that are using its Shared VPC.
I have refactored my .tf files of my infrustructure as it follows:
├── env
| ├── dev
│   ├── main.tf
│   ├── outputs.tf
│   └── variables.tf
│   ├── pre
│   └── pro
├── host
│   ├── main.tf
│   ├── outputs.tf
│   ├── terraform.tfvars
│   └── variables.tf
└── modules
├── compute
├── network
└── projects
The order of creation of the infrastructure is:
terraform apply in /host
terraform apply in /env/dev (for instance)
In the main.tf of the host directory I have created the VPC and enabled Shared VPC hosting:
# Creation of the hosted network
resource "google_compute_network" "shared_network" {
name = var.network_name
auto_create_subnetworks = false
project = google_compute_shared_vpc_host_project.host_project.project
mtu = "1460"
}
# Enable shared VPC hosting in the host project.
resource "google_compute_shared_vpc_host_project" "host_project" {
project = google_project.host_project.project_id
depends_on = [google_project_service.host_project]
}
The issue comes when I have refer to the Shared VPC Network in the Service Projects.
In the main.tf from env/dev/ I have set the following:
resource "google_compute_shared_vpc_service_project" "service_project_1" {
host_project = google_project.host_project.project_id
service_project = google_project.service_project_1.project_id
depends_on = [
google_compute_shared_vpc_host_project.host_project,
google_project_service.service_project_1,
]
}
QUESTION
How do I refer to the Host Project ID from another directory in the Service Project?
What I have tried so far
I have thought of using Ouput Values and Data Sources:
In the host/outputs.tf declared as an output the Project ID as:
output "project_id" {
value = google_project.host_project.project_id
}
But then I end up not knowing how to implement this output in my env/dev/main.tf
I have thought on Data Sources and, in the env/dev/main.tf fetch for the Host Project ID. But then, in order to fetch it, I would need its name (which breaks the purpose of providing it in a programatic way if I have to hardcode it).
What should I try next? What I am missing?
The files under the env/dev folder can't see anything above it, only any referenced modules.
You could refactor the host folder into a module to allow access to it's outputs... but that adds a risk that the host will be destroyed whenever you destroy a dev environment.
I would try running terraform output -raw project_id after creating the host and piping it to a text file or environment variable. Then using that as the input for a new "host_project" or similar variable in the 'env/dev' deployment.

create isolated stacks with terraform as with the serverless framework on AWS

I am developing a serverless data pipeline on AWS. Compared to the Serverless framework, Terraform has better support for services like Glue.
The good thing about Serverless is that you can define the --stage argument when deploying. This allows creating an isolated stack on AWS. When developing new features on our data pipeline I can deploy my state of the code like
serverless deploy --stage my-new-feature
this allows me to do an isolated integration test on the AWS account I share with my colleagues. Is this possible using Terraform?
Did you have a look at workspace? https://www.terraform.io/docs/state/workspaces.html
Terraform manages resources by way of state.
If a resource already exists in the state file and Terraform doesn't detect any drift between the configuration, the state and any differences in the provider (eg something was changed in the AWS console or by another tool) then it will show that there are no changes. If it does detect some form of drift then a plan will show you what changes it needs to make to push the existing state of things in the provider to what is defined in the Terraform code.
Separating state between different environments
If you want to have multiple environments or even other resources that are separate from each other and not managed by the same Terraform action (such as a plan, apply or destroy) then you want to separate these into different state files.
One way to do this is to separate your Terraform code by environment and use a state file matching the directory structure of your code base. A simple example might look something like this:
terraform/
├── production
│   ├── main.tf -> ../stacks/main.tf
│   └── terraform.tfvars
├── stacks
│   └── main.tf
└── staging
├── main.tf -> ../stacks/main.tf
└── terraform.tfvars
stacks/main.tf
variable "environment" {}
resource "aws_lambda_function" "foo" {
function_name = "foo-${var.environment}"
# ...
}
production/terraform.tfvars
environment = "production"
staging/terraform.tfvars
environment = "staging"
This uses symlinks so that staging and production are kept in line in code with the only changes being introduced by the terraform.tfvars file. In this case it changes the Lambda function's name to include the environment.
This is what I generally recommend for static environments as it's much clearer from looking at the code/directory structure which environments exist.
Dynamic environments
However, if you have more dynamic environments, such as per feature branch, then it's not going to work to hard code the environment name directly in your terraform.tfvars file.
In this case I would recommend something like this:
terraform/
├── production
│   ├── main.tf -> ../stacks/main.tf
│   └── terraform.tfvars
├── review
│   ├── main.tf -> ../stacks/main.tf
│   └── terraform.tfvars
├── stacks
│   └── main.tf
└── staging
├── main.tf -> ../stacks/main.tf
└── terraform.tfvars
This works the same way but I would omit the environment variable from the review structure so that it's set interactively or via CI environment variables (eg export TF_VAR_environment=${TRAVIS_BRANCH} when running in Travis CI, adapt this to support whatever CI system you use).
Keeping state separate between review environments on different branches
This only gets you half way though because when another person tries to use this with another branch then they will see that Terraform wants to destroy/update any resources that are already created by running Terraform if you were just using the default workspace.
Workspaces provide an option for separating state in a more dynamic way and also allows you to interpolate the workspace name into Terraform code:
resource "aws_instance" "example" {
tags {
Name = "web - ${terraform.workspace}"
}
# ... other arguments
}
Instead the review environments will need to create or use a dynamic workspace that is scoped for that branch only. You can do this by running the following command:
terraform workspace new [NAME]
If the workspace already exists then you should instead use the following command:
terraform workspace select [NAME]
In CI you can use the same environment variables as before to automatically use the branch name as your workspace name.

Terraform: Deploying the same web-app for multiple clients

I am currently using Terraform to deploy a PHP app to AWS.
This PHP app is deployed as a Service using AWS ECS.
I have multiple clients using this app, and each client receives their own copy of the system with their own configuration as their own service - a white label if you will.
Now, having done a bit of research on Terraform I've modularized my code and created the following file structure:
+---my-application
| shared.tf
| iam_policies.tf
| iam_roles.tf
| variables.tf
| web-apps.tf
|
+---modules
| \---role
| | main.tf
| | outputs.tf
| | variables.tf
| |
| \---webapp
| main.tf
| variables.tf
|
+---templates
web_definition.tpl.json
My problem lies in the web-apps.tf file which I use as the "glue" for all of the webapp modules:
module "client_bob" {
source = "modules/webapp"
...
}
module "client_alice" {
source = "modules/webapp"
...
}
... Over 30 more client module blocks ...
Needless to say that this is not a good setup.
It is not scalable and also creates huge .tfstate files.
Once when attempting to use Consul as a backend I got an error message saying I've reached the size limit allowed for Consul KV value.
What is the correct way to approach this situations?
I've looked at all of the questions in the Similar Questions section when I wrote this one and all of them revolve around the idea of using multiple .tfstate files, but I don't quite understand how this would solve my problem, any help would be greatly appreciated!
I did similar projects with terragrunt, take a look.
It was born to answer your requests.
the oss website is https://github.com/gruntwork-io/terragrunt
Terragrunt is a thin wrapper for Terraform that provides extra tools for working with multiple Terraform modules. https://www.gruntwork.io
In your case, you can easily manage with different tfstate files for each client.
I also recommend to manage the iam roles, policies or any other resources for each client as well, do not mix them.
For example, the structure will become to
(I guess you will manage different environments for each client, right?)
└── bob
├── prod
│ ├── app
│ │ └── terraform.tfvars
├── nonprod
├── app
└── terraform.tfvars
└── alice
├── prod
│ ├── app
│ │ └── terraform.tfvars
├── nonprod
├── app
└── terraform.tfvars
...
Later, after you master the command terraform apply-all, then it makes the deployment simpler and easier.
Quick start
https://github.com/gruntwork-io/terragrunt-infrastructure-modules-example
https://github.com/gruntwork-io/terragrunt-infrastructure-live-example

Where does EMR store Spark stdout?

I am running my Spark application on EMR, and have several println() statements. Other than the console, where do these statements get logged?
My S3 aws-logs directory structure for my cluster looks like:
node
├── i-0031cd7a536a42g1e
│   ├── applications
│   ├── bootstrap-actions
│   ├── daemons
│   ├── provision-node
│   └── setup-devices
containers/
├── application_12341331455631_0001
│   ├── container_12341331455631_0001_01_000001
You can find println's in a few places:
Resource Manager -> Your Application -> Logs -> stdout
Your S3 log directory -> containers/application_.../container_.../stdout (though this takes a few minutes to populate after the application)
SSH into the EMR, yarn logs -applicationId <Application ID> -log_files <log_file_type>
There is a very important thing that you need to consider when printing from Spark: are you running code that gets executed in the driver or is it code that runs in the executor?
For example, if you do the following, it will output in the console as you are bringing data back to the driver:
for i in your_rdd.collect():
print i
But the following will run within an executor and thus it will be written in the Spark logs:
def run_in_executor(value):
print value
your_rdd.map(lambda x: value(x))
Now going to your original question, the second case will write to the log location. Logs are usually written to the master node which is located in /mnt/var/log/hadoop/steps, but it might be better to configure logs to an s3 bucket with --log-uri. That way it will be easier to find.

How to make CloudFront never cache index.html on S3 bucket

I have a React app hosted on an S3 bucket. The code is minified using yarn build (it's a create-react-app based app). The build folder looks something like:
build
├── asset-manifest.json
├── favicon.ico
├── images
│   ├── map-background.png
│   └── robot-icon.svg
├── index.html
├── js
│   ├── fontawesome.js
│   ├── packs
│   │   ├── brands.js
│   │   ├── light.js
│   │   ├── regular.js
│   │   └── solid.js
│   └── README.md
├── service-worker.js
└── static
├── css
│   ├── main.bf27c1d9.css
│   └── main.bf27c1d9.css.map
└── js
├── main.8d11d7ab.js
└── main.8d11d7ab.js.map
I never want index.html to be cached, because if I update the code (causing the hex suffix in main.*.js to update), I need the user's next visit to pick up on the <script src> change in index.html to point to the updated code.
In CloudFront, I can only seem to exclude paths, and excluding "/" doesn't seem to work properly. I'm getting strange behavior where I change the code, and if I hit refresh, I see it, but if I quit Chrome and go back, I see very outdated code for some reason.
I don't want to have to trigger an invalidation on every code release (via CodeBuild). Is there some other way? I think one of the challenges is that since this is an app using React Router, I'm having to do some trickery by setting the error document to index.html and forcing an HTTP status 200 instead of 403.
A solution based on CloudFront configuration:
Go to your CloudFront distribution, under the "Behavior" tab and create a new behavior.
Specify the following values:
Path Pattern: index.html
Object Caching: customize
Maximum TTL: 0 (or another very small value)
Default TTL: 0 (or another very small value)
Save this configuration.
CloudFront will not cache index.html anymore.
If you never want index.html to be cached, set the Cache-Control: max-age=0 header on that file only. CloudFront will make a request back to your origin S3 bucket on every request, but it sounds like this is desired behavior.
If you're wanting to set longer expiry times and invalidate the CloudFront cache manually, you can use a * or /* as your invalidation path (not / as you have mentioned). This can take up to 15 minutes for all CloudFront edge nodes around the world to reflect the changes in your origin however.
Here is the command I ran to set cache-control on my index.html file after uploading new files to s3 and invalidating Cloudfront:
aws s3 cp s3://bucket/index.html s3://bucket/index.html --metadata-directive REPLACE --cache-control max-age=0 --content-type "text/html"
It's much better to run an invalidation for index.html on every release than to defeat Cloudfront's purpose and serve it (what is basically an entrypoint for your app) from S3 every single time.