AWS Glue Crawler adding tables for every partition?

AWS Glue Crawler adding tables for every partition? - amazon-web-services

I have several thousand files in an S3 bucket in this form:
├── bucket
│ ├── somedata
│ │   ├── year=2016
│ │   ├── year=2017
│ │   │   ├── month=11
│ │   | │   ├── sometype-2017-11-01.parquet
│ | | | ├── sometype-2017-11-02.parquet
│ | | | ├── ...
│ │   │   ├── month=12
│ │   | │   ├── sometype-2017-12-01.parquet
│ | | | ├── sometype-2017-12-02.parquet
│ | | | ├── ...
│ │   ├── year=2018
│ │   │   ├── month=01
│ │   | │   ├── sometype-2018-01-01.parquet
│ | | | ├── sometype-2018-01-02.parquet
│ | | | ├── ...
│ ├── moredata
│ │   ├── year=2017
│ │   │   ├── month=11
│ │   | │   ├── moretype-2017-11-01.parquet
│ | | | ├── moretype-2017-11-02.parquet
│ | | | ├── ...
│ │   ├── year=...
etc
Expected behavior:
The AWS Glue Crawler creates one table for each of somedata, moredata, etc. It creates partitions for each table based on the childrens' path names.
Actual Behavior:
The AWS Glue Crawler performs the behavior above, but ALSO creates a separate table for every partition of the data, resulting in several hundred extraneous tables (and more extraneous tables which every data add + new crawl).
I see no place to be able to set something or otherwise prevent this from happening... Does anyone have advice on the best way to prevent these unnecessary tables from being created?

Adding to the excludes
**_SUCCESS
**crc
worked for me (see aws page glue/add-crawler). Double stars match the files at all folder (ie partition) depths. I had an _SUCCESS living a few levels up.
Make sure you set up logging for glue, which quickly points out permission errors etc.

Use the Create a Single Schema for Each Amazon S3 Include Path option to avoid the AWS Glue Crawler adding all these extra tables.
I had this problem and ended up with ~7k tables 😅 so wrote the following script to remove them. It requires jq.
#!/bin/sh
aws glue get-tables --region <YOUR AWS REGION> --database-name <YOUR AWS GLUE DATABASE> | jq '.TableList[] | .Name' | grep <A PATTERN THAT MATCHES YOUR TABLENAMEs> > /tmp/table-names.json
cd /tmp
mkdir table-names
cd table-names
split -l 50 ../table-names.json
for f in `ls`; cat $f | tr '\r\n' ' ' | xargs aws glue batch-delete-table --region <YOUR AWS REGION> --database-name <YOUR AWS GLUE DATABASE> --tables-to-delete;

check if you have empty folders inside. When spark writes to S3, sometimes, the _temporary folder is not deleted, which will make Glue crawler create table for each partition.

I was having the same problem.
I added *crc* as exclude pattern to the AWS Glue crawler and it worked.
Or if you crawl entire directories add */*crc*.

So, my case was a little bit different and I was having the same behaviour.
I got a data structure like this:
├── bucket
│ ├── somedata
│ │ ├── event_date=2016-01-01
│ │ ├── event_date=2016-01-02
So when I started AWS Glue Crawler instead of update the tables, this pipeline was creating a one table per date. After digging into the problem I found that someone added a column as a bug at the json file instead of id was ID. Because my data is parquet the pipeline was working well to store the data and retrieve inside the EMR. But Glue was crashing pretty bad because Glue convert everything to lowercase and probably that was the reason why it was crashing. Removing the uppercase column glue start to work like a charm.

You need to have separate crawlers for each table / file type. So create one crawler that looks at s3://bucket/somedata/ and a 2nd crawler that looks at s3://bucket/moredata/.

Related

Terraform "Unsupported attribute" error when access output variable of sub-module

Gurus!
I'm under developing Terraform modules to provision NAT resource for production and non-production environment. There are two repositories one for Terraform modules another for the live environment for each account (ex: dev, stage, prod..)
I have an problem when access output variable of network/nat module.
It makes me very tired. Please refer below.
for Terraform module (sre-iac-module repo)
❯ tree sre-iac-modules/network/nat/
sre-iac-modules/network/nat/
├── main.tf
├── non_production
│   └── main.tf
├── outputs.tf
├── production
│   ├── main.tf
│   ├── outputs.tf
│   └── variables.tf
└── variables.tf
for live environment (sre-iac-modules repo)
❯ tree sre-iac-modules/network/nat/
sre-iac-modules/network/nat/
├── main.tf
├── non_production
│   └── main.tf
├── outputs.tf
├── production
│   ├── main.tf
│   ├── outputs.tf
│   └── variables.tf
└── variables.tf
In the main code snippet, sre-iac-live/dev/services/wink/network/main.tf
I cannot access output variable named module.wink_nat.eip_ids.
When I run terraform plan or terraform console, always I reached following error.
│ Error: Unsupported attribute
│
│ on ../../../../../sre-iac-modules/network/nat/outputs.tf line 2, in output "eip_ids":
│ 2: value = module.production.eip_ids
│ ├────────────────
│ │ module.production is tuple with 1 element
│
│ This value does not have any attributes.
╵
Here is the ../../../../../sre-iac-modules/network/nat/outputs.tf and main.tf
output "eip_ids" {
value = module.production.eip_ids
# value = ["a", "b", "c"]
}
----
main.tf
module "production" {
source = "./production"
count = var.is_production ? 1 : 0
env = ""
region_id = ""
service_code = ""
target_route_tables = []
target_subnets = var.target_subnets
}
module "non_production" {
source = "./non_production"
count = var.is_production ? 0 : 1
}
However, if I use value = ["a", "b", "c"] then it works well!
I couldn't re what is the problem.
Below is the code snippet of ./sre-iac-modules/network/nat/production/outputs.tf
output "eip_ids" {
value = aws_eip.for_nat[*].id
# value = [aws_eip.nat-gw-eip.*.id]
# value = aws_eip.for_nat.id
# value = ["a", "b", "c"]
}
Below is the code snippet of ./sre-iac-modules/network/nat/production/main.tf
resource "aws_eip" "for_nat" {
count = length(var.target_subnets)
vpc = true
}
And finally, here is the main.tf code snippet. (sre-iac-live/dev/services/wink/network/main.tf)
module "wink_vpc" {
.... skip ....
}
module "wink_nat" {
# Relative path references
source = "../../../../../sre-iac-modules/network/nat"
region_id = "${var.region_id}"
env = "${var.env}"
service_code = "${var.service_code}"
target_subnets = module.wink_vpc.protected_subnet_ids
is_production = true
depends_on = [module.wink_vpc]
}
I'm stuck this issue for one day.
I needs Terraform Guru's help.
Please give me your great advice.
Thank you so much in advance.
Cheers!

Your production module has a count meta attribute. To reference the module you have to use an index, like:
value = module.production[0].eip_ids

How can I reference a value from a different module?

I want to deploy an RDS database to AWS with a secret from AWS Secrets Manager. I have:
├─ environments
│ └─ myenv
│ ├── main.tf
│ ├── locals.tf
│ └── variables.tf
└─ modules
├─ db
│ ├── main.tf
│ └── variables.tf
└─ secrets
└── main.tf
In myenv/main.tf I define a module mydb that has modules/db/main.tf as source where a resource database is defined. Save for the password it all works, I specify values in blocks in myenv and the values "trickle down".
But for the credentials, I don't want to hard code them in myenv of course.
Instead in modules/secrets I define
data "aws_secretsmanager_secret_version" "my_credentials" {
# Fill in the name you gave to your secret
secret_id = "my-secret-id"
}
and with another block:
locals {
decoded_secrets = jsondecode(data.aws_secretsmanager_secret_version.my_credentials.secret_string)
}
I decode the secrcets and now I want to reference them as e.g. local.decoded_secrets.username in myenv/main. That is my interpretation of the tutorials. But it doesn't work: If I put the locals block in myenv it cannot reference data, and when I put it in modules/secrets then myenv cannot reference locals.
How can I combine the values of these two modules in my myenv/main?

Define an output in the secrets module. Define an input in the db module. Pass the output value from secrets to the input property in db.
For example if you defined an output named "password" in secrets and an input named "password" in db, then in your db module declaration you would pass the value like this:
module "secrets" {
source = "../modules/secrets"
}
module "db" {
source = "../modules/db"
password = module.secrets.password
}

You have a few options available to you here to pass the secret to the database module.
The smallest thing you need to do from your existing setup would be to call both modules at the same time and pass an output from the secrets module to the database module like this:
.
├── environments
│   └── myenv
│   ├── locals.tf
│   ├── main.tf
│   └── variables.tf
└── modules
├── db
│   ├── main.tf
│   └── variables.tf
└── secrets
├── main.tf
└── outputs.tf
modules/secrets/outputs.tf
output "secret_id" {
value = aws_secretsmanager_secret_version.secret.secret_id
}
environments/myenv/main.tf
module "secrets" {
source = "../../modules/secrets"
# ...
}
module "db" {
source = "../../modules/db"
# ...
secret_id = module.secrets.secret_id
}
A better approach however might be to have the database module create and manage its own secret and not require the secret to be passed into the database module as a parameter at all. If you wanted to reuse the secrets module with other modules then you could make it a child module of the database module or if this is the only place you currently use the secrets module then unnesting these makes things simpler.
Nesting modules
modules/db/main.tf
module "database_password_secret_id" {
source = "../secrets"
# ...
}
data "aws_secretsmanager_secret_version" "database_password" {
secret_id = module.database_password_secret_id.secret_id
}
Unnesting the modules
.
├── environments
│   └── myenv
│   ├── locals.tf
│   ├── main.tf
│   └── variables.tf
└── modules
└── db
├── main.tf
├── secrets.tf
└── variables.tf
modules/db/secrets.tf
resource "aws_secretsmanager_secret" "database_password" {
name = "database-password"
}
resource "random_password" "database_password" {
length = 32
}
resource "aws_secretsmanager_secret_version" "database_password" {
secret_id = aws_secretsmanager_secret.example.id
secret_string = random_password.database_password.result
}
modules/db/main.tf
resource "aws_db_instance" "database" {
# ...
password = aws_secretsmanager_secret_version.database_password.secret_string
}

How to properly server static files from a Flask server?

What is the proper way of serving static files (images, PDFs, Docs etc) from a flask server?
I have used the send_from_directory method before and it works fine. Here is my implementation:
#app.route('/public/assignments/<path:filename>')
def file(filename):
return send_from_directory("./public/assignments/", filename, as_attachment=True)
However if I have multiple different folders, it can get a bit hectic and repetitive because you are essentially writing the same code but for different file locations - meaning if I wanted to display files for a user instead of an assignment, I'd have to change it to /public/users/<path:filename> instead of /public/assignments/<path:filename>.
The way I thought of solving this is essentially making a /file/<path:filepath> route, where the filepath is the entire path to the destination folder + the file name and extension, instead of just the file name and extension. Then I did some formatting and separated the parent directory from the file itself and used that data when calling the send_from_directory function:
#app.route('/file/<path:filepath>', methods=["GET"])
def general_static_files(filepath):
filepath = filepath.split("/")
_dir = ""
for i, p in enumerate(filepath):
if i < len(filepath) - 1:
_dir = _dir + (p + "/")
return send_from_directory(("./" + _dir), filepath[len(filepath) - 1], as_attachment=True)
if we simulate the following request to this route:
curl http://127.0.0.1:5000/file/public/files/jobs/images/job_43_image_1.jpg
the _dir variable will hold the ./public/files/jobs/images/ value, and then filepath[len(filepath) - 1] holds the job_43_image_1.jpg value.
If i hit this route, I get a 404 - Not Found response, but all the code in the route body is being executed.
I suspect that the send_from_directory function is the reason why I'm getting a 404 - Not Found. However, I do have the image job_43_image_1.jpg stored inside the /public/files/jobs/images/ directory.
I'm afraid I don't see a lot I can do here except hope that someone has encountered the same issue/problem and found a way to fix it.
Here is the folder tree:
├── [2050] app.py
├── [2050] public
│   ├── [2050] etc
│   └── [2050] files
│   ├── [2050] jobs
│   │   ├── [2050] files
│   │   └── [2050] images
│   │   ├── [2050] job_41_image_decline_1.jpg
│   │   ├── [2050] job_41_image_decline_2554.jpg
│   │   ├── [2050] ...
│   ├── [2050] shop
│   └── [2050] videos
└── [2050] server_crash.log
Edit 1: I have set up the static_url_path. I have no reason to believe that that could be the cause of my problem.
Edit 2: Added tree

Pass these arguments when you initialise the app:
app = Flask(__name__, static_folder='public',
static_url_path='frontend_public' )
This would make the file public/blah.txt available at http://example.com/frontend_public/blah.txt.
static_folder sets the folder on the filesystem
static_url_path sets the path used within the URL
If neither of the variables are set, it defaults to 'static' for both.
Hopefully this is what you're asking.

Importing files from other directory in python?

I have following directory structure:
A
|
|--B--Hello.py
|
|--C--Message.py
Now if the path of root directory A is not fixed, how can i import "Hello.py" from B to "Message.py" in C.

At first I suggest to add empty __init__.py file into every directory with python sources. It will prevent many issues with imports because this is how the packages work in Python:
In your case it this should look like this:
A
├── B
│   ├── Hello.py
│   └── __init__.py
├── C
│   ├── Message.py
│   └── __init__.py
└── __init__.py
Let's say the Hello.py contains the function foo:
def foo():
return 'bar'
and the Message.py tries to use it:
from ..B.Hello import foo
print(foo())
The first way to make it work is to let the Python interpreter to do his job and to handle package constructing:
~ $ python -m A.C.Message
Another option is to add your Hello.py file into list of known sources with the following code:
# Message.py file
import sys, os
sys.path.insert(0, os.path.abspath('..'))
from B.Hello import foo
print(foo())
In this case you can execute it with
~/A/C $ python Message.py

Exclude multiple folders using AWS S3 sync

How to exclude multiple folders while using aws s3 syn ?
I tried :
# aws s3 sync s3://inksedge-app-file-storage-bucket-prod-env \
s3://inksedge-app-file-storage-bucket-test-env \
--exclude 'reportTemplate/* orders/* customers/*'
But still it's doing sync for folder "customer"
Output :
copy: s3://inksedge-app-file-storage-bucket-prod-env/customers/116/miniimages/IMG_4800.jpg
to s3://inksedge-app-file-storage-bucket-test-env/customers/116/miniimages/IMG_4800.jpg
copy: s3://inksedge-app-file-storage-bucket-prod-env/customers/116/miniimages/DSC_0358.JPG
to s3://inksedge-app-file-storage-bucket-test-env/customers/116/miniimages/DSC_0358.JPG

At last this worked for me:
aws s3 sync s3://my-bucket s3://my-other-bucket \
--exclude 'customers/*' \
--exclude 'orders/*' \
--exclude 'reportTemplate/*'
Hint: you have to enclose your wildcards and special characters in single or double quotes to work properly. Below are examples of matching characters. for more information regarding S3 commands, check it in amazon here.
*: Matches everything
?: Matches any single character
[sequence]: Matches any character in sequence
[!sequence]: Matches any character not in sequence

For those who are looking for sync some subfolder in a bucket, the exclude filter applies to the files and folders inside the folder that is be syncing, and not the path with respect to the bucket, example:
aws s3 sync s3://bucket1/bootstrap/ s3://bucket2/bootstrap --exclude '*' --include 'css/*'
would sync the folder bootstrap/css but not bootstrap/js neither bootstrap/fonts in the following folder tree:
bootstrap/
├── css/
│ ├── bootstrap.css
│ ├── bootstrap.min.css
│ ├── bootstrap-theme.css
│ └── bootstrap-theme.min.css
├── js/
│ ├── bootstrap.js
│ └── bootstrap.min.js
└── fonts/
├── glyphicons-halflings-regular.eot
├── glyphicons-halflings-regular.svg
├── glyphicons-halflings-regular.ttf
└── glyphicons-halflings-regular.woff
That is, the filter is 'css/*' and not 'bootstrap/css/*'
More in https://docs.aws.amazon.com/cli/latest/reference/s3/index.html#use-of-exclude-and-include-filters

From a Windows command prompt, single quotes ' don't work, only double quotes " work so use " " around wildcards, eg:
aws s3 sync s3://bucket-1/ . --exclude "reportTemplate/*" --exclude "orders/*"
Single quote doesn't work (as tested with the --dryrun option) on Windows 10.

I used a bit of a different way when we have multiple levels of folder structure. Use '**' with --include
Command:
aws s3 sync s3://$SOURCE_BUCKET/dir1/dir2/ s3://$TARGET_BUCKET/dir1/dir2/ --include "**/**'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Glue Crawler adding tables for every partition? - amazon-web-services

Adding to the excludes _SUCCESS crc worked for me (see aws page glue/add-crawler). Double stars match the files at all folder (ie partition) depths. I had an _SUCCESS living a few levels up. Make sure you set up logging for glue, which quickly points out permission errors etc.

check if you have empty folders inside. When spark writes to S3, sometimes, the _temporary folder is not deleted, which will make Glue crawler create table for each partition.

I was having the same problem. I added crc as exclude pattern to the AWS Glue crawler and it worked. Or if you crawl entire directories add /crc*.

You need to have separate crawlers for each table / file type. So create one crawler that looks at s3://bucket/somedata/ and a 2nd crawler that looks at s3://bucket/moredata/.

Related

Terraform "Unsupported attribute" error when access output variable of sub-module

How can I reference a value from a different module?

How to properly server static files from a Flask server?

Importing files from other directory in python?

Exclude multiple folders using AWS S3 sync

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Glue Crawler adding tables for every partition? - amazon-web-services

Adding to the excludes **_SUCCESS **crc worked for me (see aws page glue/add-crawler). Double stars match the files at all folder (ie partition) depths. I had an _SUCCESS living a few levels up. Make sure you set up logging for glue, which quickly points out permission errors etc.

check if you have empty folders inside. When spark writes to S3, sometimes, the _temporary folder is not deleted, which will make Glue crawler create table for each partition.

I was having the same problem. I added *crc* as exclude pattern to the AWS Glue crawler and it worked. Or if you crawl entire directories add */*crc*.

You need to have separate crawlers for each table / file type. So create one crawler that looks at s3://bucket/somedata/ and a 2nd crawler that looks at s3://bucket/moredata/.

Related

Terraform "Unsupported attribute" error when access output variable of sub-module

How can I reference a value from a different module?

How to properly server static files from a Flask server?

Importing files from other directory in python?

Exclude multiple folders using AWS S3 sync

Categories

Resources

Adding to the excludes _SUCCESS crc worked for me (see aws page glue/add-crawler). Double stars match the files at all folder (ie partition) depths. I had an _SUCCESS living a few levels up. Make sure you set up logging for glue, which quickly points out permission errors etc.

I was having the same problem. I added crc as exclude pattern to the AWS Glue crawler and it worked. Or if you crawl entire directories add /crc*.