AWS Glue Crawler: want separate table for folder in s3

AWS Glue Crawler: want separate table for folder in s3 - amazon-web-services

My s3 file structure is:
├── bucket
│ ├── customer_1
│ │   ├── year=2016
│ │   ├── year=2017
│ │   │   ├── month=11
│ │   | │   ├── sometype-2017-11-01.parquet
│ | | | ├── sometype-2017-11-02.parquet
│ | | | ├── ...
│ │   │   ├── month=12
│ │   | │   ├── sometype-2017-12-01.parquet
│ | | | ├── sometype-2017-12-02.parquet
│ | | | ├── ...
│ │   ├── year=2018
│ │   │   ├── month=01
│ │   | │   ├── sometype-2018-01-01.parquet
│ | | | ├── sometype-2018-01-02.parquet
│ | | | ├── ...
│ ├── customer_2
│ │   ├── year=2017
│ │   │   ├── month=11
│ │   | │   ├── moretype-2017-11-01.parquet
│ | | | ├── moretype-2017-11-02.parquet
│ | | | ├── ...
│ │   ├── year=...
I want create separate table for customer_1 and customer_2 with AWS Glue crawler. It is working if i mention path s3://bucket/customer_1 and s3://bucket/customer_2.
I've tried s3://bucket/customer_* and s3://bucket/*, both are not working and can not create table in Glue catalog

I myself faced this issue recently. AWS GLUE Crawlers has this option Grouping behaviour for S3 data. If the checkbox is not selected it will try to combine schemas. By selecting the checkbox you can ensure that multiple and separate databases are created.
The table level should be the depth from the root of the bucket, from where you want separate tables.
In your case the depth would be 2.
More here

Glue's natural tendency is to add similar schemas(when pointed to the parent folder) to the same table with anything over than a 70% match(Assuming, In your case Cust1 and Cust2 have the same schemas). Keeping them in individual folders might create respective partitions based on the folder names.

Related

Chef::Exceptions::FileNotFound: template[/var/www/html/index.html]

I am new to the chef I could not understand what is an issue. Following is my default script
apt_update 'Update the apt cache daily' do
frequency 86_400
action :periodic
end
package 'apache2'
service 'apache2' do
supports status: true
action [:enable, :start]
end
template '/var/www/html/index.html' do
source 'index.html.erb'
end
this is the error I am getting
[2020-04-25T12:57:00+00:00] FATAL: Stacktrace dumped to /home/vagrant/.chef/local-mode-cache/cache/chef-stacktrace.out
[2020-04-25T12:57:00+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2020-04-25T12:57:00+00:00] FATAL: Chef::Exceptions::FileNotFound: template[/var/www/html/index.html] (learn_chef_apache2::default line 18) had an error: Chef::Exceptions::FileNotFound: Cookbook 'learn_chef_apache2' (0.1.0) does not contain a file at any of these locations:
templates/host-vagrant.vm/index.html.erb
templates/ubuntu-18.04/index.html.erb
templates/ubuntu/index.html.erb
templates/default/index.html.erb
templates/index.html.erb
and this is my cookbooks tree
cookbooks
├── learn_chef_apache2
│   ├── Berksfile
│   ├── CHANGELOG.md
│   ├── chefignore
│   ├── LICENSE
│   ├── metadata.rb
│   ├── README.md
│   ├── recipes
│   │   └── default.rb
│   ├── spec
│   │   ├── spec_helper.rb
│   │   └── unit
│   │   └── recipes
│   │   └── default_spec.rb
│   └── test
│   └── integration
│   └── default
│   └── default_test.rb
├── learn_chef_appache2
│   └── templates
│   ├── default
│   └── index.html.erb
└── templates
└── index.html.erb
Can someone please help me what wrong I am doing and it will be great if you can share a link or explain it for my understanding.

what I did wrong was my template was created outside learn_chef_apache2 whereas it should be inside as follwing
cookbooks
└── learn_chef_apache2
├── Berksfile
├── CHANGELOG.md
├── chefignore
├── index.html.erb
├── LICENSE
├── metadata.rb
├── README.md
├── recipes
│   └── default.rb
├── spec
│   ├── spec_helper.rb
│   └── unit
│   └── recipes
│   └── default_spec.rb
├── templates
│   └── index.html.erb
└── test
└── integration
└── default
└── default_test.rb

Terragrunt use resources from other enviroment

I want to use resources, in this case the output of the vpc module, in another environment.
Goal is to reduce the costs for the customer with resources of stage and dev in the same vpc.
Stage and dev have seperate ecs-cluster, asg, lc, different docker images in ecr etc but should be in the same vpc with the same load balancer and then a host header listener to forward to the specific target group.
Both should use the same database and the same load balancer.
Requirement was to have n Customer each with stage, dev and prod environments.
All Customer folders should contain the three environments.
My folder structure is
├── Terraform
│   ├── Customer1
│   ├── Customer2
│   ├── Customer3
│   ├── Customer4
│   ├── Customer5
│   ├── Global
│   │   ├── iam
│   │   │   └── terragrunt.hcl
│   ├── README.md
│   └── Customer6
│   ├── non-prod
│   │   ├── eu-central-1
│   │   │   ├── dev
│   │   │   │   ├── cloudwatch
│   │   │   │   │   └── terragrunt.hcl
│   │   │   │   ├── ec2
│   │   │   │   │   └── terragrunt.hcl
│   │   │   │   ├── ecs
│   │   │   │   │   └── terragrunt.hcl
│   │   │   │   ├── lambda
│   │   │   │   │   └── terragrunt.hcl
│   │   │   │   ├── rds
│   │   │   │   │   └── terragrunt.hcl
│   │   │   │   ├── terragrunt.hcl
│   │   │   │   ├── vars.hcl
│   │   │   │   └── vpc
│   │   │   │   └── terragrunt.hcl
│   │   │   ├── region.hcl
│   │   │   └── stage
│   │   │   ├── cloudwatch
│   │   │   │   └── terragrunt.hcl
│   │   │   ├── ec2
│   │   │   │   └── terragrunt.hcl
│   │   │   ├── ecs
│   │   │   │   └── terragrunt.hcl
│   │   │   ├── lambda
│   │   │   │   └── terragrunt.hcl
│   │   │   ├── rds
│   │   │   │   └── terragrunt.hcl
│   │   │   ├── terragrunt.hcl
│   │   │   ├── vars.hcl
│   │   │   └── vpc
│   │   │   └── terragrunt.hcl
│   │   └── terragrunt.hcl
│   └── prod
│   └── eu-central-1
│   ├── prod
│   │   ├── cloudwatch
│   │   │   └── terragrunt.hcl
│   │   ├── ec2
│   │   │   └── terragrunt.hcl
│   │   ├── ecs
│   │   │   └── terragrunt.hcl
│   │   ├── lambda
│   │   │   └── terragrunt.hcl
│   │   ├── rds
│   │   │   └── terragrunt.hcl
│   │   ├── terragrunt.hcl
│   │   ├── vars.hcl
│   │   └── vpc
│   │   └── terragrunt.hcl
│   └── region.hcl
└── Modules
├── cloudwatch
│   ├── Main.tf
│   ├── Outputs.tf
│   └── Variables.tf
├── ec2
│   ├── Main.tf
│   ├── Outputs.tf
│   └── Variables.tf
├── ecs
│   ├── Main.tf
│   ├── Outputs.tf
│   └── Variables.tf
├── iam
│   ├── Main.tf
│   ├── Outputs.tf
│   └── Variables.tf
├── lambda
│   ├── Main.tf
│   ├── Outputs.tf
│   └── Variables.tf
├── rds
│   ├── Main.tf
│   ├── Outputs.tf
│   └── Variables.tf
├── vpc
│   ├── Main.tf
│   ├── Outputs.tf
│   ├── Variables.tf
└── vpc-stage
├── Main.tf
├── Outputs.tf
└── Variables.tf
I've read about data terraform_remote_state but that's on module layer.
For me it's not a good approach to do this in the module layer cause it's
only for the stage enviroment.
Is there a way to get the output from the remote state in the terragrunt.hcl in the stage folder from the dev environment to use this as input for the ec2 module?
I've used
dependency "vpc" {
config_path = "../vpc"
}
and then
vpc_id = dependency.vpc.outputs.vpc_id
for the input of ec2 module but that's only if it's in the same enviroment.
Best regards.

In the directory structure you've show above, you have a VPC in both the dev and stage environments. It sounds like you want dev and stage to share a VPC, so the first thing to do is move that VPC directory outside of dev and stage. Put the vpc under eu-west-1, then you can use it as a dependency within both dev and stage as you desire.
Customer6
│ ├── non-prod
└── eu-central-1
├── dev
│ └── ecs
├── stage
│ └── ecs
└── vpc
dependency "vpc" {
config_path = "../../vpc"
}
Refer to the Terragrunt docs on Passing outputs between modules.

Terraform: How to make sure I run terraform on the expected AWS account

Suppose I want to launch an EC2 instance in my dev account but it is possible that I accidentally run the wrong command and create a temporary credential of prod account instead of dev account, then when I Terraform apply, I will launch the EC2 at prod account?
How can I avoid this from happening? Can I create a text file with dev account id in this folder, then have Terraform compare the account id of my temporary credential with the account id in this file before launching EC2, maybe in null_resource? I cannot figure out how to implement that.

The AWS provider allows you to specify either a list of allowed_account_ids or a list of forbidden_account_ids that you could define to prevent that from happening if necessary.
So you might have a folder structure that looks a little like this:
$ tree -a
.
├── dev
│   ├── bar-app
│   │   ├── dev-eu-west-1.tf -> ../../providers/dev-eu-west-1.tf
│   │   └── main.tf
│   ├── foo-app
│   │   ├── dev-eu-west-1.tf -> ../../providers/dev-eu-west-1.tf
│   │   └── main.tf
│   └── vpc
│   ├── dev-eu-west-1.tf -> ../../providers/dev-eu-west-1.tf
│   └── main.tf
├── prod
│   ├── bar-app
│   │   ├── main.tf
│   │   └── prod-eu-west-1.tf -> ../../providers/prod-eu-west-1.tf
│   ├── foo-app
│   │   ├── main.tf
│   │   └── prod-eu-west-1.tf -> ../../providers/prod-eu-west-1.tf
│   └── vpc
│   ├── main.tf
│   └── prod-eu-west-1.tf -> ../../providers/prod-eu-west-1.tf
├── providers
│   ├── dev-eu-west-1.tf
│   ├── prod-eu-west-1.tf
│   └── test-eu-west-1.tf
└── test
├── bar-app
│   ├── main.tf
│   └── test-eu-west-1.tf -> ../../providers/test-eu-west-1.tf
├── foo-app
│   ├── main.tf
│   └── test-eu-west-1.tf -> ../../providers/test-eu-west-1.tf
└── vpc
├── main.tf
└── test-eu-west-1.tf -> ../../providers/test-eu-west-1.tf
Where your providers/dev-eu-west-1.tf file looks like:
provider "aws" {
region = "eu-west-1"
allowed_account_ids = [
"1234567890",
]
}
And your providers/test-eu-west-1.tf file looks like:
provider "aws" {
region = "eu-west-1"
allowed_account_ids = [
"5678901234",
]
}
This would mean that you could only run Terraform against dev/foo-app when you are using credentials belonging to the 1234567890 account and could only run Terraform against dev/foo-app when you are using credentials belonging to the 5678901234 account.

Store your terraform state in an S3 bucket for that account. Make sure the buckets are named uniquely (they have to be unique to a region anyway). If you run it against the wrong account, it will error out because the bucket cannot be found.

Flask project structure for grunt based workflow

I recently purchased an HTML/CSS/Js admin template based on the Bootstrap framework. It basically covered all my needs for an MVP and my plan was to customize it a bit and then plug my already developed back-end through flask.
I am quite inexperienced in this field, so I was quite impressed by the automatic workflow that was used by this admin template.
The basic structure is the following one:
root/
├── dist/
│ └── html/
│ ├── assets/
│ └── all_pages.html
├── grunt/
│ └── tasks/
├── node_modules/
├── src/
│ ├── assets/
│ ├── html/
│ ├── js/
│ └── sass/
├── Gruntfile.js
└── package.json
Thanks to grunt tasks and npm management, handling assets is very easy, after an npm install you can handle everything with grunt.
The sass are compiled in a css style for production and all code gets minified and copied to the dist folder depending on the settings.
You can easily develop on the src path, and use the grunt task "server" to both watch for changes and directly display them before sending everything to production folder "dist".
My problems arise when I try to keep this behavior with a flask application interacting with it.
My flask application uses this structure:
root/
├── __init__.py
├── templates/
│ ├── layout.html
│ └── bp1/
│ │ ├── layout.html
│ │ └── other_pages.html
│ └── bp2/
│ ├── layout.html
│ └── other_pages.html
├── views/
│ ├── __init__.py
│ ├── bp1.py.py
│ └── bp2.py.py
├── static/
│ ├── css/
│ ├── js/
│ └── img/
├── Dockerfile
└── requirements.txt
Basically, there is no difference between development and production version, and the web-app gets deployed through its docker image.
My question here is, how on earth should I approach the merging of these two guys? How to have a flask project with src-dist separation and a workflow similar to the one I described above?
I would like to keep all the good features (I managed to notice with my skills) of the admin template and have something with:
src and dist folders separation...so that all sass items, unused/discarded js code and html pages are only in the development "src" folder and will not be used in production
grunt automation for compiling sass, cleaning lib directories, watching for changes, npmcopy (to install packages with npm and move only required files to production), notifications, minification, etc...
Docker image based deployment that is based only on the "dist-generated" resource and ignores the "src-development" stuff.

Alright, I came up with a setup that works pretty neatly and that I think is worth sharing for anyone else stuck or doubtful when in a similar scenario.
Structure
root/
├── src/
│ ├── __init__.py
│ ├── models.py
│ ├── database.py
│ ├── static/
│ │ ├── css/
│ │ │ └── app.css
│ │ ├── js/
│ │ ├── img
│ │ └── lib
│ ├── templates/
│ │ ├── layout.html
│ │ ├── bp1/
│ │ │ ├── layout.html
│ │ │ └── other_pages.html
│ │ └── bp2/
│ │ ├── layout.html
│ │ └── other_pages.html
│ ├── views/
│ │ ├── __init__.py
│ │ ├── bp1.py
│ │ └── bp2.py
│ └── sass/
├── dist/
│ ├── __init__.py
│ ├── models.py
│ ├── database.py
│ ├── static/
│ │ ├── css/
│ │ │ └── app.css
│ │ ├── js/
│ │ ├── img
│ │ └── lib
│ ├── templates/
│ │ ├── layout.html
│ │ ├── bp1/
│ │ │ ├── layout.html
│ │ │ └── other_pages.html
│ │ └── bp2/
│ │ ├── layout.html
│ │ └── other_pages.html
│ └── views/
│ ├── __init__.py
│ ├── bp1.py
│ └── bp2.py
├── templates/
│ ├── layout.html
│ └── bp1/
│ │ ├── layout.html
│ │ └── other_pages.html
│ └── bp2/
│ ├── layout.html
│ └── other_pages.html
├── views/
│ ├── __init__.py
│ ├── bp1.py.py
│ └── bp2.py.py
├── static/
│ ├── css/
│ ├── js/
│ └── img/
├── instance/
│ └── flask.cfg
├── grunt/
│ └── tasks/
├── static/
├── node_modules/
├── venv/
├── Gruntfile.js
├── package.json
├── Dockerfile
├── .gitignore
└── requirements.txt
Workflow
packages are installed with npm and the package.json (node_modules gets generated).
a python virtualenv is configured using the 'requirements.txt' and linked to 'venv'.
a grunt tasks is called and uses npmcopy to move only the required files to src/static/lib that gets used by flasks' templates as: static/lib in order to keep src-dist compatibility.
a grunt task is able to compile sass parts and create 'app.css' within static/css.
several other grunt tasks do other useful things like minification.
grunt's default task performs concurrently a 'watch task' and launches flask run to let the development continue smoothly (more on this later).
grunt dist creates in the dist folder a production-ready flask project with all packages, styles and pages developed in the previous steps.
Grunt's flask task
This simple piece of code manages to launch a flask server locally to start development.
// Launch flask's server
grunt.registerTask('flask', 'Run flask server.', function() {
var spawn = require('child_process').spawn;
grunt.log.writeln('Starting Flask.');
var PIPE = {
stdio: 'inherit',
env: {
FLASK_APP: './src/__init__.py:create_app()',
FLASK_ENV: 'development',
LC_ALL: 'C.UTF-8',
LANG: 'C.UTF-8'
}
};
// more on venv later
spawn('venv/bin/flask', ['run'], PIPE);
});
Flask setup for development
In order for the flask run command to work properly in development mode, the following are configured:
venv: symbolic link to the python virtualenv used for the project.
instance/flask.cfg: flask instance folder
Gitignore
Other than the whole 'dist' folder, these are excluded from VCS:
venv;
instance folder;
lib folder within the src one;
node_modules;
Conclusion
This setup is pretty handy and is quite easy to share. Local, easy, configurations let everything work neatly for development.
Production code can be generated and then deployed/configured quickly depending on the strategies (k8s, server deployments, ...).

how to load two webapp at one time with differenct context-name?

How can I load two webcontexts programmitically?
Here is the directory structure. and What I want is that I want to code the Starter.java programatically, not with the xml configuration.
Jetty Version : jetty-9.1.5.v20140505
├── java
│ └── com
│ └── embed
│ └── jetty
│ └── server
│ ├── Starter.java
| ...
├── resources
│ ├── log4j.properties
│ └── version.properties
└── webapps
├── webapps1
│ ├── WEB-INF
│ │ ├── classess
│ │ ├── lib
│ │ └── web.xml
│ ├── index.jsp
├── webapps2
│ ├── WEB-INF
│ │ ├── classess
│ │ ├── lib
│ │ └── web.xml
│ └─ index.jsp

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Glue Crawler: want separate table for folder in s3 - amazon-web-services

Glue's natural tendency is to add similar schemas(when pointed to the parent folder) to the same table with anything over than a 70% match(Assuming, In your case Cust1 and Cust2 have the same schemas). Keeping them in individual folders might create respective partitions based on the folder names.

Related

Chef::Exceptions::FileNotFound: template[/var/www/html/index.html]

Terragrunt use resources from other enviroment

Terraform: How to make sure I run terraform on the expected AWS account

Flask project structure for grunt based workflow

how to load two webapp at one time with differenct context-name?

Categories

Resources