How to Create Subdirectories for S3 Bucket with CDK - amazon-web-services

if I wanted to create a bucket with a layout like this:
bucket/
├─ subdir-1/
│ ├─ subsubdir-1/
├─ subdir-2/
├─ subdir-3/
how could I do this using the cdk?
I know that you can just upload a file with the requisite because subdirectories don't really do anything bc S3 is a file system, but I have a use case because Spark is expecting a subdirectory to exist for some reason.
And if you have to create a file in the directory, that is a really poor solution because you lose the ability to configure your S3 bucket within the CDK (things like versioning, vpc access, replication controls, etc.)

Since folders don't really exist in S3, and files only have a 'prefix' - which by convention results in an apparent directory structure... take a look at the BucketDeployment construct and upload 0 byte placeholder files named subdir-1/subsubdir-1/placeholder that Spark will ignore.

Related

Terraform:- can we create " data source" in a seperate file like local or variables?

I want to seperate data form main code and use them in seperate file similar to local.tf or variables.tf, however even in the docs there is no reference.
use case
I am trying to create access logging for s3 bucket. Target bucket is not managed by s3 so I want to make sure that it exists before using it via data source
resource "aws_s3_bucket" "artifact" {
bucket = "jatin-123"
}
data "aws_s3_bucket" "selected" {
bucket = "bucket.test.com"
}
resource "aws_s3_bucket_logging" "artifacts_server_access_logs" {
for_each = local.env
bucket = data.aws_s3_bucket.selected.id
target_bucket = local.s3_artifact_access_logs_bucket_name
target_prefix = "${aws_s3_bucket.artifact[each.key].id}/"
}
Yes, you can have data sources in whatever file you want.
Terraform basically does not care about the file composition and their names and just lumps all .tf files in the same directory into one big blob.
Yes, of course, you can have. For organization purposes, you SHOULD use different files. When you have a simple project it's easy to check your code or even troubleshoot within a single file, but when you start to deploy more infrastructure will be a nightmare. So my advice is to start your "small" projects by splitting the through different files.
Here is my suggestion for you, regarding your example:
base.auto.tfvars
Here you can put variables that will be used along all the project.
E.g: region = us-east-1
project = web-appliance
s3.auto.tfvars
Variables that you will use in your s3 bucket
s3.tf
The code for S3 creation
datasource.tf
Here you will put all the datasources that you need in your project.
provider.tf
The configuration for your provider(s). In your example, aws provider
versions.tf
The versions of your providers

Is it feasible to maintain directory structure when backing up to AWS S3 Glacier classes?

I am trying to backup 2TB from a shared drive of Windows Server to S3 Glacier
There are maybe 100 folders (some may be nested ) and perhaps 5000 files (some small like spread sheets, photos and other are larger like server images. My first question is what counts as an object here?
Let’s say I have Folder 1 which has 10 folders inside it. Each of 10 folders have 100 files.
Would number of objects be 1 folder + (10 folders * 100 files) = 1001 objects?
I am trying to understand how folder nesting is treated in S3. Do I have to manually create each folder as a prefix and then upload each file inside that using AWS CLI? I am trying to recreate the shared drive experience on the cloud where I can browse the folders and download the files I need.
Amazon S3 does not actually support folders. It might look like it does, but it actually doesn't.
For example, you could upload an object to invoices/january.txt and the invoices directory will just magically 'appear'. Then, if you deleted that object, the invoices folder would magically 'disappear' (because it never actually existed).
So, feel free to upload objects to any location without creating the directories first.
However, if you click the Create folder button in the Amazon S3 management console, it will create a zero-length object with the name of the directory. This will make the directory 'appear' and it would be counted as an object.
The easiest way to copy the files from your Windows computer to an Amazon S3 bucket would be:
aws s3 sync directoryname s3://bucket-name/ --storage-class DEEP_ARCHIVE
It will upload all files, including files in subdirectories. It will not create the folders, since they aren't necessary. However, the folder will still 'appear' in S3.

How to delete empty sub folders from s3 buckets?

I'm performing 2 loops in my code. Below is the pseudo code.
get the subfolder list in the bucket
Loop through every subfolders
loop through every objects in a subfolder
read the objects
delete the objects (So at the end the subfolder will be empty)
get the subfolder list again (asuming the empty subfolder also will be deleted and new subfolders can be created by someone)
But as the result i'm getting an infinite loop since the subfolders are still there in the bucket. So i was finding a solution to delete the subfolder after the deletion of every objects in it. But I couldn't find a solution for this. Please suggest your idea
Below is the s3 folder structure
├── bucket
└── folder-1
└── folder-2
└── folder-3
Folders do not actually exist in Amazon S3. Instead, the Key (filename) of each object consists of the full path of the object.
Therefore, you merely need to loop through each object for a given Prefix and delete the objects.
Here's an example in Python:
import boto3
s3_resource = boto3.resource('s3')
for object in s3_resource.Bucket('bucket-name').objects.filter(Prefix='folder-name/'):
print('Deleting:', object.key)
object.delete()
This will also delete objects within subfolders, since they have the same Prefix (and folders do not actually exist).

AWS Glue does not detect partitions and creates 1000+ tables in catalog

I am using AWS Glue to create metadata tables.
AWS Glue Crawler data store path: s3://bucket-name/
Bucket structure in S3 is like
├── bucket-name
│   ├── pt=2011-10-11-01
│   │   ├── file1
| | ├── file2
│   ├── pt=2011-10-11-02
│   │   ├── file1
│   ├── pt=2011-10-10-01
│   │   ├── file1
│   ├── pt=2011-10-11-10
│   │   ├── file1
for this aws crawler create 4 tables.
My question is why aws glue crawler does not detect partition?
To force Glue to merge multiple schemas together, make sure this option is checked, when creating the crawler -
Create a single schema for each S3 path.
Screenshot of crawler creation step, with this setting enabled
Here's a detailed explanation - quoting directly, from AWS documentation (reference)
By default, when a crawler defines tables for data stored in Amazon S3, it considers both data compatibility and schema similarity. Data compatibility factors taken into account include whether the data is of the same format (for example, JSON), the same compression type (for example, GZIP), the structure of the Amazon S3 path, and other data attributes. Schema similarity is a measure of how closely the schemas of separate Amazon S3 objects are similar.
You can configure a crawler to CombineCompatibleSchemas into a common table definition when possible. With this option, the crawler still considers data compatibility, but ignores the similarity of the specific schemas when evaluating Amazon S3 objects in the specified include path.
If you are configuring the crawler on the console, to combine schemas, select the crawler option Create a single schema for each S3 path.
Need to crawl a parent folder with all partition under it, otherwise the crawler will treat each partition as a seperate table. So example, create as such
s3://bucket/table/part=1
s3://bucket/table/part=2
s3://bucket/table/part=3
then crawl s3://bucket/table/
Answer is:
Aws glue crawler before merging schema, first find similarity index of the schema(s). If similarity index is more than 70% then merge otherwise create a new table.
There are two things I needed to do to get AWS Glue to avoid creating extraneous tables. This was tested with boto3 1.17.46.
Firstly, ensure an S3 object structure such as this:
s3://mybucket/myprefix/mytable1/<nested_partition>/<name>.xyz
s3://mybucket/myprefix/mytable2/<nested_partition>/<name>.xyz
s3://mybucket/myprefix/mytable3/<nested_partition>/<name>.xyz
Secondly, if using boto3, create the crawler with the arguments:
targets = [{"Path": f"s3://mybucket/myprefix/mytable{i}/"} for i in (1, 2, 3)]
config = {"Version": 1.0, "Grouping": {"TableGroupingPolicy": "CombineCompatibleSchemas"}}
boto3.client("glue").create_crawler(Targets={"S3Targets": targets}, Configuration=json.dumps(config))
As per Targets, each table's path is provided as a list to the crawler.
As per Configuration, all files under each provided path should be merged into a single schema.
If using something other than boto3, it should be straightforward to provide the aforementioned arguments similarly.
Try to use table path like s3://bucket-name/<table_name>/pt=<date_time>/file.
If after that a Crawler treat every partition like separate table, try to create the table manually and re-run Crawler to bring partitions.

Replicate local directory in S3 bucket

I have to replicate my local folder structure in S3 bucket, I am able to do so but its not creating folders which are empty. My local folder structure is as follows and command used is.
"aws-exec s3 sync ./inbound s3://msit.xxwmm.supplychain.relex.eeeeeeeeee/
its only creating inbound/procurement/pending/test.txt, masterdata and transaction is not cretated but if i put some file in each directory it will create.
As answered by #SabeenMalik in this StackOverflow thread:
S3 doesn't have the concept of directories, the whole folder/file.jpg
is the file name. If using a GUI tool or something you delete the
file.jpg from inside the folder, you will most probably see that the
folder is gone too. The visual representation in terms of directories
is for user convenience.
You do not need to pre-create the directory structure. Just pretend that the structure is there and everything will be okay.
Amazon S3 will automatically create the structure as objects are written to paths. For example, creating an object called s3://bucketname/inbound/procurement/foo` will automatically create the directories.
(This isn't strictly true because Amazon S3 doesn't use directories, but it will appear that the directories are there.)