I'm using EMR to move a folder from local file system to S3 in Spark using fs.moveFromLocalFile API. Everything works fine except a 0-byte file created by EMRFS with name _$folder$ for EVERY folder that is uploaded.
Is there any way to move folders without this dummy file creation for every folder? (other than manually deleting this file). Also, why is this dummy file created? I'm currently using s3:// protocol recommended by EMR team.
My experience is that the mkdir() function usually called for local file systems or hdfs will result in an s3 empty file being created with the name of the mkdir folder and appended by _$folder$. In S3, there is no concept of an "empty folder" because you cannot have a key (pathname) with a null value (the file).
In a perfect world, mkdir(s3://bucket/path) should be a noop.
Don't know about EMR fs; this sounds like the same extension used by The S3n client. These files are stripped in the client when listing/stat-ing paths.
ASF's S3a creates one with a "/" suffix.
Related
The example code in the MWAA docs for connecting MWAA to EKS has the following:
#use a kube_config stored in s3 dags folder for now
kube_config_path = '/usr/local/airflow/dags/kube_config.yaml'
This doesn't make me think that putting the kube_config.yaml file in the dags/ directory is a sensible long-term solution.
But I can't find any mention in the docs about where would be a sensible place to store this file.
Can anyone link me to a reliable source on this? Or make a sensible suggestion?
From KubernetesPodOperator Airflow documentation:
Users can specify a kubeconfig file using the config_file parameter, otherwise the operator will default to ~/.kube/config.
In a local environment, the kube_config.yaml file can be stored in specific directory reserved for Kubernetes (e.g. .kube, kubeconfig). Reference: KubernetesPodOperator (Airflow).
In the MWAA environment, where DAG files are stored in S3, the kube_config.yaml file can be stored anywhere in the root DAG folder (including any subdirectory in the root DAG folder, e.g. /dags/kube). The location of the file is less important than explicitly excluding it from DAG parsing via the .airflowignore file. Reference: .airflowignore (Airflow).
Example S3 directory layout:
s3://<bucket>/dags/dag_1.py
s3://<bucket>/dags/dag_2.py
s3://<bucket>/dags/kube/kube_config.yaml
s3://<bucket>/dags/operators/operator_1.py
s3://<bucket>/dags/operators/operator_2.py
s3://<bucket>/dags/.airflowignore
Example .airflowignore file:
kube/
operators/
My code is working. The only issue I'm facing is that I cannot specify the folder within the S3 bucket that I would like to place my file in. Here is what I have:
with open("/hadoop/prodtest/tips/ut/s3/test_audit_log.txt", "rb") as f:
s3.upload_fileobj(f, "us-east-1-tip-s3-stage", "BlueConnect/test_audit_log.txt")
Explanation from #danimal captures pretty much everything. If you wanted to just create a folder-like object in s3, you can simply specify that folder-name and end it with "/", so that when you look at it from the console, it will look like a folder.
It's rather useless, an empty object, without a body (consider it as a key with null value) just for eye-candy but if you really want to do it, you can.
1) You can create it on the console interactively, as it gives you that option
2_ You can use aws sdk. boto3 has put_object method for s3 client, where you specify the key as "your_folder_name/", see example below:
import boto3
session = boto3.Session() # I assume you know how to provide credentials etc.
s3 = session.client('s3', 'us-east-1')
bucket = s3.create_bucket('my-test-bucket')
response = s3.put_object(Bucket='my-test-bucket', Key='my_pretty_folder/' # note the ending "/"
And there you have your bucket.
Again, when you are uploading a file you specify "my-test-bucket/my_file" and what you did there is create a "key" with name "my-test-bucket/my_file" and put the content of your file as its "value".
In this case you have 2 objects in the bucket. First object has null body and looks like a folder, while the second one looks like it is inside that but as #danimal pointed out in reality you created 2 keys in the same flat hierarchy, it just "looks-like" what we are used to seeing in a file system.
If you delete the file, you still have the other objects, so on the aws console, it looks like folder is still there but no files inside.
If you skipped creating the folder and simply uploaded the file like you did, you would still see the folder structure in AWS Console but you have a single object at that point.
When you however list the objects from command line, you would see a single object and if you delete it on the console it looks like folder is gone too.
Files ('objects') in S3 are actually stored by their 'Key' (~folders+filename) in a flat structure in a bucket. If you place slashes (/) in your key then S3 represents this to the user as though it is a marker for a folder structure, but those folders don't actually exist in S3, they are just a convenience for the user and allow for the usual folder navigation familiar from most file systems.
So, as your code stands, although it appears you are putting a file called test_audit_log.txt in a folder called BlueConnect, you are actually just placing an object, representing your file, in the us-east-1-tip-s3-stage bucket with a key of BlueConnect/test_audit_log.txt. In order then to (seem to) put it in a new folder, simply make the key whatever the full path to the file should be, for example:
# upload_fileobj(file, bucket, key)
s3.upload_fileobj(f, "us-east-1-tip-s3-stage", "folder1/folder2/test_audit_log.txt")
In this example, the 'key' of the object is folder1/folder2/test_audit_log.txt which you can think of as the file test_audit_log.txt, inside the folder folder1 which is inside the folder folder2 - this is how it will appear on S3, in a folder structure, which will generally be different and separate from your local machine's folder structure.
I have to replicate my local folder structure in S3 bucket, I am able to do so but its not creating folders which are empty. My local folder structure is as follows and command used is.
"aws-exec s3 sync ./inbound s3://msit.xxwmm.supplychain.relex.eeeeeeeeee/
its only creating inbound/procurement/pending/test.txt, masterdata and transaction is not cretated but if i put some file in each directory it will create.
As answered by #SabeenMalik in this StackOverflow thread:
S3 doesn't have the concept of directories, the whole folder/file.jpg
is the file name. If using a GUI tool or something you delete the
file.jpg from inside the folder, you will most probably see that the
folder is gone too. The visual representation in terms of directories
is for user convenience.
You do not need to pre-create the directory structure. Just pretend that the structure is there and everything will be okay.
Amazon S3 will automatically create the structure as objects are written to paths. For example, creating an object called s3://bucketname/inbound/procurement/foo` will automatically create the directories.
(This isn't strictly true because Amazon S3 doesn't use directories, but it will appear that the directories are there.)
Why AWS S3 uses objects and not file & directories is there any specific reason to not have directories/folders in s3
You are welcome to use directories/folders in Amazon S3. However, please realise that they do not actually exist.
Amazon S3 is not a filesystem. It is an object storage service that is highly scalable, stores trillions of objects and serves millions of objects per second. To meet the demands of such scale, it has been designed as a Key-Value store. The name of the file is the Key and the contents of the file is the Object.
When a file is uploaded to a directory (eg cat.jpg is stored in the images directory), it is actually stored with a filename of images/cat.jpg. This makes is appear to be in the images directory, but the reality is that the directory does not exist -- rather, the name of the object includes the full path.
This will not impact your normal usage of Amazon S3. However, it is not possible to rename a directory because the directory does not exist. Instead, rename the file to rename the directory. For example:
aws s3 mv s3://my-bucket/images/cat.jpg s3://my-bucket/pictures/cat.jpg
This will cause the pictures directory to magically appear, with cat.jpg inside it. There is not need to create the directory first, because it doesn't actually exist. This is because the user interface is making it appear as though there are directories.
Bottom line: Feel free to use directories, but be aware that they do not actually exist and can't be renamed.
Environment
I copied a file, ./barname.bin, to s3, using the command aws s3 cp ./barname.bin s3://fooname/barname.bin
I have a different file, ./barname.1.bin that I want to upload in place of that file
How can I upload and replace (overwrite) the file at s3://fooname/barname.bin with ./barname.1.bin?
Goals:
Don't change the s3 url used to access the file (new file should also be available at s3://fooname/barname.bin).
zero/minimum 'downtime'/unavailability of the s3 link.
As I understand it, you've got an existing file located at s3://fooname/barname.bin and you want to replace it with a new file. To replace that, you should just upload a new one on top of the old one:
aws s3 cp ./barname.1.bin s3://fooname/barname.bin.
The old file will be replaced. According to the S3 docs, this is atomic, though due to EC2s replication pattern, requests for the key may still return the old file for some time.
Note (thanks #Chris Kuehl): though the replacement is technically atomic, it's possible for multipart downloads to end up with chunks from different versions of the file. 😬