HDFS Erasure Coding File Creation - hdfs

I am looking for a way to create a file (e.g. using copyFromLocal) in Apache HDFS and set the Erasure Coding policy in the process.
According to this page, I can use hdfs ec --setPolicy -path <folder> -policy RS-6-3-1024k to set the policy for a directory and its children. Is there a way to set the policy for a file when I create it, independent of the policy of the parent?

As far as I can tell, there is currently no way to do this in the HDFS client v3.0.0 beta.

Related

Where is a sensible place to put kube_config.yaml files on MWAA?

The example code in the MWAA docs for connecting MWAA to EKS has the following:
#use a kube_config stored in s3 dags folder for now
kube_config_path = '/usr/local/airflow/dags/kube_config.yaml'
This doesn't make me think that putting the kube_config.yaml file in the dags/ directory is a sensible long-term solution.
But I can't find any mention in the docs about where would be a sensible place to store this file.
Can anyone link me to a reliable source on this? Or make a sensible suggestion?
From KubernetesPodOperator Airflow documentation:
Users can specify a kubeconfig file using the config_file parameter, otherwise the operator will default to ~/.kube/config.
In a local environment, the kube_config.yaml file can be stored in specific directory reserved for Kubernetes (e.g. .kube, kubeconfig). Reference: KubernetesPodOperator (Airflow).
In the MWAA environment, where DAG files are stored in S3, the kube_config.yaml file can be stored anywhere in the root DAG folder (including any subdirectory in the root DAG folder, e.g. /dags/kube). The location of the file is less important than explicitly excluding it from DAG parsing via the .airflowignore file. Reference: .airflowignore (Airflow).
Example S3 directory layout:
s3://<bucket>/dags/dag_1.py
s3://<bucket>/dags/dag_2.py
s3://<bucket>/dags/kube/kube_config.yaml
s3://<bucket>/dags/operators/operator_1.py
s3://<bucket>/dags/operators/operator_2.py
s3://<bucket>/dags/.airflowignore
Example .airflowignore file:
kube/
operators/

Can I programatically retrieve the directory an EFS Recovery Point was restored to?

I'm trying to restore data in EFS from recovery points managed by AWS Backup. It seems AWS Backup does not support destructive restores and will always restore to a directory in the target EFS file system, even when creating a new one.
I would like to sync the data extracted from such a recovery point to another volume, but right now I can only do this manually as I need to lookup the directory name that is used by the start-restore-job operation (e.g. aws-backup-restore_2022-05-16T11-01-17-599Z), as stated in the docs:
You can restore those items to either a new or existing file system. Either way, AWS Backup creates a new Amazon EFS directory (aws-backup-restore_datetime) off of the root directory to contain the items.
Further looking through the documentation I can't find either of:
an option to set the name of the directory used
the value of directory name returned in any call (either start-restore-job or describe-restore-job)
I have also checked how the datetime portion of the directory name maps to the creationDate and completionDate of the restore job but it seems neither match (completionDate is very close, but it's not the exact same timestamp).
Is there any way for me to do one of these two things? Both of them missing make restoring a file system from a recovery point in an automated fashion very hard.
Is there any way for me to do one of these two things?
As it stands, no.
However, since we know that the directory will always be in the root, doing find . -type d -name "aws-backup-restore_*" should return the directory name to you. You could also further filter this down based on the year, month, day, hour & minute.
You could have something polling the job status on the machine that has the EFS file system mounted, finding the correct directory and then pushing that to AWS Systems Manager Parameter Store for later retrieval. If restoring to a new file system, this of course becomes more difficult but still doable in an automated fashion.
If you're not mounting this on an EC2 instance, for example, running a Lambda with the EFS file system mounted, will let you obtain the directory & then push it to Parameter Store for retrieval elsewhere. The Lambda service mounts EFS file systems when the execution environment is prepared - in other words, during the 'cold start' duration so there are no extra costs here for extra invocation time & as such, would be the cheapest option.
There's no built-in way via the APIs however to obtain the directory or configure it so you're stuck there.
It's an AWS failure that neither do they return the filename that they use in any way nor does any of the metadata returned - creationDate/completionData - exactly match the timestamp they use to name the file.
If you're an enterprise customer, suggest this as a missing feature to your TAM or SA.

How to configure Apache Flume to not to rename ingested files with .COMPLETE

We have one AWS S3 bucket in which we get new CSV files at 10 minute interval. Goal is to ingest these files into Hive.
So the obvious way for me is to use Apache Flume for this and use Spooling Directory source which will keep looking for new files in landing directory and ingest them in Hive.
We have read-only permissions for S3 bucket and for landing directory in which files will be copied and Flume suffixes ingested files with .COMPLETED suffix. So in our case Flume won't be able to mark completed files because of permission issue.
Now questions are:
What will happen if Flume is not able to add suffix to completed
files? Will it give any error or it will silently fail? (I am actually testing this but if anyone has already tried this then I don't have to reinvent the wheel)
Whether
Flume will be able to ingest files without marking them with
.COMPLETED?
Is there any other Big Data tool/technology better
suited for this use case?
Flume Spooling Directory Source needs to have write permission either to rename or delete the processed/read log file.
check 'fileSuffix', 'deletePolicy' settings.
If it doesnt rename/delete the completed files, it can't figure out which files are already processed.
You might want to write a 'script' that reads from read-only S3 bucket to a 'staging' folder with write permissions and provide this staging folder as source to flume.

list S3 folder on EMR

I fail to understand how to simply list the contents of an S3 bucket on EMR during a spark job.
I wanted to do the following
Configuration conf = spark.sparkContext().hadoopConfiguration();
FileSystem s3 = S3FileSystem.get(conf);
List<LocatedFileStatus> list = toList(s3.listFiles(new Path("s3://mybucket"), false))
This always fails with the following error
java.lang.IllegalArgumentException: Wrong FS: s3://*********/, expected: hdfs://**********.eu-central-1.compute.internal:8020
in the hadoopConfiguration fs.defaultFS -> hdfs://**********.eu-central-1.compute.internal:8020
The way I understand it if I don't use a protocol just /myfolder/myfile instead of i.e. hdfs://myfolder/myfile it will default to the df.defaultFS.
But I would expect if I specify my s3://mybucket/ the fs.defaultFS should not matter.
How does one access the directory information? spark.read.parquet("s3://mybucket/*.parquet") works just fine but for this task I need to check the existence of some files and would also like to delete some. I assumed org.apache.hadoop.fs.FileSystem would be the correct tool.
PS: I also don't understand how logging works. If I use deploy-mode cluster (i want to deploy jars from s3 which does not work in client mode), the I can only find my logs in s3://logbucket/j-.../containers/application.../conatiner...0001. There is quite a long delay before those show in S3. How do I find it via ssh on the master? or is there some faster/better way to check spark application logs?
UPDATE: Just found them under /mnt/var/log/hadoop-yarn/containers however the it is owned by yarn:yarn and as hadoop user I cannot read it. :( Ideas?
In my case I needed to read a parquet file that was generated by prior EMR jobs, I was looking for list of files for a given s3 prefix, but nice thing is we don't need to do all that, we can simply do this:
spark.read.parquet(bucket+prefix_directory)
URI.create() should be used to point it to correct Filesystem.
val fs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val dirPaths = FileSystem.get(URI.create("<s3-path>"), fs.getConf).listStatus(new Path("<s3-path>"))```
I don't think you are picking up the FS right; just use the static FileSystem.get() method, or Path.get()
Try something like:
Path p = new Path("s3://bucket/subdir");
FileSystem fs = p.get(conf);
FileStatus[] status= fs.listStatus(p);
Regarding logs, YARN UI should let you at them via the node managers.

Amazon S3 Error Code 403 Forbidden from EMR cluster

I know that this question may have been asked multiple times but I tried those solutions and it didn't workout. Therefore, asking it in a new thread for a definite solution.
I have created a IAM user with S3 read only permission (Get and List on all S3 resources) but when I try to access S3 from EMR cluster using HDFS command it throws "Error Code 403 Forbidden" exception for certain folders. People in other post has answered it to be a permission issue; which I didn't find a right solution as it is "Forbidden" instead of "Access Denied". The behavior of this error has appeared only for certain folders (containing objects) inside a bucket and for certain empty folders. It was observed that if I use native API calls then it works normally as follows:
Exception "Forbidden" when using s3a calls:
hdfs dfs -ls s3a://<bucketname>/<folder>
No error when using s3 native calls s3n and s3:
hdfs dfs -ls s3://<bucketname>/<folder>
hdfs dfs -ls s3n://<bucketname>/<folder>
Similar behavior have also been observed for empty folders and I understand on S3 only objects are physical files whereas rest "buckets and folders" are just a place holder. However, if I create a new empty folder then s3a calls doesn't throw this exception.
P.S. - Root IAM access key surpass this exception.
I'd recommend you file a JIRA on issues.apache.org, HADOOP project, component fs/s3 with the exact hadoop version you are using. Add the stack trace as the first comment, as that's the only way we could begin to work out what is happening.
FWIW, we haven't tested restricted permissions other than simple read-only and R/W; mixing permissions down the path is inevitably going to break things, as the client code expects to be able to HEAD, GET & LIST anything in the bucket.
BTW, the Hadoop S3 clients all mock empty directories by creating 0 byte objects with a "/" suffix, e/g "folder/"; then use a HEAD on that to probe for an empty bucket. When data is added under an empty dir, the mock parent dir is DELETE-d.