Unable to access folder with backslash in the foldername in hdfs

Unable to access folder with backslash in the foldername in hdfs - hdfs

I have created a folder in hdfs using spark with a '\' in its name. How to delete that folder or access it? I am unable to do that.
I tried creating this file in spark
\user\prime\temp\nipun\cddsIdNotinPsdw
and it created the below one in hdfs
It took \t as tab and \n as next line in hdfs
Here is the name of the folder that it shows in my hdfs, and I am unable to delete that
\user\prime emp
ipun\cddsIdNotinPsdw
Now I am unable to delete this in hdfs.

If it's about linux just write part of name you can then press key TAB or as example this is my directory
drwxr-xr-x 2 root root 4096 Jan 12 08:28 lopa \popa
This is how i delete: rmdir "lopa \popa"/
Your example to delete is
rmdir "\\user\\prime emp ipun\\cddsldNotinPsdw"/

It is impossible to do with command line for hdfs. I had to write a scala program inorder to delete it, by getting the file object.

Related

Why can't my GCP script/notebook find my file?

I have a working script that finds the data file when it is in the same directory as the script. This works both on my local machine and Google Colab.
When I try it on GCP though it can not find the file. I tried 3 approaches:
PySpark Notebook:
Upload the .ipynb file which includes a wget command. This downloads the file without error but I am unsure where it saves it to and the script can not find the file either (I assume because I am telling it that the file is in the same directory and pressumably using wget on GCP saves it somewhere else by default.)
PySpark with bucket:
I did the same as the PySpark notebook above but first I uploaded the dataset to the bucket and then used the two links provided in the file details when you click the file name inside the bucket on the console (neither worked). I would like to avoid this though as wget is much faster then downloading on my slow wifi then reuploading to the bucket through the console.
GCP SSH:
Create cluster
Access VM through SSH.
Upload .py file using the cog icon
wget the dataset and move both into the same folder
Run script using python gcp.py
Just gives me an error saying file not found.
Thanks.

As per your first and third approach, if you are running a PySpark code on Dataproc, irrespective of whether you use .ipynb file or .py file, please note the below points:
If you use the ‘wget’ command to download the file, then it will be downloaded in the current working directory where your code is executed.
When you try to access the file through the PySpark code, it will check defaultly in HDFS. If you want to access the downloaded file from the current working directory, use the “ file:///” URI with absolute file path.
If you want to access the file from HDFS, then you have to move the downloaded file to HDFS and then access from there using an absolute HDFS file path. Please refer the below example:
hadoop fs -put <local file_name> </HDFS/path/to/directory>

Errno 22 When downloading multiple files from S3 bucket "sub-folder"

I've been trying to use the AWS CLI to download all files from a sub-folder in AWS however, after the first few files download it fails to download the rest. I believe this is because it adds an extension to the filename and it then sees that as an invalid filepath.
I'm using the following command;
aws s3 cp s3://my_bucket/sub_folder /tmp/ --recursive
It gives me the following error for almost all of the files in the subfolder;
[Errno 22] Invalid argument: 'C:\\tmp\\2019-08-15T16:15:02.tif.deDBF2C2
I think this is because of the .deDBF2C2 extension it seems to be adding to the files when downloading though I don't know why it does. The filenames all end with .tif in the actual bucket.
Does anyone know what causes this?
Update: The command worked once I executed it from a linux machine. Seems to be specific to windows.

This is an oversight by AWS using Windows reserved characters in Log files names! When you execute the command it will create all the directory's however any logs with :: in the name fail to download.
Issue is discussed here: https://github.com/aws/aws-cli/issues/4543
So frustrated I came up with a workaround by executing a "DryRun" which prints the expected log output and porting that to a text file, eg:
>aws s3 cp s3://config-bucket-7XXXXXXXXXXX3 c:\temp --recursive --dryrun > c:\temp\aScriptToDownloadFilesAndReplaceNames.txt
The output file is filled with these aws log entries we can turn into aws script commands:
(dryrun) download: s3://config-bucket-7XXXXXXXXXXX3/AWSLogs/7XXXXXXXXXXX3/Config/ap-southeast-2/2019/10/1/ConfigHistory/7XXXXXXXXXXX3_Config_ap-southeast-2_ConfigHistory_AWS::RDS::DBInstance_20191001T103223Z_20191001T103223Z_1.json.gz to \AWSLogs\7XXXXXXXXXXX3\Config\ap-southeast-2\2019\10\1\ConfigHistory\703014955993_Config_ap-southeast-2_ConfigHistory_AWS::RDS::DBInstance_20191001T103223Z_20191001T103223Z_1.json.gz
In Notepad++ or other text editor you replace the (dryrun) download: with aws s3 cp
Then you will see the following lines with the command: aws s3 cp, the Bucket file and the local file path. We need to remove the :: in the local file path on the right side of the to:
aws s3 cp s3://config-bucket-7XXXXXXXXXXX3/AWSLogs/7XXXXXXXXXXX3/Config/ap-southeast-2/2019/10/1/ConfigHistory/7XXXXXXXXXXX3_Config_ap-southeast-2_ConfigHistory_AWS::RDS::DBInstance_20191001T103223Z_20191001T103223Z_1.json.gz to AWSLogs\7XXXXXXXXXXX3\Config\ap-southeast-2\2019\10\1\ConfigHistory\7XXXXXXXXXXX3_Config_ap-southeast-2_ConfigHistory_AWS::RDS::DBInstance_20191001T103223Z_20191001T103223Z_1.json.gz
We can replace the :: with - only in local paths not S3 Bucket path's using a regex (.*):: that removes the last occurrence of chars at the end of each line:
And here we can see I've replaced the ::'s with hyphens $1- by clicking 'Replacing All' twice:
Next remove the to (ignore the | cursor icon in the below image, to should be replaced with nothing).
FIND: json.gz to AWSLogs
REPLACE: json.gz AWSLogs
Finally select all the lines copy/paste into a command prompt to download all the files with reserved file characters!
UPDATE:
If you have WSL (Windows Subsystem Linux) you should be able to download the files and then issue a simple file rename replacing the ::'s before copying to the mounted Windows folder system.

I tried from my raspberry pi and it worked. Seems to only be an issue with Windows OS.

could not open file "./.postgresql.conf.swp": Permission denied

i am getting permission denied error while taking backup using pg_basebackup.
/usr/pgsql-11/bin/pg_basebackup -h127.0.0.1 -U thbbackup -D backup -Ft -z -P
Password:
238546/238575 kB (99%), 1/1 tablespace
pg_basebackup: could not get write-ahead log end position from server: ERROR: could not open file "./.postgresql.conf.swp": Permission denied
pg_basebackup: removing data directory "backup"

You have probably forgotten the file postgresql.conf open in a text editor (vim). If you open this conf file again then the text editor should complain saying it is already open so, you can just delete this as .swp file it is a temporary file anyway.
"When you edit a file in Vim, you have probably noticed the (temporary) .swp file that gets created. By default it'll be in the same location as the file that you are editing (although you can change this). The swap file contains the info about changes made to the file (or buffer)."

In this case it looks like a swap file from an open editor or previously orphaned. In general, Postgres needs ownership of all files in the data directory for a pg_basebackup. I have seen this failure on files with root:root or other ownership residing in the data directory. After running chown postgres:postgres [filename] on the target files, pg_basebackup should be able to run successfully.

Upload file from my laptop to EC2 instance

I am trying to upload file from my laptop to ec2 instance.
I am trying with this:
$ scp -i ec2_instance_key.pem ~/WebstormProjects/RESTAPI/config.js ec2-user#ec2-xxx.eu-west-x.compute.amazonaws.com:~/data/
When I launch the command, terminal responses with:
scp: /home/ec2-user/data/: Is a directory
But I put it in ec2 terminal:
$ cd /home/ec2-user/data
And it responses with no such file or directory
And it copies file in my laptop again, in ec2-instance-key path.
What is the problem?
Thank you very much.

The probable case is :
There is no directory named data in home directory of ec2-user.
OR a file named data is mistakenly created in home directory, while doing scp.
Solution :
Check if there exist a file named data at home directory of ec2-user.
Move the file ~/data to some other name(if exist).
Create a directory named data in home directory of ec2-user.
Give proper access permissions to the newly created data directory. (chmod 644 ~/data).
Try uploading the file again by using following command :
scp -i ec2_instance_key.pem ~/WebstormProjects/RESTAPI/config.js ec2-user#ec2-xxx.eu-west-x.compute.amazonaws.com:~/data

No wildcard support in hdfs dfs put command in Hadoop 2.3.0-cdh5.1.3?

I'm trying to move my daily apache access log files to a Hive external table by coping the daily log files to the relevant HDFS folder for each month.
I try to use wildcard, but it seems that hdfs dfs doesn't support it? (documentation seems to be saying that it should support it).
Copying individual files works:
$ sudo HADOOP_USER_NAME=myuser hdfs dfs -put
"/mnt/prod-old/apache/log/access_log-20150102.bz2"
/user/myuser/prod/apache_log/2015/01/
But all of the following ones throw "No such file or directory":
$ sudo HADOOP_USER_NAME=myuser hdfs dfs -put
"/mnt/prod-old/apache/log/access_log-201501*.bz2"
/user/myuser/prod/apache_log/2015/01/
put:
`/mnt/prod-old/apache/log/access_log-201501*.bz2': No such file or
directory
$ sudo HADOOP_USER_NAME=myuser hdfs dfs -put
/mnt/prod-old/apache/log/access_log-201501*
/user/myuser/prod/apache_log/2015/01/
put:
`/mnt/prod-old/apache/log/access_log-201501*': No such file or
directory
The environment is on Hadoop 2.3.0-cdh5.1.3

I'm going to answer my own question.
So hdfs dfs put does work with wildcard, the problem is that the input directory is not a local directory, but a mounted SSHFS (fuse) drive.
It seems that SSHFS is the one not able to handle wildcard characters.
Below is the proof the hdfs dfs put works just fine with wildcards when using the local filesystem and not the mounted drive:
$ sudo HADOOP_USER_NAME=myuser hdfs dfs -put
/tmp/access_log-201501*
/user/myuser/prod/apache_log/2015/01/
put: '/user/myuser/prod/apache_log/2015/01/access_log-20150101.bz2':
File exists
put:
'/user/myuser/prod/apache_log/2015/01/access_log-20150102.bz2': File
exists

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unable to access folder with backslash in the foldername in hdfs - hdfs

If it's about linux just write part of name you can then press key TAB or as example this is my directory drwxr-xr-x 2 root root 4096 Jan 12 08:28 lopa \popa This is how i delete: rmdir "lopa \popa"/ Your example to delete is rmdir "\\user\\prime emp ipun\\cddsldNotinPsdw"/

It is impossible to do with command line for hdfs. I had to write a scala program inorder to delete it, by getting the file object.

Related

Why can't my GCP script/notebook find my file?

Errno 22 When downloading multiple files from S3 bucket "sub-folder"

could not open file "./.postgresql.conf.swp": Permission denied

Upload file from my laptop to EC2 instance

No wildcard support in hdfs dfs put command in Hadoop 2.3.0-cdh5.1.3?

Categories

Resources