How to delete a file in HDFS which is in different cluster? - hdfs

I am creating a workflow in which I need to delete some HDFS files which are in a different cluster. Below is the command that i have tried
clusterA-ip ] $ hdfs dfs -rm hdfs://clusterB-ip:8888/path/to/the/file.txt
But this is not working. Could you please help?

Related

Getting an error while copying file from aws EC2 instance to hadoop cluster

I am not able to run any hadoop command even -ls is also not working getting same error and not able to create directory using
hadoop fs -mkdir users.
You can create one directory in HDFS using the command :
$ hdfs dfs -mkdir "path"
and, then use the given below command to copy data from the local file to HDFS:
$ hdfs dfs -put /root/Hadoop/sample.txt /"your_hdfs_dir_path"
Alternatively, you can also use the below command:
$ hdfs dfs -copyFromLocal /root/Hadoop/sample.txt /"your_hdfs_dir_path"

how --py-files works internally in pyspark

I am new to pySpark. I have used --py-files like below in spark-submit command to copy all files to worker nodes.
spark-submit --master yarn-client --driver-memory 4g --py-files /home/valli/pyFiles.zip /home/valli/main.py
In logs I observed that it is storing pyFiles.zip in .sparkStaging directory like below.
hdfs://cdhstltest/user/valli/.sparkStaging/application_1550968677175_9659/pyFiles.zip
When I copied above file into my specific local directory it is still showing like a zip file and unable to read files in it. But when I try to find out the current files directory it is showing with hdfs_directory/pyfiles.zip/module1.py and able to execute py file. As far as I know --py-files will copy all .py files in zip folder into worker nodes by automatically unzipping.
Can anyone please help me in understanding what is happening behind the screen ?
Thanks in advance.

How to put file from local computer to HDFS

Is there any way to push files from local compute to HDFS.
I have tried to send GET request to port 50070 but It always shows
Authentication required.
Please help me ! I am quite new to HDFS
To create new folder
hadoop fs -mkdir your-folder-path-on-HDFS
Example
hadoop fs -mkdir /folder1 /folder2
hadoop fs -mkdir hdfs://:/folder1
Uploading your file
bin/hdfs dfs –put ~/#youruploadfolderpath# /#yourhadoopfoldername#

Clear data from HDFS on AWS EMR in Hadoop 1.0.3

For various reasons I'm running some jobs on EMR with AMI 2.4.11/Hadoop 1.0.3. I'm trying to run a cleanup of HDFS after my jobs by adding an additional EMR step. Using boto:
step = JarStep(
'HDFS cleanup',
'command-runner.jar',
action_on_failure='CONTINUE',
step_args=['hadoop', 'dfs', '-rmr', '-skipTrash', 'hdfs:/tmp'])
emr_conn.add_jobflow_steps(cluster_id, [step])
However it regularly fails with nothing in stderr in the EMR console.
Why I am confused is if I ssh into the master node and run the command:
hadoop dfs -rmr -skipTrash hdfs:/tmp
It succeeds with a 0 and a message that it successfully deleted everything. All the normal hadoop commands seem to work as documented. Does anyone know if there's an obvious reason for this? Issue with the Amazon distribution? Undocumented behavior in certain commands?
Note:
I have other jobs running in Hadoop 2 and the documented:
hdfs dfs -rm -r -skipTrash hdfs:/tmp
works as one would expect both as a step and as a command.
My solution generally was to upgrade everything to Hadoop2, in which case this works:
JarStep(
'%s: HDFS cleanup' % self.job_name,
'command-runner.jar',
action_on_failure='CONTINUE',
step_args=['hdfs', 'dfs', '-rm', '-r', '-skipTrash', path]
)
This was the best I could get with Hadoop1 that worked pretty well.
JarStep(
'%s: HDFS cleanup' % self.job_name,
'command-runner.jar',
action_on_failure='CONTINUE',
step_args=['hadoop', 'fs', '-rmr', '-skipTrash',
'hdfs:/tmp/mrjob']
)

No wildcard support in hdfs dfs put command in Hadoop 2.3.0-cdh5.1.3?

I'm trying to move my daily apache access log files to a Hive external table by coping the daily log files to the relevant HDFS folder for each month.
I try to use wildcard, but it seems that hdfs dfs doesn't support it? (documentation seems to be saying that it should support it).
Copying individual files works:
$ sudo HADOOP_USER_NAME=myuser hdfs dfs -put
"/mnt/prod-old/apache/log/access_log-20150102.bz2"
/user/myuser/prod/apache_log/2015/01/
But all of the following ones throw "No such file or directory":
$ sudo HADOOP_USER_NAME=myuser hdfs dfs -put
"/mnt/prod-old/apache/log/access_log-201501*.bz2"
/user/myuser/prod/apache_log/2015/01/
put:
`/mnt/prod-old/apache/log/access_log-201501*.bz2': No such file or
directory
$ sudo HADOOP_USER_NAME=myuser hdfs dfs -put
/mnt/prod-old/apache/log/access_log-201501*
/user/myuser/prod/apache_log/2015/01/
put:
`/mnt/prod-old/apache/log/access_log-201501*': No such file or
directory
The environment is on Hadoop 2.3.0-cdh5.1.3
I'm going to answer my own question.
So hdfs dfs put does work with wildcard, the problem is that the input directory is not a local directory, but a mounted SSHFS (fuse) drive.
It seems that SSHFS is the one not able to handle wildcard characters.
Below is the proof the hdfs dfs put works just fine with wildcards when using the local filesystem and not the mounted drive:
$ sudo HADOOP_USER_NAME=myuser hdfs dfs -put
/tmp/access_log-201501*
/user/myuser/prod/apache_log/2015/01/
put: '/user/myuser/prod/apache_log/2015/01/access_log-20150101.bz2':
File exists
put:
'/user/myuser/prod/apache_log/2015/01/access_log-20150102.bz2': File
exists