How to copy file to HDFS in case insensitive way - regex

I have to copy certain CSV files to HDFS of format
ABCDWXYZ.csvviz. PERSONDETAILS.csv and I have to copy it to an HDFS directory of name AbcdWxyz viz PersonDetails.
Now the problem is I don't have exact HDFS directory name, I get it from the CSV file after trimming it and fire put
Hadoop fs -put $localRootDir/$Dir/*.csv $HDFSRootDir/$Dir
but it throws an error as there is no such directory in HDFS with all uppercase letter.
Now how can I copy the file to HDFS? Is there a way to make the Hadoop put command case insensitive using regex or natively.
Or is there a way by which I can convert the String to required CamelCase

You should be able to use
hadoop fs -find / -iname $Dir -print
to get the path name in the correct spelling as it exists in HDFS. Then feed that back into your copy command.

Related

HDFS dfs -ls path/filename

I have copied few files to the path. but when I tried to run the command hdfs dfs -ls path/filename then it returns no file found.
hdfs dfs -ls till directory works but when i use the file name it returns no files found. For one of the file, I copied and pasted the file name using ambari. Then file started getting returned on using hdfs dfs -ls path/filename.
What is causing this issue?
Because when you are executing HDFS dfs -ls path/filename what you are saying to hdfs is show me all the files that are in the directory and if end path is a file, of course, you are not listing anything. You must point to a directory not a file.
#saravanan it seems like a permission issue if the file shows up only after using ambari. Make sure the files are owned correctly to confirm the commands. The ls command will list files and folders per documentation.
Here is full documentation for ls command:
[root#hdp ~]# hdfs dfs -help ls
-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...] :
List the contents that match the specified file pattern. If path is not
specified, the contents of /user/<currentUser> will be listed. For a directory a
list of its direct children is returned (unless -d option is specified).
Directory entries are of the form:
permissions - userId groupId sizeOfDirectory(in bytes)
modificationDate(yyyy-MM-dd HH:mm) directoryName
and file entries are of the form:
permissions numberOfReplicas userId groupId sizeOfFile(in bytes)
modificationDate(yyyy-MM-dd HH:mm) fileName
-C Display the paths of files and directories only.
-d Directories are listed as plain files.
-h Formats the sizes of files in a human-readable fashion
rather than a number of bytes.
-q Print ? instead of non-printable characters.
-R Recursively list the contents of directories.
-t Sort files by modification time (most recent first).
-S Sort files by size.
-r Reverse the order of the sort.
-u Use time of last access instead of modification for
display and sorting.
-e Display the erasure coding policy of files and directories.

How to remove all files matching specific file content in HDFS?

By mistake, using NiFi, huge number of files got generated in HDFS directory with content "val3val2val1". I want to remove all files matching this content using HDFS command. Please advise

-R uploading a folder with Spanish characters in file name returns an error

I'm trying to upload a collection of folders with files (with different file extensions) to my bucket using gsutil. I'm using the following command:
gsutil -m cp -R -L dir gs://my_bucket
It uploads the documents fine until in encounters a file name ("Opinió ITAE3") that contains characters like ó and other Spanish characters, and gives me this error:
[Error 2] The system cannot find the file specified: u'C:\Users\anton\Desktop\Test\Test\Opinio\xb4 ITAE3.txt'
CommandException: 1 file/object could not be transferred.
Many of the files are pretty old. When I create a file with a name like éóá.txt it works fine. But it doesn't work for that old file. It looks like it has something to do with encoding.
What can I do to upload these documents along with others?
As stated in the Cloud Storage documentation for Filename encoding and interoperability problems:
Users with files stored in other encodings (such as Latin 1) must convert those filenames to UTF-8 before attempting to upload the files.
And it suggests:
If you have too many files for that to be practical you can use a tool to convert the old character encoding to UTF-8. One such tool is native2ascii.

Hadoop command to ignore first / last line from input file while copying into HDFS

I have a input file in Linux and it has a header. I cannot modify this file since there is only Read-Only access to this file.
And I am able to copy this file successfully from Linux to HDFS using copyFromLocal command.
But the header should not be present in the HDFS file and I do not have access to modify the Linux input file to remove the header.
Is there any other way to skip / ignore the header while copying the file from Linux to HDFS. something like copyFromLocal -1 input_file_name hdfs_file_name ?
Remove the first line using awk and put it to HDFS:
awk 'NR != 1 {print}' file.txt | hdfs dfs -put - hdfs://nn1/user/cloudera

Hadoop Put command for two files

A file named records.txt from local to HDFS can be copied by using below command
hadoop dfs -put /home/cloudera/localfiles/records.txt /user/cloudera/inputfiles
By using the above command the file records.txt will be copied into HDFS with the same name.
But I want to store two files(records1.txt and demo.txt) into HDFS
I know that we can use something like below
hadoop dfs -put /home/cloudera/localfiles/records* /user/cloudera/inputfiles
but Is there any command that will help us to store one or two files with different names to be copied into hdfs ?
With put command argument, you could provide one or multiple source files as mentioned here. So try something like:
hadoop dfs -put /home/cloudera/localfiles/records* /home/cloudera/localfiles/demo* /user/cloudera/inputfiles
From hadoop shell command usage:
put
Usage: hadoop fs -put <localsrc> ... <dst>
Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem.
hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile
Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.
It can be done using copyFromLocal coomand as follows :
hduser#ubuntu:/usr/local/pig$ hadoop dfs -copyFromLocal /home/user/Downloads/records1.txt /home/user/Downloads/demo.txt /user/pig/output