Cannot use regular expression in hadoop in Linux command line

Cannot use regular expression in hadoop in Linux command line - hdfs

I have a folder that contains a large number of subfolders that are dates from 2018. In my HDFS I have created a folder of just December dates (formatted 2018-12-) and I need to delete specifically days 21 - 25. I copied this folder from my HDFS to my docker container and used the command
rm -r *[21-25]
in the folder it worked as expected. But when I run this same command adapted to hdfs
hdfs dfs –rm -r /home/cloudera/logs/2018-Dec/*[21-25]
it gives me the error
rm: `/home/cloudera/logs/2018-Dec/*[21-25]': No such file or directory."
If you need something to be explained in more detail leave a comment. I'm brand new to all of this and I don't 100% understand how to say some of these things.

I figured it out with the help of #Barmer. I was referring to my local systems base directory also I had to change the regular expression to 2[1-5]. So the command ended up being hdfs dfs -rm -r /user/cloudera/logs/2018-Dec/*2[1-5].

Related

Is there a way for changing text file names in a folder using C++

I am working with a bunch of txt files(thousands) on my project. Each txt file has 'csv' information on it. The problem is that each txt file has a random name and I cannot create a code for loading them in my project due to it. So, I want to rename them in a particular pattern to make easier the loading of the files in my work. I will use C++ for accomplish this task.
I put all the txt files in a folder but I cannot see a way of renaming them using C++. How can I do this? is there a way to do it? Can someone help me?

You can use std::filesystem::directory_iterator and std::filesystem::rename (c++17), as documented here.

Disclaimer
This answer validity is based on a comment where the author precised they were not bound to the C++ language (it may be worth editing the question, the C++ tag, and the OS). This solution may work for UNIX systems supporting bash, that is most Linux distributions and all releases of Apple's macOS prior to macOS Catalina (correct me if I'm wrong).
Bash command line
Using the following bash command should rename all the files in a folder with increasing numbers, that is:
toto.csv -> 1.csv
titi.csv -> 2.csv etc
It assumes the ordering is not important.
a=1; for i in *; do mv -n "$i" "$a.csv" ; let "a +=1"; done
To test it, you can prepare a test folder by opening a terminal and typing:
mkdir test
cd test
touch toto.csv titi.csv tata.csv
ls
Output:
tata.csv titi.csv toto.csv
Then you can run the following command:
a=1; for i in *; do mv -n "$i" "$a.csv" ; let "a +=1"; done
ls
Output:
1.csv 2.csv 3.csv
Explication:
a=1 declare a variable
for i in *; begin to iterate over all files in the folder
do mv will move (rename) a file of the list (that is, the variable $i) to a new name called a.csv
and we increment the counter a, and close the loop.
the option -n will make sure no file gets overwritten by the command mv
I assumed there was no specific criterion to rename the files. If there is a specific structure (pattern) in the renaming, the bash command can probably accommodate it, but the question should then give more details about these requirements :)

HDFS dfs -ls path/filename

I have copied few files to the path. but when I tried to run the command hdfs dfs -ls path/filename then it returns no file found.
hdfs dfs -ls till directory works but when i use the file name it returns no files found. For one of the file, I copied and pasted the file name using ambari. Then file started getting returned on using hdfs dfs -ls path/filename.
What is causing this issue?

Because when you are executing HDFS dfs -ls path/filename what you are saying to hdfs is show me all the files that are in the directory and if end path is a file, of course, you are not listing anything. You must point to a directory not a file.

#saravanan it seems like a permission issue if the file shows up only after using ambari. Make sure the files are owned correctly to confirm the commands. The ls command will list files and folders per documentation.
Here is full documentation for ls command:
[root#hdp ~]# hdfs dfs -help ls
-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...] :
List the contents that match the specified file pattern. If path is not
specified, the contents of /user/<currentUser> will be listed. For a directory a
list of its direct children is returned (unless -d option is specified).
Directory entries are of the form:
permissions - userId groupId sizeOfDirectory(in bytes)
modificationDate(yyyy-MM-dd HH:mm) directoryName
and file entries are of the form:
permissions numberOfReplicas userId groupId sizeOfFile(in bytes)
modificationDate(yyyy-MM-dd HH:mm) fileName
-C Display the paths of files and directories only.
-d Directories are listed as plain files.
-h Formats the sizes of files in a human-readable fashion
rather than a number of bytes.
-q Print ? instead of non-printable characters.
-R Recursively list the contents of directories.
-t Sort files by modification time (most recent first).
-S Sort files by size.
-r Reverse the order of the sort.
-u Use time of last access instead of modification for
display and sorting.
-e Display the erasure coding policy of files and directories.

How to tar a folder in HDFS?

Just like Unix command tar -czf xxx.tgz xxx/, is there a method can do the same thing in HDFS? I have a folder in HDFS has over 100k small files, and want to download it to local file system as fast as possible. hadoop fs -get is too slowly, I know hadoop archive can output a har, but it seems cannot solve my problem.

From what I see here,
https://issues.apache.org/jira/browse/HADOOP-7519
it is not possible to perform tar operation using hadoop commands. This has been filed as an improvement as I mentioned above and not resolved/available yet to use.
Hope this answers your question.
Regarding your scenario - having 100k small files in HDFS is not a good practice. You can find a way to merge them all (may be by creating tables through Hive or Impala from this data) or move all the small files to a single folder in HDFS and use hadoop fs -copyToLocal <HDFS_FOLDER_PATH>; to get the whole folder to your local along with all the files in it.

Delete files extracted with xorriso

I was looking for a way to extract an iso file without root access.
I succeeded using xorriso.
I used this command:
xorriso -osirrox on -indev image.iso -extract / extracted_path
Now when I want to delete the extracted files I get a permission denied error.
lsattr lists -------------e-- for all files.
ls -l lists -r-xr-xr-x for all files.
I tried chmod go+w on a test file but still can't delete it.
Can anyone help me out?

obviously your files were marked read-only in the ISO. xorriso preserves
the permissions when extracting files.
The reason why you cannot remove the test file after chmod +w is that
the directory which holds that file is still read-only. (Anyways, your
chmod command did not give w-permission to the owner of the file.)
Try this tree changing command:
chmod -R u+w extracted_path
Have a nice day :)
Thomas

Unix: fast 'remove directory' for cleaning up daily builds

Is there a faster way to remove a directory then simply submitting
rm -r -f *directory*
? I am asking this because our daily cross-platform builds are really huge (e.g. 4GB per build). So the harddisks on some of the machines are frequently running out of space.
This is namely the case for our AIX and Solaris platforms.
Maybe there are 'special' commands for directory remove on these platforms?
PASTE-EDIT (moved my own separate answer into the question):
I am generally wondering why 'rm -r -f' is so slow. Doesn't 'rm' just need to modify the '..' or '.' files to de-allocate filesystem entries.
something like
mv *directory* /dev/null
would be nice.

For deleting a directory from a filesystem, rm is your fastest option.
On linux, sometimes we do our builds (few GB) in a ramdisk, and it has a really impressive delete speed :) You could also try different filesystems, but on AIX/Solaris you may not have many options...
If your goal is to have the directory $dir empty now, you can rename it, and delete it later from a background/cron job:
mv "$dir" "$dir.old"
mkdir "$dir"
# later
rm -r -f "$dir.old"
Another trick is that you create a seperate filesystem for $dir, and when you want to delete it, you just simply re-create the filesystem. Something like this:
# initialization
mkfs.something /dev/device
mount /dev/device "$dir"
# when you want to delete it:
umount "$dir"
# re-init
mkfs.something /dev/device
mount /dev/device "$dir"

I forgot the source of this trick but it works:
EMPTYDIR=$(mktemp -d)
rsync -r --delete $EMPTYDIR/ dir_to_be_emptied/

On AIX at least, you should be using LVM, the logical volume manager. All our systems bundle all the physical hard drive into a single volume group and then create one big honkin' file system out of that.
That way, you can add physical devices to your machine at will and increase the size of your file system to whatever you need.
One other solution I've seen is to allocate a trash directory on each file system and use a combination of mv and a find cron job to tackle the space problem.
Basically, have a cron job that runs every ten minutes and executes:
rm -rf /trash/*
rm -rf /filesys1/trash/*
rm -rf /filesys2/trash/*
Then, when you want your specific directory on that file system recycled, use something like:
mv /filesys1/overnight /filesys1/trash/overnight
and, within the next ten minutes your disk space will start being recovered. The filesys1/overnight directory will immediately be available for use even before the trashed version has started being deleted.
It's important that the trash directory be on the same filesystem as the directory you want to get rid of, otherwise you have a massive copy/delete operation on your hands rather than a relatively quick move.

rm -r directory works by recursing depth-first down through directory, deleting files, and deleting the directories on the way back up. It has to, since you cannot delete a directory that is not empty.
Long, boring details: Each file system object is represented by an inode in the file system, which has file system-wide, flat array of inodes.[1] If you just deleted directory without first deleting its children then the children would remain allocated, but without any pointers to them. (fsck checks for that kind of thing when it runs, since it represents file system damage.)
[1] That may not be strictly true for every file system out there, and there may be a file system that works the way you describe. It would possibly require something like a garbage collector. However, all the common ones I know of act like fs objects are owned by inodes, and directories are lists of name/inode number pairs.

If rm -rf is slow, perhaps you are using a "sync" option or similar, which is writing to the disk too often. On Linux ext3 with normal options, rm -rf is very quick.
One option for fast removal which would work on Linux and presumably also on various Unixen is to use a loop device, something like:
hole temp.img $[5*1024*1024*1024] # create a 5Gb "hole" file
mkfs.ext3 temp.img
mkdir -p mnt-temp
sudo mount temp.img mnt-temp -o loop
The "hole" program is one I wrote myself to create a large empty file using a "hole" rather than allocated blocks on the disk, which is much faster and doesn't use any disk space until you really need it. http://sam.nipl.net/coding/c-examples/hole.c
I just noticed that GNU coreutils contains a similar program "truncate", so if you have that you can use this to create the image:
truncate --size=$[5*1024*1024*1024] temp.img
Now you can use the mounted image under mnt-temp for temporary storage, for your build. When you are done with it, do this to remove it:
sudo umount mnt-temp
rm test.img
rmdir mnt-temp
I think you will find that removing a single large file is much quicker than removing lots of little files!
If you don't care to compile my "hole.c" program, you can use dd, but this is much slower:
dd if=/dev/zero of=temp.img bs=1024 count=$[5*1024*1024] # create a 5Gb allocated file

I think that actually there is nothing else than "rm -rf" as you quoted to delete your directories.
to avoid doing it manually over and over you can cron daily a script that recursively deletes all the build directories of your build root directory if they're "old enough" with something like :
find <buildRootDir>/* -prune -mtime +4 -exec rm -rf {} \;
(here mtime +4 indicates "any file older than 4 days)
Another way would be to configure your builder (if it allows such things) to crush the previous build with the current one.

I was looking into this as well.
I had a dir with 600,000+ files.
rm * would fail, because there are too many entries.
find . -exec rm {} \; was nice, and deleting ~750 files every 5 seconds. Was checking the rm rate via another shell.
So, instead I wrote a short script to rm many files at once. Which obtained about ~1000 files every 5 seconds. The idea is to put as many files into 1 rm command as you can to increase the efficiency.
#!/usr/bin/ksh
string="";
count=0;
for i in $(cat filelist);do
string="$string $i";
count=$(($count + 1));
if [[ $count -eq 40 ]];then
count=1;
rm $string
string="";
fi
done

On Solaris, this is the fastest way I have found.
find /dir/to/clean -type f|xargs rm
If you have files with odd paths, use
find /dir/to/clean -type f|while read line; do echo "$line";done|xargs rm

Use
perl -e 'for(<*>){((stat)[9]<(unlink))}'
Please refer below link:
http://www.slashroot.in/which-is-the-fastest-method-to-delete-files-in-linux

Needed to delete 700 Gbytes from dozens of directories on AWS EBS 1 TB disk (ext3) before copying remainder to a new 200 Gbyte XFS volume. It was taking hours leaving that volume at 100%wa. Since the disk IO and server time are not free, this took only a fraction of a second per directory.
where /dev/sdb
is an empty volume of any size
directory_to_delete=/ebs/var/tmp/
mount /dev/sdb $directory_to_delete
nohup rsync -avh /ebs/ /ebs2/

I coded a small Java application RdPro (Recursive Directory Purge tool) which is faster than rm. It also can remove target directories user specified under a root.Works for both Linux/Unix and Windows. It has both a command line version and a GUI version.
https://github.com/mhisoft/rdpro

I had to delete more than 3,00,000 files in windows. I had cygwin installed. Luckily i had all the primary directory in a database. Created a for loop and based on line entry and delete using rm -rf

I just use find ./ -delete in the folder to empty, and it has deleted 620000 directories (total size) 100GB in arround 10 minutes.
Source : a comment in this site https://www.slashroot.in/comment/1286#comment-1286

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Cannot use regular expression in hadoop in Linux command line - hdfs

I figured it out with the help of #Barmer. I was referring to my local systems base directory also I had to change the regular expression to 2[1-5]. So the command ended up being hdfs dfs -rm -r /user/cloudera/logs/2018-Dec/*2[1-5].

Related

Is there a way for changing text file names in a folder using C++

HDFS dfs -ls path/filename

How to tar a folder in HDFS?

Delete files extracted with xorriso

Unix: fast 'remove directory' for cleaning up daily builds

Categories

Resources