How can I use perl to delete files matching a regex - regex

Due to a Makefile mistake, I have some fake files in my git repo...
$ ls
=0.1.1 =4.8.0 LICENSE
=0.5.3 =5.2.0 Makefile
=0.6.1 =7.1.0 pyproject.toml
=0.6.1, all_commands.txt README_git_workflow.md
=0.8.1 CHANGES.md README.md
=1.2.0 ciscoconfparse/ requirements.txt
=1.7.0 configs/ sphinx-doc/
=2.0 CONTRIBUTING.md tests/
=2.2.0 deploy_docs.py tutorial/
=22.2.0 dev_tools/ utils/
=22.8.0 do.py
=2.7.0 examples/
$
I tried this, but it seems that there may be some more efficient means to accomplish this task...
# glob "*" will list all files globbed against "*"
foreach my $filename (grep { /\W\d+\.\d+/ } glob "*") {
my $cmd1 = "rm $filename";
`$cmd1`;
}
Question:
I want a remove command that matches against a pcre.
What is a more efficient perl solution to delete the files matching this perl regex: /\W\d+\.\d+/ (example filename: '=0.1.1')?

Fetch a wider set of files and then filter through whatever you want
my #files_to_del = grep { /^\W[0-9]+\.[0-9]+/ and not -d } glob "$dir/*";
I added an anchor (^) so that the regex can only match a string that begins with that pattern, otherwise this can blow away files other than intended. Reconsider what exactly you need.
Altogether perhaps (or see a one-liner below †)
use warnings;
use strict;
use feature 'say';
use File::Glob ':bsd_glob'; # for better glob()
use Cwd qw(cwd); # current-working-directory
my $dir = shift // cwd; # cwd by default, or from input
my $re = qr/^\W[0-9]+\.[0-9]+/;
my #files_to_del = grep { /$re/ and not -d } glob "$dir/*";
say for #files_to_del; # please inspect first
#unlink or warn "Can't unlink $_: $!" for #files_to_del;
where that * in glob might as well have some pre-selection, if suitable. In particular, if the = is a literal character (and not an indicator printed by the shell, see footnote)‡ then glob "=*" will fetch files starting with it, and then you can pass those through a grep filter.
I exclude directories, identified by -d filetest, since we are looking for files (and to not mix with some scary language about directories from unlink, thanks to brian d foy comment).
If you'd need to scan subdirectories and do the same with them, perhaps recursively -- what doesn't seem to be the case here? -- then we could employ this logic in File::Find::find (or File::Find::Rule, or yet others).
Or read the directory any other way (opendir+readdir, libraries like Path::Tiny), and filter.
† Or, a quick one-liner ... print (to inspect) what's about to get blown away
perl -wE'say for grep { /^\W[0-9]+\.[0-9]+/ and not -d } glob "*"'
and then delete 'em
perl -wE'unlink or warn "$_: $!" for grep /^\W[0-9]+\.[0-9]+/ && !-d, glob "*"'
(I switched to a more compact syntax just so. Not necessary)
If you'd like to be able to pass a directory to it (optionally, or work in the current one) then do
perl -wE'$d = shift//q(.); ...' dirpath (relative path fine. optional)
and then use glob "$d/*" in the code. This works the same way as in the script above -- shift pulls the first element from #ARGV, if anything was passed to the script on the command line, or if #ARGV is empty it returns undef and then // (defined-or) operator picks up the string q(.).
‡ That leading = may be an "indicator" of a file type if ls has been aliased with ls -F, what can be checked by running ls with suppressed aliases, one way being \ls (or check alias ls).
If that is so, the = stands for it being a socket, what in Perl can be tested for by the -S filetest.
Then that \W in the proposed regex may need to be changed to \W? to allow for no non-word characters preceding a digit, along with a test for a socket. Like
my $re = qr/^\W? [0-9]+ \. [0-9]+/x;
my #files_to_del = grep { /$re/ and -S } glob "$dir/*";

Why not just:
$ rm =*
Sometimes, shell commands are the best option.

In these cases, I use perl to merely filter the list of files:
ls | perl -ne 'print if /\A\W\d+\.\d+/a' | xargs rm
And, when I do that, I feel guilty for not doing something simpler with an extended pattern in grep:
ls | grep -E '^\W\d+\.\d+' | xargs rm
Eventually I'll run into a problem where there's a directory so I need to be more careful about the file list:
find . -type f -maxdepth 1 | grep -E '^\./\W\d+\.\d+' | xargs rm
Or I need to allow rm to remove directories too should I want that:
ls | grep -E '^\W\d+\.\d+' | xargs rm -r

Here you go.
unlink( grep { /\W\d+\.\d+/ && !-d } glob( "*" ) );
This matches the filename, and excludes directories.

To delete filenames matching this: /\W\d+\.\d+/ pcre, use the following one-liners...
1> $fn is a filename... I'm also removing the my keywords since the one-liner doesn't have to worry about perl lexical scopes:
perl -e 'foreach $fn (grep { /\W\d+\.\d+/ } glob "*") {$cmd1="rm $fn";`$cmd1`;}'
2> Or as Andy Lester responded, perhaps his answer is as efficient as we can make it...
perl -e 'unlink(grep { /\W\d+\.\d+/ } glob "*");'

Related

rename multiple files splitting filenames by '_' and retaining first and last fields

Say I have the following files:
a_b.txt a_b_c.txt a_b_c_d_e.txt a_b_c_d_e_f_g_h_i.txt
I want to rename them in such a way that I split their filenames by _ and I retain the first and last field, so I end up with:
a_b.txt a_c.txt a_e.txt a_i.txt
Thought it would be easy, but I'm a bit stuck...
I tried rename with the following regexp:
rename 's/^([^_]*).*([^_]*[.]txt)/$1_$2/' *.txt
But what I would really need to do is to actually split the filename, so I thought of awk, but I'm not so proficient with it... This is what I have so far (I know at some point I should specify FS="_" and grab the first and last field somehow...
find . -name "*.txt" | awk -v mvcmd='mv "%s" "%s"\n' '{old=$0; <<split by _ here somehow and retain first and last fields>>; printf mvcmd,old,$0}'
Any help? I don't have a preferred method, but it would be nice to use this to learn awk. Thanks!
Your rename attempt was close; you just need to make sure the final group is greedy.
rename 's/^([^_]*).*_([^_]*[.]txt)$/$1_$2/' *_*_*.txt
I added a _ before the last opening parenthesis (this is the crucial fix), and a $ anchor at the end, and also extended the wildcard so that you don't process any files which don't contain at least two underscores.
The equivalent in Awk might look something like
find . -name "*_*_*.txt" |
awk -F _ '{ system("mv " $0 " " $1 "_" $(NF)) }'
This is somewhat brittle because of the system call; you might need to rethink your approach if your file names could contain whitespace or other shell metacharacters. You could add quoting to partially fix that, but then the command will fail if the file name contains literal quotes. You could fix that, too, but then this will be a little too complex for my taste.
Here's a less brittle approach which should cope with completely arbitrary file names, even ones with newlines in them:
find . -name "*_*_*.txt" -exec sh -c 'for f; do
mv "$f" "${f%%_*}_${f##*_}"
done' _ {} +
find will supply a leading path before each file name, so we don't need mv -- here (there will never be a file name which starts with a dash).
The parameter expansion ${f##pattern} produces the value of the variable f with the longest available match on pattern trimmed off from the beginning; ${f%%pattern} does the same, but trims from the end of the string.
With your shown samples, please try following pure bash code(with great use parameter expansion capability of BASH). This will catch all files with name/format .txt in their name. Then it will NOT pick files like: a_b.txt it will only pick files which have more than 1 underscore in their name as per requirement.
for file in *_*_*.txt
do
firstPart="${file%%_*}"
secondPart="${file##*_}"
newName="${firstPart}_${secondPart}"
mv -- "$file" "$newName"
done
This answer works for your example, but #tripleee's "find" approach is more robust.
for f in a_*.txt; do mv "$f" "${f%%_*}_${f##*_}"; done
Details: https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html / https://www.gnu.org/software/bash/manual/html_node/Pattern-Matching.html
Here's an alternate regexp for the given samples:
$ rename -n 's/_.*_/_/' *.txt
rename(a_b_c_d_e_f_g_h_i.txt, a_i.txt)
rename(a_b_c_d_e.txt, a_e.txt)
rename(a_b_c.txt, a_c.txt)
A different rename regex
rename 's/(\S_)[a-z_]*(\S\.txt)/$1$2/'
Using the same regex with sed or using awk within a loop.
for a in a_*; do
name=$(echo $a | awk -F_ '{print $1, $NF}'); #Or
#name=$(echo $a | sed -E 's/(\S_)[a-z_]*(\S\.txt)/\1\2/g');
mv "$a" "$name";
done

Find folders that contain multiple matches to a regex/grep

I have a folder structure encompassing many thousands of folders. I would like to be able to find all the folders that, for example, contain multiple .txt files, or multiple .jpeg, or whatever without seeing any folders that contain only a single file of that kind.
The folders should all have only one file of a specific type, but this is not always the case and it is tedious to try to find them.
Note that the folders may contain many other files.
If possible, I'd like to match "FILE.JPG" and "file.jpg" as both matching a query on "file" or "jpg".
What I have been doing in simply find . -iname "*file*" and going through it manually.
folders contain folders, sometimes 3 or 4 levels deep
first/
second/
README.txt
readme.TXT
readme.txt
foo.txt
third/
info.txt
third/fourth/
raksljdfa.txt
Should return
first/second/README.txt
first/second/readme.TXT
first/second/readme.txt
first/secondfoo.txt```
when searching for "txt"
and
first/second/README.txt
first/second/readme.TXT
first/second/readme.txt
when searching for "readme"
This pure Bash code should do it (with caveats, see below):
#! /bin/bash
fileglob=$1 # E.g. '*.txt' or '*readme*'
shopt -s nullglob # Expand to nothing if nothing matches
shopt -s dotglob # Match files whose names start with '.'
shopt -s globstar # '**' matches multiple directory levels
shopt -s nocaseglob # Ignore case when matching
IFS= # Disable word splitting
for dir in **/ ; do
matching_files=( "$dir"$fileglob )
(( ${#matching_files[*]} > 1 )) && printf '%s\n' "${matching_files[#]}"
done
Supply the pattern to be matched as an argument to the program when you run it. E.g.
myprog '*.txt'
myprog '*readme*'
(The quotes on the patterns are necessary to stop them matching files in the current directory.)
The caveats regarding the code are:
globstar was introduced with Bash 4.0. The code won't work with older Bash.
Prior to Bash 4.3, globstar matches followed symlinks. This could lead to duplicate outputs, or even failures due to circular links.
The **/ pattern expands to a list of all the directories in the hierarchy. This could take an excessively long time or use an excessive amount of memory if the number of directories is large (say, greater than ten thousand).
If your Bash is older than 4.3, or you have large numbers of directories, this code is a better option:
#! /bin/bash
fileglob=$1 # E.g. '*.txt' or '*readme*'
shopt -s nullglob # Expand to nothing if nothing matches
shopt -s dotglob # Match files whose names start with '.'
shopt -s nocaseglob # Ignore case when matching
IFS= # Disable word splitting
find . -type d -print0 \
| while read -r -d '' dir ; do
matching_files=( "$dir"/$fileglob )
(( ${#matching_files[*]} > 1 )) \
&& printf '%s\n' "${matching_files[#]}"
done
Something like this sounds like what you want:
find . -type f -print0 |
awk -v re='[.]txt$' '
BEGIN {
RS = "\0"
IGNORECASE = 1
}
{
dir = gensub("/[^/]+$","",1,$0)
file = gensub("^.*/","",1,$0)
}
file ~ re {
dir2files[dir][file]
}
END {
for (dir in dir2files) {
if ( length(dir2files[dir]) > 1 ) {
for (file in dir2files[dir]) {
print dir "/" file
}
}
}
}'
It's untested but should be close. It uses GNU awk for gensub(), IGNORECASE, true multi-dimensional arrays and length(array).

Bash: delete most files in directory

I have a directory full of mostly postscript files which I'm trying to erase most: Namely those who don't have 000100, 000110, 000120 or 000200 on the second place in their name. I want to retain those.
Here is an excerpt from the directory:
0091_000100_0000_0000_0001_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000110_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000120_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000200_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000300_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000310_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000320_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000330_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000400_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000410_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000420_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_001120_0102_0000_0003_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0096_000100_0000_0000_0001_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000110_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000120_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000200_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000300_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000310_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000320_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000330_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000400_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000410_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000420_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000430_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000440_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000450_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0097_000100_0000_0000_0001_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000110_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000120_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000200_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000300_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000310_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000320_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000330_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000400_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000410_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000420_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000430_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
This is what I'm trying to get:
0091_000100_0000_0000_0001_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000110_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000120_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000200_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0096_000100_0000_0000_0001_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000110_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000120_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000200_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0097_000100_0000_0000_0001_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000110_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000120_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000200_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
My try so far works but is somewhat unpractical:
#!/bin/sh
for f in *.ps; do
case $f in
(0091_000100*.ps|0091_000110*.ps|0091_000120*.ps|0091_000200*.ps)
;;
(*)
rm -- "$f";;
esac
done
I have to write every start of the filename I want to keep. One problem: The script doesn't match the 0096_* and 0097_* files and all the others omitted for readability. The format of the filename is always the same up to the double underscore. The values in the number groups might change.
Is there a way to match for the second group? My experimentation wasn't successful so far.
Thank you for your help!
Seems like ls *.ps | awk -F_ '$2 < 100 || $2 > 200' might be the list of files you want to delete. After verifying that,
rm $(ls *.ps | awk -F_ '$2 < 100 || $2 > 200')
As long as no file has whitespace or glob characters in its name. (If they do, use xargs)
I like using find for best performance when dealing with a large count of files.
This regex should yield the same results:
find . -type f -name '*.ps' |egrep "000[12]{1}[012]{1}" |xargs rm -f
Assuming a directory has only regular files...
ls *.ps | egrep -v '^[0-9]{4}_000100_|^[0-9]{4}_000110_|^[0-9]{4}_000120_|^[0-9]{4}_000200_' | xargs rm -f

Perl from command line: substitute regex only once in a file

I'm trying to replace a line in a configuration file. The problem is I only want to replace only one occurence. Part of the file looks like this. It is the gitolite default config file:
# -----------------------------------------------------------------
# suggested locations for site-local gitolite code (see cust.html)
# this one is managed directly on the server
# LOCAL_CODE => "$ENV{HOME}/local",
# or you can use this, which lets you put everything in a subdirectory
# called "local" in your gitolite-admin repo. For a SECURITY WARNING
# on this, see http://gitolite.com/gitolite/cust.html#pushcode
# LOCAL_CODE => "$rc{GL_ADMIN_BASE}/local",
# ------------------------------------------------------------------
I would like to set LOCAL_CODE to something else from the command line. I thought I might do it in perl to get pcre convenience. I'm new to perl though and can't get it working.
I found this:
perl -i.bak -p -e’s/old/new/’ filename
The problem is -p seems to have it loop over the file line by line, and so a 'o' modifier won't any have effect. However without the -p option it doesn't seem to work...
A compact way to do this is
perl -i -pe '$done ||= s/old/new/' filename
Yet another one-liner:
perl -i.bak -p -e '$i = s/old/new/ if !$i' filename
There are probably a large number of perl one liners that will do this, but here is one.
perl -i.bak -p -e '$x++ if $x==0 && s/old/new/;' filename

BASH: How to rename lots of file insertnig folder name in middle of filename

(I'm in a Bash environment, Cygwin on a Windows machine, with awk, sed, grep, perl, etc...)
I want to add the last folder name to the filename, just before the last underscore (_) followed by numbers or at the end if no numbers are in the filename.
Here is an example of what I have (hundreds of files needed to be reorganized) :
./aaa/A/C_17x17.p
./aaa/A/C_32x32.p
./aaa/A/C.p
./aaa/B/C_12x12.p
./aaa/B/C_4x4.p
./aaa/B/C_A_3x3.p
./aaa/B/C_X_91x91.p
./aaa/G/C_6x6.p
./aaa/G/C_7x7.p
./aaa/G/C_A_113x113.p
./aaa/G/C_A_8x8.p
./aaa/G/C_B.p
./aab/...
I would like to rename all thses files like this :
./aaa/C_A_17x17.p
./aaa/C_A_32x32.p
./aaa/C_A.p
./aaa/C_B_12x12.p
./aaa/C_B_4x4.p
./aaa/C_A_B_3x3.p
./aaa/C_X_B_91x91.p
./aaa/C_G_6x6.p
./aaa/C_G_7x7.p
./aaa/C_A_G_113x113.p
./aaa/C_A_G_8x8.p
./aaa/C_B_G.p
./aab/...
I tried many bash for loops with sed and the last one was the following :
IFS=$'\n'
for ofic in `find * -type d -name 'A'`; do
fic=`echo $ofic|sed -e 's/\/A$//'`
for ftr in `ls -b $ofic | grep -E '.png$'`; do
nfi=`echo $ftr|sed -e 's/(_\d+[x]\d+)?/_A\1/'`
echo mv \"$ofic/$ftr\" \"$fic/$nfi\"
done
done
But yet with no success... This \1 does not get inserted in the $nfi...
This is the last one I tried, only working on 1 folder (which is a subfolder of a huge folder collection) and after over 60 minutes of unsuccessful trials, I'm here with you guys.
I modified your script so that it works for all your examples.
IFS=$'\n'
for ofic in ???/?; do
IFS=/ read fic fia <<<$ofic
for ftr in `ls -b $ofic | grep -E '\.p.*$'`; do
nfi=`echo $ftr|sed -e "s/_[0-9]*x[0-9]*/_$fia&/;t;s/\./_$fia./"`
echo mv \"$ofic/$ftr\" \"$fic/$nfi\"
done
done
# it's easier to change to here first
cd aaa
# process every file
for f in $(find . -type f); do
# strips everything after the first / so this is our foldername
foldername=${f/\/*/}
# creates the new filename from substrings of the
# original filename concatenated to the foldername
newfilename=".${f:1:3}${foldername}_${f:4}"
# if you are satisfied with the output, just leave out the `echo`
# from below
echo mv ${f} ${newfilename}
done
Might work for you.
See here in action. (slightly modified, as ideone.com handles STDIN/find diferently...)