Use find to identify filename same as the parent directory name - regex

I would like to use find in order to search for files in different subdirectories that have to match the same pattern as their parent category.
example:
ls
Random1_fa Random2_fa Random3_fa
inside these dirs there are different files that I want to search for only one of each:
cd Random1_fa
Random1.fa
Random1.fastq
Random1_match_genome.fa
Random1_unmatch_genome.fa
...
I want to "find" only the files with "filename".fa e.g:
/foo/bar/1_Random1/Random1_fa/Random1.fa
/foo/bar/2_Random2/Random2_fa/Random2.fa
/foo/bar/3_Random5/Random5_fa/Random5.fa
/foo/bar/10_Random99/Random99_fa/Random99.fa
I did:
ls | sed 's/_fa//' |find -name "*.fa"
but not what I was looking for.
I want to redirect the result of sed as a regex pattern in find.
Something "awk-like" this:
ls| sed 's/_fa//' |find -name "$1.fa"
or
ls| sed 's/_fa/.fa/' |find -name "$1"

Why read from standard input using sed to filter out files to exclude when you can do the regex condition directly with find. First you run a shell glob expansion for all directories ending with _fa and get the name of the string to find to use in the find expression. All you need to do is
for dir in ./*_fa; do
# Ignore un-expanded globs from the for-loop. The un-expanded string woul fail
# to match the condition for a directory(-d), so we exit the loop in case
# we find no files to match
[ -d "$dir" ] || continue
# The filename from the glob expansion is returned as './name.fa'. Using the
# built-in parameter expansion we remove the './' and '_fa' from the name
str="${dir##./}"
regex="${str%%_fa}"
# We then use 'find' to identify the file as 'name.fa' in the directory
find "$dir" -type f -name "${regex}.fa"
done
The below would match filenames containing only [A-Za-z0-9] and ending with .fa. Run this command at the top level containing your directories to match all the files.
To copy the file elsewhere add the following
find "$dir" -type f -name "${regex}.fa" -exec cp -t /home/destinationPath {} +

Related

How can I use perl to delete files matching a regex

Due to a Makefile mistake, I have some fake files in my git repo...
$ ls
=0.1.1 =4.8.0 LICENSE
=0.5.3 =5.2.0 Makefile
=0.6.1 =7.1.0 pyproject.toml
=0.6.1, all_commands.txt README_git_workflow.md
=0.8.1 CHANGES.md README.md
=1.2.0 ciscoconfparse/ requirements.txt
=1.7.0 configs/ sphinx-doc/
=2.0 CONTRIBUTING.md tests/
=2.2.0 deploy_docs.py tutorial/
=22.2.0 dev_tools/ utils/
=22.8.0 do.py
=2.7.0 examples/
$
I tried this, but it seems that there may be some more efficient means to accomplish this task...
# glob "*" will list all files globbed against "*"
foreach my $filename (grep { /\W\d+\.\d+/ } glob "*") {
my $cmd1 = "rm $filename";
`$cmd1`;
}
Question:
I want a remove command that matches against a pcre.
What is a more efficient perl solution to delete the files matching this perl regex: /\W\d+\.\d+/ (example filename: '=0.1.1')?
Fetch a wider set of files and then filter through whatever you want
my #files_to_del = grep { /^\W[0-9]+\.[0-9]+/ and not -d } glob "$dir/*";
I added an anchor (^) so that the regex can only match a string that begins with that pattern, otherwise this can blow away files other than intended. Reconsider what exactly you need.
Altogether perhaps (or see a one-liner below †)
use warnings;
use strict;
use feature 'say';
use File::Glob ':bsd_glob'; # for better glob()
use Cwd qw(cwd); # current-working-directory
my $dir = shift // cwd; # cwd by default, or from input
my $re = qr/^\W[0-9]+\.[0-9]+/;
my #files_to_del = grep { /$re/ and not -d } glob "$dir/*";
say for #files_to_del; # please inspect first
#unlink or warn "Can't unlink $_: $!" for #files_to_del;
where that * in glob might as well have some pre-selection, if suitable. In particular, if the = is a literal character (and not an indicator printed by the shell, see footnote)‡ then glob "=*" will fetch files starting with it, and then you can pass those through a grep filter.
I exclude directories, identified by -d filetest, since we are looking for files (and to not mix with some scary language about directories from unlink, thanks to brian d foy comment).
If you'd need to scan subdirectories and do the same with them, perhaps recursively -- what doesn't seem to be the case here? -- then we could employ this logic in File::Find::find (or File::Find::Rule, or yet others).
Or read the directory any other way (opendir+readdir, libraries like Path::Tiny), and filter.
† Or, a quick one-liner ... print (to inspect) what's about to get blown away
perl -wE'say for grep { /^\W[0-9]+\.[0-9]+/ and not -d } glob "*"'
and then delete 'em
perl -wE'unlink or warn "$_: $!" for grep /^\W[0-9]+\.[0-9]+/ && !-d, glob "*"'
(I switched to a more compact syntax just so. Not necessary)
If you'd like to be able to pass a directory to it (optionally, or work in the current one) then do
perl -wE'$d = shift//q(.); ...' dirpath (relative path fine. optional)
and then use glob "$d/*" in the code. This works the same way as in the script above -- shift pulls the first element from #ARGV, if anything was passed to the script on the command line, or if #ARGV is empty it returns undef and then // (defined-or) operator picks up the string q(.).
‡ That leading = may be an "indicator" of a file type if ls has been aliased with ls -F, what can be checked by running ls with suppressed aliases, one way being \ls (or check alias ls).
If that is so, the = stands for it being a socket, what in Perl can be tested for by the -S filetest.
Then that \W in the proposed regex may need to be changed to \W? to allow for no non-word characters preceding a digit, along with a test for a socket. Like
my $re = qr/^\W? [0-9]+ \. [0-9]+/x;
my #files_to_del = grep { /$re/ and -S } glob "$dir/*";
Why not just:
$ rm =*
Sometimes, shell commands are the best option.
In these cases, I use perl to merely filter the list of files:
ls | perl -ne 'print if /\A\W\d+\.\d+/a' | xargs rm
And, when I do that, I feel guilty for not doing something simpler with an extended pattern in grep:
ls | grep -E '^\W\d+\.\d+' | xargs rm
Eventually I'll run into a problem where there's a directory so I need to be more careful about the file list:
find . -type f -maxdepth 1 | grep -E '^\./\W\d+\.\d+' | xargs rm
Or I need to allow rm to remove directories too should I want that:
ls | grep -E '^\W\d+\.\d+' | xargs rm -r
Here you go.
unlink( grep { /\W\d+\.\d+/ && !-d } glob( "*" ) );
This matches the filename, and excludes directories.
To delete filenames matching this: /\W\d+\.\d+/ pcre, use the following one-liners...
1> $fn is a filename... I'm also removing the my keywords since the one-liner doesn't have to worry about perl lexical scopes:
perl -e 'foreach $fn (grep { /\W\d+\.\d+/ } glob "*") {$cmd1="rm $fn";`$cmd1`;}'
2> Or as Andy Lester responded, perhaps his answer is as efficient as we can make it...
perl -e 'unlink(grep { /\W\d+\.\d+/ } glob "*");'

Why can't `find` actually find all of the directories matching a pattern?

I have directories matching the pattern foo[0-9]+ and foo-bar. I want to remove all directories matching the former pattern. My goal for doing this is by using find, but when I try to find directories matching the former pattern, I can't recall them:
$ mkdir foo{1..15} foo-bar
$ # yields nothing
$ find . -name "foo[0-9]+"
When I try to find everything that matches foo[^-], only some of the directories appear:
$ find . -name "foo[^-]"
./foo9
./foo7
./foo6
./foo1
./foo8
./foo4
./foo3
./foo2
./foo5
I've played with the -regex flag and all available -regextypes, but can't seem to get the magic right.
How can I list all of these directories?
This should work:
find -E . -regex '.*/foo[0-9]+'
You might want to limit the type: find -E . -type d -regex '.*/foo[0-9]+'
This works:
$ ls -F
foo-bar/ foo10/ foo12/ foo14/ foo2/ foo4/ foo6/ foo8/
foo1/ foo11/ foo13/ foo15/ foo3/ foo5/ foo7/ foo9/
$ find . -name "foo[^-]*"
./foo1
./foo2
./foo3
./foo4
./foo5
./foo6
./foo7
./foo8
./foo9
./foo10
./foo11
./foo12
./foo13
./foo14
./foo15
Alternatively, if your goal is to list all directories that don't match foo-bar then you can simply use the -not operator:
$ find . -not -name foo-bar
.
./foo1
./foo2
./foo3
./foo4
./foo5
./foo6
./foo7
./foo8
./foo9
./foo10
./foo11
./foo12
./foo13
./foo14
./foo15
By the way, you were using file globbing and not regexes when you weren't using the -regex flag.
To find the files using globbing:
find . -name "foo[1-9]" -o -name "foo1[0-5]" -o -name "foo-bar"
There we match any files with name "foo" followed by "single digit between 1 and 9", or files named foo1 followed by "single digit between 0 and 5", or files named exactly "foo-bar".
Or if you know the directory won't have any numbered files aside from the ones you created:
find . -name "foo[1-9]*" -o -name "foo-bar'"
Here we find all files named "foo" followed by one digit, followed by any number of any characters, or the file named exactly foo-bar. Globbing is not very precise like regexes, but it's often sufficient and it's pretty quick.
The * and ? in globbing is different than in regexes. In globbing, they themselves represent unknown characters in the string being matched as well as the quantity of them. In regexes, they modify the previous atom in the regex, and express the quantity of that previous atom.

Run Regex using Grep/Sed recursively over files to store capture group

I have a file structure that looks like this:
Folder1
file1.feature
file2.feature
file3.feature
Folder2
file1.feature
file2.feature
...etc.
The files are Behat feature files which look like this:
Scenario: I am filling out a form
Given I am logged in as User
And I fill in "Name" with "My name"
Then I fill in "Email" with "myemail#example.com"
I am trying to iterate over each file within the file structure to get matches on my regex:
/I fill in "[^"]+" with "([^"]+)"/gm
The regex looks for I fill in "x" with "y", and I would like to store the capture group "y" from each file where a line in the file matches the expression.
So far I can iterate through the folders and print out the file names in mt Bash script like so:
#!/bin/bash
cd behat/features
files="*/*.feature"
for f in $files
do
echo ${f}
done
I am trying to retrieve the capture group using Sed currently by doing this in my loop:
sed -r 's/^I fill in \"[^)]+\" with \"([^)]+)\"$/\1/'
But I fear that I am going down the wrong track, as this is returning all of the file content throughout all the files.
You may use
cd behat/features && find . -name *.feature -type f -print0 | xargs -0 \
sed -E -n 's/.*I fill in "[^"]+" with "([^"]+)"/\1/p' > outfile
This command "goes" to behat/features directory, finds all files with feature extension (recursively) and then prints the capture group #1 values matched with your regex as -n option suppresses the output of lines and p flag only outputs what remains after a replacement.
See more specific solutions for recursive file matching at How to do a recursive find/replace of a string with awk or sed? if need be.

Find folders that contain multiple matches to a regex/grep

I have a folder structure encompassing many thousands of folders. I would like to be able to find all the folders that, for example, contain multiple .txt files, or multiple .jpeg, or whatever without seeing any folders that contain only a single file of that kind.
The folders should all have only one file of a specific type, but this is not always the case and it is tedious to try to find them.
Note that the folders may contain many other files.
If possible, I'd like to match "FILE.JPG" and "file.jpg" as both matching a query on "file" or "jpg".
What I have been doing in simply find . -iname "*file*" and going through it manually.
folders contain folders, sometimes 3 or 4 levels deep
first/
second/
README.txt
readme.TXT
readme.txt
foo.txt
third/
info.txt
third/fourth/
raksljdfa.txt
Should return
first/second/README.txt
first/second/readme.TXT
first/second/readme.txt
first/secondfoo.txt```
when searching for "txt"
and
first/second/README.txt
first/second/readme.TXT
first/second/readme.txt
when searching for "readme"
This pure Bash code should do it (with caveats, see below):
#! /bin/bash
fileglob=$1 # E.g. '*.txt' or '*readme*'
shopt -s nullglob # Expand to nothing if nothing matches
shopt -s dotglob # Match files whose names start with '.'
shopt -s globstar # '**' matches multiple directory levels
shopt -s nocaseglob # Ignore case when matching
IFS= # Disable word splitting
for dir in **/ ; do
matching_files=( "$dir"$fileglob )
(( ${#matching_files[*]} > 1 )) && printf '%s\n' "${matching_files[#]}"
done
Supply the pattern to be matched as an argument to the program when you run it. E.g.
myprog '*.txt'
myprog '*readme*'
(The quotes on the patterns are necessary to stop them matching files in the current directory.)
The caveats regarding the code are:
globstar was introduced with Bash 4.0. The code won't work with older Bash.
Prior to Bash 4.3, globstar matches followed symlinks. This could lead to duplicate outputs, or even failures due to circular links.
The **/ pattern expands to a list of all the directories in the hierarchy. This could take an excessively long time or use an excessive amount of memory if the number of directories is large (say, greater than ten thousand).
If your Bash is older than 4.3, or you have large numbers of directories, this code is a better option:
#! /bin/bash
fileglob=$1 # E.g. '*.txt' or '*readme*'
shopt -s nullglob # Expand to nothing if nothing matches
shopt -s dotglob # Match files whose names start with '.'
shopt -s nocaseglob # Ignore case when matching
IFS= # Disable word splitting
find . -type d -print0 \
| while read -r -d '' dir ; do
matching_files=( "$dir"/$fileglob )
(( ${#matching_files[*]} > 1 )) \
&& printf '%s\n' "${matching_files[#]}"
done
Something like this sounds like what you want:
find . -type f -print0 |
awk -v re='[.]txt$' '
BEGIN {
RS = "\0"
IGNORECASE = 1
}
{
dir = gensub("/[^/]+$","",1,$0)
file = gensub("^.*/","",1,$0)
}
file ~ re {
dir2files[dir][file]
}
END {
for (dir in dir2files) {
if ( length(dir2files[dir]) > 1 ) {
for (file in dir2files[dir]) {
print dir "/" file
}
}
}
}'
It's untested but should be close. It uses GNU awk for gensub(), IGNORECASE, true multi-dimensional arrays and length(array).

BASH: How to rename lots of file insertnig folder name in middle of filename

(I'm in a Bash environment, Cygwin on a Windows machine, with awk, sed, grep, perl, etc...)
I want to add the last folder name to the filename, just before the last underscore (_) followed by numbers or at the end if no numbers are in the filename.
Here is an example of what I have (hundreds of files needed to be reorganized) :
./aaa/A/C_17x17.p
./aaa/A/C_32x32.p
./aaa/A/C.p
./aaa/B/C_12x12.p
./aaa/B/C_4x4.p
./aaa/B/C_A_3x3.p
./aaa/B/C_X_91x91.p
./aaa/G/C_6x6.p
./aaa/G/C_7x7.p
./aaa/G/C_A_113x113.p
./aaa/G/C_A_8x8.p
./aaa/G/C_B.p
./aab/...
I would like to rename all thses files like this :
./aaa/C_A_17x17.p
./aaa/C_A_32x32.p
./aaa/C_A.p
./aaa/C_B_12x12.p
./aaa/C_B_4x4.p
./aaa/C_A_B_3x3.p
./aaa/C_X_B_91x91.p
./aaa/C_G_6x6.p
./aaa/C_G_7x7.p
./aaa/C_A_G_113x113.p
./aaa/C_A_G_8x8.p
./aaa/C_B_G.p
./aab/...
I tried many bash for loops with sed and the last one was the following :
IFS=$'\n'
for ofic in `find * -type d -name 'A'`; do
fic=`echo $ofic|sed -e 's/\/A$//'`
for ftr in `ls -b $ofic | grep -E '.png$'`; do
nfi=`echo $ftr|sed -e 's/(_\d+[x]\d+)?/_A\1/'`
echo mv \"$ofic/$ftr\" \"$fic/$nfi\"
done
done
But yet with no success... This \1 does not get inserted in the $nfi...
This is the last one I tried, only working on 1 folder (which is a subfolder of a huge folder collection) and after over 60 minutes of unsuccessful trials, I'm here with you guys.
I modified your script so that it works for all your examples.
IFS=$'\n'
for ofic in ???/?; do
IFS=/ read fic fia <<<$ofic
for ftr in `ls -b $ofic | grep -E '\.p.*$'`; do
nfi=`echo $ftr|sed -e "s/_[0-9]*x[0-9]*/_$fia&/;t;s/\./_$fia./"`
echo mv \"$ofic/$ftr\" \"$fic/$nfi\"
done
done
# it's easier to change to here first
cd aaa
# process every file
for f in $(find . -type f); do
# strips everything after the first / so this is our foldername
foldername=${f/\/*/}
# creates the new filename from substrings of the
# original filename concatenated to the foldername
newfilename=".${f:1:3}${foldername}_${f:4}"
# if you are satisfied with the output, just leave out the `echo`
# from below
echo mv ${f} ${newfilename}
done
Might work for you.
See here in action. (slightly modified, as ideone.com handles STDIN/find diferently...)