Using a script to compare directories with modified file names? - regex

I want to write a script that compares two directories.
However, the file names are modified in one of them.
So directory A contains files like HouseFile.txt, CouchFile.txt, ChairFile.txt
Directory B contains House.txt, Couch.txt, Chair.txt (which should be seen as 'equivalent' to the above)
Both may also contain new, completely different files.
Could someone point me in the right direction here? It's been a while since I've done scripting.
I have tried using diff, and I know I need to use some form of regexto compare the file names, but I am not sure where to start.
Thank you!
Added for clarification:
Of course diff, however, just compares the actual file names. I would like to know how to specify that I regard files names such as, in the example, "HouseFile.txt" and "House.txt" as equivalent in this case

If I understand correctly, this is a possible solution to compare a to b:
mkdir a b ; touch a/HouseFile.txt a/ChairFile.txt a/CouchFile.txt a/SomeFile.txt b/House.txt b/Chair.txt b/Couch.txt b/Sofa.txt
for file in a/*(.); do [[ ! -f b/${${file##*/}:fs#File#} ]] && echo $file ; done
Outputs:
a/SomeFile.txt
What is not clear to me: Is the difference pattern strictly 'File' or any arbitrary string?
EDIT: The previous was for zsh. Here is one for bash:
find a -type f -maxdepth 1 | while read file; do
check=$(echo $file | sed -r -e 's#(.*)/(.*)#\2#' -e "s#File##") ;
[[ ! -f b/${check} ]] && echo $file
done
Using parameter expansion instead of sed:
find a -type f -maxdepth 1 | while read file; do
check=${file/%File.txt/.txt} #end of file name changed
check=${check/#*\//} #delete path before the first slash
[[ ! -f b/${check} ]] && echo $file
done

Related

Bash: delete most files in directory

I have a directory full of mostly postscript files which I'm trying to erase most: Namely those who don't have 000100, 000110, 000120 or 000200 on the second place in their name. I want to retain those.
Here is an excerpt from the directory:
0091_000100_0000_0000_0001_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000110_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000120_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000200_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000300_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000310_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000320_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000330_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000400_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000410_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000420_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_001120_0102_0000_0003_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0096_000100_0000_0000_0001_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000110_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000120_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000200_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000300_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000310_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000320_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000330_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000400_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000410_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000420_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000430_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000440_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000450_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0097_000100_0000_0000_0001_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000110_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000120_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000200_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000300_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000310_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000320_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000330_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000400_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000410_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000420_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000430_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
This is what I'm trying to get:
0091_000100_0000_0000_0001_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000110_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000120_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0091_000200_0000_0000_0002_000000__66_5_32_6_9_82856598585_60_3560351294_L_40_1_52_9_42_97_58_53.ps
0096_000100_0000_0000_0001_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000110_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000120_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0096_000200_0000_0000_0002_000000__85_5_2__2_37732144298_48_1790154593_L_52_26_17_77_41_43.ps
0097_000100_0000_0000_0001_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000110_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000120_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
0097_000200_0000_0000_0002_000000__81_5_46_2_48_2146991211_65_1953946853_L_44_6_72_1_58_71_77_49.ps
My try so far works but is somewhat unpractical:
#!/bin/sh
for f in *.ps; do
case $f in
(0091_000100*.ps|0091_000110*.ps|0091_000120*.ps|0091_000200*.ps)
;;
(*)
rm -- "$f";;
esac
done
I have to write every start of the filename I want to keep. One problem: The script doesn't match the 0096_* and 0097_* files and all the others omitted for readability. The format of the filename is always the same up to the double underscore. The values in the number groups might change.
Is there a way to match for the second group? My experimentation wasn't successful so far.
Thank you for your help!
Seems like ls *.ps | awk -F_ '$2 < 100 || $2 > 200' might be the list of files you want to delete. After verifying that,
rm $(ls *.ps | awk -F_ '$2 < 100 || $2 > 200')
As long as no file has whitespace or glob characters in its name. (If they do, use xargs)
I like using find for best performance when dealing with a large count of files.
This regex should yield the same results:
find . -type f -name '*.ps' |egrep "000[12]{1}[012]{1}" |xargs rm -f
Assuming a directory has only regular files...
ls *.ps | egrep -v '^[0-9]{4}_000100_|^[0-9]{4}_000110_|^[0-9]{4}_000120_|^[0-9]{4}_000200_' | xargs rm -f

Substring removal in bash

I'm currently trying to get into bash regular expressions to change multiple filenames at the same time. Here are the file names:
a_001_D_xy_S37_L003_R1_001.txt
a_001_D_xy_S37_L003_R2_001.txt
a_002_D_xy_S37_L006_R1_001.txt
a_002_D_xy_S37_L006_R2_001.txt
a_003_D_xy_S23_L003_R1_001.txt
a_003_D_xy_S23_L003_R2_001.txt
I want this as my result:
a_002_D_xy_R1.txt
a_002_D_xy_R2.txt
...
I only want to change those with *001.txt at the end. First I want to remove the _S.._L00. in the filenames and the 001 in the end. I split this procedure in two parts:
for file in *001.txt;
do
echo ${file#_S.._L..6}
done
This loop already does not work. As a second alternative I tried:
for file in *001.fastq.gz;
do
echo ${file/_S.._L00./}
done
but the filenames are again unchanged. (I just use echo here to see the results. If it works I will replace it with mv ${file} ${regularexpression})
Thanks for help!
Considering that you need lots of different fields it is possibly better to just split the filename and then reconstruct it as you wish.
I suggest using an array built by splitting the original filename with _. Then you just reconstruct the new name by using the fields that you wish.
for file in *001.txt; do
echo "FILE: $file"
IFS='_' read -r -a fileFields <<< "$file"
echo "FILE FIELDS: "
for index in "${!fileFields[#]}"; do
echo "- $index ${fileFields[index]}"
done
fileName="${fileFields[0]}_${fileFields[1]}_${fileFields[2]}_${fileFields[3]}_${fileFields[-2]}.txt"
echo "NEW FILE NAME: $fileName"
# mv $file $fileName
done
The echo commands are just for debuging, you can remove them all once you understand the code.
However, if you really need to split the string using BASH expressions you can check this post:
Extracting part of a string to a variable in bash or take a look at this BASH cheat sheet.
Try to make a function, you'll first have to decide the number (n) of files.
n=$(ls *_001.txt | wc -l)
functionRename(){
for(( i=1; i <=n; i++))
do
file=$(ls *_001.txt | head -n $i | tail -n 1)
mv "${file}" "${file%_S??_*}${file#???????????????????}"
file2=$(ls *_001.txt | head -n $i | tail -n 1)
mv "${file2}" "${file2%_001*}.txt"
done
}
functionRename

List down all sub-Directories in Bash based on some criteria

I'm writing a script which navigates all subdirs named something like 12, 98, etc., and checks that each one contains a runs subdir. The check is needed for subsequent operations in the script. How can I do that? I managed to write this:
# check that I am in a multi-grid directory, with "runs" subdirectories
for grid in ??; do
cd $grid
cd ..
done
However, ?? also matches stuff like LS, which is not correct. Any ideas on how to fix it?
Next step: in each directory named dd (digit/digit), I need to check that there is a subdirectory named runs, or exit with error. Any idea on how to do that? I thought of using find -type d -name "runs", but it looks recursively inside subdirs, which is wrong, and anyway if find doesn't find a match, I have no idea on how to catch that inside the script.
Loop over the directories, report the missing subdir:
for dir in [0-9][0-9]/ ; do
[[ -d $dir/runs ]] || { echo $dir ; exit 1 ; }
done
You can use character classes in glob patterns. The / (not \) after the pattern makes it match only directories, i.e. a file named 42 will be skipped.
The next line reads "$dir/runs is a directory, or report it". [[ ... ]] introduces a condition, see man bash for details. -d tests whether a directory exists. || is "or", you can rephrase the line as
if [[ ! -d $dir/runs ]] ; then
echo $dir
exit 1
fi
where ! stands for "not".
First find all directories with name runs using :
find . -type d -name runs
Note: to restrict to one level you can use find along with -maxdepth
From this extract the previous directory by removing the last word after /
try :
sed 's,/*[^/]\+/*$,,'

Regex match for file and rename + overwrite old file

Im trying to make a bash script to rename some files wich match my regex, if they match i want to rename them using the regex and overwrite an old existing file.
I want to do this because on computer 1 i have a file, on computer 2 i change the file. Later i go back to computer 1 and it gives an example conflict so it saves them both.
Example file:
acl_cam.MYI
Example file after conflict:
acl_cam (Example conflit with .... on 2015-08-20).MYI
I tried a lot of thinks like rename, mv and couple other scripts but it didn't work.
the regex i should use in my opinion:
(.*)/s\(.*\)\.(.*)
then rename it to value1 . value2 and replace the old file (acl_cam.MYI) and do this for all files/directories from where it started
can you guys help me with this one?
The issue you have, if I understand your question correctly, is two part. (1) What is the correct regex that will match the error string and produce a filename?; and (2) how to use the returned filename to move/remove the offending file?
If the sting at issue is:
acl_cam (Example conflit with .... on 2015-08-20).MYI
and you need to return the MySQL file name, then a regex similar to the following will work:
[ ][(].*[)]
The stream editor sed is about as good as anything else to return the filename from your string. Example:
$ printf "acl_cam (Example conflit with .... on 2015-08-20).MYI\n" | \
sed -e 's/[ ][(].*[)]//'
acl_cam.MYI
(shown with line continuation above)
Then it is up to you how you move or delete the file. The remaining question is where is the information (the error string) currently stored and how do you have access to it? If you have a file full of these errors, then you could do something like the following:
while read -r line; do
victim=$( printf "%s\n" "$line" | sed -e 's/[ ][(].*[)]//' )
## to move the file to /path/to/old
[ -e "$victim" ] && mv "$victim" /path/to/old
done <$myerrorfilename
(you could also feed the string to sed as a here-string, but omitted for simplicity)
You could also just delete the file if that suits your purpose. However, more information is needed to clarify how/where that information is stored and what exactly you want to do with it to provide any more specifics. Let me know if you have further questions.
Final solution for this question for people who are interested:
for i in *; do
#Wildcar check if current file containt (Exemplaar
if [[ $i == *"(Exemplaar"* ]]
then
#Rename the file to the original name (without Exemplaar conflict)
NewFileName=$(echo "$i" | sed -E -e 's/[ ][(].*[)]//')
#Remove the original file
rm $NewFileName;
#Copy the conflict file as the original file name
cp -a "$i" $NewFileName;
#Delete the conflict file
rm "$i";
echo "Removed file: $NewFileName with: $i";
fi
done
I used this code to replace my database conflict files created by dropbox sync with different computers.

How to move many files in multiple different directories (on Linux)

My problem is that I have too many files in single directory. I cannot "ls" the directory, cos is too large. I need to move all files in better directory structure.
I'm using the last 3 digits from ID as folders in reverse way.
For example ID 2018972 will gotta go in /2/7/9/img_2018972.jpg.
I've created the directories, but now I need help with bash script. I know the IDs, there are in range 1,300,000 - 2,000,000. But I can't handle regular expressions.
I wan't to move all files like this:
/images/folder/img_2018972.jpg -> /images/2/7/9/img_2018972.jpg
I will appreciate any help on this subject. Thanks!
EDIT: after explainations in comments the following assumptions exists:
filenames are in the form of img_<id>.jpg or img_<id>_<size>.jpg
the new dir is the reverse order of the three last digits of the id
using Bash:
for file in /images/folder/*.jpg; do
fname="${file%.*}" # remove extension and _<size>
[[ "$fname" =~ img_[0-9]+_[0-9]+$ ]] && fname="${fname%_*}"
last0="${fname: -1:1}" # last letter/digit
last1="${fname: -2:1}" # last but one letter/digit
last2="${fname: -3:1}" # last but two letter/digit
newdir="/images/$last0/$last1/$last2"
# optionally check if the new dir exists, if not create it
[[ -d "$newdir" ]] || mkdir -p "$newdir"
mv "$file" "$newdir"
done
if * can't handle it (although I think * in a for loop has no limits),
use find as suggested by #MichaƂ Kosmulski in the comments
while read -r; do
fname="${REPLY%.*}" # remove extension and _<size>
[[ "$fname" =~ img_[0-9]+_[0-9]+$ ]] && fname="${fname%_*}"
last0="${fname: -1:1}" # last letter/digit
last1="${fname: -2:1}" # last but one letter/digit
last2="${fname: -3:1}" # last but two letter/digit
newdir="/images/$last0/$last1/$last2"
# optionally check if the new dir exists, if not create it
[[ -d "$newdir" ]] || mkdir -p "$newdir"
mv "$REPLY" "$newdir"
done < <(find /images/folder/ -maxdepth 1 -type f -name "*.jpg")
find /images/folder -type f -maxdepth 1 | while read file
do
filelen=${#file}
((rootn=$filelen-5))
((midn=$filelen-6))
((topn=$filelen-7))
root=${file:$rootn:1}
mid=${file:$midn:1}
top=${file:$topn:1}
mkdir -p /images/${root}/${mid}/${top}
mv $file /images/${root}/${mid}/${top}
done