Regex doesn't work using loop to go through filenames - regex

I've been working on this for a while now and I can't seem to crack it. I have in a folder some files
testing.test.S09E01.720p.HDTV.x264-TWIST
dmscript.sh
I have the script in there to test it before I set it up with download manager to run once an episode has been downloaded.
The script has the following code.
#!/bin/sh
#THIS SCRIPT WILL DETERMINE TV SHOWS AND EPISODES AND MOVE THEM TO THE CORRECT
#FOLDER WHICH WILL ALLOW EITHER SICKBEARD OR COUCHPOTATO TO RENAME AND MOVE THEM
regex="([Ss]?([0-9]{1,2})[-|.]?[x|Ee|]?(\d{2})|(0?\d{1})(\d{2}))"
#where the tv shows will be copied to
tvdir="/volume1/public/NZB/Complete/tv/processed/"
#this was a debug variable to allow me to see where the code fails
num="1"
#will change the * to full path of folder later, for now this works because script is running from
#inside folder
for filename in *
do
if [[ "$filename =~ $regex ]]
then
#display the current filename and the variable number
echo "$filename $num"
#commented on the following code so when the script works I will just uncomment
#mv $filename $tvdir
#change variable to 2, this is to see whether the if test will fail and skip the file
#that doesn't conform to the regex
num="2"
else
echo "nothing of use"
done
However, once I run the code, I get this
testing.test.S09E01.720p.HDTV.x264-TWIST 1
dmscript.sh 2
obviously something goes wrong, as I just want it to display the first in the list above and ignore the other.
I got the regex rule from http://regex101.com/r/qZ2eO9/1 , I ignored the /gim at the end as I am unsure whether this will work in shell, and just stuck with the Ss and Ee so it isn't case sensitive

A few modifications:
(1) fix the \d into [0-9] in the expression.
(2) fix the quotation mark " in the line if [[ "$filename =~ $regex ]]
(3) add fi in the end of the if block
And now the script should work.
regex="([Ss]?([0-9]{1,2})[-|.]?[x|Ee|]?([0-9]{2})|(0?[0-9]{1})([0-9]{2}))"
#where the tv shows will be copied to
tvdir="/volume1/public/NZB/Complete/tv/processed/"
#this was a debug variable to allow me to see where the code fails
num="1"
#will change the * to full path of folder later, for now this works because script is running from
#inside folder
for filename in *
do
if [[ $filename =~ $regex ]]
then
#display the current filename and the variable number
echo "$filename $num"
#commented on the following code so when the script works I will just uncomment
#mv $filename $tvdir
#change variable to 2, this is to see whether the if test will fail and skip the file
#that doesn't conform to the regex
num="2"
else
echo "nothing of use"
fi
done

Related

Bash script with regex and capturing group

I'm working on a bash script to rename automatically files on my Synology NAS.
I have a loop for the statement of the files and everything is ok until I want to make my script more efficient with regex.
I have several bits of code which are working like as expected:
filename="${filename//[-_.,\']/ }"
filename="${filename//[éèēěëê]/e}"
But I have this:
filename="${filename//t0/0}"
filename="${filename//t1/1}"
filename="${filename//t2/2}"
filename="${filename//t3/3}"
filename="${filename//t4/4}"
filename="${filename//t5/5}"
filename="${filename//t6/6}"
filename="${filename//t7/7}"
filename="${filename//t8/8}"
filename="${filename//t9/9}"
And, I would like to use captured group to have something like this:
filename="${filename//t([0-9]{1,2})/\1}"
filename="${filename//t([0-9]{1,2})/${BASH_REMATCH[1]}}"
I've been looking for a working syntax without success...
The shell's parameter expansion facility does not support regular expressions. But you can approximate it with something like
filename=$(sed 's/t\([0-9]\)/\1/g' <<<"$filename")
This will work regardless of whether the first digit is followed by additional digits or not, so dropping that requirement simplifies the code.
If you want the last or all t[0-9]{1,2}s replaced:
$ filename='abt1cdt2eft3gh'; [[ "$filename" =~ (.*)t([0-9]{1,2}.*) ]] && filename="${BASH_REMATCH[1]}${BASH_REMATCH[2]}"; echo "$filename"
abt1cdt2ef3gh
$ filename='abt1cdt2eft3gh'; while [[ "$filename" =~ (.*)t([0-9]{1,2}.*) ]]; do filename="${BASH_REMATCH[1]}${BASH_REMATCH[2]}"; done; echo "$filename"
ab1cd2ef3gh
Note that the "replace all" case above would keep iterating until all t[0-9]{1,2}s are changed, even ones that didn't exist in the original input but were being created by the loop, e.g.:
$ filename='abtt123de'; while [[ "$filename" =~ (.*)t([0-9]{1,2}.*) ]]; do filename="${BASH_REMATCH[1]}${BASH_REMATCH[2]}"; echo "$filename"; done
abt123de
ab123de
whereas the sed script in #tripleee's answer would not do that:
$ filename='abtt123de'; filename=$(sed 's/t\([0-9]\)/\1/g' <<<"$filename"); echo "$filename"
abt123de

How do I get terminal in OSX recognise Regex as filenames?

In OS X terminal I have the following:
for filename in ^.* 2\.jpeg$; do printf "$filename\n"; done;
which I want to match filenames in the current folder ending in the string " 2.jpeg"
but it's not being recognised as Regex and it's not searching the current directory. It simply prints the two strings:
^.*
2\.jpeg$
obviously there's more I want to do with these files but I can't get it to match. Putting the regex in inverted commas doesn't seem to help either.
You need to use a glob pattern, regex doesn't work in for ... in ... construct. And don't print variables like that, use echo or printf '%s\n' "$variable".
for filename in ./*' '2.jpeg; do
echo "$filename"
done
You can do the following:
for filename in *2.jpeg; do echo ${filename}; done
This gives the following for me:
for filename in *2.jpeg; do echo ${filename}; done
2.jpeg
a2.jpeg
In a directory with 3 files:
touch 1.jpeg
touch a2.jpeg
touch 2.jpeg

Regex pattern that recognises file extension in Bash script not accurate to capture compressed files

I created this little Bash script that has one argument (a filename) and the script is supposed to respond according to the extension of the file:
#!/bin/bash
fileFormat=${1}
if [[ ${fileFormat} =~ [Ff][Aa]?[Ss]?[Tt]?[Qq]\.?[[:alnum:]]+$ ]]; then
echo "its a FASTQ file";
elif [[ ${fileFormat} =~ [Ss][Aa][Mm] ]]; then
echo "its a SAM file";
else
echo "its not fasta nor sam";
fi
It's ran like this:
sh script.sh filename.sam
If it's a fastq (or FASTQ, or fq, or FQ, or fastq.gz (compressed)) I want the script to tell me "it's a fastq". If it's a sam, I want it to tell me it's a sam, and if not, I want to tell me it's neither sam or fastq.
THE PROBLEM: when I didn't consider the .gz (compressed) scenario, the script ran well and gave the result I expected, but something is happening when I try to add that last part to account for that situation (see third line, the part where it says .?[[:alnum:]]+ ). This part is meant to say "in the filename, after the extension (fastq in this case), there might be a dot plus some word afterwards".
My input is this:
sh script.sh filename.fastq.gz
And it works. But if I put:
sh script.sh filename.fastq
It says it's not fastq. I wanted to put that last part as optional, but if I add a "?" at the end it doesn't work. Any thoughts? Thanks! My question would be to fix that part in order to work for both cases.
You may use this regex:
fileFormat="$1"
if [[ $fileFormat =~ [Ff]([Aa][Ss][Tt])?[Qq](\.[[:alnum:]]+)?$ ]]; then
echo "its a FASTQ file"
elif [[ $fileFormat =~ [Ss][Aa][Mm]$ ]]; then
echo "its a SAM file"
else
echo "its not fasta nor sam"
fi
Here (\.[[:alnum:]]+)? makes last group optional which is dot followed by 1+ alphanumeric characters.
When you run it as:
./script.sh filename.fastq
its a FASTQ file
./script.sh fq
its a FASTQ file
./script.sh filename.fastq.gz
its a FASTQ file
./script.sh filename.sam
its a SAM file
./script.sh filename.txt
its not fasta nor sam
The immediate problem is that you are requiring at least one [[:alnum:]] character after .fastq. This is easy to fix per se with * instead of +.
Regex is not a particularly happy solution to this problem, though.
case $fileFormat in
*.[Ff][Aa][Ss][Tt][Qq] | *.[Ff][Aa][Ss][Tt][Qq].*)
echo "$0: $fileFormat is a FASTQ file" >&2 ;;
*.[Ss][Aa][Mm] )
echo "$0: $fileFormat is a SAM file" >%2 ;;
esac
is portable all the way back to the original Bourne sh. In Bash 4.x you could lowercase the filename before the comparison so as to simplify the glob patterns.
Notice also how the diagnostics contain the name of the script and print to standard error instead of standard output.

How to capture the beginning of a filename using a regex in Bash?

I have a number of files in a directory named edit_file_names.sh, each containing a ? in their name. I want to use a Bash script to shorten the file names right before the ?. For example, these would be my current filenames:
test.file.1?twagdsfdsfdg
test.file.2?
test.file.3?.?
And these would be my desired filenames after running the script:
test.file.1
test.file.2
test.file.3
However, I can't seem to capture the beginning of the filenames in my regex to use in renaming the files. Here is my current script:
#!/bin/bash
cd test_file_name_edit/
regex="(^[^\?]*)"
for filename in *; do
$filename =~ $regex
echo ${BASH_REMATCH[1]}
done
At this point I'm just attempting to print off the beginnings of each filename so that I know that I'm capturing the correct string, however, I get the following error:
./edit_file_names.sh: line 7: test.file.1?twagdsfdsfdg: command not found
./edit_file_names.sh: line 7: test.file.2?: command not found
./edit_file_names.sh: line 7: test.file.3?.?: command not found
How can I fix my code to successfully capture the beginnings of these filenames?
Regex as such may not be the best tool for this job. Instead, I'd suggest using bash parameter expansion. For example:
#!/bin/bash
files=(test.file.1?twagdsfdsfdg test.file.2? test.file.3?.?)
for f in "${files[#]}"; do
echo "${f} shortens to ${f%%\?*}"
done
which prints
test.file.1?twagdsfdsfdg shortens to test.file.1
test.file.2? shortens to test.file.2
test.file.3?.? shortens to test.file.3
Here, ${f%%\?*} expands f and trims the longest suffix that matches a ? followed by any characters (the ? must be escaped since it's a wildcard character).
You miss the test command [[ ]] :
for filename in *; do
[[ $filename =~ $regex ]] && echo ${BASH_REMATCH[1]}
done

Regular expressions don't work as expected in bash if-else block's condition

My pattern defined to match in if-else block is :
pat="17[0-1][0-9][0-9][0-9].AUG"
nln=""
In my script, I'm taking user input which needs to be matched against the pattern, which if doesn't match, appropriate error messages are to be shown. Pretty simple, but giving me a hard time though. My code block from the script is this:
echo "How many days' AUDIT Logs need to be searched?"
read days
echo "Enter file name(s)[For multiple files, one file per line]: "
for(( c = 0 ; c < $days ; c++))
do
read elements
if [[ $elements =~ $pat ]];
then
array[$c]="$elements"
elif [[ $elements =~ $nln ]];
then
echo "No file entered.Run script again. Exiting"
exit;
else
echo "Invalid filename entered: $elements.Run script again. Exiting"
exit;
fi
done
The format I want from the user for filenames to be entered is this:
170402.AUG
So basically yymmdd.AUG (where y-year,m-month,d-day), with trailing or leading spaces is fine. Anything other than that should throw "Invalid filename entered: $elements.Run script again. Exiting" message. Also I want to check if if it is a blank line with a "Enter" hit, it should give an error saying "No file entered.Run script again. Exiting"
However my code, even if I enter something like "xxx" as filename, which should be throwing "Invalid filename entered: $elements.Run script again. Exiting", is actually checking true against a blank line, and throwing "No file entered.Run script again. Exiting"
Need some help with handling the regular expressions' check with user input, as otherwise rest of my script works just fine.
I think as discussed in the comments you are confusing with the glob match and a regEx match, what you have defined as pat is a glob match which needs to be equated with the == operator as,
pat="17[0-1][0-9][0-9][0-9].AUG"
string="170402.AUG"
[[ $string == $pat ]] && printf "Match success\n"
The equivalent ~ match would be to something as
pat="17[[:digit:]]{4}\.AUG"
[[ $string =~ $pat ]] && printf "Match success\n"
As you can see the . in the regex syntax has been escaped to deprive of its special meaning ( to match any character) but just to use as a literal dot. The POSIX character class [[:digit:]] with a character count {4} allows you to match 4 digits followed by .AUG
And for the string empty check do as suggested by the comments from Cyrus, or by Benjamin.W
[[ $elements == "" ]]
(or)
[[ -z $elements ]]
I would not bug the user with how many days (who want count 15 days or like)? Also, why only one file per line? You should help the users, not bug them like microsoft...
For the start:
show_help() { cat <<'EOF'
bla bla....
EOF
}
show_files() { echo "${#files[#]} valid files entered: ${files[#]}"; }
while read -r -p 'files? (h-help)> ' line
do
case "$line" in
q) echo "quitting..." ; exit 0 ;;
h) show_help ; continue;;
'') (( ${#files} )) && show_files; continue ;;
l) show_files ; continue ;;
p) (( ${#files} )) && break || { echo "No files enterd.. quitting" ; exit 1; } ;; # go to processing
esac
# select (grep) the valid patterns from the entered line
# and append them into the array
# using the -P (if your grep know it) you can construct very complex regexes
files+=( $(grep -oP '17\d{4}.\w{3}' <<< "$line") )
done
echo "processing files ${files[#]}"
Using such logic you can build really powerful and user-friendly app. Also, you can use -e for the read enable the readline functions (cursor keys and like)...
But :) Consider just create a simple script, which accepts arguments. Without any dialogs and such. example:
myscript -h
same as above, or some longer help text
myscript 170402.AUG 170403.AUG 170404.AUG 170405.AUG
will do whatever it should do with the files. Main benefit, you could use globbing in the filenames, like
myscript 1704*
and so on...
And if you really want the dialog, it could show it when someone runs the script without any argument, e.g.:
myscript
will run in interactive mode...