Use regular expression in linux bash to change output filename - regex

I have downloaded a few epub files and I need to convert them to epub again so that my ebook reader can read them.
I can do conversion in batch fairly easily using R as below:
setwd('~/Downloads/pubmed')
epub.files = list.files('./',full.names = TRUE,pattern = 'epub$')
for (loop in (1:length(epub.files))) {
command = paste('ebook-convert ',
epub.files[loop],
gsub('\\.epub','.mod.epub',epub.files[loop]))
system(command)
}
But I don't know how to do it using linux bash, I don't know: i) how to assign a variable within a for-loop, and ii) how to use regular expression to replace string in bash.
Can anyone help? Thanks.

You can also use bash's parameter substitution:
for i in *.epub; do
ebook-convert ${i} ${i/%.epub/.mod.epub}
done

You can use find and sed:
cd ~/Downloads/pubmed
for f in $(find . -regex .*epub\$); do
ebook-convert $f $(echo $f | sed 's/\.epub/.mod.epub/')
done

Not sure what ebook-convert is, but if you're trying to rename the files, try the following. Paste it into a file with the .sh extension (to signify a shell script) and make sure it is executable (chmod +x your-file.sh).
#!/bin/bash
FILES=~/Downloads/pubmed/*.epub
for f in $FILES
do
# $f stores the current file name, =~ is the regex operator
# only rename non-modified epub files
if [[ ! "$f" =~ \.mod\.epub$ ]]
then
echo "Processing $f file..."
# take action on each file
mv $f "${f%.*}".mod.epub
fi
done
You will need bash version 3 or greater for regex support. This can be implemented w/o regular expression as well.

You can use GNU parallel in combination with find:
find ~/Downloads/pubmed -name '*.epub' | parallel --gnu ebook-convert {} {.}.mod.epub
It should be available on most distributions and may have speed advantages over an ordinary loop if you process a large number of files. Although speed was not part of the original question...

Related

bash compare regex expression and not the variable

I want to remove all file contain a substring in a string, if does not contain, I want to ignore it, so I use regex expression
str=9009
patt=*v[0-9]{3,}*.txt
for i in "${patt}"; do echo "$i"
if ! [[ "$i" =~ $str ]]; then rm "$i" ; fi done
but I got an error :
*v[0-9]{3,}*.txt
rm: cannot remove '*v[0-9]{3,}*.txt': No such file or directory
file name like this : mari_v9009.txt femme_v9009.txt mari_v9010.txt femme_v9010.txt
bash filename expansion does not use regular expressions. See https://www.gnu.org/software/bash/manual/bash.html#Filename-Expansion
To find files with "v followed by 3 or more digits followed by .txt" you'll have to use bash's extended pattern matching.
A demonstration:
$ shopt -s extglob
$ touch mari_v9009.txt femme_v9009.txt mari_v9010.txt femme_v9010.txt
$ touch foo_v12.txt
$ for f in *v[0-9][0-9]+([0-9]).txt; do echo "$f"; done
femme_v9009.txt
femme_v9010.txt
mari_v9009.txt
mari_v9010.txt
What you have with this pattern for i in *v[0-9]{3,}*.txt is:
first, bash performs brace expansion which results in
for i in *v[0-9]3*.txt *v[0-9]*.txt
then, the first word *v[0-9]3*.txt results in no matches, and the default behaviour of bash is to leave the pattern as a plain string. rm tries to delete the file named literally "*v[0-9]3*.txt" and that gives you the "file not found error"
next, the second word *v[0-9]*.txt gets expanded, but the expansion will include files you don't want to delete.
I missed the not from the question.
try this: within [[ ... ]], the == and != operators are a pattern-matching operators, and extended globbing is enabled by default
keep_pattern='*v[0-9][0-9]+([0-9]).txt'
for file in *; do
if [[ $file != $keep_pattern ]]; then
echo rm "$file"
fi
done
But find would be preferable here, if it's OK to descend into subdirectories:
find . -regextype posix-extended '!' -regex '.*v[0-9]{3,}\.txt' -print
# ...............................^^^
If that returns the files you expect to delete, change -print to -delete
You need to remove the quotes in the for loop. Then the filename globs will be interpreted:
for i in ${patt}; do echo "$i"
I assume that you are using Python.
I have tested your regex code, and found the * character unnecessary.
The following seems to work fine: v[0-9]{3,}.txt
Can you please elaborate some more on the issue?
Thanks,
Bren.
I just piped the error message to /dev/null. This worked for me:
#!/bin/bash
str=9009
patt=*v[0-9]{3,}*.txt
rm $(eval ls $patt 2> /dev/null | grep $str)
This is not regex, this is globbing. Take a look what gets expanded:
# echo *v[0-9]{3,}*.txt
*v[0-9]3*.txt femme_v9009.txt femme_v9010.txt mari_v9009.txt mari_v9010.txt
*v[0-9]3*.txt obvously doesn't exists. can you clarify what files are you trying to achieve with {3,} ? Otherwise live it out and it will match the kind of filenames you have specified.
http://tldp.org/LDP/abs/html/globbingref.html

Bash: Reposition substring within a string

I have some files which I would like to rename. The filenames look like below:
C18-02B-NEB-sktrim_1-20000000.fq
C18-02B-NEB-sktrim_1-30000000.fq
C18-02B-NEB-sktrim_1-50000000.fq
C18-02B-NEB-sktrim_2-20000000.fq
C18-02B-NEB-sktrim_2-30000000.fq
...
I would like to reposition the _digit part to before .fq like so.
C18-02B-NEB-sktrim-20000000_1.fq
C18-02B-NEB-sktrim-30000000_1.fq
C18-02B-NEB-sktrim-50000000_1.fq
C18-02B-NEB-sktrim-20000000_2.fq
C18-02B-NEB-sktrim-30000000_2.fq
...
I am able to capture my substring of interest as such:
find * | egrep -o '_[0-9]'
_1
_1
_1
_2
_2
I could also remove the substring from the string as such:
find * | sed 's/_[0-9]//'
C18-02B-NEB-sktrim-20000000.fq
C18-02B-NEB-sktrim-30000000.fq
C18-02B-NEB-sktrim-50000000.fq
C18-02B-NEB-sktrim-20000000.fq
but I am not sure how to move it over to the new position and then rename the files.
Use capture groups, e.g.:
sed 's/\(.*\)sktrim\(_[0-9]*\)\(.*\)\.fq/\1sktrim\3\2.fg/'
The rename perl utility can turn this sed expression and a set of files into a set of corresponding mv's.
Since those rename-generated mv's will be at the system level (they won't launch /bin/mv but rather just use the rename(2) system function) it'll be faster than using generating your own mv commands and launching them from the shell.
This should do it:
find . -name '*.fq' -exec sh -c 'mv "$0" "$(echo "$0" |sed "s/^\(.*\)\(_[0-9]\)\(.*\)\.fq$/\1\3\2.fq/")"' {} \;
sed part:
sed "s/^\(.*\)\(_[0-9]\)\(.*\)\.fq$/\1\3\2.fq/"
Explanation:
find . -name '*.fq' searches for the glob pattern *.fq and then the -exec option executes a mv command per each file found.
The sh -c 'mv "$0" "$var"' {} construct is just a mv command with two arguments and $0 is substituted with {} which is the filename found by find
If file renaming is all you want, better use tools that are exclusively for file renaming though. rename is a pretty popular tool to do this kind of stuff, but I got my own tool: [rnm][1].
With rnm, you can do:
rnm -rs '/^(.*)(_\d)(.*)\.fq$/\1\3\2.fq/' *.fq
Or using the exact same regex like the sed command (i.e BRE): BRE support was dropped in favor of PCRE2.
Using rename
rename 's/sktrim(_\d+)(.*)\.fq/sktrim$2$1.fq/' *.fq

bash script with simple regular expression

Consider the following bash script with a simple regular expression:
for f in "$FILES"
do
echo $f
sed -i '/HTTP|RT/d' $f
done
This script shall read every file in the directory specified by FILES and remove the lines with occurrences of 'http' or 'RT' However, it seems that the OR part of the regular expression is not working. That is if I just have sed -i '/HTTP/d' $f then it will remove all lines containing HTTP but I cannot get it to remove both HTTP and RT
What must I change in my regular expression so that lines with HTTP or RT are removed?
Thanks in advance!
Two ways of doing it (at least):
Having sed understand your regex:
sed -E -i '/HTTP|RT/d' $f
Specifying each token separately:
sed -i '/HTTP/d;/RT/d' $f
Before you do anything, run with the opposite, and PRINT what you plan to DELETE:
sed -n -e '/HTTP/p' -e '/RT/p' $f
Just to be sure you are deleting only what you want to delete before actually changing the files.
"It's not a question of whether you are paranoid or not, but whether you are paranoid ENOUGH."
Well, first of all, it will process all WORDS in the FILES variable.
If you want it to do all files in the FILES directory, then you need something like this:
for f in $( find $FILES -maxdepth 1 -type f )
do
echo $f
sed -i -e '/HTTP/d' -e '/RT/d' $f
done
You just need two "-e" options to sed.

BASH script to *create* filenames with spaces from filenames with "%20"

First, I know this sounds ass backwards. It is. But I'm looking to convert (on the BASH command line) a bunch of script-generated thumbnail filenames that do have a "%20" in them to the equivalent without filenames. In case you're curious, the reason is because the script I'm using created the thumbnail filenames from their current URLs, and it added the %20 in the process. But now WordPress is looking for files like "This%20Filename.jpg" and the browser is, of course, removing the escape character and replacing it with spaces. Which is why one shouldn't have spaces in filenames.
But since I'm stuck here, I'd love to convert my existing thumbnails over. Next, I will post a question for help fixing the problem in the script mentioned above. What I'm looking for now is a quick script to do the bad thing and create filenames with spaces out of filenames with "%20"s.
Thanks!
If you only want to replace each literal %20 with one space:
for i in *; do
mv "$i" "${i//\%20/ }"
done
(for instance this will rename file%with%20two%20spaces to file%with two spaces).
You'll probably need to apply %25->% too though, and other similar transforms.
convmv can do this, no script needed.
$ ls
a%20b.txt
$ convmv --unescape *.txt --notest
mv "./a%20b.txt" "./a b.txt"
Ready!
$ ls
a b.txt
personally, I don't like file names with spaces - beware you will have to treat them specially in future scripts. Anyway, here is the script that will do what you want to achieve.
#!/bin/sh
for fname in `ls *%20*`
do
newfname=`echo $fname | sed 's/%20/ /g'`
mv $fname "$newfname"
done;
Place this to a file, add execute permission and run this from the directory where you have file with %20 in their names.
Code :
#!/bin/bash
# This is where your files currently are
DPATH="/home/you/foo/*.txt"
# This is where your new files will be created
BPATH="/home/you/new_foo"
TFILE="/tmp/out.tmp.$$"
[ ! -d $BPATH ] && mkdir -p $BPATH || :
for f in $DPATH
do
if [ -f $f -a -r $f ]; then
/bin/cp -f $f $BPATH
sed "s/%20/ /g" "$f" > $TFILE && mv $TFILE "$f"
else
echo "Error: Cannot read $f"
fi
done
/bin/rm $TFILE
Not bash, but for the more general case of %hh (encoded hex) in names.
#!/usr/bin/perl
foreach $c(#ARGV){
$d=$c;
$d=~s/%([a-fA-F0-9][a-fA-F0-9])/my $a=pack('C',hex($1));$a="\\$a"/eg;
print `mv $c $d` if ($c ne $d);
}

Regexp for extensions tgz, tar.gz, TGZ and TAR.GZ

Im trying to get a regexp (in bash) to identify files with only the following extensions :
tgz, tar.gz, TGZ and TAR.GZ.
I tried several ones but cant get it to work.
Im using this regexp to select only files files with those extensions to do some work with them :
if [ -f $myregexp ]; then
.....
fi
thanks.
Try this:
#!/bin/bash
# no case match
shopt -s nocasematch
matchRegex='.*\.(tgz$)|(tar\.gz$)'
for f in *
do
# display filtered files
[[ -f "$f" ]] && [[ "$f" =~ "$matchRegex" ]] && echo "$f";
done
I have found an elegant way of doing this:
shopt -s nocasematch
for file in *;
do
[[ "$file" =~ .*\.(tar.gz|tgz)$ ]] && echo $file
done
This may be good for you since you seems to want to use the if and a bash regex. The =~ operator allow to check if the pattern is matching a given expression. Also shopt -s nocasematch has to be set to perfom a case insensitive match.
Use this pattern
.*\.{1}(tgz|tar\.gz)
But how to make a regular expression case-insensitive? It depends on the language you use. In JavaScript they use /pattern/i, in which, i denotes that the search should be case-insensitive. In C# they use RegexOptions enumeration.
Depends on where you want to use this regex. If with GREP, then use egrep with -i parameter, which stands for "ignore case"
egrep -i "(\.tgz)|(\.tar\.gz)$"
Write 4 regexes, and check whether the file name matches any of them. Or write 2 case-insensitive regexes.
This way the code will be much more readable (and easier) than writing 1 regex.
You can even do it without a regex (a bit wordy though):
for f in *.[Tt][Gg][Zz] *.[Tt][Aa][Rr].[Gg][Zz]; do
echo $f
done
In bash? Use curly brackets, *.{tar.gz,tgz,TAR.GZ,TGZ} or even *.{t{ar.,}gz,T{AR.,}GZ}. Thus, ls -l *.{t{ar.,}gz,T{AR.,}GZ} on the command-line will do a detailed listing of all files with the matching extensions.