Using Awk to remove one of multiple file extensions? - regex

I'm working on some stuff related to converting files, and I'm trying to find a shell command to remove the original file extension.
For example, if I convert a file called text.rtf, it will be converted into text.rtf.mobi. I'd like to use something to remove the .rtf (or any other extension) so it's only text.mobi.
I've been playing with awk and sed but I couldn't get anything to work. I'm not sure how to get it to pick up both the original extension and the .mobi, but only remove the original extension.
Somewhat related, where should I be going to pick up regex and actually understand it instead of just immense amounts of Googling? Thanks.
EDIT: I was a little unclear in the original post so let me clarify. The shell command I need is for removing the original extension in a converted file, such as text.ANYTHING.mobi. Sorry about the confusion.

The classic way is the basename command:
file="text.rtf"
new=$(basename "$file" .rtf).mobi
The more modern way avoids exercising other programs:
file="text.rtf"
new="${file%.rtf}.mobi"
If you really must use awk, then I suppose you use:
file="text.rtf"
new=$(echo "$file" | awk '/\.rtf$/ { sub(/\.rtf$/, ".mobi"); } { print }')
For sed, you use:
file="text.rtf"
new=$(echo "$file" | sed 's/\.rtf$/.mobi/')
For a really good explanation of regular expressions, then you want Friedl's "Mastering Regular Expressions" book.
To convert text.rtf.mobi to text.mobi, you can use any of the tools previously shown with minor adaptations:
new=$(basename "$file" .rtf.mobi).mobi
new="${file%.rtf.mobi}.mobi"
new=$(echo "$file" | awk '/\.rtf\.mobi$/ { sub(/\.rtf\.mobi$/, ".mobi"); } { print }')
new=$(echo "$file" | sed 's/\.rtf\.mobi$/.mobi/')
And things are only marginally different if the .rtf can be any other extension, but you start to ask yourself "why doesn't he remove the original extension from the file before converting it, or use the file naming facilities in the converter to get the required output name?"
There is no longer a sensible way to do it with basename.
new="${file/.[!.]*.mobi/}" # bash
new=$(echo "$file" | awk '/\.[^.]+\.mobi$/ { sub(\.[^.]*\.mobi$/, ".mobi"); } { print }')
new=$(echo "$file" | sed 's/\.[^.]*\.mobi$/.mobi/')

Just remove all the extensions and then add back in .mobi
$ x=something.whatever.mobi
$ echo ${x%%.*}.mobi
something.mobi

Here's an example using xargs and basename to strip .wav from file names for batch conversion of wave files into mp3s using lame:
/bin/ls | xargs -I % basename % ".wav" | xargs -I $ lame --tl Xenologix --ta Drøn --tg IDM --tt $ $.wav $.mp3

for f in *.mobi
do
mv $f $(echo $f | cut -d'.' -f1).mobi
done

for f in *.mobi
do
mv $f `echo $f|awk -F"." '{print $1"."$3}'`
done

Related

How to find specific text in a text file, and append it to the filename?

I have a collection of plain text files which are named as yymmdd_nnnnnnnnnn.txt, which I want to append another number sequence to the filenames, so that they each become named as yymmdd_nnnnnnnnnn_iiiiiiiii.txt instead, where the iiiiiiiii is taken from the one line in each file which contains the text "GST: 123456789⏎" (or similar) at the end of the line. While I am sure that there will only be one such matching line within each file, I don't know exactly which line it will be on.
I need an elegant one-liner solution that I can run over the collection of files in a folder, from a bash script file, to rename each file in the collection by appending the specific GST number for each filename, as found within the files themselves.
Before even getting to the renaming stage, I have encountered a problem with this. Here is what I tried, which didn't work...
# awk '/\d+$/' | grep -E 'GST: ' 150101_2224567890.txt
The grep command alone works perfectly to find the relevant line within the file, but the awk doesn't return just the final digits group. It fails with the error "warning: regexp escape sequence \d is not a known regexp operator". I had assumed that this regex should return any number of digits which are at the end of the line. The text file in question contains a line which ends with "GST: 112060340⏎". Can someone please show me how to make this work, and maybe also to help with the appropriate coding to move the collection of files to the new filenames? Thanks.
Thanks to a comment from #Renaud, I now have the following code working to obtain just the GST registration number from within a text file, which puts me a step closer towards a workable solution.
awk '/GST: / {printf $NF}' 150101_2224567890.txt
I still need to loop this over the collection instead of just specifying one filename. I also need to be able to use the output from #Renaud's contribution, to rename the files. I'm getting closer to a working solution, thanks!
This awk should work for you:
awk '$1=="GST:" {fn=FILENAME; sub(/\.txt$/, "", fn); print "mv", FILENAME, fn "_" $2 ".txt"; nextfile}' *_*.txt | sh
To make it more readable:
awk '$1 == "GST:" {
fn = FILENAME
sub(/\.txt$/, "", fn)
print "mv", FILENAME, fn "_" $2 ".txt"
nextfile
}' *_*.txt | sh
Remove | sh from above to see all mv commands together.
You may try
for f in *_*.txt; do echo mv "$f" "${f%.txt}_$(sed '/.*GST: /!d; s///; q' "$f").txt"; done
Drop the echo if you're satisfied with the output.
As you are sure there is only one matching line, you can try:
$ n=$(awk '/GST:/ {print $NF}' 150101_2224567890.txt)
$ mv 150101_2224567890.txt "150101_2224567890_$n.txt"
Or, for all .txt files:
for f in *.txt; do
n=$(awk '/GST:/ {print $NF}' "$f")
if [[ -z "$n" ]]; then
printf '%s: GST not found\n' "$f"
continue
fi
mv "$f" "$f{%.txt}_$n.txt"
done
Another one-line solution to consider, although perhaps not so elegant.
for original_filename in *_*.txt; do \
new_filename=${original_filename%'.txt'}_$(
grep -E 'GST: ' "$original_filename" | \
sed -E 's/.*GST//g; s/[^0-9]//g'
)'.txt' && \
mv "$original_filename" "$new_filename"; \
done
Output:
150101_2224567890_123456789.txt
If you are open to a multi line script:-
#!/bin/sh
for f in *.txt; do
prefix=$(echo "${f}" | sed s'#\.txt##')
cp "${f}" f1
sed -i s'#GST#%GST#' "./f1"
cat "./f1" | tr '%' '\n' > f2
number=$(cat "./f2" | sed -n '/GST/'p | cut -d':' -f2 | tr -d ' ')
newname="${prefix}_${number}.txt"
mv -v "${f}" "${newname}"
rm -v "./f1"
rm -v "./f2"
done
In general, if you want to make your files easy to work with, then leave as many potential places for them to be split with newlines as possible. It is much easier to alter files by simply being able to put what you want to delete or print on its' own line, than it is to search for things horizontally with regular expressions.

replace string with underscore and dots using sed or awk

I have a bunch of files with filenames composed of underscore and dots, here is one example:
META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
I want to remove the part that contains .bed.nodup.sortedbed.roadmap.sort.fgwas.gz. so the expected filename output would be META_ALL_whrAdjBMI_GLOBAL_August2016.r0-ADRL.GLND.FET-EnhA.out.params
I am using these sed commands but neither one works:
stringZ=META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
echo $stringZ | sed -e 's/\([[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.\)//g'
echo $stringZ | sed -e 's/\[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.//g'
Any solution is sed or awk would help a lot
Don't use external utilities and regexes for such a simple task! Use parameter expansions instead.
stringZ=META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
echo "${stringZ/.bed.nodup.sortedbed.roadmap.sort.fgwas.gz}"
To perform the renaming of all the files containing .bed.nodup.sortedbed.roadmap.sort.fgwas.gz, use this:
shopt -s nullglob
substring=.bed.nodup.sortedbed.roadmap.sort.fgwas.gz
for file in *"$substring"*; do
echo mv -- "$file" "${file/"$substring"}"
done
Note. I left echo in front of mv so that nothing is going to be renamed; the commands will only be displayed on your terminal. Remove echo if you're satisfied with what you see.
Your regex doesn't really feel too much more general than the fixed pattern would be, but if you want to make it work, you need to allow for more than one lower case character between each dot. Right now you're looking for exactly one, but you can fix it with \+ after each [[:lower:]] like
printf '%s' "$stringZ" | sed -e 's/\([[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.\)//g'
which with
stringZ="META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params"
give me the output
META_ALL_whrAdjBMI_GLOBAL_August2016.r0-ADRL.GLND.FET-EnhA.out.params
Try this:
#!/bin/bash
for line in $(ls -1 META*);
do
f2=$(echo $line | sed 's/.bed.nodup.sortedbed.roadmap.sort.fgwas.gz//')
mv $line $f2
done

How to cut a string from a string

My script gets this string for example:
/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file
let's say I don't know how long the string until the /importance.
I want a new variable that will keep only the /importance/lib1/lib2/lib3/file from the full string.
I tried to use sed 's/.*importance//' but it's giving me the path without the importance....
Here is the command in my code:
find <main_path> -name file | sed 's/.*importance//
I am not familiar with the regex, so I need your help please :)
Sorry my friends I have just wrong about my question,
I don't need the output /importance/lib1/lib2/lib3/file but /importance/lib1/lib2/lib3 with no /file in the output.
Can you help me?
I would use awk:
$ echo "/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file" | awk -F"/importance/" '{print FS$2}'
importance/lib1/lib2/lib3/file
Which is the same as:
$ awk -F"/importance/" '{print FS$2}' <<< "/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file"
importance/lib1/lib2/lib3/file
That is, we set the field separator to /importance/, so that the first field is what comes before it and the 2nd one is what comes after. To print /importance/ itself, we use FS!
All together, and to save it into a variable, use:
var=$(find <main_path> -name file | awk -F"/importance/" '{print FS$2}')
Update
I don't need the output /importance/lib1/lib2/lib3/file but
/importance/lib1/lib2/lib3 with no /file in the output.
Then you can use something like dirname to get the path without the name itself:
$ dirname $(awk -F"/importance/" '{print FS$2}' <<< "/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file")
/importance/lib1/lib2/lib3
Instead of substituting all until importance with nothing, replace with /importance:
~$ echo $var
/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file
~$ sed 's:.*importance:/importance:' <<< $var
/importance/lib1/lib2/lib3/file
As noted by #lurker, if importance can be in some dir, you could add /s to be safe:
~$ sed 's:.*/importance/:/importance/:' <<< "/dir1/dirimportance/importancedir/..../importance/lib1/lib2/lib3/file"
/importance/lib1/lib2/lib3/file
With GNU sed:
echo '/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file' | sed -E 's#.*(/importance.*)#\1#'
Output:
/importance/lib1/lib2/lib3/file
pure bash
kent$ a="/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file"
kent$ echo ${a/*\/importance/\/importance}
/importance/lib1/lib2/lib3/file
external tool: grep
kent$ grep -o '/importance/.*' <<<$a
/importance/lib1/lib2/lib3/file
I tried to use sed 's/.*importance//' but it's giving me the path without the importance....
You were very close. All you had to do was substitute back in importance:
sed 's/.*importance/importance/'
However, I would use Bash's built in pattern expansion. It's much more efficient and faster.
The pattern expansion ${foo##pattern} says to take the shell variable ${foo} and remove the largest matching glob pattern from the left side of the shell variable:
file_name="/dir1/dir2/dir3.../importance/lib1/lib2/lib3/file"
file_name=${file_name##*importance}
Removeing the /file at the end as you ask:
echo '<path>' | sed -r 's#.*(/importance.*)/[^/]*#\1#'
Input /dir1/dir2/dir3.../importance/lib1/lib2/lib3/file
Returns: /importance/lib1/lib2/lib3
See this "Match groups" tutorial.

Splitting a line in bash based on delimiter with Sed / Regex

Regex rookie and hoping to change that. I have the following seemingly very simple problem that I cannot figure the correct regex implementation to parse properly. Basically I have a file that has lines that looks like this:
time:3:35PM
I am just trying to cut out all characters up to and including ONLY FIRST ':' delimiter and keep the rest intact with sed so that I can process on many files with same format. What I am trying to get is this:
3:35PM
The below is the closest I got but is just using the last occurrence of the delimiter instead of the first.:
sed 's/.*://'
I have also tried with python but have challenges with applying a python function to iterate through all lines in many files as opposed to just one file.
Any help would be greatly appreciated.
You can do this in just about every text processing tool (many without using regular expressions at all).
ed
If the in-place editing is really important, the canonical correct way is not sed (the stream editor) but ed (the file editor).
ed "$file" << EOF
,s/^[^:]*://g
w
EOF
sed
(Pretty much the same commands as ed, formatted a little differently)
sed 's/^[^:]*://' < "$file" > "$file".new
mv "$file".new "$file"
BASH
This one doesn't cause any new processes to be spawned. (For whatever that's worth.)
while IFS=: read _ time; do
printf '%s\n' "$time"
done < "$file" > "$file".new
mv "$file".new "$file"
awk
awk -F: 'BEGIN{ OFS=":" } { print $2,$3 }' < "$file" > "$file".new
mv "$file".new "$file"
cut
cut -d: -f2- < "$file" > "$file".new
mv "$file".new "$file"
Since you don't need a regular expression to match a single, known character, consider using cut instead of sed.
This simple expression sets : as the d-elimiter and emits f-ields 2, onwards (-):
cut -d: -f2-
Example:
% echo 'time:3:35PM' | cut -d: -f2-
3:35PM
kojiro's answer has a plenty of great alternatives, but you have asked how to do that with regex. Here are some pure regex solutions:
grep -oP '[^:]*:\K.*' file.txt
\K makes it forget everything before the occurrence of \K.
But if you know the exact prefix length then you can use lookaround feature:
grep -oP '(?<=^time:).*' file.txt
Note that most of regex implementations do not support these features. You can use it in grep with -P flag and perl itself. I wonder if any other utility supports these.
To remove every instance up to : and including the : you could do..
sed -i.bak 's/^[^:]*://' file.txt
on multiple .txt files
sed -i.bak 's/^[^:]*://' *.txt
The -i option specifies that files are to be edited in-place. By creating a temporary file and sending output to this file rather than to the standard output.
Please consider my answer here:
How to use regex with cut at the command line?
You could for example just write:
echo 'time:3:35PM' | cutr -d : -f 2- -r :
In your particular case, you could simply use cut though:
echo 'time:3:35PM' | cut -d : -f 2-
Any feedback welcome. cutr isn't perfect yet, but before I invest too much time into it, I wanted to get some feedback.

bash script regex matching

In my bash script, I have an array of filenames like
files=( "site_hello.xml" "site_test.xml" "site_live.xml" )
I need to extract the characters between the underscore and the .xml extension so that I can loop through them for use in a function.
If this were python, I might use something like
re.match("site_(.*)\.xml")
and then extract the first matched group.
Unfortunately this project needs to be in bash, so -- How can I do this kind of thing in a bash script? I'm not very good with grep or sed or awk.
Something like the following should work
files2=(${files[#]#site_}) #Strip the leading site_ from each element
files3=(${files2[#]%.xml}) #Strip the trailing .xml
EDIT: After correcting those two typos, it does seem to work :)
xbraer#NO01601 ~
$ VAR=`echo "site_hello.xml" | sed -e 's/.*_\(.*\)\.xml/\1/g'`
xbraer#NO01601 ~
$ echo $VAR
hello
xbraer#NO01601 ~
$
Does this answer your question?
Just run the variables through sed in backticks (``)
I don't remember the array syntax in bash, but I guess you know that well enough yourself, if you're programming bash ;)
If it's unclear, dont hesitate to ask again. :)
I'd use cut to split the string.
for i in site_hello.xml site_test.xml site_live.xml; do echo $i | cut -d'.' -f1 | cut -d'_' -f2; done
This can also be done in awk:
for i in site_hello.xml site_test.xml site_live.xml; do echo $i | awk -F'.' '{print $1}' | awk -F'_' '{print $2}'; done
If you're using arrays, you probably should not be using bash.
A more appropriate example wold be
ls site_*.xml | sed 's/^site_//' | sed 's/\.xml$//'
This produces output consisting of the parts you wanted. Backtick or redirect as needed.