Bash: Batch Rename Appending File Extension - regex

I have a bunch of temperature logger data files in .csv format. The proprietary temp-logger software saves them with weird useless names. I want to name the files by their serial numbers (S/N). The S/N can be found in each of the files (in several places).
So, I need to extract the S/N and change the name of the file to {S/N}.csv.
I'm almost there, but can't figure out how to get the ".csv" file extension onto the end.
Here's my code:
for i in *.csv; do grep -Eo "S\/N\: [0-9]+" "$i" | cut -c 6- | head -1 | xargs mv "$i" ; done
Note the "cut" and "head" commands are necessary to get just the S/N number from the regular expression return, and to take only one (the S/N is listed several times in the file).
If anyone has a more elegant solution, I'd love to see it. All I really need though is to get that ".csv" onto the end of my new file names.
Thanks!

You can do it with xargs, but it's simpler to skip it and call mv directly. (You're only renaming one file per call to xargs anyway.)
for i in *.csv; do
ser_num=$(grep -Eo "S\/N\: [0-9]+" "$i" | cut -c 6- | head -1)
mv "$i" "$ser_num.csv"
done

Related

extract filename and path with slash delimiter from text using grep and lookaround

I am trying to write a bash script to automate checkin of changed files after comparing a local folder to another remote folder.
To achieve this I am trying to extract the filename with a portion of the path of the remote folder, to be used in the checkin commands. I am seeking assistance on extracting the filename with it's path.
To achieve the comparison I used diff command as follows
diff --brief --suppress-common-lines -x '*.class' -ar ~/myprojects/company/apps/product/package/test/ $env_var/java/package/test/
The above command prints output in following format:
Files /home/xxxx/myprojects/company/apps/product/package/test/fileName.java and /productdev/product/product121/java/package/test/filename.java differ
I want to the extract the file name between 'and' & 'differ'. So I used lookarounds regular expression in a grep command :
diff --brief --suppress-common-lines -x '*.class' -ar ~/myprojects/company/apps/product/package/test/ $env_var/java/package/test/ | grep -oP '(?<=and) .*(?=differ)'
which gave me:
/productdev/product/product121/java/package/test/filename.java
I would like to display the path starting from java to the end of the text as in: java/package/test/filename.java ?
You could try the below grep command,
grep -oP 'and.*\/\Kjava.*?(?= differ)'
That is,
diff --brief --suppress-common-lines -x '*.class' -ar ~/myprojects/company/apps/product/package/test/ $env_var/java/package/test/ | grep -oP 'and.*\/\Kjava.*?(?=\s*differ)'
For how I see it you will be getting all the files compared in both folders and will get several lines like the ones you mentioned.
So first step would be grep-ing all the lines "differ" in them. (If that command gives any other kind of lines too)
You can ignore the above step if I am wrong and didn't understand it right.
So the next step is grep-ing both the paths. For that you can use these :
awk '{print $2,$4}'
This will print only 2nd and third fields i.e., both the paths.
awk prints fields irrespective of the spaces.
Another simple way of doing it is :
cut -d" " -f 2,4
This will also do the same.
Here with "-d" flag we are specifying a delimiter to separate strings and with "-f" flag we are specifying the number of field places to pick from(so 2nd and 4th field).
Once you get these paths you can always store them in two variables and cut or awk whatever parts of it you want.

batch renaming of files with perl expressions

This should be a basic question for a lot of people, but I am a biologist with no programming background, so please excuse my question.
What I am trying to do is rename about 100,000 gzipped data files that have existing name of a code (example: XG453834.fasta.gz). I'd like to name them to something easily readable and parseable by me (example: Xanthomonas_galactus_str_453.fasta.gz).
I've tried to use sed, rename, and mmv, to no avail. If I use any of those commands on a one-off script then they work fine, it's just when I try to incorporate variables into a shell script do I run into problems. I'm not getting any errors, just no names are changed, so I suspect it's an I/O error.
Here's what my files look like:
#! /bin/bash
# change a bunch of file names
file=names.txt
while IFS=' ' read -r r1 r2;
do
mmv ''$r1'.fasta.gz' ''$r2'.fasta.gz'
# or I tried many versions of: sed -i 's/"$r1"/"$r2"/' *.gz
# and I tried many versions of: rename -i 's/$r1/$r2/' *.gz
done < "$file"
...and here's the first lines of my txt file with single space delimiter:
cat names.txt
#find #replace
code1 name1
code2 name2
code3 name3
I know I can do this with python or perl, but since I'm stuck here working on this particular script I want to find a simple solution to fixing this bash script and figure out what I am doing wrong. Thanks so much for any help possible.
Also, I tried to cat the names file (see comment from Ashoka Lella below) and then use awk to move/rename. Some of the files have variable names (but will always start with the code), so I am looking for a find & replace option to just replace the "code" with the "name" and preserve the file name structure.
I suspect I am not escaping the variable within the single tick of the perl expression, but I have poured over a lot of manuals and I can't find the way to do this.
If you're absolutely sure than the filenames doesn't contain spaces of tabs, you can try the next
xargs -n2 < names.txt echo mv
This is for DRY run (will only print what will do) - if you satisfied with the result, remove the echo ...
If you want check the existence ot the target, use
xargs -n2 < names.txt echo mv -i
if you want NEVER allow overwriting of the target use
xargs -n2 < names.txt echo mv -n
again, remove the echo if youre satisfied.
I don't think that you need to be using mmv, a simple mv will do. Also, there's no need to specify the IFS, the default will work for you:
while read -r src dest; do mv "$src" "$dest"; done < names.txt
I have double quoted the variable names as it is generally considered good practice but in this case, a space in either of the filenames will result in read not working as you expect.
You can put an echo before the mv inside the loop to ensure that the correct command will be executed.
Note that in your file names.txt, the .fasta.gz suffix is already included, so you shouldn't be adding it inside the loop aswell. Perhaps that was your problem?
This should rename all files in column1 to column2 of names.txt. Provided they are in the same folder as names.txt
cat names.txt| awk '{print "mv "$1" "$2}'|sh

how to remove lines from file that don't match regex?

I have a big file that looks like this:
7f0c41d6-f9c6-47aa-a034-d40bc629c973.csv
159890
159891
24faaed6-62ee-4175-8430-5d73b09911c8.csv
159907
5bad221f-25ef-44fa-9086-fd152e697928.csv
642e4ac3-3d46-4b4c-b5c8-aa2fa54d0b04.csv
d0e145a5-ceb8-4d4b-ae47-11e0c9a6548d.csv
159929
ba678cbd-af57-493b-a69e-e7504b4bc328.csv
7750840f-9bf9-4a68-9f25-a2ba0968d481.csv
159955
159959
And I'm only interesting in *.csv files, can someone point me how to remove files that do not end with .csv.
Thank you.
grep "\.csv$" file
will pull out only those lines ending in .csv
Then if you want to put them in a different file;
grep "\.csv$" file > newfile
sed is your friend:
sed -i.bak '/\.csv$/!d' file
-i.bak : in-place edit. creates backup file with .bak extension
([0-9a-zA-Z-]*.csv$)
This is the regex code that only select the filename ending with .csv extensions.
Hope this will help you.
If you are familiar with the vim text editor (vim or vi is typically installed on many linux boxes), use the following vim Ex mode command to remove lines that don't match a particular pattern:
:v/<pattern>/d
For example, if I wanted to delete all lines that didn't contain "column" I would run:
:v/"column"/d
Hope this helps.
If it is the case that you do not want to have to save the names of files in another file just to remove unwanted files, then this may also be an added solution for your needs (understanding that this is an old question).
This single line for loop using the grep "\.csv" file solution recursively so you don't need to manage multiple files names being saved here or there.
for f in *; do if [ ! "$(echo ${f} | grep -Eo '.csv')" == ".csv" ]; then rm "${f}"; fi; done
As a visual aid to show you that it works as intended (for removing all files except csv files) here is a quick and dirty screenshot showing the results using your sample output.
And here is a slightly shorter version of the single line command:
for f in *; do if [ ! "$(echo ${f} | grep -o '.csv')" ]; then rm "${f}"; fi; done
And here is it's sample output using your sample's csv file names and some randomly generated text files.
The purpose for using such a loop with a conditional is to guarantee you only rid yourself of the files you want gone (the non-csv files) and only in the current working directory without parsing the ls command.
Hopefully this helps you and anyone else that is looking for a similar solution.

Merging files in Linux

I am using Cygwin to merge multiple files. However, I wanted to know if my approach is correct or not. This is both a question and a discussion :)
First, a little info about the files I have:
Both the files have ASCII as well as NON ASCII Characters.
File1 has 7899097 lines in it and a size of ~ 70.9 Mb
File2 has 14344391 lines in it and a size of ~ 136.6 Mb
File Encoding Information:
$ file -bi file1.txt
text/x-c++; charset=unknown-8bit
$ file -bi file2.txt
text/x-c++; charset=utf-8
$ file -bi output.txt
text/x-c++; charset=unknown-8bit
This is the method I am following to merge the two files, sort them and then remove all the duplicate entries:
I create a temp folder and place both the text files inside it.
I run the following commands to merge both the files but keep a line break between the two
for file in *.txt; do
cat $file >> output.txt
echo >> output.txt
done
The resulting output.txt file has 22243490 lines in it and a size of 207.5 Mb
Now, if I run the sort command on it as shown below, I get an error since there are Non ASCII characters (maybe unicode, wide characters) present inside it:
sort -u output.txt
string comparison failed: Invalid or incomplete multibyte or wide character
So, I set the environment variable LC_ALL to C and then run the command as follows:
cat output.txt | sort -u | uniq >> result.txt
And, the result.txt has 22243488 lines in it and a size of 207.5 Mb.
So, result.txt is the same as output.txt
Now, I already know that there are many duplicate entries in output.txt, then why the above commands are not able to remove the duplicate entries?
Also, considering the large size of the files, I wanted to know if this is an efficient method to merge multiple files, sort them and then unique them?
Hmm, I'd use
cat file1.txt file2.txt other-files.* | recode enc1..enc2 | sort | uniq > file3.txt
but watch out - this could cause problem with some big file sizes, counted in gigabytes ( or bigger), anyway with hundreds of megabytes should probably go fine. If I'd want real efficiency, e.g. having really huge files, I'd first remove single-file duplicates, then sort it, merge one after one, and then sort again and remove duplicate lines again. Theoretically uniq -c and grep filter could remove duplicates. Try to avoid falling into some unneeded sophistication of the solution :)
http://catb.org/~esr/writings/unix-koans/two_paths.html
edited:
mv file1.txt file1_iso1234.txt
mv file2.txt file2_latin7.txt
ls file*.txt |while read line; do cat $line |recode $(echo $line|cut -d'_' -f2 |cut -d'.' -f1)..utf8 ; done | sort | uniq > finalfile.txt

using find and rename for their intended use

Now before you face palm and click on duplicate entry or the like, read on, this question is both Theory and practical.
From the title it is pretty obvious what I am trying to do, find some files, then rename them. Well the problem, there is so many way to do this, that I finally decided to pick one, and try to figure it out, theoretically.
Let me set the stage:
Lets say I have 100 files all named like this Image_200x200_nnn_AlphaChars.jpg, where the nnn is a incremental number and AlphaChars ie:
Image_200x200_001_BlueHat.jpg
Image_200x200_002_RedHat.jpg
...
Image_200x200_100_MyCat.jpg
Enter the stage find. Now with a simple one liner I can find all the image files in this directory.(Not sure how to do this case insensitive)
find . -type f -name "*.jpg"
Enter the stage rename. On it's own, rename expect you to do the following:
rename <search> <replace> <haystack>
When I try to combine the two with -print0 and xargs and some regular expressions I get stuck, and I am almost sure it's because rename is looking for the haystack or the search part... (Please do explain if you understand what happens after the pipe)
find . -type f -name "*.jpg" -print0 | xargs -0 rename "s/Image_200x200_(\d{3})/img/"
So the goal is to get the find to give rename the original image name, and replace everything before the last underscore with img
Yes I know that duplicates will give a problem, and yes I know that spaces in the name will also make my life hell, and don't even start with sub directories and the like. To keep it simple, we are talking about a single directory, and all filename are unique and without special characters.
I need to understand the fundamental basics, before getting to the hardcore stuff. Anybody out there feel like helping?
Another approach is to avoid using rename -- bash is capable enough:
find ... -print0 | while read -r -d '' filename; do
mv "$filename" "img_${filename##*_}"
done
the ##*_ part remove all leading characters up to and including the last underscore from the value.
If you don't need -print0 (i.e. you are sure your filenames don't contain newlines), you can just do:
find . -type f -name "*.jpg" | xargs rename 's/Image_200x200_(\d{3})/img/'
Which works for me:
~/tmp$ touch Image_200x200_001_BlueHat.jpg
~/tmp$ touch Image_200x200_002_RedHat.jpg
~/tmp$ touch Image_200x200_100_MyCat.jpg
~/tmp$ find . -type f -name "*.jpg" | xargs rename 's/Image_200x200_(\d{3})/img/'
~/tmp$ ls
img_BlueHat.jpg img_MyCat.jpg img_RedHat.jpg
What's happening after the pipe is that xargs is parsing the output of find and passing that in reasonable chunks to a rename command, which is executing a regex on the filename and renaming the file to the result.
update: I didn't try your version with the null-terminators at first, but it also works for me. Perhaps you tested with a different regex?
What's happening after the pipe:
find ... -print0 | xargs -0 rename "s/Image_200x200_(\d{3})/img/"
xargs is reading the filenames produced by the find command, and executing the rename command repeatedly, appending a few filenames at a time. The net effect will be something like:
rename '...' file001 file002 file003 file004 file005 file006 file007 file008 file009 file010
rename '...' file011 file012 file013 file014 file015 file016 file017 file018 file019 file010
rename '...' file021 file022 file023 file024 file025 file026 file027 file028 file029 file010
...
rename '...' file091 file092 file093 file094 file095 file096 file097 file098 file099 file100
The find -print0 | xargs -0 is a handy combination for more safely handling files that may contain whitespace.