Renaming files by using a prefix from text file - regex

I have a set of files named
sample_exp_A1_A01
sample_exp_A2_A02
sample_exp_A3_A03
sample_exp_A4_A04
sample_exp_A5_A05
And I have a text file with the following values
A01 170
A02 186
A03 165
A04 130
A05 120
I would like to rename the files based on the text file values like
FS_170_sample_exp_A1_A01
FS_186_sample_exp_A2_A02
FS_165_sample_exp_A3_A03
FS_130_sample_exp_A4_A04
FS_120_sample_exp_A5_A05
So match the IDs in the text file based on A01 or A02 or A03 or A04 and then add that corresponding number as prefix to filenames. In addition also prefix all file names with a FS as shown above.
I manually tried doing it but could only do for one file at a time this way
rename 's/^/FS_170_/' *A01
to get
FS_170_sample_exp_A1_A01

Assuming your suffixes are in a file named suffix, you can do this:
for fname in sample*; do
echo mv "$fname" FS_"$(awk -v pf="${fname##*_}" \
'$1 == pf {print $2}' suffix)"_"$fname"
done
It loops over all your files; in the loop, it puts together the new file name by prepending FS_ and the output of
awk -v pf="${fname##*_}" '$1 == pf {print $2}' suffix
This assigns the last part of the input file name to the awk variable pf, and then, for lines in suffix where the first field matches that variable, prints the second field.
Alternatively, if you have a grep that supports Perl compatible regular expressions, you can use grep -Po "${fname##*_} \K.*" suffix instead (using a variable-sized look-behind, \K):
for fname in sample*; do
echo mv "$fname" FS_"$(grep -Po "${fname##*_} \K.*" suffix)"_"$fname"
done
This is added to the new filename, and the rest of the new name is the complete old name.
For your input files, this results in
mv sample_exp_A1_A01 FS_170_sample_exp_A1_A01
mv sample_exp_A2_A02 FS_186_sample_exp_A2_A02
mv sample_exp_A3_A03 FS_165_sample_exp_A3_A03
mv sample_exp_A4_A04 FS_130_sample_exp_A4_A04
mv sample_exp_A5_A05 FS_120_sample_exp_A5_A05
To actually rename the files, the echo has to be removed.
If suffix is gigantic, you can accelerate this by having awk exit after the first match:
awk -v pf="${fname##*_}" '$1 == pf {print $2; exit}' suffix
or grep stop after the first match:
grep -m 1 -Po "${fname##*_} \K.*" suffix

Related

How to find specific text in a text file, and append it to the filename?

I have a collection of plain text files which are named as yymmdd_nnnnnnnnnn.txt, which I want to append another number sequence to the filenames, so that they each become named as yymmdd_nnnnnnnnnn_iiiiiiiii.txt instead, where the iiiiiiiii is taken from the one line in each file which contains the text "GST: 123456789⏎" (or similar) at the end of the line. While I am sure that there will only be one such matching line within each file, I don't know exactly which line it will be on.
I need an elegant one-liner solution that I can run over the collection of files in a folder, from a bash script file, to rename each file in the collection by appending the specific GST number for each filename, as found within the files themselves.
Before even getting to the renaming stage, I have encountered a problem with this. Here is what I tried, which didn't work...
# awk '/\d+$/' | grep -E 'GST: ' 150101_2224567890.txt
The grep command alone works perfectly to find the relevant line within the file, but the awk doesn't return just the final digits group. It fails with the error "warning: regexp escape sequence \d is not a known regexp operator". I had assumed that this regex should return any number of digits which are at the end of the line. The text file in question contains a line which ends with "GST: 112060340⏎". Can someone please show me how to make this work, and maybe also to help with the appropriate coding to move the collection of files to the new filenames? Thanks.
Thanks to a comment from #Renaud, I now have the following code working to obtain just the GST registration number from within a text file, which puts me a step closer towards a workable solution.
awk '/GST: / {printf $NF}' 150101_2224567890.txt
I still need to loop this over the collection instead of just specifying one filename. I also need to be able to use the output from #Renaud's contribution, to rename the files. I'm getting closer to a working solution, thanks!
This awk should work for you:
awk '$1=="GST:" {fn=FILENAME; sub(/\.txt$/, "", fn); print "mv", FILENAME, fn "_" $2 ".txt"; nextfile}' *_*.txt | sh
To make it more readable:
awk '$1 == "GST:" {
fn = FILENAME
sub(/\.txt$/, "", fn)
print "mv", FILENAME, fn "_" $2 ".txt"
nextfile
}' *_*.txt | sh
Remove | sh from above to see all mv commands together.
You may try
for f in *_*.txt; do echo mv "$f" "${f%.txt}_$(sed '/.*GST: /!d; s///; q' "$f").txt"; done
Drop the echo if you're satisfied with the output.
As you are sure there is only one matching line, you can try:
$ n=$(awk '/GST:/ {print $NF}' 150101_2224567890.txt)
$ mv 150101_2224567890.txt "150101_2224567890_$n.txt"
Or, for all .txt files:
for f in *.txt; do
n=$(awk '/GST:/ {print $NF}' "$f")
if [[ -z "$n" ]]; then
printf '%s: GST not found\n' "$f"
continue
fi
mv "$f" "$f{%.txt}_$n.txt"
done
Another one-line solution to consider, although perhaps not so elegant.
for original_filename in *_*.txt; do \
new_filename=${original_filename%'.txt'}_$(
grep -E 'GST: ' "$original_filename" | \
sed -E 's/.*GST//g; s/[^0-9]//g'
)'.txt' && \
mv "$original_filename" "$new_filename"; \
done
Output:
150101_2224567890_123456789.txt
If you are open to a multi line script:-
#!/bin/sh
for f in *.txt; do
prefix=$(echo "${f}" | sed s'#\.txt##')
cp "${f}" f1
sed -i s'#GST#%GST#' "./f1"
cat "./f1" | tr '%' '\n' > f2
number=$(cat "./f2" | sed -n '/GST/'p | cut -d':' -f2 | tr -d ' ')
newname="${prefix}_${number}.txt"
mv -v "${f}" "${newname}"
rm -v "./f1"
rm -v "./f2"
done
In general, if you want to make your files easy to work with, then leave as many potential places for them to be split with newlines as possible. It is much easier to alter files by simply being able to put what you want to delete or print on its' own line, than it is to search for things horizontally with regular expressions.

sed & regex expression

I'm trying to add a 'chr' string in the lines where is not there. This operation is necessary only in the lines that have not '##'.
At first I use grep + sed commands, as following, but I want to run the command overwriting the original file.
grep -v "^#" 5b110660bf55f80059c0ef52.vcf | grep -v 'chr' | sed 's/^/chr/g'
So, to run the command in file I write this:
sed -i -E '/^#.*$|^chr.*$/ s/^/chr/' 5b110660bf55f80059c0ef52.vcf
This is the content of the vcf file.
##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="#ref plus strand,#ref minus strand, #alt plus strand, #alt minus strand">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 24430-0009S21_GM17-12140
1 955597 95692 G T 1382 PASS VARTYPE=1;BGN=0.00134309;ARL=150;DER=53;DEA=55;QR=40;QA=39;PBP=1091;PBM=300;TYPE=SNP;DBXREF=dbSNP:rs115173026,g1000:0.2825,esp5400:0.2755,ExAC:0.2290,clinvar:rs115173026,CLNSIG:2,CLNREVSTAT:mult,CLNSIGLAB:Benign;SGVEP=AGRN|+|NM_198576|1|c.45G>T|p.:(p.Pro15Pro)|synonymous GT:DP:AD:DP4 0/1:125:64,61:50,14,48,13
chr1 957898 82729935 G T 1214 off_target VARTYPE=1;BGN=0.00113362;ARL=149;DER=50;DEA=55;QR=38;QA=40;PBP=245;PBM=978;NVF=0.53;TYPE=SNP;DBXREF=dbSNP:rs2799064,g1000:0.3285;SGVEP=AGRN|+|NM_198576|2|c.463+56G>T|.|intronic GT:DP:AD:DP4 0/1:98:47,51:9,38,10,41
If I understand what is your expected result, try:
sed -ri '/^(#|chr)/! s/^/chr/' file
Your question isn't clear and you didn't provide the expected output so we can't test a potential solution but if all you want is to add chr to the start of lines where it's not already present and which don't start with # then that's just:
awk '!/^(#|chr)/{$0="chr" $0} 1' file
To overwrite the original file using GNU awk would be:
awk -i inplace '!/^(#|chr)/{$0="chr" $0} 1' file
and with any awk:
awk '!/^(#|chr)/{$0="chr" $0} 1' file > tmp && mv tmp file
This can be done with a single sed invocation. The script itself is something like the following.
If you have an input of format
$ echo -e '#\n#\n123chr456\n789chr123\nabc'
#
#
123chr456
789chr123
abc
then to prepend chr to non-commented chrless lines is done as
$ echo -e '#\n#\n123chr456\n789chr123\nabc' | sed '/^#/ {p
d
}
/chr/ {p
d
}
s/^/chr/'
which prints
#
#
123chr456
789chr123
chrabc
(Note the multiline sed script.)
Now you only need to run this script on a file in-place (-i in modern sed versions.)

egrep to find largest suffix for file

There are files like this:
Report.cfg
Report.cfg.1
Report.cfg.2
Report.cfg.3
I want to fetch the max suffix, if exists (i.e. 3) using egrep.
If I try simple egrep:
ls | egrep Report.cfg.*
I get the full file name and the whole list, not the suffix only.
What could be an optimized egrep?
You can use this awk to find greatest number from a list of file ending with dot and a number.:
printf '%s\n' *.cfg.[0-9] | awk -F '.' '$NF > max{max = $NF} END{print max}'
3

bash search for multiple patterns on different lines in a file

I have a number of files and I want to filter out the ones that contain 2 patterns. However these patterns are on different lines. I've tried it using grep and awk but in both cases they only seem to work on matches patterns on the same line. I know grep is line based but I'm less familiar with awk. Here's what I came up with but it only works prints lines that match both strings:
awk '/string1/ && /string2/' file
Grep will easily handle this using xargs:
grep -l string1 * | xargs grep -l string2
Use this command in the directory where the files are located, and resulting matches will be displayed.
Depending om whether you really want to search for regexps:
gawk -v RS='^$' '/regexp1/ && /regexp2/ {print FILENAME}' file
or for strings:
gawk -v RS='^$' 'index($0,"string1") && index($0,"string2") {print FILENAME}' file
The above uses GNU awk for multi-char RS to read the whole file as a single record.
You can do it with find
find -type f -exec bash -c "grep -q string1 {} && grep -q string2 {} && echo {}" ";"
You could do it like this with GNU awk:
awk '/foo/{seenFoo++} /bar/{seenBar++} seenFoo&&seenBar{print FILENAME;seenFoo=seenBar=0;nextfile}' file*
That says... if you see foo, increment variable seenFoo, likewise if you see bar, increment variable seenBar. If, at any point, you have seen both foo and bar, print the name of the current file and skip to the next input file ignoring all remaining lines in current file, and, before you start the next file, clear the flags to say we have seen neither foo nor bar in the new file.

get file names, time stamps and MD5 checksums from a log file

I want to write a bash script, that will take the output of a log file and extract the relevant content to another log file, which I will used to do statistical analysis of the time it takes to send a file as an example:
The content is as follows:
FileSize TimeStamp MD5 Full Path to File
4824597 2013-06-21 11:26 5a264...c11 ...45/.../.../ITAM.xml
4824597 2013-06-20 23:18 5a264...c11 ...48/.../.../1447_rO8iKD.TMP.ITAM.xml
I am trying to extract the TimeStamp and the Full Path to the File.
I am a beginner in scripting but so far I have tried:
cat "/var/log/Customer.log" | grep '2013* *11' >> test.txt
Are there other methods I'm missing. Thank you very much.
If you want extract the TimeStamp and the Full Path for all entries then this should work:
awk 'NR>1{print $2,$3,$NF}' inputFile > outputFile
Code for GNU sed:
sed -nr '2,$ {s/\S+\s+(\S+)\s+(\S+)\s+\S+\s+(.*)/\1 \2\t\3/;p}' file
$cat file
FileSize TimeStamp MD5 Full Path to File
4824597 2013-06-21 11:26 5a264...c11 ...45/.../.../ITAM.xml
4824597 2013-06-20 23:18 5a264...c11 ...48/.../.../1447_rO8iKD.TMP.ITAM.xml
$sed -nr '2,$ {s/\S+\s+(\S+)\s+(\S+)\s+\S+\s+(.*)/\1 \2\t\3/;p}' file
2013-06-21 11:26 ...45/.../.../ITAM.xml
2013-06-20 23:18 ...48/.../.../1447_rO8iKD.TMP.ITAM.xml
Looks like this is what you want:
awk '$2 ~ /^2013/ && $4 ~ /11$/ { print $2, $3, $NF; }' /var/log/Customer.log > test.txt
$2 ~ /^2013/ matches dates beginning with 2013
$4 ~ /11$/ matches MD5 ending with 11
print $2, $3, $NF prints fields 2 (date), 3 (time), and the last field (pathname)
If these regular expressions are confusing to you, go to Regular-Expressions.info and read the tutorial.
Assuming the columns are tab-separated, you can just use cut:
cut -f2,4 /var/log/Customer.log | grep -v MD5 >> test.txt
will append columns 2 and 4 (counting starts at 1) into the test.txt. Lines containing MD5 will be removed by the grep invocation.
You can do it like this:
awk 'NR!=1 {print $2 " " $3 "\t" $5}' Customer.log > stat.txt