Break down text file in bash - regex

I have a text file in the following format:
variableStep chrom=chr1 span=10
10161 1
10171 1
10181 2
10191 2
10201 2
10211 2
10221 2
10231 2
10241 2
10251 1
variableStep chrom=chr10 span=10
70711 1
70721 2
70731 2
70741 2
70751 2
70761 2
70771 2
70781 2
70791 1
71161 1
71171 1
71181 1
variableStep chrom=chr11 span=10
104731 1
104741 1
104751 1
104761 1
104771 1
104781 1
104791 1
104801 1
128711 1
128721 1
128731 1
I need a way to break this down into several files named for example "chr1.txt", "chr10.txt and "chr11.txt". How would I go about doing this?
I about the the following way:
cat file.txt | \
while IFS=$'\t' read -r -a rowArray; do
echo -e "${rowArray[0]}\t${rowArray[1]}\t${rowArray[2]}"
done > $file.mod.txt
That reads line by line and then saves line by line. However, I need something a little more elaborate that spans rows. "chr1.txt" would include everything from the row 10161 1 to row 10251 1, "chr10.txt" would include everything from the row 70711 1 to row 71181 1, etc. It's also specific in that I have to read in the actual chr# from each line as well, and save that as the file name.
The help is really appreciated.

awk -F'[ =]' '
$1 == "variableStep" {file = $3 ".txt"; next}
file != "" {print > file}' < input.txt

This worked for me:
IFS=$'\n'
curfile=""
content=($(< file.txt))
for ((idx = 0; idx < ${#content[#]}; idx++)); do
if [[ ${content[idx]} =~ ^.*chrom=(\\b.*?\\b)\ .*$ ]]; then
curfile="${BASH_REMATCH[1]}.txt"
rm -rf ${curfile}
elif [ -n "${curfile}" ]; then
echo ${content[idx]} >> ${curfile}
fi
done

Awk is appropriate for this problem domain because the text file is already (more or less) organized into columns. Here's what I would use:
awk 'NF == 3 && index($2, "=") { filename = substr($2, index($2, "=") + 1) }
NF == 2 && filename { print $0 > (filename ".txt") }' < input.txt
Explanation:
Think of the lines starting with variableStep as "three columns" and the other lines as "two columns". The above script says, "Parse the text file line-by-line; if a line has three columns and the second column contains an '=' character, assign 'all of the characters in the second column that occur after the '=' character' to a variable called filename. If a line has two columns and the filename variable's been assigned, write the entire line to the file that's constructed by concatenating the string in the filename variable with '.txt'".
Notes:
NF is a built-in variable in Awk that represents the "number of fields", where a "field" (in this case) can be thought of as a column of data.
$0 and $2 are built-in variables that represent the entire line and the second column of data, respectively. ($1 represents the first column, $3 represents the third column, etc...)
substr and index are built-in functions described here: http://www.gnu.org/software/gawk/manual/gawk.html#String-Functions
The redirection operator (>) acts differently in Awk than it does in a shell script; subsequent writes to the same file are appended.
String concatenation is performed simply by writing expressions next to each other. The parenthesis ensure the concatenation happens before the file gets written to.
More details can be found here: http://www.gnu.org/software/gawk/manual/gawk.html#Two-Rules

i used sed to filter ....
code part :
Kaizen ~/so_test $ cat zsplit.sh
cntr=1;
prev=1;
for curr in `cat ztmpfile2.txt | nl | grep variableStep | tr -s " " | cut -d" " -f2 | sed -n 's/variableStep//p'`
do
sed -n "$prev,$(( ${curr} - 1))p" ztmpfile2.txt > zchap$cntr.txt ;
#echo "displaying : : zchap$cntr.txt " ;
#cat zchap$cntr.txt ;
prev=$curr; cntr=$(( $cntr + 1 ));
done
sed -n "$prev,$ p" ztmpfile2.txt > zchap$cntr.txt ;
#echo "displaying : : zchap$cntr.txt " ;
#cat zchap$cntr.txt ;
output :
Kaizen ~/so_test $ ./zsplit.sh
+ ./zsplit.sh
zchap1.txt :: 1 :: 1
displaying : : zchap1.txt
variableStep chrom=chr1 span=10
zchap2.txt :: 1 :: 12
displaying : : zchap2.txt
variableStep chrom=chr1 span=10
10161 1
10171 1
10181 2
10191 2
10201 2
10211 2
10221 2
10231 2
10241 2
10251 1
zchap3.txt :: 12 :: 25
displaying : : zchap3.txt
variableStep chrom=chr10 span=10
70711 1
70721 2
70731 2
70741 2
70751 2
70761 2
70771 2
70781 2
70791 1
71161 1
71171 1
71181 1
displaying : : zchap4.txt
variableStep chrom=chr11 span=10
104731 1
104741 1
104751 1
104761 1
104771 1
104781 1
104791 1
104801 1
128711 1
128721 1
128731 1
from the result zchap* files , iff you want you can remove the line : variableStep chrom=chr11 span=10 by using sed -- sed -i '/variableStep/d' zchap*
does this help ?

Related

Awk to extract and format a highly variable text file

I'm dealing with a text file that's just a mess. It's the service record for a used RV that I'm buying, and it's a regex lover's nightmare
It has both inconsistent field separators and an inconsistent number of fields, with the lines being one of two types:
Type 1 (11 columns):
UNIT Mile GnHr R.O. Ln Service Description Mechanic Hours $ Amt
7-9918;57878 1698 01633 021;0502-00C ENG OIL/ FILTERT IF NEEDED;M02 JOSE A. SANCHEZ;0.80;80.00
Type 2 (10 columns)
UNIT Mile GnHr R.O. Ln Service Description Hours $ Amt
7-9918;55007 1641 [9564 007;ELE-BAT-BAT-0-0AAA;BATTERY AAA ALL BRANDS;2;31.12
I've stripped out all the headings, but put them back here just for reference. In Type 2 lines, the Mechanic field is missing.
I replaced all occurrences of multiple spaces with semicolons, so what I have now is a file where some lines have 10 fields, some lines have 11 fields, and sometimes the field separator is a space, and in other cases it's a semicolon, and some fields have legitimate embedded spaces (Description and Mechanic).
I'm trying to find a way with awk to:
1) Extract each field and be able to print it out with a uniform OFS (semicolon is preferred)
2) If the Mechanic field is missing, insert it and print N/A or -- for the Mechanic
I can deal with column headings and stuff myself, I just can't crack the code for how to deal with the FS problem and variable number of columns in this file. I can grep out specific information that I need, but would be thrilled to get it into a form where I can import it into a spreadsheet or DB.
Your input file's not so bad. Assuming your input file is semi-colon separated:
Replace all blank chars in $2 with a ; to split that up into separate fields for output, then
if there's a blank in $3 then replace the first blank with a ; (since it contains both the service and description so you need to separate them), otherwise
this is a format of line that has no mechanic specified so add the empty-mechanic text after $4 (the description)
and then just print the line:
$ awk 'BEGIN{FS=OFS=";"} {gsub(/ /,OFS,$2)} !sub(/ /,OFS,$3){$4=$4 OFS "N/A"} 1' file
7-9918;57878;1698;01633;021;0502-00C;ENG OIL/ FILTERT IF NEEDED;M02 JOSE A. SANCHEZ;0.80;80.00
7-9918;55007;1641;[9564;007;ELE-BAT-BAT-0-0AAA;BATTERY AAA ALL BRANDS;N/A;2;31.12
and if you'd like to do anything with the individual fields:
$ cat tst.awk
BEGIN { FS=OFS=";" }
{ gsub(/ /,OFS,$2) }
!sub(/ /,OFS,$3) { $4 = $4 OFS "N/A" }
{
$0 = $0
print
for (i=1; i<=NF; i++) {
print NR, i, $i
}
print ""
}
.
$ awk -f tst.awk file
7-9918;57878;1698;01633;021;0502-00C;ENG OIL/ FILTERT IF NEEDED;M02 JOSE A. SANCHEZ;0.80;80.00
1;1;7-9918
1;2;57878
1;3;1698
1;4;01633
1;5;021
1;6;0502-00C
1;7;ENG OIL/ FILTERT IF NEEDED
1;8;M02 JOSE A. SANCHEZ
1;9;0.80
1;10;80.00
7-9918;55007;1641;[9564;007;ELE-BAT-BAT-0-0AAA;BATTERY AAA ALL BRANDS;N/A;2;31.12
2;1;7-9918
2;2;55007
2;3;1641
2;4;[9564
2;5;007
2;6;ELE-BAT-BAT-0-0AAA
2;7;BATTERY AAA ALL BRANDS
2;8;N/A
2;9;2
2;10;31.12
A friend of mine also sent me this solution, done in perl:
#!/usr/bin/env perl -w
use strict;
use warnings;
# 1 1 1 1 1
# 1 2 3 4 5 6 7 8 9 0 1 2 3 4
# 012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
# Type 1:
# 7-9918 55007 1641 [9564 022 0211 INTERIOR MISC. M02 JOSE A. SANCHEZ 0.00 0.00
# Type 2:
# 7-9918 57878 1698 01633 001 FUE-LPG-LPG-S-GAS PROPANE GAS BULK PURCHASE 5 24.00
my $delim="\t";
while (<STDIN>) {
#print $_;
# Both formats are the same at this point
print substr($_, 0, 6) . $delim;
print substr($_, 8, 5) . $delim;
print substr($_, 14, 4) . $delim;
print substr($_, 19, 5) . $delim;
print substr($_, 25, 3) . $delim;
my $qty = substr($_, 109, 11);
$qty =~ s/^\s*//g;
$qty =~ s/\s*$//g;
if ($qty =~ /^\d+\.\d{2}$/) {
# Type 1
print substr($_, 40, 9) . $delim;
print substr($_, 49, 32) . $delim;
# print substr($_, 81, 32) . $delim; # Technician name
print $qty . $delim;
} elsif ($qty =~ /^[-]?\d+$/) {
# Type 2
print substr($_, 40, 23) . $delim;
print substr($_, 63, 46) . $delim;
print $qty . $delim;
}
print sprintf("%.2f", substr($_, 120, 11)) . "\n";
}
1;

grep, cut, sed, awk a file for 3rd column, n lines at a time, then paste into repeated columns of n rows?

I have a file of the form:
#some header text
a 1 1234
b 2 3333
c 2 1357
#some header text
a 4 8765
b 1 1212
c 7 9999
...
with repeated data in n-row chunks separated by a blank line (with possibly some other header text). I'm only interested in the third column, and would like to do some grep, cut, awk, sed, paste magic to turn it in to this:
a 1234 8765 ...
b 3333 1212
c 1357 9999
where the third column of each subsequent n-row chunk is tacked on as a new column. I guess you could call it a transpose, just n-lines at a time, and only a specific column. The leading (a b c) column label isn't essential... I'd be happy if I could just grab the data in the third column
Is this even possible? It must be. I can get things chopped down to only the interesting columns with grep and cut:
cat myfile | grep -A2 ^a\ | cut -c13-15
but I can't figure out how to take these n-row chunks and sed/paste/whatever them into repeated n-row columns.
Any ideas?
This awk does the job:
awk 'NF<3 || /^(#|[[:blank:]]*$)/{next} !a[$1]{b[++k]=$1; a[$1]=$3; next}
{a[$1] = a[$1] OFS $3} END{for(i=1; i<=k; i++) print b[i], a[b[i]]}' file
a 1234 8765
b 3333 1212
c 1357 9999
awk '/#/{next}{a[$1] = a[$1] $3 "\t"}END{for(i in a){print i, a[i]}}' file
Would produce
a 1234 8765
b 3333 1212
c 1357 9999
You can change "\t" to a different output separator like " " if you like.
sub(/\t$/, "", a[i]); may be inserted before printif uf you don't like having trailing spaces. Another solution is to check if a[$1] already has a value where you decide if you have append to a previous value or not. It complicates the code a bit though.
Using bash > 4.0:
declare -A array
while read line
do
if [[ $line && $line != \#* ]];then
c=$( echo $line | cut -f 1 -d ' ')
value=$( echo $line | cut -f 3 -d ' ')
array[$c]="${array[$c]} $value"
fi
done < myFile.txt
for k in "${!array[#]}"
do
echo "$k ${array[$k]}"
done
Will produce:
a 1234 8765
b 3333 1212
c 1357 9999
It stores the letter as the key of the associative array, and in each iteration, appends the correspondig value to it.
$ awk -v RS= -F'\n' '{ for (i=2;i<=NF;i++) {split($i,f,/[[:space:]]+/); map[f[1]] = map[f[1]] " " f[3]} } END{ for (key in map) print key map[key]}' file
a 1234 8765
b 3333 1212
c 1357 9999

Print last match of a sed regex

I have the following:
cat /tmp/cluster_concurrentnodedump.out.20140501.103855 | sed -n '/Starting inject/s/.*[Ii]nject \([0-9]*\).*/\1/p
Which gives a list of
0
1
2
..
How can I print only the last match with this sed?
Thanks.
Store the substitution results in the hold buffer then print it at the end:
sed -ne '
/Starting inject/ {
# do the substitution
s/.*[Ii]nject \([0-9]*\).*/\1/
# instead of printing, copy the results to the hold buffer
h
}
$ { # at the end of the file:
# copy the hold buffer back to the pattern buffer
x
# print the pattern buffer
p
}
' /tmp/cluster_concurrentnodedump.out.20140501.103855
Use tac to print the file in reverse (first line last) and exit after first match:
tac /tmp/cluster_concurrentnodedump.out.20140501.103855 | sed -n '/Starting inject/s/.*[Ii]nject \([0-9]*\).*/\1/p;q'
Last part is where we have ;q to quit:
sed -n '....p;q'
^
Example
Print last number:
$ cat a
1
2
3
4
5
6
7
8
9
$ tac a | sed -n 's/\([0-9]\)/\1/p;q'
9

awk with joined field

I am trying to extract data from one file, based on another.
The substring from file1 serves as an index to find matches in file2.
All works when the string to be searched inf file2 is beetween spaces or isolated, but when is joined to other fields awk cannot find it. is there a way to have awk match any part of the strings in file2 ?
awk -vv1="$Var1" -vv2="$var2" '
NR==FNR {
if ($4==v1 && $5==v2) {
s=substr($0,4,8)
echo $s
a[s]++
}
next
}
!($1 in a) {
print
}' /tmp/file1 /tmp/file2
example that works:
file1:
1 554545352014-01-21 2014-01-21T16:18:01 FS 14001 1 1.10
1 554545362014-01-21 2014-01-21T16:18:08 FS 14002 1 5.50
file2:
55454535 11 17 102 850Sande Fiambre 1.000
55454536 11 17 17 238Pesc. Dourada 1.000
example that does not work:
file2:
5545453501/21/20142 1716:18 1 1 116:18
5545453601/21/20142 1716:18 1 1 216:18
the string to be searched, for instance : 55454535 finds a match in the working example, but it doesn't in the bottom one.
You probably want to replace this:
!($1 in a) {
print
}
with this (or similar - your requirements are unclear):
{
found = 0
for (s in a) {
if ($1 ~ "^"s) {
found = 1
}
}
if (!found) {
print
}
}
Use a regex comparison ~ instead of ==
ex. if ($4 ~ v1 && $5 ~ v2)
Prepend v1/v2 with ^ if you want to the word to only begin with string and $ if you want to word to only end with it

How do I count number of matched terms and return a value of zero if they don't match?

I am trying to count the number of matched terms from an input list containing one term per line with a data file and create an output file containing both the matched (grep'd) term with the number of matched terms and where there isn't match, to return a value of zero.
Input list:
+ 5S_rRNA
+ 7SK
+ AC001
+ AC000111.3
+ AC000111.6
The data.txt file:
chr10 101780038 101780209 5S_rRNA
chr10 103578280 103578430 5S_rRNA
chr10 112327234 112327297 5S_rRNA
chr10 120766459 120766601 7SK
chr10 127408228 127408317 7SK
chr10 127511874 127512063 AADAC
chr10 14614140 14614294 AC000111.3
I would like to create an output file containing all the unmatched terms and matched terms with the corresponding count to look like this:
+ 5S_rRNA 3
+ 7SK 2
+ AC001 0
+ AADAC 1
+ AC000111.3 1
+ AC000111.6 0
I can create an output file containing matched terms and the counts but I don't know how to get the zero value to be returned if there isn't a match and get it to print all the output to a separate file.
These are the codes I have used to create matched terms (thanks perreal and Mark Setchell)
#!/bin/bash
while read line
do
line=${line##+ } # Strip off leading + and space
n=$(grep "$line" data.txt 2> /dev/null | wc -l)
if [ $n -gt 0 ]; then
echo $line
echo $n
fi
done < input_list.txt > output.txt
and
cut -d' ' -f2 input.txt | grep -o -f - data.txt | sort | uniq -c | \
sed 's/\s*\([0-9]*\)\s*\(.*\)/+ \2\t\1/' > output.txt
Any suggestions would be great. Thanks
Harriet
You can use this simple loop with grep -c:
while read l; do echo -n "+ $l "; grep -c "$l" file1; done < inputs
+ 5S_rRNA 3
+ 7SK 2
+ AC001 0
+ AC000111.3 1
+ AC000111.6 0
cut -d' ' -f2 input.txt | grep -o -f - data.txt | sort | uniq -c | \
sed 's/\s*\([0-9]*\)\s*\(.*\)/+ \2 \1/' | \
join -a 1 -e 0 -j 2 input.txt - -o '1.2 2.3' | \
sed 's/ /\t/;s/^/+ /'
When working with tab, whitespace or similar delimited files, think awk. Perhaps this is what you're looking for. I have used a ternary operator, but you could use if / else statements if you find them easier to read.
awk 'FNR==NR { a[$4]++; next } { print "+", $2, $2 in a ? a[$2] : 0 }' data.txt inputlist.txt
Results:
+ 5S_rRNA 3
+ 7SK 2
+ AC001 0
+ AC000111.3 1
+ AC000111.6 0
$2 in a ? a[$2] : 0 means if column two is in the array (called a), return the value for that key. Else, return zero. HTH.