Error in writing output file through AWK scripting - regex

I have a AWK script to write specific values matching with specific pattern to a .csv file.
The code is as follows:
BEGIN{print "Query Start,Query End, Target Start, Target End,Score, E,P,GC"}
/^\>g/ { Query=$0 }
/Query =/{
split($0,a," ")
query_start=a[3]
query_end=a[5]
query_end=gsub(/,/,"",query_end)
target_start=a[8]
target_end=a[10]
}
/Score =/{
split($0,a," ")
score=a[3]
score=gsub(/,/,"",score)
e=a[6]
e=gsub(/,/,"",e)
p=a[9]
p=gsub(/,/,"",p)
gc=a[12]
printf("%s,%s,%s,%s,%s,%s,%s,%s\n",query_start, query_end,target_start,target_end,score,e,p,gc)
}
The input file is as follows:
>gi|ABCDEF|
Plus strand results:
Query = 100 - 231, Target = 100 - 172
Score = 20.92, E = 0.01984, P = 4.309e-08, GC = 51
But I received the output in a .csv file as provided below:
100 0 100 172 0 0 0 51
The program failed to copy the values of:
Query end
Score
E
P
(Note: all the failed values are present before comma (,))
Any help to obtain the right output will be great.
Best regards,
Amit

As #Jidder mentioned, you don't need to call split() and as #jaypal mentioned you're using gsub() incorrectly, but also you don't need to call gsub() at all if you just include , in your FS.
Try this:
BEGIN {
FS = "[[:space:],]+"
OFS = ","
print "Query Start","Query End","Target Start","Target End","Score","E","P","GC"
}
/^\>g/ { Query=$0 }
/Query =/ {
query_start=$4
query_end=$6
target_start=$9
target_end=$11
}
/Score =/ {
score=$4
e=$7
p=$10
gc=$13
print query_start,query_end,target_start,target_end,score,e,p,gc
}
That work? Note the field numbers are bumped out by 1 because when you don't use the default FS awk no longer skips leading white space so there's an empty field before the white space in your input.
Obviously, you are not using your Query variable so the line that populates it is redundant.

Related

GAWK - Multiple BEGIN and END sections

I'm trying to process a bunch of files extracting data using gawk.
File area fixed width space formatted file
I'm trying to extract data from two different lines matched by two different regular expressions but return the data from both of these lines on the ONE print statement.
I can achieve this with the following in a.awk file and use gawk -f to run it. the first BEGIN section setup up input file format (FIELDWIDTHs) and the second BEGIN I am trying to use a loop per file to output based on extracted data. The first END complete the inner BEGIN and the second to match the outer BEGIN.
However I can only apply this to one file at a time because if I apply to a bunch of files (as in gawk -f regex.awk km*.txt , I only get the last file's output.
Can I get a one line of output per file input without having to resort to a script file looping over the input files and running the awk script each time.
Thanks
BEGIN{
OFS=","; FIELDWIDTHS ="2 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12";
printf("Date, Turnover, SalesA, SalesB, SalesC, SalesD, Other Data\n");
}
BEGIN{ Sales = 0;
SalesA = 0;
SalesB = 0;
SalesC = 0;
SalesD = 0;
JointSales = 0;
Turnover = 0;
OtherData = 0;}
/^03/ || /^06/ {
if ($1 == "03") {
Sales = $15/100;
SalesA = $17/100;
SalesB = $26/100;
SalesC = $20/100;
SalesD = $22/100;
JointSales = SalesA - SalesB;
Turnover = JointSales + SalesB + SalesC + SalesD; }
else if ( $1 == "06") {
OtherData = substr($0,183,12)/100; }
# printf("%s, %10.2f, %10.2f, %10.2f, %10.2f, %10.2f, %10.2f\n", getDate(FILENAME), Sales, JointSales, SalesB, SalesC, SalesD, OtherData )
}
END{printf("%s, %10.2f, %10.2f, %10.2f, %10.2f, %10.2f, %10.2f\n", getDate(FILENAME), Sales, JointSales, SalesB, SalesC, SalesD, OtherData ) }
END {}
function getDate(str)
{ date = substr(str,3,6);
year = substr(date,1,2);
month= substr(date,3,2);
day=substr(date,5,2);
odate=(day"/"month"/"year);
return odate
}
If you are using gawk, you're in luck. In addition to BEGIN and END blocks, gawk implements BEGINFILE and ENDFILE blocks, which are executed just as you want: before and after processing each file. See the handy gawk programming guide.
Like all awk implementations, Gnu awk allows you to have multiple BEGIN and END blocks. All BEGIN blocks are run in order (first to last) before the first file is read, and all END blocks are run in the same first-to-last order after the last file is done. Since the same order is used for both types of special block, they don't "nest".
awk only allows one begin and end action set per run (though they can be spread across multiple blocks, they're all combined into one action set) and a run includes all files that you process.
If you want to do something between each file as well, the can use the ARGIND variable which holds the index of the current argument (zero-based). You just need to maintain the last argument index (initially zero) and, if the actual argument index is different, execute your special actions and update the last index.
With empty files (for which no code would be run), the current argument index may be more than one higher than the last so you may need to loop, incrementing the last index until it reaches the current one.
For example, let's print the lines of each file but with special markers for before, within and after. With the file a.in:
xyzzy
plugh
and a b.in file containing nothing, you can use the following script demo.awk:
function middleCheck() {
while (lastArgInd != ARGIND) {
print "MIDDLE after "lastArgInd":"ARGV[lastArgInd]
lastArgInd++
}
}
BEGIN { print "BEGIN"
lastArgInd = 1
}
{ middleCheck()
print " "$0
}
END { middleCheck()
print "END"
}
to effect an action between each file:
pax> vi demo.awk ; awk -f demo.awk b.in a.in a.in b.in a.in b.in b.in
BEGIN
MIDDLE after 1:b.in
xyzzy
plugh
MIDDLE after 2:a.in
xyzzy
plugh
MIDDLE after 3:a.in
MIDDLE after 4:b.in
xyzzy
plugh
MIDDLE after 5:a.in
MIDDLE after 6:b.in
END
You just have to make that action match what you need, your current "inner" end followed by your current "inner" begin.

Finding columns with only white space in a text file and replace them with a unique separator

I have a file like this:
aaa b b ccc 345
ddd fgt f u 3456
e r der der 5 674
As you can see the only way that we can separate the columns is by finding columns that have only one or more spaces. How can we identify these columns and replace them with a unique separator like ,.
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Note:
If we find all continuous columns with one or more white spaces (nothing else) and replace them with , (all the column) the problem will be solved.
Better explanation of the question by josifoski :
Per block of matrix characters, if all are 'space' then all block should be replaced vertically with one , on every line.
$ cat tst.awk
BEGIN{ FS=OFS=""; ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR {
for (i=1;i<=NF;i++) {
if ($i == " ") {
space[i]
}
else {
nonSpace[i]
}
}
next
}
FNR==1 {
for (i in nonSpace) {
delete space[i]
}
}
{
for (i in space) {
$i = ","
}
gsub(/,+/,",")
print
}
$ awk -f tst.awk file
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Another in awk
awk 'BEGIN{OFS=FS=""} # Sets field separator to nothing so each character is a field
FNR==NR{for(i=1;i<=NF;i++)a[i]+=$i!=" ";next} #Increments array with key as character
#position based on whether a space is in that position.
#Skips all further commands for first file.
{ # In second file(same file but second time)
for(i=1;i<=NF;i++) #Loops through fields
if(!a[i]){ #If field is set
$i="," #Change field to ","
x=i #Set x to field number
while(!a[++x]){ # Whilst incrementing x and it is not set
$x="" # Change field to nothing
i=x # Set i to x so it doesnt do those fields again
}
}
}1' test{,} #PRint and use the same file twice
Since you have also tagged this r, here is a possible solution using the R package readr. It looks like you want to read a fix width file and convert it to a comma-seperated file. You can use read_fwf to read the fix width file and write_csv to write the comma-seperated file.
# required package
require(readr)
# read data
df <- read_fwf(path_to_input, fwf_empty(path_to_input))
# write data
write_csv(df, path = path_to_output, col_names = FALSE)

Parse text file to DGV, need to replace more then 1 consecutive periods

I am parsing a text file and importing it into a Data Grid View. The file is set up like so:
number 1 .......... 845.6
number 2 ....... 0.0001
col 1 col 2 col 3
1.233 4.55 1000
I need to get these values into a DGV then I can insert them into a Excel Template. I have everything working except for one line. the "number 1" line ends up all in one cell. I'll explain why.
I use the following code to process each line and sort of create a csv out of the data first.
TextLine = Regex.Replace(TextLine, " {2,}", " ")
TextLine = Replace(TextLine, " ", ",")
Since the data in the file is separated by more then one consecutive space on every line (except "number 1") I can just replace each occurrence with a comma and I get a nice result. What I was trying to do was replace consecutive periods with a space so "number 1" is separated from the value.
I have tried a few things:
TextLine = Regex.Replace(TextLine, "(.)\1{2,}", " ")
TextLine = Regex.Replace(TextLine, ".{2,}", " ")
the top one works, but also obviously gets rid of other characters that are consecutive. I also can't just remove periods, as some numbers have a decimal. I was thinking the solution might be using "Chr(46)" in that function, but I can't seem to make it work.
You need to escape the dot with a backslash so that it doesn't have a special meaning.
It looks like this is what you want:
Dim textLine As String = "number 1 .......... 845.6"
textLine = Regex.Replace(textLine, "\.{2,}", ",")
Console.WriteLine(textLine) ' outputs "number 1 , 845.6"
Or maybe Regex.Replace(textLine, "\.{2,}", " ") to replace two-or-more consecutive dots with a space.

Read fields from text file and store them in a structure

I am trying to read a file that looks as follows:
Data Sampling Rate: 256 Hz
*************************
Channels in EDF Files:
**********************
Channel 1: FP1-F7
Channel 2: F7-T7
Channel 3: T7-P7
Channel 4: P7-O1
File Name: chb01_02.edf
File Start Time: 12:42:57
File End Time: 13:42:57
Number of Seizures in File: 0
File Name: chb01_03.edf
File Start Time: 13:43:04
File End Time: 14:43:04
Number of Seizures in File: 1
Seizure Start Time: 2996 seconds
Seizure End Time: 3036 seconds
So far I have this code:
fid1= fopen('chb01-summary.txt')
data=struct('id',{},'stime',{},'etime',{},'seizenum',{},'sseize',{},'eseize',{});
if fid1 ==-1
error('File cannot be opened ')
end
tline= fgetl(fid1);
while ischar(tline)
i=1;
disp(tline);
end
I want to use regexp to find the expressions and so I did:
line1 = '(.*\d{2} (\.edf)'
data{1} = regexp(tline, line1);
tline=fgetl(fid1);
time = '^Time: .*\d{2]}: \d{2} :\d{2}' ;
data{2}= regexp(tline,time);
tline=getl(fid1);
seizure = '^File: .*\d';
data{4}= regexp(tline,seizure);
if data{4}>0
stime = '^Time: .*\d{5}';
tline=getl(fid1);
data{5}= regexp(tline,seizure);
tline= getl(fid1);
data{6}= regexp(tline,seizure);
end
I tried using a loop to find the line at which file name starts with:
for (firstline<1) || (firstline>1 )
firstline= strfind(tline, 'File Name')
tline=fgetl(fid1);
end
and now I'm stumped.
Suppose that I am at the line at which the information is there, how do I store the information with regexp? I got an empty array for data after running the code once...
Thanks in advance.
I find it the easiest to read the lines into a cell array first using textscan:
%// Read lines as strings
fid = fopen('input.txt', 'r');
C = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
and then apply regexp on it to do the rest of the manipulations:
%// Parse field names and values
C = regexp(C{:}, '^\s*([^:]+)\s*:\s*(.+)\s*', 'tokens');
C = [C{:}]; %// Flatten the cell array
C = reshape([C{:}], 2, []); %// Reshape into name-value pairs
Now you have a cell array C of field names and their corresponding (string) values, and all you have to do is plug it into struct in the correct syntax (using a comma-separated list in this case). Note that the field names have spaces in them, so this needs to be taken care of before they can be used (e.g replace them with underscores):
C(1, :) = strrep(C(1, :), ' ', '_'); %// Replace spaces with underscores
data = struct(C{:});
Here's what I get for your input file:
data =
Data_Sampling_Rate: '256 Hz'
Channel_1: 'FP1-F7'
Channel_2: 'F7-T7'
Channel_3: 'T7-P7'
Channel_4: 'P7-O1'
File_Name: 'chb01_03.edf'
File_Start_Time: '13:43:04'
File_End_Time: '14:43:04'
Number_of_Seizures_in_File: '1'
Seizure_Start_Time: '2996 seconds'
Seizure_End_Time: '3036 seconds'
Of course, it is possible to prettify it even more by converting all relevant numbers to numerical values, grouping the 'channel' fields together and such, but I'll leave this to you. Good luck!

Generating the shortest regex to match an arbitrary word list

I'm hoping someone might know of a script that can take an arbitrary word list and generated the shortest regex that could match that list exactly (and nothing else).
For example, suppose my list is
1231
1233
1234
1236
1238
1247
1256
1258
1259
Then the output should be:
12(3[13468]|47|5[589])
This is an old post, but for the benefit of those finding it through web searches as I did, there is a Perl module that does this, called Regexp::Optimizer, here: http://search.cpan.org/~dankogai/Regexp-Optimizer-0.23/lib/Regexp/Optimizer.pm
It takes a regular expression as input, which can consist just of the list of input strings separated with |, and outputs an optimal regular expression.
For example, this Perl command-line:
perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/1231|1233|1234|1236|1238|1247|1256|1258|1259/)"
generates this output:
(?^:(?^:12(?:3[13468]|5[689]|47)))
(assuming you have installed Regex::Optimizer), which matches the OP's expectation quite well.
Here's another example:
perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/314|324|334|3574|384/)"
And the output:
(?^:(?^:3(?:[1238]|57)4))
For comparison, an optimal trie-based version would output 3(14|24|34|574|84). In the above output, you can also search and replace (?: and (?^: with just ( and eliminate redundant parentheses, to obtain this:
3([1238]|57)4
You are probably better off saving the entire list, or if you want to get fancy, create a Trie:
1231
1234
1247
1
|
2
/ \
3 4
/ \ \
1 4 7
Now when you take a string check if it reaches a leaf node. It does, it's valid.
If you have variable length overlapping strings (eg: 123 and 1234) you'll need to mark some nodes as possibly terminal.
You can also use the trie to generate the regex if you really like the regex idea:
Nodes from the root to the first branching are fixed (eg: 12)
Branches create |: (eg: 12(3|4)
Leaf nodes generate a character class (or single character) that follows the parent node: (eg 12(3[14]|47))
This might not generate the shortest regex, to do that you'll might some extra work:
"Compact" ranges if you find them (eg [12345] becomes [1-4])
Add quantifiers for repeated elements (eg: [1234][1234] becomes [1234]{2}
???
I really don't think it's worth it to generate the regex.
This project generates a regexp from a given list of words: https://github.com/bwagner/wordhierarchy
It almost does the same as the above JavaScript solution, but avoids certain superfluous parentheses.
It only uses "|", non-capturing group "(?:)" and option "?".
There's room for improvement when there's a row of single characters:
Instead of e.g. (?:3|8|1|6|4) it could generate [38164].
The generated regexp could easily be adapted to other regexp dialects.
Sample usage:
java -jar dist/wordhierarchy.jar 1231 1233 1234 1236 1238 1247 1256 1258 1259
-> 12(?:5(?:6|9|8)|47|3(?:3|8|1|6|4))
Here's what I came up with (JavaScript). It turned a list of 20,000 6-digit numbers into a 60,000-character regular expression. Compared to a naive (word1|word2|...) construction, that's almost 60% "compression" by character count.
I'm leaving the question open, as there's still a lot of room for improvement and I'm holding out hope that there might be a better tool out there.
var list = new listChar("");
function listChar(s, p) {
this.char = s;
this.depth = 0;
this.parent = p;
this.add = function(n) {
if (!this.subList) {
this.subList = {};
this.increaseDepth();
}
if (!this.subList[n]) {
this.subList[n] = new listChar(n, this);
}
return this.subList[n];
}
this.toString = function() {
var ret = "";
var subVals = [];
if (this.depth >=1) {
for (var i in this.subList) {
subVals[subVals.length] = this.subList[i].toString();
}
}
if (this.depth === 1 && subVals.length > 1) {
ret = "[" + subVals.join("") + "]";
} else if (this.depth === 1 && subVals.length === 1) {
ret = subVals[0];
} else if (this.depth > 1) {
ret = "(" + subVals.join("|") + ")";
}
return this.char + ret;
}
this.increaseDepth = function() {
this.depth++;
if (this.parent) {
this.parent.increaseDepth();
}
}
}
function wordList(input) {
var listStep = list;
while (input.length > 0) {
var c = input.charAt(0);
listStep = listStep.add(c);
input = input.substring(1);
}
}
words = [/* WORDS GO HERE*/];
for (var i = 0; i < words.length; i++) {
wordList(words[i]);
}
document.write(list.toString());
Using
words = ["1231","1233","1234","1236","1238","1247","1256","1258","1259"];
Here's the output:
(1(2(3[13468]|47|5[689])))