Get matching group from previous line - regex

I'm working on a sed script to run through a file and make substitutions. The existing file will have a sequence of floating point numbers, and the sequence ends when a letter is found. Most of the substitutions are straightforward and look like this:
s/(-?[0-9]*\.?[0-9]*) (-?[0-9]*\.?[0-9]*) l/lineto(\1,\2);/g
To just replace the raw command with a function call.
Some commands have no 1:1 equivalent to a function call, because they depend on coordinates found on the previous line.
So I need to turn this:
1.068 7.399 m
-11.794 13.153 -11.843 12.234 v
Into this:
move(1.068,7.399);
curveto(1.068,7.399,-11.794,13.153,-11.843,12.234);
The last set of coordinates from the previous line needs to be used as the first set of coordinates for this line. The coordinates in the previous line don't always end in the same token, so that this:
-7.451 17.792 -10.366 16.42 -11.198 14.444 c
-11.794 13.153 -11.843 12.234 v
Needs to become this:
curveto(-7.451,17.792,-10.366,16.42,-11.198,14.444);
curveto(-11.198,14.444,-11.794,13.153,-11.843,12.234);
Here's my attempt (which is not working, broken into lines for readability, this is a one liner):
s/
.*(-?[0-9]*\.?[0-9]*) (-?[0-9]*\.?[0-9]*) [a-zA-Z]$^(-?[0-9]*\.?[0-9]*) (-?[0-9]*\.?[0-9]*) (-?[0-9]*\.?[0-9]*) (-?[0-9]*\.?[0-9]*) y/
curveto(\1,\2,\3,\4,\5,\6);/
g
What's the correct way to do this?

For your problem, you could try something along these lines:
$NF == "m" { print "move(" $1 "," $2 ");" }
$NF == "v" { print "curveto(" one "," two "," $1 "," $2 "," $3 "," $4 ");" }
$NF == "c" { print "curveto(" $1 "," $2 "," $3 "," $4 ");" }
{ one = $(NF-2); two = $(NF - 1) }
$NF is the last field of each line and is used to select which transformation to apply. The two fields preceding the command are assigned to the variables one and two (x and y might be a better choice).

Related

Bash - Extract a column from a tsv file whose header matches a given pattern

I've got a tab-delimited file called dataTypeA.txt. It looks something like this:
Probe_ID GSM24652 GSM24653 GSM24654 GSM24655 GSM24656 GSM24657
1007_s_at 1149.82818866431 1156.14191288693 743.515922643437 1219.55564561635 1291.68030259557 1110.83793199643
1053_at 253.507372571459 150.907554200493 181.107054946649 99.0610660103702 147.953428467212 178.841519788697
117_at 157.176825094869 147.807257232552 162.11169957066 248.732378039521 176.808414979907 112.885784025819
121_at 1629.87514240262 1458.34809770171 1397.36209234134 1601.83045996129 1777.53949459116 1256.89054921471
1255_g_at 91.9622298972477 29.644137111864 61.3949774595639 41.2554576367652 78.4403716513328 66.5624213750532
1294_at 313.633291641829 305.907304474766 218.567756319376 335.301256439494 337.349552407502 316.760658896597
1316_at 195.799277107983 163.176402437481 111.887056644528 194.008323756222 211.992656497053 135.013920706472
1320_at 34.5168433158599 19.7928225262233 21.7147425051394 25.3213322300348 22.4410631949167 29.6960283168278
1405_i_at 74.938724593443 24.1084307838881 24.8088845994911 113.28326338746 74.6406975005947 70.016519414531
1431_at 88.5010900723741 21.0652011409692 84.8954961447585 110.017339630928 84.1264201735067 49.8556999547353
1438_at 26.0276274326623 45.5977459152141 31.8633816890024 38.568939176828 43.7048363737468 28.5759163094148
1487_at 1936.80799770498 2049.19167519573 1902.85054762899 2079.84030768241 2088.91036902825 1879.84684705068
1494_f_at 358.11266607978 271.309665853292 340.738488775022 477.953251687206 388.441738062896 329.43505750512
1598_g_at 2908.90515715761 4319.04621682741 2405.62061966298 3450.85255814957 2573.97860992156 2791.38660060659
160020_at 416.089910909237 327.353902186303 385.030831004533 385.199279534446 256.512900212781 217.754025190117
1729_at 43.1079499314469 114.654670657195 133.191500889286 86.4106614983387 122.099426341898 218.536976034472
177_at 75.9653827137444 27.4348937420347 16.5837374743166 50.6758325717831 58.7568500760629 18.8061888366161
1773_at 31.1717741953018 158.225161489953 161.976679771553 139.173486349393 218.572194156366 103.916119454
179_at 1613.72113870554 1563.35465407698 1725.1817757679 1694.82209331327 1535.8108561345 1650.09670894426
Let's say I have a variable col="GSM24655". I want to extract the column from dataTypeA.txt that corresponds to this column name.
Additionally, I'd like to put this in a function, where I can just give it a file (i.e. dataTypeA.txt), and a column (i.e. GSM24655), and it'll return that column.
I'm not very proficient in Bash, so I've been having some trouble with this. I'd appreciate the help.
Below script using awk can be used to achieve the objective.
col="GSM24655";
awk -v column_val="$col" '{ if (NR==1) {val=-1; for(i=1;i<=NF;i++) { if ($i == column_val) {val=i;}}} if(val != -1) print $val} ' dataTypeA.txt
Working: Initially, value of col is passed to awk script using -v column_val="$col" . Then the column number is find out. (when NR==1, i.e the first row, it iterates through all the fields (for(i=1;i<=NF;i++), awk variable NF contains the number of columns) and then compare the value of column_val (if ($i == column_val)), when a match is found the corresponding column number is found and stored ( val=i )). After that, from next row onwards, the values in that column is printed (print $val).
If you copy the below code into a file called say find_column.sh, you can call sh find_column.sh GSM24655 dataTypeA.txt to display the column having value of first parameter (GSM24655) in the file named second parameter (dataTypeA.txt). $1 and $2 are positional parameters. The lines column=$1 and file=$2 will assign the input values to the variables.
column=$1;
file=$2;
awk -v column_val="$column" '{ if (NR==1) {val=-1; for(i=1;i<=NF;i++) { if ($i == column_val) {val=i;}}} if(val != -1) print $val} ' $file
I would use the following, it is quick and easy.
In your script, you get the name of the file, let's say $1, and word, $2.
Then, in my for each I am using the whole header, but you can just add a head -1 $1, and in the IF, the $2, this is going to output column name.
c=0;
for each in `echo "Probe_ID GSM24652 GSM24653 GSM24654 GSM24655 GSM24656 GSM24657"`;do if [[ $each == "Probe_ID" ]];then
echo $c;
col=$c;
else c=$(( c + 1 ));
fi;
done
Right after this, you just do a cat $1| cut -d$'\t' -f$col

How to represent many parts of awk sub/gsub's matched string

How to represent more than one part of awk sub or gsub's matched string.
For a regexpr like "##code", if I want to insert a word between "##" and "code", I would want a way like VSCode's syntax in witch $1 represent the first part and $2 represent the second part
sub(/(##)(code)/, "$1before$2", str)
from awk's user manual, I found that awk use & to represent the whole matched string。 How can I represent one,two or more part in the matched string like VSCode.
sub(regexp, replacement [, target])
Search target, which is treated as a string, for the leftmost, longest substring matched by the regular expression regexp. Modify the entire string by replacing the matched text with replacement. The modified string becomes the new value of target. Return the number of substitutions made (zero or one).
The regexp argument may be either a regexp constant (/…/) or a string constant ("…"). In the latter case, the string is treated as a regexp to be matched. See Computed Regexps for a discussion of the difference between the two forms, and the implications for writing your program correctly.
This function is peculiar because target is not simply used to compute a value, and not just any expression will do—it must be a variable, field, or array element so that sub() can store a modified value there. If this argument is omitted, then the default is to use and alter $0.48 For example:
str = "water, water, everywhere"
sub(/at/, "ith", str)
sets str to ‘wither, water, everywhere’, by replacing the leftmost longest occurrence of ‘at’ with ‘ith’.
If the special character ‘&’ appears in replacement, it stands for the precise substring that was matched by regexp. (If the regexp can match more than one string, then this precise substring may vary.) For example:
{ sub(/candidate/, "& and his wife"); print }
changes the first occurrence of ‘candidate’ to ‘candidate and his wife’ on each input line. Here is another example:
The user manual's link is here
Your best option is to use GNU awk for either of these:
$ awk '{$0=gensub(/(##)(code)/,"\\1before\\2",1)} 1' <<<'##code'
##beforecode
$ awk 'match($0,/(##)(code)/,a){$0=a[1] "before" a[2]} 1' <<<'##code'
##beforecode
The first one only lets you move text segments around while the 2nd lets you call functions, perform math ops or do anything else on the matching text before moving it around in the original or doing anything else with it:
$ awk 'match($0,/(##)(code)/,a){$0=length(a[1])*10 "before" toupper(a[2])} 1' <<<'##code'
20beforeCODE
After thinking about this for a bit, I don't know how to get the desired behavior in any reasonable way using just POSIX awk constructs. Here's something I tried (the matches() function):
$ cat tst.awk
BEGIN {
str = "foobar"
re = "(f.*o)(b.*r)"
printf "\nre \"%s\" matching string \"%s\"\n", re, str
print "succ: gensub(): ", gensub(re,"<\\1> <\\2>",1,str)
print "succ: match(): ", (match(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
print "succ: matches(): ", (matches(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
str = "foofoo"
re = "(f.*o)(f.*o)"
printf "\nre \"%s\" matching string \"%s\"\n", re, str
print "succ: gensub(): ", gensub(re,"<\\1> <\\2>",1,str)
print "succ: match(): ", (match(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
print "fail: matches(): ", (matches(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
}
function matches(str,re,arr, start,tgt,n,i,segs) {
delete arr
if ( start=match(str,re) ) {
tgt = substr($0,RSTART,RLENGTH)
n = split(re,segs,/[)(]+/) - 1
for (i=1; RSTART && (i < n); i++) {
if ( match(str,segs[i+1]) ) {
arr[i] = substr(str,RSTART,RLENGTH)
str = substr(str,RSTART+RLENGTH)
}
}
}
return start
}
.
$ awk -f tst.awk
re "(f.*o)(b.*r)" matching string "foobar"
succ: gensub(): <foo> <bar>
succ: match(): <foo> <bar>
succ: matches(): <foo> <bar>
re "(f.*o)(f.*o)" matching string "foofoo"
succ: gensub(): <foo> <foo>
succ: match(): <foo> <foo>
fail: matches(): <foofoo> <>
but of course that doesn't work for the 2nd case as the first RE segment of f.*o matches the whole string foofoo and of course the same thing happens if you try to take the RE segments in reverse. I also considered getting the RE segments like above but then build up a new string one char at a time from the string passed in and compare the first RE segment to THAT until it matches as THAT would be the shortest matching string to the RE segment BUT that would fail for a string+RE like:
str='foooobar'
re='(f.*o)(b.*r)'
since f.*o would match foo with that alorigthm when it really needs to match fooooo.
So - I guess you'd need to keep iterating (being careful of what direction you iterate in - from the end is correct I expect) till you get the string split up into segments that each match every RE segment in a left-most-longest fashion. Seems like a lot of work!
When you use GNU awk, you can use gensub for this purpose. Without gensub for any generic awk it becomes a bit more tedious. The procedure could be something like this:
ere="(ere1)(ere2)"
match(str,ere)
tmp=substr(str,RSTART,RLENGTH)
match(tmp,"ere1"); part1=substr(tmp,RSTART,RLENGTH)
part2=substr(tmp,RLENGTH)
sub(ere,part1 "before" part2,str)
The problem with this is that it will not always work and you have to engineer it a bit. A simple fail can be created due to the greedyness of the ERE":
str="foocode"
ere="(f.*o)(code)"
match(str,ere) # finds "foocode"
tmp=substr(str,RSTART,RLENGTH) # tmp <: "foocode"
match(tmp,"(f.*o)"); # greedy "fooco"
part1=substr(tmp,RSTART,RLENGTH) # part1 <: "fooco"
part2=substr(tmp,RLENGTH) # part2 <: "de"
sub(ere,part1 "before" part2,str) # :> "foocobeforede

How to identify '\N' character in data using Pig

I'm getting very weird character '\N' in my data. I want to remove or replace this character from data. Below is the data sample:
Girls Shoes,1325051884
\N,\N
Men's Shirts,\N
Delimiter : comma (,)
I tried couple of ways to replace/identify this \N character but not working.
In Pig, positional notation is indicated with the dollar sign ($) and begins with zero (0); for example, $0, $1, $2.
So, in the data mentioned above, the first field is identified by $0 (for e.g. "Girls Shoes") and second is identified by $1 (for e.g. 1325051884).
Following script has logic to replace '\N':
A = LOAD '/data.txt' USING PigStorage(',');
B = FILTER A BY ($0 != '\\N') OR ($1 != '\\N');
dump B;
C = FOREACH B GENERATE ($0 == '\\N' ? '' : $0), ($1 == '\\N' ? '' : $1);
dump C;
Where '/data.txt' contains following data:
Girl's Shoes,1325051884
\N,\N
Men's Shirts,\N
\N,Boy's Pants
Logic:
A = LOAD '/data.txt' USING PigStorage(',');
Loads data, by assuming the delimiter to be comma (,).
B = FILTER A BY ($0 != '\\N') OR ($1 != '\\N');
For each loaded record, filter the records by condition: $0 (first field) NOT EQUALS '\N' OR $1 (second field) NOT EQUALS '\N'
Output of this stage would be (2nd record containing both '\N' is filtered out):
(Girl's Shoes,1325051884)
(Men's Shirts,\N)
(\N,Boy's Pants)
C = FOREACH B GENERATE ($0 == '\\N' ? '' : $0), ($1 == '\\N' ? '' : $1);
For each of the records generated in the 2nd step, it checks: if $0 is equal to '\N'. If yes, it emits blank (''), else emits $0. Similar logic is applied to $1.
Output of this stage would be:
(Girl's Shoes,1325051884)
(Men's Shirts,)
(,Boy's Pants)
You can see that, '\N' is replaced by blank ('').
I am using Apache Pig 0.15. This script worked perfectly for your data.
A = FILTER data by $2 =='//N'
it will list out all data with such character appearance.

Finding columns with only white space in a text file and replace them with a unique separator

I have a file like this:
aaa b b ccc 345
ddd fgt f u 3456
e r der der 5 674
As you can see the only way that we can separate the columns is by finding columns that have only one or more spaces. How can we identify these columns and replace them with a unique separator like ,.
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Note:
If we find all continuous columns with one or more white spaces (nothing else) and replace them with , (all the column) the problem will be solved.
Better explanation of the question by josifoski :
Per block of matrix characters, if all are 'space' then all block should be replaced vertically with one , on every line.
$ cat tst.awk
BEGIN{ FS=OFS=""; ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR {
for (i=1;i<=NF;i++) {
if ($i == " ") {
space[i]
}
else {
nonSpace[i]
}
}
next
}
FNR==1 {
for (i in nonSpace) {
delete space[i]
}
}
{
for (i in space) {
$i = ","
}
gsub(/,+/,",")
print
}
$ awk -f tst.awk file
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Another in awk
awk 'BEGIN{OFS=FS=""} # Sets field separator to nothing so each character is a field
FNR==NR{for(i=1;i<=NF;i++)a[i]+=$i!=" ";next} #Increments array with key as character
#position based on whether a space is in that position.
#Skips all further commands for first file.
{ # In second file(same file but second time)
for(i=1;i<=NF;i++) #Loops through fields
if(!a[i]){ #If field is set
$i="," #Change field to ","
x=i #Set x to field number
while(!a[++x]){ # Whilst incrementing x and it is not set
$x="" # Change field to nothing
i=x # Set i to x so it doesnt do those fields again
}
}
}1' test{,} #PRint and use the same file twice
Since you have also tagged this r, here is a possible solution using the R package readr. It looks like you want to read a fix width file and convert it to a comma-seperated file. You can use read_fwf to read the fix width file and write_csv to write the comma-seperated file.
# required package
require(readr)
# read data
df <- read_fwf(path_to_input, fwf_empty(path_to_input))
# write data
write_csv(df, path = path_to_output, col_names = FALSE)

Replace several occurences of the same character in a different way in AWK

I want to replace several characters in a csv file depending on the characters around them using AWK.
For example in this line:
"Example One; example one; EXAMPLE ONE; E. EXAMPLE One"
I would like to replace all capital "E"'s with "EE" if they are within a word that uses only capitals and with "Ee" if they are in a word with upper and lower case letters or in an abbreviation (like the E., it's an adress file so there are no cases where this could also be the end of a sentence) so it should look like this:
"Eexample One; example one; EEXAMPLEE ONEE; Ee. EEXAMPLEE One"
Now what I have tried is this:
{if ($0 ~/E[A-Z]+/)
$0 = gensub(/E/,"EE","g",$0)
else if ($0 ~/[A-Z]E/)
$0 = gensub(/E/,"EE","g",$0)
else
$0 = gensub(/E/,"Ee","g",$0)
}
This works fine in most cases, but for lines (or fieds for that matter) that contain several "E"'s where I'd want one to be replaced as a "Ee" and one as a "EE" like in "E. EXAMPLE One", it matches the E in "EXAMPLE" and just replaces all "E"'s in that line with "EE".
Is there a better way to do this? Can I maybe somehow use if within gensub?
ps: Hope this makes sense, I just started learning the basics of programming!
$ cat tst.awk
{
head = ""
tail = $0
while ( match(tail,/[[:alpha:]]+\.?/) ) {
tgt = substr(tail,RSTART,RLENGTH)
add = (tgt ~ /^[[:upper:]]+$/ ? "E" : "e")
gsub(/E/,"&"add,tgt)
head = head substr(tail,1,RSTART-1) tgt
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}
$ awk -f tst.awk file
Eexample One; example one; EEXAMPLEE ONEE; Ee. EEXAMPLEE One
It's not clear though how you distinguish a string of letters followed by a period as an abbreviation or just the end of a sentence.