Two float numbers ara attached together in my output text file - regex

In my output file two columns corresponding to two float numbers are attached together, forming one column. An example is shown here, is there anyway to separet these two columns from each other?
Here, this is supposed to be 5 columns separated by white-spaces, but space between columns 3&4 is missing. Is there anyway to correct this mistake with some UNIX commands such as cut, awk, sed or even Regular Expressions?
3.77388 0.608871 -8216.342.42161 1.88655
4.39243 0.625 -8238.241.49211 0.889258
4.38903 0.608871 -7871.71.52994 0.883976
4.286 0.653226 -8287.322.3195 2.13736
4.29313 0.629032 -7954.651.59168 1.02046
The corrected version should look like this:
3.77388 0.608871 -8216.34 2.42161 1.88655
4.39243 0.625 -8238.24 1.49211 0.889258
4.38903 0.608871 -7871.7 1.52994 0.883976
4.286 0.653226 -8287.32 2.3195 2.13736
4.29313 0.629032 -7954.65 1.59168 1.02046
More info: column 4 is always less than 10, so it only has one digit to the left of decimal point.
I have tried to use awk:
tail -n 5 output.dat | awk '{print $3}'
-8216.342.42161
-8238.241.49211
-7871.71.52994
-8287.322.3195
-7954.651.59168
Is there any way to separate this column into two columns?

One solution:
sed 's/\(\.[0-9]*\)\([0-9]\.\)/\1 \2/'

Using Perl one-liner:
perl -pe 's/(\d+\.\d+)(\d\.\d+)/$1 $2/' < output.dat > fixed_output.dat

Your input file
$ cat file
3.77388 0.608871 -8216.342.42161 1.88655
4.39243 0.625 -8238.241.49211 0.889258
4.38903 0.608871 -7871.71.52994 0.883976
4.286 0.653226 -8287.322.3195 2.13736
4.29313 0.629032 -7954.651.59168 1.02046
Awk approach
awk '{
n = index($3,".") # index of dot from field 3
x = substr($3,1,n+3) ~/\.$/ ? n+1 : n+2 # Decision for no of char to consider
$3 = substr($3,1,x) OFS substr($3,x+1) # separate out fields
$0 = $0 # Recalculate fields (number of fields NF)
$1 = $1 # recalculate the record, removing excess spacing (the new field separator becomes OFS, default is a single space)
}1' OFS='\t' file
Resulting
3.77388 0.608871 -8216.34 2.42161 1.88655
4.39243 0.625 -8238.24 1.49211 0.889258
4.38903 0.608871 -7871.7 1.52994 0.883976
4.286 0.653226 -8287.32 2.3195 2.13736
4.29313 0.629032 -7954.65 1.59168 1.02046

Related

awk: how to extract 2 patterns from a single line and then concatenate them?

I want to find 2 patterns in each line and then print them with a dash between them as a separator. Here is a sample of lines:
20200323: #5357 BEAR_SPX_X15_NORDNET_D1 {CU=DKK, ES=E, II=DK0061205473, IR=NRB, LN=BEAR SPX X15 NORDNET D1, MIC=FNDK, NS=1, PC=C, SE=193133, SG=250, SN=193133, TK="0.01 to 100,0.05 to 500,0.1", TS=BEAR_SPX_X15_NORDNET_D1, TY=W, UQ=1}
20200323: #5358 BULL_SPX_X10_NORDNET_D2 {CU=DKK, ES=E, II=DK0061205556, IR=NRB, LN=BULL SPX X10 NORDNET D2, MIC=FNDK, NS=1, PC=P, SE=193132, SG=250, SN=193132, TK="0.01 to 100,0.05 to 500,0.1", TS=BULL_SPX_X10_NORDNET_D2, TY=W, UQ=1}
20200323: #5359 BULL_SPX_X12_NORDNET_D2 {CU=DKK, ES=E, II=DK0061205630, IR=NRB, LN=BULL SPX X12 NORDNET D2, MIC=FNDK, NS=1, PC=P, SE=193131, SG=250, SN=193131, TK="0.01 to 100,0.05 to 500,0.1", TS=BULL_SPX_X12_NORDNET_D2, TY=W, UQ=1}
Given the above lines, my desired output after running a script should look like this:
BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630
The first alphanumeric value (e.g. BULL_SPX_X12_NORDNET_D2) is always in the 3rd position of a line.
The second alphanumeric value (e.g. DK0061205630) can be at various positions but it's always preceded by "II=" and is always exactly 12 characters length.
I tried to implement my task with the following script:
13 regex='II=.\{12\}'
14 while IFS="" read -r line; do
15 matchedString=`grep -o $regex littletest.txt | tr -d 'II=,'`
16 awk /II=/'{print $3, " - ", $matchedString}' littletest.txt > temp.txt
17 done <littletest.txt
My thought process and intentions/assumptions:
Line 13 defines a regex pattern to match the alphanumeric string preceded with "II="
In line 15 variable "matchedString" gets assigned a value that is extracted from a line via regex, with the preceding "II=" being deleted.
Line 16 uses awk expression to to detect all lines that contain "II=" and then print the third string that is found on every input file's line and also print the value of matched string pattern that was defined in the previous line of the script. So I expect that at this point a pair of extracted patterns (e.g. BEAR_SPX_X15_NORDNET_D1 - DK0061205473) should be transfered to temp.txt file.
Line 17 is taking an input file for a script to consume.
However, after running the script I did not get the desired output. Here is a sample of what I got:
BEAR_SPX_X15_NORDNET_D1
20200323: #5357 BEAR_SPX_X15_NORDNET_D1 {CU=DKK, ES=E, II=DK0061205473, IR=NRB, LN=BEAR SPX X15 NORDNET D1, MIC=FNDK, NS=1, PC=C, SE=193133, SG=250, SN=193133, TK="0.01 to 100,0.05 to 500,0.1", TS=BEAR_SPX_X15_NORDNET_D1, TY=W, UQ=1}
How could I achieve my desired output that I described earlier?
$ awk -v OFS=' - ' 'match($0,/II=/){print $3, substr($0,RSTART+3,12)}' file
BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630
Just trying out awk.
awk 'BEGIN{ FS="[II=, ]+" ; OFS=" - " } {print $3, $8}' file.txt
Using gawk (gnu awk) that supports regex as Field Seperator (FS) , and considering that each line in your file has exactly the same format / same number of fields, this works fine in my tests:
awk '{print $3,$9}' FS="[ ]|II=" OFS=" - " file1
#or FS="[[:space:]]+|II=|[,]" if you might have more than one space between fields
Results
BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630
Since the II= part could be anywhere, this trick could also work with a penalty of parsing the file twice:
paste -d "-" <(awk '{print $3}' file1) <(awk '/II/{print $2}' RS="[ ]" FS="=|," file1)

How to use sed or awk to replace string in csv file

Sorry for a really basic question. How to replace a particular column in a csv file with some string?
e.g.
id, day_model,night_model
===========================
1 , ,
2 ,2_DAY ,
3 ,3_DAY ,3_NIGHT
4 , ,
(4 rows)
I want to replace any string in the column 2 and column 3 to true
others would be false, but not the 1,2 row and end row.
Output:
id, day_model,night_model
===========================
1 ,false ,false
2 ,true ,false
3 ,true ,true
4 ,false ,false
(4 rows)
What I tried is the following sample code( Only trying to replace the string to "true" in column 3):
#awk -F, '$3!=""{$3="true"}' OFS=, file.csv > out.csv
But the out.csv is empty. Please give me some direction.
Many thanks!!
Since your field separator is comma, the "empty" fields may contain spaces, particularly the 2nd field. Therefore they might not equal the empty string.
I would do this:
awk -F, -v OFS=, '
# ex
NR>2 && !/^\([0-9]+ rows\)/ {
for (i=2; i<=NF; i++)
$i = ($i ~ /[^[:blank:]]/) ? "true" : "false"
}
{ print }
' file
Well since you added sed in tag and you have only three columns I have solution for your problem in four steps because regex replacement was not possible for all cases in just one go.
Since your 2nd and 3rd column is having blank space. I wrote four sed commands to do the replacement for each kind of row.
sed '/^(\d+\s+,)\S+\s*,\S+\s*$/\1true,true/gm' file.txt
This will replace rows like 3 ,3_DAY ,3_NIGHT
Regex101 Demo
sed '/^(\d+\s+,)\S+\s*,\s*$/\1true,false/gm' file.txt
This will replace rows like 2 ,2_DAY ,
Regex101 Demo
sed '/^(\d+\s+,)\s*,\S+\s*$/\1false,true/gm' file.txt
This will replace rows like 5 , ,2_Day
Regex101 Demo
sed '/^(\d+\s+,)\s*,\s*$/\1false,false/gm' file.txt
This will replace rows like 1 , ,
Regex101 Demo

Using awk to find a domain name containing the longest repeated word

For example, let's say there is a file called domains.csv with the following:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
I'm trying to use linux awk regex expressions to find the line that contains the longest repeated1 word, so in this case, it will return the line
5,letswelcomewelcomeyou.org
How do I do that?
1 Meaning "immediately repeated", i.e., abcabc, but not abcXabc.
A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.
I'ved added one line to the example input file for the case of multiple longest words:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
And this gets the lines with the longest repeated sequence:
cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile
Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:
Remove the first column with the line number to avoid matches for lines numbers with repeating digits:
$ cut -d ',' -f 2 infile
helloguys.ca
byegirls.com
hellohelloboys.ca
hellobyebyedad.com
letswelcomewelcomeyou.org
letscomewelcomewelyou.org
Get all lines with a repeated sequence, extract just that repeated sequence:
... | grep -Eo '(.*)\1'
ll
hellohello
ll
byebye
welcomewelcome
comewelcomewel
Get the length of each of those lines:
... | awk '{ print length(), $0 }'
2 ll
10 hellohello
2 ll
6 byebye
14 welcomewelcome
14 comewelcomewel
Sort by the first column, numerically, descending:
...| sort -k 1,1 -nr
14 welcomewelcome
14 comewelcomewel
10 hellohello
6 byebye
2 ll
2 ll
Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:
... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
welcomewelcome
comewelcomewel
Pipe this into grep, using the -f - argument to read stdin as a file:
... | grep -f - infile
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
Limitations
While this can handle the bbwelcomewelcome case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome, where it only finds welwel, but not welcomewelcome.
Alternative solution with more awk, less sort
As pointed out by tripleee in comments, this can be simplified to skip the sort step and combine the two awk steps and the sort step into a single awk step, likely improving performance:
$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile
Let's look at that awk step in more detail, with expanded variable names for clarity:
{
# New longest match: throw away stored longest matches, reset index
if (length() > max_len) {
max_len = length()
delete arr_longest
idx = 1
}
# Add line to longest matches
if (length() >= max_len)
arr_longest[idx++] = $0
}
# Print all the longest matches
END {
for (idx in arr_longest)
print arr_longest[idx]
}
Benchmarking
I've timed the two solutions on the top one million domains file mentioned in the comments:
First solution (with sort and two awk steps):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.742s
user 1m57.873s
sys 0m0.045s
Second solution (just one awk step, no sort):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.603s
user 1m56.514s
sys 0m0.045s
And the Perl solution by Casimir et Hippolyte:
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 0m5.249s
user 0m5.234s
sys 0m0.000s
What we learn from this: ask for a Perl solution next time ;)
Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1 instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.
Portability remark
Apparently, BSD grep can't do grep -f - to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f.
A way with perl:
perl -F, -ane 'if (#m=$F[1]=~/(?=(.+)\1)/g) {
#m=sort { length $b <=> length $a} #m;
$cl=length #m[0];
if ($l<$cl) { #res=($_); $l=$cl; } elsif ($l==$cl) { push #res, ($_); }
}
END { print #res; }' file
The idea is to find all longest overlapping repeated strings for each position in the second field, then the match array is sorted and the longest substring becomes the first item in the array (#m[0]).
Once done, the length of the current repeated substring ($cl) is compared with the stored length (of the previous longest substring). When the current repeated substring is longer than the stored length, the result array is overwritten with the current line, when the lengths are the same, the current line is pushed into the result array.
details:
command line option:
-F, set the field separator to ,
-ane (e execute the following code, n read a line at a time and puts its content in $_, a autosplit, using the defined FS, and puts fields in the #F array)
The pattern:
/
(?= # open a lookahead assertion
(.+)\1 # capture group 1 and backreference to the group 1
) # close the lookahead
/g # all occurrences
This is a well-know pattern to find all overlapping results in a string. The idea is to use the fact that a lookahead doesn't consume characters (a lookahead only means "check if this subpattern follows at the current position", but it doesn't match any character). To obtain the characters matched in the lookahead, all that you need is a capture group.
Since a lookahead matches nothing, the pattern is tested at each position (and doesn't care if the characters have been already captured in group 1 before).

Awk 3 Spaces + 1 space or hyphen

I have a rather large chart to parse. Each column is separated by either 4 spaces or by 3 spaces and a hyphen (since the numbers in the chart can be negative).
cat DATA.txt | awk "{ print match($0,/\s\s/) }"
does nothing but print a slew of 0's. I'm trying to understand AWK and when to escape, etc, but I'm not getting the hang of it. Help is appreciated.
One line:
1979 1 -0.176 -0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
1979 1 -0.176 0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
I would like to get just, say, the second column. I copied the line, but I'd like to see -0.185 and 0.185.
You need to start by thinking about bash quoting, since it is bash which interprets the argument to awk which will be the awk program. Inside double-quoted strings, bash expands $0 to the name of the bash executable (or current script); that's almost certainly not what you want, since it will not be a quoted string. In fact, you almost never want to use double quotes around the awk program argument, so you should get into the habit of writing awk '...'.
Also, awk regular expressions don't understand \s (although Gnu awk will handle that as an extension). And match returns the position of the match, which I don't think you care about either.
Since by default, awk considers any sequence of whitespace a field separator, you don't really need to play any games to get the fourth column. Just use awk '{print $4}'
Why not just use this simple awk
awk '$0=$4' Data.txt
-0.185
0.185
It sets $0 to value in $4 and does the default action, print.
PS do not use cat with program that can read data itself, like awk
In case of filed 4 containing 0, you can make it more robust like:
awk '{$0=$4}1' Data.txt
If you're trying to split the input according to 3 or 4 spaces then you will get the expected output only from column 3.
$ awk -v FS=" {3,4}" '{print $3}' file
-0.185
0.185
FS=" {3,4}" here we pass a regex as FS value. This regex get parsed and set the Field Separator value to three or four spaces. In regex {min,max} called range quantifier which repeats the previous token from min to max times.

Add consecutive entries in a column

I have a file that has the format
0.99987799 17743.000
1.9996300 75.000000
2.9993899 75.000000
3.9991500 102.00000
4.9988999 131.00000
5.9986601 130.00000
6.9984102 152.00000
7.9981699 211.00000
8.9979200 256.00000
9.9976797 259.00000
10.997400 341.00000
11.997200 373.00000
What I would like to do is add the data in the second column, every four lines. So a desired output would be
1 17743+75+75+102
2 131+130+52+211
3 256+259+341+373
How can this be done in awk?
I know that I can find a specific element in the file using
awk 'FNR == 5 {print $2}' file
but I don't know how to add 4 elements in a row. If I try for instance
awk '$2 {print FNR == 5}' file
I get nothing but zeros, so I don't know how to parse the column first and then the line. I also tried
awk 'BEGIN{i=4}
{
for (NR>=1 || NR<=i)
{
print $2
}
}' filename
but I get a syntax error at NR<=i. I also don't have any idea how to loop on the entire file. Any help or idea would be more than welcome! Or perhaps would it be better to do it in C++? I don't know which is more convenient...
I also tried
awk 'BEGIN{sum=0} {{sum += $2} if(FNR%4 == 0) { print sum; sum=0}}' infile.dat
but it doesn't seem to work properly...
awk 'NR%4==1{sum=$2; next}{sum+=$2} NR%4==0{print ++j,sum;}' input.txt
Output:
1 17995
2 624
3 1229
For first number of a group it stores value of second column in $2, for next 3 rows adds the value of the second column and sum. for last row of a group NR%4==0 prints the result.
If you don't need the row numbers before the sum results just remove ++j,.
awk '{print $2}' file | paste -d+ - - - - | bc
This works fine for me:
awk '{sum += $2}
FNR%4==0 {print FNR/4, sum; sum = 0}
END {if(FNR%4){print int(FNR/4)+1, sum}}' awktest.txt
with the result of:
1 17995
2 624
3 1229