Bash select valid rows from file with awk

Bash select valid rows from file with awk - regex

I have a large data set with some invalid rows. I want to copy to another file only rows which start with valid date (regex digits).
Basically check if awk $1 is digit ([0-9]), if yes, write whole row ($0) to output file, if no skip this row, go to next row.
How I imagine it like (both versions give syntax error):
awk '{if ($1 =~ [0-9]) print $0 }' >> output.txt
awk '$1 =~ [0-9] {print $0}' filename.txt
while this does print the first field, I have no idea how to proceed.
awk '{ print $1 }' filename.txt
19780101
19780102
19780103
a
19780104
19780105
19780106
...
Full data set:
19780101 1 1 1 1 1
19780102 2 2 2 2 2
19780103 3 3 3 3 3
a a a a a a
19780104 4 4 4 4 4
19780105 5 5 5 5 5
19780106 6 6 6 6 6
19780107 7 7 7 7 7
19780108 8 8 8 8 8
19780109 9 9 9 9 9
19780110 10 10 10 10 10
19780111 11 11 11 11 11
19780112 12 12 12 12 12
19780113 13 13 13 13 13
19780114 14 14 14 14 14
19780115 15 15 15 15 15
19780116 16 16 16 16 16
a a a a a a
19780117 17 17 17 17 17
19780118 18 18 18 18 18
19780119 19 19 19 19 19
19780120 20 20 20 20 20
The data set can be reproduced with R
library(dplyr)
library(DataCombine)
N <- 20
df = as.data.frame(matrix(seq(N),nrow=N,ncol=5))
df$date = format(seq.Date(as.Date('1978-01-01'), by = 'day', len = N), "%Y%m%d")
df <- df %>% select(date, everything())
df <- InsertRow(df, NewRow = rep("a", 6), RowNum = 4)
df <- InsertRow(df, NewRow = rep("a", 6), RowNum = 18)
write.table(df,"filename.txt", quote = FALSE, sep="\t",row.names=FALSE)
Questions about reading first N rows don't address my need, because my invalid rows could be anywhere. This solution doesn't work for some reason.

Since you have a large data set and such a simple requirement, you could just use grep for this as it'd be faster than awk:
grep '^[0-9]' file

Based on your data, you can check if first column has 8 digits to be representing a date in YYYYMMDD format using this command:
awk '$1 ~ /^[0-9]{8}$/' file > output

You can just go with this:
awk '/^[0-9]+/' file.txt >> output.txt
By default awk works with lines, so you tell him (I am assuming he is a boy) to select the lines that starts (^) with at least one digit ([0-9]+), and to print them, redirecting in output.txt.
Hope helps.

You can also try this..
sed '/^[0-9]/!d' inputfile > outputfile

Related

Get next 5 lines after regexp is matched in tcl

How to get the next 5 lines after a certain pattern is matched in TCL
I've some 30lines of output and need only few lines in between...

Might be easier to split the output into a list of lines so you can use lsearch:
% set output [exec seq 10]
1
2
3
4
5
6
7
8
9
10
% set lines [split $output \n]
1 2 3 4 5 6 7 8 9 10
% set idx [lsearch -regexp $lines {4}]
3
% set wanted [lrange $lines $idx+1 $idx+5]
5 6 7 8 9

Just append something to your regular expression! Like this:
([^\n]*\n){5}

Glenn Jackman's solution is probably better, but the line processing command in fileutil can be preferable for some variations.
package require fileutil
Given a file that looks like this:
% cat file.txt
1
2
3
4
5
6
7
8
9
10
Now, for each line in the file
set n 0
set re 4
set nlines 5
::fileutil::foreachLine line file.txt {
if {$n > 0} {
puts $line
incr n -1
}
if {$n == 0 && [regexp $re $line]} {
set n $nlines
}
}
If the counter n is greater than 0, print the line and decrement. If n is equal to 0 and the regular expression matches the line, set n to $nlines (5).
# output:
5
6
7
8
9
Documentation: fileutil package, if, incr, package, puts, Syntax of Tcl regular expressions, regexp, set

Split file on pattern matching with different header into files using AWK

I've a file that needs to be separated based on search pattern into multiple files and different headers for different files. I can split the file but unable to add different header to different files. Here's the code I tried:
BEGIN {
{
a=substr($0,38,2)
if(a=="HD")
{
print"a","b","c"...
OFS="|"
}
if(a=="AS")
{
print"e","f","g"...
OFS="|"
}
}
}
{
a=substr($0,38,2)
if(a=="HD")
{
FIELDWIDTHS="10 8 10 9 2 1 1 11 14 14 14 14 14 14 14 14 8 60 30 30 32 32 27 18 11 346"
OFS="|"
}
if(a=="AS")
{
FIELDWIDTHS="10 8 10 9 2 1 7 30 14 14 14 14 625"
OFS="|"
}
}
{
$1=$1
print > a".txt"
}

Why don't you do it like the below? And as far as I see, setting FIELDWITHS only works properly if it was set in the BEGIN block (or when changing an input file...):
awk 'NR=1 { HEADER1 = "whatever" ; HEADER2 = "whatever2" ;
print HEADER1 > FIRSTFILE ;
print HEADER2 > SECONDFILE ;
}
{ a=substr($0,38,2)
OFS="|"
print $0 >> a".txt"
}' INPUTFILE

Using AWK to print lines between patterns after a pattern has been met

I am trying to print specific parts of a file that looks like this:
99999 1 55 127
{bunch of numbers here}
99999 2 55 126
{bunch of numbers here}
99999 3 55 144
{bunch of numbers here}
and basically I am trying to print the "bunch of numbers" (along with the preceeding line) when a specific sequence is met. The 99999 is always constant and i dont care about the number right after, but i want to condition the next two numbers.
#! /usr/bin/awk -f
BEGIN{}
{
if ( $3 == 55 && $4 = 100 )
{next
do{print $0}
while($1 != 99999}
}}
END{}
I'm quite new to awk and would really appreciate the help! Thanks

your question is not clear to me...
I guess you want to print out block of lines after (inclusive) a 99999 x 55 100 and before (exclusive) another 99999 ... line.
I used your example, (btw, you should provide a better example and with output.) but I changed your criteria to $3==55 and $4=126 so that the block sits in the middle of your data.
awk '$1==99999{f=($3==55&&$4==126)?1:0}f' file
test:
kent$ cat f
99999 1 55 127
{bunch of numbers here}
1
2
99999 2 55 126
3
4
{bunch of numbers here}
99999 3 55 144
5
6
{bunch of numbers here}
kent$ awk '$1==99999{f=($3==55&&$4==126)?1:0}f' f
99999 2 55 126
3
4
{bunch of numbers here}

You can use a flag, set it when either third and fourth field match, and reset it (exiting) when found next line that its first field is 99999:
awk '
$1 == 99999 && flag == 1 { exit }
$3 == 55 && $4 == 126 { flag = 1 }
flag == 1 { print }
' infile
It yields:
99999 2 55 126
{bunch of numbers here}

replace strings with lines from another text file by matching patterns

I have a file with a correspondence key -> value:
sort keyFile.txt | head
ENSMUSG00000000001 ENSMUSG00000000001_Gnai3
ENSMUSG00000000003 ENSMUSG00000000003_Pbsn
ENSMUSG00000000003 ENSMUSG00000000003_Pbsn
ENSMUSG00000000028 ENSMUSG00000000028_Cdc45
ENSMUSG00000000028 ENSMUSG00000000028_Cdc45
ENSMUSG00000000028 ENSMUSG00000000028_Cdc45
ENSMUSG00000000031 ENSMUSG00000000031_H19
ENSMUSG00000000031 ENSMUSG00000000031_H19
ENSMUSG00000000031 ENSMUSG00000000031_H19
ENSMUSG00000000031 ENSMUSG00000000031_H19
And I would like to replace every correspondence of "key" with the "value" in the temp.txt:
head temp.txt
ENSMUSG00000000001:001 515
ENSMUSG00000000001:002 108
ENSMUSG00000000001:003 64
ENSMUSG00000000001:004 45
ENSMUSG00000000001:005 58
ENSMUSG00000000001:006 63
ENSMUSG00000000001:007 46
ENSMUSG00000000001:008 11
ENSMUSG00000000001:009 13
ENSMUSG00000000003:001 0
The result should be:
out.txt
ENSMUSG00000000001_Gnai3:001 515
ENSMUSG00000000001_Gnai3:002 108
ENSMUSG00000000001_Gnai3:003 64
ENSMUSG00000000001_Gnai3:004 45
ENSMUSG00000000001_Gnai3:005 58
ENSMUSG00000000001_Gnai3:006 63
ENSMUSG00000000001_Gnai3:007 46
ENSMUSG00000000001_Gnai3:008 11
ENSMUSG00000000001_Gnai3:009 13
ENSMUSG00000000001_Gnai3:001 0
I have tried a few variations following this AWK example but as you can see the result is not what I expected:
awk 'NR==FNR{a[$1]=$1;next}{$1=a[$1];}1' keyFile.txt temp.txt | head
515
108
64
45
58
63
46
11
13
0
My guess is that column 1 of temp does not match 'exactly' column 1 of keyValues. Could someone please help me with this?
R/python/sed solutions are also welcome.

Use awk command like this:
awk 'NR==FNR {a[$1]=$2;next} {
split($1, b, ":");
if (b[1] in a)
print a[b[1]] ":" b[2], $2;
else
print $0;
}' keyFile.txt temp.txt

Code for GNU sed:
$sed -nr '$!N;/^(.*)\n\1$/!bk;D;:k;s#\S+\s+(\w+)_(\w+)#/^\1/s/(\\w+)(:\\w+)\\s+(\\w+)/\\1_\2\\2 \\3/p#;P;s/^(.*)\n//' keyfile.txt|sed -nrf - temp.txt
ENSMUSG00000000001_Gnai3:001 515
ENSMUSG00000000001_Gnai3:002 108
ENSMUSG00000000001_Gnai3:003 64
ENSMUSG00000000001_Gnai3:004 45
ENSMUSG00000000001_Gnai3:005 58
ENSMUSG00000000001_Gnai3:006 63
ENSMUSG00000000001_Gnai3:007 46
ENSMUSG00000000001_Gnai3:008 11
ENSMUSG00000000001_Gnai3:009 13
ENSMUSG00000000003_Pbsn:001 0

Another awk option
awk -F: 'NR == FNR{split($0, a, " "); x[a[1]]=a[2]; next}{print x[$1]":"$2}' keyFile.txt temp.txt

Another awk version:
awk 'NR==FNR{a[$1]=$2;next}
{sub(/[^:]+/,a[substr($1,1,index($1,":")-1)])}1' keyFile.txt temp.txt

Separate string of digits into 3 columns using awk/sed

I have a string of digits in rows as below:
6390212345678912011012112121003574820069121409100000065471234567810
6390219876543212011012112221203526930428968109100000065478765432196
That I need to split into 6 columns as below:
639021234567891,201101211212100,3574820069121409,1000000,654712345678,10
639021987654321,201101211222120,3526930428968109,1000000,654787654321,96
Conditions:
Field 1 = 15 Char
Field 2 = 15 Char
Field 3 = 15 or 16 Char
Field 4 = 7 Char
Field 5 = 12 Char
Field 6 = 2 Char
Final Output:
639021234567891,3574820069121409,654712345678
639021987654321,3526930428968109,654787654321

It's not clear how detect whether field 3 should have 15 or 16 chars. But as draft for the first 3 fields you could use something like that:
echo 63902910069758520110121121210035748200670169758510 |
awk '{ printf("%s,%s,%s",substr($1,1,15),substr($1,16,15),substr($1,30,15)); }'

Or with sed:
echo $NUM | sed -r 's/^([0-9]{16})([0-9]{15})([0-9]{15,16}) ...$/\1,\2,\3, .../'
This will use 15 or 16 for the length of field 3 based the length of the whole string.

If you're using gawk:
gawk -v f3w=16 'BEGIN {OFS=","; FIELDWIDTHS="15 15 " f3w " 7 12 2"} {print $1, $3, $5}'
Do you know ahead of time what the width of Field 3 should be? Do you need it to be programatically determined? How? Based on the total length of the line? Does it change line-by-line?
Edit:
If you don't have gawk, then this is a similar approach:
awk -v f3w=16 'BEGIN {OFS=","; FIELDWIDTHS="15 15 " f3w " 7 12 2"; n=split(FIELDWIDTHS,fw," ")} { p=1; r=$0; for (i=1;i<=n;i++) { $i=substr(r,p,fw[i]); p += fw[i]}; print $1,$3,$5}'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Bash select valid rows from file with awk - regex

Since you have a large data set and such a simple requirement, you could just use grep for this as it'd be faster than awk: grep '^[0-9]' file

Based on your data, you can check if first column has 8 digits to be representing a date in YYYYMMDD format using this command: awk '$1 ~ /^[0-9]{8}$/' file > output

You can just go with this: awk '/^[0-9]+/' file.txt >> output.txt By default awk works with lines, so you tell him (I am assuming he is a boy) to select the lines that starts (^) with at least one digit ([0-9]+), and to print them, redirecting in output.txt. Hope helps.

You can also try this.. sed '/^[0-9]/!d' inputfile > outputfile

Related

Get next 5 lines after regexp is matched in tcl

Split file on pattern matching with different header into files using AWK

Using AWK to print lines between patterns after a pattern has been met

replace strings with lines from another text file by matching patterns

Separate string of digits into 3 columns using awk/sed

Categories

Resources