Split file on pattern matching with different header into files using AWK - regex

I've a file that needs to be separated based on search pattern into multiple files and different headers for different files. I can split the file but unable to add different header to different files. Here's the code I tried:
BEGIN {
{
a=substr($0,38,2)
if(a=="HD")
{
print"a","b","c"...
OFS="|"
}
if(a=="AS")
{
print"e","f","g"...
OFS="|"
}
}
}
{
a=substr($0,38,2)
if(a=="HD")
{
FIELDWIDTHS="10 8 10 9 2 1 1 11 14 14 14 14 14 14 14 14 8 60 30 30 32 32 27 18 11 346"
OFS="|"
}
if(a=="AS")
{
FIELDWIDTHS="10 8 10 9 2 1 7 30 14 14 14 14 625"
OFS="|"
}
}
{
$1=$1
print > a".txt"
}

Why don't you do it like the below? And as far as I see, setting FIELDWITHS only works properly if it was set in the BEGIN block (or when changing an input file...):
awk 'NR=1 { HEADER1 = "whatever" ; HEADER2 = "whatever2" ;
print HEADER1 > FIRSTFILE ;
print HEADER2 > SECONDFILE ;
}
{ a=substr($0,38,2)
OFS="|"
print $0 >> a".txt"
}' INPUTFILE

Related

Bash select valid rows from file with awk

I have a large data set with some invalid rows. I want to copy to another file only rows which start with valid date (regex digits).
Basically check if awk $1 is digit ([0-9]), if yes, write whole row ($0) to output file, if no skip this row, go to next row.
How I imagine it like (both versions give syntax error):
awk '{if ($1 =~ [0-9]) print $0 }' >> output.txt
awk '$1 =~ [0-9] {print $0}' filename.txt
while this does print the first field, I have no idea how to proceed.
awk '{ print $1 }' filename.txt
19780101
19780102
19780103
a
19780104
19780105
19780106
...
Full data set:
19780101 1 1 1 1 1
19780102 2 2 2 2 2
19780103 3 3 3 3 3
a a a a a a
19780104 4 4 4 4 4
19780105 5 5 5 5 5
19780106 6 6 6 6 6
19780107 7 7 7 7 7
19780108 8 8 8 8 8
19780109 9 9 9 9 9
19780110 10 10 10 10 10
19780111 11 11 11 11 11
19780112 12 12 12 12 12
19780113 13 13 13 13 13
19780114 14 14 14 14 14
19780115 15 15 15 15 15
19780116 16 16 16 16 16
a a a a a a
19780117 17 17 17 17 17
19780118 18 18 18 18 18
19780119 19 19 19 19 19
19780120 20 20 20 20 20
The data set can be reproduced with R
library(dplyr)
library(DataCombine)
N <- 20
df = as.data.frame(matrix(seq(N),nrow=N,ncol=5))
df$date = format(seq.Date(as.Date('1978-01-01'), by = 'day', len = N), "%Y%m%d")
df <- df %>% select(date, everything())
df <- InsertRow(df, NewRow = rep("a", 6), RowNum = 4)
df <- InsertRow(df, NewRow = rep("a", 6), RowNum = 18)
write.table(df,"filename.txt", quote = FALSE, sep="\t",row.names=FALSE)
Questions about reading first N rows don't address my need, because my invalid rows could be anywhere. This solution doesn't work for some reason.
Since you have a large data set and such a simple requirement, you could just use grep for this as it'd be faster than awk:
grep '^[0-9]' file
Based on your data, you can check if first column has 8 digits to be representing a date in YYYYMMDD format using this command:
awk '$1 ~ /^[0-9]{8}$/' file > output
You can just go with this:
awk '/^[0-9]+/' file.txt >> output.txt
By default awk works with lines, so you tell him (I am assuming he is a boy) to select the lines that starts (^) with at least one digit ([0-9]+), and to print them, redirecting in output.txt.
Hope helps.
You can also try this..
sed '/^[0-9]/!d' inputfile > outputfile

Regular Expression to get substrings in PowerShell

I need help with the regular expression. I have 1000's of lines in a file with the following format:
+ + [COMPILED]\SRC\FileCheck.cs - TotalLine: 99 RealLine: 27 Braces: 18 Comment: 49 Empty: 5
+ + [COMPILED]\SRC\FindstringinFile.cpp - TotalLine: 103 RealLine: 26 Braces: 22 Comment: 50 Empty: 5
+ + [COMPILED]\SRC\findingstring.js - TotalLine: 91 RealLine: 22 Braces: 14 Comment: 48 Empty: 7
+ + [COMPILED]\SRC\restinpeace.h - TotalLine: 95 RealLine: 24 Braces: 16 Comment: 48 Empty: 7
+ + [COMPILED]\SRC\Getsomething.h++ - TotalLine: 168 RealLine: 62 Braces: 34 Comment: 51 Empty: 21
+ + [COMPILED]\SRC\MemDataStream.hh - TotalLine: 336 RealLine: 131 Braces: 82 Comment: 72 Empty: 51
+ + [CONTEXT]\SRC\MemDataStream.sql - TotalLine: 36 RealLine: 138 Braces: 80 Comment: 76 Empty: 59
I need a regular expression that can give me:
FilePath i.e. \SRC\FileMap.cpp
Extension i.e. .cpp
RealLine value i.e. 17
I'm using PowerShell to implement this and been successful in getting the results back using Get-Content (to read the file) and Select-String cmdlets.
Problem is its taking a long time to get the various substrings and then writing those in the xml file.(I have not put in the code for generating and xml).
I've never used regular expressions before but I know using a regular expression would be an efficient way to get the strings..
Help would be appreciated.
The Select-String cmdlet accepts the regular expression to search for the string.
Current code is as follows:
function Get-SubString
{
Param ([string]$StringtoSearch, [string]$StartOfTheString, [string]$EndOfTheString)
If($StringtoSearch.IndexOf($StartOfTheString) -eq -1 )
{
return
}
[int]$StartOfIndex = $StringtoSearch.IndexOf($StartOfTheString) + $StartOfTheString.Length
[int]$EndOfIndex = $StringtoSearch.IndexOf($EndOfTheString , $StartOfIndex)
if( $StringtoSearch.IndexOf($StartOfTheString)-ne -1 -and $StringtoSearch.IndexOf($EndOfTheString) -eq -1 )
{
[string]$ExtractedString=$StringtoSearch.Substring($StartOfTheString.Length)
}
else
{
[string]$ExtractedString = $StringtoSearch.Substring($StartOfIndex, $EndOfIndex - $StartOfIndex)
}
Return $ExtractedString
}
function Get-FileExtension
{
Param ( [string]$Path)
[System.IO.Path]::GetExtension($Path)
}
#For each file extension we will be searching all lines starting with + +
$SearchIndividualLines = "+ + ["
$TotalLines = select-string -Pattern $SearchIndividualLines -Path
$StandardOutputFilePath -allmatches -SimpleMatch
for($i = $TotalLines.GetLowerBound(0); $i -le $TotalLines.GetUpperBound(0); $i++)
{
$FileDetailsString = $TotalLines[$i]
#Get File Path
$StartStringForFilePath = "]"
$EndStringforFilePath = "- TotalLine"
$FilePathValue = Get-SubString -StringtoSearch $FileDetailsString -StartOfTheString $StartStringForFilePath -EndOfTheString $EndStringforFilePath
#Write-Host FilePathValue is $FilePathValue
#GetFileExtension
$FileExtensionValue = Get-FileExtension -Path $FilePathValue
#Write-Host FileExtensionValue is $FileExtensionValue
#GetRealLine
$StartStringForRealLine = "RealLine:"
$EndStringforRealLine = "Braces"
$RealLineValue = Get-SubString -StringtoSearch $FileDetailsString -
StartOfTheString $StartStringForRealLine -EndOfTheString $EndStringforRealLine
if([string]::IsNullOrEmpty($RealLineValue))
{
continue
}
}
Assume you have those in C:\temp\sample.txt
Something like this?
PS> (get-content C:\temp\sample.txt) | % { if ($_ -match '.*COMPILED\](\\.*)(\.\w+)\s*.*RealLine:\s*(\d+).*') { [PSCustomObject]#{FilePath=$matches[1]; Extention=$Matches[2]; RealLine=$matches[3]} } }
FilePath Extention RealLine
-------- --------- --------
\SRC\FileCheck .cs 27
\SRC\FindstringinFile .cpp 26
\SRC\findingstring .js 22
\SRC\restinpeace .h 24
\SRC\Getsomething .h 62
\SRC\MemDataStream .hh 131
Update:
Stuff inside paranthesis is captured, so if you want to capture [COMPILED], you will need to just need to add that part into the regex:
Instead of
$_ -match '.*COMPILED\](\\.*)
use
$_ -match '.*(\[COMPILED\]\\.*)
The link in the comment to your question includes a good primer on the regex.
UPDATE 2
Now that you want to capture set of path, I am guessing you sample looks like this:
+ + [COMPILED]C:\project\Rom\Main\Plan\file1.file2.file3\Cmd\Camera.culture.less-Lat‌​e-PP.min.js - TotalLine: 336 RealLine: 131 Braces: 82 Comment: 72 Empty: 51
The technique above will work, you just need to do a very slight adjustment for the first parenthesis like this:
$_ -match (\[COMPILED\].*)
This will tell regex that you want to capture [COMPILED] and everything that comes after it, until
(\.\w+)
i.e to the extension, which as a dot and a couple of letters (which might not works if you had an extension like .3gp)
So, your original one liner would instead be:
(get-content C:\temp\sample.txt) | % { if ($_ -match '.(\[COMPILED\].*)(\.\w+)\s*.*RealLine:\s*(\d+).*') { [PSCustomObject]#{FilePath=$matches[1]; Extention=$Matches[2]; RealLine=$matches[3]} } }

awk with joined field

I am trying to extract data from one file, based on another.
The substring from file1 serves as an index to find matches in file2.
All works when the string to be searched inf file2 is beetween spaces or isolated, but when is joined to other fields awk cannot find it. is there a way to have awk match any part of the strings in file2 ?
awk -vv1="$Var1" -vv2="$var2" '
NR==FNR {
if ($4==v1 && $5==v2) {
s=substr($0,4,8)
echo $s
a[s]++
}
next
}
!($1 in a) {
print
}' /tmp/file1 /tmp/file2
example that works:
file1:
1 554545352014-01-21 2014-01-21T16:18:01 FS 14001 1 1.10
1 554545362014-01-21 2014-01-21T16:18:08 FS 14002 1 5.50
file2:
55454535 11 17 102 850Sande Fiambre 1.000
55454536 11 17 17 238Pesc. Dourada 1.000
example that does not work:
file2:
5545453501/21/20142 1716:18 1 1 116:18
5545453601/21/20142 1716:18 1 1 216:18
the string to be searched, for instance : 55454535 finds a match in the working example, but it doesn't in the bottom one.
You probably want to replace this:
!($1 in a) {
print
}
with this (or similar - your requirements are unclear):
{
found = 0
for (s in a) {
if ($1 ~ "^"s) {
found = 1
}
}
if (!found) {
print
}
}
Use a regex comparison ~ instead of ==
ex. if ($4 ~ v1 && $5 ~ v2)
Prepend v1/v2 with ^ if you want to the word to only begin with string and $ if you want to word to only end with it

Using AWK to print lines between patterns after a pattern has been met

I am trying to print specific parts of a file that looks like this:
99999 1 55 127
{bunch of numbers here}
99999 2 55 126
{bunch of numbers here}
99999 3 55 144
{bunch of numbers here}
and basically I am trying to print the "bunch of numbers" (along with the preceeding line) when a specific sequence is met. The 99999 is always constant and i dont care about the number right after, but i want to condition the next two numbers.
#! /usr/bin/awk -f
BEGIN{}
{
if ( $3 == 55 && $4 = 100 )
{next
do{print $0}
while($1 != 99999}
}}
END{}
I'm quite new to awk and would really appreciate the help! Thanks
your question is not clear to me...
I guess you want to print out block of lines after (inclusive) a 99999 x 55 100 and before (exclusive) another 99999 ... line.
I used your example, (btw, you should provide a better example and with output.) but I changed your criteria to $3==55 and $4=126 so that the block sits in the middle of your data.
awk '$1==99999{f=($3==55&&$4==126)?1:0}f' file
test:
kent$ cat f
99999 1 55 127
{bunch of numbers here}
1
2
99999 2 55 126
3
4
{bunch of numbers here}
99999 3 55 144
5
6
{bunch of numbers here}
kent$ awk '$1==99999{f=($3==55&&$4==126)?1:0}f' f
99999 2 55 126
3
4
{bunch of numbers here}
You can use a flag, set it when either third and fourth field match, and reset it (exiting) when found next line that its first field is 99999:
awk '
$1 == 99999 && flag == 1 { exit }
$3 == 55 && $4 == 126 { flag = 1 }
flag == 1 { print }
' infile
It yields:
99999 2 55 126
{bunch of numbers here}

replace strings with lines from another text file by matching patterns

I have a file with a correspondence key -> value:
sort keyFile.txt | head
ENSMUSG00000000001 ENSMUSG00000000001_Gnai3
ENSMUSG00000000003 ENSMUSG00000000003_Pbsn
ENSMUSG00000000003 ENSMUSG00000000003_Pbsn
ENSMUSG00000000028 ENSMUSG00000000028_Cdc45
ENSMUSG00000000028 ENSMUSG00000000028_Cdc45
ENSMUSG00000000028 ENSMUSG00000000028_Cdc45
ENSMUSG00000000031 ENSMUSG00000000031_H19
ENSMUSG00000000031 ENSMUSG00000000031_H19
ENSMUSG00000000031 ENSMUSG00000000031_H19
ENSMUSG00000000031 ENSMUSG00000000031_H19
And I would like to replace every correspondence of "key" with the "value" in the temp.txt:
head temp.txt
ENSMUSG00000000001:001 515
ENSMUSG00000000001:002 108
ENSMUSG00000000001:003 64
ENSMUSG00000000001:004 45
ENSMUSG00000000001:005 58
ENSMUSG00000000001:006 63
ENSMUSG00000000001:007 46
ENSMUSG00000000001:008 11
ENSMUSG00000000001:009 13
ENSMUSG00000000003:001 0
The result should be:
out.txt
ENSMUSG00000000001_Gnai3:001 515
ENSMUSG00000000001_Gnai3:002 108
ENSMUSG00000000001_Gnai3:003 64
ENSMUSG00000000001_Gnai3:004 45
ENSMUSG00000000001_Gnai3:005 58
ENSMUSG00000000001_Gnai3:006 63
ENSMUSG00000000001_Gnai3:007 46
ENSMUSG00000000001_Gnai3:008 11
ENSMUSG00000000001_Gnai3:009 13
ENSMUSG00000000001_Gnai3:001 0
I have tried a few variations following this AWK example but as you can see the result is not what I expected:
awk 'NR==FNR{a[$1]=$1;next}{$1=a[$1];}1' keyFile.txt temp.txt | head
515
108
64
45
58
63
46
11
13
0
My guess is that column 1 of temp does not match 'exactly' column 1 of keyValues. Could someone please help me with this?
R/python/sed solutions are also welcome.
Use awk command like this:
awk 'NR==FNR {a[$1]=$2;next} {
split($1, b, ":");
if (b[1] in a)
print a[b[1]] ":" b[2], $2;
else
print $0;
}' keyFile.txt temp.txt
Code for GNU sed:
$sed -nr '$!N;/^(.*)\n\1$/!bk;D;:k;s#\S+\s+(\w+)_(\w+)#/^\1/s/(\\w+)(:\\w+)\\s+(\\w+)/\\1_\2\\2 \\3/p#;P;s/^(.*)\n//' keyfile.txt|sed -nrf - temp.txt
ENSMUSG00000000001_Gnai3:001 515
ENSMUSG00000000001_Gnai3:002 108
ENSMUSG00000000001_Gnai3:003 64
ENSMUSG00000000001_Gnai3:004 45
ENSMUSG00000000001_Gnai3:005 58
ENSMUSG00000000001_Gnai3:006 63
ENSMUSG00000000001_Gnai3:007 46
ENSMUSG00000000001_Gnai3:008 11
ENSMUSG00000000001_Gnai3:009 13
ENSMUSG00000000003_Pbsn:001 0
Another awk option
awk -F: 'NR == FNR{split($0, a, " "); x[a[1]]=a[2]; next}{print x[$1]":"$2}' keyFile.txt temp.txt
Another awk version:
awk 'NR==FNR{a[$1]=$2;next}
{sub(/[^:]+/,a[substr($1,1,index($1,":")-1)])}1' keyFile.txt temp.txt