AWK regex split function using multiple delimiters

AWK regex split function using multiple delimiters - regex

I'm trying to use Awk's split function to split input into three fields in order to use the values as field[1], field[2], field[3]. I'm trying to extract the first value: everything (including) the colon, then everything until the first tab (\t) (the hex), then the last field will include everything else.
I've tried multiple regexes and the closest I've come to solving this is:
echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" \
| awk '{split($0,field,/([:])([ ])|([\t])/); \
print "length of field:" length(field);for (x in field) print field[x]}'
But the result doesn't include the colon --and I'm not sure if it's good regex I've written:
length of field:3
ffffffff81000000
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
Thanks in advance.

Using gnu-awk's RS (for record separator) variable:
s=$'ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf'
awk -v RS='^\\S+|[^\t:]+' '{gsub(/^\s*|\s*$/, "", RT); print RT}' <<< "$s"
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
Explanation:
RS='^\\S+|[^\t:]+': Sets RS as 1+ non-whitespace characters at the start OR 1+ of non-tab, non-colon characters
gsub(/^\s*|\s*$/, "", RT) removed whitespace at the start or end from RT variable that gets populated because of RS
print RTprintsRT` variable
If you want to print length of fields also then use:
awk -v RS='^\\S+|[^\t:]+' '{gsub(/^\s*|\s*$/, "", RT); print RT} END {print "length of field:", NR}' <<< "$s"
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
length of field: 3
If you don't have gnu-awk then here is a POSIX awk solution for the same:
awk '{
while (match($0, /^[^[:blank:]]+|[^\t:]+/)) {
print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART+RLENGTH)
}
}' <<< "$s"
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf

Using your awk code with some changes:
echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" | awk -v OFS='\n' '
{
sub(/: */,":\t")
split($0,field,/[\t]/)
print "length of field:" length(field), field[1], field[2],field[3]
}'
length of field:3
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
As you can see:
added a tab with sub(),
so the separator for split() is only [\t],
and the OFS is \n.
And finally only a print.

Your regex can be simplified as:
split($0,field,/: |\t/)
but the result will be the same without containing the colon character
because the delimiter pattern is not included in the splitted result.
If you want to use a complex pattern such as a whitespace preceded by a colon
as a delimiter in the split function, you will need to use PCRE which is not
supported by awk.
Here is an example with python:
#!/usr/bin/python
import re
s = "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf"
print(re.split(r'(?<=:) |\t', s))
Output:
['ffffffff81000000:', '48 8d 25 51 3f 60 01', 'leaq asdf asdf asdf']
You'll see the colon is included in the result.

You can use sub to replace : with :\t and the \t with \n. You will not find \n in a line of awk text unless your programming actions put it there; it is therefor a useful delimiter. You now can split on \n and your code will work as you imagine:
echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" \
| awk '{sub(/: /,":\t"); gsub(/\t/,"\n"); split($0,field,/\n/)
print "length of field:" length(field)
for (x=1; x<=length(field); x++) print field[x]}'
Prints:
length of field:3
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf

IMHO for a job like this you should use GNU awk for the 3rd arg to match() instead of split():
$ echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" |
awk '
match($0,/([^:]+:)\s*([^\t]+)\t(.*)/,field) {
print "length of field:" length(field);for (x in field) print x, field[x]
}
'
length of field:12
0start 1
0length 58
3start 40
1start 1
2start 19
3length 19
2length 20
1length 17
0 ffffffff81000000: 48 8d 25 51 3f 60 01 leaq asdf asdf asdf
1 ffffffff81000000:
2 48 8d 25 51 3f 60 01
3 leaq asdf asdf asdf
Note that the resultant array has a lot more information than just the 3 fields that get populated with the strings that match the regexp segments. Just ignore the extra fields if you don't need them:
$ echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" |
awk '
match($0,/([^:]+:)\s*([^\t]+)\t(.*)/,field) {
for (x=1; x<=3; x++) print x, field[x]
}
'
1 ffffffff81000000:
2 48 8d 25 51 3f 60 01
3 leaq asdf asdf asdf

Related

How to match using Regex till the invalid response (C#)

I need to write a regex that matches the following string till E 1 ERRORWARNING SET \n, (till the end of invalid response). M 1 CSD ... are valid response strings.
Scenario #1
"M 1 CSD 382 01 44 2B 54 36 7B 22 6A \n" +
"M 1 CSD 382 00 73 6F 6E 72 70 63 22 \n" +
"R OK \n" + // This could be any string not matching the pattern M 1 CSD ...
"E 1 ERRORWARNING SET \n" + // This could be any string not matching the pattern M 1 CSD ...
"M 1 CSD 382 00 3A 22 32 2E 30 22 2C \n" +
Scenario #2
"R OK \n" + // This could be any string not matching the pattern M 1 CSD ...
"E 1 ERRORWARNING SET \n" + // This could be any string not matching the pattern M 1 CSD ...
"M 1 CSD 382 00 3A 22 32 2E 30 22 2C \n" +
I know I can write something like (M 1 CSD (?:.{3}) (?:.{2}\s)+\n)* to match the M 1 CSD pattern but not sure how to match the invalid response. The best I am able to do is
(M 1 CSD (?:.{3}) (?:.{2}\s)+\r\n)*([^M].*\r\n)*. But what happens if the invalid response starts with M?
Off course it is possible that there is no invalid response, then the regex needs to match till the end, i.e till M 1 CSD 382 02 30 33 22 7D 7D \n
"M 1 CSD 382 01 44 2B 54 36 7B 22 6A \n"
"M 1 CSD 382 00 73 6F 6E 72 70 63 22 \n"
"M 1 CSD 382 00 3A 22 32 2E 30 22 2C \n"
"M 1 CSD 382 00 22 69 64 22 3A 30 2C \n"
"M 1 CSD 382 00 22 72 65 73 75 6C 74 \n"
"M 1 CSD 382 00 22 3A 7B 22 53 65 72 \n"
"M 1 CSD 382 00 69 61 6C 4E 75 6D 62 \n"
"M 1 CSD 382 00 65 72 22 3A 22 32 32 \n"
"M 1 CSD 382 00 32 30 31 31 34 32 35 \n"
"M 1 CSD 382 02 30 33 22 7D 7D \n"

You can repeat matching all lines that do not have ERRORWARNING SET the invalid response starts with M
^(?![\w ]* ERRORWARNING SET \r?\n).+(?:\r?\n(?![\w ]* ERRORWARNING SET \r?\n).+)*
The pattern matches:
^ Start of string
(?![\w ]* ERRORWARNING SET \r?\n) Assert that the string does not start with ERRORWARNING SET preceded by optional word chars and spaces
.+ Match the whole line with at least a single char
(?: Non capture group
\r?\n Match a newline
(?![\w ]* ERRORWARNING SET \r?\n) Assert that the next line does not start with ERRORWARNING SET preceded by optional word chars and spaces
.+ Match the whole line with at least a single char
)* Close non capture group and optionally repeat
.NET regex demo
Or a bit more strict to test that the string does not start with a single char A-Z followed by 1 and then ERRORWARNING SET
^(?![A-Z] 1 ERRORWARNING SET \r?\n).+(?:\r?\n(?![A-Z] 1 ERRORWARNING SET \r?\n).+)*

removing the dots and colons in the time field

I have the following contents from data.log file. I wish to extract the time value and part of the payload (after deadbeef in the payload, third row, starting second to last byte. Please refer to expected output).
data.log
print 1: file offset 0x0
ts=0x584819041ff529e0 2016-12-07 14:13:24.124834649 UTC
type: ERF Ethernet
dserror=0 rxerror=0 trunc=0 vlen=0 iface=1 rlen=96 lctr=0 wlen=68
pad=0x00 offset=0x00
dst=aa:bb:cc:dd:ee:ff src=ca:fe:ba:be:ca:fe
etype=0x0800
45 00 00 32 00 00 40 00 40 11 50 ff c0 a8 34 35 E..2..#.#.P...45
c0 a8 34 36 80 01 00 00 00 1e 00 00 08 08 08 08 ..46............
08 08 50 e6 61 c3 85 21 01 00 de ad be ef 85 d7 ..P.a..!........
91 21 6f 9a 32 94 fd 07 01 00 de ad be ef 85 d7 .!o.2...........
print 2: file offset 0x60
ts=0x584819041ff52b00 2016-12-07 14:13:24.124834716 UTC
type: ERF Ethernet
dserror=0 rxerror=0 trunc=0 vlen=0 iface=1 rlen=96 lctr=0 wlen=68
pad=0x00 offset=0x00
dst=aa:bb:cc:dd:ee:ff src=ca:fe:ba:be:ca:fe
etype=0x0800
45 00 00 32 00 00 40 00 40 11 50 ff c0 a8 34 35 E..2..#.#.P...45
c0 a8 34 36 80 01 00 00 00 1e 00 00 08 08 08 08 ..46............
08 08 68 e7 61 c3 85 21 01 00 de ad be ef 86 d7 ..h.a..!........
91 21 c5 34 77 bd fd 07 01 00 de ad be ef 86 d7 .!.4w...........
Expected output
I just want to replace the dots and colons in the time field (before UTC) and get the entire value.
141324124834649,85d79121
141324124834716,86d79121
What I have done so far
I have extracted the fields after "." but not sure how to replace the colons and get the entire time value.
awk -F '[= ]' '$NF == "UTC"{split($4,b,".");s=b[2]",";a=15} /de ad be ef/{s=s $a $(a+1);if(a==1)print s;a=1}' data.log
124834649,85d79121
124834716,86d79121
Any help is much appreciated.

awk '$NF == "UTC"{gsub("[.:]","",$3);s=$3",";a=15} /de ad be ef/{s=s $a $(a+1);if(a==1)print s;a=1}' data.log
Result:
141324124834649,85d79121
141324124834716,86d79121
PS: it can be simplified with getline :
awk '$NF == "UTC"{gsub("[.:]","",$3);s=$3","} /de ad be ef/{s=s $15 $16;getline;print(s $1 $2)}' data.log

You can extract the time part like this:
$ awk '/UTC/ {split($0,a); gsub(/[\.:]/,"",a[3]); print a[3]}' file
141324124834649
141324124834716

for the UTC part, (rest of the code is the same
awk '/UTC$/{gsub(/[\.:]/,"");print $3}' YourFile
just remove the ":" and "." and take the field value, other part of the line don't have those 2 character, so is not modified
$NF test is replaced by /UTC$/, a bit faster and simpler (OMHO)
the full code
awk -F '[= ]' '/UTC$/{gsub(/[\.:]/,"");s=$3",";a=15} /de ad be ef/{s=s $a $(a+1);if(a==1)print s;a=1}' YourFile

Capture two digit pairs from a text

I want to capture all two digits from the following header file:
#define KEYMAP( \
K00, K01, K02, K03, K04, K05, K06, K07, K08, K09, K0A, K0B, K0C, K0D, \
K10, K11, K12, K13, K14, K15, K16, K17, K18, K19, K1A, K1B, K1C, K1D, \
K20, K21, K22, K23, K24, K25, K26, K27, K28, K29, K2A, K2B, K2C, K2D, \
K30, K31, K32, K33, K34, K35, K36, K37, K38, K39, K3A, K3B, K3C, K3D, \
K40, K41, K42, K45, K49, K4A, K4B, K4C, K4D \
)
So I want to get a list containing 00,01,02.....4D. I tried to do this using the Select-String cmdlet:
gc 'y:\keyboard.h' | sls 'K'
But doesnt give me the expected result

Use a lookbehind assertion in the pattern and a proper hexadecimal capturing pattern (see regex101):
gc 'y:\keyboard.h' | select-string '(?<=K)([\da-f]{2})' -AllMatches | %{ $_.matches.value }
Select-String uses case-insensitive matching by default, use its -CaseSensitive switch if needed. It's possible to make matching more strict to reject possible false positives from other parts of the file: '\s+(?<=K)([\da-fA-F]{2})(?:[\s,]|$)' -CaseSensitive

I would use the static regex::Matches method:
$content = Get-Content 'y:\keyboard.h' -Raw
[regex]::Matches($content, '\bK(..),') | Foreach {
$_.Groups[1].Value
}
Output:
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 10 11 12 13 14 15 16 17 18
19 1A 1B 1C 1D 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 30 31 32 33
34 35 36 37 38 39 3A 3B 3C 3D 40 41 42 45 49 4A 4B 4C

Grep: find lines only matching unknown character once

I have a list with hexadecimal lines. For example:
0b 5a 3f 5a 7d d0 5d e6 2b c4 7e 7d c2 c0 e6 9a
84 bd aa 74 f3 85 da 9d ac b6 e0 b6 62 0f b5 d5
c0 b0 f5 60 02 8b 1c a4 41 7c 53 f2 85 20 a0 d1
...
I'm trying to find all the lines with grep, where there is a character that occurs only once in the line.
For example: there is only one time a 'd' in the third line.
I tried this, but it's not working:
egrep '^.*([a-f0-9])[^\1]*$'

This can be done with a regex, but it has to be verbose.
It kind of can't be generalized.
# ^(?:[^a]*a[^a]*|[^b]*b[^b]*|[^c]*c[^c]*|[^d]*d[^d]*|[^e]*e[^e]*|[^f]*f[^f]*|[^0]*0[^0]*|[^1]*1[^1]*|[^2]*2[^2]*|[^3]*3[^3]*|[^4]*4[^4]*|[^5]*5[^5]*|[^6]*6[^6]*|[^7]*7[^7]*|[^8]*8[^8]*|[^9]*9[^9]*)$
^
(?:
[^a]* a [^a]*
| [^b]* b [^b]*
| [^c]* c [^c]*
| [^d]* d [^d]*
| [^e]* e [^e]*
| [^f]* f [^f]*
| [^0]* 0 [^0]*
| [^1]* 1 [^1]*
| [^2]* 2 [^2]*
| [^3]* 3 [^3]*
| [^4]* 4 [^4]*
| [^5]* 5 [^5]*
| [^6]* 6 [^6]*
| [^7]* 7 [^7]*
| [^8]* 8 [^8]*
| [^9]* 9 [^9]*
)
$
For discovery, if you put capture groups around the letters and numbers,
and use a brach reset:
^
(?|
[^a]* (a) [^a]*
| [^b]* (b) [^b]*
| [^c]* (c) [^c]*
| [^d]* (d) [^d]*
| [^e]* (e) [^e]*
| [^f]* (f) [^f]*
| [^0]* (0) [^0]*
| [^1]* (1) [^1]*
| [^2]* (2) [^2]*
| [^3]* (3) [^3]*
| [^4]* (4) [^4]*
| [^5]* (5) [^5]*
| [^6]* (6) [^6]*
| [^7]* (7) [^7]*
| [^8]* (8) [^8]*
| [^9]* (9) [^9]*
)
$
This is the output:
** Grp 0 - ( pos 0 , len 50 )
0b 5a 3f 5a 7d d0 5d e6 2b c4 7e 7d c2 c0 e6 9a
** Grp 1 - ( pos 7 , len 1 )
f
-----------------------
** Grp 0 - ( pos 50 , len 51 )
84 bd aa 74 f3 85 da 9d ac b6 e0 b6 62 0f b5 d5
** Grp 1 - ( pos 77 , len 1 )
c
-----------------------
** Grp 0 - ( pos 101 , len 51 )
c0 b0 f5 60 02 8b 1c a4 41 7c 53 f2 85 20 a0 d1
** Grp 1 - ( pos 148 , len 1 )
d

I don't know a way to do it with a regex. However you can use this stupid awk script:
awk -F '' '{for(i=1;i<=NF;i++){a[$i]++};for(i in a){if(a[i]==1){print;next}}}' input
The scripts counts the number of occurrences of every character in the line. At the end of the line it checks all totals and prints the line if at least one of those totals equals 1.

Here is a piece of code that uses a number of shell tools beyond grep.
It reads the input line by line. Generates a frequency table. Upon finding an element with frequency 1 it outputs the unique character and the entire line.
cat input | while read line ; do
export line ;
echo $line | grep -o . | sort | uniq -c | \
awk '/[ ]+1[ ]/ {print $2 ":" ENVIRON["line"] ; exit }' ;
done
Note that if you are interested in digits only you could replace grep -o . with grep -o "[a-f]"

Expect Script - detecting two unique instances of a pattern in one returned buffer

I'm trying to do two matches on one block returned data inside an expect script. This is the returned data from a command shows what this system is connected to(I changed the descriptions to protect sensitive information). I thought I could use expect_out(buffer), but I can't figure out how to parse the returned data to detect two unique instances of the patterns. I can re-run the command if I detect one instance a pattern, but that won't allow me to detect the case where I have two unique instances of a pattern in the returned data as expect{} would re-find the first pattern. For example 'abcd' and 'abcd'.
Case one: I will have zero instances of 'abcd', 'efgh', 'ijkl', 'mnop', or 'qurs' in the returned block - in that case nothing will be written to a file and that's fine.
Case two: I will have only once instance of 'abcd', 'efgh', 'ijkl', 'mnop', or 'qurs' in the file, the current code detects that case and then writes the existence of one pattern to a file for later processing.
Case three: I have two instances of the patterns 'abcd', 'efgh', 'ijkl', 'mnop', or 'qurs', in any combination of the pairs. I could have 'abcd', 'abcd'; 'abcd', 'efgh'; or 'ijkl', 'mnop'. If case 3 happens I need to write a different message to the file.
Can anyone help?
My data:
A4 | 48 48 changedToProtectPrivacy
A15 | 48 48 changedToProtectPrivacy
A16 | 48 48 changedToProtectPrivacy
A17 | 48 48 changedToProtectPrivacy
A18 | 48 48 changedToProtectPrivacy
A19 | 48 48 changedToProtectPrivacy
A20 | 48 48 changedToProtectPrivacy
A21 | 48 48 changedToProtectPrivacy
A24 | abcd
A24 | abcd
B1 | 48 48 changedToProtectPrivacy
B2 | 48 48 changedToProtectPrivacy
B3 | 48 48 changedToProtectPrivacy
B4 | 48 48 changedToProtectPrivacy
B5 | 48 48 changedToProtectPrivacy
B6 | 48 48 changedToProtectPrivacy
B21 | 48 48 changedToProtectPrivacy
B24 | abcd
B24 | abcd
D2 | 00 ... 1 changedToProtectPrivacy
D10 | 00 ... 1 changedToProtectPrivacy
E6 | 00 ... 1 changedToProtectPrivacy
-=- Current code snippit -=-
expect { "prompt" } send { "superSecretCommand" ; sleep 2 }
expect {
"abcd" { set infofile "info.$server" ;
set ::infofile [open $infofile a] ;
puts $::infofile "Connection detected" ;
close $::infofile ;
}
"efgh" { set infofile "info.$server" ;
set ::infofile [open $infofile a] ;
puts $::infofile "Connection detected" ;
close $::infofile ;
}
}

I guess what you need is like this:
[STEP 101] $ cat infile
A20 | 48 48 changedToProtectPrivacy
A21 | 48 48 changedToProtectPrivacy
A24 | abcd
A24 | abcd
B1 | 48 48 changedToProtectPrivacy
B6 | 48 48 changedToProtectPrivacy
B7 | ijkl
B21 | 48 48 changedToProtectPrivacy
B24 | efgh
B24 | abcd
D2 | 00 ... 1 changedToProtectPrivacy
D3 | efgh
D3 | abcd
D10 | 00 ... 1 changedToProtectPrivacy
D11 | ijkl
E6 | 00 ... 1 changedToProtectPrivacy
E7 | ijkl
[STEP 102] $ cat foo.exp
#!/usr/bin/expect
log_user 0
spawn -noecho cat infile
set pat1 {[\r\n]+[[:blank:]]*[A-Z][0-9]+[[:blank:]]*\|[[:blank:]]*}
set pat2 {[a-z]{4,4}}
expect {
-re "${pat1}($pat2)${pat1}($pat2)|${pat1}($pat2)" {
if {[info exists expect_out(3,string)]} {
send_user ">>> $expect_out(3,string)\n"
} else {
send_user ">>> $expect_out(1,string) $expect_out(2,string)\n"
}
array unset expect_out
exp_continue
}
}
[STEP 103] $ expect foo.exp
>>> abcd abcd
>>> ijkl
>>> efgh abcd
>>> efgh abcd
>>> ijkl
>>> ijkl
[STEP 104] $

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWK regex split function using multiple delimiters - regex

Related

How to match using Regex till the invalid response (C#)

removing the dots and colons in the time field

Capture two digit pairs from a text

Grep: find lines only matching unknown character once

Expect Script - detecting two unique instances of a pattern in one returned buffer

Categories

Resources