Grep: find lines only matching unknown character once

Grep: find lines only matching unknown character once - regex

I have a list with hexadecimal lines. For example:
0b 5a 3f 5a 7d d0 5d e6 2b c4 7e 7d c2 c0 e6 9a
84 bd aa 74 f3 85 da 9d ac b6 e0 b6 62 0f b5 d5
c0 b0 f5 60 02 8b 1c a4 41 7c 53 f2 85 20 a0 d1
...
I'm trying to find all the lines with grep, where there is a character that occurs only once in the line.
For example: there is only one time a 'd' in the third line.
I tried this, but it's not working:
egrep '^.*([a-f0-9])[^\1]*$'

This can be done with a regex, but it has to be verbose.
It kind of can't be generalized.
# ^(?:[^a]*a[^a]*|[^b]*b[^b]*|[^c]*c[^c]*|[^d]*d[^d]*|[^e]*e[^e]*|[^f]*f[^f]*|[^0]*0[^0]*|[^1]*1[^1]*|[^2]*2[^2]*|[^3]*3[^3]*|[^4]*4[^4]*|[^5]*5[^5]*|[^6]*6[^6]*|[^7]*7[^7]*|[^8]*8[^8]*|[^9]*9[^9]*)$
^
(?:
[^a]* a [^a]*
| [^b]* b [^b]*
| [^c]* c [^c]*
| [^d]* d [^d]*
| [^e]* e [^e]*
| [^f]* f [^f]*
| [^0]* 0 [^0]*
| [^1]* 1 [^1]*
| [^2]* 2 [^2]*
| [^3]* 3 [^3]*
| [^4]* 4 [^4]*
| [^5]* 5 [^5]*
| [^6]* 6 [^6]*
| [^7]* 7 [^7]*
| [^8]* 8 [^8]*
| [^9]* 9 [^9]*
)
$
For discovery, if you put capture groups around the letters and numbers,
and use a brach reset:
^
(?|
[^a]* (a) [^a]*
| [^b]* (b) [^b]*
| [^c]* (c) [^c]*
| [^d]* (d) [^d]*
| [^e]* (e) [^e]*
| [^f]* (f) [^f]*
| [^0]* (0) [^0]*
| [^1]* (1) [^1]*
| [^2]* (2) [^2]*
| [^3]* (3) [^3]*
| [^4]* (4) [^4]*
| [^5]* (5) [^5]*
| [^6]* (6) [^6]*
| [^7]* (7) [^7]*
| [^8]* (8) [^8]*
| [^9]* (9) [^9]*
)
$
This is the output:
** Grp 0 - ( pos 0 , len 50 )
0b 5a 3f 5a 7d d0 5d e6 2b c4 7e 7d c2 c0 e6 9a
** Grp 1 - ( pos 7 , len 1 )
f
-----------------------
** Grp 0 - ( pos 50 , len 51 )
84 bd aa 74 f3 85 da 9d ac b6 e0 b6 62 0f b5 d5
** Grp 1 - ( pos 77 , len 1 )
c
-----------------------
** Grp 0 - ( pos 101 , len 51 )
c0 b0 f5 60 02 8b 1c a4 41 7c 53 f2 85 20 a0 d1
** Grp 1 - ( pos 148 , len 1 )
d

I don't know a way to do it with a regex. However you can use this stupid awk script:
awk -F '' '{for(i=1;i<=NF;i++){a[$i]++};for(i in a){if(a[i]==1){print;next}}}' input
The scripts counts the number of occurrences of every character in the line. At the end of the line it checks all totals and prints the line if at least one of those totals equals 1.

Here is a piece of code that uses a number of shell tools beyond grep.
It reads the input line by line. Generates a frequency table. Upon finding an element with frequency 1 it outputs the unique character and the entire line.
cat input | while read line ; do
export line ;
echo $line | grep -o . | sort | uniq -c | \
awk '/[ ]+1[ ]/ {print $2 ":" ENVIRON["line"] ; exit }' ;
done
Note that if you are interested in digits only you could replace grep -o . with grep -o "[a-f]"

Related

AWK regex split function using multiple delimiters

I'm trying to use Awk's split function to split input into three fields in order to use the values as field[1], field[2], field[3]. I'm trying to extract the first value: everything (including) the colon, then everything until the first tab (\t) (the hex), then the last field will include everything else.
I've tried multiple regexes and the closest I've come to solving this is:
echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" \
| awk '{split($0,field,/([:])([ ])|([\t])/); \
print "length of field:" length(field);for (x in field) print field[x]}'
But the result doesn't include the colon --and I'm not sure if it's good regex I've written:
length of field:3
ffffffff81000000
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
Thanks in advance.

Using gnu-awk's RS (for record separator) variable:
s=$'ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf'
awk -v RS='^\\S+|[^\t:]+' '{gsub(/^\s*|\s*$/, "", RT); print RT}' <<< "$s"
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
Explanation:
RS='^\\S+|[^\t:]+': Sets RS as 1+ non-whitespace characters at the start OR 1+ of non-tab, non-colon characters
gsub(/^\s*|\s*$/, "", RT) removed whitespace at the start or end from RT variable that gets populated because of RS
print RTprintsRT` variable
If you want to print length of fields also then use:
awk -v RS='^\\S+|[^\t:]+' '{gsub(/^\s*|\s*$/, "", RT); print RT} END {print "length of field:", NR}' <<< "$s"
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
length of field: 3
If you don't have gnu-awk then here is a POSIX awk solution for the same:
awk '{
while (match($0, /^[^[:blank:]]+|[^\t:]+/)) {
print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART+RLENGTH)
}
}' <<< "$s"
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf

Using your awk code with some changes:
echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" | awk -v OFS='\n' '
{
sub(/: */,":\t")
split($0,field,/[\t]/)
print "length of field:" length(field), field[1], field[2],field[3]
}'
length of field:3
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
As you can see:
added a tab with sub(),
so the separator for split() is only [\t],
and the OFS is \n.
And finally only a print.

Your regex can be simplified as:
split($0,field,/: |\t/)
but the result will be the same without containing the colon character
because the delimiter pattern is not included in the splitted result.
If you want to use a complex pattern such as a whitespace preceded by a colon
as a delimiter in the split function, you will need to use PCRE which is not
supported by awk.
Here is an example with python:
#!/usr/bin/python
import re
s = "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf"
print(re.split(r'(?<=:) |\t', s))
Output:
['ffffffff81000000:', '48 8d 25 51 3f 60 01', 'leaq asdf asdf asdf']
You'll see the colon is included in the result.

You can use sub to replace : with :\t and the \t with \n. You will not find \n in a line of awk text unless your programming actions put it there; it is therefor a useful delimiter. You now can split on \n and your code will work as you imagine:
echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" \
| awk '{sub(/: /,":\t"); gsub(/\t/,"\n"); split($0,field,/\n/)
print "length of field:" length(field)
for (x=1; x<=length(field); x++) print field[x]}'
Prints:
length of field:3
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf

IMHO for a job like this you should use GNU awk for the 3rd arg to match() instead of split():
$ echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" |
awk '
match($0,/([^:]+:)\s*([^\t]+)\t(.*)/,field) {
print "length of field:" length(field);for (x in field) print x, field[x]
}
'
length of field:12
0start 1
0length 58
3start 40
1start 1
2start 19
3length 19
2length 20
1length 17
0 ffffffff81000000: 48 8d 25 51 3f 60 01 leaq asdf asdf asdf
1 ffffffff81000000:
2 48 8d 25 51 3f 60 01
3 leaq asdf asdf asdf
Note that the resultant array has a lot more information than just the 3 fields that get populated with the strings that match the regexp segments. Just ignore the extra fields if you don't need them:
$ echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" |
awk '
match($0,/([^:]+:)\s*([^\t]+)\t(.*)/,field) {
for (x=1; x<=3; x++) print x, field[x]
}
'
1 ffffffff81000000:
2 48 8d 25 51 3f 60 01
3 leaq asdf asdf asdf

How to match using Regex till the invalid response (C#)

I need to write a regex that matches the following string till E 1 ERRORWARNING SET \n, (till the end of invalid response). M 1 CSD ... are valid response strings.
Scenario #1
"M 1 CSD 382 01 44 2B 54 36 7B 22 6A \n" +
"M 1 CSD 382 00 73 6F 6E 72 70 63 22 \n" +
"R OK \n" + // This could be any string not matching the pattern M 1 CSD ...
"E 1 ERRORWARNING SET \n" + // This could be any string not matching the pattern M 1 CSD ...
"M 1 CSD 382 00 3A 22 32 2E 30 22 2C \n" +
Scenario #2
"R OK \n" + // This could be any string not matching the pattern M 1 CSD ...
"E 1 ERRORWARNING SET \n" + // This could be any string not matching the pattern M 1 CSD ...
"M 1 CSD 382 00 3A 22 32 2E 30 22 2C \n" +
I know I can write something like (M 1 CSD (?:.{3}) (?:.{2}\s)+\n)* to match the M 1 CSD pattern but not sure how to match the invalid response. The best I am able to do is
(M 1 CSD (?:.{3}) (?:.{2}\s)+\r\n)*([^M].*\r\n)*. But what happens if the invalid response starts with M?
Off course it is possible that there is no invalid response, then the regex needs to match till the end, i.e till M 1 CSD 382 02 30 33 22 7D 7D \n
"M 1 CSD 382 01 44 2B 54 36 7B 22 6A \n"
"M 1 CSD 382 00 73 6F 6E 72 70 63 22 \n"
"M 1 CSD 382 00 3A 22 32 2E 30 22 2C \n"
"M 1 CSD 382 00 22 69 64 22 3A 30 2C \n"
"M 1 CSD 382 00 22 72 65 73 75 6C 74 \n"
"M 1 CSD 382 00 22 3A 7B 22 53 65 72 \n"
"M 1 CSD 382 00 69 61 6C 4E 75 6D 62 \n"
"M 1 CSD 382 00 65 72 22 3A 22 32 32 \n"
"M 1 CSD 382 00 32 30 31 31 34 32 35 \n"
"M 1 CSD 382 02 30 33 22 7D 7D \n"

You can repeat matching all lines that do not have ERRORWARNING SET the invalid response starts with M
^(?![\w ]* ERRORWARNING SET \r?\n).+(?:\r?\n(?![\w ]* ERRORWARNING SET \r?\n).+)*
The pattern matches:
^ Start of string
(?![\w ]* ERRORWARNING SET \r?\n) Assert that the string does not start with ERRORWARNING SET preceded by optional word chars and spaces
.+ Match the whole line with at least a single char
(?: Non capture group
\r?\n Match a newline
(?![\w ]* ERRORWARNING SET \r?\n) Assert that the next line does not start with ERRORWARNING SET preceded by optional word chars and spaces
.+ Match the whole line with at least a single char
)* Close non capture group and optionally repeat
.NET regex demo
Or a bit more strict to test that the string does not start with a single char A-Z followed by 1 and then ERRORWARNING SET
^(?![A-Z] 1 ERRORWARNING SET \r?\n).+(?:\r?\n(?![A-Z] 1 ERRORWARNING SET \r?\n).+)*

Regex / Python3 - re.findall() - Find all occurrences between opcodes

Background
I'm reverse engineering a TCP stream that uses a Type-Length-Value approach to encoding data.
Example:
TCP Payload: b'0000001f001270622e416374696f6e4e6f74696679425243080310840718880e20901c'
---------------------------------------------------------------------------------------
Type: 00 00 # New function call
Length: 00 1f # Length of Value (Length of Function + Function + Data)
Value: 00 12 # Length of Function
Value: 70 62 2e 41 63 74 69 6f 6e 4e 6f 74 69 66 79 42 52 43 # Function ->(hex2ascii)-> pb.ActionNotifyBRC
Value: 08 03 10 84 07 18 88 0e 20 90 1c # Data
However the Data is a data object that can include multiple variables with variable data lengths.
Data: 08 05 10 04 10 64 18 c8 01 20 ef 0f
----------------------------------------------
Opcode : Value
08 : 05 # var1 : 1 byte
10 : 04 # var2 : 1 byte
18 : c8 01 # var3 : 1-10 bytes
20 : ef 0f # var4 : 1-10 bytes
Currently I am parsing the Data using the following Python3 code:
############################### NOTES ###############################
# Opcodes sometimes rotate starting positions but the general order is always held:
# Data: 20 ef 0f 08 05 10 04 10 64 18 c8 01
#####################################################################
import re
import binascii
def dataVariable(data, start, end):
p = re.compile(start + b'(.*?)' + end)
return p.findall(data + data)
data = bytearray.fromhex('08051004106418c80120ef0f')
var3 = dataVariable(data, b'\x18', b'\x20')
print("Variable 3:", end=' ')
for item in set(var3):
print(binascii.hexlify(item), end=' ')
----------------------------------------------------------------------------
[Output]: Variable 3: b'c801'
So far all good...
Problem
If an Opcode appears in the previous variables Value the code is no longer reliable.
Data: 08 05 10 04 10 64 18 c8 20 01 20 ef 0f
----------------------------------------------
Opcode : Value
08 : 05
10 : 04
18 : c8 20 01 # The Value includes the next opcode (20)
20 : ef 0f
----------------------------------------------------------------------------
[Output]: Variable 3: b'c8'
[Output]: Variable 4: b'0120ef0f'
I was expecting an output of:
[Output]: Variable 3: b'c8' b'c82001'
[Output]: Variable 4: b'0120ef0f' b'ef0f'
It seems like there is an issue with my regular expression?
Update
To further clarify, var3 and var4 are representing integers.
I have managed to figure out how the length of the Value was being encoded. The most significant bit was being used as a flag to inform me that another byte was coming. You can then strip the MSB of each byte, swap the endianness and convert to decimal.
data -> binary representation -> strip MSB and swap endianness -> decimal representation
ac d7 05 -> 10101100 11010111 00000101 -> 0001 01101011 10101100 -> 93100
e4 a6 04 -> 11100100 10100110 00000100 -> 0001 00010011 01100100 -> 70500
90 e1 02 -> 10010000 11100001 00000010 -> 10110000 10010000 -> 45200
dc 24 -> 11011100 00100100 -> 00010010 01011100 -> 4700
f0 60 -> 11110000 01100000 -> 00110000 01110000 -> 12400

You may use
def dataVariable(data, start, end):
p = re.compile(b'(?=(' + start + b'.*' + end + b'))')
res = []
for x in p.findall(data):
cur = b''
for i, m in enumerate([x[i:i+1] for i in range(len(x))]):
if i == 0:
continue
if m == end and cur:
res.append(cur)
cur = cur + m
return res
See the Python demo:
data = bytearray.fromhex('08051004106418c8200120ef0f0f') # => b'c82001' b'c8'
#data = bytearray.fromhex('185618205720') # => b'56182057' b'2057' b'5618'
var3 = dataVariable(data, b'\x18', b'\x20')
print("Variable 3:", end=' ')
for item in set(var3):
print(binascii.hexlify(item), end=' ')
Output is Variable 3: b'c8' b'c82001' for '08051004106418c8200120ef0f0f' string and b'56182057' b'2057' b'5618' for 185618205720 input.
The pattern is of (?=(...)) type to find all overlapping matches. If you do not need the overlapping feature, remove these parts from the regex.
The point here is:
match all substrings starting with start and up to the last end with start + b'.*' + end pattern
iterate through the match dropping the first start byte and add an item to the resulting list when the end byte is found, adding up found bytes at each iteration (thus, getting all inner substrings inside the match).

Capture two digit pairs from a text

I want to capture all two digits from the following header file:
#define KEYMAP( \
K00, K01, K02, K03, K04, K05, K06, K07, K08, K09, K0A, K0B, K0C, K0D, \
K10, K11, K12, K13, K14, K15, K16, K17, K18, K19, K1A, K1B, K1C, K1D, \
K20, K21, K22, K23, K24, K25, K26, K27, K28, K29, K2A, K2B, K2C, K2D, \
K30, K31, K32, K33, K34, K35, K36, K37, K38, K39, K3A, K3B, K3C, K3D, \
K40, K41, K42, K45, K49, K4A, K4B, K4C, K4D \
)
So I want to get a list containing 00,01,02.....4D. I tried to do this using the Select-String cmdlet:
gc 'y:\keyboard.h' | sls 'K'
But doesnt give me the expected result

Use a lookbehind assertion in the pattern and a proper hexadecimal capturing pattern (see regex101):
gc 'y:\keyboard.h' | select-string '(?<=K)([\da-f]{2})' -AllMatches | %{ $_.matches.value }
Select-String uses case-insensitive matching by default, use its -CaseSensitive switch if needed. It's possible to make matching more strict to reject possible false positives from other parts of the file: '\s+(?<=K)([\da-fA-F]{2})(?:[\s,]|$)' -CaseSensitive

I would use the static regex::Matches method:
$content = Get-Content 'y:\keyboard.h' -Raw
[regex]::Matches($content, '\bK(..),') | Foreach {
$_.Groups[1].Value
}
Output:
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 10 11 12 13 14 15 16 17 18
19 1A 1B 1C 1D 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 30 31 32 33
34 35 36 37 38 39 3A 3B 3C 3D 40 41 42 45 49 4A 4B 4C

Expect Script - detecting two unique instances of a pattern in one returned buffer

I'm trying to do two matches on one block returned data inside an expect script. This is the returned data from a command shows what this system is connected to(I changed the descriptions to protect sensitive information). I thought I could use expect_out(buffer), but I can't figure out how to parse the returned data to detect two unique instances of the patterns. I can re-run the command if I detect one instance a pattern, but that won't allow me to detect the case where I have two unique instances of a pattern in the returned data as expect{} would re-find the first pattern. For example 'abcd' and 'abcd'.
Case one: I will have zero instances of 'abcd', 'efgh', 'ijkl', 'mnop', or 'qurs' in the returned block - in that case nothing will be written to a file and that's fine.
Case two: I will have only once instance of 'abcd', 'efgh', 'ijkl', 'mnop', or 'qurs' in the file, the current code detects that case and then writes the existence of one pattern to a file for later processing.
Case three: I have two instances of the patterns 'abcd', 'efgh', 'ijkl', 'mnop', or 'qurs', in any combination of the pairs. I could have 'abcd', 'abcd'; 'abcd', 'efgh'; or 'ijkl', 'mnop'. If case 3 happens I need to write a different message to the file.
Can anyone help?
My data:
A4 | 48 48 changedToProtectPrivacy
A15 | 48 48 changedToProtectPrivacy
A16 | 48 48 changedToProtectPrivacy
A17 | 48 48 changedToProtectPrivacy
A18 | 48 48 changedToProtectPrivacy
A19 | 48 48 changedToProtectPrivacy
A20 | 48 48 changedToProtectPrivacy
A21 | 48 48 changedToProtectPrivacy
A24 | abcd
A24 | abcd
B1 | 48 48 changedToProtectPrivacy
B2 | 48 48 changedToProtectPrivacy
B3 | 48 48 changedToProtectPrivacy
B4 | 48 48 changedToProtectPrivacy
B5 | 48 48 changedToProtectPrivacy
B6 | 48 48 changedToProtectPrivacy
B21 | 48 48 changedToProtectPrivacy
B24 | abcd
B24 | abcd
D2 | 00 ... 1 changedToProtectPrivacy
D10 | 00 ... 1 changedToProtectPrivacy
E6 | 00 ... 1 changedToProtectPrivacy
-=- Current code snippit -=-
expect { "prompt" } send { "superSecretCommand" ; sleep 2 }
expect {
"abcd" { set infofile "info.$server" ;
set ::infofile [open $infofile a] ;
puts $::infofile "Connection detected" ;
close $::infofile ;
}
"efgh" { set infofile "info.$server" ;
set ::infofile [open $infofile a] ;
puts $::infofile "Connection detected" ;
close $::infofile ;
}
}

I guess what you need is like this:
[STEP 101] $ cat infile
A20 | 48 48 changedToProtectPrivacy
A21 | 48 48 changedToProtectPrivacy
A24 | abcd
A24 | abcd
B1 | 48 48 changedToProtectPrivacy
B6 | 48 48 changedToProtectPrivacy
B7 | ijkl
B21 | 48 48 changedToProtectPrivacy
B24 | efgh
B24 | abcd
D2 | 00 ... 1 changedToProtectPrivacy
D3 | efgh
D3 | abcd
D10 | 00 ... 1 changedToProtectPrivacy
D11 | ijkl
E6 | 00 ... 1 changedToProtectPrivacy
E7 | ijkl
[STEP 102] $ cat foo.exp
#!/usr/bin/expect
log_user 0
spawn -noecho cat infile
set pat1 {[\r\n]+[[:blank:]]*[A-Z][0-9]+[[:blank:]]*\|[[:blank:]]*}
set pat2 {[a-z]{4,4}}
expect {
-re "${pat1}($pat2)${pat1}($pat2)|${pat1}($pat2)" {
if {[info exists expect_out(3,string)]} {
send_user ">>> $expect_out(3,string)\n"
} else {
send_user ">>> $expect_out(1,string) $expect_out(2,string)\n"
}
array unset expect_out
exp_continue
}
}
[STEP 103] $ expect foo.exp
>>> abcd abcd
>>> ijkl
>>> efgh abcd
>>> efgh abcd
>>> ijkl
>>> ijkl
[STEP 104] $

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Grep: find lines only matching unknown character once - regex

Related

AWK regex split function using multiple delimiters

How to match using Regex till the invalid response (C#)

Regex / Python3 - re.findall() - Find all occurrences between opcodes

Capture two digit pairs from a text

Expect Script - detecting two unique instances of a pattern in one returned buffer

Categories

Resources