Capture two digit pairs from a text - regex

I want to capture all two digits from the following header file:
#define KEYMAP( \
K00, K01, K02, K03, K04, K05, K06, K07, K08, K09, K0A, K0B, K0C, K0D, \
K10, K11, K12, K13, K14, K15, K16, K17, K18, K19, K1A, K1B, K1C, K1D, \
K20, K21, K22, K23, K24, K25, K26, K27, K28, K29, K2A, K2B, K2C, K2D, \
K30, K31, K32, K33, K34, K35, K36, K37, K38, K39, K3A, K3B, K3C, K3D, \
K40, K41, K42, K45, K49, K4A, K4B, K4C, K4D \
)
So I want to get a list containing 00,01,02.....4D. I tried to do this using the Select-String cmdlet:
gc 'y:\keyboard.h' | sls 'K'
But doesnt give me the expected result

Use a lookbehind assertion in the pattern and a proper hexadecimal capturing pattern (see regex101):
gc 'y:\keyboard.h' | select-string '(?<=K)([\da-f]{2})' -AllMatches | %{ $_.matches.value }
Select-String uses case-insensitive matching by default, use its -CaseSensitive switch if needed. It's possible to make matching more strict to reject possible false positives from other parts of the file: '\s+(?<=K)([\da-fA-F]{2})(?:[\s,]|$)' -CaseSensitive

I would use the static regex::Matches method:
$content = Get-Content 'y:\keyboard.h' -Raw
[regex]::Matches($content, '\bK(..),') | Foreach {
$_.Groups[1].Value
}
Output:
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 10 11 12 13 14 15 16 17 18
19 1A 1B 1C 1D 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 30 31 32 33
34 35 36 37 38 39 3A 3B 3C 3D 40 41 42 45 49 4A 4B 4C

Related

How to match using Regex till the invalid response (C#)

I need to write a regex that matches the following string till E 1 ERRORWARNING SET \n, (till the end of invalid response). M 1 CSD ... are valid response strings.
Scenario #1
"M 1 CSD 382 01 44 2B 54 36 7B 22 6A \n" +
"M 1 CSD 382 00 73 6F 6E 72 70 63 22 \n" +
"R OK \n" + // This could be any string not matching the pattern M 1 CSD ...
"E 1 ERRORWARNING SET \n" + // This could be any string not matching the pattern M 1 CSD ...
"M 1 CSD 382 00 3A 22 32 2E 30 22 2C \n" +
Scenario #2
"R OK \n" + // This could be any string not matching the pattern M 1 CSD ...
"E 1 ERRORWARNING SET \n" + // This could be any string not matching the pattern M 1 CSD ...
"M 1 CSD 382 00 3A 22 32 2E 30 22 2C \n" +
I know I can write something like (M 1 CSD (?:.{3}) (?:.{2}\s)+\n)* to match the M 1 CSD pattern but not sure how to match the invalid response. The best I am able to do is
(M 1 CSD (?:.{3}) (?:.{2}\s)+\r\n)*([^M].*\r\n)*. But what happens if the invalid response starts with M?
Off course it is possible that there is no invalid response, then the regex needs to match till the end, i.e till M 1 CSD 382 02 30 33 22 7D 7D \n
"M 1 CSD 382 01 44 2B 54 36 7B 22 6A \n"
"M 1 CSD 382 00 73 6F 6E 72 70 63 22 \n"
"M 1 CSD 382 00 3A 22 32 2E 30 22 2C \n"
"M 1 CSD 382 00 22 69 64 22 3A 30 2C \n"
"M 1 CSD 382 00 22 72 65 73 75 6C 74 \n"
"M 1 CSD 382 00 22 3A 7B 22 53 65 72 \n"
"M 1 CSD 382 00 69 61 6C 4E 75 6D 62 \n"
"M 1 CSD 382 00 65 72 22 3A 22 32 32 \n"
"M 1 CSD 382 00 32 30 31 31 34 32 35 \n"
"M 1 CSD 382 02 30 33 22 7D 7D \n"
You can repeat matching all lines that do not have ERRORWARNING SET the invalid response starts with M
^(?![\w ]* ERRORWARNING SET \r?\n).+(?:\r?\n(?![\w ]* ERRORWARNING SET \r?\n).+)*
The pattern matches:
^ Start of string
(?![\w ]* ERRORWARNING SET \r?\n) Assert that the string does not start with ERRORWARNING SET preceded by optional word chars and spaces
.+ Match the whole line with at least a single char
(?: Non capture group
\r?\n Match a newline
(?![\w ]* ERRORWARNING SET \r?\n) Assert that the next line does not start with ERRORWARNING SET preceded by optional word chars and spaces
.+ Match the whole line with at least a single char
)* Close non capture group and optionally repeat
.NET regex demo
Or a bit more strict to test that the string does not start with a single char A-Z followed by 1 and then ERRORWARNING SET
^(?![A-Z] 1 ERRORWARNING SET \r?\n).+(?:\r?\n(?![A-Z] 1 ERRORWARNING SET \r?\n).+)*

Regex / Python3 - re.findall() - Find all occurrences between opcodes

Background
I'm reverse engineering a TCP stream that uses a Type-Length-Value approach to encoding data.
Example:
TCP Payload: b'0000001f001270622e416374696f6e4e6f74696679425243080310840718880e20901c'
---------------------------------------------------------------------------------------
Type: 00 00 # New function call
Length: 00 1f # Length of Value (Length of Function + Function + Data)
Value: 00 12 # Length of Function
Value: 70 62 2e 41 63 74 69 6f 6e 4e 6f 74 69 66 79 42 52 43 # Function ->(hex2ascii)-> pb.ActionNotifyBRC
Value: 08 03 10 84 07 18 88 0e 20 90 1c # Data
However the Data is a data object that can include multiple variables with variable data lengths.
Data: 08 05 10 04 10 64 18 c8 01 20 ef 0f
----------------------------------------------
Opcode : Value
08 : 05 # var1 : 1 byte
10 : 04 # var2 : 1 byte
18 : c8 01 # var3 : 1-10 bytes
20 : ef 0f # var4 : 1-10 bytes
Currently I am parsing the Data using the following Python3 code:
############################### NOTES ###############################
# Opcodes sometimes rotate starting positions but the general order is always held:
# Data: 20 ef 0f 08 05 10 04 10 64 18 c8 01
#####################################################################
import re
import binascii
def dataVariable(data, start, end):
p = re.compile(start + b'(.*?)' + end)
return p.findall(data + data)
data = bytearray.fromhex('08051004106418c80120ef0f')
var3 = dataVariable(data, b'\x18', b'\x20')
print("Variable 3:", end=' ')
for item in set(var3):
print(binascii.hexlify(item), end=' ')
----------------------------------------------------------------------------
[Output]: Variable 3: b'c801'
So far all good...
Problem
If an Opcode appears in the previous variables Value the code is no longer reliable.
Data: 08 05 10 04 10 64 18 c8 20 01 20 ef 0f
----------------------------------------------
Opcode : Value
08 : 05
10 : 04
18 : c8 20 01 # The Value includes the next opcode (20)
20 : ef 0f
----------------------------------------------------------------------------
[Output]: Variable 3: b'c8'
[Output]: Variable 4: b'0120ef0f'
I was expecting an output of:
[Output]: Variable 3: b'c8' b'c82001'
[Output]: Variable 4: b'0120ef0f' b'ef0f'
It seems like there is an issue with my regular expression?
Update
To further clarify, var3 and var4 are representing integers.
I have managed to figure out how the length of the Value was being encoded. The most significant bit was being used as a flag to inform me that another byte was coming. You can then strip the MSB of each byte, swap the endianness and convert to decimal.
data -> binary representation -> strip MSB and swap endianness -> decimal representation
ac d7 05 -> 10101100 11010111 00000101 -> 0001 01101011 10101100 -> 93100
e4 a6 04 -> 11100100 10100110 00000100 -> 0001 00010011 01100100 -> 70500
90 e1 02 -> 10010000 11100001 00000010 -> 10110000 10010000 -> 45200
dc 24 -> 11011100 00100100 -> 00010010 01011100 -> 4700
f0 60 -> 11110000 01100000 -> 00110000 01110000 -> 12400
You may use
def dataVariable(data, start, end):
p = re.compile(b'(?=(' + start + b'.*' + end + b'))')
res = []
for x in p.findall(data):
cur = b''
for i, m in enumerate([x[i:i+1] for i in range(len(x))]):
if i == 0:
continue
if m == end and cur:
res.append(cur)
cur = cur + m
return res
See the Python demo:
data = bytearray.fromhex('08051004106418c8200120ef0f0f') # => b'c82001' b'c8'
#data = bytearray.fromhex('185618205720') # => b'56182057' b'2057' b'5618'
var3 = dataVariable(data, b'\x18', b'\x20')
print("Variable 3:", end=' ')
for item in set(var3):
print(binascii.hexlify(item), end=' ')
Output is Variable 3: b'c8' b'c82001' for '08051004106418c8200120ef0f0f' string and b'56182057' b'2057' b'5618' for 185618205720 input.
The pattern is of (?=(...)) type to find all overlapping matches. If you do not need the overlapping feature, remove these parts from the regex.
The point here is:
match all substrings starting with start and up to the last end with start + b'.*' + end pattern
iterate through the match dropping the first start byte and add an item to the resulting list when the end byte is found, adding up found bytes at each iteration (thus, getting all inner substrings inside the match).

removing the dots and colons in the time field

I have the following contents from data.log file. I wish to extract the time value and part of the payload (after deadbeef in the payload, third row, starting second to last byte. Please refer to expected output).
data.log
print 1: file offset 0x0
ts=0x584819041ff529e0 2016-12-07 14:13:24.124834649 UTC
type: ERF Ethernet
dserror=0 rxerror=0 trunc=0 vlen=0 iface=1 rlen=96 lctr=0 wlen=68
pad=0x00 offset=0x00
dst=aa:bb:cc:dd:ee:ff src=ca:fe:ba:be:ca:fe
etype=0x0800
45 00 00 32 00 00 40 00 40 11 50 ff c0 a8 34 35 E..2..#.#.P...45
c0 a8 34 36 80 01 00 00 00 1e 00 00 08 08 08 08 ..46............
08 08 50 e6 61 c3 85 21 01 00 de ad be ef 85 d7 ..P.a..!........
91 21 6f 9a 32 94 fd 07 01 00 de ad be ef 85 d7 .!o.2...........
print 2: file offset 0x60
ts=0x584819041ff52b00 2016-12-07 14:13:24.124834716 UTC
type: ERF Ethernet
dserror=0 rxerror=0 trunc=0 vlen=0 iface=1 rlen=96 lctr=0 wlen=68
pad=0x00 offset=0x00
dst=aa:bb:cc:dd:ee:ff src=ca:fe:ba:be:ca:fe
etype=0x0800
45 00 00 32 00 00 40 00 40 11 50 ff c0 a8 34 35 E..2..#.#.P...45
c0 a8 34 36 80 01 00 00 00 1e 00 00 08 08 08 08 ..46............
08 08 68 e7 61 c3 85 21 01 00 de ad be ef 86 d7 ..h.a..!........
91 21 c5 34 77 bd fd 07 01 00 de ad be ef 86 d7 .!.4w...........
Expected output
I just want to replace the dots and colons in the time field (before UTC) and get the entire value.
141324124834649,85d79121
141324124834716,86d79121
What I have done so far
I have extracted the fields after "." but not sure how to replace the colons and get the entire time value.
awk -F '[= ]' '$NF == "UTC"{split($4,b,".");s=b[2]",";a=15} /de ad be ef/{s=s $a $(a+1);if(a==1)print s;a=1}' data.log
124834649,85d79121
124834716,86d79121
Any help is much appreciated.
awk '$NF == "UTC"{gsub("[.:]","",$3);s=$3",";a=15} /de ad be ef/{s=s $a $(a+1);if(a==1)print s;a=1}' data.log
Result:
141324124834649,85d79121
141324124834716,86d79121
PS: it can be simplified with getline :
awk '$NF == "UTC"{gsub("[.:]","",$3);s=$3","} /de ad be ef/{s=s $15 $16;getline;print(s $1 $2)}' data.log
You can extract the time part like this:
$ awk '/UTC/ {split($0,a); gsub(/[\.:]/,"",a[3]); print a[3]}' file
141324124834649
141324124834716
for the UTC part, (rest of the code is the same
awk '/UTC$/{gsub(/[\.:]/,"");print $3}' YourFile
just remove the ":" and "." and take the field value, other part of the line don't have those 2 character, so is not modified
$NF test is replaced by /UTC$/, a bit faster and simpler (OMHO)
the full code
awk -F '[= ]' '/UTC$/{gsub(/[\.:]/,"");s=$3",";a=15} /de ad be ef/{s=s $a $(a+1);if(a==1)print s;a=1}' YourFile

Grep: find lines only matching unknown character once

I have a list with hexadecimal lines. For example:
0b 5a 3f 5a 7d d0 5d e6 2b c4 7e 7d c2 c0 e6 9a
84 bd aa 74 f3 85 da 9d ac b6 e0 b6 62 0f b5 d5
c0 b0 f5 60 02 8b 1c a4 41 7c 53 f2 85 20 a0 d1
...
I'm trying to find all the lines with grep, where there is a character that occurs only once in the line.
For example: there is only one time a 'd' in the third line.
I tried this, but it's not working:
egrep '^.*([a-f0-9])[^\1]*$'
This can be done with a regex, but it has to be verbose.
It kind of can't be generalized.
# ^(?:[^a]*a[^a]*|[^b]*b[^b]*|[^c]*c[^c]*|[^d]*d[^d]*|[^e]*e[^e]*|[^f]*f[^f]*|[^0]*0[^0]*|[^1]*1[^1]*|[^2]*2[^2]*|[^3]*3[^3]*|[^4]*4[^4]*|[^5]*5[^5]*|[^6]*6[^6]*|[^7]*7[^7]*|[^8]*8[^8]*|[^9]*9[^9]*)$
^
(?:
[^a]* a [^a]*
| [^b]* b [^b]*
| [^c]* c [^c]*
| [^d]* d [^d]*
| [^e]* e [^e]*
| [^f]* f [^f]*
| [^0]* 0 [^0]*
| [^1]* 1 [^1]*
| [^2]* 2 [^2]*
| [^3]* 3 [^3]*
| [^4]* 4 [^4]*
| [^5]* 5 [^5]*
| [^6]* 6 [^6]*
| [^7]* 7 [^7]*
| [^8]* 8 [^8]*
| [^9]* 9 [^9]*
)
$
For discovery, if you put capture groups around the letters and numbers,
and use a brach reset:
^
(?|
[^a]* (a) [^a]*
| [^b]* (b) [^b]*
| [^c]* (c) [^c]*
| [^d]* (d) [^d]*
| [^e]* (e) [^e]*
| [^f]* (f) [^f]*
| [^0]* (0) [^0]*
| [^1]* (1) [^1]*
| [^2]* (2) [^2]*
| [^3]* (3) [^3]*
| [^4]* (4) [^4]*
| [^5]* (5) [^5]*
| [^6]* (6) [^6]*
| [^7]* (7) [^7]*
| [^8]* (8) [^8]*
| [^9]* (9) [^9]*
)
$
This is the output:
** Grp 0 - ( pos 0 , len 50 )
0b 5a 3f 5a 7d d0 5d e6 2b c4 7e 7d c2 c0 e6 9a
** Grp 1 - ( pos 7 , len 1 )
f
-----------------------
** Grp 0 - ( pos 50 , len 51 )
84 bd aa 74 f3 85 da 9d ac b6 e0 b6 62 0f b5 d5
** Grp 1 - ( pos 77 , len 1 )
c
-----------------------
** Grp 0 - ( pos 101 , len 51 )
c0 b0 f5 60 02 8b 1c a4 41 7c 53 f2 85 20 a0 d1
** Grp 1 - ( pos 148 , len 1 )
d
I don't know a way to do it with a regex. However you can use this stupid awk script:
awk -F '' '{for(i=1;i<=NF;i++){a[$i]++};for(i in a){if(a[i]==1){print;next}}}' input
The scripts counts the number of occurrences of every character in the line. At the end of the line it checks all totals and prints the line if at least one of those totals equals 1.
Here is a piece of code that uses a number of shell tools beyond grep.
It reads the input line by line. Generates a frequency table. Upon finding an element with frequency 1 it outputs the unique character and the entire line.
cat input | while read line ; do
export line ;
echo $line | grep -o . | sort | uniq -c | \
awk '/[ ]+1[ ]/ {print $2 ":" ENVIRON["line"] ; exit }' ;
done
Note that if you are interested in digits only you could replace grep -o . with grep -o "[a-f]"

Clean Hex dump of ASCII code

I've got a simple PDF file in Hex dump format, lines looking like this:
00000000: 25 50 44 46 2d 31 2e 34 0d 0a 25 e2 e3 cf d3 0d %PDF-1.4..%.....
00000010: 0a 31 20 30 20 6f 62 6a 0d 0a 3c 3c 20 2f 46 69 .1 0 obj..<< /Fi
00000020: 6c 74 65 72 20 2f 46 6c 61 74 65 44 65 63 6f 64 lter /FlateDecod
00000030: 65 20 2f 4c 65 6e 67 74 68 20 31 38 34 30 36 20 e /Length 18406
00000040: 3e 3e 0d 0a 73 74 72 65 61 6d 0d 0a 2b 2c 9c 77 >>..stream..+,.w
How can I clean up the file so I'm left with the Hex values only?
I'm guessing regular expressions but I fail to apply it successfully in an editor.
Am running on a Windows machine.
Notepad++
Ctrl+F -> Replace
- Tick Regular Expression in the bottom left
Find: ^(.{73}).*$
Replace With: \1
Or Alt+Drag for column selection & copy/pasta.