Regex / Python3 - re.findall() - Find all occurrences between opcodes - regex

Background
I'm reverse engineering a TCP stream that uses a Type-Length-Value approach to encoding data.
Example:
TCP Payload: b'0000001f001270622e416374696f6e4e6f74696679425243080310840718880e20901c'
---------------------------------------------------------------------------------------
Type: 00 00 # New function call
Length: 00 1f # Length of Value (Length of Function + Function + Data)
Value: 00 12 # Length of Function
Value: 70 62 2e 41 63 74 69 6f 6e 4e 6f 74 69 66 79 42 52 43 # Function ->(hex2ascii)-> pb.ActionNotifyBRC
Value: 08 03 10 84 07 18 88 0e 20 90 1c # Data
However the Data is a data object that can include multiple variables with variable data lengths.
Data: 08 05 10 04 10 64 18 c8 01 20 ef 0f
----------------------------------------------
Opcode : Value
08 : 05 # var1 : 1 byte
10 : 04 # var2 : 1 byte
18 : c8 01 # var3 : 1-10 bytes
20 : ef 0f # var4 : 1-10 bytes
Currently I am parsing the Data using the following Python3 code:
############################### NOTES ###############################
# Opcodes sometimes rotate starting positions but the general order is always held:
# Data: 20 ef 0f 08 05 10 04 10 64 18 c8 01
#####################################################################
import re
import binascii
def dataVariable(data, start, end):
p = re.compile(start + b'(.*?)' + end)
return p.findall(data + data)
data = bytearray.fromhex('08051004106418c80120ef0f')
var3 = dataVariable(data, b'\x18', b'\x20')
print("Variable 3:", end=' ')
for item in set(var3):
print(binascii.hexlify(item), end=' ')
----------------------------------------------------------------------------
[Output]: Variable 3: b'c801'
So far all good...
Problem
If an Opcode appears in the previous variables Value the code is no longer reliable.
Data: 08 05 10 04 10 64 18 c8 20 01 20 ef 0f
----------------------------------------------
Opcode : Value
08 : 05
10 : 04
18 : c8 20 01 # The Value includes the next opcode (20)
20 : ef 0f
----------------------------------------------------------------------------
[Output]: Variable 3: b'c8'
[Output]: Variable 4: b'0120ef0f'
I was expecting an output of:
[Output]: Variable 3: b'c8' b'c82001'
[Output]: Variable 4: b'0120ef0f' b'ef0f'
It seems like there is an issue with my regular expression?
Update
To further clarify, var3 and var4 are representing integers.
I have managed to figure out how the length of the Value was being encoded. The most significant bit was being used as a flag to inform me that another byte was coming. You can then strip the MSB of each byte, swap the endianness and convert to decimal.
data -> binary representation -> strip MSB and swap endianness -> decimal representation
ac d7 05 -> 10101100 11010111 00000101 -> 0001 01101011 10101100 -> 93100
e4 a6 04 -> 11100100 10100110 00000100 -> 0001 00010011 01100100 -> 70500
90 e1 02 -> 10010000 11100001 00000010 -> 10110000 10010000 -> 45200
dc 24 -> 11011100 00100100 -> 00010010 01011100 -> 4700
f0 60 -> 11110000 01100000 -> 00110000 01110000 -> 12400

You may use
def dataVariable(data, start, end):
p = re.compile(b'(?=(' + start + b'.*' + end + b'))')
res = []
for x in p.findall(data):
cur = b''
for i, m in enumerate([x[i:i+1] for i in range(len(x))]):
if i == 0:
continue
if m == end and cur:
res.append(cur)
cur = cur + m
return res
See the Python demo:
data = bytearray.fromhex('08051004106418c8200120ef0f0f') # => b'c82001' b'c8'
#data = bytearray.fromhex('185618205720') # => b'56182057' b'2057' b'5618'
var3 = dataVariable(data, b'\x18', b'\x20')
print("Variable 3:", end=' ')
for item in set(var3):
print(binascii.hexlify(item), end=' ')
Output is Variable 3: b'c8' b'c82001' for '08051004106418c8200120ef0f0f' string and b'56182057' b'2057' b'5618' for 185618205720 input.
The pattern is of (?=(...)) type to find all overlapping matches. If you do not need the overlapping feature, remove these parts from the regex.
The point here is:
match all substrings starting with start and up to the last end with start + b'.*' + end pattern
iterate through the match dropping the first start byte and add an item to the resulting list when the end byte is found, adding up found bytes at each iteration (thus, getting all inner substrings inside the match).

Related

regex in python to remove 2 patterns

Want to make a regex to remove 2019 and 0 from left side of the string and last six zeroes from right side of the string.
original value Dtype : class 'str'
original value: 2019 01 10 00 00 00
expected output is : 1 10
Using str.split with list slicing.
Ex:
s = "2019 01 10 00 00 00"
print(" ".join(s.split()[1:3]).lstrip("0"))
Using re.match
Ex:
import re
s = "2019 01 10 00 00 00"
m = re.match(r"\d{4}\b\s(?P<value>\d{2}\b\s\d{2}\b)", s)
if m:
print(m.group("value").lstrip("0"))
Output:
1 10

c++ replacing order of items in byte array

I'm writing a code for Arduino C++.
I have a byte array with hex byte values, for example:
20 32 36 20 E0 EC 20 F9 F0 E9 E9 E3 F8 5C 70 5C 70 5C 73 20 E3 E2 EC 20 F8 E0 E5 E1 EF 20 39 31 5C
There are four ASCII digits in these bytes:
HEX 0x32 is number 2 in ascii code
HEX 0x35 is number 5 in ascii code
HEX 0x39 is number 9 in ascii code
and so on....
https://www.ascii-codes.com/cp862.html
So the hex values 32, 36 represent the number 26, and 39, 31 represent 91.
I want to find these numbers and reverse each group, so that (in this example) 62 and 19 are represented instead of 26 and 91.
The output would thus have to look like this:
20 36 32 20 E0 EC 20 F9 F0 E9 E9 E3 F8 5C 70 5C 70 5C 73 20 E3 E2 EC 20 F8 E0 E5 E1 EF 20 31 39 5C
The numbers don't have to be two digits but could be anything in 0-1000
I also know that each group of such numbers is preceded by the hex value 20, if that helps.
I have done this in C# (with some help of Stack overflow users :-) ):
string result = Regex.Replace(HexMessage1,
#"(?<=20\-)3[0-9](\-3[0-9])*(?=\-20)",
match => string.Join("-", Transform(match.Value.Split('-'))));
private static IEnumerable<string> Transform(string[] items)
{
// Either terse Linq:
// return items.Reverse();
// Or good old for loop:
string[] result = new string[items.Length];
for (int i = 0; i < items.Length; ++i)
result[i] = items[items.Length - i - 1];
return result;
}
Can someone help me make it work on C++?
Loop over the array, element by element, looking for 0x32 or 0x39. If found, check the next byte (if within bounds) to see if it matches 0x36 or 0x31 (respectively). If it does then swap the current and the next byte. Continue the loop, skipping over the current and the next byte.

removing the dots and colons in the time field

I have the following contents from data.log file. I wish to extract the time value and part of the payload (after deadbeef in the payload, third row, starting second to last byte. Please refer to expected output).
data.log
print 1: file offset 0x0
ts=0x584819041ff529e0 2016-12-07 14:13:24.124834649 UTC
type: ERF Ethernet
dserror=0 rxerror=0 trunc=0 vlen=0 iface=1 rlen=96 lctr=0 wlen=68
pad=0x00 offset=0x00
dst=aa:bb:cc:dd:ee:ff src=ca:fe:ba:be:ca:fe
etype=0x0800
45 00 00 32 00 00 40 00 40 11 50 ff c0 a8 34 35 E..2..#.#.P...45
c0 a8 34 36 80 01 00 00 00 1e 00 00 08 08 08 08 ..46............
08 08 50 e6 61 c3 85 21 01 00 de ad be ef 85 d7 ..P.a..!........
91 21 6f 9a 32 94 fd 07 01 00 de ad be ef 85 d7 .!o.2...........
print 2: file offset 0x60
ts=0x584819041ff52b00 2016-12-07 14:13:24.124834716 UTC
type: ERF Ethernet
dserror=0 rxerror=0 trunc=0 vlen=0 iface=1 rlen=96 lctr=0 wlen=68
pad=0x00 offset=0x00
dst=aa:bb:cc:dd:ee:ff src=ca:fe:ba:be:ca:fe
etype=0x0800
45 00 00 32 00 00 40 00 40 11 50 ff c0 a8 34 35 E..2..#.#.P...45
c0 a8 34 36 80 01 00 00 00 1e 00 00 08 08 08 08 ..46............
08 08 68 e7 61 c3 85 21 01 00 de ad be ef 86 d7 ..h.a..!........
91 21 c5 34 77 bd fd 07 01 00 de ad be ef 86 d7 .!.4w...........
Expected output
I just want to replace the dots and colons in the time field (before UTC) and get the entire value.
141324124834649,85d79121
141324124834716,86d79121
What I have done so far
I have extracted the fields after "." but not sure how to replace the colons and get the entire time value.
awk -F '[= ]' '$NF == "UTC"{split($4,b,".");s=b[2]",";a=15} /de ad be ef/{s=s $a $(a+1);if(a==1)print s;a=1}' data.log
124834649,85d79121
124834716,86d79121
Any help is much appreciated.
awk '$NF == "UTC"{gsub("[.:]","",$3);s=$3",";a=15} /de ad be ef/{s=s $a $(a+1);if(a==1)print s;a=1}' data.log
Result:
141324124834649,85d79121
141324124834716,86d79121
PS: it can be simplified with getline :
awk '$NF == "UTC"{gsub("[.:]","",$3);s=$3","} /de ad be ef/{s=s $15 $16;getline;print(s $1 $2)}' data.log
You can extract the time part like this:
$ awk '/UTC/ {split($0,a); gsub(/[\.:]/,"",a[3]); print a[3]}' file
141324124834649
141324124834716
for the UTC part, (rest of the code is the same
awk '/UTC$/{gsub(/[\.:]/,"");print $3}' YourFile
just remove the ":" and "." and take the field value, other part of the line don't have those 2 character, so is not modified
$NF test is replaced by /UTC$/, a bit faster and simpler (OMHO)
the full code
awk -F '[= ]' '/UTC$/{gsub(/[\.:]/,"");s=$3",";a=15} /de ad be ef/{s=s $a $(a+1);if(a==1)print s;a=1}' YourFile

How to parse Hexcode of Network traffic

First of all, thanks to the community of Stack Overflow. I've found lots of competent answers here, so I'll try to ask my own question.
I have a WinPcap sample program which "dumps" my network traffic as Hexcode.
I made a simple server/client pair which periodically sends some text (22 bytes).
Problem:
Parsing this Hex-dump in C++ and generate an output like Wireshark or phd does (just in the console).
After reading some TCP/IP references (1), I'm not able to determine all bytes "values"
So for example there are 76 bytes, reading backwards:
the first 22 bytes are my Data;
then, there are 20 for the TCP header;
20 for the IP Header and then some other Bytes I don't know what they stand for. I'm not very knowledgeable about the IP protocol.
Here is an example in hex:
08 00 27 b3 23 63 f4 6d 04 2e 68 24 08 00 45 00
00 3e 31 c4 40 00 80 06 45 9e c0 a8 01 03 c0 a8
01 04 0b 27 04 d2 b0 f7 47 61 28 6c fd a7 50 18
fa f0 8e a0 00 00 48 61 6c 6c 6f 20 64 61 73 20
69 73 74 20 65 69 6e 20 54 65 73 74
Question:
Can someone tell me what these first bytes are for, and
where to get a (simple) description of how the network traffic is composed?
(1) TCP Reference, IP Reference
I used Packet Dump Decode and decoded this into the following data... pdd translates the the hex dump into something you can use in wireshark... then I worked backwards from the wireshark info to break the packet down...
Ethernet Header
---------------
08 00 27 b3 23 63 f4 6d 04 2e 68 24 08 00
IP Header
---------
45 00 00 3e 31 c4 40 00 80 06 45 9e
c0 a8 01 03 c0 a8 01 04
TCP Header
----------
0b 27 04 d2 b0 f7 47 61 28 6c fd a7 50 18
fa f0 8e a0 00 00
TCP Data Payload
----------------
48 61 6c 6c 6f 20 64 61 73 20
69 73 74 20 65 69 6e 20 54 65 73 74
And a full wireshark decode...
Frame 1: 76 bytes on wire (608 bits), 76 bytes captured (608 bits)
WTAP_ENCAP: 1
Arrival Time: Nov 24, 2012 07:12:54.000000000 Central Standard Time
[Time shift for this packet: 0.000000000 seconds]
Epoch Time: 1353762774.000000000 seconds
[Time delta from previous captured frame: 0.000000000 seconds]
[Time delta from previous displayed frame: 0.000000000 seconds]
[Time since reference or first frame: 0.000000000 seconds]
Frame Number: 1
Frame Length: 76 bytes (608 bits)
Capture Length: 76 bytes (608 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ip:tcp:data]
Ethernet II, Src: AsustekC_2e:68:24 (f4:6d:04:2e:68:24), Dst: CadmusCo_b3:23:63 (08:00:27:b3:23:63)
Destination: CadmusCo_b3:23:63 (08:00:27:b3:23:63)
Address: CadmusCo_b3:23:63 (08:00:27:b3:23:63)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
Source: AsustekC_2e:68:24 (f4:6d:04:2e:68:24)
Address: AsustekC_2e:68:24 (f4:6d:04:2e:68:24)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
Type: IP (0x0800)
Internet Protocol Version 4, Src: 192.168.1.3 (192.168.1.3), Dst: 192.168.1.4 (192.168.1.4)
Version: 4
Header length: 20 bytes
Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
0000 00.. = Differentiated Services Codepoint: Default (0x00)
.... ..00 = Explicit Congestion Notification: Not-ECT (Not ECN-Capable Transport) (0x00)
Total Length: 62
Identification: 0x31c4 (12740)
Flags: 0x02 (Don't Fragment)
0... .... = Reserved bit: Not set
.1.. .... = Don't fragment: Set
..0. .... = More fragments: Not set
Fragment offset: 0
Time to live: 128
Protocol: TCP (6)
Header checksum: 0x459e [correct]
[Good: True]
[Bad: False]
Source: 192.168.1.3 (192.168.1.3)
Destination: 192.168.1.4 (192.168.1.4)
[Source GeoIP: Unknown]
[Destination GeoIP: Unknown]
Transmission Control Protocol, Src Port: msrp (2855), Dst Port: search-agent (1234), Seq: 1, Ack: 1, Len: 22
Source port: msrp (2855)
Destination port: search-agent (1234)
[Stream index: 0]
Sequence number: 1 (relative sequence number)
[Next sequence number: 23 (relative sequence number)]
Acknowledgment number: 1 (relative ack number)
Header length: 20 bytes
Flags: 0x018 (PSH, ACK)
000. .... .... = Reserved: Not set
...0 .... .... = Nonce: Not set
.... 0... .... = Congestion Window Reduced (CWR): Not set
.... .0.. .... = ECN-Echo: Not set
.... ..0. .... = Urgent: Not set
.... ...1 .... = Acknowledgment: Set
.... .... 1... = Push: Set
.... .... .0.. = Reset: Not set
.... .... ..0. = Syn: Not set
.... .... ...0 = Fin: Not set
Window size value: 64240
[Calculated window size: 64240]
[Window size scaling factor: -1 (unknown)]
Checksum: 0x8ea0 [validation disabled]
[Good Checksum: False]
[Bad Checksum: False]
[SEQ/ACK analysis]
[Bytes in flight: 22]
Data (22 bytes)
0000 48 61 6c 6c 6f 20 64 61 73 20 69 73 74 20 65 69 Hallo das ist ei
0010 6e 20 54 65 73 74 n Test
Data: 48616c6c6f20646173206973742065696e2054657374
[Length: 22]

vbscript match within a match

Good day all.
I am running some Cisco show commands on a router. I am capturing the output to an array. I want to use Regex to find certain information in the output. The Regex works in the sense that it find the line containing it however there is not enough unique information I can create my regex with so I end up with more that I want. Here is the output:
ROUTERNAME#sh diag
Slot 0:
C2821 Motherboard with 2GE and integrated VPN Port adapter, 2 ports
Port adapter is analyzed
Port adapter insertion time 18w4d ago
Onboard VPN : v2.3.3
EEPROM contents at hardware discovery:
PCB Serial Number : FOC1XXXXXXXXX
Hardware Revision : 1.0
Top Assy. Part Number : 800-26921-04
Board Revision : E0
Deviation Number : 0
Fab Version : 03
RMA Test History : 00
RMA Number : 0-0-0-0
RMA History : 00
Processor type : 87
Hardware date code : 20090816
Chassis Serial Number : FTXXXXXXXXXX
Chassis MAC Address : 0023.ebf4.5480
MAC Address block size : 32
CLEI Code : COMV410ARA
Product (FRU) Number : CISCO2821
Part Number : 73-8853-05
Version Identifier : V05
EEPROM format version 4
EEPROM contents (hex):
0x00: 04 FF C1 8B 46 4F 43 31 33 33 33 31 4E 36 34 40
0x10: 03 E8 41 01 00 C0 46 03 20 00 69 29 04 42 45 30
0x20: 88 00 00 00 00 02 03 03 00 81 00 00 00 00 04 00
0x30: 09 87 83 01 32 8F C0 C2 8B 46 54 58 31 33 33 36
0x40: 41 30 4C 41 C3 06 00 23 EB F4 54 80 43 00 20 C6
0x50: 8A 43 4F 4D 56 34 31 30 41 52 41 CB 8F 43 49 53
0x60: 43 4F 32 38 32 31 20 20 20 20 20 20 82 49 22 95
0x70: 05 89 56 30 35 20 D9 02 40 C1 FF FF FF FF FF FF
AIM Module in slot: 0
Hardware Revision : 1.0
Top Assy. Part Number : 800-27059-01
Board Revision : A0
Deviation Number : 0-0
Fab Version : 02
PCB Serial Number : FOXXXXXXXXX
RMA Test History : 00
RMA Number : 0-0-0-0
RMA History : 00
Product (FRU) Number : AIM-VPN/SSL-2
Version Identifier : V01
EEPROM format version 4
EEPROM contents (hex):
0x00: 04 FF 40 04 F4 41 01 00 C0 46 03 20 00 69 B3 01
0x10: 42 41 30 80 00 00 00 00 02 02 C1 8B 46 4F 43 31
0x20: 33 33 31 36 39 59 55 03 00 81 00 00 00 00 04 00
0x30: CB 8D 41 49 4D 2D 56 50 4E 2F 53 53 4C 2D 32 89
0x40: 56 30 31 00 D9 02 40 C1 FF FF FF FF FF FF FF FF
0x50: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
0x60: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
0x70: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
What I want to capture is the Model number that is contained in the 'Product (FRU) Number:' section. In this example 'CISCO2821'. I want to output or MsgBox just the CISCO2821 although other possibilities could be 'CISCO2911/K9' or something similar.
This is the regex pattern I am using:
Product\s\(FRU\)\sNumber\s*:\s*CIS.*
Using a regex testing tool I was able to match the entire line containing what I want but I want to write only the model number.
I looked at 'ltrim' and 'rtrim' but did not think that could do it.
Any help would be greatly appreciated.
Regards.
Ok, this is in VB.NET not vbscript, but this may help get you on your way:
Dim RegexObj As New Regex("Product\s\(FRU\)\sNumber[\s\t]+:\s(CIS.+?)$", RegexOptions.IgnoreCase Or RegexOptions.Multiline)
ResultString = RegexObj.Match(SubjectString).Groups(1).Value
Invest in 2 little helper functions:
Function qq(sT) : qq = """" & sT & """" : End Function
Function newRE(sP, sF)
Set newRE = New RegExp
newRE.Pattern = sP
newRE.Global = "G" = Mid(sF, 1, 1)
newRE.IgnoreCase = "I" = Mid(sF, 2, 1)
newRE.MultiLine = "M" = Mid(sF, 3, 1)
End Function
and use
' 3 ways to skin this cat
Dim sInp : sInp = Join(Array( _
"CLEI Code: COMV410ARA" _
, "Product (FRU) Number : CISCO2821" _
, "Part Number:73-8853-05" _
), vbCrLf) ' or vbLf, vbCr
WScript.Echo sInp
' (1) just search for CIS + sequence of non-spaces - risky if e.g. CLEI Code starts with CIS
WScript.Echo 0, "=>", qq(newRE("CIS\S+", "gim").Execute(sInp)(0).Value)
' (2) use a capture/group (idea stolen from skyburner; just 'ported' to VBScript)
WScript.Echo 1, "=>", qq(newRE("\(FRU\)[^:]+:\s(\S+)", "gim").Execute(sInp)(0).Value)
WScript.Echo 2, "=>", qq(newRE("\(FRU\)[^:]+:\s(\S+)", "gim").Execute(sInp)(0).SubMatches(0))
' (3) generalize & use a Dictionary
Dim dicProps : Set dicProps = CreateObject("Scripting.Dictionary")
Dim oMT
For Each oMT in newRe("^\s*(.+?)\s*:\s*(.+?)\s*$", "GiM").Execute(sInp)
Dim oSM : Set oSM = oMT.SubMatches
dicProps(oSM(0)) = oSM(1)
Next
Dim sName
For Each sName In dicProps.Keys
WScript.Echo qq(sName), "=>", qq(dicProps(sName))
Next
to get this output:
CLEI Code: COMV410ARA
Product (FRU) Number : CISCO2821
Part Number:73-8853-05
0 => "CISCO2821"
1 => "(FRU) Number : CISCO2821"
2 => "CISCO2821"
"CLEI Code" => "COMV410ARA"
"Product (FRU) Number" => "CISCO2821"
"Part Number" => "73-8853-05"
and - I hope - some food for thought.
Important
a (plain) pattern matches/finds some part of the input
captures/groups/submatches/parentheses cut parts from this match
sometimes dealing with a generalized version of the problem gives
you more gain for less work