regex in python to remove 2 patterns

regex in python to remove 2 patterns - regex

Want to make a regex to remove 2019 and 0 from left side of the string and last six zeroes from right side of the string.
original value Dtype : class 'str'
original value: 2019 01 10 00 00 00
expected output is : 1 10

Using str.split with list slicing.
Ex:
s = "2019 01 10 00 00 00"
print(" ".join(s.split()[1:3]).lstrip("0"))
Using re.match
Ex:
import re
s = "2019 01 10 00 00 00"
m = re.match(r"\d{4}\b\s(?P<value>\d{2}\b\s\d{2}\b)", s)
if m:
print(m.group("value").lstrip("0"))
Output:
1 10

Related

Regex / Python3 - re.findall() - Find all occurrences between opcodes

Background
I'm reverse engineering a TCP stream that uses a Type-Length-Value approach to encoding data.
Example:
TCP Payload: b'0000001f001270622e416374696f6e4e6f74696679425243080310840718880e20901c'
---------------------------------------------------------------------------------------
Type: 00 00 # New function call
Length: 00 1f # Length of Value (Length of Function + Function + Data)
Value: 00 12 # Length of Function
Value: 70 62 2e 41 63 74 69 6f 6e 4e 6f 74 69 66 79 42 52 43 # Function ->(hex2ascii)-> pb.ActionNotifyBRC
Value: 08 03 10 84 07 18 88 0e 20 90 1c # Data
However the Data is a data object that can include multiple variables with variable data lengths.
Data: 08 05 10 04 10 64 18 c8 01 20 ef 0f
----------------------------------------------
Opcode : Value
08 : 05 # var1 : 1 byte
10 : 04 # var2 : 1 byte
18 : c8 01 # var3 : 1-10 bytes
20 : ef 0f # var4 : 1-10 bytes
Currently I am parsing the Data using the following Python3 code:
############################### NOTES ###############################
# Opcodes sometimes rotate starting positions but the general order is always held:
# Data: 20 ef 0f 08 05 10 04 10 64 18 c8 01
#####################################################################
import re
import binascii
def dataVariable(data, start, end):
p = re.compile(start + b'(.*?)' + end)
return p.findall(data + data)
data = bytearray.fromhex('08051004106418c80120ef0f')
var3 = dataVariable(data, b'\x18', b'\x20')
print("Variable 3:", end=' ')
for item in set(var3):
print(binascii.hexlify(item), end=' ')
----------------------------------------------------------------------------
[Output]: Variable 3: b'c801'
So far all good...
Problem
If an Opcode appears in the previous variables Value the code is no longer reliable.
Data: 08 05 10 04 10 64 18 c8 20 01 20 ef 0f
----------------------------------------------
Opcode : Value
08 : 05
10 : 04
18 : c8 20 01 # The Value includes the next opcode (20)
20 : ef 0f
----------------------------------------------------------------------------
[Output]: Variable 3: b'c8'
[Output]: Variable 4: b'0120ef0f'
I was expecting an output of:
[Output]: Variable 3: b'c8' b'c82001'
[Output]: Variable 4: b'0120ef0f' b'ef0f'
It seems like there is an issue with my regular expression?
Update
To further clarify, var3 and var4 are representing integers.
I have managed to figure out how the length of the Value was being encoded. The most significant bit was being used as a flag to inform me that another byte was coming. You can then strip the MSB of each byte, swap the endianness and convert to decimal.
data -> binary representation -> strip MSB and swap endianness -> decimal representation
ac d7 05 -> 10101100 11010111 00000101 -> 0001 01101011 10101100 -> 93100
e4 a6 04 -> 11100100 10100110 00000100 -> 0001 00010011 01100100 -> 70500
90 e1 02 -> 10010000 11100001 00000010 -> 10110000 10010000 -> 45200
dc 24 -> 11011100 00100100 -> 00010010 01011100 -> 4700
f0 60 -> 11110000 01100000 -> 00110000 01110000 -> 12400

You may use
def dataVariable(data, start, end):
p = re.compile(b'(?=(' + start + b'.*' + end + b'))')
res = []
for x in p.findall(data):
cur = b''
for i, m in enumerate([x[i:i+1] for i in range(len(x))]):
if i == 0:
continue
if m == end and cur:
res.append(cur)
cur = cur + m
return res
See the Python demo:
data = bytearray.fromhex('08051004106418c8200120ef0f0f') # => b'c82001' b'c8'
#data = bytearray.fromhex('185618205720') # => b'56182057' b'2057' b'5618'
var3 = dataVariable(data, b'\x18', b'\x20')
print("Variable 3:", end=' ')
for item in set(var3):
print(binascii.hexlify(item), end=' ')
Output is Variable 3: b'c8' b'c82001' for '08051004106418c8200120ef0f0f' string and b'56182057' b'2057' b'5618' for 185618205720 input.
The pattern is of (?=(...)) type to find all overlapping matches. If you do not need the overlapping feature, remove these parts from the regex.
The point here is:
match all substrings starting with start and up to the last end with start + b'.*' + end pattern
iterate through the match dropping the first start byte and add an item to the resulting list when the end byte is found, adding up found bytes at each iteration (thus, getting all inner substrings inside the match).

Write hex code to text file from integer value in python

Details: Ubuntu 14.04(LTS), Python(2.7)
I want to write hex code to a text file so I wrote this code:
import numpy as np
width = 28
height = 28
num = 10
info = np.array([num, width, height]).reshape(1,3)
info = info.astype(np.int32)
newfile = open('test.txt', 'w')
newfile.write(info)
newfile.close()
I expected like this:
00 00 00 0A 00 00 00 1C 00 00 00 1C
But this is my actual result:
0A 00 00 00 1C 00 00 00 1C 00 00 00
Why did this happen and how can I get my expected output?

If you want big endian binary data, call astype(">i") then tostring():
import numpy as np
width = 28
height = 28
num = 10
info = np.array([num, width, height]).reshape(1,3)
info = info.astype(np.int32)
info.astype(">i").tostring()
If you want hex text:
" ".join("{:02X}".format(x) for x in info.astype(">i").tostring())
the output:
00 00 00 0A 00 00 00 1C 00 00 00 1C

memory structure of a vector<T>

I am playing around with vectors in c++ and I've tried to figure out how a vector looks inside memory...
I made a vector like this: vector<int> numbers = { 1, 2, 3, 4, 5, 6 }; and extracted a couple of information about the vector
&numbers: 002EF8B8
begin()._Ptr: 0031A538
end()._Ptr: 0031A550
data at vector memory location 002EF8B8 :
00 9d 31 00 38 a5 31 00 50 a5 31 00 50 a5 31 00
cc cc cc cc 30 31 82 1f
found begin()._Ptr and end()._Ptr addresses stored there...
and integers found in that address range:
1st int at memory location: 0031A538 = 01 00 00 00
2nd int at memory location: 0031A53C = 02 00 00 00
3rd int at memory location: 0031A540 = 03 00 00 00
4th int at memory location: 0031A544 = 04 00 00 00
5th int at memory location: 0031A548 = 05 00 00 00
6th int at memory location: 0031A54C = 06 00 00 00
Question:
If 002EF8B8 is the memory location of the vector, 31 00 38 a5 and 31 00 50 a5 are beginning and the end of vector, what is 00 9d at the beginning and the data after? 31 00 50 a5 31 00 cc cc cc cc 30 31 82 1f
I got the size with numbers.size()*sizeof(int) but I'm almost sure that's not the actual size of vector in memory.
Can someone explain to me how can I get the size of actual vector, and what does each part of it represent?
something like:
data size [2 bytes] [4 bytes] [4 bytes] [? bytes]
data meaning [something] [begin] [end] [something else]
EDIT:
bcrist suggested the use of /d1reportAllClassLayout and it generated this output
1> class ?$_Vector_val#U?$_Simple_types#H#std## size(16):
1> +---
1> | +--- (base class _Container_base12)
1> 0 | | _Myproxy
1> | +---
1> 4 | _Myfirst
1> 8 | _Mylast
1> 12 | _Myend
1> +---
which is basically [_Myproxy] [_Myfirst] [_Mylast] [_Myend]

You misinterpret the bytes. On a little-endian machine, the value 0x0031A538 is represented with the sequence of bytes 38 A5 31 00. So, your highlights are shifted.
Actually you have four addresses here: 0x00319D00, 0x0031A538, 0x0031A550 and again 0x0031A550.
A vector minimally needs three values to control its data, one of them being, obviously, the vector base. The other two may be either pointers to the end of vector and the end of allocated area, or sizes.
0x0031A538 is obviously the vector base, 0x0031A550 is both its end and the end of allocated area. What still needs explanation, then, is the value 0x00319D00.

Binary File interpretation

I am reading in a binary file (in c++). And the header is something like this (printed in hexadecimal)
43 27 41 1A 00 00 00 00 23 00 00 00 00 00 00 00 04 63 68 72 31 FFFFFFB4 01 00 00 04 63 68 72 32 FFFFFFEE FFFFFFB7
when printed out using:
std::cout << hex << (int)mem[c];
Is there an efficient way to store 23 which is the 9th byte(?) into an integer without using stringstream? Or is stringstream the best way?
Something like
int n= mem[8]
I want to store 23 in n not 35.

You did store 23 in n. You only see 35 because you are outputting it with a routine that converts it to decimal for display. If you could look at the binary data inside the computer, you would see that it is in fact a hex 23.
You will get the same result as if you did:
int n=0x23;
(What you might think you want is impossible. What number should be stored in n for 1E? The only corresponding number is 31, which is what you are getting.)

Do you mean you want to treat the value as binary-coded decimal? In that case, you could convert it using something like:
unsigned char bcd = mem[8];
unsigned char ones = bcd % 16;
unsigned char tens = bcd / 16;
if (ones > 9 || tens > 9) {
// handle error
}
int n = 10*tens + ones;

vbscript match within a match

Good day all.
I am running some Cisco show commands on a router. I am capturing the output to an array. I want to use Regex to find certain information in the output. The Regex works in the sense that it find the line containing it however there is not enough unique information I can create my regex with so I end up with more that I want. Here is the output:
ROUTERNAME#sh diag
Slot 0:
C2821 Motherboard with 2GE and integrated VPN Port adapter, 2 ports
Port adapter is analyzed
Port adapter insertion time 18w4d ago
Onboard VPN : v2.3.3
EEPROM contents at hardware discovery:
PCB Serial Number : FOC1XXXXXXXXX
Hardware Revision : 1.0
Top Assy. Part Number : 800-26921-04
Board Revision : E0
Deviation Number : 0
Fab Version : 03
RMA Test History : 00
RMA Number : 0-0-0-0
RMA History : 00
Processor type : 87
Hardware date code : 20090816
Chassis Serial Number : FTXXXXXXXXXX
Chassis MAC Address : 0023.ebf4.5480
MAC Address block size : 32
CLEI Code : COMV410ARA
Product (FRU) Number : CISCO2821
Part Number : 73-8853-05
Version Identifier : V05
EEPROM format version 4
EEPROM contents (hex):
0x00: 04 FF C1 8B 46 4F 43 31 33 33 33 31 4E 36 34 40
0x10: 03 E8 41 01 00 C0 46 03 20 00 69 29 04 42 45 30
0x20: 88 00 00 00 00 02 03 03 00 81 00 00 00 00 04 00
0x30: 09 87 83 01 32 8F C0 C2 8B 46 54 58 31 33 33 36
0x40: 41 30 4C 41 C3 06 00 23 EB F4 54 80 43 00 20 C6
0x50: 8A 43 4F 4D 56 34 31 30 41 52 41 CB 8F 43 49 53
0x60: 43 4F 32 38 32 31 20 20 20 20 20 20 82 49 22 95
0x70: 05 89 56 30 35 20 D9 02 40 C1 FF FF FF FF FF FF
AIM Module in slot: 0
Hardware Revision : 1.0
Top Assy. Part Number : 800-27059-01
Board Revision : A0
Deviation Number : 0-0
Fab Version : 02
PCB Serial Number : FOXXXXXXXXX
RMA Test History : 00
RMA Number : 0-0-0-0
RMA History : 00
Product (FRU) Number : AIM-VPN/SSL-2
Version Identifier : V01
EEPROM format version 4
EEPROM contents (hex):
0x00: 04 FF 40 04 F4 41 01 00 C0 46 03 20 00 69 B3 01
0x10: 42 41 30 80 00 00 00 00 02 02 C1 8B 46 4F 43 31
0x20: 33 33 31 36 39 59 55 03 00 81 00 00 00 00 04 00
0x30: CB 8D 41 49 4D 2D 56 50 4E 2F 53 53 4C 2D 32 89
0x40: 56 30 31 00 D9 02 40 C1 FF FF FF FF FF FF FF FF
0x50: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
0x60: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
0x70: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
What I want to capture is the Model number that is contained in the 'Product (FRU) Number:' section. In this example 'CISCO2821'. I want to output or MsgBox just the CISCO2821 although other possibilities could be 'CISCO2911/K9' or something similar.
This is the regex pattern I am using:
Product\s\(FRU\)\sNumber\s*:\s*CIS.*
Using a regex testing tool I was able to match the entire line containing what I want but I want to write only the model number.
I looked at 'ltrim' and 'rtrim' but did not think that could do it.
Any help would be greatly appreciated.
Regards.

Ok, this is in VB.NET not vbscript, but this may help get you on your way:
Dim RegexObj As New Regex("Product\s\(FRU\)\sNumber[\s\t]+:\s(CIS.+?)$", RegexOptions.IgnoreCase Or RegexOptions.Multiline)
ResultString = RegexObj.Match(SubjectString).Groups(1).Value

Invest in 2 little helper functions:
Function qq(sT) : qq = """" & sT & """" : End Function
Function newRE(sP, sF)
Set newRE = New RegExp
newRE.Pattern = sP
newRE.Global = "G" = Mid(sF, 1, 1)
newRE.IgnoreCase = "I" = Mid(sF, 2, 1)
newRE.MultiLine = "M" = Mid(sF, 3, 1)
End Function
and use
' 3 ways to skin this cat
Dim sInp : sInp = Join(Array( _
"CLEI Code: COMV410ARA" _
, "Product (FRU) Number : CISCO2821" _
, "Part Number:73-8853-05" _
), vbCrLf) ' or vbLf, vbCr
WScript.Echo sInp
' (1) just search for CIS + sequence of non-spaces - risky if e.g. CLEI Code starts with CIS
WScript.Echo 0, "=>", qq(newRE("CIS\S+", "gim").Execute(sInp)(0).Value)
' (2) use a capture/group (idea stolen from skyburner; just 'ported' to VBScript)
WScript.Echo 1, "=>", qq(newRE("\(FRU\)[^:]+:\s(\S+)", "gim").Execute(sInp)(0).Value)
WScript.Echo 2, "=>", qq(newRE("\(FRU\)[^:]+:\s(\S+)", "gim").Execute(sInp)(0).SubMatches(0))
' (3) generalize & use a Dictionary
Dim dicProps : Set dicProps = CreateObject("Scripting.Dictionary")
Dim oMT
For Each oMT in newRe("^\s*(.+?)\s*:\s*(.+?)\s*$", "GiM").Execute(sInp)
Dim oSM : Set oSM = oMT.SubMatches
dicProps(oSM(0)) = oSM(1)
Next
Dim sName
For Each sName In dicProps.Keys
WScript.Echo qq(sName), "=>", qq(dicProps(sName))
Next
to get this output:
CLEI Code: COMV410ARA
Product (FRU) Number : CISCO2821
Part Number:73-8853-05
0 => "CISCO2821"
1 => "(FRU) Number : CISCO2821"
2 => "CISCO2821"
"CLEI Code" => "COMV410ARA"
"Product (FRU) Number" => "CISCO2821"
"Part Number" => "73-8853-05"
and - I hope - some food for thought.
Important
a (plain) pattern matches/finds some part of the input
captures/groups/submatches/parentheses cut parts from this match
sometimes dealing with a generalized version of the problem gives
you more gain for less work

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regex in python to remove 2 patterns - regex

Want to make a regex to remove 2019 and 0 from left side of the string and last six zeroes from right side of the string. original value Dtype : class 'str' original value: 2019 01 10 00 00 00 expected output is : 1 10

Using str.split with list slicing. Ex: s = "2019 01 10 00 00 00" print(" ".join(s.split()[1:3]).lstrip("0")) Using re.match Ex: import re s = "2019 01 10 00 00 00" m = re.match(r"\d{4}\b\s(?P<value>\d{2}\b\s\d{2}\b)", s) if m: print(m.group("value").lstrip("0")) Output: 1 10

Related

Regex / Python3 - re.findall() - Find all occurrences between opcodes

Write hex code to text file from integer value in python

memory structure of a vector<T>

Binary File interpretation

vbscript match within a match

Categories

Resources