Regex for grabbing an INT from a SUBSTRING - regex

import re
data = []
tcp_dump = "17:18:38.877517 IP 192.168.0.15.43471 > 23.195.155.202.443: Flags [.], ack 1623866279, win 245, options [nop,nop,TS val 43001536 ecr 287517202], length 0"
regex = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(^length (\d+))'
data_ready = re.findall(regex, tcp_dump)
print(data_ready)
data.append(data_ready)
print(data)
this code currently needs to grab 2 IPv4 addresses and the length of a packet and cast them into a 2-d list. so far the first half of my regex does just that with the IPv4 addresses. my problem comes down to grabbing the length. i get the output:
[('192.168.0.15', '', ''), ('23.195.155.202', '', '')]
instead of the desired output of:
['192.168.0.15', '23.195.155.202', '0']
any ways to fix the regex?
EDIT
so it turns out, the regex seperated works (just the first half or just the second half), i cant seem to get them to work combined.

This should do it. You just need to make some of your parenthesis non-capturing and do some data cleaning
import re
data = []
tcp_dump = "17:18:38.877517 IP 192.168.0.15.43471 > 23.195.155.202.443: Flags [.], ack 1623866279, win 245, options [nop,nop,TS val 43001536 ecr 287517202], length 0"
regex = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:length (\d+))'
# make the returned tuples into two lists, one containing the IPs and the
# other containing the lengths. Finally, filter out empty strings.
data_ready,lengths = zip(*re.findall(regex, tcp_dump))
list_data = [ip for ip in list(data_ready) + list(lengths) if ip != '']
print(list_data)
data.append(list_data)
print(data)
output:
['192.168.0.15', '23.195.155.202', '0']

I wouldn't call it IP address matching (as 192.168.0.15.43471 is not valid IP address) but text parsing/processing. Optimized solution with re.search() function:
import re
tcp_dump = "17:18:38.877517 IP 192.168.0.15.43471 > 23.195.155.202.443: Flags [.], ack 1623866279, win 245, options [nop,nop,TS val 43001536 ecr 287517202], length 0"
result = re.search(r'((?:\d{1,3}\.){3}\d{1,3})(?:\.\d+) > ((?:\d{1,3}\.){3}\d{1,3})(?:\.\d+).*(\d+)$', tcp_dump)
result = list(result.groups())
print(result)
The output:
['192.168.0.15', '23.195.155.202', '0']

Related

How to replace a string using input from a different text file

I have 2 files, file1.txt - which has 100's of IP Address line by line and on my second file(file2.txt), I have an entry ip_address which need to replaced by the actual ip address from the file1. How to do it in Python.
Your help is much appreciated
Eg:
less File1.txt
10.10.10.1
10.10.20.1
10.20.10.10 etc
less File2.txt
[/tmp/test/ip_address]
whitelist = *
I am looking for my output to be like this:
[/tmp/test/10.10.10.1]
whitelist = *
[/tmp/test/10.10.20.1]
whitelist = *
[/tmp/test/10.20.10.10]
whitelist = *
etc.
Using a simple iteration.
Ex:
with open("File1.txt") as infile, open("File2.txt", "w") as outfile:
for line in infile: #iterate each line
outfile.write("[/tmp/test/{}]\n whitelist = *\n\n".format(line.strip())) #Write content to file

Python exclude directory with fnmatch

I'm working with some legacy code that I can't change (for reasons).
It uses fnmatch.fnmatch to filter a list of paths, like so (simplified):
import fnmatch
paths = ['a/x.txt', 'b/y.txt']
for path in paths:
if fnmatch.fnmatch(path, '*.txt'):
print 'do things'
Via configuration I am able to change the pattern used to match the files. I need to exclude everything in b/, is that possible?
From reading the docs (https://docs.python.org/2/library/fnmatch.html) it does not appear to be, but I thought asking was worth a try.
From the fnmatch.fnmatch documentation:
Patterns are Unix shell style:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq
When I run:
for path in paths:
if fnmatch.fnmatch(path, '[!b]*'):
print path
I get:
a/x.txt
Somehow this method works for alphabet just after "!'
for example in my case from the list col_names
['# Spec No', 'Name', 'Date (DD/MM/YYYY)', 'Time (hh:mm:ss)', 'Year',
'Fractional day', 'Fractional time', 'Scans', 'Tint', 'SZA',
'NO2_UV.RMS', 'NO2_UV.RefZm', 'NO2_UV.RefNumber', 'NO2_UV.SlCol(bro)',
'NO2_UV.SlErr(bro)', 'NO2_UV.SlCol(ring)', 'NO2_UV.SlErr(ring)',
'NO2_UV.SlCol(HCHO)', 'NO2_UV.SlErr(HCHO)', 'NO2_UV.SlCol(O4)',
'NO2_UV.SlErr(O4)', 'NO2_UV.SlCol(O3a)', 'NO2_UV.SlErr(O3a)',
'NO2_UV.SlCol(O3223k)', 'NO2_UV.SlErr(O3223k)', 'NO2_UV.SlCol(NO2)',
'NO2_UV.SlErr(NO2)', 'NO2_UV.SlCol(no2a)', 'NO2_UV.SlErr(no2a)',
'NO2_UV.Offset (Constant)', 'NO2_UV.Err(Offset (Constant))',
'NO2_UV.Offset (Order 1)', 'NO2_UV.Err(Offset (Order 1))',
'NO2_UV.Shift(Spectrum)', 'NO2_UV.Stretch(Spectrum)1',
'NO2_UV.Stretch(Spectrum)2', 'HCHO.RMS', 'HCHO.RefZm', 'HCHO.RefNumber',
'HCHO.SlCol(bro)', 'HCHO.SlErr(bro)', 'HCHO.SlCol(ring)',
'HCHO.SlErr(ring)', 'HCHO.SlCol(HCHO)', 'HCHO.SlErr(HCHO)',
'HCHO.SlCol(O4)', 'HCHO.SlErr(O4)', 'HCHO.SlCol(O3a)',
'HCHO.SlErr(O3a)', 'HCHO.SlCol(O3223k)', 'HCHO.SlErr(O3223k)',
'HCHO.SlCol(NO2)', 'HCHO.SlErr(NO2)', 'HCHO.Offset (Constant)',
'HCHO.Err(Offset (Constant))', 'HCHO.Offset (Order 1)',
'HCHO.Err(Offset (Order 1))', 'HCHO.Shift(Spectrum)',
'HCHO.Stretch(Spectrum)1', 'HCHO.Stretch(Spectrum)2', 'Fluxes 318',
'Fluxes 330', 'Fluxes 390', 'Fluxes 440']
I wanted to search all the names that did not contain NO2_UV.
If I do
header_hcho = fnmatch.filter(col_names, '[!NO2_UV.]*');
it excludes the second element that is "Name"., because it starts with N. And the result is the same as if i do
header_hcho = fnmatch.filter(col_names, '[!N]*');
So, I went by rather an old-school method
header_hcho = []
idx=0
for idx in range(0, len(col_names)):
if col_names[idx].find("NO2_UV") == -1:
header_hcho.append(col_names[idx])
idx=idx+1

deobfuscate ip addresses in python dictionary

I need to parse a file that contains flat text and extract both valid ip addresses and obfuscated ip addresses.
(ie. 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 . 168 . 1 . 1)
Once the data is extracted I need to convert them all to valid format and remove duplicates.
My current code places the ip addresses into a string, which should be a dict? I know I need to use some kind of recursion to set the key value, but I feel there is a more efficient and modular way to complete the task.
import json, ordereddict, re
# define the pattern of valid and obfuscated ips
pattern = r"((([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])[ (\[]?(\.|dot)[ )\]]?){3}([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))"
# open data file that contains ip addresses and other text
with open ("sample.txt", "r") as myfile:
text=myfile.read().replace('\n', '')
# put non normalized ip addresses in a dictionary
ips = {"data": [{"key1": match[0] for match in re.findall(pattern, text) }]}
# normalized ip addresses
for name, datalist in ips.iteritems():
for datadict in datalist:
for key, value in datadict.items():
if value == "(dot)":
datadict[key] = "."
if value == "[dot]":
datadict[key] = "."
if value == " . ":
datadict[key] = "."
if value == " .":
datadict[key] = "."
if value == ". ":
datadict[key] = "."
# write valid ip address to json file
with open('test.json', 'w') as outfile:
json.dump(ips, outfile)
Sample data file
These are valid ip addresses 192.168.1.1, 8.8.8.8
These are obfuscated 192.168.2[.]1 or 192.168.3(.)1 or 192.168.1[dot]1
192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. or 192 . 168 . 1 . 1
This is what an invalid ip address looks like, they should be excluded 256.1.1.1 or 500.1.500.1 or 192.168.4.0
Expected result
192.168.1.1, 192.168.2.1, 192.168.3.1 , 8.8.8.8

Remove first and last IP from netaddr result

I am writing a script to print all IPs in CIDR notaion, but I do not want to print first and last IPs as they are not usable.
from netaddr import IPNetwork
ipc = raw_input('Enter The IP Range ')
n = 0
for ip in IPNetwork(ipc):
n = n + 1
print '%s' % ip
print 'Total No of IPs are ' + str(n)
This means that if I give 12.110.34.224/27 I should get 30 IPs as result, removing first and last IPs as /27 means 32 IPs.
That should do it.
for ip in list(IPNetwork(ipc))[1:-1]:

How to split tokens, count number of tokens, and write in a file in python?

I have file which has data in lines as follows:
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour', 'The Smashing Pumpkins', 'Warner Bros. Entertainment','This is a good Beer]
['Voices Inside', 'Expressivista', 'The Kentucky Fried Movie', 'The Bridges of Madison County']
and so on. I want to re-write the data into a file which has lines with tokens with words less than 3 or some other number. e.g.:
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour']
['Voices Inside', 'Expressivista']
this is what I have tried so far:
for line in open(file):
line = line.strip()
line = line.rstrip()
prog = re.compile("([a-z0-9]){32}")
if line:
line = line.replace('"', '')
line = line.split(",")
if re.match(prog, line[0]) and len(line)>2:
wo=[]
for words in line:
word=words.split()
if len(word)<3:
print word.append(word)
But the output says None. Any thoughts where I am making a mistake?
A better way to do what you're doing is to use ast.literal_eval, which automagically converts string representations of Python objects (e.g. lists) into actual Python objects.
import ast
# raw data
data = """
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour', 'The Smashing Pumpkins', 'Warner Bros. Entertainment','This is a good Beer']
['Voices Inside', 'Expressivista', 'The Kentucky Fried Movie', 'The Bridges of Madison County']
"""
# set threshold number of tokens
threshold = 3
# split into lines
lines = data.split('\n')
# parse non-blank lines into python lists
lists = [ast.literal_eval(line) for line in lines if line]
# for each list, keep only those tokens with less than `threshold` tokens
result = [[item for item in lst if len(item.split()) < threshold]
for lst in lists]
# show result
for line in result:
print(line)
Result:
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour']
['Voices Inside', 'Expressivista']
I think the reason your code isn't working is that you're trying to match line[0] against your regex prog - but the problem is that line[0] isn't 32 characters long for either of your lines, so your regex won't match.