Regex for grabbing an INT from a SUBSTRING

Regex for grabbing an INT from a SUBSTRING - regex

import re
data = []
tcp_dump = "17:18:38.877517 IP 192.168.0.15.43471 > 23.195.155.202.443: Flags [.], ack 1623866279, win 245, options [nop,nop,TS val 43001536 ecr 287517202], length 0"
regex = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(^length (\d+))'
data_ready = re.findall(regex, tcp_dump)
print(data_ready)
data.append(data_ready)
print(data)
this code currently needs to grab 2 IPv4 addresses and the length of a packet and cast them into a 2-d list. so far the first half of my regex does just that with the IPv4 addresses. my problem comes down to grabbing the length. i get the output:
[('192.168.0.15', '', ''), ('23.195.155.202', '', '')]
instead of the desired output of:
['192.168.0.15', '23.195.155.202', '0']
any ways to fix the regex?
EDIT
so it turns out, the regex seperated works (just the first half or just the second half), i cant seem to get them to work combined.

This should do it. You just need to make some of your parenthesis non-capturing and do some data cleaning
import re
data = []
tcp_dump = "17:18:38.877517 IP 192.168.0.15.43471 > 23.195.155.202.443: Flags [.], ack 1623866279, win 245, options [nop,nop,TS val 43001536 ecr 287517202], length 0"
regex = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:length (\d+))'
# make the returned tuples into two lists, one containing the IPs and the
# other containing the lengths. Finally, filter out empty strings.
data_ready,lengths = zip(*re.findall(regex, tcp_dump))
list_data = [ip for ip in list(data_ready) + list(lengths) if ip != '']
print(list_data)
data.append(list_data)
print(data)
output:
['192.168.0.15', '23.195.155.202', '0']

I wouldn't call it IP address matching (as 192.168.0.15.43471 is not valid IP address) but text parsing/processing. Optimized solution with re.search() function:
import re
tcp_dump = "17:18:38.877517 IP 192.168.0.15.43471 > 23.195.155.202.443: Flags [.], ack 1623866279, win 245, options [nop,nop,TS val 43001536 ecr 287517202], length 0"
result = re.search(r'((?:\d{1,3}\.){3}\d{1,3})(?:\.\d+) > ((?:\d{1,3}\.){3}\d{1,3})(?:\.\d+).*(\d+)$', tcp_dump)
result = list(result.groups())
print(result)
The output:
['192.168.0.15', '23.195.155.202', '0']

Related

How to replace a string using input from a different text file

I have 2 files, file1.txt - which has 100's of IP Address line by line and on my second file(file2.txt), I have an entry ip_address which need to replaced by the actual ip address from the file1. How to do it in Python.
Your help is much appreciated
Eg:
less File1.txt
10.10.10.1
10.10.20.1
10.20.10.10 etc
less File2.txt
[/tmp/test/ip_address]
whitelist = *
I am looking for my output to be like this:
[/tmp/test/10.10.10.1]
whitelist = *
[/tmp/test/10.10.20.1]
whitelist = *
[/tmp/test/10.20.10.10]
whitelist = *
etc.

Using a simple iteration.
Ex:
with open("File1.txt") as infile, open("File2.txt", "w") as outfile:
for line in infile: #iterate each line
outfile.write("[/tmp/test/{}]\n whitelist = *\n\n".format(line.strip())) #Write content to file

Python exclude directory with fnmatch

I'm working with some legacy code that I can't change (for reasons).
It uses fnmatch.fnmatch to filter a list of paths, like so (simplified):
import fnmatch
paths = ['a/x.txt', 'b/y.txt']
for path in paths:
if fnmatch.fnmatch(path, '*.txt'):
print 'do things'
Via configuration I am able to change the pattern used to match the files. I need to exclude everything in b/, is that possible?
From reading the docs (https://docs.python.org/2/library/fnmatch.html) it does not appear to be, but I thought asking was worth a try.

From the fnmatch.fnmatch documentation:
Patterns are Unix shell style:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq
When I run:
for path in paths:
if fnmatch.fnmatch(path, '[!b]*'):
print path
I get:
a/x.txt

Somehow this method works for alphabet just after "!'
for example in my case from the list col_names
['# Spec No', 'Name', 'Date (DD/MM/YYYY)', 'Time (hh:mm:ss)', 'Year',
'Fractional day', 'Fractional time', 'Scans', 'Tint', 'SZA',
'NO2_UV.RMS', 'NO2_UV.RefZm', 'NO2_UV.RefNumber', 'NO2_UV.SlCol(bro)',
'NO2_UV.SlErr(bro)', 'NO2_UV.SlCol(ring)', 'NO2_UV.SlErr(ring)',
'NO2_UV.SlCol(HCHO)', 'NO2_UV.SlErr(HCHO)', 'NO2_UV.SlCol(O4)',
'NO2_UV.SlErr(O4)', 'NO2_UV.SlCol(O3a)', 'NO2_UV.SlErr(O3a)',
'NO2_UV.SlCol(O3223k)', 'NO2_UV.SlErr(O3223k)', 'NO2_UV.SlCol(NO2)',
'NO2_UV.SlErr(NO2)', 'NO2_UV.SlCol(no2a)', 'NO2_UV.SlErr(no2a)',
'NO2_UV.Offset (Constant)', 'NO2_UV.Err(Offset (Constant))',
'NO2_UV.Offset (Order 1)', 'NO2_UV.Err(Offset (Order 1))',
'NO2_UV.Shift(Spectrum)', 'NO2_UV.Stretch(Spectrum)1',
'NO2_UV.Stretch(Spectrum)2', 'HCHO.RMS', 'HCHO.RefZm', 'HCHO.RefNumber',
'HCHO.SlCol(bro)', 'HCHO.SlErr(bro)', 'HCHO.SlCol(ring)',
'HCHO.SlErr(ring)', 'HCHO.SlCol(HCHO)', 'HCHO.SlErr(HCHO)',
'HCHO.SlCol(O4)', 'HCHO.SlErr(O4)', 'HCHO.SlCol(O3a)',
'HCHO.SlErr(O3a)', 'HCHO.SlCol(O3223k)', 'HCHO.SlErr(O3223k)',
'HCHO.SlCol(NO2)', 'HCHO.SlErr(NO2)', 'HCHO.Offset (Constant)',
'HCHO.Err(Offset (Constant))', 'HCHO.Offset (Order 1)',
'HCHO.Err(Offset (Order 1))', 'HCHO.Shift(Spectrum)',
'HCHO.Stretch(Spectrum)1', 'HCHO.Stretch(Spectrum)2', 'Fluxes 318',
'Fluxes 330', 'Fluxes 390', 'Fluxes 440']
I wanted to search all the names that did not contain NO2_UV.
If I do
header_hcho = fnmatch.filter(col_names, '[!NO2_UV.]*');
it excludes the second element that is "Name"., because it starts with N. And the result is the same as if i do
header_hcho = fnmatch.filter(col_names, '[!N]*');
So, I went by rather an old-school method
header_hcho = []
idx=0
for idx in range(0, len(col_names)):
if col_names[idx].find("NO2_UV") == -1:
header_hcho.append(col_names[idx])
idx=idx+1

deobfuscate ip addresses in python dictionary

I need to parse a file that contains flat text and extract both valid ip addresses and obfuscated ip addresses.
(ie. 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 . 168 . 1 . 1)
Once the data is extracted I need to convert them all to valid format and remove duplicates.
My current code places the ip addresses into a string, which should be a dict? I know I need to use some kind of recursion to set the key value, but I feel there is a more efficient and modular way to complete the task.
import json, ordereddict, re
# define the pattern of valid and obfuscated ips
pattern = r"((([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])[ (\[]?(\.|dot)[ )\]]?){3}([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))"
# open data file that contains ip addresses and other text
with open ("sample.txt", "r") as myfile:
text=myfile.read().replace('\n', '')
# put non normalized ip addresses in a dictionary
ips = {"data": [{"key1": match[0] for match in re.findall(pattern, text) }]}
# normalized ip addresses
for name, datalist in ips.iteritems():
for datadict in datalist:
for key, value in datadict.items():
if value == "(dot)":
datadict[key] = "."
if value == "[dot]":
datadict[key] = "."
if value == " . ":
datadict[key] = "."
if value == " .":
datadict[key] = "."
if value == ". ":
datadict[key] = "."
# write valid ip address to json file
with open('test.json', 'w') as outfile:
json.dump(ips, outfile)
Sample data file
These are valid ip addresses 192.168.1.1, 8.8.8.8
These are obfuscated 192.168.2[.]1 or 192.168.3(.)1 or 192.168.1[dot]1
192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. or 192 . 168 . 1 . 1
This is what an invalid ip address looks like, they should be excluded 256.1.1.1 or 500.1.500.1 or 192.168.4.0
Expected result
192.168.1.1, 192.168.2.1, 192.168.3.1 , 8.8.8.8

Remove first and last IP from netaddr result

I am writing a script to print all IPs in CIDR notaion, but I do not want to print first and last IPs as they are not usable.
from netaddr import IPNetwork
ipc = raw_input('Enter The IP Range ')
n = 0
for ip in IPNetwork(ipc):
n = n + 1
print '%s' % ip
print 'Total No of IPs are ' + str(n)
This means that if I give 12.110.34.224/27 I should get 30 IPs as result, removing first and last IPs as /27 means 32 IPs.

That should do it.
for ip in list(IPNetwork(ipc))[1:-1]:

How to split tokens, count number of tokens, and write in a file in python?

I have file which has data in lines as follows:
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour', 'The Smashing Pumpkins', 'Warner Bros. Entertainment','This is a good Beer]
['Voices Inside', 'Expressivista', 'The Kentucky Fried Movie', 'The Bridges of Madison County']
and so on. I want to re-write the data into a file which has lines with tokens with words less than 3 or some other number. e.g.:
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour']
['Voices Inside', 'Expressivista']
this is what I have tried so far:
for line in open(file):
line = line.strip()
line = line.rstrip()
prog = re.compile("([a-z0-9]){32}")
if line:
line = line.replace('"', '')
line = line.split(",")
if re.match(prog, line[0]) and len(line)>2:
wo=[]
for words in line:
word=words.split()
if len(word)<3:
print word.append(word)
But the output says None. Any thoughts where I am making a mistake?

A better way to do what you're doing is to use ast.literal_eval, which automagically converts string representations of Python objects (e.g. lists) into actual Python objects.
import ast
# raw data
data = """
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour', 'The Smashing Pumpkins', 'Warner Bros. Entertainment','This is a good Beer']
['Voices Inside', 'Expressivista', 'The Kentucky Fried Movie', 'The Bridges of Madison County']
"""
# set threshold number of tokens
threshold = 3
# split into lines
lines = data.split('\n')
# parse non-blank lines into python lists
lists = [ast.literal_eval(line) for line in lines if line]
# for each list, keep only those tokens with less than `threshold` tokens
result = [[item for item in lst if len(item.split()) < threshold]
for lst in lists]
# show result
for line in result:
print(line)
Result:
['Marilyn Manson', 'Web', 'Skydera Inc.', 'Stone Sour']
['Voices Inside', 'Expressivista']
I think the reason your code isn't working is that you're trying to match line[0] against your regex prog - but the problem is that line[0] isn't 32 characters long for either of your lines, so your regex won't match.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex for grabbing an INT from a SUBSTRING - regex

Related

How to replace a string using input from a different text file

Python exclude directory with fnmatch

deobfuscate ip addresses in python dictionary

Remove first and last IP from netaddr result

How to split tokens, count number of tokens, and write in a file in python?

Categories

Resources