I need to parse a file that contains flat text and extract both valid ip addresses and obfuscated ip addresses.
(ie. 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 . 168 . 1 . 1)
Once the data is extracted I need to convert them all to valid format and remove duplicates.
My current code places the ip addresses into a string, which should be a dict? I know I need to use some kind of recursion to set the key value, but I feel there is a more efficient and modular way to complete the task.
import json, ordereddict, re
# define the pattern of valid and obfuscated ips
pattern = r"((([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])[ (\[]?(\.|dot)[ )\]]?){3}([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))"
# open data file that contains ip addresses and other text
with open ("sample.txt", "r") as myfile:
text=myfile.read().replace('\n', '')
# put non normalized ip addresses in a dictionary
ips = {"data": [{"key1": match[0] for match in re.findall(pattern, text) }]}
# normalized ip addresses
for name, datalist in ips.iteritems():
for datadict in datalist:
for key, value in datadict.items():
if value == "(dot)":
datadict[key] = "."
if value == "[dot]":
datadict[key] = "."
if value == " . ":
datadict[key] = "."
if value == " .":
datadict[key] = "."
if value == ". ":
datadict[key] = "."
# write valid ip address to json file
with open('test.json', 'w') as outfile:
json.dump(ips, outfile)
Sample data file
These are valid ip addresses 192.168.1.1, 8.8.8.8
These are obfuscated 192.168.2[.]1 or 192.168.3(.)1 or 192.168.1[dot]1
192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. or 192 . 168 . 1 . 1
This is what an invalid ip address looks like, they should be excluded 256.1.1.1 or 500.1.500.1 or 192.168.4.0
Expected result
192.168.1.1, 192.168.2.1, 192.168.3.1 , 8.8.8.8
Related
I have 2 files, file1.txt - which has 100's of IP Address line by line and on my second file(file2.txt), I have an entry ip_address which need to replaced by the actual ip address from the file1. How to do it in Python.
Your help is much appreciated
Eg:
less File1.txt
10.10.10.1
10.10.20.1
10.20.10.10 etc
less File2.txt
[/tmp/test/ip_address]
whitelist = *
I am looking for my output to be like this:
[/tmp/test/10.10.10.1]
whitelist = *
[/tmp/test/10.10.20.1]
whitelist = *
[/tmp/test/10.20.10.10]
whitelist = *
etc.
Using a simple iteration.
Ex:
with open("File1.txt") as infile, open("File2.txt", "w") as outfile:
for line in infile: #iterate each line
outfile.write("[/tmp/test/{}]\n whitelist = *\n\n".format(line.strip())) #Write content to file
import re
data = []
tcp_dump = "17:18:38.877517 IP 192.168.0.15.43471 > 23.195.155.202.443: Flags [.], ack 1623866279, win 245, options [nop,nop,TS val 43001536 ecr 287517202], length 0"
regex = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(^length (\d+))'
data_ready = re.findall(regex, tcp_dump)
print(data_ready)
data.append(data_ready)
print(data)
this code currently needs to grab 2 IPv4 addresses and the length of a packet and cast them into a 2-d list. so far the first half of my regex does just that with the IPv4 addresses. my problem comes down to grabbing the length. i get the output:
[('192.168.0.15', '', ''), ('23.195.155.202', '', '')]
instead of the desired output of:
['192.168.0.15', '23.195.155.202', '0']
any ways to fix the regex?
EDIT
so it turns out, the regex seperated works (just the first half or just the second half), i cant seem to get them to work combined.
This should do it. You just need to make some of your parenthesis non-capturing and do some data cleaning
import re
data = []
tcp_dump = "17:18:38.877517 IP 192.168.0.15.43471 > 23.195.155.202.443: Flags [.], ack 1623866279, win 245, options [nop,nop,TS val 43001536 ecr 287517202], length 0"
regex = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:length (\d+))'
# make the returned tuples into two lists, one containing the IPs and the
# other containing the lengths. Finally, filter out empty strings.
data_ready,lengths = zip(*re.findall(regex, tcp_dump))
list_data = [ip for ip in list(data_ready) + list(lengths) if ip != '']
print(list_data)
data.append(list_data)
print(data)
output:
['192.168.0.15', '23.195.155.202', '0']
I wouldn't call it IP address matching (as 192.168.0.15.43471 is not valid IP address) but text parsing/processing. Optimized solution with re.search() function:
import re
tcp_dump = "17:18:38.877517 IP 192.168.0.15.43471 > 23.195.155.202.443: Flags [.], ack 1623866279, win 245, options [nop,nop,TS val 43001536 ecr 287517202], length 0"
result = re.search(r'((?:\d{1,3}\.){3}\d{1,3})(?:\.\d+) > ((?:\d{1,3}\.){3}\d{1,3})(?:\.\d+).*(\d+)$', tcp_dump)
result = list(result.groups())
print(result)
The output:
['192.168.0.15', '23.195.155.202', '0']
I am writing a script to print all IPs in CIDR notaion, but I do not want to print first and last IPs as they are not usable.
from netaddr import IPNetwork
ipc = raw_input('Enter The IP Range ')
n = 0
for ip in IPNetwork(ipc):
n = n + 1
print '%s' % ip
print 'Total No of IPs are ' + str(n)
This means that if I give 12.110.34.224/27 I should get 30 IPs as result, removing first and last IPs as /27 means 32 IPs.
That should do it.
for ip in list(IPNetwork(ipc))[1:-1]:
Hi i am a newbie to python.i am creating a small program which parses the load log file of a particular website and stores the valid data in particular feild of a database.but some of the feild has weired character like '推广频道'.(non-ascii character ' xe6').
i also tried to apply the solution of this question.
Inserting unicode into sqlite?
but this codec is unable to decode and providing UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 84: chara
cter maps to
def db_log_entry(filename):
conn = sqlite3.connect('log.db')
c= conn.cursor()
""" creating log table
Table structure:
Name Type
id integer
identifier text
url text
user_ip text
datetime text
timezone_mod text
query_type text
status text
opt_id text
user_agent text
opt_ip text
opt_ip_port text
"""
c.execute(''' CREATE TABLE log
(id integer, identifier text, url text, user_ip text, datetime text, timezone_mod text, query_type text, query_url text, status text,
opt_id text, user_agent text, opt_ip text, opt_ip_port text)''')
f=codecs.open(filename,encoding='cp1252') ### opening file to read data
loop_count=0 # variable to keep record of successful loop iteration
id_count=0 # variable to be entered in id feild of log table
""" Reading each line of log file for performing iterative opertion on each line to parse valid data
make a entry in database """
for log_line in f:
loop_count= loop_count+1
id_count= id_count+1
list=[]
db_list=[]
(txt1,txt2)=log_line.split('HTTP/') #split the log_line in two parts txt1,txt2 for comparison with reg1,reg2
reg1= r'(\d{6}_\d{5})\s([\w.-]+)\s([\d.]+)\s-\s-\s\[(\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2})\s([+-]?\d+)\]\s"(\w+)\s([\w.\-\/]+)'
reg2= r'[1.1"]?[1.0"]?\s(\d*)\s(\d*)([\s\"\-]*)([\w\.\%\/\s\;\(\:\+\)\?\S\"]+)"\s(\d{2,3}.\d{2,3}.\d{2,3}.\d{2,3})\:(\d+)'
print 'starting loop ',loop_count
match= re.search(reg1,txt1)
if match: # If regex match found between (reg1,txt1) than append the data in db_list
db_list.append(id_count)
db_list.append(match.group(1))
print match.group(1)
db_list.append(match.group(2))
print match.group(2)
db_list.append(match.group(3))
print match.group(3)
db_list.append(match.group(4))
print match.group(4)
db_list.append(match.group(5))
print match.group(5)
db_list.append(match.group(6))
print match.group(6)
db_list.append(match.group(7))
print match.group(7)
print 'yes 1'
else:
print 'match not found in reg1'
match= re.search(reg2,txt2) # If regex match found between (reg2,txt2) than append the data in db_list
if match:
db_list.append(match.group(1))
print match.group(1)
db_list.append(match.group(2))
print match.group(2)
#print match.group(3)
db_list.append(match.group(4))
print match.group(4)
db_list.append(match.group(5))
print match.group(5)
db_list.append(match.group(6))
print match.group(6)
print 'yes 2'
else:
print 'match not found in reg2'
print 'inserting value of',loop_count
c.execute('INSERT INTO log VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)',db_list)
conn.commit()
print 'success...'
conn.commit()
def main():
db_log_entry('access_log.txt')
if __name__ == '__main__':
main()
That's because you're using the wrong character encoding to open the file.
You should be opening it with UTF-8, like this:
f=codecs.open(filename,encoding='utf-8')
This is because CP-1252 is an encoding for the Latin alphabet and as such doesn't understand Japanese characters.
As I'm not sure what the original encoding (or language) is, UTF-8 is also a safe bet as it supports all languages.
I'm trying to create a WiFi Log Scanner. Currently we go through logs manually using CTRL+F and our keywords. I just want to automate that process. i.e. bang in a .txt file and receive an output.
I've got the bones of the code, can work on making it pretty later, but I'm running into a small issue. I want the scanner to search the file (done), count instances of that string (done) and output the number of occurrences (done) followed by the full line where that string occurred last, including line number (line number is not essential, just makes things easier to do a gestimate of which is the more recent issue if there are multiple).
Currently I'm getting an output of every line with the string in it. I know why this is happening, I just can't think of a way to specify just output the last line.
Here is my code:
import os
from Tkinter import Tk
from tkFileDialog import askopenfilename
def file_len(filename):
#Count Number of Lines in File and Output Result
with open(filename) as f:
for i, l in enumerate(f):
pass
print('There are ' + str(i+1) + ' lines in ' + os.path.basename(filename))
def file_scan(filename):
#All Issues to Scan will go here
print ("DHCP was found " + str(filename.count('No lease, failing')) + " time(s).")
for line in filename:
if 'No lease, failing' in line:
print line.strip()
DNS= (filename.count('Host name lookup failure:res_nquery failed') + filename.count('HTTP query failed'))/2
print ("DNS Failure was found " + str(DNS) + " time(s).")
for line in filename:
if 'Host name lookup failure:res_nquery failed' or 'HTTP query failed' in line:
print line.strip()
print ("PSK= was found " + str(testr.count('psk=')) + " time(s).")
for line in ln:
if 'psk=' in line:
print 'The length(s) of the PSK used is ' + str(line.count('*'))
Tk().withdraw()
filename=askopenfilename()
abspath = os.path.abspath(filename) #So that doesn't matter if File in Python Dir
dname = os.path.dirname(abspath) #So that doesn't matter if File in Python Dir
os.chdir(dname) #So that doesn't matter if File in Python Dir
print ('Report for ' + os.path.basename(filename))
file_len(filename)
file_scan(filename)
That's, pretty much, going to be my working code (just have to add a few more issue searches), I have a version that searches a string instead of a text file here. This outputs the following:
Total Number of Lines: 38
DHCP was found 2 time(s).
dhcp
dhcp
PSK= was found 2 time(s).
The length(s) of the PSK used is 14
The length(s) of the PSK used is 8
I only have general stuff there, modified for it being a string rather than txt file, but the string I'm scanning from will be what's in the txt files.
Don't worry too much about PSK, I want all examples of that listed, I'll see If I can tidy them up into one line at a later stage.
As a side note, a lot of this is jumbled together from doing previous searches, so I have a good idea that there are probably neater ways of doing this. This is not my current concern, but if you do have a suggestion on this side of things, please provide an explanation/link to explanation as to why your way is better. I'm fairly new to python, so I'm mainly dealing with stuff I currently understand. :)
Thanks in advance for any help, if you need any further info, please let me know.
Joe
To search and count the string occurrence I solved in following way
'''---------------------Function--------------------'''
#Counting the "string" occurrence in a file
def count_string_occurrence():
string = "test"
f = open("result_file.txt")
contents = f.read()
f.close()
print "Number of '" + string + "' in file", contents.count("foo")
#we are searching "foo" string in file "result_file.txt"
I can't comment yet on questions, but I think I can answer more specifically with some more information What line do you want only one of?
For example, you can do something like:
search_str = 'find me'
count = 0
for line in file:
if search_str in line:
last_line = line
count += 1
print '{0} occurrences of this line:\n{1}'.format(count, last_line)
I notice that in file_scan you are iterating twice through file. You can surely condense it into one iteration :).