Regex patterns to extract information from event logs - regex

I am working on an event log data, as we all know its unstructured data and need to extract important pieces of information from these logs for a better visualization perspective. This data is tab-separated, I have created a data frame from these event logs and also expected output. Col name Event_message is the raw event log message, and the columns CtrJb_ID, Prcs_ID, LotID, Wafer_ID are the columns which I would like to extract the information from these logs. If the condition is not met, then I would like the row to be None or empty. For example in one event if the lot id exists then extract the lot id and if not then None.
data = {'Timestamp':['2009/8/22 08:02:29.862', '2009/8/22 08:02:30.706','2008/08/22 08:02:33.207','2008/08/22 08:02:37.551'],
'Event_Message':["2009/8/22 08:02:29.862 2009/8/22 08:02:29.862 123456 ControlJobStateTransition1 CWControlJobManager 'ControlJob named XYZ12345-20090822-0005 was created and is in the QUEUED state.' [] ['EventVariable ControlJobID 0 true XYZ12345-20090822-0005'' ControlJobID' 'EventVariable DataCollectionPlan 0 true ",
"2009/8/22 08:02:30.706 2009/8/22 08:02:30.315 123456 PRJobStateChange XYZ12ProcessJobManager 'Process Job 200908221102-2R34567.000-01 has changed state to PRJOBACTIVE/SETUP.' [] ['EventVariable ProcessJobID 0 true ''200908221102-2R34567.000-01'' ProcessJobID' 'EventVariable ProcessJobState 0 true ''1''",
"2008/08/22 08:02:33.207 2008/08/22 08:02:33.175 123456789 DAExtendPerResourceDAWaferCenterOffsetB TransferChamberSlotValvePM4 'DAPerResource EXTEND' [] ['StatusVariable Source 0 true ''TransferChamber-EndEffector2'' Source' 'StatusVariable Destination 0 true ''PM4'' Destination' 'StatusVariable WaferID 0 true ''1A234568ABC2'' WaferID' 'StatusVariable LotID 0 true ''200908221036-2R34567.000-01'' LotID'",
"2008/08/22 08:02:37.551 2008/08/22 08:02:37.404 12345678 RecipeStarted PM4 'Started processing recipe AB0-Z-65XYZ-ABCDE12XYZ1-2R34567000 on material 1A234568ABC2. ' [] ['StatusVariable RecipeName 0 true ''PM4-P-14LPP-PEBNS31JFA1-8R91721000'' RecipeName' 'StatusVariable MaterialID 0 true ''1A234568ABC2'' MaterialID' 'StatusVariable JobID 0 true ''201910021036-2R34567.000-01'' JobID' 'EventVariable WacID 0 true '''' WacID' 'StatusVariable LotID 0 true ''2R34567.000'' LotID' 'StatusVariable SlotID 0 true ''11"],
'CtrJb_ID': ['XYZ12345-20090822-0005', None, None, None],
'Prcs_ID': [None, '200908221102-2R34567.000', None, None],
'LotID': [None, None, '200908221036-2R34567.000-01', '2R34567.000'],
'Wafer_ID': [None, None, '1A234568ABC2', None ]}
df= pd.DataFrame(data)
I have read this event log message line by line and then tried to extract it using regex patterns, but not been succesful. Below is the code, that I have tried so far.
import pandas as pd
import re
f = open ("C:\ABCD\XYZ\egfh_ijk_lmn\log2009082212.txt")
lines = f.readlines()
for line in lines:
print (line)
lot= re.compile(r'LotID\s+\d\s+\w+\s+(.*)\s+LotID')
for line in lines:
if lot.search(str(line)):
print(lot)
else:
print(None)
Output:
None
None
None
None
None
re.compile('LotID\\s+\\d\\s+\\w+\\s+(.*)\\s+LotID')
re.compile('LotID\\s+\\d\\s+\\w+\\s+(.*)\\s+LotID')
None
None
None
None
None
re.compile('LotID\\s+\\d\\s+\\w+\\s+(.*)\\s+LotID')
None
None
re.compile('LotID\\s+\\d\\s+\\w+\\s+(.*)\\s+LotID')

Since you ask for a working snippet:
import re
test = "2008/08/22 08:02:33.207 2008/08/22 08:02:33.175 123456789 DAExtendPerResourceDAWaferCenterOffsetB TransferChamberSlotValvePM4 'DAPerResource EXTEND' [] ['StatusVariable Source 0 true ''TransferChamber-EndEffector2'' Source' 'StatusVariable Destination 0 true ''PM4'' Destination' 'StatusVariable WaferID 0 true ''1A234568ABC2'' WaferID' 'StatusVariable LotID 0 true ''200908221036-2R34567.000-01'' LotID'"
lot = re.compile(r"LotID[^']+''([\d\-.A-Z]*)''[^']+LotID")
match = lot.search(test)
if (match):
print(match.group(1))
else:
print "None"
Output:
200908221036-2R34567.000-01

There's no need to use look-ahead and look-behind for this sort of task. Not using them allows you to use quantifiers. In your version \w.... will not match 'true'. Also your sample contains spaces instead of tab characters, I don't know if that is the formatting on this site or your actual data.
This will match your sample:
LotID\s+\d\s+\w+\s+(.*)\s+LotID
Though if you know what characters you expect in the Lot ID it might be better expressed as
LotID[^']+(['\d\-R.]*)[^']+LotID

Related

Importing Excel data to python with openpyxl

With openpyxl
With
if sheet.cell(j, i).value
i am testing if a certain cell in an excel sheet has a value.
The Problem: If there is a "0" in that cell the request give output "no"
Question: How do I change it, so that my request gives "Yes" as output if there is a "0" in that cell
If you use only if sheet.cell(j, i).value as the logical check, it will give False for both 0 and blank. So, you need to use if sheet.cell(j, i).value is None, in which case, it will return TRUE only when the cell is blank. A working example.
import openpyxl
wb=openpyxl.load_workbook('Sample.xlsx')
sheet=wb['Sheet2']
j=10
for i in range(1, 9):
print("Data in cell is : ", sheet.cell(j, i).value)
if sheet.cell(j, i).value is None:
print("This is blank")
else:
print("This is not blank")
Output
Data in cell is : 1
This is not blank
Data in cell is : 2
This is not blank
Data in cell is : 3
This is not blank
Data in cell is : None
This is blank
Data in cell is : 4
This is not blank
Data in cell is : 5
This is not blank
Data in cell is : 0
This is not blank
Data in cell is : 8
This is not blank

Python: Match a special caracter with regular expression

Hi everyone I'm using the re.match function to extract pieces of string within a row from the file.
My code is as follows:
## fp_tmp => pointer of file
for x in fp_tmp:
try:
cpuOverall=re.match(r"(Overall CPU load average)\s+(\S+)(%)",x)
cpuUsed=re.match(r"(Total)\s+(\d+)(%)",x)
ramUsed=re.match(r"(RAM Utilization)\s+(\d+\%)",x)
####Not Work####
if cpuUsed is not None: cpuused_new=cpuUsed.group(2)
if ramUsed is not None: ramused_new=ramUsed.group(2)
if cpuOverall is not None: cpuoverall_new=cpuOverall.group(2)
except:
searchbox_result = None
Each field is extracted from the following corresponding line:
ramUsed => RAM Utilization 2%
cpuUsed => Total 4%
cpuOverall => Overall CPU load average 12%
ramUsed, cpuUsed, cpuOverall are the variable where I want write the result!!
Corretly line are:
(space undefined) RAM Utilization 2%
(space undefined) Total 4%
(space undefined) Overall CPU load average 12%
When I execute the script all variable return a value: None.
With other variable the script work corretly.
Why the code not work in this case? I use the python3
I think that the problem is a caracter % that not read.
Do you have any suggestions?
PROBLEM 2:
## fp_tmp => pointer of file
for x in fp_tmp:
try:
emailReceived=re.match(r".*(Messages Received)\s+\S+\s+\S+\s+(\S+)",x)
####Not Work####
if emailReceived is not None: emailreceived_new=emailReceived.group(2)
except:
searchbox_result = None
Each field is extracted from the following corresponding on 2 lines in a file:
[....]
Counters: Reset Uptime Lifetime
Receiving
Messages Received 3,406 1,558 3,406
[....]
Rates (Events Per Hour): 1-Minute 5-Minutes 15-Minutes
Receiving
Messages Received 0 0 0
Recipients Received 0 0 0
[....]
I want extract only second occured, that:
Rates (Events Per Hour): 1-Minute 5-Minutes 15-Minutes
Receiving
Messages Received 0 0 0 <-this
Do you have any suggestions?
cpuOverall line: you forgot that there is more information at the start of the line. Change to
'.*(Overall CPU load average)\s+(\S+%)'
cpuUsed line: you forgot that there is more information at the start of the line. Change to
'.*(Total)\s+(\d+%)'
ramUsed line: you forgot that there is more information at the start of the line... Change to
'.*(RAM Utilization)\s+(\d+%)'
Remember that re.match looks for an exact match from the start:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. [..]
With these changes, your three variables are set to the percentages:
>>> print (cpuused_new,ramused_new,cpuoverall_new)
4% 2% 12%

How to get PyPDF2 to extract text from multiple sequential pages - in range?

I'm trying to get PyPDF2 to extract specific text throughout a document per the code below. It is pulling exactly what I need and eliminating the duplicates, but it is not getting me a list from each page, it seems to only be showing me the text from the last page. What am I doing wrong?
#import PyPDF2 and set extracted text as the page_content variable
import PyPDF2
pdf_file = open('enme2.pdf','rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
#for loop to get number of pages and extract text from each page
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content = page.extractText()
#initialize the user_input variable
user_input = ""
#function to get the AFE numbers from the pdf document
def get_afenumbers(Y):
#initialize the afe and afelist variables
afe = "A"
afelist = ""
x = ""
#while loop to get only 6 digits after the "A"
while True:
if user_input.upper().startswith("Y") == True:
#Return a list of AFE's
import re
afe = re.findall('[A][0-9]{6}', page_content)
set(afe)
print(set(afe))
break
else:
afe = "No AFE numbers found..."
if user_input.upper().startswith("N") == True:
print("HAVE A GREAT DAY - GOODBYE!!!")
break
#Build a while loop for initial question prompt (when Y or N is not True):
while user_input != "Y" and user_input != "N":
user_input = input('List AFE numbers? Y or N: ').upper()
if user_input not in ["Y","N"]:
print('"',user_input,'"','is an invalid input')
get_afenumbers(user_input)
#FIGURE OUT HOW TO EXTRACT FROM ALL PAGES AND NOT JUST ONE
I'm quite new to this, just learned about regex by a response to my question earlier today. Thanks for any help.
If you change a little, it seems works fine.
page_content="" # define variable for using in loop.
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content += page.extractText() # concate reading pages.

Dictvectorizer for list as one feature in Python Pandas and Scikit-learn

I have been trying to solve this for days, and although I have found a similar problem here How can i vectorize list using sklearn DictVectorizer, the solution is overly simplified.
I would like to fit some features into a logistic regression model to predict 'chinese' or 'non-chinese'. I have a raw_name which I will extract to get two features 1) is just the last name, and 2) is a list of substring of the last name, for example, 'Chan' will give ['ch', 'ha', 'an']. But it seems Dictvectorizer doesn't take list type as part of the dictionary. From the link above, I try to create a function list_to_dict, and successfully, return some dict elements,
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
but I have no idea how to incorporate that in the my_dict = ... before applying the dictvectorizer.
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
import re
import random
from random import randint
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
lr = LogisticRegression()
dv = DictVectorizer()
# Get csv file into data frame
data = pd.read_csv("V2-1_2000Records_Processed_SEP2015.csv", header=0, encoding="utf-8")
df = DataFrame(data)
# Pandas data frame shuffling
df_shuffled = df.iloc[np.random.permutation(len(df))]
df_shuffled.reset_index(drop=True)
# Assign X and y variables
X = df.raw_name.values
y = df.chineseScan.values
# Feature extraction functions
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return None
except: return None
def feature_twoLetters(nameString):
placeHolder = []
try:
for i in range(0, len(nameString)):
x = nameString[i:i+2]
if len(x) == 2:
placeHolder.append(x)
return placeHolder
except: return []
def list_to_dict(substring_list):
try:
substring_dict = {}
for i in substring_list:
substring_dict['substring='+str(i)] = True
return substring_dict
except: return None
list_example = ['co', 'or', 'rn', 'ns']
print list_to_dict(list_example)
# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'two-letter-substrings': feature_twoLetters(feature_full_last_name(i)),
'last-name': feature_full_last_name(i), 'dummy': 1} for i in X]
print my_dict[3]
Output:
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
{'dummy': 1, 'two-letter-substrings': [u'co', u'or', u'rn', u'ns'], 'last-name': u'corns'}
Sample data:
Raw_name chineseScan
Jack Anderson non-chinese
Po Lee chinese
If I have understood correctly you want a way to encode list values in order to have a feature dictionary that DictVectorizer could use. (One year too late but) something like this can be used depending on the case:
my_dict_list = []
for i in X:
# create a new feature dictionary
feat_dict = {}
# add the features that are straight forward
feat_dict['last-name'] = feature_full_last_name(i)
feat_dict['dummy'] = 1
# for the features that have a list of values iterate over the values and
# create a custom feature for each value
for two_letters in feature_twoLetters(feature_full_last_name(i)):
# make sure the naming is unique enough so that no other feature
# unrelated to this will have the same name/ key
feat_dict['two-letter-substrings-' + two_letters] = True
# save it to the feature dictionary list that will be used in Dict vectorizer
my_dict_list.append(feat_dict)
print my_dict_list
from sklearn.feature_extraction import DictVectorizer
dict_vect = DictVectorizer(sparse=False)
transformed_x = dict_vect.fit_transform(my_dict_list)
print transformed_x
Output:
[{'dummy': 1, u'two-letter-substrings-er': True, 'last-name': u'Anderson', u'two-letter-substrings-on': True, u'two-letter-substrings-de': True, u'two-letter-substrings-An': True, u'two-letter-substrings-rs': True, u'two-letter-substrings-nd': True, u'two-letter-substrings-so': True}, {'dummy': 1, u'two-letter-substrings-ee': True, u'two-letter-substrings-Le': True, 'last-name': u'Lee'}]
[[ 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1.]
[ 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.]]
Another thing you could do (but I don't recommend) if you don't want to create as many features as the values in your lists is something like this:
# sorting the values would be a good idea
feat_dict[frozenset(feature_twoLetters(feature_full_last_name(i)))] = True
# or
feat_dict[" ".join(feature_twoLetters(feature_full_last_name(i)))] = True
but the first one means that you can't have any duplicate values and probably both don't make good features, especially if you need fine-tuned and detailed ones. Also, they reduce the possibility of two rows having the same combination of two letter combinations, thus the classification probably won't do well.
Output:
[{'dummy': 1, 'last-name': u'Anderson', frozenset([u'on', u'rs', u'de', u'nd', u'An', u'so', u'er']): True}, {'dummy': 1, 'last-name': u'Lee', frozenset([u'ee', u'Le']): True}]
[{'dummy': 1, 'last-name': u'Anderson', u'An nd de er rs so on': True}, {'dummy': 1, u'Le ee': True, 'last-name': u'Lee'}]
[[ 1. 0. 1. 1. 0.]
[ 0. 1. 1. 0. 1.]]

Python: extract all placeholders from format string

I have to 'parse' a format string in order to extract the variables.
E.g.
>>> s = "%(code)s - %(description)s"
>>> get_vars(s)
'code', 'description'
I managed to do that by using regular expressions:
re.findall(r"%\((\w+)\)", s)
but I wonder whether there are built-in solutions (actually Python do parse the string in order to evaluate it!).
This seems to work great:
def get_vars(s):
d = {}
while True:
try:
s % d
except KeyError as exc:
# exc.args[0] contains the name of the key that was not found;
# 0 is used because it appears to work with all types of placeholders.
d[exc.args[0]] = 0
else:
break
return d.keys()
gives you:
>>> get_vars('%(code)s - %(description)s - %(age)d - %(weight)f')
['age', 'code', 'description', 'weight']