Logic and spaces in Regex - regex

I have the following Regex:
^(?P<port>[\w\-\>]+)(?P<virtual_route> +None|None)\s+(?P<port_state>\w)\s+(?P<link_state>[\w]+)\s+(?P<auto_neg>[\w]+)\s+(?P<cfg_speed>[\d]+) (?P<actual_speed>[\d\s]+)(?P<cfg_duplex>[\w]+)
That I use on the following configuration:
MLAG-ISC>None E A OFF 40000 40000 FULL FULL NONE 1 Q+CR4_1m
2 None E NP OFF 10000 FULL NONE
3 None E NP OFF 10000 FULL NONE
4 None E NP OFF 10000 FULL NONE
MLAG-ISC>None E A OFF 40000 40000 FULL FULL NONE 1 Q+CR4_1m
6 None E NP OFF 10000 FULL NONE
Which gives me this result (https://regex101.com):
This is the result I want however I would also like to capture the next FULL or empty, NONE or empty, 1 or empty and Q+CR4_1m or NONE. I just can't seem to make it work because of the spaces in the rows 2,3,4 and 6.
Note that I am using Python3.

If FULL, NONE and 1 are the only possible (though optional) values in that columns:
^(?P<port>[\w\-\>]+)(?P<virtual_route> +None|None)\s+(?P<port_state>\w)\s+(?P<link_state>[\w]+)\s+(?P<auto_neg>[\w]+)\s+(?P<cfg_speed>[\d]+) (?P<actual_speed>\d*)\s+(?P<cfg_duplex>[\w]+)\s+((?:FULL)?)\s+((?:NONE)?)\s+(1?)\s+([\w+]+)
otherwise:
^(?P<port>[\w\-\>]+)(?P<virtual_route> +None|None)\s+(?P<port_state>\w)\s+(?P<link_state>[\w]+)\s+(?P<auto_neg>[\w]+)\s+(?P<cfg_speed>[\d]+) (?P<actual_speed>\d*)\s+(?P<cfg_duplex>[\w]+)[^\S\r\n]+(\w*)[^\S\r\n]+(\w*)[^\S\r\n]+(\d*)[^\S\r\n]+([\w+]+)

Related

How to identify invalid pattern using regx?

I have a dataset such as below:
import pandas as pd
dic={"ID":[1,2,3,4,5,6],
"Size":["3-4mm","12mm",math.nan,"1 mm","1mm, 2mm, 3mm","13*18mm"]}
dt = pd.DataFrame(dic)
so, the dataset is:
ID Size
1 3-4mm
2 12mm
3 NaN
4 1 mm
5 1mm, 2mm, 3mm
6 13*18mm
In the column Size, i should have only 3 valid patterns, and anything except these 3 are invalid. These 3 pattern are as below
3-4mm (int-intmm)
NaN
4mm (intmm)
I am wondering how can i have function which specifies the ID of the rows which has invalid size pattern?
So, in my example:
ID
4
5
6
The reason is their size is not in valid format.
I have no preference for the solution, but i guess the most easiest solution comes from Regx
using #CodeManiac's pattern, you can pass it to series.str.contains() and pass the na parameter as True since it is a actual NaN:
dt.loc[~dt.Size.str.contains('^(?:\d+-\d+mm|\d+mm)$',na=True),'ID']
3 4
4 5
5 6
Details:
executing: dt.Size.str.contains('^(?:\d+-\d+mm|\d+mm)$')
0 True
1 True
2 NaN
3 False
4 False
5 False
pass na=True to fill NaN as True:
dt.Size.str.contains('^(?:\d+-\d+mm|\d+mm)$',na=True)
0 True
1 True
2 True
3 False
4 False
5 False
Then use invert ~ to invert True as False and vice versa since we want False values and call the ID column under df.loc[]
The function that returns 'ID'-s of rows with invalid value in 'Size' column:
import re # standard Python regular expressions module
def get_invalid(dt):
return dt[dt['Size'].apply(lambda r: re.match(r'^\d+-\d+mm|nan|\d+mm$', str(r), re.MULTILINE) is None)]['ID']
Output:
3 4
4 5
5 6
Name: ID, dtype: int64

python numpy probability output

using python 2.7 and numpy i want to be able to print test results based on a probablity percentage. based on 10 events i want to print out true if it passed and false if it failed. Below is pseudo code
import numpy as np
probability = .33
i'm struggling how to implement using the probablity variable in determining if a test has passed. So in this case the probablity of a test passing 'True' is 33 percent. The probablity does not change for each iteration. its always going to be .33 percent
Ideally it should return something like this
false
false
fasle
true
false
true
fasle
fasle
true
true.
You could use the built-in random and generate a number between 0 and 1 (uniform distribution, so all numbers between 0 and 1 are "equally likely"). Then test if that number is less than your desired probability:
import random
def uniform_trials(probability, num_trials):
for _ in range(num_trials):
print(probability < random.uniform(0, 1))
Then just call uniform_trials(.33, 10) for your desired example (or any other probability and num_trials you'd like to output).

Python, calculating time difference

I'm parsing logs generated from multiple sources and joined together to form a huge log file in the following format;
My_testNumber: 14, JobType = testx.
ABC 2234
**SR 111**
1483529571 1 1 Wed Jan 4 11:32:51 2017 0 4
datatype someRandomValue
SourceCode.Cpp 588
DBConnection failed
TB 132
**SR 284**
1483529572 0 1 Wed Jan 4 11:32:52 2017 5010400 4
datatype someRandomXX
SourceCode2.cpp 455
DBConnection Success
TB 102
**SR 299**
1483529572 0 1 **Wed Jan 4 11:32:54 2017** 5010400 4
datatype someRandomXX
SourceCode3.cpp 455
ConnectionManager Success
....
(there are dozens of SR Numbers here)
Now i'm looking a smart way to parse logs so that it calculates time differences in seconds for each testNumber and SR number
like
My_testNumber:14 it subtracts SR 284 and SR 111 time (difference would be 1 second here), for SR 284 and 299 it is 2 seconds and so on.
You can parse your posted log file and save the corresponding data accordingly. Then, you can work with the data to get the time differences. The following should be a decent start:
from itertools import combinations
from itertools import permutations # if order matters
from collections import OrderedDict
from datetime import datetime
import re
sr_numbers = []
dates = []
# Loop through the file and get the test number and times
# Save the data in a list
pattern = re.compile(r"(.*)\*{2}(.*)\*{2}(.*)")
for line in open('/Path/to/log/file'):
if '**' in line:
# Get the data between the asterisks
if 'SR' in line:
sr_numbers.append(re.sub(pattern,"\\2", line.strip()))
else:
dates.append(datetime.strptime(re.sub(pattern,"\\2", line.strip()), '%a %b %d %H:%M:%S %Y'))
else:
continue
# Use hashmap container (ordered dictionary) to make it easy to get the time differences
# Using OrderedDict here to maintain the order of the order of the test number along the file
log_dict = OrderedDict((k,v) for k,v in zip(sr_numbers, dates))
# Use combinations to get the possible combinations (or permutations if order matters) of time differences
time_differences = {"{} - {}".format(*x):(log_dict[x[1]] - log_dict[x[0]]).seconds for x in combinations(log_dict, 2)}
print(time_differences)
# {'SR 284 - SR 299': 2, 'SR 111 - SR 284': 1, 'SR 111 - SR 299': 3}
Edit:
Parsing the file without relying on the asterisks around the dates:
from itertools import combinations
from itertools import permutations # if order matters
from collections import OrderedDict
from datetime import datetime
import re
sr_numbers = []
dates = []
# Loop through the file and get the test number and times
# Save the data in a list
pattern = re.compile(r"(.*)\*{2}(.*)\*{2}(.*)")
for line in open('/Path/to/log/file'):
if 'SR' in line:
current_sr_number = re.sub(pattern,"\\2", line.strip())
sr_numbers.append(current_sr_number)
elif line.strip().count(":") > 1:
try:
dates.append(datetime.strptime(re.split("\s{3,}",line)[2].strip("*"), '%a %b %d %H:%M:%S %Y'))
except IndexError:
#print(re.split("\s{3,}",line))
dates.append(datetime.strptime(re.split("\t+",line)[2].strip("*"), '%a %b %d %H:%M:%S %Y'))
else:
continue
# Use hashmap container (ordered dictionary) to make it easy to get the time differences
# Using OrderedDict here to maintain the order of the order of the test number along the file
log_dict = OrderedDict((k,v) for k,v in zip(sr_numbers, dates))
# Use combinations to get the possible combinations (or permutations if order matters) of time differences
time_differences = {"{} - {}".format(*x):(log_dict[x[1]] - log_dict[x[0]]).seconds for x in combinations(log_dict, 2)}
print(time_differences)
# {'SR 284 - SR 299': 2, 'SR 111 - SR 284': 1, 'SR 111 - SR 299': 3}
I hope this proves useful.

How do I iterate a loop over several data frames in a list in python

I am very new to programming and am working with Python. For a work project I am trying to read several .csv files, convert them to data frames, concatenate some of the fields into one for a column header, and then append all of the dataframes into one big DataFrame. I have searched extensively in StackOverflow as well as in other resources but I have not been able to find an answer. Here is the code I have thus far along with some abbreviated output:
import pandas as pd
import glob
# Read a directory of files to a list
csvlist = []
for f in glob.glob("AssayCerts/*"):
csvlist.append(f)
csvlist
['AssayCerts/CH09051590.csv', 'AssayCerts/CH09051591.csv', 'AssayCerts/CH14158806.csv', 'AssayCerts/CH14162453.csv', 'AssayCerts/CH14186004.csv']
# Read .csv files and convert to DataFrames
dflist = []
for csv in csvlist:
df = pd.read_csv(filename, header = None, skiprows = 7)
dflist.append(df)
dflist
[ 0 1 2 3 4 5 \
0 NaN Au-AA23 ME-ICP41 ME-ICP41 ME-ICP41 ME-ICP41
1 SAMPLE Au Ag Al As B
2 DESCRIPTION ppm ppm % ppm ppm
#concatenates the cells in the first three rows of the last dataframe; need to apply this to all of the dataframes.
for df in dflist:
column_names = df.apply(lambda x: str(x[1]) + '-'+str(x[2])+' - '+str(x[0]),axis=0)
column_names
0 SAMPLE-DESCRIPTION - nan
1 Au-ppm - Au-AA23
2 Ag-ppm - ME-ICP41
3 Al-% - ME-ICP41
I am unable to apply the last operation across all of the DataFrames. It seems I can only get it to apply to the last DataFrame in my list. Once I get past this point I will have to append all of the DataFrames to form one large DataFrame.
As Andy Hayden mentions in his comment, the reason your loop only appears to work on the last DataFrame is that you just keep assigning the result of df.apply( ... ) to column_names, which gets written over each time. So at the end of the loop, column_names always contains the results from the last DataFrame in the list.
But you also have some other problems in your code. In the loop that begins for csv in csvlist:, you never actually reference csv - you just reference filename, which doesn't appear to be defined. And dflist just appears to have one DataFrame in it anyway.
As written in your problem, the code doesn't appear to work. I'd advise posting the real code that you're using, and only what's relevant to your problem (i.e. if building csvlist is working for you, then you don't need to show it to us).

Pandas Interpolate Returning NaNs

I'm trying to do basic interpolation of position data at 60hz (~16ms) intervals. When I try to use pandas 0.14 interpolation over the dataframe, it tells me I only have NaNs in my data set (not true). When I try to run it over individual series pulled from the dataframe, it returns the same series without the NaNs filled in. I've tried setting the indices to integers, using different methods, fiddling with the axis and limit parameters of the interpolation function - no dice. What am I doing wrong?
df.head(5) :
x y ms
0 20.5815 14.1821 333.3333
1 NaN NaN 350
2 20.6112 14.2013 366.6667
3 NaN NaN 383.3333
4 20.5349 14.2232 400
df = df.set_index(df.ms) # set indices to milliseconds
When I try running
df.interpolate(method='values')
I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-462-cb0f1f01eb84> in <module>()
12
13
---> 14 df.interpolate(method='values')
15
16
/Users/jsb/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in interpolate(self, method, axis, limit, inplace, downcast, **kwargs)
2511
2512 if self._data.get_dtype_counts().get('object') == len(self.T):
-> 2513 raise TypeError("Cannot interpolate with all NaNs.")
2514
2515 # create/use the index
TypeError: Cannot interpolate with all NaNs.
I've also tried running over individual series, which only return what I put in:
temp = df.x
temp.interpolate(method='values')
333.333333 20.5815
350.000000 NaN
366.666667 20.6112
383.333333 NaN
400.000000 20.5349 Name: x, dtype: object
EDIT :
Props to Jeff for inspiring the solution.
Adding:
df[['x','y','ms']] = df[['x','y','ms']].astype(float)
before
df.interpolate(method='values')
interpolation did the trick.
Based on your edit with props to Jeff for inspiring the solution.
Adding:
df = df.astype(float)
before
df.interpolate(method='values')
interpolation did the trick for me as well. Unless you're sub-selecting a column set, you don't need to specify the columns.
I'm not able to to reproduce the error (see below for a copy/paste-able example), can you make sure the the data you show is actually representative of your data?
In [137]: from StringIO import StringIO
In [138]: df = pd.read_csv(StringIO(""" x y ms
...: 0 20.5815 14.1821 333.3333
...: 1 NaN NaN 350
...: 2 20.6112 14.2013 366.6667
...: 3 NaN NaN 383.3333
...: 4 20.5349 14.2232 400"""), delim_whitespace=True)
In [140]: df = df.set_index(df.ms)
In [142]: df.interpolate(method='values')
Out[142]:
x y ms
ms
333.3333 20.58150 14.18210 333.3333
350.0000 20.59635 14.19170 350.0000
366.6667 20.61120 14.20130 366.6667
383.3333 20.57305 14.21225 383.3333
400.0000 20.53490 14.22320 400.0000