evaluate mixed type with eval() - python-2.7

I have two date/time variables that contains list of the date/time values and another variable containing the list of operator to operate over the date/time variables. The format can be expressed as follows:
column1 = np.array([date1, date2,.......,dateN])
column2 = np.array([date1, date2,.......,dateN])
Both of the above variables of type Date/Time. Then I have the following variable operator that has the same length of column1 and column2:
operator = np.array(['>=','<=','==','=!',......])
I am getting "Invalid Token" with the following operation:
np.array([eval('{}{}{}'.format(v1,op,v2)) for v1,op,v2 in zip(column1,operator,column2)])
Any hint to get around this issue ?
-------------------EDIT----------------------
With some sample data and without eval I get the following output:
np.array(['{} {} {}'.format(v1,op,v2) for v1,op,v2 in zip(datelist1,operator,datelist2)])
array(['2017-03-30 10:30:22.928000 <= 2012-05-23 00:00:00',
'2011-01-07 00:00:00 == 2017-03-30 10:31:14.477000'],
dtype='|S49')
Once I bring in eval(), I get the following error:
eval('2011-01-07 00:00:00 == 2017-03-30 10:31:14.477000')
File "<string>", line 1
2011-01-07 00:00:00 == 2017-03-30 10:31:14.477000
^
SyntaxError: invalid syntax
----------------------EDIT & CORRECTIONS ----------------------------
Date/Time variables that I mentioned before are basically of type numpy datetime64 type and I am now getting the following issue while trying two date comparions with eval:
np.array([(repr(d1)+op+repr(d2)) for d1,op,d2 in zip(${Column Name1},${Operator},${Column Name2})])
The above snippet is tried over a table with three columns where ${Column Name1} and ${Column Name2} is of numpy.datetime64 type and ${Operator} is of string type. The result is as follows for one of the rows:
numpy.datetime64('2014-08-13T02:00:00.000000+0200')>=numpy.datetime64('2014-08-13T02:00:00.000000+0200')
Now I want to evaluate the above expression with function eval as follows:
np.array([eval(repr(d1)+op+repr(d2)) for d1,op,d2 in zip(${Column Name1},${Operator},${Column Name2})])
Eventually I get the following error:
NameError:name 'numpy' is not defined
I can assume the problem. The Open Source Tool that I am using is importing numpy as np whereas repr() returning numpy that it does not recognize. If this is the problem , how to fix this issue ?

datetime objects can be compared:
In [506]: datetime.datetime.today()
Out[506]: datetime.datetime(2017, 3, 30, 10, 43, 18, 363747)
In [507]: t1=datetime.datetime.today()
In [508]: t2=datetime.datetime.today()
In [509]: t1 < t2
Out[509]: True
In [510]: t1 == t2
Out[510]: False
Numpy's own version of datetime objects can also be compared
In [516]: nt1 = np.datetime64('2017-03-30 10:30:22.928000')
In [517]: nt2 = np.datetime64('2017-03-30 10:31:14.477000')
In [518]: nt1 < nt2
Out[518]: True
In [519]: nt3 = np.datetime64('2012-05-23 00:00:00')
In [520]: [nt1 <= nt2, nt2==nt3]
Out[520]: [True, False]
Using the repr string version of a datetime object works:
In [524]: repr(t1)+'<'+repr(t2)
Out[524]: 'datetime.datetime(2017, 3, 30, 10, 47, 29, 69324)<datetime.datetime(2017, 3, 30, 10, 47, 33, 669494)'
In [525]: eval(repr(t1)+'<'+repr(t2))
Out[525]: True
Not that I recommend that sort of construction. I like the dictionary mapping to an operator better.

You might want to use python operator for this:
# import operators used
import operator
from operator import ge, eq, le, ne
# build a look up table from string to operators
ops = {">=": ge, "==": eq, "<=": le, "!=": ne}
import numpy as np
# used some numbers to simplify the case, should work on datetime as well
a = np.array([1, 3, 5, 3])
b = np.array([2, 3, 2, 1])
operator = np.array(['>=','<=','==','!='])
# evaluate the operation
[ops[op](x, y) for op, x, y in zip(operator, a, b)]
# [False, True, False, True]

Related

Fidning max/min value of a list in pyspark

I know this is a very trivial question, and I am quite surprised I could not find an answer on the internet, but can one find the max or min value o a list in pyspark?
In Python it is easily done by
max(list)
However, when I try the same in pyspark I get the following error:
An error was encountered:
An error occurred while calling z:org.apache.spark.sql.functions.max. Trace:
py4j.Py4JException: Method max([class java.util.ArrayList]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
at py4j.Gateway.invoke(Gateway.java:276)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Any ideas as to what I am doing wrong?
UPDATE: Adding what exactly I did:
This is my list:
cur_datelist
Output:
['2020-06-10', '2020-06-11', '2020-06-12', '2020-06-13', '2020-06-14', '2020-06-15', '2020-06-16', '2020-06-17', '2020-06-18', '2020-06-19', '2020-06-20', '2020-06-21', '2020-06-22', '2020-06-23', '2020-06-24', '2020-06-25', '2020-06-26', '2020-06-27', '2020-06-28', '2020-06-29', '2020-06-30', '2020-07-01', '2020-07-02', '2020-07-03', '2020-07-04', '2020-07-05', '2020-07-06', '2020-07-07', '2020-07-08', '2020-07-09', '2020-07-10', '2020-07-11', '2020-07-12', '2020-07-13', '2020-07-14', '2020-07-15', '2020-07-16', '2020-07-17', '2020-07-18', '2020-07-19', '2020-07-20', '2020-07-21', '2020-07-22', '2020-07-23', '2020-07-24', '2020-07-25', '2020-07-26', '2020-07-27', '2020-07-28', '2020-07-29', '2020-07-30', '2020-07-31', '2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04', '2020-08-05', '2020-08-06', '2020-08-07', '2020-08-08', '2020-08-09', '2020-08-10', '2020-08-11', '2020-08-12', '2020-08-13', '2020-08-14', '2020-08-15', '2020-08-16', '2020-08-17', '2020-08-18', '2020-08-19', '2020-08-20', '2020-08-21', '2020-08-22', '2020-08-23', '2020-08-24', '2020-08-25', '2020-08-26', '2020-08-27', '2020-08-28', '2020-08-29', '2020-08-30', '2020-08-31']
The class is 'list':
type(cur_datelist)
<class 'list'>
I assumed that to be a normal pythonic list.
So when I tried max(cur_datelist), I get the above mentioned error.
It is not different between pyspark and python for the list but the column is difference. This is the result of my pyspark.
# just a list
l = [1, 2, 3]
print(max(l))
# 3
# dataframe with the array column
df = spark.createDataFrame([(1, [1, 2, 3]), (2, [4, 5, 6])]).toDF('id', 'list')
import pyspark.sql.functions as f
df.withColumn('max', f.array_max(f.col('list'))).show()
#+---+---------+---+
#| id| list|max|
#+---+---------+---+
#| 1|[1, 2, 3]| 3|
#| 2|[4, 5, 6]| 6|
#+---+---------+---+
Your error comes from the max function overlap between the python native one and the spark column function! To avoid this, specify your pyspark function. Then max denotes the python original.
import pyspark.sql.functions as f
l = ['2020-06-10', '2020-06-11', '2020-06-12', '2020-06-13', '2020-06-14', '2020-06-15', '2020-06-16', '2020-06-17', '2020-06-18', '2020-06-19', '2020-06-20', '2020-06-21', '2020-06-22', '2020-06-23', '2020-06-24', '2020-06-25', '2020-06-26', '2020-06-27', '2020-06-28', '2020-06-29', '2020-06-30', '2020-07-01', '2020-07-02', '2020-07-03', '2020-07-04', '2020-07-05', '2020-07-06', '2020-07-07', '2020-07-08', '2020-07-09', '2020-07-10', '2020-07-11', '2020-07-12', '2020-07-13', '2020-07-14', '2020-07-15', '2020-07-16', '2020-07-17', '2020-07-18', '2020-07-19', '2020-07-20', '2020-07-21', '2020-07-22', '2020-07-23', '2020-07-24', '2020-07-25', '2020-07-26', '2020-07-27', '2020-07-28', '2020-07-29', '2020-07-30', '2020-07-31', '2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04', '2020-08-05', '2020-08-06', '2020-08-07', '2020-08-08', '2020-08-09', '2020-08-10', '2020-08-11', '2020-08-12', '2020-08-13', '2020-08-14', '2020-08-15', '2020-08-16', '2020-08-17', '2020-08-18', '2020-08-19', '2020-08-20', '2020-08-21', '2020-08-22', '2020-08-23', '2020-08-24', '2020-08-25', '2020-08-26', '2020-08-27', '2020-08-28', '2020-08-29', '2020-08-30', '2020-08-31']
print(max(l))
# 2020-08-31
Or,
import builtins as p
print(p.max(l))
# 2020-08-31

XlsxWriter: set_column() with one format for multiple non-continuous columns

I want to write my Pandas dataframe to Excel and apply a format to multiple individual columns (e.g., A and C but not B) using a one-liner as such:
writer = pd.ExcelWriter(filepath, engine='xlsxwriter')
my_format = writer.book.add_format({'num_format': '#'})
writer.sheets['Sheet1'].set_column('A:A,C:C', 15, my_format)
This results in the following error:
File ".../python2.7/site-packages/xlsxwriter/worksheet.py", line 114, in column_wrapper
cell_1, cell_2 = [col + '1' for col in args[0].split(':')]
ValueError: too many values to unpack
It doesn't accept the syntax 'A:A,C:C'. Is it even possible to apply the same formatting without calling set_column() for each column?
If the column ranges are non-contiguous you will have to call set_column() for each range:
writer.sheets['Sheet1'].set_column('A:A', 15, my_format)
writer.sheets['Sheet1'].set_column('C:C', 15, my_format)
Note, to do this programmatically you can also use a numeric range:
for col in (0, 2):
writer.sheets['Sheet1'].set_column(col, col, 15, my_format)
Or you could reference columns like this:
for col in ('X', 'Z'):
writer.sheets['Sheet1'].set_column(col+':'+col, None, my_format)

Not calculating sum for all columns in pandas dataframe

I'm pulling data from Impala using impyla, and converting them to dataframe using as_pandas. And I'm using Pandas 0.18.0, Python 2.7.9
I'm trying to calculate the sum of all columns in a dataframe and trying to select the columns which are greater than the threshold.
self.data = self.data.loc[:,self.data.sum(axis=0) > 15]
But when I run this I'm getting error like below:
pandas.core.indexing.IndexingError: Unalignable boolean Series key
provided
Then I tried like below.
print 'length : ',len(self.data.sum(axis = 0)),' all columns : ',len(self.data.columns)
Then i'm getting different length i.e
length : 78 all columns : 83
And I'm getting below warning
C:\Python27\lib\decimal.py:1150: RuntimeWarning: tp_compare didn't
return -1 or -2 for exception
And To achieve my goal i tried the other way
for column in self.data.columns:
sum = self.data[column].sum()
if( sum < 15 ):
self.data = self.data.drop(column,1)
Now i have got the other errors like below:
TypeError: unsupported operand type(s) for +: 'Decimal' and 'float'
C:\Python27\lib\decimal.py:1150: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
Then i tried to get the data types of each column like below.
print 'dtypes : ', self.data.dtypes
The result has all the columns are one of these int64 , object and float 64
Then i thought of changing the data type of columns which are in object like below
self.data.convert_objects(convert_numeric=True)
Still i'm getting the same errors, Please help me in solving this.
Note : In all the columns I do not have strings i.e characters and missing values or empty.I have checked this using self.data.to_csv
As i'm new to pandas and python Please don't mind if it is a silly question. I just want to learn
Please review the simple code below and you may understand the reason of the error.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([3,3]))
df.iloc[0,0] = np.nan
print df
print df.sum(axis=0) > 1.5
print df.loc[:, df.sum(axis=0) > 1.5]
df.iloc[0,0] = 'string'
print df
print df.sum(axis=0) > 1.5
print df.loc[:, df.sum(axis=0) > 1.5]
0 1 2
0 NaN 0.336250 0.801349
1 0.930947 0.803907 0.139484
2 0.826946 0.229269 0.367627
0 True
1 False
2 False
dtype: bool
0
0 NaN
1 0.930947
2 0.826946
0 1 2
0 string 0.336250 0.801349
1 0.930947 0.803907 0.139484
2 0.826946 0.229269 0.367627
1 False
2 False
dtype: bool
Traceback (most recent call last):
...
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
Shortly, you need additional preprocess on your data.
df.select_dtypes(include=['object'])
If it's convertable string numbers, you can convert it by df.astype(), or you should purge them.

Find empty or NaN entry in Pandas Dataframe

I am trying to search through a Pandas Dataframe to find where it has a missing entry or a NaN entry.
Here is a dataframe that I am working with:
cl_id a c d e A1 A2 A3
0 1 -0.419279 0.843832 -0.530827 text76 1.537177 -0.271042
1 2 0.581566 2.257544 0.440485 dafN_6 0.144228 2.362259
2 3 -1.259333 1.074986 1.834653 system 1.100353
3 4 -1.279785 0.272977 0.197011 Fifty -0.031721 1.434273
4 5 0.578348 0.595515 0.553483 channel 0.640708 0.649132
5 6 -1.549588 -0.198588 0.373476 audio -0.508501
6 7 0.172863 1.874987 1.405923 Twenty NaN NaN
7 8 -0.149630 -0.502117 0.315323 file_max NaN NaN
NOTE: The blank entries are empty strings - this is because there was no alphanumeric content in the file that the dataframe came from.
If I have this dataframe, how can I find a list with the indexes where the NaN or blank entry occurs?
np.where(pd.isnull(df)) returns the row and column indices where the value is NaN:
In [152]: import numpy as np
In [153]: import pandas as pd
In [154]: np.where(pd.isnull(df))
Out[154]: (array([2, 5, 6, 6, 7, 7]), array([7, 7, 6, 7, 6, 7]))
In [155]: df.iloc[2,7]
Out[155]: nan
In [160]: [df.iloc[i,j] for i,j in zip(*np.where(pd.isnull(df)))]
Out[160]: [nan, nan, nan, nan, nan, nan]
Finding values which are empty strings could be done with applymap:
In [182]: np.where(df.applymap(lambda x: x == ''))
Out[182]: (array([5]), array([7]))
Note that using applymap requires calling a Python function once for each cell of the DataFrame. That could be slow for a large DataFrame, so it would be better if you could arrange for all the blank cells to contain NaN instead so you could use pd.isnull.
Try this:
df[df['column_name'] == ''].index
and for NaNs you can try:
pd.isna(df['column_name'])
Check if the columns contain Nan using .isnull() and check for empty strings using .eq(''), then join the two together using the bitwise OR operator |.
Sum along axis 0 to find columns with missing data, then sum along axis 1 to the index locations for rows with missing data.
missing_cols, missing_rows = (
(df2.isnull().sum(x) | df2.eq('').sum(x))
.loc[lambda x: x.gt(0)].index
for x in (0, 1)
)
>>> df2.loc[missing_rows, missing_cols]
A2 A3
2 1.10035
5 -0.508501
6 NaN NaN
7 NaN NaN
I've resorted to
df[ (df[column_name].notnull()) & (df[column_name]!=u'') ].index
lately. That gets both null and empty-string cells in one go.
In my opinion, don't waste time and just replace with NaN! Then, search all entries with Na. (This is correct because empty values are missing values anyway).
import numpy as np # to use np.nan
import pandas as pd # to use replace
df = df.replace(' ', np.nan) # to get rid of empty values
nan_values = df[df.isna().any(axis=1)] # to get all rows with Na
nan_values # view df with NaN rows only
Partial solution: for a single string column
tmp = df['A1'].fillna(''); isEmpty = tmp==''
gives boolean Series of True where there are empty strings or NaN values.
you also do something good:
text_empty = df['column name'].str.len() > -1
df.loc[text_empty].index
The results will be the rows which are empty & it's index number.
Another opltion covering cases where there might be severar spaces is by using the isspace() python function.
df[df.col_name.apply(lambda x:x.isspace() == False)] # will only return cases without empty spaces
adding NaN values:
df[(df.col_name.apply(lambda x:x.isspace() == False) & (~df.col_name.isna())]
To obtain all the rows that contains an empty cell in in a particular column.
DF_new_row=DF_raw.loc[DF_raw['columnname']=='']
This will give the subset of DF_raw, which satisfy the checking condition.
You can use string methods with regex to find cells with empty strings:
df[~df.column_name.str.contains('\w')].column_name.count()

unable to read a tab delimited file into a numpy 2-D array

I am quite new to nympy and I am trying to read a tab(\t) delimited text file into an numpy array matrix using the following code:
train_data = np.genfromtxt('training.txt', dtype=None, delimiter='\t')
File contents:
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
what I expect is a 2-D array matrix of shape (3, 15)
but with my above code I only get a single row array of shape (3,)
I am not sure why those fifteen fields of each row are not assigned a column each.
I also tried using numpy's loadtxt() but it could not handle type conversions on my data i.e even though I gave dtype=None it tried to convert the strings to default float type and failed at it.
Tried code:
train_data = np.loadtxt('try.txt', dtype=None, delimiter='\t')
Error:
ValueError: could not convert string to float: State-gov
Any pointers?
Thanks
Actually the issue here is that np.genfromtxt and np.loadtxt both return a structured array if the dtype is structured (i.e., has multiple types). Your array reports to have a shape of (3,), because technically it is a 1d array of 'records'. These 'records' hold all your columns but you can access all the data as if it were 2d.
You are loading it correctly:
In [82]: d = np.genfromtxt('tmp',dtype=None)
As you reported, it has a 1d shape:
In [83]: d.shape
Out[83]: (3,)
But all your data is there:
In [84]: d
Out[84]:
array([ (38, 'Private', 215646, 'HS-grad', 9, 'Divorced', 'Handlers-cleaners', 'Not-in-family', 'White', 'Male', 0, 0, 40, 'United-States', '<=50K'),
(53, 'Private', 234721, '11th', 7, 'Married-civ-spouse', 'Handlers-cleaners', 'Husband', 'Black', 'Male', 0, 0, 40, 'United-States', '<=50K'),
(30, 'State-gov', 141297, 'Bachelors', 13, 'Married-civ-spouse', 'Prof-specialty', 'Husband', 'Asian-Pac-Islander', 'Male', 0, 0, 40, 'India', '>50K')],
dtype=[('f0', '<i8'), ('f1', 'S9'), ('f2', '<i8'), ('f3', 'S9'), ('f4', '<i8'), ('f5', 'S18'), ('f6', 'S17'), ('f7', 'S13'), ('f8', 'S18'), ('f9', 'S4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 'S13'), ('f14', 'S5')])
The dtype of the array is structured as so:
In [85]: d.dtype
Out[85]: dtype([('f0', '<i8'), ('f1', 'S9'), ('f2', '<i8'), ('f3', 'S9'), ('f4', '<i8'), ('f5', 'S18'), ('f6', 'S17'), ('f7', 'S13'), ('f8', 'S18'), ('f9', 'S4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 'S13'), ('f14', 'S5')])
And you can still access "columns" (known as fields) using the names given in the dtype:
In [86]: d['f0']
Out[86]: array([38, 53, 30])
In [87]: d['f1']
Out[87]:
array(['Private', 'Private', 'State-gov'],
dtype='|S9')
It's more convenient to give proper names to the fields:
In [104]: names = "age,military,id,edu,a,marital,job,fam,ethnicity,gender,b,c,d,country,income"
In [105]: d = np.genfromtxt('tmp',dtype=None, names=names)
So you can now access the 'age' field, etc.:
In [106]: d['age']
Out[106]: array([38, 53, 30])
In [107]: d['income']
Out[107]:
array(['<=50K', '<=50K', '>50K'],
dtype='|S5')
Or the incomes of people under 35
In [108]: d[d['age'] < 35]['income']
Out[108]:
array(['>50K'],
dtype='|S5')
and over 35
In [109]: d[d['age'] > 35]['income']
Out[109]:
array(['<=50K', '<=50K'],
dtype='|S5')
Updated answer
Sorry, I misread your original question:
what I expect is a 2-D array matrix of shape (3, 15)
but with my above code I only get a single row array of shape (3,)
I think you misunderstand what np.genfromtxt() will return. In this case, it will try to infer the type of each 'column' in your text file and give you back a structured / "record" array. Each row will contain multiple fields (f0...f14), each of which can contain values of a different type corresponding to a 'column' in your text file. You can index a particular field by name, e.g. data['f0'].
You simply can't have a (3,15) numpy array of heterogeneous types. You can have a (3,15) homogeneous array of strings, for example:
>>> string_data = np.genfromtext('test', dtype=str, delimiter='\t')
>>> print string_data.shape
(3, 15)
Then of course you could manually cast the columns to whatever type you want, as in #DrRobotNinja's answer. However you might as well let numpy create a structured array for you, then index it by field and assign the columns to new arrays.
I do not believe Numpy arrays handle different datatypes within a single array. What can be done, is load the entire array as strings, then convert the necessary columns to numbers as necessary
# Load data as strings
train_data = np.loadtxt('try.txt', dtype=np.str, delimiter='\t')
# Convert numeric strings into integers
first_col = train_data[:,0].astype(np.int)
third_col = train_data[:,2].astype(np.int)