How to stop printing using if statement with openpyxl - python-2.7

I'm reading values from an excel workbook and I'm trying to stop printing when the value of a specific cell equals a string. Here is my code
import openpyxl
wb = openpyxl.load_workbook('data.xlsx')
for sheet in wb.worksheets:
nrows = sheet.max_row
ncolumns = sheet.max_column
for rowNum in range(20,sheet.max_row):
Sub = sheet.cell(row=rowNum, column=3).value
Unit = sheet.cell(row=rowNum, column=6).value
Concentration = sheet.cell(row=rowNum, column=9).value
com = rowNum[3]
if Sub == "User:":
pass
else:
print(Sub, Concentration, Unit)
The problem is that the if statement doesn't work. When I use type(Sub), python return <type 'unicode'>
Do you have any idea how to do it ?
Thanks

Sounds like you're test is failing. All strings in Excel files are returned as unicode objects but "User:" == u"User:". Maybe the cells you are looking at have some whitespace that isn't visible in your debug statement. In this case it's useful to embed the string in a list when printing it print([Sub])
Alternatively, and this looks to be the case, you are getting confused between Excel's 1-based indexing and Python's zero based indexing. In you code the first cell to be looked at will be C20 (ws.cell(20, 3)) but rowNum[3] is actually D20.
I also recommend you try and avoid using Python's range and the max_column and max_row properties unless you really need them. In your case, ws.get_squared_range() make more sense. Or, in openpyxl 2.4, which allows specify the known edges of a range of cells.
for row in get_squared_range(ws.min_colum, 20, ws.max_column, ws.max_row):
Sub = row[2].value # 3rd item
Unit = row[5].value
Concentration = row[8].value
In openpyxl 2.4 get_squared_range can be replaced:
for row in ws.iter_rows(min_row=20):

Related

Replace empty values based on part of the text from another variable in Pandas dataframe, using filter and regex expression

I want to replace empty values with part of the text I find in another variable in Pandas. To achieve this, I need to make use of a regex expression to extract the exact text value I want transferred, but also apply a filter so that only those rows that have no value from begin with will be subject to change.
In SAS this is straight forward, but I am struggling doing the same in Python/pandas.
The example below is a simplified version of my problem. Specifically, I need to replace any empty values for the variable Mount with part of the text in the variable Lens that is preceded by the word, “til” (means “for” in English), in this example, second row, the word “Canon”. If Mount is not missing for a particular row, then nothing happens (as can be seen in first row).
I have come up with a self-constructed solution below that sort of works, but feel there is a more efficient way to do it. Especially this temporary variable Mount_tmp seems unnecessary. Any thoughts and ideas to improve my code would be appreciated. Thanks.
data = {'Lens': ['Canon EF 50mm f/1.8 STM', 'Zeiss Planar T* 85mm f/1.4 til Canon'],
'Mount': ['Canon E', np.nan]}
frame = pd.DataFrame(data)
#Generate temporary variable
frame['Mount_tmp'] = frame['Lens'].str.extract(r'til (\w+\s*\w*)')
#Replace empty data in variable Mount with existing data from Mount_tmp
filt = frame['Mount'].isnull()
frame.loc[filt, 'Mount'] = frame.loc[filt, 'Mount_tmp']
frame.drop('Mount_tmp', axis=1, inplace=True)
Try:
mask = frame.Mount.isna()
frame.loc[mask, "Mount"] = frame.loc[mask, "Lens"].str.extract(r"til\s+(.*)")[0]
print(frame)
Prins:
Lens Mount
0 Canon EF 50mm f/1.8 STM Canon E
1 Zeiss Planar T* 85mm f/1.4 til Canon Canon

Reference a list of dicts

Python 2.7 on Mint Cinnamon 17.3.
I have a bit of test code employing a list of dicts and despite many hours of frustration, I cannot seem to work out why it is not working as it should do.
blockagedict = {'location': None, 'timestamp': None, 'blocked': None}
blockedlist = [blockagedict]
blockagedict['location'] = 'A'
blockagedict['timestamp'] = '12-Apr-2016 01:01:08.702149'
blockagedict['blocked'] = True
blockagedict['location'] = 'B'
blockagedict['timestamp'] = '12-Apr-2016 01:01:09.312459'
blockagedict['blocked'] = False
blockedlist.append(blockagedict)
for test in blockedlist:
print test['location'], test['timestamp'], test['blocked']
This always produces the following output and I cannot work out why and cannot see if I have anything wrong with my code. It always prints out the last set of dict values but should print all, if I am not mistaken.
B 12-Apr-2016 01:01:09.312459 False
B 12-Apr-2016 01:01:09.312459 False
I would be happy for someone to show me the error of my ways and put me out of my misery.
It is because the line blockedlist = [blockagedict] actually stores a reference to the dict, not a copy, in the list. Your code effectively creates a list that has two references to the very same object.
If you care about performance and will have 1 million dictionaries in a list, all with the same keys, you will be better off using a NumPy structured array. Then you can have a single, efficient data structure which is basically a matrix of rows and named columns of appropriate types. You mentioned in a comment that you may know the number of rows in advance. Here's a rewrite of your example code using NumPy instead, which will be massively more efficient than a list of a million dicts.
import numpy as np
dtype = [('location', str, 1), ('timestamp', str, 27), ('blocked', bool)]
count = 2 # will be much larger in the real program
blockages = np.empty(count, dtype) # use zeros() instead if some data may never be populated
blockages[0]['location'] = 'A'
blockages[0]['timestamp'] = '12-Apr-2016 01:01:08.702149'
blockages[0]['blocked'] = True
blockages['location'][1] = 'B' # n.b. indexing works this way too
blockages['timestamp'][1] = '12-Apr-2016 01:01:09.312459'
blockages['blocked'][1] = False
for test in blockages:
print test['location'], test['timestamp'], test['blocked']
Note that the usage is almost identical. But the storage is in a fixed size, single allocation. This will reduce memory usage and compute time.
As a nice side effect, writing it as above completely sidesteps the issue you originally had, with multiple references to the same row. Now all the data is placed directly into the matrix with no object references at all.
Later in a comment you mention you cannot use NumPy because it may not be installed. Well, we can still avoid unnecessary dicts, like this:
from array import array
blockages = {'location': [], 'timestamp': [], 'blocked': array('B')}
blockages['location'].append('A')
blockages['timestamp'].append('12-Apr-2016 01:01:08.702149')
blockages['blocked'].append(True)
blockages['location'].append('B')
blockages['timestamp'].append('12-Apr-2016 01:01:09.312459')
blockages['blocked'].append(False)
for location, timestamp, blocked in zip(*blockages.values()):
print location, timestamp, blocked
Note I use array here for efficient storage of the fixed-size blocked values (this way each value takes exactly one byte).
You still end up with resizable lists that you could avoid, but at least you don't need to store a dict in every slot of the list. This should still be more efficient.
Ok, I have initialised the list of dicts right off the bat and this seems to work. Although I am tempted to write a class for this.
blockedlist = [{'location': None, 'timestamp': None, 'blocked': None} for k in range(2)]
blockedlist[0]['location'] = 'A'
blockedlist[0]['timestamp'] = '12-Apr-2016 01:01:08.702149'
blockedlist[0]['blocked'] = True
blockedlist[1]['location'] = 'B'
blockedlist[1]['timestamp'] = '12-Apr-2016 01:01:09.312459'
blockedlist[1]['blocked'] = False
for test in blockedlist:
print test['location'], test['timestamp'], test['blocked']
And this produces what I was looking for:
A 12-Apr-2016 01:01:08.702149 True
B 12-Apr-2016 01:01:09.312459 False
I will be reading from a text file with 1 to 2 million lines, so converting the code to iterate through the lines won't be a problem.

KeyError Pandas Dataframe (encoding index)

I'm running the code below. It creates a couple of dataframes that takes a column in another dataframe that has a list of Conference Names, as its index.
df_conf = pd.read_sql("select distinct Conference from publications where year>=1991 and length(conference)>1 order by conference", db)
for index, row in df_conf.iterrows():
row[0]=row[0].encode("utf-8")
df2= pd.DataFrame(index=df_conf['Conference'], columns=['Citation1991','Citation1992'])
df2 = df2.fillna(0)
df_if= pd.DataFrame(index=df_conf['Conference'], columns=['IF1994','IF1995'])
df_if = df_if.fillna(0)
df_pubs=pd.read_sql("select Conference, Year, count(*) as totalPubs from publications where year>=1991 group by conference, year", db)
for index, row in df_pubs.iterrows():
row[0]=row[0].encode("utf-8")
df_pubs= df_pubs.pivot(index='Conference', columns='Year', values='totalPubs')
df_pubs.fillna(0)
for index, row in df2.iterrows():
df_if.ix[index,'IF1994'] = df2.ix[index,'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
The last line keeps giving me the following error:
KeyError: 'Analyse dynamischer Systeme in Medizin, Biologie und \xc3\x96kologie'
Not quite sure what I'm doing wrong. I tried encoding the indexes. It won't work. I even tried .at still wont' work.
I know it has to do with encoding, as it always stops at indexes with non-ascii characters.
I'm using python 2.7
I think the problem with this:
for index, row in df_conf.iterrows():
row[0]=row[0].encode("utf-8")
is that it may or may not work, I'm surprised it didn't raise a warning.
Besides that it's much quicker to use the vectorised str method to encode the series:
df_conf['col_name'] = df_conf['col_name'].str.encode('utf-8')
If needed you can also encode the index in a similar fashion:
df.index = df.index.str.encode('utf-8')
It happens in the line in the last part of the code?
df_if.ix[index,'IF1994'] = df2.ix[index,'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
if then, try
df_if.ix[index,u'IF1994'] = df2.ix[index,u'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
It would work. Dataframe indexing in UTF8 works in strange way even though the script is declared with "# -- coding:utf8 --". Just put "u" in utf8 strings when you use dataframe columns and index with utf8 strings

how to apply cell style when using `append` in openpyxl?

I am using openpyxl to create an Excel worksheet. I want to apply styles when I insert the data. The trouble is that the append method takes a list of data and automatically inserts them to cells. I cannot seem to specify a font to apply to this operation.
I can go back and apply a style to individual cells after-the-fact, but this requires overhead to find out how many data points were in the list, and which row I am currently appending to. Is there an easier way?
This illustrative code shows what I would like to do:
def create_xlsx(self, header):
self.ft_base = Font(name='Calibri', size=10)
self.ft_bold = self.ft_base.copy(bold=True)
if header:
self.ws.append(header, font=ft_bold) # cannot apply style during append
ws.append() is designed for appending rows of data easily. It does, however, also allow you to include placeless cells within a row so that you can apply formatting while adding data. This is primarily of interest when using write_only=True but will work for normal workbooks.
Your code would look something like:
data = [1, 3, 4, 9, 10]
def styled_cells(data):
for c in data:
if c == 1:
c = Cell(ws, column="A", row=1, value=c)
c.font = Font(bold=True)
yield c
ws.append(styled_cells(data))
openpyxl will correct the coordinates of such cells.

Single quote replacement, handling of null integers in pandas/python2.7

New to Pandas/Python and I'm having to write some kludgy code. I would appreciate any input on how you would do this and speed it up (I'll be doing this for gigabytes of data).
So, I'm using pandas/python for some ETL work. Row-wise calculations are performed so I need them as numeric types within the process (left this part out). I need to output some of the fields as an array and get rid of the single quotes, nan's, and ".0"'s.
First question, is there a way to vectorize these if else statements ala ifelse in R? Second, surely there is a better way to remove the ".0". There seems to be major issues with out pandas/numpy handles nulls in numeric types.
Finally, the .replace does not seem to work on the DataFrame for single quotes. Am I missing something? Here's the sample code, please let me know if you have any questions about it:
import pandas as pd
# have some nulls and need it in integers
d = {'one' : [1.0, 2.0, 3.0, 4.0],'two' : [4.0, 3.0, NaN, 1.0]}
dat = pd.DataFrame(d)
# make functions to get rid of the ".0" and necessarily converting to strings
def removeforval(val):
if str(val)[-2:] == ".0":
val = str(val)[:len(str(val))-2]
else:
val = str(val)
return val
def removeforcol(col):
col = col.apply(removeforval)
return col
dat = dat.apply(removeforcol,axis=0)
# remove the nan's
dat = dat.replace('nan','')
# need some fields in arrays on a postgres database
quoted = ['{' + str(tuple(x))[1:-1] + '}' for x in dat.to_records(index=False)]
print "Before single quote removal"
print quoted
# try to replace single quotes using DataFrame's replace
quoted_df = pd.DataFrame(quoted).replace('\'','')
quoted_df = quoted_df.replace('\'','')
print "DataFrame does not seem to work"
print quoted_df
# use a loop
for item in range(len(quoted)):
quoted[item] = quoted[item].replace('\'','')
print "This Works"
print quoted
Thank you!
You understand that this is very odd to make a string exactly like this. This is not valid python at all. What are you doing with this? Why are you stringifying it?
revised
In [144]: list([ "{%s , %s}" % tup[1:] for tup in df.replace(np.nan,0).astype(int).replace(0,'').itertuples() ])
Out[144]: ['{1 , 4}', '{2 , 3}', '{3 , }', '{4 , 1}']