KeyError Pandas Dataframe (encoding index) - python-2.7

I'm running the code below. It creates a couple of dataframes that takes a column in another dataframe that has a list of Conference Names, as its index.
df_conf = pd.read_sql("select distinct Conference from publications where year>=1991 and length(conference)>1 order by conference", db)
for index, row in df_conf.iterrows():
row[0]=row[0].encode("utf-8")
df2= pd.DataFrame(index=df_conf['Conference'], columns=['Citation1991','Citation1992'])
df2 = df2.fillna(0)
df_if= pd.DataFrame(index=df_conf['Conference'], columns=['IF1994','IF1995'])
df_if = df_if.fillna(0)
df_pubs=pd.read_sql("select Conference, Year, count(*) as totalPubs from publications where year>=1991 group by conference, year", db)
for index, row in df_pubs.iterrows():
row[0]=row[0].encode("utf-8")
df_pubs= df_pubs.pivot(index='Conference', columns='Year', values='totalPubs')
df_pubs.fillna(0)
for index, row in df2.iterrows():
df_if.ix[index,'IF1994'] = df2.ix[index,'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
The last line keeps giving me the following error:
KeyError: 'Analyse dynamischer Systeme in Medizin, Biologie und \xc3\x96kologie'
Not quite sure what I'm doing wrong. I tried encoding the indexes. It won't work. I even tried .at still wont' work.
I know it has to do with encoding, as it always stops at indexes with non-ascii characters.
I'm using python 2.7

I think the problem with this:
for index, row in df_conf.iterrows():
row[0]=row[0].encode("utf-8")
is that it may or may not work, I'm surprised it didn't raise a warning.
Besides that it's much quicker to use the vectorised str method to encode the series:
df_conf['col_name'] = df_conf['col_name'].str.encode('utf-8')
If needed you can also encode the index in a similar fashion:
df.index = df.index.str.encode('utf-8')

It happens in the line in the last part of the code?
df_if.ix[index,'IF1994'] = df2.ix[index,'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
if then, try
df_if.ix[index,u'IF1994'] = df2.ix[index,u'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
It would work. Dataframe indexing in UTF8 works in strange way even though the script is declared with "# -- coding:utf8 --". Just put "u" in utf8 strings when you use dataframe columns and index with utf8 strings

Related

unexpected character after line continuation character. Also to keep rows after floating point rows in pandas dataframe

I have a dataset in which I want to keep row just after a floating value row and remove other rows.
For eg, a column of the dataframe looks like this:
17.3
Hi Hello
Pranjal
17.1
[aasd]How are you
I am fine[:"]
Live Free
So in this I want to preserve:
Hi Hello
[aasd]How are you
and remove the rest. I tried it with the following code, but an error showed up saying "unexpected character after line continuation character". Also I don't know if this code will solve my purpose
Dropping extra rows
for ind in data.index:
if re.search((([1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?, ind):
ind+=1
else:
data.drop(ind)
your regex has to be a string, you can't just write it like that.
re.search((('[1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?', ind):
edit - but actually i think the rest of your code is wrong too.
what you really want is something more like this:
import pandas as pd
l = ['17.3',
'Hi Hello',
'Pranjal',
'17.1',
'[aasd]How are you',
'I am fine[:"]',
'Live Free']
data = pd.DataFrame(l, columns=['col'])
data[data.col.str.match('\d+\.\d*').shift(1) == True]
logic:
if you have a dataframe with a column that is all string type (won't work for mixed type decimal and string you can find the decimal / int entries with the regex '\d+.?\d*'. If you shift this mask by one it gives you the entries after the matches. use that to select the rows you want in your dataframe.

Search pandas column and return all elements (rows) that contain any (one or more) non-digit character

Seems pretty straight forward. The column contains numbers in general but for some reason, some of them have non-digit characters. I want to find all of them. I am using this code:
df_other_values.total_count.str.contains('[^0-9]')
but I get the following error:
AttributeError: Can only use .str accessor with string values, which use
np.object_ dtype in pandas
So I tried this:
df_other_values = df_other.total_countvalues
df_other_values.total_count.str.contains('[^0-9]')
but get the following error:
AttributeError: 'DataFrame' object has no attribute 'total_countvalues'
So instead of going down the rabbit hole further, I was thinking there must be a way to do this without having to change my dataframe into a np.object. Please advise.
Thanks.
I believe you need cast to strings first by astype and then filter by boolean indexing:
df1 = df[df_other_values.total_count.astype(str).str.contains('[^0-9]')]
Alternative solution with isnumeric:
df1 = df[~df_other_values.total_count.astype(str).str.isnumeric()]

Python 2.7 - How to call individual columns from transposed csv file

I understand that the csv module exists, however for my current project we are not allowed to use the module to call csv files.
My code is as follows;
table = []
for line in open("data.csv"):
data = line.split(",")
table.append(data)
transposed = [[table[j][i] for j in range(len(table))] for i in range(len(table[0]))]
rows = transposed[1][1:]
rows = [float(i) for i in rows]
I'm really new to python so this is probably a massively basic question, I've been scouring the internet all day and struggle to find a solution. All I need to do is to be able to call data from any individual column so I can analyse it. Thanks
your data is organized in a list of lists. Each sub list represents a row. To better illustrate this I would avoid using list comprehensions because they are more difficult to read. Additionally I would avoid using variables like 'i' and 'j' and instead use more descriptive names like row or column. Here is a simple example of how I would accomplish this
def read_csv():
table = []
with open("data.csv") as fileobj:
for line in fileobj.readlines():
data = line.strip().split(',')
table.append(data)
return table
def get_column_data(data, column_index):
column_data = []
for row in data:
cell_data = row[column_index]
column_data.append(cell_data)
return column_data
data = read_csv()
get_column_data(data, column_index=2) #example usage

How to stop printing using if statement with openpyxl

I'm reading values from an excel workbook and I'm trying to stop printing when the value of a specific cell equals a string. Here is my code
import openpyxl
wb = openpyxl.load_workbook('data.xlsx')
for sheet in wb.worksheets:
nrows = sheet.max_row
ncolumns = sheet.max_column
for rowNum in range(20,sheet.max_row):
Sub = sheet.cell(row=rowNum, column=3).value
Unit = sheet.cell(row=rowNum, column=6).value
Concentration = sheet.cell(row=rowNum, column=9).value
com = rowNum[3]
if Sub == "User:":
pass
else:
print(Sub, Concentration, Unit)
The problem is that the if statement doesn't work. When I use type(Sub), python return <type 'unicode'>
Do you have any idea how to do it ?
Thanks
Sounds like you're test is failing. All strings in Excel files are returned as unicode objects but "User:" == u"User:". Maybe the cells you are looking at have some whitespace that isn't visible in your debug statement. In this case it's useful to embed the string in a list when printing it print([Sub])
Alternatively, and this looks to be the case, you are getting confused between Excel's 1-based indexing and Python's zero based indexing. In you code the first cell to be looked at will be C20 (ws.cell(20, 3)) but rowNum[3] is actually D20.
I also recommend you try and avoid using Python's range and the max_column and max_row properties unless you really need them. In your case, ws.get_squared_range() make more sense. Or, in openpyxl 2.4, which allows specify the known edges of a range of cells.
for row in get_squared_range(ws.min_colum, 20, ws.max_column, ws.max_row):
Sub = row[2].value # 3rd item
Unit = row[5].value
Concentration = row[8].value
In openpyxl 2.4 get_squared_range can be replaced:
for row in ws.iter_rows(min_row=20):

TypeError during executemany() INSERT statement using a list of strings

I am trying to just do a basic INSERT operation to a PostgreSQL database through Python via the Psycopg2 module. I have read a great many of the questions already posted regarding this subject as well as the documentation but I seem to have done something uniquely wrong and none of the fixes seem to work for my code.
#API CALL + JSON decoding here
x = 0
for item in ulist:
idValue = list['members'][x]['name']
activeUsers.append(str(idValue))
x += 1
dbShell.executemany("""INSERT INTO slickusers (username) VALUES (%s)""", activeUsers
)
The loop creates a list of strings that looks like this when printed:
['b2ong', 'dune', 'drble', 'drars', 'feman', 'got', 'urbo']
I am just trying to have the code INSERT these strings as 1 row each into the table.
The error specified when running is:
TypeError: not all arguments converted during string formatting
I tried changing the INSERT to:
dbShell.executemany("INSERT INTO slackusers (username) VALUES (%s)", (activeUsers,) )
But that seems like it's merely treating the entire list as a single string as it yields:
psycopg2.DataError: value too long for type character varying(30)
What am I missing?
First in the code you pasted:
x = 0
for item in ulist:
idValue = list['members'][x]['name']
activeUsers.append(str(idValue))
x += 1
Is not the right way to accomplish what you are trying to do.
first list is a reserved word in python and you shouldn't use it as a variable name. I am assuming you meant ulist.
if you really need access to the index of an item in python you can use enumerate:
for x, item in enumerate(ulist):
but, the best way to do what you are trying to do is something like
for item in ulist: # or list['members'] Your example is kinda broken here
activeUsers.append(str(item['name']))
Your first try was:
['b2ong', 'dune', 'drble', 'drars', 'feman', 'got', 'urbo']
Your second attempt was:
(['b2ong', 'dune', 'drble', 'drars', 'feman', 'got', 'urbo'], )
What I think you want is:
[['b2ong'], ['dune'], ['drble'], ['drars'], ['feman'], ['got'], ['urbo']]
You could get this many ways:
dbShell.executemany("INSERT INTO slackusers (username) VALUES (%s)", [ [a] for a in activeUsers] )
or event better:
for item in ulist: # or list['members'] Your example is kinda broken here
activeUsers.append([str(item['name'])])
dbShell.executemany("""INSERT INTO slickusers (username) VALUES (%s)""", activeUsers)