Pandas HD5-query, where expression fails - python-2.7

I want to query a HDF5-file. I do
df.to_hdf(pfad,'df', format='table')
to write the dataframe on disc.
To read I use
hdf = pandas.HDFStore(pfad)
I have a list that contains numpy.datetime64 values called expirations and try to read the portion of the hd5 table into a dataframe, that has values between expirations[1] and expirations[0] in column "expiration". Column expiration entries have the format Timestamp('2002-05-18 00:00:00').
I use the following command:
df = hdf.select('df',
where=['expiration<expiration[1]','expiration>=expirations[0]'])
However, this fails and produces a value error:
ValueError: The passed where expression: [expiration=expirations[0]]
contains an invalid variable reference
all of the variable refrences must be a reference to
an axis (e.g. 'index' or 'columns'), or a data_column
The currently defined references are: index,columns

Can you try this code:
df = hdf.select('df', where='expiration < expirations[1] and expiration >= expirations[0]')
or, as a query:
df = hdf.query('expiration < #expirations[1] and expiration >= #expirations[0]')
Not sure which one fits best your case, I noticed you are trying to use 'where' to filter rows, without a string or a list, does it make sense ?

Related

Randomly set one-third of na's in a column to one value and the rest to another value

I'm trying to impute missing values in a dataframe df. I have a column A with 300 NaN's. I want to randomly set 2/3rd of it to value1 and the rest to value2.
Please help.
EDIT: I'm actually trying to this on dask, which does not support item assignment. This is what I have currently. Initially, I thought I'll try to convert all NA's to value1
da.where(df.A.isnull() == True, 'value1', df.A)
I got the following error:
ValueError: need more than 0 values to unpack
As the comment suggests, you can solve this with Series.where.
The following will work, but I cannot promise how efficient this is. (I suspect it may be better to produce a whole column of replacements at once with numpy.choice.)
df['A'] = d['A'].where(~d['A'].isnull(),
lambda df: df.map(
lambda x: random.choice(['value1', 'value1', x])))
explanation: if the value is not null (NaN), certainly keep the original. Where it is null, replace with the corresonding values of the dataframe produced by the first lambda. This maps values of the dataframe (chunks) to randomly choose the original value for 1/3 and 'value1' for others.
Note that, depending on your data, this likely has changed the data type of the column.

Datetime object through 'datetime.strptime is not iterable'

i have a csv file containing years of data, and i need to calculate the difference between the max date and the min date, i am facing a real problem in how can i determine the max value of dates.
So, i am doing this to convert my dates into datetime object
Temps = datetime.strptime(W['datum'][i]+' '+W['timestamp'][i],'%Y-%m-%d %H:%M:%S')
Printing this line, gives me the exact result i want, but when i try to extract the max values of these dates using this line of code :
start = max(Temps)
I got this error : datetime.strptime' object is not iterable
where am i mistaken ?
The expression
datetime.strptime(W['datum'][i]+' '+W['timestamp'][i],'%Y-%m-%d %H:%M:%S')
produces a single value (a scalar). When you assign it to Temps this variable become a scalar not a list. It contains only one value.
Then when you try to evaluate max(Temps) max is expecting to find something with multiple values as its argument but, unfortunately, it finds what Temps was assigned most recently.
This was a single value, which is not 'iterable'.

TypeError during executemany() INSERT statement using a list of strings

I am trying to just do a basic INSERT operation to a PostgreSQL database through Python via the Psycopg2 module. I have read a great many of the questions already posted regarding this subject as well as the documentation but I seem to have done something uniquely wrong and none of the fixes seem to work for my code.
#API CALL + JSON decoding here
x = 0
for item in ulist:
idValue = list['members'][x]['name']
activeUsers.append(str(idValue))
x += 1
dbShell.executemany("""INSERT INTO slickusers (username) VALUES (%s)""", activeUsers
)
The loop creates a list of strings that looks like this when printed:
['b2ong', 'dune', 'drble', 'drars', 'feman', 'got', 'urbo']
I am just trying to have the code INSERT these strings as 1 row each into the table.
The error specified when running is:
TypeError: not all arguments converted during string formatting
I tried changing the INSERT to:
dbShell.executemany("INSERT INTO slackusers (username) VALUES (%s)", (activeUsers,) )
But that seems like it's merely treating the entire list as a single string as it yields:
psycopg2.DataError: value too long for type character varying(30)
What am I missing?
First in the code you pasted:
x = 0
for item in ulist:
idValue = list['members'][x]['name']
activeUsers.append(str(idValue))
x += 1
Is not the right way to accomplish what you are trying to do.
first list is a reserved word in python and you shouldn't use it as a variable name. I am assuming you meant ulist.
if you really need access to the index of an item in python you can use enumerate:
for x, item in enumerate(ulist):
but, the best way to do what you are trying to do is something like
for item in ulist: # or list['members'] Your example is kinda broken here
activeUsers.append(str(item['name']))
Your first try was:
['b2ong', 'dune', 'drble', 'drars', 'feman', 'got', 'urbo']
Your second attempt was:
(['b2ong', 'dune', 'drble', 'drars', 'feman', 'got', 'urbo'], )
What I think you want is:
[['b2ong'], ['dune'], ['drble'], ['drars'], ['feman'], ['got'], ['urbo']]
You could get this many ways:
dbShell.executemany("INSERT INTO slackusers (username) VALUES (%s)", [ [a] for a in activeUsers] )
or event better:
for item in ulist: # or list['members'] Your example is kinda broken here
activeUsers.append([str(item['name'])])
dbShell.executemany("""INSERT INTO slickusers (username) VALUES (%s)""", activeUsers)

KeyError Pandas Dataframe (encoding index)

I'm running the code below. It creates a couple of dataframes that takes a column in another dataframe that has a list of Conference Names, as its index.
df_conf = pd.read_sql("select distinct Conference from publications where year>=1991 and length(conference)>1 order by conference", db)
for index, row in df_conf.iterrows():
row[0]=row[0].encode("utf-8")
df2= pd.DataFrame(index=df_conf['Conference'], columns=['Citation1991','Citation1992'])
df2 = df2.fillna(0)
df_if= pd.DataFrame(index=df_conf['Conference'], columns=['IF1994','IF1995'])
df_if = df_if.fillna(0)
df_pubs=pd.read_sql("select Conference, Year, count(*) as totalPubs from publications where year>=1991 group by conference, year", db)
for index, row in df_pubs.iterrows():
row[0]=row[0].encode("utf-8")
df_pubs= df_pubs.pivot(index='Conference', columns='Year', values='totalPubs')
df_pubs.fillna(0)
for index, row in df2.iterrows():
df_if.ix[index,'IF1994'] = df2.ix[index,'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
The last line keeps giving me the following error:
KeyError: 'Analyse dynamischer Systeme in Medizin, Biologie und \xc3\x96kologie'
Not quite sure what I'm doing wrong. I tried encoding the indexes. It won't work. I even tried .at still wont' work.
I know it has to do with encoding, as it always stops at indexes with non-ascii characters.
I'm using python 2.7
I think the problem with this:
for index, row in df_conf.iterrows():
row[0]=row[0].encode("utf-8")
is that it may or may not work, I'm surprised it didn't raise a warning.
Besides that it's much quicker to use the vectorised str method to encode the series:
df_conf['col_name'] = df_conf['col_name'].str.encode('utf-8')
If needed you can also encode the index in a similar fashion:
df.index = df.index.str.encode('utf-8')
It happens in the line in the last part of the code?
df_if.ix[index,'IF1994'] = df2.ix[index,'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
if then, try
df_if.ix[index,u'IF1994'] = df2.ix[index,u'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
It would work. Dataframe indexing in UTF8 works in strange way even though the script is declared with "# -- coding:utf8 --". Just put "u" in utf8 strings when you use dataframe columns and index with utf8 strings

Adding data to a Pandas dataframe

I have a dataframe that contains Physician_Profile_City, Physician_Profile_State and Physician_Profile_Zip_Code. I ultimately want to stratify an analysis based on state, but unfortunately not all of the Physician_Profile_States are filled in. I started looking around to try and figure out how to fill in the missing States. I came across the pyzipcode module which can take as an input a zip code and returns the state as follows:
In [39]: from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
zcdb = ZipCodeDatabase()
zipcode = zcdb[54115]
zipcode.state
Out[39]: u'WI'
What I'm struggling with is how I would iterate through the dataframe and add the appropriate "Physician_Profile_State" when that variable is missing. Any suggestions would be most appreciated.
No need to iterate if the form of the data is a dict then you should be able to perform the following:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].map(zcdb)
Otherwise you can call apply like so:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].apply(lambda x: zcdb[x].state)
In the case where the above won't work as it can't generate a Series to align with you df you can apply row-wise passing axis=1 to the df:
df['Physician_Profile_State'] = df[['Physician_Profile_Zip_Code']].apply(lambda x: zcdb[x].state, axis=1)
By using double square brackets we return a df allowing you to pass the axis param