My original data looks like this:
SUBBASIN HRU HRU_SLP OV_N
1 1 0.016155144 0.15
1 2 0.015563287 0.14
2 1 0.010589782 0.15
2 2 0.011574839 0.14
3 1 0.013865396 0.15
3 2 0.01744597 0.15
3 3 0.018983217 0.14
3 4 0.013890315 0.05
3 5 0.011792533 0.05
I need to modify value of OV_N for each SUBBASIN number:
hru = pd.read_csv('hru.csv')
for i in hru.OV_N:
hru.ix[hru.SUBBASIN.isin([76,65,64,72,81,84,60,46,37,1,2]), 'OV_N'] = i*(1+df21.value[12])
hru.ix[hru.SUBBASIN.isin([80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12]), 'OV_N'] = i*(1+df23.value[12])
hru.ix[hru.SUBBASIN.isin([85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,49,29,22,24,25,9,10]), 'OV_N'] = i*(1+df56.value[12])
hru.ix[hru.SUBBASIN.isin([92,88,95,94,93]), 'OV_N'] = i*(1+df58.value[12])
where df21.value[12] is a value from a txt file
The code results in an infinite value of OV_N for all subbasins, so I assume that looping through a file goes multiple times, but I can't find a mistake and this code was working before with different numbers of subbasins.
It is generally better not to loop and index over rows in a pandas DataFrame. Transforming the DataFrame by column operations is the more pandasnic approach. A pandas DataFrame can be thought of as a zipped combination of pandas Series: each column is its own pandas Series – all sharing the same index. Operations can be applied to one or more pandas Series to create a new Series that shares the same index. Operations can also be applied to combine a Series with one dimensional numpy array to create a new Series. It is helpful to understand pandas indexing – however this answer will just use sequential integer indexing.
To modify the value of OV_N for each SUBBASIN number:
Initialize the hru DataFrame by reading it in from the hru.csv as in the original question. Here we initialize it with the data given in the question.
import numpy as np
import pandas as pd
hru = pd.DataFrame({
'SUBBASIN':[1,1,2,2,3,3,3,3,3],
'HRU':[1,2,1,2,1,2,3,4,5],
'HRU_SLP':[0.016155144,0.015563287,0.010589782,0.011574839,0.013865396,0.01744597,0.018983217,0.013890315,0.011792533],
'OV_N':[0.15,0.14,0.15,0.14,0.15,0.15,0.14,0.05,0.05]})
Create one separate pandas Series that gathers and stores all the values from the various DataFrames, i.e. df21, df23, df56, and df58, into one place. This will be used to look up values by index. Let’s call it subbasin_multiplier_ds. Let’s respectively assume values of 21, 23, 56, and 58 were read from the txt file. Do replace these with the real values read in from the txt file.
subbasin_multiplier_ds=pd.Series([21]*96)
subbasin_multiplier_ds[80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,
34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12] = 23
subbasin_multiplier_ds[85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,
86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,
49,29,22,24,25,9,10] = 56
subbasin_multiplier_ds[92,88,95,94,93] = 58
Replace OV_N in hru DataFrame based on columns in the DataFrame and a lookup in subbasin_multiplier_ds by index.
hru['OV_N'] = hru['OV_N'] * (1 + subbasin_multiplier_ds[hru['SUBBASIN']].values)
A numpy array is created by .values above so expected results are achieved. If you want to experiment with removing values give it a try to see what happens.
Related
I am struggling to find an efficient way of retrieving the solution to an optimization problem. The solution consists of around 200K variables that I would like in a pandas DataFrame. After searching online the only approaches I found for accessing the variables was through a for loop which looks something like this:
instance = M.create_instance('input.dat') # reading in a datafile
results = opt.solve(instance, tee=True)
results.write()
instance.solutions.load_from(results)
for v in instance.component_objects(Var, active=True):
print ("Variable",v)
varobject = getattr(instance, str(v))
for index in varobject:
print (" ",index, varobject[index].value)
I know I can use this for loop to store them in a dataframe but this is pretty inefficient.
I found out how to access the indexes by using
import pandas as pd
index = pd.DataFrame(instance.component_objects(Var, active=True))
But I dont know how to get the solution
There is actually a very simple and elegant solution, using the method pandas.DataFrame.from_dict combined with the Var.extract_values() method.
from pyomo.environ import *
import pandas as pd
m = ConcreteModel()
m.N = RangeSet(5)
m.x = Var(m.N, rule=lambda _, el: el**2) # x = [1,4,9,16,25]
df = pd.DataFrame.from_dict(m.x.extract_values(), orient='index', columns=[str(m.x)])
print(df)
yields
x
1 1
2 4
3 9
4 16
5 25
Note that for Var we can use both get_values() and extract_values(), they seem to do the same. For Param there is only extract_values().
Of course you can use instance.some_var.pprint() to print it on the screen.
But if you have a variable indexed by a large set. You can also write it to a
seperate file. The following code writes the result to a .txt file:
f = open('Result.txt', 'a')
instance.some_var.pprint(f)
f.close()
I had the same issue as Jasper and tried the suggested solutions. By doing so I noticed, that the part writing the results takes most time. Maybe this is also true in Jasper's case.
results.write()
instance.solutions.load_from(results)
So I suggest to surpress this two lines if you can do so. Maybe someone has a suggestions how to speed this up? Or an alternative method.
Also I saw that in this post (Pyomo: Save results to CSV files) The "for loop" method is recomanded. A pyomo developer states:"I think it's possible in option 2 for the indices and the variable slice to be iterated over in a different order which would invalidate your resulting array."
For simplicity of code and to largely avoid for-loops, I found the pyomoio module in the urbs project, which has taken over the slightly deprecated code of pandaspyomo.py. It relies on each pyomo object's iteritem() method, and handles multiple dimensions elegantly. It can extract sets, parameters, variables as pandas objects.
If I set up a small pyomo model
from pyomo.environ import *
import pyomoio as po
import pandas as pd
# Define a model with 200k values
m = ConcreteModel()
m.ix = RangeSet(200000)
def idem(model, i):
return i
m.a = Param(m.ix, rule=idem)
I can read in the parameter with just one line of code
%%timeit
a_po = po.get_entity(m, 'a')
# 110 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, if I compare it to the approach in the original question, it is not faster, even a little slower:
%%timeit
val = []
ix = []
varobject = getattr(m, 'a')
for index in varobject:
ix.append(index)
val.append(varobject[index])
a = pd.Series(index=ix, data=val)
# 92.5 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have a pandas dataframe timeseries weight for over 100 scales named per "short_id". I am having trouble figuring out the best way to apply a moving filter for each scale's weight data to remove outliers.
Here is a sample of the data:
Out[159]:
published_at short_id weight
0 2017-11-08 16:03:36 INT16 50.35
1 2017-11-08 16:02:43 INT1 45.71
2 2017-11-08 16:02:10 NOT11 35.52
3 2017-11-08 16:01:07 INT7 50.03
4 2017-11-08 16:00:23 INT3 47.04
converting the dataframe into a dictionary per "short_id" and apply moving filter per dict item did not work out, nor did converting the data to a "wide" format from "long" (using pandas.pivot_table).
It seems like it could be possible in one line using groupy.by then .apply the rolling function:
df['MovingFilt'] = df.groupby('short_id')['weight'].apply(pd.rolling(6).median())
but receive an error: TypeError: incompatible index of inserted column with frame index...This is because sometimes there is weight data at the same time for certain scales, but not usually.
Is this the best way to approach the problem?:
Creating new dataframes per 'short_id' then using seems not pythonic enough, although it runs fine
INT16['MovingFilt'] = pd.Series.rolling(INT16['weight'], window=6, center=True).median()
The error is because you wrote the groupby wrong
df['MovingFilt'] = df.groupby('short_id')['weight'].rolling(6).median().values
I am working with a code in Pandas that involves reading a lot of files and then performing various operations on each file inside a loop (which iterates over a file list).
I am trying to convert this to a Dask-based approach instead of a Pandas-based approach and have the following attempt so far - I am new to Dask and need to ask about whether this is a reasonable approach.
Here is what the input data looks like:
A X1 X2 X3 A_d S_d
0 1.0 0.475220 0.839753 0.872468 1 1
1 2.0 0.318410 0.940817 0.526758 2 2
2 3.0 0.053959 0.056407 0.169253 3 3
3 4.0 0.900777 0.307995 0.689259 4 4
4 5.0 0.670465 0.939116 0.037865 5 5
Here is the code:
import dask.dataframe as dd
import numpy as np; import pandas as pd
def my_func(df,r): # perform representative calculations
q = df.columns.tolist()
df2 = df.loc[:,q[1:]] / df.loc[:,q()[1:]].sum()
df2['A'] = df['A']
df2 = df2[ ( df2['A'] >= r[0] ) & ( df2['A'] <= r[1] ) ]
c = q[1:-2]
A = df2.loc[:,c].sum()
tx = df2.loc[:,c].min() * df2.loc[:,c].max()
return A - tx
list_1 = []
for j in range(1,13):
df = dd.read_csv('Test_file.csv')
N = my_func(df,[751.7,790.4]) # perform calculations
out = ['X'+str(j)+'_2', df['A'].min()] + N.compute().tolist()
list_1.append(out)
df_f = pd.DataFrame(list_1)
my_func returns a Dask Series N. Currently, I must .compute() the Dask Series before I can convert it into a list. I am having trouble overcoming this.
Is it possible to vertically append N (which is a Dask Series) as a row to a blank Dask DF? eg. in Pandas, I tend to do
this: df_N = pd.DataFrame() would go outside the for loop and
then something like df_N = pd.concat([df_N,N],axis=0). This would
allow a Dask DF to be built up in the for loop. After that
(outside the loop), I could easily just horizontally concatenate the
built-up Dask DF to pd.DataFrame(list_1).
Another approach is to create a single row Dask DF from the Dask
series N. Then, vertically concatenate this single row DF to a
blank Dask DF (that was created outside the loop). Is it possible in Dask to create single row Dask DataFrame
from a Series?
Additional Information (if needed):
In my real code, I am reading from a *.csv file inside a loop. For this reason, when I generated a sample dastaset, I wrote it to a *.csv file in order to use dd.read_csv() inside the loop.
df2s['A'] = df['A'] - this line is needed since the line above it omits column A (during a normalization of each column to its sum) and produces new DataFrame. df2s['A'] = df['A'] adds column A back to the new DataFrame.
Is it possible to vertically append N (which is a Dask Series) as a row to a blank Dask DF? eg. in Pandas, I tend to do this: df_N = pd.DataFrame() would go outside the for loop and then something like df_N = pd.concat([df_N,N],axis=0). This would allow a Dask DF to be built up in the for loop. After that (outside the loop), I could easily just horizontally concatenate the built-up Dask DF to pd.DataFrame(list_1).
You should never append rows to either a Pandas dataframe or a Dask dataframe. This is very inefficient. Instead it is better to collect many pandas/dask dataframes together and then call the pd.concat or dd.concat function.
Also I note that you are calling compute within your for loop. It is recommended to call compute only after you have set up your entire computation if possible. Otherwise you are probably not getting much parallelism.
Note: I haven't actually gone through the trouble of understanding your code. I'm just responding to the questions at the end. Hopefully someone else comes along with a more comprehensive answer.
I have a dataset which the dimension is around 2,000 (rows) x 120,000 (columns).
And I'd like to pick up certain columns (~8,000 columns).
So the file dimension would be 2,000 (rows) x 8,000 (columns).
Here is the code written by a good man (I searched from stackoverflow but I am sorry I have forgotten his name).
import pandas as pd
df = pd.read_csv('...mydata.csv')
my_query = pd.read_csv('...myquery.csv')
df[list['Name'].unique()].to_csv('output.csv')
However, the result shows MemoryError in my console, which means the code may not work quite well.
So does anyone know how to improve the code with more efficient way to select the certain columns?
I think I found your source.
So, my solution use read_csv with arguments:
iterator=True - if True, return a TextFileReader to enable reading a file into memory piece by piece
chunksize=1000 - an number of rows to be used to “chunk” a file into pieces. Will cause an TextFileReader object to be returned
usecols=subset - a subset of columns to return, results in much faster parsing time and lower memory usage
Source.
I filter large dataset with usecols - I use only dataset (2 000, 8 000) instead (2 000, 120 000).
import pandas as pd
#read subset from csv and remove duplicate indices
subset = pd.read_csv('8kx1.csv', index_col=[0]).index.unique()
print subset
#use subset as filter of columns
tp = pd.read_csv('input.csv',iterator=True, chunksize=1000, usecols=subset)
df = pd.concat(tp, ignore_index=True)
print df.head()
print df.shape
#write to csv
df.to_csv('output.csv',iterator=True, chunksize=1000)
I use this snippet for testing:
import pandas as pd
import io
temp=u"""A,B,C,D,E,F,G
1,2,3,4,5,6,7"""
temp1=u"""Name
B
B
C
B
C
C
E
F"""
subset = pd.read_csv(io.StringIO(temp1), index_col=[0]).index.unique()
print subset
#use subset as filter of columns
df = pd.read_csv(io.StringIO(temp), usecols=subset)
print df.head()
print df.shape
I am very new to programming and am working with Python. For a work project I am trying to read several .csv files, convert them to data frames, concatenate some of the fields into one for a column header, and then append all of the dataframes into one big DataFrame. I have searched extensively in StackOverflow as well as in other resources but I have not been able to find an answer. Here is the code I have thus far along with some abbreviated output:
import pandas as pd
import glob
# Read a directory of files to a list
csvlist = []
for f in glob.glob("AssayCerts/*"):
csvlist.append(f)
csvlist
['AssayCerts/CH09051590.csv', 'AssayCerts/CH09051591.csv', 'AssayCerts/CH14158806.csv', 'AssayCerts/CH14162453.csv', 'AssayCerts/CH14186004.csv']
# Read .csv files and convert to DataFrames
dflist = []
for csv in csvlist:
df = pd.read_csv(filename, header = None, skiprows = 7)
dflist.append(df)
dflist
[ 0 1 2 3 4 5 \
0 NaN Au-AA23 ME-ICP41 ME-ICP41 ME-ICP41 ME-ICP41
1 SAMPLE Au Ag Al As B
2 DESCRIPTION ppm ppm % ppm ppm
#concatenates the cells in the first three rows of the last dataframe; need to apply this to all of the dataframes.
for df in dflist:
column_names = df.apply(lambda x: str(x[1]) + '-'+str(x[2])+' - '+str(x[0]),axis=0)
column_names
0 SAMPLE-DESCRIPTION - nan
1 Au-ppm - Au-AA23
2 Ag-ppm - ME-ICP41
3 Al-% - ME-ICP41
I am unable to apply the last operation across all of the DataFrames. It seems I can only get it to apply to the last DataFrame in my list. Once I get past this point I will have to append all of the DataFrames to form one large DataFrame.
As Andy Hayden mentions in his comment, the reason your loop only appears to work on the last DataFrame is that you just keep assigning the result of df.apply( ... ) to column_names, which gets written over each time. So at the end of the loop, column_names always contains the results from the last DataFrame in the list.
But you also have some other problems in your code. In the loop that begins for csv in csvlist:, you never actually reference csv - you just reference filename, which doesn't appear to be defined. And dflist just appears to have one DataFrame in it anyway.
As written in your problem, the code doesn't appear to work. I'd advise posting the real code that you're using, and only what's relevant to your problem (i.e. if building csvlist is working for you, then you don't need to show it to us).