Why does pandas Round method explodes my data frame?

Why does pandas Round method explodes my data frame? - python-2.7

I try to round all values in this dataframe. However, the pandas round() method explodes my dataframe by a factor 5! From 150 rows to 7518 rows.
It may be that there is something odd with the data in the dataframe, but then again, one would not expect a simple rounding function to do this.
Below, I replicate the error using 1) simulated data and 2) the data that leads to the said error.
This results in 150 rows, which is the correct number:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([150, 4]), columns=['A', 'B', 'C', 'D'])
df["cat"] = "MID"
df.loc[:399,["cat"]] = "LOW"
df.iloc[-400:,-1] = "HI"
df.cat.value_counts()
df.set_index("cat", inplace=True)
df.round(3)
Using the data from my dropbox folder, the round-function produces a whopping 7518 rows:
dfb = pd.read_pickle('dfna.pkl')
dfb.round(3)
This is strange. I solved it for now using this rather ugly line:
dfb = dfb.reset_index().round({'A': 3, 'B': 3, 'C': 3, 'D': 3}).set_index('tricile')
However, this is not ideal, given that pandas' round method acts in mysterious ways and may affect future programs.

I think it is bug - round with duplicated CategoricalIndex, so created pandas issue 21809 and pandas issue 21810.
Similar solution like your is:
print (dfb.reset_index().round(3).set_index('tricile'))
Or remove CategoricalIndex:
dfb.index = dfb.index.astype(str)
print (dfb.round(3))

Related

Dask: Groupby with nlargest automatically brings in index and doesn't allow reset_index()

I've been trying to get nlargest rows for a group by following method from this question. The solution to the question is correct up to a point.
In this example, I groupby column A and want to return the rows of C and D based on the top two values in B.
For some reason the index of grp_df is multilevel and includes both A and the original index of ddf.
I was hoping to simply reset_index() and drop the unwanted index and just keep A, but I get the following error:
ValueError: The columns in the computed data do not match the columns in the provided metadata
Here is a simple example reproducing the error:
import numpy as np
import dask.dataframe as dd
import pandas as pd
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(df, npartitions=3)
grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta={
"B": 'f8', "C": 'f8'})
# Print is successful and results are correct
print(grp_df.head())
grp_df = grp_df.reset_index()
# Print is unsuccessful and shows error below
print(grp_df.head())

Found approach for solution here.
Following code now allows for reset_index() to work and gets rid of the original ddf index. Still not sure why the original ddf index came through the groupby in the first place, though
meta = pd.DataFrame(columns=['B', 'C'], dtype=int, index=pd.MultiIndex([[], []], [[], []], names=['A', None]))
grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta=meta)
grp_df = grp_df.reset_index().drop('level_1', axis=1)

Retrieving Pyomo solution without using for loop

I am struggling to find an efficient way of retrieving the solution to an optimization problem. The solution consists of around 200K variables that I would like in a pandas DataFrame. After searching online the only approaches I found for accessing the variables was through a for loop which looks something like this:
instance = M.create_instance('input.dat') # reading in a datafile
results = opt.solve(instance, tee=True)
results.write()
instance.solutions.load_from(results)
for v in instance.component_objects(Var, active=True):
print ("Variable",v)
varobject = getattr(instance, str(v))
for index in varobject:
print (" ",index, varobject[index].value)
I know I can use this for loop to store them in a dataframe but this is pretty inefficient.
I found out how to access the indexes by using
import pandas as pd
index = pd.DataFrame(instance.component_objects(Var, active=True))
But I dont know how to get the solution

There is actually a very simple and elegant solution, using the method pandas.DataFrame.from_dict combined with the Var.extract_values() method.
from pyomo.environ import *
import pandas as pd
m = ConcreteModel()
m.N = RangeSet(5)
m.x = Var(m.N, rule=lambda _, el: el**2) # x = [1,4,9,16,25]
df = pd.DataFrame.from_dict(m.x.extract_values(), orient='index', columns=[str(m.x)])
print(df)
yields
x
1 1
2 4
3 9
4 16
5 25
Note that for Var we can use both get_values() and extract_values(), they seem to do the same. For Param there is only extract_values().

Of course you can use instance.some_var.pprint() to print it on the screen.
But if you have a variable indexed by a large set. You can also write it to a
seperate file. The following code writes the result to a .txt file:
f = open('Result.txt', 'a')
instance.some_var.pprint(f)
f.close()

I had the same issue as Jasper and tried the suggested solutions. By doing so I noticed, that the part writing the results takes most time. Maybe this is also true in Jasper's case.
results.write()
instance.solutions.load_from(results)
So I suggest to surpress this two lines if you can do so. Maybe someone has a suggestions how to speed this up? Or an alternative method.
Also I saw that in this post (Pyomo: Save results to CSV files) The "for loop" method is recomanded. A pyomo developer states:"I think it's possible in option 2 for the indices and the variable slice to be iterated over in a different order which would invalidate your resulting array."

For simplicity of code and to largely avoid for-loops, I found the pyomoio module in the urbs project, which has taken over the slightly deprecated code of pandaspyomo.py. It relies on each pyomo object's iteritem() method, and handles multiple dimensions elegantly. It can extract sets, parameters, variables as pandas objects.
If I set up a small pyomo model
from pyomo.environ import *
import pyomoio as po
import pandas as pd
# Define a model with 200k values
m = ConcreteModel()
m.ix = RangeSet(200000)
def idem(model, i):
return i
m.a = Param(m.ix, rule=idem)
I can read in the parameter with just one line of code
%%timeit
a_po = po.get_entity(m, 'a')
# 110 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, if I compare it to the approach in the original question, it is not faster, even a little slower:
%%timeit
val = []
ix = []
varobject = getattr(m, 'a')
for index in varobject:
ix.append(index)
val.append(varobject[index])
a = pd.Series(index=ix, data=val)
# 92.5 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Python,os.path.getmtime(fullname) sb close to time.mktime(t.timetuple()) but isn't

I edited and saved a text file, "fullname" on my Windows 7 computer.
I ran the following two lines of code immediately after saving the edits to "fullname", and I expected both of the following lines of code to return almost the same number of seconds since the epoch:
print str(os.path.getmtime(fullname))
print str(time.mktime(t.timetuple()))
The second line of code was borrowed from How to convert a Python datetime object to seconds
The results were not even close:
"1494082110.0"
"1319180400.0"
I would like to know why the results were not close.
My ultimate goal is that I want to know how to generate a float date, matching a calendar date of my choosing,
for use in the context of:
win32file.SetFileTime(handle, CreatedTime , AccessTime , WrittenTime )
Any help in understanding these issues would be much appreciated.

You need to compare the current time with the time at which you saved the file. In this code I save a file, then I get the current time in t and display it, then I get the modification time for the file and display that. You may note that the two times differ by less than a half a second.
>>> import datetime
>>> import time
>>> import os
>>> fullname = 'temp.txt'
>>> open('temp.txt', 'w').write('something')
9
>>> t = datetime.datetime.now()
>>> time.mktime(t.timetuple())
1502039202.0
>>> os.path.getmtime(fullname)
1502039187.4629886
I notice too that,
>>> datetime.datetime.fromtimestamp(1319180400)
datetime.datetime(2011, 10, 21, 3, 0)
In other words, that second number in your question yields a date that came before you put your question.

How do you Unit Test Python DataFrames

How do i unit test python dataframes?
I have functions that have an input and output as dataframes. Almost every function I have does this. Now if i want to unit test this what is the best method of doing it? It seems a bit of an effort to create a new dataframe (with values populated) for every function?
Are there any materials you can refer me to? Should you write unit tests for these functions?

While Pandas' test functions are primarily used for internal testing, NumPy includes a very useful set of testing functions that are documented here: NumPy Test Support.
These functions compare NumPy arrays, but you can get the array that underlies a Pandas DataFrame using the values property. You can define a simple DataFrame and compare what your function returns to what you expect.
One technique you can use is to define one set of test data for a number of functions. That way, you can use Pytest Fixtures to define that DataFrame once, and use it in multiple tests.
In terms of resources, I found this article on Testing with NumPy and Pandas to be very useful. I also did a short presentation about data analysis testing at PyCon Canada 2016: Automate Your Data Analysis Testing.

you can use pandas testing functions:
It will give more flexbile to compare your result with computed result in different ways.
For example:
df1=pd.DataFrame({'a':[1,2,3,4,5]})
df2=pd.DataFrame({'a':[6,7,8,9,10]})
expected_res=pd.Series([7,9,11,13,15])
pd.testing.assert_series_equal((df1['a']+df2['a']),expected_res,check_names=False)
For more details refer this link

If you are using pytest, pandasSnapshot will be useful.
# use with pytest
import pandas as pd
from snapshottest_ext.dataframe import PandasSnapshot
def test_format(snapshot):
df = pd.DataFrame([['a', 'b'], ['c', 'd']],
columns=['col 1', 'col 2'])
snapshot.assert_match(PandasSnapshot(df))
One big cons is that the snapshot is not readable anymore. (store the content as csv is more readable, but it is problematic.
PS: I am the author of pytest snapshot extension.

I don't think it's hard to create small DataFrames for unit testing?
import pandas as pd
from nose.tools import assert_dict_equal
input_df = pd.DataFrame.from_dict({
'field_1': [some, values],
'field_2': [other, values]
})
expected = {
'result': [...]
}
assert_dict_equal(expected, my_func(input_df).to_dict(), "oops, there's a bug...")

You could use snapshottest and do something like this:
def test_something_works(snapshot): # snapshot is a pytest fixture from snapshottest
data_frame = calc_something_and_return_pandas_dataframe()
snapshot.assert_match(data_frame.to_csv(index=False), 'some_module_level_unique_name_for_the_snapshot')
This will create a snapshots folder with a file in that contains the csv output that you can update with --snapshot-update when your code changes.
It works by comparing the data_frame variable to what is saved to disk.
Might be worth mentioning that your snapshots should be checked in to source control.

I would suggest writing the values as CSV in docstrings (or separate files if they're large) and parsing them using pd.read_csv(). You can parse the expected output from CSV too, and compare, or else use df.to_csv() to write a CSV out and diff it.

Pandas has built in testing functions, but I don't find the output easy to parse, so I created an open source project called beavis with functions that output error messages that are easier for humans to read.
Here's an example of one of the built in testing methods:
df = pd.DataFrame({"col1": [1042, 2, 9, 6], "col2": [5, 2, 7, 6]})
pd.testing.assert_series_equal(df["col1"], df["col2"])
Here's the error message:
> ???
E AssertionError: Series are different
E
E Series values are different (50.0 %)
E [index]: [0, 1, 2, 3]
E [left]: [1042, 2, 9, 6]
E [right]: [5, 2, 7, 6]
Not very easy to see which rows are mismatched because the output isn't aligned.
Here's how you can write the same test with beavis.
import beavis
beavis.assert_pd_column_equality(df, "col1", "col2")
This'll give you the following readable error message:
The built-in assert_frame_equal doesn't give a readable error message either. Here's how you can compare DataFrame equality with beavis.
df1 = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame({'col1': [5, 2], 'col2': [3, 4]})
beavis.assert_pd_equality(df1, df2)

The frame-fixtures Python package (of which I am an author) is designed to make it easy to "create a new dataframe (with values populated)" for unit or performance tests.
For example, if you want to test against a DataFrame of floats and strings with a numerical index, you can use a compact string declaration to generate a DataFrame.
>>> ff.Fixture.to_frame('i(I,int)|v(float,str)|s(4,2)').to_pandas()
0 1
34715 1930.40 zaji
-3648 -1760.34 zJnC
91301 1857.34 zDdR
30205 1699.34 zuVU
>>> ff.Fixture.to_frame('i(I,int)|v(float,str)|s(8,3)').to_pandas()
0 1 2
34715 1930.40 zaji 694.30
-3648 -1760.34 zJnC -72.96
91301 1857.34 zDdR 1826.02
30205 1699.34 zuVU 604.10
54020 268.96 zKka 1080.40
129017 3511.58 zJXD 2580.34
35021 1175.36 zPAQ 700.42
166924 2925.68 zyps 3338.48

How to improve the code with more elegant way and low memory consumed?

I have a dataset which the dimension is around 2,000 (rows) x 120,000 (columns).
And I'd like to pick up certain columns (~8,000 columns).
So the file dimension would be 2,000 (rows) x 8,000 (columns).
Here is the code written by a good man (I searched from stackoverflow but I am sorry I have forgotten his name).
import pandas as pd
df = pd.read_csv('...mydata.csv')
my_query = pd.read_csv('...myquery.csv')
df[list['Name'].unique()].to_csv('output.csv')
However, the result shows MemoryError in my console, which means the code may not work quite well.
So does anyone know how to improve the code with more efficient way to select the certain columns?

I think I found your source.
So, my solution use read_csv with arguments:
iterator=True - if True, return a TextFileReader to enable reading a file into memory piece by piece
chunksize=1000 - an number of rows to be used to “chunk” a file into pieces. Will cause an TextFileReader object to be returned
usecols=subset - a subset of columns to return, results in much faster parsing time and lower memory usage
Source.
I filter large dataset with usecols - I use only dataset (2 000, 8 000) instead (2 000, 120 000).
import pandas as pd
#read subset from csv and remove duplicate indices
subset = pd.read_csv('8kx1.csv', index_col=[0]).index.unique()
print subset
#use subset as filter of columns
tp = pd.read_csv('input.csv',iterator=True, chunksize=1000, usecols=subset)
df = pd.concat(tp, ignore_index=True)
print df.head()
print df.shape
#write to csv
df.to_csv('output.csv',iterator=True, chunksize=1000)
I use this snippet for testing:
import pandas as pd
import io
temp=u"""A,B,C,D,E,F,G
1,2,3,4,5,6,7"""
temp1=u"""Name
B
B
C
B
C
C
E
F"""
subset = pd.read_csv(io.StringIO(temp1), index_col=[0]).index.unique()
print subset
#use subset as filter of columns
df = pd.read_csv(io.StringIO(temp), usecols=subset)
print df.head()
print df.shape

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why does pandas Round method explodes my data frame? - python-2.7

I think it is bug - round with duplicated CategoricalIndex, so created pandas issue 21809 and pandas issue 21810. Similar solution like your is: print (dfb.reset_index().round(3).set_index('tricile')) Or remove CategoricalIndex: dfb.index = dfb.index.astype(str) print (dfb.round(3))

Related

Dask: Groupby with nlargest automatically brings in index and doesn't allow reset_index()

Retrieving Pyomo solution without using for loop

Python,os.path.getmtime(fullname) sb close to time.mktime(t.timetuple()) but isn't

How do you Unit Test Python DataFrames

How to improve the code with more elegant way and low memory consumed?

Categories

Resources