Pandas concatenating dataframes raises ValueError - python-2.7

I am looping through many smaller dataframes and concatenating them into a single dataframe using pandas.concat(). In the middle of the looping an exception is raised with message ValueError: Plan shapes are not aligned.
The failed dataframe contains a single row (like all the previous dataframes) and the columns are a subset of the other dataframe. A sample snippet of the code is below.
import pandas as pd
df, failed = pd.DataFrame(), pd.DataFrame()
for _file in os.listdir(file_dir):
_tmp = pd.read_csv(file_dir + _file)
try:
df= pd.concat([df, _tmp])
except ValueError as e:
if 'Plan shapes are not aligned' in str(e):
failed = pd.concat([failed, _tmp])
print [x for x in failed.columns if x not in df.columns]
print len(df), len(failed)
And I end up with the result
Out[10]: []
118 1
Checking the failures it is always the same dataframe, so the dataframe must be the problem. Printing out the dataframe I get
0 timestamp actual average_estimate median_estimate \
0 1996-11-14 01:30:00 2.300000 2.380000 2.400000
0 estimate1 estimate2 estimate3 estimate4 \
0 2.400000 2.200000 2.500000 2.600000
0 estimate5
0 2.200000
Which has a similar format to the other concatenated dataframes and the df dataframe. Is there something that I'm missing?
Extra info: I am using pandas 0.16.0
Edit: full stack trace below with modifications for anonymity
Traceback (most recent call last):
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\IPython\core\interactiveshell.py", line 3066, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-48539cb93d64>", line 37, in <module>
df = pd.concat([df, _tmp])
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\tools\merge.py", line 755, in concat
return op.get_result()
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\tools\merge.py", line 926, in get_result
mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\core\internals.py", line 4040, in concatenate_block_managers
for placement, join_units in concat_plan]
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\core\internals.py", line 4258, in combine_concat_plans
raise ValueError("Plan shapes are not aligned")
ValueError: Plan shapes are not aligned
Edit 2: Tried with 0.17.1 and 0.18.0 and still have the same error.

Related

How to resolve IndexError: list index out of range

I'm new to python and I have a csv file which contains 35 columns. To print the last column of each i'm executing the script below:
import csv
filename = '/home/cloudera/PMGE/Bfr.csv'
with open(filename, 'r') as csvfile:
datareader = csv.reader(csvfile)
for row in datareader:
print(row[34])
```
It prints the expected result but at the end i have this error:
```
Traceback (most recent call last):
File "barplot.py", line 15, in <module>
print(row[34])
IndexError: list index out of range
```
Can anyone help me to understand this?

Series regex extract producing a dataframe

I am working through a regex task on Dataquest. The following code snippet runs correctly
inside of the Dataquest IDE:
titles = hn["title"]
pattern = r'\[(\w+)\]'
tag_matches = titles.str.extract(pattern)
tag_freq = tag_matches.value_counts()
print(tag_freq, '\n')
However, on my PC running pandas 0.25.3 this exact same code block yields an error:
Traceback (most recent call last):
File "C:/Users/Mark/PycharmProjects/main/main.py", line 63, in <module>
tag_freq = tag_matches.value_counts()
File "C:\Users\Mark\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 5179, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'value_counts'
Why is tag_matches coming back as a dataframe? I am running an extract against the series 'titles'.
From the docs:
Pandas.Series.str.Extract
A pattern with one group will return a Series if expand=False.
>>> s.str.extract(r'[ab](\d)', expand=False)
0 1
1 2
2 NaN
dtype: object
So perhaps you must be explicit and set expand=False to get a series object?

cannot reindex duplicate axis

I am trying to merge multiple csv files in a folder.
They look like this (there are more than two df's in actuality):
df1
LCC acres
2 10
3 20
4 40
5 5
df2
LCC acres_2
2 4
3 2
4 40
5 6
6 7
I want to put all the dataframes into one list, and then merge them with reduce. To do this they need to have the same index.
I am trying this code:
combined = []
reindex = [2,3,4,5,6]
folder = r'C:\path_to_files'
for f in os.listdir(folder):
#read each file
df = pd.read_csv(os.path.join(folder,f))
#check for duplicates - returns empty lists
print df[df.index.duplicated()]
#reindex
df.set_index([df.columns[0]], inplace=True)
df=df.reindex(reindex, fill_value=0)
#append
combined.append(df)
#merge on 'LCC' column
final = reduce(lambda left, right: pd.merge(left, right, on=['LCC'], how='outer'), combined)
but this still returns:
Traceback (most recent call last):
File "<ipython-input-31-45f925f6d48d>", line 9, in <module>
df=df.reindex(reindex, fill_value=0)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\frame.py", line 2741, in reindex
**kwargs)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\generic.py", line 2229, in reindex
fill_value, copy).__finalize__(self)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\frame.py", line 2687, in _reindex_axes
fill_value, limit, tolerance)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\frame.py", line 2698, in _reindex_index
allow_dups=False)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\generic.py", line 2341, in _reindex_with_indexers
copy=copy)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\internals.py", line 3586, in reindex_indexer
self.axes[axis]._can_reindex(indexer)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\indexes\base.py", line 2293, in _can_reindex
raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
There is problem you need check duplicates of index after setting first column to index.
#set index by first column
df.set_index([df.columns[0]], inplace=True)
#check for duplicates - returns NO empty lists
print df[df.index.duplicated()]
#reindex
df=df.reindex(reindex, fill_value=0)
Or check duplicates in first column instead index, also parameter keep=False return all duplicates (if necessary):
#check duplicates in first column
print df[df.iloc[:, 0].duplicated(keep=False)]
#set index + reindex
df.set_index([df.columns[0]], inplace=True)
df=df.reindex(reindex, fill_value=0)

pandas .drop() memory error large file

For reference, this is all on a Windows 7 x64 bit machine in PyCharm Educational Edition 1.0.1, with Python 3.4.2 and Pandas 0.16.1
I have an ~791MB .csv file with ~3.04 million rows x 24 columns. The file contains liquor sales data for the state of Iowa from January 2014 to February 2015. If you are interested, the file can be found here: https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy.
One of the columns, titled store location, holds the address including latitude and longitude. The purpose of the program below is to take the latitude and longitude out of the store location cell and place each in its own cell. When the file is cut down to ~1.04 million rows, my program works properly.
1 import pandas as pd
2
3 #import the original file
4 sales = pd.read_csv('Iowa_Liquor_Sales.csv', header=0)
5
6 #transfer the copies into lists
7 lat = sales['STORE LOCATION']
8 lon = sales['STORE LOCATION']
9
10 #separate the latitude and longitude from each cell into their own list
11 hold = [i.split('(', 1)[1] for i in lat]
12 lat2 = [i.split(',', 1)[0] for i in hold]
13 lon2 = [i.split(',', 1)[1] for i in hold]
14 lon2 = [i.split(')', 1)[0] for i in lon2]
15
16 #put the now separate latitude and longitude back into their own columns
17 sales['LATITUDE'] = lat2
18 sales['LONGITUDE'] = lon2
19
20 #drop the store location column
21 sales = sales.drop(['STORE LOCATION'], axis=1)
22
23 #export the new panda data frame into a new file
24 sales.to_csv('liquor_data2.csv')
However, when I try to run the code with the full 3.04 million line file, it gives me this error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1595, in drop
dropped = self.reindex(**{axis_name: new_axis})
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2505, in reindex
**kwargs)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1751, in reindex
self._consolidate_inplace()
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2132, in _consolidate_inplace
self._data = self._protect_consolidate(f)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2125, in _protect_consolidate
result = f()
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2131, in <lambda>
f = lambda: self._data.consolidate()
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 2833, in consolidate
bm._consolidate_inplace()
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 2838, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3817, in _consolidate
_can_consolidate=_can_consolidate)
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3840, in _merge_blocks
new_values = _vstack([b.values for b in blocks], dtype)
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3870, in _vstack
return np.vstack(to_stack)
File "C:\Python34\lib\site-packages\numpy\core\shape_base.py", line 228, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
MemoryError
I tried running the code line-by-line in the python console and found that the error occurs after the program runs the sales = sales.drop(['STORE LOCATION'], axis=1) line.
I have searched for similar issues elsewhere and the only answer I have come up with is chunking the file as it is read by the program, like this:
#import the original file
df = pd.read_csv('Iowa_Liquor_Sales7.csv', header=0, chunksize=chunksize)
sales = pd.concat(df, ignore_index=True)
My only problem with that is then I get this error:
Traceback (most recent call last):
File "C:/Users/Aaron/PycharmProjects/DATA/Liquor_Reasign_Pd.py", line 14, in <module>
lat = sales['STORE LOCATION']
TypeError: 'TextFileReader' object is not subscriptable
My google-foo is all foo'd out. Anyone know what to do?
UPDATE
I should specify that with the chunking method,the error comes about when the program tries to duplicate the store location column.
So I found an answer to my issue. I ran the program in python 2.7 instead of python 3.4. The only change I made was deleting line 8, as it is unused. I don't know if 2.7 just handles the memory issue differently, or if I had improperly installed the pandas package in 3.4. I will reinstall pandas in 3.4 to see if that was the problem, but if anyone else has a similar issue, try your program in 2.7.
UPDATE Realized that I was running 32 bit python on a 64 bit machine. I upgraded my versions of python and it runs without memory errors now.

Python TypeError: list indices must be integers, not tuple

Using the python 2.7 shell on osx lion. The .csv file has 12 columns by 892 rows.
import csv as csv
import numpy as np
# Open up csv file into a Python object
csv_file_object = csv.reader(open('/Users/scdavis6/Documents/Kaggle/train.csv', 'rb'))
header = csv_file_object.next()
data=[]
for row in csv_file_object:
data.append(row)
data = np.array(data)
# Convert to float for numerical calculations
number_passengers = np.size(data[0::,0].astype(np.float))
And this is the error I get:
Traceback (most recent call last):
File "pyshell#5>", line 1, in <module>
number_passengers = np.size(data[0::,0].astype(np.float))
TypeError: list indices must be integers, not tuple
What am I doing wrong.
Don't use csv to read the data into a NumPy array. Use numpy.genfromtxt; using dtype=None will cause genfromtxt to make an intelligent guess at the dtypes for you. By doing it this way you won't have to manually convert strings to floats.
data[0::, 0] just gives you the first column of data.
data[:, 0] would give you the same result.
The error message
TypeError: list indices must be integers, not tuple
suggests that for some reason your data variable might be holding a list rather than a ndarray. For example, the same Exception can produced like this:
In [73]: data = [1,2,3]
In [74]: data[1,2]
TypeError: list indices must be integers, not tuple
I don't know why that is happening, but if you post a sample of your CSV we should be able to help fix that.
Using np.genfromtxt, your current code could be simplified to:
import numpy as np
filename = '/Users/scdavis6/Documents/Kaggle/train.csv'
data = np.genfromtxt(filename, delimiter=',', skiprows=1, dtype=None)
number_passengers = np.size(data, axis=0)