pandas .drop() memory error large file - python-2.7

For reference, this is all on a Windows 7 x64 bit machine in PyCharm Educational Edition 1.0.1, with Python 3.4.2 and Pandas 0.16.1
I have an ~791MB .csv file with ~3.04 million rows x 24 columns. The file contains liquor sales data for the state of Iowa from January 2014 to February 2015. If you are interested, the file can be found here: https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy.
One of the columns, titled store location, holds the address including latitude and longitude. The purpose of the program below is to take the latitude and longitude out of the store location cell and place each in its own cell. When the file is cut down to ~1.04 million rows, my program works properly.
1 import pandas as pd
2
3 #import the original file
4 sales = pd.read_csv('Iowa_Liquor_Sales.csv', header=0)
5
6 #transfer the copies into lists
7 lat = sales['STORE LOCATION']
8 lon = sales['STORE LOCATION']
9
10 #separate the latitude and longitude from each cell into their own list
11 hold = [i.split('(', 1)[1] for i in lat]
12 lat2 = [i.split(',', 1)[0] for i in hold]
13 lon2 = [i.split(',', 1)[1] for i in hold]
14 lon2 = [i.split(')', 1)[0] for i in lon2]
15
16 #put the now separate latitude and longitude back into their own columns
17 sales['LATITUDE'] = lat2
18 sales['LONGITUDE'] = lon2
19
20 #drop the store location column
21 sales = sales.drop(['STORE LOCATION'], axis=1)
22
23 #export the new panda data frame into a new file
24 sales.to_csv('liquor_data2.csv')
However, when I try to run the code with the full 3.04 million line file, it gives me this error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1595, in drop
dropped = self.reindex(**{axis_name: new_axis})
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2505, in reindex
**kwargs)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1751, in reindex
self._consolidate_inplace()
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2132, in _consolidate_inplace
self._data = self._protect_consolidate(f)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2125, in _protect_consolidate
result = f()
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2131, in <lambda>
f = lambda: self._data.consolidate()
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 2833, in consolidate
bm._consolidate_inplace()
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 2838, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3817, in _consolidate
_can_consolidate=_can_consolidate)
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3840, in _merge_blocks
new_values = _vstack([b.values for b in blocks], dtype)
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3870, in _vstack
return np.vstack(to_stack)
File "C:\Python34\lib\site-packages\numpy\core\shape_base.py", line 228, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
MemoryError
I tried running the code line-by-line in the python console and found that the error occurs after the program runs the sales = sales.drop(['STORE LOCATION'], axis=1) line.
I have searched for similar issues elsewhere and the only answer I have come up with is chunking the file as it is read by the program, like this:
#import the original file
df = pd.read_csv('Iowa_Liquor_Sales7.csv', header=0, chunksize=chunksize)
sales = pd.concat(df, ignore_index=True)
My only problem with that is then I get this error:
Traceback (most recent call last):
File "C:/Users/Aaron/PycharmProjects/DATA/Liquor_Reasign_Pd.py", line 14, in <module>
lat = sales['STORE LOCATION']
TypeError: 'TextFileReader' object is not subscriptable
My google-foo is all foo'd out. Anyone know what to do?
UPDATE
I should specify that with the chunking method,the error comes about when the program tries to duplicate the store location column.

So I found an answer to my issue. I ran the program in python 2.7 instead of python 3.4. The only change I made was deleting line 8, as it is unused. I don't know if 2.7 just handles the memory issue differently, or if I had improperly installed the pandas package in 3.4. I will reinstall pandas in 3.4 to see if that was the problem, but if anyone else has a similar issue, try your program in 2.7.
UPDATE Realized that I was running 32 bit python on a 64 bit machine. I upgraded my versions of python and it runs without memory errors now.

Related

Series regex extract producing a dataframe

I am working through a regex task on Dataquest. The following code snippet runs correctly
inside of the Dataquest IDE:
titles = hn["title"]
pattern = r'\[(\w+)\]'
tag_matches = titles.str.extract(pattern)
tag_freq = tag_matches.value_counts()
print(tag_freq, '\n')
However, on my PC running pandas 0.25.3 this exact same code block yields an error:
Traceback (most recent call last):
File "C:/Users/Mark/PycharmProjects/main/main.py", line 63, in <module>
tag_freq = tag_matches.value_counts()
File "C:\Users\Mark\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 5179, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'value_counts'
Why is tag_matches coming back as a dataframe? I am running an extract against the series 'titles'.
From the docs:
Pandas.Series.str.Extract
A pattern with one group will return a Series if expand=False.
>>> s.str.extract(r'[ab](\d)', expand=False)
0 1
1 2
2 NaN
dtype: object
So perhaps you must be explicit and set expand=False to get a series object?

Trying to search excel file for a date

Im searching for a date in excel using a string that I've converted to python date. I get a error trying to convert excel values to date using the following code:
from dateutil import parser
import xlrd
d = '4/8/2019'
dt_obj = parser.parse(d)
wbpath = 'XLSX FILE'
wb = xlrd.open_workbook(wbpath)
ws = wb.sheet_by_index(1)
for rowidx in range(ws.nrows):
row = ws.row(rowidx)
for colidx, cell in enumerate(row):
if xlrd.xldate_as_tuple(cell.value, wb.datemode) == dt_obj:
print(ws.name)
print(colidx)
print(rowidx)
ERROR I get:
Traceback (most recent call last):
File "C:/Users/DKisialeu/PycharmProjects/new/YIM.py", line 12, in <module>
if xlrd.xldate_as_tuple(cell.value, wb.datemode) == dt_obj:
File "C:\Users\DKisialeu\AppData\Local\Programs\Python\Python37\lib\site-packages\xlrd\xldate.py", line 95, in xldate_as_tuple
if xldate < 0.00:
TypeError: '<' not supported between instances of 'str' and 'float'
Make sure that the dates in your excel spreadsheet are formatted as dates and not as text.
I get the same error by running your code with a spreadsheet with any text formatted cells at all.

cannot reindex duplicate axis

I am trying to merge multiple csv files in a folder.
They look like this (there are more than two df's in actuality):
df1
LCC acres
2 10
3 20
4 40
5 5
df2
LCC acres_2
2 4
3 2
4 40
5 6
6 7
I want to put all the dataframes into one list, and then merge them with reduce. To do this they need to have the same index.
I am trying this code:
combined = []
reindex = [2,3,4,5,6]
folder = r'C:\path_to_files'
for f in os.listdir(folder):
#read each file
df = pd.read_csv(os.path.join(folder,f))
#check for duplicates - returns empty lists
print df[df.index.duplicated()]
#reindex
df.set_index([df.columns[0]], inplace=True)
df=df.reindex(reindex, fill_value=0)
#append
combined.append(df)
#merge on 'LCC' column
final = reduce(lambda left, right: pd.merge(left, right, on=['LCC'], how='outer'), combined)
but this still returns:
Traceback (most recent call last):
File "<ipython-input-31-45f925f6d48d>", line 9, in <module>
df=df.reindex(reindex, fill_value=0)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\frame.py", line 2741, in reindex
**kwargs)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\generic.py", line 2229, in reindex
fill_value, copy).__finalize__(self)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\frame.py", line 2687, in _reindex_axes
fill_value, limit, tolerance)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\frame.py", line 2698, in _reindex_index
allow_dups=False)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\generic.py", line 2341, in _reindex_with_indexers
copy=copy)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\internals.py", line 3586, in reindex_indexer
self.axes[axis]._can_reindex(indexer)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\indexes\base.py", line 2293, in _can_reindex
raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
There is problem you need check duplicates of index after setting first column to index.
#set index by first column
df.set_index([df.columns[0]], inplace=True)
#check for duplicates - returns NO empty lists
print df[df.index.duplicated()]
#reindex
df=df.reindex(reindex, fill_value=0)
Or check duplicates in first column instead index, also parameter keep=False return all duplicates (if necessary):
#check duplicates in first column
print df[df.iloc[:, 0].duplicated(keep=False)]
#set index + reindex
df.set_index([df.columns[0]], inplace=True)
df=df.reindex(reindex, fill_value=0)

Pandas concatenating dataframes raises ValueError

I am looping through many smaller dataframes and concatenating them into a single dataframe using pandas.concat(). In the middle of the looping an exception is raised with message ValueError: Plan shapes are not aligned.
The failed dataframe contains a single row (like all the previous dataframes) and the columns are a subset of the other dataframe. A sample snippet of the code is below.
import pandas as pd
df, failed = pd.DataFrame(), pd.DataFrame()
for _file in os.listdir(file_dir):
_tmp = pd.read_csv(file_dir + _file)
try:
df= pd.concat([df, _tmp])
except ValueError as e:
if 'Plan shapes are not aligned' in str(e):
failed = pd.concat([failed, _tmp])
print [x for x in failed.columns if x not in df.columns]
print len(df), len(failed)
And I end up with the result
Out[10]: []
118 1
Checking the failures it is always the same dataframe, so the dataframe must be the problem. Printing out the dataframe I get
0 timestamp actual average_estimate median_estimate \
0 1996-11-14 01:30:00 2.300000 2.380000 2.400000
0 estimate1 estimate2 estimate3 estimate4 \
0 2.400000 2.200000 2.500000 2.600000
0 estimate5
0 2.200000
Which has a similar format to the other concatenated dataframes and the df dataframe. Is there something that I'm missing?
Extra info: I am using pandas 0.16.0
Edit: full stack trace below with modifications for anonymity
Traceback (most recent call last):
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\IPython\core\interactiveshell.py", line 3066, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-48539cb93d64>", line 37, in <module>
df = pd.concat([df, _tmp])
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\tools\merge.py", line 755, in concat
return op.get_result()
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\tools\merge.py", line 926, in get_result
mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\core\internals.py", line 4040, in concatenate_block_managers
for placement, join_units in concat_plan]
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\core\internals.py", line 4258, in combine_concat_plans
raise ValueError("Plan shapes are not aligned")
ValueError: Plan shapes are not aligned
Edit 2: Tried with 0.17.1 and 0.18.0 and still have the same error.

Is it possible to do a times and date plot by reading the data from temporary file?

I have twitter data in a textfile in a following format "RT Find out how AZ is targeting escape pathways to further personalise breastcancer treatment SABCS14 Thu Dec 11 03:09:12 +0000 2014". So from this i need to find same tweets with same text and get their time and date. After that i have to do a time and date plot. Following is my code what i have tried.
import matplotlib
import os
import tempfile
temp = tempfile.TemporaryFile(mode = 'w+t')
count = 0
f = open('bcd.txt','r')
lines = f.read().splitlines()
for i in range(len(lines)):
line = lines[i]
try:
next_line = lines[i+1]
except IndexError:
continue
if line == next_line:
count +=1
temp.writelines([line+'\n'])
temp.seek(0)
for line in temp:
line.split('\t')
First i have compared the line with next line, then i wrote that into a temporary file then tried to go through that temporary file and extract the time and date of particular tweets but i was unsuccessful. Anykind of help would be appreciated. Thanks in advance.
link to data file
https://drive.google.com/file/d/0BxE3rA3-6F8eVGVwOFlVRjNFTUE/view?usp=sharing