cannot reindex duplicate axis - python-2.7

I am trying to merge multiple csv files in a folder.
They look like this (there are more than two df's in actuality):
df1
LCC acres
2 10
3 20
4 40
5 5
df2
LCC acres_2
2 4
3 2
4 40
5 6
6 7
I want to put all the dataframes into one list, and then merge them with reduce. To do this they need to have the same index.
I am trying this code:
combined = []
reindex = [2,3,4,5,6]
folder = r'C:\path_to_files'
for f in os.listdir(folder):
#read each file
df = pd.read_csv(os.path.join(folder,f))
#check for duplicates - returns empty lists
print df[df.index.duplicated()]
#reindex
df.set_index([df.columns[0]], inplace=True)
df=df.reindex(reindex, fill_value=0)
#append
combined.append(df)
#merge on 'LCC' column
final = reduce(lambda left, right: pd.merge(left, right, on=['LCC'], how='outer'), combined)
but this still returns:
Traceback (most recent call last):
File "<ipython-input-31-45f925f6d48d>", line 9, in <module>
df=df.reindex(reindex, fill_value=0)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\frame.py", line 2741, in reindex
**kwargs)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\generic.py", line 2229, in reindex
fill_value, copy).__finalize__(self)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\frame.py", line 2687, in _reindex_axes
fill_value, limit, tolerance)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\frame.py", line 2698, in _reindex_index
allow_dups=False)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\generic.py", line 2341, in _reindex_with_indexers
copy=copy)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\core\internals.py", line 3586, in reindex_indexer
self.axes[axis]._can_reindex(indexer)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda2_2\lib\site-packages\pandas\indexes\base.py", line 2293, in _can_reindex
raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis

There is problem you need check duplicates of index after setting first column to index.
#set index by first column
df.set_index([df.columns[0]], inplace=True)
#check for duplicates - returns NO empty lists
print df[df.index.duplicated()]
#reindex
df=df.reindex(reindex, fill_value=0)
Or check duplicates in first column instead index, also parameter keep=False return all duplicates (if necessary):
#check duplicates in first column
print df[df.iloc[:, 0].duplicated(keep=False)]
#set index + reindex
df.set_index([df.columns[0]], inplace=True)
df=df.reindex(reindex, fill_value=0)

Related

RDD to DataFrame in pyspark (columns from rdd's first element)

I have created a rdd from a csv file and the first row is the header line in that csv file. Now I want to create dataframe from that rdd and retain the column from 1st element of rdd.
Problem is I am able to create the dataframe and with column from rdd.first(), but the created dataframe has its first row as the headers itself. How to remove that?
lines = sc.textFile('/path/data.csv')
rdd = lines.map(lambda x: x.split('#####')) ###multiple char sep can be there #### or ### , so can't directly read csv to a dataframe
#rdd: [[u'mailid', u'age', u'address'], [u'satya', u'23', u'Mumbai'], [u'abc', u'27', u'Goa']] ###first element is the header
df = rdd.toDF(rdd.first()) ###retaing te column from rdd.first()
df.show()
#mailid age address
mailid age address ####I don't want this as dataframe data
satya 23 Mumbai
abc 27 Goa
How to avoid that first element moving to dataframe data. Can I give any option in rdd.toDF(rdd.first()) to get that done??
Note: I can't collect rdd to form list , then remove first item from that list, then parallelize that list back to form rdd again and then toDF()...
Please suggest!!!Thanks
You will have to remove the header from your RDD. One way to do it is the following considering your rdd variable :
>>> header = rdd.first()
>>> header
# ['mailid', 'age', 'address']
>>> data = rdd.filter(lambda row : row != header).toDF(header)
>>> data.show()
# +------+---+-------+
# |mailid|age|address|
# +------+---+-------+
# | satya| 23| Mumbai|
# | abc| 27| Goa|
# +------+---+-------+

Pandas concatenating dataframes raises ValueError

I am looping through many smaller dataframes and concatenating them into a single dataframe using pandas.concat(). In the middle of the looping an exception is raised with message ValueError: Plan shapes are not aligned.
The failed dataframe contains a single row (like all the previous dataframes) and the columns are a subset of the other dataframe. A sample snippet of the code is below.
import pandas as pd
df, failed = pd.DataFrame(), pd.DataFrame()
for _file in os.listdir(file_dir):
_tmp = pd.read_csv(file_dir + _file)
try:
df= pd.concat([df, _tmp])
except ValueError as e:
if 'Plan shapes are not aligned' in str(e):
failed = pd.concat([failed, _tmp])
print [x for x in failed.columns if x not in df.columns]
print len(df), len(failed)
And I end up with the result
Out[10]: []
118 1
Checking the failures it is always the same dataframe, so the dataframe must be the problem. Printing out the dataframe I get
0 timestamp actual average_estimate median_estimate \
0 1996-11-14 01:30:00 2.300000 2.380000 2.400000
0 estimate1 estimate2 estimate3 estimate4 \
0 2.400000 2.200000 2.500000 2.600000
0 estimate5
0 2.200000
Which has a similar format to the other concatenated dataframes and the df dataframe. Is there something that I'm missing?
Extra info: I am using pandas 0.16.0
Edit: full stack trace below with modifications for anonymity
Traceback (most recent call last):
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\IPython\core\interactiveshell.py", line 3066, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-48539cb93d64>", line 37, in <module>
df = pd.concat([df, _tmp])
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\tools\merge.py", line 755, in concat
return op.get_result()
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\tools\merge.py", line 926, in get_result
mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\core\internals.py", line 4040, in concatenate_block_managers
for placement, join_units in concat_plan]
File "C:\Users\<user>\Documents\GitHub\<environment>\lib\site-packages\pandas\core\internals.py", line 4258, in combine_concat_plans
raise ValueError("Plan shapes are not aligned")
ValueError: Plan shapes are not aligned
Edit 2: Tried with 0.17.1 and 0.18.0 and still have the same error.

how to select every 5th row in a .csv file using python

just a normal .csv file
with the first row has titles for each column.
I wonder how to create a new .csv file that has the same header (first row), but contains every 5th rows of the original file?
thank you!
This will take any text file and output the first and every 5th line after that. It doesn't have to be manipulated as a .csv, if the columns aren't being accessed:
with open('a.txt') as f:
with open('b.txt','w') as out:
for i,line in enumerate(f):
if i % 5 == 0:
out.write(line)
This will read the file one line at a time and only write rows 5, 10, 15, 20...
import csv
count = 0
# open files and handle headers
with open('input.csv') as infile:
with open('ouput.csv', 'w') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames)
writer.writeheader()
# iterate through file and write only every 5th row
for row in reader:
count += 1
if not count % 5:
writer.writerow(row)
(work with Python 2 and 3)
If you'd prefer to start with data row #1 to write lines 1, 6, 11, 16... at the top change to:
count = -1
If you want to use the csv library, a tighter version would be...
import csv
# open files and handle headers
with open('input.csv') as infile:
with open('ouput.csv', 'w') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames)
writer.writeheader()
# iterate through file and write only every 5th row
writer.writerows([x for i,x in enumerate(reader) if i % 5 == 4])

pandas .drop() memory error large file

For reference, this is all on a Windows 7 x64 bit machine in PyCharm Educational Edition 1.0.1, with Python 3.4.2 and Pandas 0.16.1
I have an ~791MB .csv file with ~3.04 million rows x 24 columns. The file contains liquor sales data for the state of Iowa from January 2014 to February 2015. If you are interested, the file can be found here: https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy.
One of the columns, titled store location, holds the address including latitude and longitude. The purpose of the program below is to take the latitude and longitude out of the store location cell and place each in its own cell. When the file is cut down to ~1.04 million rows, my program works properly.
1 import pandas as pd
2
3 #import the original file
4 sales = pd.read_csv('Iowa_Liquor_Sales.csv', header=0)
5
6 #transfer the copies into lists
7 lat = sales['STORE LOCATION']
8 lon = sales['STORE LOCATION']
9
10 #separate the latitude and longitude from each cell into their own list
11 hold = [i.split('(', 1)[1] for i in lat]
12 lat2 = [i.split(',', 1)[0] for i in hold]
13 lon2 = [i.split(',', 1)[1] for i in hold]
14 lon2 = [i.split(')', 1)[0] for i in lon2]
15
16 #put the now separate latitude and longitude back into their own columns
17 sales['LATITUDE'] = lat2
18 sales['LONGITUDE'] = lon2
19
20 #drop the store location column
21 sales = sales.drop(['STORE LOCATION'], axis=1)
22
23 #export the new panda data frame into a new file
24 sales.to_csv('liquor_data2.csv')
However, when I try to run the code with the full 3.04 million line file, it gives me this error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1595, in drop
dropped = self.reindex(**{axis_name: new_axis})
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2505, in reindex
**kwargs)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1751, in reindex
self._consolidate_inplace()
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2132, in _consolidate_inplace
self._data = self._protect_consolidate(f)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2125, in _protect_consolidate
result = f()
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2131, in <lambda>
f = lambda: self._data.consolidate()
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 2833, in consolidate
bm._consolidate_inplace()
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 2838, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3817, in _consolidate
_can_consolidate=_can_consolidate)
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3840, in _merge_blocks
new_values = _vstack([b.values for b in blocks], dtype)
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3870, in _vstack
return np.vstack(to_stack)
File "C:\Python34\lib\site-packages\numpy\core\shape_base.py", line 228, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
MemoryError
I tried running the code line-by-line in the python console and found that the error occurs after the program runs the sales = sales.drop(['STORE LOCATION'], axis=1) line.
I have searched for similar issues elsewhere and the only answer I have come up with is chunking the file as it is read by the program, like this:
#import the original file
df = pd.read_csv('Iowa_Liquor_Sales7.csv', header=0, chunksize=chunksize)
sales = pd.concat(df, ignore_index=True)
My only problem with that is then I get this error:
Traceback (most recent call last):
File "C:/Users/Aaron/PycharmProjects/DATA/Liquor_Reasign_Pd.py", line 14, in <module>
lat = sales['STORE LOCATION']
TypeError: 'TextFileReader' object is not subscriptable
My google-foo is all foo'd out. Anyone know what to do?
UPDATE
I should specify that with the chunking method,the error comes about when the program tries to duplicate the store location column.
So I found an answer to my issue. I ran the program in python 2.7 instead of python 3.4. The only change I made was deleting line 8, as it is unused. I don't know if 2.7 just handles the memory issue differently, or if I had improperly installed the pandas package in 3.4. I will reinstall pandas in 3.4 to see if that was the problem, but if anyone else has a similar issue, try your program in 2.7.
UPDATE Realized that I was running 32 bit python on a 64 bit machine. I upgraded my versions of python and it runs without memory errors now.

Extracting columnar data correctly as it is in the file

Suppose i have tabular column as below.Now i want to extract the column wise data.I tried extracting data by creating a list.But it is extracting the first row correctly but from second row onwards there is space i.e under CEN/4.Now my code considers zeroth column has 5.0001e-1 form second row,it starts reading from there. How to extract the data correctly coulmn wise.output is scrambled.
0 1 25 CEN/4 -5.000000E-01 -3.607026E+04 -5.747796E+03 -8.912796E+02 -88.3178
5.000000E-01 3.607026E+04 5.747796E+03 8.912796E+02 1.6822
27 -5.000000E-01 -3.641444E+04 -5.783247E+03 -8.912796E+02 -88.3347
5.000000E-01 3.641444E+04 5.783247E+03 8.912796E+02 1.6653
28 -5.000000E-01 -3.641444E+04 -5.712346E+03 -8.912796E+02 -88.3386
5.000000E-01 3.641444E+04 5.712346E+03 8.912796E+02
my code is :
f1=open('newdata1.txt','w')
L = []
for index, line in enumerate(open('Trial_1.txt','r')):
#print index
if index < 0: #skip first 5 lines
continue
else:
line =line.split()
L.append('%s\t%s\t %s\n' %(line[0], line[1],line[2]))
f1.writelines(L)
f1.close()
my output looks like this:
0 1 CEN/4 -5.000000E-01 -5.120107E+04
5.000000E-01 5.120107E+04 1.028093E+04 5.979930E+03 8.1461
i want columnar data as it is in the file.How to do that.I am a bgeinner
its hard to tell from the way the input data is presented in your question, but Im guessing your file is using tabs to separate columns, in any case, consider using python csv module with the relevant delimiter like:
import csv
with open('input.csv') as f_in, open('newdata1', 'w') as f_out:
reader = csv.reader(f_in, delimiter='\t')
writer = csv.writer(f_out, delimiter='\t')
for row in reader:
writer.writerow(row)
see python csv module documentation for further details