Renaming columns using column data from another dataframe - python-2.7

I have two dataframes - an original file that looks like this:
Gene Symbol, 10555, 10529, 10519
Map7, .184, .026, .207
nan, .348, .041, .187
Cpm, .45, .278, .453
and a reference file that looks like this:
Experiment_Num, Microarray, Experiment_Name, Chip_Name
10555, Genechip, Famotidine-5d, RG230-2
10529, Genechip, Famotidine-3d, RG230-2
10519, MMchip, Dicyclomine-3d, R01
I am trying to merge them in a way that the header of the original file display the Experiment_Name rather than just the Experiment_Num as follows:
Gene symbol, Famotidine-5d, Famotidine-3d, Dicyclomine-3d
Map7, .184, .026, .207
nan, .348, .041, .187
Cpm, .45, .278, .453
My code is completely written using pandas and looks as follows:
import pandas as pd
df = ('ftp://anonftp.niehs.nih.gov/drugmatrix/Differentially_expressed_gene_lists_directly_from_DrugMatrix/Affymetrix/Affymetrix_annotation.txt', sep='\t', dtype=str)
# Reference file
df2.columns = df2.columns.to_series().replace(df.set_index('Experiment').Compound_Name)
#Original File
df2
I tried to convert the columns of the original DF to it's series representation and then replace the old value which were the Experiment_Num. with the new Experiment_name retrieved from the reference DF, but keep getting
KeyError: 'Experiment'
I tried figuring out what could be causing a KeyError, but found that there are so many possibilities, none of which seems to fix my particular issue.
Thanks for the help if possible!
Troy

Related

Python - AttributeError: 'DataFrame' object has no attribute

I have a CSV file with various columns and everything worked perfectly for the past few months until I updated the file and got new information and now the one column does not appear to be picked up by Python. I am using Python 2.7 and have made sure I have the latest version of pandas.
When I downloaded the csv file from Yahoo Finance, I opened it in Excel and made changes to the format of the columns in order to make it more readable as all information was in one cell. I used the "Text to Column" feature and split up the data based on where the commas were.
Then I made sure that in each column there were no white spaces in the beginning of the cell using the Trim function in excel and left-aligning the data.
I tried the following and still get the same or similiar:
After the df = pd.read_csv("KIO.csv") I tried to read whether I can read the first few columns by using df.head() - but still got the same error.
I tried renaming the problematic column as suggested in a similiar post using:
df = df.rename(columns={"Close": "Closing"}) - here I got the same error again. "print df.columns" also led to the same issue.
"df[1]" - gave a long error with "KeyError: 1" at the end - I can print the entire thing if it it will assist.
Adding the "skipinitialspace=True" - no difference.
I thought the problem might be within the actual csv file information so I deleted all the columns and made my own information and I still got the same error.
Below is a portion of my code as the total code is very long:
enter code here
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as pltdate
import datetime
import matplotlib.animation as animation
import numpy as np
df = pd.read_csv("KIO.csv", skipinitialspace=True)
#df.head()
#Close = df.columns[0]
#df= df.rename(columns={"Close": "Closing"})
df1 = pd.read_csv("USD-ZAR.csv")
kio_close = pd.DataFrame(df.Close)
exchange = pd.DataFrame(df1.Value)
dates = df["Date"]
dates1 = df1["Date"]
The above variables have been used throughout the remaining code though so if this issue can be solved here the remaining code will be right.
This is copy/paste of the error:
Blockquote
Traceback (most recent call last):
File "C:/Users/User/Documents/PycharmProjects/Trading_GUI/GUI_testing.py", line 33, in
kio_close = pd.DataFrame(df.Close)
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 4372, in getattr
return object.getattribute(self, name)
AttributeError: 'DataFrame' object has no attribute 'Close'
Thank you so much in advance.
#Rip_027 This is in regards to your last comment. I used to have the same issue whenever I open a csv file by simply double clicking the file icon. You need to launch Excel first, then get external data. Link below has more details,which will serve as a guideline. Hope this helps.
https://www.hesa.ac.uk/support/user-guides/import-csv

How to create a bag of words from csv file in python?

I am new to python. I have a csv file which has cleaned tweets. I want to create a bag of words of these tweets.
I have the following code but its not working correctly.
import pandas as pd
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
data = pd.read_csv(open("Twidb11.csv"), sep=' ')
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data.Text)
count_vect.vocabulary_
Error:
.ParserError: Error tokenizing data. C error: Expected 19 fields in
line 5, saw 22
It's duplicated i think. U can see answer here. There are a lot of answers and comments.
So, solution can be:
data = pd.read_csv('Twidb11.csv', error_bad_lines=False)
Or:
df = pandas.read_csv(fileName, sep='delimiter', header=None)
"In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indeces for each field {0,1,2,...}."

How can I parse multiple date columns in Pandas?

I have a field/column in a .csv file that I am loading into Pandas that will not parse as a datetime data type in Pandas. I don't really understand why. I want both FirstTime and SecondTime to parse as datetime64 in Pandas DataFrame.
# Assigning a header for our data
header = ['FirstTime', 'Col1', 'Col2', 'Col3', 'SecondTime', 'Col4',
'Col5', 'Col6', 'Col7', 'Col8']
# Loading our data into a dataframe
df = pd.read_csv('MyData.csv', names=header, parse_dates=['FirstTime', 'SecondTime'])
The code above will only parse SecondTime as datetime64[ns]. FirstTime is left as a Object data type. If I do the following code instead:
# Assigning a header for our data
header = ['FirstTime', 'Col1', 'Col2', 'Col3', 'SecondTime', 'Col4',
'Col5', 'Col6', 'Col7', 'Col8']
# Loading our data into a dataframe
df = pd.read_csv('MyData.csv', names=header, parse_dates=['FirstTime'])
It still will not parse FirstTime as a datetime64[ns].
The format for both columns is the same:
# Example FirstTime
# (%f is always .000)
2015-11-05 16:52:37.000
# Example SecondTime
# (%f is always .000)
2015-11-04 15:33:15.000
What am I missing here? Is the first column not able to be datetime by default or something in Pandas?
did you try
df = pd.read_csv('MyData.csv', names=header, parse_dates=True)
I had a similar problem and it turned out in one of my date variables there is an integer cell. So, python recognize it as "object" and the other one is recognized as "int64". You need to make sure both variables are integer.
You can use df.dtypes to see the format of your vaiables.

How to improve the code with more elegant way and low memory consumed?

I have a dataset which the dimension is around 2,000 (rows) x 120,000 (columns).
And I'd like to pick up certain columns (~8,000 columns).
So the file dimension would be 2,000 (rows) x 8,000 (columns).
Here is the code written by a good man (I searched from stackoverflow but I am sorry I have forgotten his name).
import pandas as pd
df = pd.read_csv('...mydata.csv')
my_query = pd.read_csv('...myquery.csv')
df[list['Name'].unique()].to_csv('output.csv')
However, the result shows MemoryError in my console, which means the code may not work quite well.
So does anyone know how to improve the code with more efficient way to select the certain columns?
I think I found your source.
So, my solution use read_csv with arguments:
iterator=True - if True, return a TextFileReader to enable reading a file into memory piece by piece
chunksize=1000 - an number of rows to be used to “chunk” a file into pieces. Will cause an TextFileReader object to be returned
usecols=subset - a subset of columns to return, results in much faster parsing time and lower memory usage
Source.
I filter large dataset with usecols - I use only dataset (2 000, 8 000) instead (2 000, 120 000).
import pandas as pd
#read subset from csv and remove duplicate indices
subset = pd.read_csv('8kx1.csv', index_col=[0]).index.unique()
print subset
#use subset as filter of columns
tp = pd.read_csv('input.csv',iterator=True, chunksize=1000, usecols=subset)
df = pd.concat(tp, ignore_index=True)
print df.head()
print df.shape
#write to csv
df.to_csv('output.csv',iterator=True, chunksize=1000)
I use this snippet for testing:
import pandas as pd
import io
temp=u"""A,B,C,D,E,F,G
1,2,3,4,5,6,7"""
temp1=u"""Name
B
B
C
B
C
C
E
F"""
subset = pd.read_csv(io.StringIO(temp1), index_col=[0]).index.unique()
print subset
#use subset as filter of columns
df = pd.read_csv(io.StringIO(temp), usecols=subset)
print df.head()
print df.shape

How do I iterate a loop over several data frames in a list in python

I am very new to programming and am working with Python. For a work project I am trying to read several .csv files, convert them to data frames, concatenate some of the fields into one for a column header, and then append all of the dataframes into one big DataFrame. I have searched extensively in StackOverflow as well as in other resources but I have not been able to find an answer. Here is the code I have thus far along with some abbreviated output:
import pandas as pd
import glob
# Read a directory of files to a list
csvlist = []
for f in glob.glob("AssayCerts/*"):
csvlist.append(f)
csvlist
['AssayCerts/CH09051590.csv', 'AssayCerts/CH09051591.csv', 'AssayCerts/CH14158806.csv', 'AssayCerts/CH14162453.csv', 'AssayCerts/CH14186004.csv']
# Read .csv files and convert to DataFrames
dflist = []
for csv in csvlist:
df = pd.read_csv(filename, header = None, skiprows = 7)
dflist.append(df)
dflist
[ 0 1 2 3 4 5 \
0 NaN Au-AA23 ME-ICP41 ME-ICP41 ME-ICP41 ME-ICP41
1 SAMPLE Au Ag Al As B
2 DESCRIPTION ppm ppm % ppm ppm
#concatenates the cells in the first three rows of the last dataframe; need to apply this to all of the dataframes.
for df in dflist:
column_names = df.apply(lambda x: str(x[1]) + '-'+str(x[2])+' - '+str(x[0]),axis=0)
column_names
0 SAMPLE-DESCRIPTION - nan
1 Au-ppm - Au-AA23
2 Ag-ppm - ME-ICP41
3 Al-% - ME-ICP41
I am unable to apply the last operation across all of the DataFrames. It seems I can only get it to apply to the last DataFrame in my list. Once I get past this point I will have to append all of the DataFrames to form one large DataFrame.
As Andy Hayden mentions in his comment, the reason your loop only appears to work on the last DataFrame is that you just keep assigning the result of df.apply( ... ) to column_names, which gets written over each time. So at the end of the loop, column_names always contains the results from the last DataFrame in the list.
But you also have some other problems in your code. In the loop that begins for csv in csvlist:, you never actually reference csv - you just reference filename, which doesn't appear to be defined. And dflist just appears to have one DataFrame in it anyway.
As written in your problem, the code doesn't appear to work. I'd advise posting the real code that you're using, and only what's relevant to your problem (i.e. if building csvlist is working for you, then you don't need to show it to us).