I am not able to append dataframe to already created dataframe.
Expecting output as below:
a b
0 1 1
1 2 4
2 11 121
3 12 144
import pandas
def FunAppend(*kwargs):
if len(kwargs)==0:
Dict1={'a':[1,2],'b':[1,4]}
df=pandas.DataFrame(Dict1)
else:
Dict1={'a':[11,12],'b':[121,144]}
dfTemp=pandas.DataFrame(Dict1)
df=df.append(dfTemp,ignore_index=True)
return df
df=FunAppend()
df=FunAppend(df)
print df
print "Completed"
Any help would be appreciated.
You are modifying the global variable df inside the FunAppend function.
Python needs to be explicitly notified about this. Add inside the function:
global df
And it works as expected.
As df is not defined when you go to the else case, although you did sent the df to the function in the form of kwargs, so you can access to the df by typing kwargs[0]
Change this:
df=df.append(dfTemp,ignore_index=True)
To:
df=kwargs[0].append(dfTemp,ignore_index=True)
Related
How do I set the data within a dataframe to be left-aligned?
I'm using python 2.7.13.
This question has been asked before but the accepted answer didn't even work.
The answer given was:
df.style.set_properties(**{'text-align': 'left'})
It doesn't work, my data is still right aligned.
Does anyone know how? Do I have to import any modules other than pandas?
Case 1: Styling to print as html
df.style.set_properties returns an object of type pandas.io.formats.style.Styler
type(df.style.set_properties(**{'text-align': 'left'}))
Out[37]: pandas.io.formats.style.Styler
Which is meant to be rendered as an html string as follows:
s = df.style.set_properties(**{'text-align': 'left'})
s.render()
Then you can use the result of s.render() in your HTML file.
Case 2: Align data left as a dataframe
If you are looking for a way to remove left whitespaces from the values in your dataFrame and leave the data within a dataframe, here's an example on how to do that:
df = pd.DataFrame([[' a',' b'],[' c', ' d']], columns=list('AB'))
df = df.stack().str.lstrip().unstack()
output:
A B
0 a b
1 c d
I am trying to run Dickey-Fuller test in statsmodels in Python but getting error P
Running from python 2.7 & Pandas version 0.19.2. Dataset is from Github and imported the same
enter code here
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
print 'Results of Dickey-Fuller Test:'
dftest = ts.adfuller(timeseries, autolag='AIC' )
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print dfoutput
test_stationarity(tr)
Gives me following error :
Results of Dickey-Fuller Test:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-10ab4b87e558> in <module>()
----> 1 test_stationarity(tr)
<ipython-input-14-d779e1ed35b3> in test_stationarity(timeseries)
19 #Perform Dickey-Fuller test:
20 print 'Results of Dickey-Fuller Test:'
---> 21 dftest = ts.adfuller(timeseries, autolag='AIC' )
22 #dftest = adfuller(timeseries, autolag='AIC')
23 dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
C:\Users\SONY\Anaconda2\lib\site-packages\statsmodels\tsa\stattools.pyc in adfuller(x, maxlag, regression, autolag, store, regresults)
209
210 xdiff = np.diff(x)
--> 211 xdall = lagmat(xdiff[:, None], maxlag, trim='both', original='in')
212 nobs = xdall.shape[0] # pylint: disable=E1103
213
C:\Users\SONY\Anaconda2\lib\site-packages\statsmodels\tsa\tsatools.pyc in lagmat(x, maxlag, trim, original)
322 if x.ndim == 1:
323 x = x[:,None]
--> 324 nobs, nvar = x.shape
325 if original in ['ex','sep']:
326 dropidx = nvar
ValueError: too many values to unpack
tr must be a 1d array-like, as you can see here. I don't know what is tr in your case. Assuming that you defined tr as the dataframe that contains the time serie's data, you should do something like this:
tr = tr.iloc[:,0].values
Then adfuller will be able to read the data.
just change the line as:
dftest = adfuller(timeseries.iloc[:,0].values, autolag='AIC' )
It will work. adfuller requires a 1D array list. In your case you have a dataframe. Therefore mention the column or edit the line as mentioned above.
I am assuming since you are using the Dickey-Fuller test .you want to maintain the timeseries i.e date time column as index.So in order to do that.
tr=tr.set_index('Month') #I am assuming here the time series column name is Month
ts = tr['othercoulumnname'] #Just use the other column name here it might be count or anything
I hope this helps.
My pandas dataframe is structured as follows:
date tag
0 2015-07-30 19:19:35-04:00 E7RG6
1 2016-01-27 08:20:01-05:00 ER57G
2 2015-11-15 23:32:16-05:00 EQW7G
3 2016-07-12 00:01:11-04:00 ERV7G
4 2016-02-14 00:35:21-05:00 EQW7G
5 2016-03-01 00:08:59-05:00 EQW7G
6 2015-06-19 07:15:06-04:00 ER57G
7 2016-09-08 18:17:53-04:00 ER5TT
8 2016-09-03 01:53:45-04:00 EQW7G
9 2015-11-30 09:31:02-05:00 ER57G
10 2016-03-03 22:28:26-05:00 ES5TG
11 2016-02-11 10:39:24-05:00 E5P7G
12 2015-03-16 07:18:47-04:00 ER57G
...
[11015 rows x 2 columns]
date datetime64[ns, America/New_York]
tag object
dtype: object
I'm attempting to set the column 'date' as the index:
df = df.set_index(pd.DatetimeIndex(df['date']))
which yields the following error (using pandas 0.19)
File "pandas/tslib.pyx", line 3753, in pandas.tslib.tz_localize_to_utc (pandas/tslib.c:64516)
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 01:38:12'), try using the 'ambiguous' argument
I've consulted this, but I'm still unable to work through this error. For example,
df = df.set_index(pd.DatetimeIndex(df['date']), ambiguous='infer')
yields:
File "pandas/tslib.pyx", line 3703, in pandas.tslib.tz_localize_to_utc (pandas/tslib.c:63553)
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2015-11-01 01:38:12 asthere are no repeated times
Any advice on how to convert the datetime column to the index would be greatly appreciated.
If your dtype for a column is already datetime then you can just call set_index without the need to try to construct a DatetimeIndex from the column:
df.set_index(df['date'], inplace=True)
should just work, the dtype for the index is sniffed out so there is no need to construct an index object from the Series/column here.
I am trying to automate 100 google searches (one per individual String in a row and return urls per each query) on a specific column in a csv (via python 2.7); however, I am unable to get Pandas to read the row contents to the Google Search automater.
*GoogleSearch source = https://breakingcode.wordpress.com/2010/06/29/google-search-python/
Overall, I can print Urls successfully for a query when I utilize the following code:
from google import search
query = "apples"
for url in search(query, stop=5, pause=2.0):
print(url)
However, when I add Pandas ( to read each "query") the rows are not read -> queried as intended. I.E. "data.irow(n)" is being queired instead of the row contents, one at a time.
from google import search
import pandas as pd
from pandas import DataFrame
query_performed = 0
querying = True
query = 'data.irow(n)'
#read the excel file at column 2 (i.e. "Fruit")
df = pd.read_csv('C:\Users\Desktop\query_results.csv', header=0, sep=',', index_col= 'Fruit')
# need to specify "Column2" and one "data.irow(n)" queried at a time
while querying:
if query_performed <= 100:
print("query")
query_performed +=1
else:
querying = False
print("Asked all 100 query's")
#prints initial urls for each "query" in a google search
for url in search(query, stop=5, pause=2.0):
print(url)
Incorrect output I receive at the command line:
query
Asked all 100 query's
query
Asked all 100 query's
Asked all 100 query's
http://www.irondata.com/
http://www.irondata.com/careers
http://transportation.irondata.com/
http://www.irondata.com/about
http://www.irondata.com/public-sector/regulatory/products/versa
http://www.irondata.com/contact-us
http://www.irondata.com/public-sector/regulatory/products/cavu
https://www.linkedin.com/company/iron-data-solutions
http://www.glassdoor.com/Reviews/Iron-Data-Reviews-E332311.htm
https://www.facebook.com/IronData
http://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=35267805
http://www.indeed.com/cmp/Iron-Data
http://www.ironmountain.com/Services/Data-Centers.aspx
FYI: My Excel .CSV format is the following:
B
1 **Fruit**
2 apples
2 oranges
4 mangos
5 mangos
6 mangos
...
101 mangos
Any advice on next steps is greatly appreciated! Thanks in advance!
Here's what I got. Like I mentioned in my comment, I couldn't get the stop parameter to work like i thought it should. Maybe i'm misunderstanding how its used. I'm assuming you only want the first 5 urls per search.
a sample df
d = {"B" : ["mangos", "oranges", "apples"]}
df = pd.DataFrame(d)
Then
stop = 5
urlcols = ["C","D","E","F","G"]
# Here i'm using an apply() to call the google search for each 'row'
# and a list is built for the urls return by search()
df[urlcols] = df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])) #get 5 by slicing
which gives you. Formatting is a bit rough on this
B C D E F G
0 mangos http://en.wikipedia.org/wiki/Mango http://en.wikipedia.org/wiki/Mango_(disambigua... http://en.wikipedia.org/wiki/Mangifera http://en.wikipedia.org/wiki/Mangifera_indica http://en.wikipedia.org/wiki/Purple_mangosteen
1 oranges http://en.wikipedia.org/wiki/Orange_(fruit) http://en.wikipedia.org/wiki/Bitter_orange http://en.wikipedia.org/wiki/Valencia_orange http://en.wikipedia.org/wiki/Rutaceae http://en.wikipedia.org/wiki/Cherry_Orange
2 apples https://www.apple.com/ http://desmoines.citysearch.com/review/692986920 http://local.yahoo.com/info-28919583-apple-sto... http://www.judysbook.com/Apple-Store-BtoB~Cell... https://tr.foursquare.com/v/apple-store/4b466b...
if you'd rather not specify the columns (i.e. ["C",D"..]) you could do the following.
df.join(df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])))
I am able to convert a csv file to pandas DataFormat and able to print out the table, as seen below. However, when I try to print out the Height column I get an error. How can I fix this?
import pandas as pd
df = pd.read_csv('/path../NavieBayes.csv')
print df #this prints out as seen below
print df.Height #this gives me the "AttributeError: 'DataFrame' object has no attribute 'Height'
Height Weight Classifer
0 70.0 180 Adult
1 58.0 109 Adult
2 59.0 111 Adult
3 60.0 113 Adult
4 61.0 115 Adult
I have run into a similar issue before when reading from csv. Assuming it is the same:
col_name =df.columns[0]
df=df.rename(columns = {col_name:'new_name'})
The error in my case was caused by (I think) by a byte order marker in the csv or some other non-printing character being added to the first column label. df.columns returns an array of the column names. df.columns[0] gets the first one. Try printing it and seeing if something is odd with the results.
PS On above answer by JAB - if there is clearly spaces in your column names use skipinitialspace=True in read_csv e.g.
df = pd.read_csv('/path../NavieBayes.csv',skipinitialspace=True)
df = pd.read_csv(r'path_of_file\csv_file_name.csv')
OR
df = pd.read_csv('path_of_file/csv_file_name.csv')
Example:
data = pd.read_csv(r'F:\Desktop\datasets\hackathon+data+set.csv')
Try it, it will work.