Adding constant values at the begining of a dataframe in pyspark - python-2.7

I am trying to read a CSV file from HDFS location and to that 3 columns batchid,load timestamp and a delete indicator needs to be added at the beginning. I am using spark 2.3.2 and python 2.7.5. Sample values for 3 columns to be added is given below.
batchid- YYYYMMdd (int)
Load timestamp - current timestamp (timestamp)
delete indicator - blank (string)

Your question is a little bit obscure. You can do something in this flavor. First, create your timestamp using python functionalities :
import time
import datetime
timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
Then, assuming you use the DataFrame API, you plug that into your column :
import pyspark.sql.functions as psf
df = (df
.withColumn('time',
psf.unix_timestamp(
psf.lit(timestamp),'yyyy-MM-dd HH:mm:ss'
).cast("timestamp")
)
.withColumn('batchid', psf.date_format('time', 'yyyyMMdd/yyy'))
.withColumn('delete', psf.lit(''))
To reorder your columns:
df = df.select(*["time","batchid","delete"] + [k for k in colnames if k not in ["time","batchid","delete"]])

Related

How to create a bag of words from csv file in python?

I am new to python. I have a csv file which has cleaned tweets. I want to create a bag of words of these tweets.
I have the following code but its not working correctly.
import pandas as pd
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
data = pd.read_csv(open("Twidb11.csv"), sep=' ')
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data.Text)
count_vect.vocabulary_
Error:
.ParserError: Error tokenizing data. C error: Expected 19 fields in
line 5, saw 22
It's duplicated i think. U can see answer here. There are a lot of answers and comments.
So, solution can be:
data = pd.read_csv('Twidb11.csv', error_bad_lines=False)
Or:
df = pandas.read_csv(fileName, sep='delimiter', header=None)
"In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indeces for each field {0,1,2,...}."

How to export the data of a pivot table ( of an existing excel sheets) to multiple excel sheets

I am bit new to python and implementing a project to convert an excel sheet to pivot table excel sheet, and then once the pivot table excel sheet is created, I am trying to create multiple excel sheet for each pivot data ( each row of the pivot data )
I am succesful in getting till the pivot data in excel ( Workbook3.xlsx ), but from here, if i am trying to go for a "for loop", I am getting below error :
"raise NotImplementedError("Writing as Excel with a MultiIndex is "
NotImplementedError: Writing as Excel with a MultiIndex is not yet implemented."
Can someone help me out to implement this last step ?
Below is my code :
Author : Abhishek
Version : December 23, 2016
import pandas as pd
import xlrd
import xlsxwriter
import numpy as np
names=['Business Type','Trip ID','Status','Name of Customer','Driver Name','Trip Type','One way/two way','Channel of Lead','Payment Rate','Remark for Payment/Discount/Fixed Rate/Monthly Rate','Trip Date','Trip ID','Entry Date and Time','CSE','Trip Time(when customer required)','Begin Trip Time','End Trip Time','Hours','Minutes','Basic','Basics','Transit','Transits','Discount ','Discounts','Tax','Total','Wallet','Wallet Type','Cash with Driver','Adjustment','Remark','Blank','Basic','Zuver (20%)','Basic Earning','Transit','Total Earning','Cash Colleted','Balance','Inventive','Total Earning','Total Cash Collected','Total Balance','Total Incentive','Final Invoice']
df=pd.read_excel(r'path/28 Nov - 4 Dec _ Payment IC _ Mumbai.xlsx',sheetname='calc',header=None,names=names)
df = df[df.Status != 'Cancelled']
df = df[df.Status != 'Unfulfilled']
#print df['Begin Trip Time'].values
# Defining variables for the output Report
custname=df['Name of Customer'].values
drivername=df['Driver Name'].values
drivercontact=df['Trip Date'].values
status=df['Status'].values
tripstatus=df['Begin Trip Time'].values
triptype=df['End Trip Time'].values
starttime=df['Hours'].values
endtime=df['Minutes'].values
transit=df['Transits'].values
totalbill=df['Basic Earning'].values
totalearning=df['Transits'].values+df['Basic Earning'].values
cashcollected=df['Cash Colleted'].values
balance=df['Balance'].values
incentive= df['Inventive'].values
df1 = pd.DataFrame(zip(totalbill,drivername,custname,drivercontact,status,tripstatus,triptype,starttime,endtime,totalbill,transit,totalearning,cashcollected,balance,incentive))
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('path/Workbook2.xlsx', engine='xlsxwriter')
df1.to_excel(writer, sheet_name='Sheet1',index=False,header=False)
# Close the Pandas Excel writer and output the Excel file.
writer.save()
#Pivot table
df2=pd.read_excel(r'path/Workbook2.xlsx',sheetname='Sheet1')
#table=pd.pivot_table(df2, index=['Driver Name','Name of Customer'])
df2['Inventive'].fillna(0, inplace=True)
df2 = pd.pivot_table(df2, values=['Hours', 'Minutes'],index=['Driver Name','Name of Customer','Inventive','Trip Date','Status','Begin Trip Time','End Trip Time','Basic Earning','Cash Colleted','Balance'],aggfunc=np.sum, fill_value='0', margins=True)
pivoted = pd.ExcelWriter('path/Workbook3.xlsx', engine='xlsxwriter')
df2.to_excel(pivoted, sheet_name='Sheet1')
pivoted.save()
df3=pd.read_excel(r'path/Workbook3.xlsx',sheetname='Sheet1')
for n in range(0, len(df3)):
tempdata=df3.iloc[n]
df4= pd.DataFrame(tempdata)
writer=pd.ExcelWriter('path/Final%s.xlsx' % n, engine='xlsxwriter')
df4.to_excel(writer, sheet_name='Sheet 1')
writer.save()

NLTK applied to dataframes , how to iterate through list

Apologies in advance as this is my first question. I am using nltk to tokenize a series of tweets from a csv that I have loaded into a df. The tokenization works fine and outputs something like this [[My, uncle, ...]] into a cell in a df. I want to then apply a POS tagger to the tokenized text for the whole column of the df. I use the code below to do it. The line I am having difficulty with is df['tagged'] = df['tokenized'].apply(lambda row: [nltk.pos_tag(row) for item in row]). I know that I am iterating on the wrong element (row versus item) but can't figure out the correct way to do it. The code is below:
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize,wordpunct_tokenize
from nltk.tag import pos_tag
read_test = pd.read_csv("simontwittertest.csv")
df = read_test
df['tokenized'] = df['content'].apply(lambda row: [nltk.wordpunct_tokenize(row) for item in row])
df['tagged'] = df['tokenized'].apply(lambda row: [nltk.pos_tag(row) for item in row])
print(df['tagged'])`
Out of interest I found a small bug with pos_tag which only works with NLTK 3.1 not NLTK 3.2 (at least with Python 2.7)
Many Thanks`
If you are applying a lambda function to a row, you need to specify axis=1:
df['tokenized'] = df['content'].apply(
lambda row: [nltk.wordpunct_tokenize(row) for item in row], axis=1)
df['tagged'] = df['tokenized'].apply(
lambda row: [nltk.pos_tag(row) for item in row], axis=1)

Freeze header in pandas dataframe

Is there a way by which I can freeze Pandas data frame header { as we do in excel}.So if its a long dataframe with multiple rows we can see the headers once we scroll down!! I am assuming ipython notebook
This function may do the trick:
from ipywidgets import interact, IntSlider
from IPython.display import display
def freeze_header(df, num_rows=30, num_columns=10, step_rows=1,
step_columns=1):
"""
Freeze the headers (column and index names) of a Pandas DataFrame. A widget
enables to slide through the rows and columns.
Parameters
----------
df : Pandas DataFrame
DataFrame to display
num_rows : int, optional
Number of rows to display
num_columns : int, optional
Number of columns to display
step_rows : int, optional
Step in the rows
step_columns : int, optional
Step in the columns
Returns
-------
Displays the DataFrame with the widget
"""
#interact(last_row=IntSlider(min=min(num_rows, df.shape[0]),
max=df.shape[0],
step=step_rows,
description='rows',
readout=False,
disabled=False,
continuous_update=True,
orientation='horizontal',
slider_color='purple'),
last_column=IntSlider(min=min(num_columns, df.shape[1]),
max=df.shape[1],
step=step_columns,
description='columns',
readout=False,
disabled=False,
continuous_update=True,
orientation='horizontal',
slider_color='purple'))
def _freeze_header(last_row, last_column):
display(df.iloc[max(0, last_row-num_rows):last_row,
max(0, last_column-num_columns):last_column])
Test it with:
import pandas as pd
df = pd.DataFrame(pd.np.random.RandomState(seed=0).randint(low=0,
high=100,
size=[200, 50]))
freeze_header(df=df, num_rows=10)
It results in (the colors were customized in the ~/.jupyter/custom/custom.css file):
Old question but wanted to revisit it because I recently found a solution. Use the qgrid module: https://github.com/quantopian/qgrid
This will not only allow you to scroll with the headers frozen but also sort, filter, edit inline and some other stuff. Very helpful.
Try panda's Sticky Headers:
import pandas as pd
import numpy as np
bigdf = pd.DataFrame(np.random.randn(16, 100))
bigdf.style.set_sticky(axis="index")
(this feature was introduced lately, I found it working on pandas 1.3.1, but not on 1.2.4)
A solution that would work on any editor is to select what rows you want to look at:
df.ix[100:110] # would show you from row 101 to 110 keeping the header on top

Python - Create An Empty Pandas DataFrame and Populate From Another DataFrame Using a For Loop

Using: Python 2.7 and Pandas 0.11.0 on Mac OSX Lion
I'm trying to create an empty DataFrame and then populate it from another dataframe, based on a for loop.
I have found that when I construct the DataFrame and then use the for loop as follows:
data = pd.DataFrame()
for item in cols_to_keep:
if item not in dummies:
data = data.join(df[item])
Results in an empty DataFrame, but with the headers of the appropriate columns to be added from the other DataFrame.
That's because you are using join incorrectly.
You can use a list comprehension to restrict the DataFrame to the columns you want:
df[[col for col in cols_to_keep if col not in dummies]]
What about just creating a new frame based off of the columns you know you want to keep, instead of creating an empty one first?
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':np.random.randn(5),
'b':np.random.randn(5),
'c':np.random.randn(5),
'd':np.random.randn(5)})
cols_to_keep = ['a', 'c', 'd']
dummies = ['d']
not_dummies = [x for x in cols_to_keep if x not in dummies]
data = df[not_dummies]
data
a c
0 2.288460 0.698057
1 0.097110 -0.110896
2 1.075598 -0.632659
3 -0.120013 -2.185709
4 -0.099343 1.627839