RDD to DataFrame in pyspark (columns from rdd's first element) - python-2.7

I have created a rdd from a csv file and the first row is the header line in that csv file. Now I want to create dataframe from that rdd and retain the column from 1st element of rdd.
Problem is I am able to create the dataframe and with column from rdd.first(), but the created dataframe has its first row as the headers itself. How to remove that?
lines = sc.textFile('/path/data.csv')
rdd = lines.map(lambda x: x.split('#####')) ###multiple char sep can be there #### or ### , so can't directly read csv to a dataframe
#rdd: [[u'mailid', u'age', u'address'], [u'satya', u'23', u'Mumbai'], [u'abc', u'27', u'Goa']] ###first element is the header
df = rdd.toDF(rdd.first()) ###retaing te column from rdd.first()
df.show()
#mailid age address
mailid age address ####I don't want this as dataframe data
satya 23 Mumbai
abc 27 Goa
How to avoid that first element moving to dataframe data. Can I give any option in rdd.toDF(rdd.first()) to get that done??
Note: I can't collect rdd to form list , then remove first item from that list, then parallelize that list back to form rdd again and then toDF()...
Please suggest!!!Thanks

You will have to remove the header from your RDD. One way to do it is the following considering your rdd variable :
>>> header = rdd.first()
>>> header
# ['mailid', 'age', 'address']
>>> data = rdd.filter(lambda row : row != header).toDF(header)
>>> data.show()
# +------+---+-------+
# |mailid|age|address|
# +------+---+-------+
# | satya| 23| Mumbai|
# | abc| 27| Goa|
# +------+---+-------+

Related

Read multiple excel sheets on specific column and right them in one csv file using python

I have multiple sheets in one excel file like Sheet1, Sheet2, Sheet3,etc. Now I have to list all the particular column in one csv file. Both the sheets has one unique column "Attribute" and only those records should be listed in the csv file line by line. (First sheet's 'Attribute' values should be in 1st line and 2nd sheet's 'Attribute' values should be in 2nd line and etc.,)
If instances,
Sheet1:
Attribute,Order
P,1
Emp_ID,2
DOJ,3
Name,4
Sheet2:
Attribute,Order
C,1
Emp_ID,2
Exp,3
LWD,4
Expected result: (In some .csv file)
P,Emp_ID,DOJ,name
C,Emp_ID,Exp,LWD
Note: Line starting from P should be in first line and C should be in 2nd line and etc.,
Below is my code:
import pandas as pd
excel = 'E:\Python Utility\Inbound.xlsx'
K = 'E:\Python Utility\Headers_Files\All_Header.csv'
df = pd.read_excel(excel,sheet_name = None)
data = pd.DataFrame(df,columns=['Attribute']).T
print data
M = data.to_csv(K, encoding='utf-8',index=False,header=False)
print 'done'
Output show's as below:
Empty DataFrame Columns: [] Index: [Attribute] done
If I use sheet_name = 'sheet1' then DataFrame works good and data loaded as expected in csv file.
Thanks in advance

use python to write to a specific column is a .csv file

I have a .csv file where I need to overwrite a certain column with new values from a list.
Let's say I have the list L1 = ['La', 'Lb', 'Lc'] that I want to write in column no. 5 of the .csv file.
If I run:
L1 = ['La', 'Lb', 'Lc']
import csv
with open(r'C:\LIST.csv','wb') as f:
w = csv.writer(f)
for i in L1:
w.writerow(i)
This will write the L1 values to the first and second column.
First column will be 'L', 'L', 'L' and second column 'a', 'b', 'c'
I could not find the syntax to write to a specific column each element from the list. (this is in Python 2.7). Thank you for your help!
(for this script I must use IronPython, and just the built in Libraries that comes with IronPython)
Although you could certainly use Python's built-in csv module to read the data, modify it, and write it out, I'd recommend the excellent tablib module:
from tablib import Dataset
csv = '''Col1,Col2,Col3,Col4,Col5,Col6,Col7
a1,b1,c1,d1,e1,f1,g1
a2,b2,c2,d2,e2,f2,g2
a3,b3,c3,d3,e3,f3,g3
'''
# Read a hard-coded string just for test purposes.
# In your code, you would use open('...', 'rt').read() to read from a file.
imported_data = Dataset().load(csv, format='csv')
L1 = ['La', 'Lb', 'Lc']
for i in range(len(L1)):
# Each row is a tuple, and tuples don't support assignment.
# Convert to a list first so we can modify it.
row = list(imported_data[i])
# Put our value in the 5th column (index 4).
row[4] = L1[i]
# Store the row back into the Dataset.
imported_data[i] = row
# Export to CSV. (Of course, you could write this to a file instead.)
print imported_data.export('csv')
# Output:
# Col1,Col2,Col3,Col4,Col5,Col6,Col7
# a1,b1,c1,d1,La,f1,g1
# a2,b2,c2,d2,Lb,f2,g2
# a3,b3,c3,d3,Lc,f3,g3

parsing records with key value pairs in python

I have a file with millions of records like this
2017-07-24 18:34:23|CN:SSL|RESPONSETIME:23|BYTESIZE:1456|CLIENTIP:127.0.0.9|PROTOCOL:SSL-V1.2
Each record contains around 30 key-value pairs with "|" delimeter. Key-value pair position is not constant.
Trying to parse these records using python dictionary or list concepts.
Note: 1st column is not in key-value format
your file is basically a |-separated csv file holding first the timestamp, then 2 fields separated by :.
So you could use csv module to read the cells, then pass the result of str.split to a dict in a gencomp to build the dictionary for all elements but the first one.
Then update the dict with the timestamp:
import csv
list_of_dicts = []
with open("input.txt") as f:
cr = csv.reader(f,delimiter="|")
for row in cr:
d = dict(v.split(":") for v in row[1:])
d["date"] = row[0]
list_of_dicts.append(d)
list_of_dicts contains dictionaries like
{'date': '2017-07-24 18:34:23', 'PROTOCOL': 'SSL-V1.2', 'RESPONSETIME': '23', 'CN': 'SSL', 'CLIENTIP': '127.0.0.9', 'BYTESIZE': '1456'}
You repeat the below process for all the lines in your code. I am not clear about the date time value. So I haven't included that in the input. You can include it based on your understanding.
import re
given = "CN:SSL|RESPONSETIME:23|BYTESIZE:1456|CLIENTIP:127.0.0.9|PROTOCOL:SSL-
V1.2"
results = dict()
list_for_this_line = re.split('\|',given)
for i in range(len(list_for_this_line)):
separated_k_v = re.split(':',list_for_this_line[i])
results[separated_k_v[0]] = separated_k_v[1]
print results
Hope this helps!

data frame with pandas not outputing tabular

I have been working on extracting data from a large number of files. I want to form a table of the data, with the file base name as the left most column and the numerical data in the next. So far, I have been testing on a folder containing 8 files, but am hoping to be able to read hundreds.
I have tried adding an index, but that seemed to cause more problems. I am attaching the closest working code I have come up with, alongside the output.
In:
import re, glob
import pandas as pd
pattern = re.compile('-\d+\D\d+\skcal/mol', flags=re.S)
for file in glob.glob('*rank_*.pdb'):
with open(file) as fp:
for result in pattern.findall(fp.read()):
Dock_energy = {file:[],result:[]}
df = pd.DataFrame(Dock_energy)
df.append(df)
df = df.append(df)
print(df)
This seems to work for extracting the data, but it is not in the form I am looking for.
Out:
Empty DataFrame
Columns: [-10.02 kcal/mol, MII_rank_8.pdb]
Index: []
Empty DataFrame
Columns: [-12.51 kcal/mol, MII_rank_5.pdb]
Index: []
Empty DataFrame
Columns: [-13.47 kcal/mol, MII_rank_4.pdb]
Index: []
Empty DataFrame
Columns: [-14.67 kcal/mol, MII_rank_2.pdb]
Index: []
Empty DataFrame
Columns: [-13.67 kcal/mol, MII_rank_3.pdb]
Index: []
Empty DataFrame
Columns: [-14.80 kcal/mol, MII_rank_1.pdb]
Index: []
Empty DataFrame
Columns: [-11.45 kcal/mol, MII_rank_7.pdb]
Index: []
Empty DataFrame
Columns: [-12.47 kcal/mol, MII_rank_6.pdb]
Index: []
What is causing the fractured table, and why are my columns in reverse order from what I am hoping? Any help is greatly appreciate.
This should be closer to what you intend:
all_data = []
for file in glob.glob('*rank_*.pdb'):
with open(file) as fp:
file_data = []
for result in pattern.findall(fp.read()):
file_data.append([file, result])
all_data.extend(file_data)
df = pd.DataFrame(all_data, columns=['file', 'result'])
print(df)

Print columns of Pandas dataframe to separate files + dataframe with datetime (min/sec)

I am trying to print a Pandas dataframe's columns to separate *.csv files in Python 2.7.
Using this code, I get a dataframe with 4 columns and an index of dates:
import pandas as pd
import numpy as np
col_headers = list('ABCD')
dates = pd.date_range(dt.datetime.today().strftime("%m/%d/%Y"),periods=rows)
df2 = pd.DataFrame(np.random.randn(10, 4), index=dates, columns = col_headers)
df = df2.tz_localize('UTC') #this does not seem to be giving me hours/minutes/seconds
I then remove the index and set it to a separate column:
df['Date'] = df.index
col_headers.append('Date') #update the column keys
At this point, I just need to print all 5 columns of the dataframe to separate files. Here is what I have tried:
for ijk in range(0,len(col_headers)):
df.to_csv('output' + str(ijk) + '.csv', columns = col_headers[ijk])
I get the following error message:
KeyError: "[['D', 'a', 't', 'e']] are not in ALL in the [columns]"
If I say:
for ijk in range(0,len(col_headers)-1):
then it works, but it does not print the 'Date' clumn. That is not what I want. I need to also print the date column.
Questions:
How do I get it to print the 'Dates' column to a *.csv file?
How do I get the time with hours, minutes and seconds? If the number of
rows is changed from 10 to 5000, then will the seconds change from one row of the dataframe to the next?
EDIT:
- Answer for Q2 (See here) ==> in the case of my particular code, see this:
dates = pd.date_range(dt.datetime.today().strftime("%m/%d/%Y %H:%M"),periods=rows)
I don't quite understand your logic but the following is a simpler method to do it:
for col in df:
df[col].to_csv('output' + col + '.csv')
example:
In [41]:
for col in df2:
print('output' + col + '.csv')
outputA.csv
outputB.csv
outputC.csv
outputD.csv
outputDate.csv