How to load Python Dataframe to cloudera Impala?

How to load Python Dataframe to cloudera Impala? - python-2.7

getting error while loading the dataframe data to Impala table
DB = conn.cursor()
for row in fourth_set:
SQL = ('''Insert into Boots_retailer(sale_date, product, Assessment, weekno, store_Number, volume, turnover, turnover_missing, Inv_Cubic, XGB, KNN)
values(?,?,?,?,?,?,?,?,?,?,?)''' )
Values = row['Sale_date'], row['product'], row['Assessment'], row['weekno'],row['store_number'],
row['volume'],row['turnover'],row['turnover_missing'],row['Inv_Cubic'],row['XGB'],row['KNN']
har = DB.execute(SQL, Values)
connection.commit()
Error is on line Values = row['Sale_date'], ...:
TypeError: string indices must be integers, not str

You're getting TypeError because you're iterating over the labels of the rows, not the rows themselves. Replace your second line with
for _, row in fourth_set.iterrows():

Related

How to append a specific row from an existing csv file to a new one with Pyhton 2

I have two csv files test1.csv and test2.csv that contain two rows with values (altitude,time).
test1.csv is quite larger that test2.csv.
I want to compare the altitudes based on the same time
I have found this piece of code that runs on Python2
import csv
with open('test1.csv', 'rb') as master:
master_indices = dict((r[0], i) for i, r in enumerate(csv.reader(master)))
with open('test2.csv', 'rb') as hosts:
with open('results.csv', 'wb') as results:
reader = csv.reader(hosts)
writer = csv.writer(results)
writer.writerow(next(reader, []) + ['result'])
for row in reader:
index = master_indices.get(row[0])
if index is not None:
message = 'Same time is found (row {})'.format(index)
else:
message = 'No same time is found'
writer.writerow(row + [message])
and it works fine as it writes the index from time1.csv that was found the same.
The result csv contains the time and altitude of test2.csv and also the message that show when there is match on time value or not.
Since I'm quite new to Python I'm trying to find away so that the results.csv file contains also the altitude column from test1.csv.
I tried to replicated the above code for the test1.csv file in order to add the row by adding the following code to the existing:
with open('test1.csv', 'rb') as master:
with open('results.csv', 'wb') as results:
writer = csv.writer(results)
reader2 = csv.reader(master)
writer.writerow(next(reader2, []) + ['altitude'])
for row in reader2:
writer.writerow(row)
But I got a csv file without the previous result column and an new but empty altitude column.
So eventually the result.csv should contain the following columns:
time,altitude(from test2.csv),altitude(from test1.csv),result
How can this be achieved?

Django ValueError Can only compare identically-labeled Series objects

i am getting this error. one df dataframe is read from json API and second df2 is read from csv i want to compare one column of csv to API and then matched value to save into new csv. can anyone help me
df2=pd.read_csv(file_path)
r = requests.get('https://data.ct.gov/resource/6tja-6vdt.json')
df = pd.DataFrame(r.json())
df['verified'] = np.where(df['salespersoncredential'] == df2['salespersoncredential'],'True', 'False')
print(df)

Probably just make df['verified'] = np.where(df['salespersoncredential'] == df2['salespersoncredential'],'True', 'False')
this
df['verified'] = df['salespersoncredential'] == df2['salespersoncredential']
assuming the dtypes and are correct.
If the indexes are different on the two dataframes, you might need to .reset_index().

Read multiple excel sheets on specific column and right them in one csv file using python

I have multiple sheets in one excel file like Sheet1, Sheet2, Sheet3,etc. Now I have to list all the particular column in one csv file. Both the sheets has one unique column "Attribute" and only those records should be listed in the csv file line by line. (First sheet's 'Attribute' values should be in 1st line and 2nd sheet's 'Attribute' values should be in 2nd line and etc.,)
If instances,
Sheet1:
Attribute,Order
P,1
Emp_ID,2
DOJ,3
Name,4
Sheet2:
Attribute,Order
C,1
Emp_ID,2
Exp,3
LWD,4
Expected result: (In some .csv file)
P,Emp_ID,DOJ,name
C,Emp_ID,Exp,LWD
Note: Line starting from P should be in first line and C should be in 2nd line and etc.,
Below is my code:
import pandas as pd
excel = 'E:\Python Utility\Inbound.xlsx'
K = 'E:\Python Utility\Headers_Files\All_Header.csv'
df = pd.read_excel(excel,sheet_name = None)
data = pd.DataFrame(df,columns=['Attribute']).T
print data
M = data.to_csv(K, encoding='utf-8',index=False,header=False)
print 'done'
Output show's as below:
Empty DataFrame Columns: [] Index: [Attribute] done
If I use sheet_name = 'sheet1' then DataFrame works good and data loaded as expected in csv file.
Thanks in advance

How to remove error records from a Dynamic dataframe in AWS glue?

I have a dynamic dataframe which contains error records.Please find the code below.
val rawDataFrame = glueContext.getCatalogSource(database = rawDBName, tableName = rawTBLName).getDynamicFrame();
println(s"RAW_DF-----count: ${rawDataFrame.count} errors: ${rawDataFrame.errorsCount}")
The above print statement prints as below.
RAW_DF-----count: 168456 errors: 4
I need to create a dynamic data frame which contains only 168456 records and I need to eliminate 4 error records.Kindly help.

Error records are not converting to Spark's DataFrame so try to transform your DynamicFrame to df and back:
val noErrorsDyf = DynamicFrame(rawDataFrame.toDF(), glueContext)

RDD to DataFrame in pyspark (columns from rdd's first element)

I have created a rdd from a csv file and the first row is the header line in that csv file. Now I want to create dataframe from that rdd and retain the column from 1st element of rdd.
Problem is I am able to create the dataframe and with column from rdd.first(), but the created dataframe has its first row as the headers itself. How to remove that?
lines = sc.textFile('/path/data.csv')
rdd = lines.map(lambda x: x.split('#####')) ###multiple char sep can be there #### or ### , so can't directly read csv to a dataframe
#rdd: [[u'mailid', u'age', u'address'], [u'satya', u'23', u'Mumbai'], [u'abc', u'27', u'Goa']] ###first element is the header
df = rdd.toDF(rdd.first()) ###retaing te column from rdd.first()
df.show()
#mailid age address
mailid age address ####I don't want this as dataframe data
satya 23 Mumbai
abc 27 Goa
How to avoid that first element moving to dataframe data. Can I give any option in rdd.toDF(rdd.first()) to get that done??
Note: I can't collect rdd to form list , then remove first item from that list, then parallelize that list back to form rdd again and then toDF()...
Please suggest!!!Thanks

You will have to remove the header from your RDD. One way to do it is the following considering your rdd variable :
>>> header = rdd.first()
>>> header
# ['mailid', 'age', 'address']
>>> data = rdd.filter(lambda row : row != header).toDF(header)
>>> data.show()
# +------+---+-------+
# |mailid|age|address|
# +------+---+-------+
# | satya| 23| Mumbai|
# | abc| 27| Goa|
# +------+---+-------+

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to load Python Dataframe to cloudera Impala? - python-2.7

You're getting TypeError because you're iterating over the labels of the rows, not the rows themselves. Replace your second line with for _, row in fourth_set.iterrows():

Related

How to append a specific row from an existing csv file to a new one with Pyhton 2

Django ValueError Can only compare identically-labeled Series objects

Read multiple excel sheets on specific column and right them in one csv file using python

How to remove error records from a Dynamic dataframe in AWS glue?

RDD to DataFrame in pyspark (columns from rdd's first element)

Categories

Resources