Loading data from on-premises hdfs to local SparkR - hdfs

I'm trying to load data from an on-premises hdfs to R-Studio with SparkR.
When I do this:
df_hadoop <- read.df(sqlContext, "hdfs://xxx.xx.xxx.xxx:xxxx/user/lam/lamr_2014_09.csv",
source = "com.databricks.spark.csv")
and then this:
str(df_hadoop)
I get this:
Formal class 'DataFrame' [package "SparkR"] with 2 slots
..# env: <environment: 0x000000000xxxxxxx>
..# sdf:Class 'jobj' <environment: 0x000000000xxxxxx>
This is not however the df I'm looking for, because there are 13 fields in the csv I'm trying to load from hdfs.
I have a schema with the 13 fields of the csv, but where or how do I tell it to SparkR?

If you try the following:
df <- createDataFrame(sqlContext,
data.frame(a=c(1,2,3),
b=c(2,3,4),
c=c(3,4,5)))
str(df)
You as well get
Formal class 'DataFrame' [package "SparkR"] with 2 slots
..# env:<environment: 0x139235d18>
..# sdf:Class 'jobj' <environment: 0x139230e68>
Str() does show you the representation of df, which is a pointer instead of a data.frame. Rather just use
df
or
show(df)

Related

How to remove error records from a Dynamic dataframe in AWS glue?

I have a dynamic dataframe which contains error records.Please find the code below.
val rawDataFrame = glueContext.getCatalogSource(database = rawDBName, tableName = rawTBLName).getDynamicFrame();
println(s"RAW_DF-----count: ${rawDataFrame.count} errors: ${rawDataFrame.errorsCount}")
The above print statement prints as below.
RAW_DF-----count: 168456 errors: 4
I need to create a dynamic data frame which contains only 168456 records and I need to eliminate 4 error records.Kindly help.
Error records are not converting to Spark's DataFrame so try to transform your DynamicFrame to df and back:
val noErrorsDyf = DynamicFrame(rawDataFrame.toDF(), glueContext)

python2.7: dataframe date_format="%Y-%m-%d" doesn't work

I want the dataframe as following, saved as csv file, and datetime index format is
"%Y-%m-%d".
date price_am price_pm
2017-06-01 E E
2017-06-02 D E
2017-06-03 C D
I used the following code, and it doesn't work:
df.to_csv('move.csv', date_format='%Y-%m-%d')
But when I opened saved cvs file, the datetime format is shown as following:
date price_am price_pm
2017/6/1 E E
2017/6/2 D E
2017/6/3 C D
How to change datatime format index in the csv file. Thanks!
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2557 entries, 2011-01-01 to 2017-12-31
Data columns (total 2 columns):
price_am 2527 non-null object
price_pm 2526 non-null object
dtypes: object(2)
memory usage: 59.9+ KB
It looks like you verify data in Excel - original dates are changed.
Try some text editor like Notepad++ - csv file data are correct.
If use default options:
print (df.head().to_csv())
date,price_am,price_pm
2017-06-01,E,E
2017-06-02,D,E
2017-06-03,C,D

Not able to append dataframe

I am not able to append dataframe to already created dataframe.
Expecting output as below:
a b
0 1 1
1 2 4
2 11 121
3 12 144
import pandas
def FunAppend(*kwargs):
if len(kwargs)==0:
Dict1={'a':[1,2],'b':[1,4]}
df=pandas.DataFrame(Dict1)
else:
Dict1={'a':[11,12],'b':[121,144]}
dfTemp=pandas.DataFrame(Dict1)
df=df.append(dfTemp,ignore_index=True)
return df
df=FunAppend()
df=FunAppend(df)
print df
print "Completed"
Any help would be appreciated.
You are modifying the global variable df inside the FunAppend function.
Python needs to be explicitly notified about this. Add inside the function:
global df
And it works as expected.
As df is not defined when you go to the else case, although you did sent the df to the function in the form of kwargs, so you can access to the df by typing kwargs[0]
Change this:
df=df.append(dfTemp,ignore_index=True)
To:
df=kwargs[0].append(dfTemp,ignore_index=True)

Get Mean of a column in data frame

I'm using pyspark 1.6 and python 2.7.
I Have a dataframe, and I wanted to get the mean of a particular column after group by another columns.
data is my dataframe
For that I'm doing like below
data.registerTempTable('dataframe')
query = 'select mean(Weight) as Weight, b, s from dataframe group by b, s'
df = sqlContext.sql(query)
Is there any good way of achieving this result.
Sample Data is like :
s b Weight
7801 d9b4 0.12911255
7801 6b11 0.128151033
7801 dd1f 0.12791147
7801 c802 0.134295454
7801 1294 0.128722551
7801 4203 0.134276383
7801 accc 0.134290742
7801 aab9 0.129347649
7801 4546 0.126628807
It's pretty trivial to get a mean after grouping: see pyspark documentation. Try something like what's below, though I believe the sql you defined in the question should also be close to working as is.
data.groupBy('b', 's').agg({'Weight': 'mean'})
>>> # [Row(b=u'6b11', s=u'7801', avg(Weight)=0.128151033), ...]

How to get this output from pig Latin in MapReduce

I want to get the following output from Pig Latin / Hadoop
((39,50,60,42,15,Bachelor,Male),5)
((40,35,HS-grad,Male),2)
((39,45,15,30,12,7,HS-grad,Female),6)
from the following data sample
data sample for adult data
I have written the following Pig Latin script:
sensitive = LOAD '/mdsba/sample2.csv' using PigStorage(',') as (AGE,EDU,SEX,SALARY);
BV= group sensitive by (EDU,SEX) ;
BVA= foreach BV generate group as EDU, COUNT (sensitive) as dd:long;
Dump BVA ;
Unfortunately, the results come out like this
((Bachelor,Male),5)
((HS-grad,Male),2)
Than try to project the AGE data too.
Something like this:
BVA= foreach BV generate
sensitive.AGE as AGE,
FLATTEN(group) as (EDU,SEX),
COUNT(sensitive) as dd:long;
Another suggestion is to specify the datatype when you load the data.
sensitive = LOAD '/mdsba/sample2.csv' using PigStorage(',') as (AGE:int,EDU:chararray,SEX:chararray,SALARY:chararray);