How to send a pandas dataframe to a hive table?
I know if I have a spark dataframe, I can register it to a temporary table using
df.registerTempTable("table_name")
sqlContext.sql("create table table_name2 as select * from table_name")
but when I try to use the pandas dataFrame to registerTempTable, I get the below error:
AttributeError: 'DataFrame' object has no attribute 'registerTempTable'
Is there a way for me to use a pandas dataFrame to register a temp table or convert it to a spark dataFrame and then use it register a temp table so that I can send it back to hive.
I guess you are trying to use pandas df instead of Spark's DF.
Pandas DataFrame has no such method as registerTempTable.
you may try to create Spark DF from pandas DF.
UPDATE:
I've tested it under Cloudera (with installed Anaconda parcel, which includes Pandas module).
Make sure that you have set PYSPARK_PYTHON to your anaconda python installation (or another one containing Pandas module) on all your Spark workers (usually in: spark-conf/spark-env.sh)
Here is result of my test:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
>>> sdf = sqlContext.createDataFrame(df)
>>> sdf.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 98| 33| 75|
| 91| 57| 80|
| 20| 87| 85|
| 20| 61| 37|
| 96| 64| 60|
| 79| 45| 82|
| 82| 16| 22|
| 77| 34| 65|
| 74| 18| 17|
| 71| 57| 60|
+---+---+---+
>>> sdf.printSchema()
root
|-- A: long (nullable = true)
|-- B: long (nullable = true)
|-- C: long (nullable = true)
first u need to convert pandas dataframe to spark dataframe:
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
df = hive_context.createDataFrame(pd_df)
then u can create a temptable which is in memory:
df.registerTempTable('tmp')
now,u can use hive ql to save data into hive:
hive_context.sql("""insert overwrite table target partition(p='p') select a,b from tmp'''
note than:the hive_context must be keep to the same one!
I converted my pandas df to a temp table by
1) Converting the pandas dataframe to spark dataframe:
spark_df=sqlContext.createDataFrame(Pandas_df)
2) Make sure that the data is migrated properly
spark_df.select("*").show()
3) Convert the spark dataframe to a temp table for querying.
spark_df.registerTempTable("table_name").
Cheers..
By Following all the other answers here, I was able to convert a pandas dataframe to a permanent Hive table as follows:
# sc is a spark context created with enableHiveSupport()
from pyspark.sql import HiveContext
hc=HiveContext(sc)
# df is my pandas dataframe
sc.createDataFrame(df).registerTempTable('tmp')
# sch is the hive schema, and tabname is my new hive table name
hc.sql("create table sch.tabname as select * from tmp")
Related
I need help because I wanted to filter some data from a dataframe as a criterion for another dataframe but I didn't want to use SQL commands.
df1
id ; create ; change ; name
1 ;2020-12-01;;Paul
2 ;2020-12-02;;Mary
3 ;2020-12-03;;David
4 ;2020-12-04;;Marley
df2
id ; create ; change ; name
1 ;2020-12-01;2020-12-30;Paul
2 ;2020-12-02;;Mary
3 ;2020-12-03;;David
4 ;2020-12-04;2020-12-30;Marley
5 ;2020-12-30;;Ted
df3
I wanted to create the df3 dataframe with the following rule where the id (df2) containing change pre-filled with the date 2020-12-30 and exists
in df1 not to be inserted in df3
id ; create ; change ; name
2 ;2020-12-02;;Mary
3 ;2020-12-03;;David
You can first do a semi-join of df2 with df1, and then filter the change column.
df3 = df2.join(df1, ['id', 'create', 'name'], 'semi') \
.filter("change is null or change != '2020-12-30'") \
.select('id', 'create', 'change', 'name')
df3.show()
+---+----------+------+-----+
| id| create|change| name|
+---+----------+------+-----+
| 2|2020-12-02| null| Mary|
| 3|2020-12-03| null|David|
+---+----------+------+-----+
I have a fixed Spark DataFrame order from the target table:
Target Spark Dataframe(col1 string , col2 int , col3 string , col4 double)
Now, if the source data comes in a jumbled order:
Source Spark Dataframe(col3 string , col2 int ,col4 double , col1 string).
How can I rearrange the source DataFrame to match the column order of the target DataFrame using PySpark?
The Source Spark Dataframe should be reordered like below to match the target DataFrame:
Output:
Updated Source Spark Dataframe(col1 string , col2 int , col3 string , col4 double)
Scenario 2:
Source Dataframe =[a,c,d,e]
Target dataframe =[a,b,c,d]
In this scenario, the source DataFrame should be rearranged to [a,b,c,d,e]
Keep the order of the target columns
Change the datatypes of the source column to match the target dataframe
Add the new columns at the end
If the target column is not present in the source columns, then the column should still be added but filled with null values.
In the above example, after the source DataFrame is rearranged, it would have a b column added with null values.
This will ensure that when we use saveAsTable, the source DataFrame can easily be pushed into the table without breaking the existing table.
Suppose you had the following two DataFrames:
source.show()
#+---+---+---+---+
#| a| c| d| e|
#+---+---+---+---+
#| A| C| 0| E|
#+---+---+---+---+
target.show()
#+---+---+---+---+
#| a| b| c| d|
#+---+---+---+---+
#| A| B| C| 1|
#+---+---+---+---+
With the following data types:
print(source.dtypes)
#[('a', 'string'), ('c', 'string'), ('d', 'string'), ('e', 'string')]
print(target.dtypes)
#[('a', 'string'), ('b', 'string'), ('c', 'string'), ('d', 'int')]
If I understand your logic correctly, the following list comprehension should work for you:
from pyspark.sql.functions import col, lit
new_source = source.select(
*(
[
col(t).cast(d) if t in source.columns else lit(None).alias(t)
for t, d in target.dtypes
] +
[s for s in source.columns if s not in target.columns]
)
)
new_source.show()
new_source.show()
#+---+----+---+---+---+
#| a| b| c| d| e|
#+---+----+---+---+---+
#| A|null| C| 0| E|
#+---+----+---+---+---+
And the resulting output will have the following schema:
new_source.printSchema()
#root
# |-- a: string (nullable = true)
# |-- b: null (nullable = true)
# |-- c: string (nullable = true)
# |-- d: integer (nullable = true)
# |-- e: string (nullable = true)
As you can see, column d's datatype changed from string to integer to match the target table's schema.
The logic is to first loop over the columns in target and select them if they exist in source.columns or create a column of nulls if it doesn't exist. Then add in the columns from source that don't exist in target.
I have following string value obtained from a pandas dataframe.
u'1:19 AM Eastern, Tuesday, May 16, 2017'
How do I convert it to a datetime.datetime(2017,5,16) object?
Thx.
You need to create a custom date parser, to give you some ideas here's a reproducible example:
import pandas as pd
import datetime
from StringIO import StringIO
st = u'01:19 AM Eastern, Tuesday, May 16, 2017'
def parse_date(date_string):
date_string = ",".join(date_string.split(',')[-2:]).strip()
return datetime.datetime.strptime(date_string, '%B %d, %Y')
df = pd.read_csv(StringIO(st), header=None, sep="|", date_parser=parse_date, parse_dates=[0])
If you print the dataFrame content as follows, :
print("dataframe content")
print(df)
you will get this output:
dataframe content
0
0 2017-05-16
checking the dtypes confirms that the column is now of type datetime:
print("dataframe types")
print(df.dtypes)
output:
dataframe types
0 datetime64[ns]
dtype: object
I'm using SparkSQL on pyspark to store some PostgreSQL tables into DataFrames and then build a query that generates several time series based on a start and stop columns of type date.
Suppose that my_table contains:
start | stop
-------------------------
2000-01-01 | 2000-01-05
2012-03-20 | 2012-03-23
In PostgreSQL it's very easy to do that:
SELECT generate_series(start, stop, '1 day'::interval)::date AS dt FROM my_table
and it will generate this table:
dt
------------
2000-01-01
2000-01-02
2000-01-03
2000-01-04
2000-01-05
2012-03-20
2012-03-21
2012-03-22
2012-03-23
but how to do that using plain SparkSQL? Will it be necessary to use UDFs or some DataFrame methods?
EDIT
This creates a dataframe with one row containing an array of consecutive dates:
from pyspark.sql.functions import sequence, to_date, explode, col
spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date")
+------------------------------------------+
| date |
+------------------------------------------+
| ["2018-01-01","2018-02-01","2018-03-01"] |
+------------------------------------------+
You can use the explode function to "pivot" this array into rows:
spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date").withColumn("date", explode(col("date"))
+------------+
| date |
+------------+
| 2018-01-01 |
| 2018-02-01 |
| 2018-03-01 |
+------------+
(End of edit)
Spark v2.4 support sequence function:
sequence(start, stop, step) - Generates an array of elements from start to stop (inclusive), incrementing by step. The type of the returned elements is the same as the type of argument expressions.
Supported types are: byte, short, integer, long, date, timestamp.
Examples:
SELECT sequence(1, 5);
[1,2,3,4,5]
SELECT sequence(5, 1);
[5,4,3,2,1]
SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month);
[2018-01-01,2018-02-01,2018-03-01]
https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#sequence
The existing answers will work, but are very inefficient. Instead it is better to use range and then cast data. In Python
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
def generate_series(start, stop, interval):
"""
:param start - lower bound, inclusive
:param stop - upper bound, exclusive
:interval int - increment interval in seconds
"""
spark = SparkSession.builder.getOrCreate()
# Determine start and stops in epoch seconds
start, stop = spark.createDataFrame(
[(start, stop)], ("start", "stop")
).select(
[col(c).cast("timestamp").cast("long") for c in ("start", "stop")
]).first()
# Create range with increments and cast to timestamp
return spark.range(start, stop, interval).select(
col("id").cast("timestamp").alias("value")
)
Example usage:
generate_series("2000-01-01", "2000-01-05", 60 * 60).show(5) # By hour
+-------------------+
| value|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-01 01:00:00|
|2000-01-01 02:00:00|
|2000-01-01 03:00:00|
|2000-01-01 04:00:00|
+-------------------+
only showing top 5 rows
generate_series("2000-01-01", "2000-01-05", 60 * 60 * 24).show() # By day
+-------------------+
| value|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
+-------------------+
#Rakesh answer is correct, but I would like to share a less verbose solution:
import datetime
import pyspark.sql.types
from pyspark.sql.functions import UserDefinedFunction
# UDF
def generate_date_series(start, stop):
return [start + datetime.timedelta(days=x) for x in range(0, (stop-start).days + 1)]
# Register UDF for later usage
spark.udf.register("generate_date_series", generate_date_series, ArrayType(DateType()) )
# mydf is a DataFrame with columns `start` and `stop` of type DateType()
mydf.createOrReplaceTempView("mydf")
spark.sql("SELECT explode(generate_date_series(start, stop)) FROM mydf").show()
Suppose you have dataframe df from spark sql, Try this
from pyspark.sql.functions as F
from pyspark.sql.types as T
def timeseriesDF(start, total):
series = [start]
for i xrange( total-1 ):
series.append(
F.date_add(series[-1], 1)
)
return series
df.withColumn("t_series", F.udf(
timeseriesDF,
T.ArrayType()
) ( df.start, F.datediff( df.start, df.stop ) )
).select(F.explode("t_series")).show()
Building off of user10938362 answer, just showing a way to use range without a UDF, provided that you are trying to build a data frame of dates based off of some ingested dataset, rather than with a hardcoded start/stop.
# start date is min date
date_min=int(df.agg({'date': 'min'}).first()[0])
# end date is current date or alternatively could use max as above
date_max=(
spark.sql('select unix_timestamp(current_timestamp()) as date_max')
.collect()[0]['date_max']
)
# range is int, unix time is s so 60*60*24=day
df=spark.range(date_min, date_max, 60*60*24).select('id')
I have a pandas dataframe with timestamps as index:
I would like to convert it to get a dataframe with daily values but without having to resample the original dataframe (no to sum, or average the hourly data). Ideally I would like to get the 24 daily values in a vector for each day, for example:
Is there a method to do this quickly?
Thanks!
IIUC you can groupby on the date attribute of your index and then apply a lambda that aggregates the values into a list:
In [21]:
# generate some data
df = pd.DataFrame({'GFS_rad':np.random.randn(100), 'GFS_tmp':np.random.randn(100)}, index=pd.date_range(dt.datetime(2016,1,1), freq='1h', periods=100))
df.groupby(df.index.date)['GFS_rad','GFS_tmp'].agg(lambda x: [x['GFS_rad'].values,x['GFS_tmp'].values])
Out[21]:
GFS_rad \
2016-01-01 [-0.324115177542, 1.59297335764, 0.58118555943...
2016-01-02 [-0.0547016526463, -1.10093451797, -1.55790161...
2016-01-03 [-0.34751220092, 1.06246918632, 0.181218794826...
2016-01-04 [0.950977469848, 0.422905080529, 1.98339145764...
2016-01-05 [-0.405124861624, 0.141470757613, -0.191169333...
GFS_tmp
2016-01-01 [-2.36889710412, -0.557972678049, -1.293544410...
2016-01-02 [-0.125562429825, -0.018852674365, -0.96735945...
2016-01-03 [0.802961514703, -1.68049099535, -0.5116769061...
2016-01-04 [1.35789157665, 1.37583167965, 0.538638510171,...
2016-01-05 [-0.297611872638, 1.10546853812, -0.8726761667...