SparkSQL on pyspark: how to generate time series? - python-2.7

I'm using SparkSQL on pyspark to store some PostgreSQL tables into DataFrames and then build a query that generates several time series based on a start and stop columns of type date.
Suppose that my_table contains:
start | stop
-------------------------
2000-01-01 | 2000-01-05
2012-03-20 | 2012-03-23
In PostgreSQL it's very easy to do that:
SELECT generate_series(start, stop, '1 day'::interval)::date AS dt FROM my_table
and it will generate this table:
dt
------------
2000-01-01
2000-01-02
2000-01-03
2000-01-04
2000-01-05
2012-03-20
2012-03-21
2012-03-22
2012-03-23
but how to do that using plain SparkSQL? Will it be necessary to use UDFs or some DataFrame methods?

EDIT
This creates a dataframe with one row containing an array of consecutive dates:
from pyspark.sql.functions import sequence, to_date, explode, col
spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date")
+------------------------------------------+
| date |
+------------------------------------------+
| ["2018-01-01","2018-02-01","2018-03-01"] |
+------------------------------------------+
You can use the explode function to "pivot" this array into rows:
spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date").withColumn("date", explode(col("date"))
+------------+
| date |
+------------+
| 2018-01-01 |
| 2018-02-01 |
| 2018-03-01 |
+------------+
(End of edit)
Spark v2.4 support sequence function:
sequence(start, stop, step) - Generates an array of elements from start to stop (inclusive), incrementing by step. The type of the returned elements is the same as the type of argument expressions.
Supported types are: byte, short, integer, long, date, timestamp.
Examples:
SELECT sequence(1, 5);
[1,2,3,4,5]
SELECT sequence(5, 1);
[5,4,3,2,1]
SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month);
[2018-01-01,2018-02-01,2018-03-01]
https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#sequence

The existing answers will work, but are very inefficient. Instead it is better to use range and then cast data. In Python
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
def generate_series(start, stop, interval):
"""
:param start - lower bound, inclusive
:param stop - upper bound, exclusive
:interval int - increment interval in seconds
"""
spark = SparkSession.builder.getOrCreate()
# Determine start and stops in epoch seconds
start, stop = spark.createDataFrame(
[(start, stop)], ("start", "stop")
).select(
[col(c).cast("timestamp").cast("long") for c in ("start", "stop")
]).first()
# Create range with increments and cast to timestamp
return spark.range(start, stop, interval).select(
col("id").cast("timestamp").alias("value")
)
Example usage:
generate_series("2000-01-01", "2000-01-05", 60 * 60).show(5) # By hour
+-------------------+
| value|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-01 01:00:00|
|2000-01-01 02:00:00|
|2000-01-01 03:00:00|
|2000-01-01 04:00:00|
+-------------------+
only showing top 5 rows
generate_series("2000-01-01", "2000-01-05", 60 * 60 * 24).show() # By day
+-------------------+
| value|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
+-------------------+

#Rakesh answer is correct, but I would like to share a less verbose solution:
import datetime
import pyspark.sql.types
from pyspark.sql.functions import UserDefinedFunction
# UDF
def generate_date_series(start, stop):
return [start + datetime.timedelta(days=x) for x in range(0, (stop-start).days + 1)]
# Register UDF for later usage
spark.udf.register("generate_date_series", generate_date_series, ArrayType(DateType()) )
# mydf is a DataFrame with columns `start` and `stop` of type DateType()
mydf.createOrReplaceTempView("mydf")
spark.sql("SELECT explode(generate_date_series(start, stop)) FROM mydf").show()

Suppose you have dataframe df from spark sql, Try this
from pyspark.sql.functions as F
from pyspark.sql.types as T
def timeseriesDF(start, total):
series = [start]
for i xrange( total-1 ):
series.append(
F.date_add(series[-1], 1)
)
return series
df.withColumn("t_series", F.udf(
timeseriesDF,
T.ArrayType()
) ( df.start, F.datediff( df.start, df.stop ) )
).select(F.explode("t_series")).show()

Building off of user10938362 answer, just showing a way to use range without a UDF, provided that you are trying to build a data frame of dates based off of some ingested dataset, rather than with a hardcoded start/stop.
# start date is min date
date_min=int(df.agg({'date': 'min'}).first()[0])
# end date is current date or alternatively could use max as above
date_max=(
spark.sql('select unix_timestamp(current_timestamp()) as date_max')
.collect()[0]['date_max']
)
# range is int, unix time is s so 60*60*24=day
df=spark.range(date_min, date_max, 60*60*24).select('id')

Related

Oracle 18c - Alternative to REGEXP_REPLACE

After migrating to Oracle 18c Enterprise Edition, a function based index fails to create.
Here is my index DDL:
CREATE INDEX my_index ON my_table
(UPPER( REGEXP_REPLACE ("DEPT_NUM",'[^[:alnum:]]',NULL,1,0)))
TABLESPACE my_tbspace
PCTFREE 10
INITRANS 2
MAXTRANS 255
STORAGE (
INITIAL 64K
MINEXTENTS 1
MAXEXTENTS UNLIMITED
PCTINCREASE 0
BUFFER_POOL DEFAULT
);
I get the following error:
ORA-01743: only pure functions can be indexed
01743. 00000 - "only pure functions can be indexed"
*Cause: The indexed function uses SYSDATE or the user environment.
*Action: PL/SQL functions must be pure (RNDS, RNPS, WNDS, WNPS). SQL
expressions must not use SYSDATE, USER, USERENV(), or anything
else dependent on the session state. NLS-dependent functions
are OK.
Is this a known bug in 18c? If this function based index is no longer supported, what is another way to write this function?
The issue is regexp_replace is not deterministic. The problem arises when changing NLS settings:
alter session set nls_language = english;
with rws as (
select 'STÜFF' v
from dual
)
select regexp_replace ( v, '[A-Z]+', '#' )
from rws;
REGEXP_REPLACE(V,'[A-Z]+','#')
#Ü#
alter session set nls_language = german;
with rws as (
select 'STÜFF' v
from dual
)
select regexp_replace ( v, '[A-Z]+', '#' )
from rws;
REGEXP_REPLACE(V,'[A-Z]+','#')
#
U-umlaut is at the end of the alphabet in English. But after U in German. So the first statement doesn't replace it. The second does.
In Oracle Database 12.1 and earlier regexp_replace was incorrectly marked as deterministic. 12.2 fixed this by making it non-deterministic.
Consider carefully whether any workarounds manage diacritics correctly.
MOS note 2592779.1 discusses this further.
Most likely the REGEXP_REPLACE causes the problem, see Find out if a string contains only ASCII characters. You can bypass the limitation with a user defined function (thanks to Bob Jarvis)
CREATE OR REPLACE FUNCTION KEEP_ALNUM(strIn IN VARCHAR2)
RETURN VARCHAR2
DETERMINISTIC
AS
BEGIN
RETURN UPPER(REGEXP_REPLACE(strIn, '[^[:alnum:]]', NULL, 1, 0));
END KEEP_ALNUM;
/
CREATE INDEX DEPTS_1 ON DEPTS(KEEP_ALNUM(DEPT_NUM));
Just ensure function has keyword DETERMINISTIC, then you can define even useless functions like below and create a functional index on it
CREATE OR REPLACE FUNCTION SillyValue RETURN VARCHAR2 DETERMINISTIC
AS
BEGIN
RETURN DBMS_RANDOM.STRING('p', 20);
END;
/
There are a couple of workarounds.
First one is a hack.
As you may know, when you create FBI then Oracle creates hidden column and index on it.
Moreover, you even can specify the name of that column instead of FBI expression and Oracle will use an index.
set lines 70 pages 70
column column_name format a15
column data_type format a15
drop table my_table;
create table my_table(dept_num, dept_descr) as select rownum||'*', 'dummy' from dual connect by level <= 1e6;
create index my_index
on my_table(upper(regexp_replace(dept_num, '[^[:alnum:]]', null, 1, 0)));
select column_name, data_type from user_tab_cols where table_name = 'MY_TABLE';
explain plan for
select * from my_table where upper(regexp_replace(dept_num, '[^[:alnum:]]', null, 1, 0)) = '666';
select * from table(dbms_xplan.display(format => 'BASIC'));
explain plan for
select * from my_table where SYS_NC00003$ = '666';
select * from table(dbms_xplan.display(format => 'BASIC'));
Output
Table dropped.
Table created.
Index created.
COLUMN_NAME DATA_TYPE
--------------- ---------------
DEPT_NUM VARCHAR2
DEPT_DESCR CHAR
SYS_NC00003$ VARCHAR2
3 rows selected.
Explain complete.
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------
Plan hash value: 2234884270
--------------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_TABLE |
| 2 | INDEX RANGE SCAN | MY_INDEX |
--------------------------------------------------------
9 rows selected.
Explain complete.
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------
Plan hash value: 2234884270
--------------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_TABLE |
| 2 | INDEX RANGE SCAN | MY_INDEX |
--------------------------------------------------------
9 rows selected.
So to mimic FBI you can create a hidden column and an index on top of it.
That can be done in Oracle 11g using dbms_stats.create_extended_stats.
drop index my_index;
begin
for i in (select dbms_stats.create_extended_stats
(user, 'my_table', '(upper(regexp_replace("DEPT_NUM", ''[^[:alnum:]]'', null, 1, 0)))') as col_name
from dual)
loop
execute immediate(utl_lms.format_message('alter table %s rename column "%s" to my_hidden_col','my_table', i.col_name));
end loop;
end;
/
select column_name, data_type from user_tab_cols where table_name = 'MY_TABLE';
create index my_index on my_table(my_hidden_col);
explain plan for
select * from my_table where upper(regexp_replace(dept_num, '[^[:alnum:]]', null, 1, 0)) = '666';
select * from table(dbms_xplan.display(format => 'BASIC'));
explain plan for
select * from my_table where MY_HIDDEN_COL = '666';
select * from table(dbms_xplan.display(format => 'BASIC'));
Output
Index dropped.
PL/SQL procedure successfully completed.
COLUMN_NAME DATA_TYPE
--------------- ---------------
DEPT_NUM VARCHAR2
DEPT_DESCR CHAR
MY_HIDDEN_COL VARCHAR2
3 rows selected.
Index created.
Explain complete.
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------
Plan hash value: 2234884270
--------------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_TABLE |
| 2 | INDEX RANGE SCAN | MY_INDEX |
--------------------------------------------------------
9 rows selected.
Explain complete.
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------
Plan hash value: 2234884270
--------------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_TABLE |
| 2 | INDEX RANGE SCAN | MY_INDEX |
--------------------------------------------------------
9 rows selected.
Starting with Oracle 12c hidden columns are documented so it becomes even more straightforward.
alter table my_table add (my_hidden_col invisible as
(upper(regexp_replace(dept_num, '[^[:alnum:]]', null, 1, 0))) virtual);
create index my_index on my_table(my_hidden_col);
Another approach is to implement the same logic without a regex.
create index my_index on my_table(
translate(upper(dept_num, '_'||translate(dept_num, '_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789', '_'), '_')));
But in this case you have to make sure that all expressions with regex in predicates are replaced with the new one.
The work-around I found easiest was to create the index using NLS_UPPER instead of UPPER:
CREATE INDEX my_index ON my_table
( REGEXP_REPLACE (NLS_UPPER("DEPT_NUM"),'[^[:alnum:]]',NULL,1,0)))
TABLESPACE my_tbspace
PCTFREE 10
INITRANS 2
MAXTRANS 255
STORAGE (
INITIAL 64K
MINEXTENTS 1
MAXEXTENTS UNLIMITED
PCTINCREASE 0
BUFFER_POOL DEFAULT
);

Powerbi create mesaure from multiple tables

I want to create a measure where i get the ammount of written hours in a shift. The issue is that my DAX knowledge is not sufficient to create such a formula.
The tables are as follows:
TABLE 1 (Shift information)
END | START | EMPLOYEE
datetime | datetime | varchar
TABLE 2 (Written time)
END | START | DURATION | EMPLOYEE
datetime | datetime | duration | varchar
Now i have created an simular SQL to create this data, unfortunately the SQL is so intensive that it would take too long to run it.
The formula in the SQL is as follows:
SUM(table2.duration) WHERE table2.end BETWEEN table1.start and table1.end
Any help would be appreciated!
Try DATESBETWEEN
=
CALCULATE (
SUM ( table2[duration] ),
DATESBETWEEN (
table2[end],
LASTDATE ( table1[start] ),
LASTDATE ( table1[end] )
)
)

Pandas dataframe in pyspark to hive

How to send a pandas dataframe to a hive table?
I know if I have a spark dataframe, I can register it to a temporary table using
df.registerTempTable("table_name")
sqlContext.sql("create table table_name2 as select * from table_name")
but when I try to use the pandas dataFrame to registerTempTable, I get the below error:
AttributeError: 'DataFrame' object has no attribute 'registerTempTable'
Is there a way for me to use a pandas dataFrame to register a temp table or convert it to a spark dataFrame and then use it register a temp table so that I can send it back to hive.
I guess you are trying to use pandas df instead of Spark's DF.
Pandas DataFrame has no such method as registerTempTable.
you may try to create Spark DF from pandas DF.
UPDATE:
I've tested it under Cloudera (with installed Anaconda parcel, which includes Pandas module).
Make sure that you have set PYSPARK_PYTHON to your anaconda python installation (or another one containing Pandas module) on all your Spark workers (usually in: spark-conf/spark-env.sh)
Here is result of my test:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
>>> sdf = sqlContext.createDataFrame(df)
>>> sdf.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 98| 33| 75|
| 91| 57| 80|
| 20| 87| 85|
| 20| 61| 37|
| 96| 64| 60|
| 79| 45| 82|
| 82| 16| 22|
| 77| 34| 65|
| 74| 18| 17|
| 71| 57| 60|
+---+---+---+
>>> sdf.printSchema()
root
|-- A: long (nullable = true)
|-- B: long (nullable = true)
|-- C: long (nullable = true)
first u need to convert pandas dataframe to spark dataframe:
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
df = hive_context.createDataFrame(pd_df)
then u can create a temptable which is in memory:
df.registerTempTable('tmp')
now,u can use hive ql to save data into hive:
hive_context.sql("""insert overwrite table target partition(p='p') select a,b from tmp'''
note than:the hive_context must be keep to the same one!
I converted my pandas df to a temp table by
1) Converting the pandas dataframe to spark dataframe:
spark_df=sqlContext.createDataFrame(Pandas_df)
2) Make sure that the data is migrated properly
spark_df.select("*").show()
3) Convert the spark dataframe to a temp table for querying.
spark_df.registerTempTable("table_name").
Cheers..
By Following all the other answers here, I was able to convert a pandas dataframe to a permanent Hive table as follows:
# sc is a spark context created with enableHiveSupport()
from pyspark.sql import HiveContext
hc=HiveContext(sc)
# df is my pandas dataframe
sc.createDataFrame(df).registerTempTable('tmp')
# sch is the hive schema, and tabname is my new hive table name
hc.sql("create table sch.tabname as select * from tmp")

Power Query: Date Time Zone operations

I am building Power BI report based on data from SQL Server's table User which contains a column CreatedDate of type DateTypeOffset(7).
I am looking for Power Query equivalent to .NET [TimeZoneInfo.FindSystemTimeZoneById]. My Report data has a Time Zone info which is stored as standard ID (like Pacific Standard Time) and I need to convert values in my DateTimeOffset column to the time zone I was talking about above, this is how I do that in C#:
TimeSpan timeZoneOffsetSpan = TimeZoneInfo.FindSystemTimeZoneById(settings.TimeZ​one).GetUtcOffset(UserCreatedDateTime);
This gives me a difference between the time in the time zone stored in my Settings and UTC for UserCreatedDateTime. And I am executing that line of the code just once for all rows in my User table because the purpose of the code above is to find out current offset taking into account such timezone's features like DayLightSaving.
And now I can simply add that offset (which could be positive or negative) against every value in my User.CreateDateTime column, in C# I am doing that by using [DateTimeOffset.ToOffset]:
DateTimeOffset convertedOffset = UserCreatedDateTime.ToOffset(timeZoneOffsetSpan)
So I need to be able to do the same conversions in my Power BI report but I couldnot find any Power Query function which can do the same what [TimeZoneInfo.FindSystemTimeZoneById] does, if that function existed it would cover my first line of C# code and then I would need to find an equivalent to [TimeZoneInfo.FindSystemTimeZoneById]. [DateTimeZone.SwitchZone] is probably what I need but I first need to know the offset corresponding to my time zone and then I would be able to supply that offset as 2nd and 3rd parameters to that function.
So, to finalize, I need Power BI analog for [TimeZoneInfo.FindSystemTimeZoneById].
Can anybody help?
You could add a table to the data model with columns for time zone ID and time zone offset. In your main query you can join this table and then create a calculation to add or subtract the offset into a new column. After that, remove the joined column.
See here How can I perform the equivalent of AddHours to a DateTime in Power Query?
Or you could use the DateTimeZone functions built into M.
You can take advantage of the AT TIME ZONE feature of TSQL to create a table of offsets (in minutes) for each day, for each timezone, and import the resultant table into PowerBI, so you can add/subtract relevant offsets to your base time.
If I was in the OP's situation, I would import (convert) all times to UTC, and then add the offset within PowerBI, dependent on the timezone you want to display. (maybe using this method?)
The example below gives a table the offsets for Australian timezones against UTC.
If my target table is in UTC and I want to change the time in PowerBI, I just have to add the specific offset to my time...
+------------+-----------+----------+----------+----------+
| TargetDate | QLDOffset | NTOffset | SAOffset | WAOffset |
+------------+-----------+----------+----------+----------+
| 2018-04-02 | -600 | -570 | -570 | -480 |
| 2018-04-01 | -600 | -570 | -570 | -480 |
| 2018-03-31 | -600 | -570 | -630 | -480 |
| 2018-03-30 | -600 | -570 | -630 | -480 |
+------------+-----------+----------+----------+----------+
You can see the offset changing for SA on 1st April
WITH MostDays AS ( --MostDays is a table of 10 years worth of dates going back from today.
SELECT Today = CAST(DATEADD(DAY, -days, GETUTCDATE()) AS DATETIMEOFFSET)
FROM (SELECT TOP 3650 Days = ROW_NUMBER() OVER (ORDER BY message_id ) -- this will give a rolling window of 10 years
FROM sys.messages
)x
)
-- This section takes the datetime from above, and calculates how many minutes difference there is between each timezone for each date.
Select TargetDate=CAST(Today AS DATE)
,QLDOffset = DATEDIFF(minute,CAST(Today at TIME ZONE 'UTC' AT TIME ZONE 'E. Australia Standard TIME' AS DateTime),CAST(Today at TIME ZONE 'UTC' AS DATETIME))
,NTOffset = DATEDIFF(minute,CAST(Today at TIME ZONE 'UTC' AT TIME ZONE 'AUS Central Standard Time' AS DateTime),CAST(Today at TIME ZONE 'UTC' AS DATETIME))
,SAOffset = DATEDIFF(minute,CAST(Today at TIME ZONE 'UTC' AT TIME ZONE 'Cen. Australia Standard TIME' AS DateTime),CAST(Today at TIME ZONE 'UTC' AS DATETIME))
,WAOffset = DATEDIFF(minute,CAST(Today at TIME ZONE 'UTC' AT TIME ZONE 'W. Australia Standard TIME' AS DateTime),CAST(Today at TIME ZONE 'UTC' AS DATETIME))
FROM MostDays
Obviously the table gets pretty big if you have no limits on dates or timezones

How to calculate number of days in redshift

I would like to calculate number of days between two dates in redshift but the function should take into account the time that's mean day=0 if there is less than 24 hours between the dates like Timestampdiff function in MySQL. Datediff is not relevant here .
Do a DATEDIFF(hours, start, end) / 24 to count the number of hours, then divide by 24 to get the number of whole days, eg:
db=# create table t (a TIMESTAMP, b TIMESTAMP);
CREATE TABLE
db=# insert into t values ('2015-08-19 03:56:39'::timestamp, '2015-08-23 02:22:18'::timestamp);
INSERT 0 1
db=# insert into t values ('2015-08-19 03:56:39'::timestamp, '2015-08-23 10:22:18'::timestamp);
INSERT 0 1
db=# select a, b, datediff(hours, a, b) / 24 as whole_days from t;
a | b | whole_days
---------------------+---------------------+------------
2015-08-19 03:56:39 | 2015-08-23 02:22:18 | 3
2015-08-19 03:56:39 | 2015-08-23 10:22:18 | 4
(2 rows)
See also:
Amazon Redshift Data Types documentation
Working with Dates and Times in PostgreSQL