Splitting an array into columns in Athena/Presto - amazon-athena

I feel this should be simple, but I've struggled to find the right terminology, please bear with me.
I have two columns, timestamp and voltages which is the array
If I do a simple
SELECT timestamp, voltages FROM table
Then I'd get a result of:
|timestamp | voltages |
|1544435470 |3.7352,3.749,3.7433,3.7533|
|1544435477 |3.7352,3.751,3.7452,3.7533|
|1544435484 |3.7371,3.749,3.7433,3.7533|
|1544435490 |3.7352,3.749,3.7452,3.7533|
|1544435497 |3.7352,3.751,3.7452,3.7533|
|1544435504 |3.7352,3.749,3.7452,3.7533|
But I want to split the voltages array so each element in its array is its own column.
|timestamp | v1 | v2 | v3 | v4 |
|1544435470 |3.7352 |3.749 |3.7433 |3.7533|
|1544435477 |3.7352 |3.751 |3.7452 |3.7533|
|1544435484 |3.7371 |3.749 |3.7433 |3.7533|
|1544435490 |3.7352 |3.749 |3.7452 |3.7533|
|1544435497 |3.7352 |3.751 |3.7452 |3.7533|
|1544435504 |3.7352 |3.749 |3.7452 |3.7533|
I know I can do this with:
SELECT timestamp, voltages[1] as v1, voltages[2] as v2 FROM table
But I'd need to be able to do this programmatically, as opposed to listing them out.
Am I missing something obvious?

This should serve your purpose if you have arrays of fixed length.
You need to first break down each array element into it's own row. You can do this using the UNNEST operator in the following way :
SELECT timestamp, volt
FROM table
CROSS JOIN UNNEST(voltages) AS t(volt)
Using the resultant table you can pivot (convert multiple rows with the same timestamp into multiple columns) by referring to Gordon Linoff's answer for "need to convert data in multiple rows with same ID into 1 row with multiple columns".

Related

Dynamic Google Sheets Column + Row formula

I have a good sheet that I want to grab the header which a date time stamp which will match against another sheet find the entries with that date and suburb and type and give me an average cost.
My formula is =AVERAGEIFS(Sheet1!C:C,Sheet1!A:A, B11:B, Sheet1!F:F, C10) which gives me the average but i've hard coded the header date:
example:
What I want to do is dynamically add the data from the row above with the date time instead of of manually adding it in the formula something like this:
=AVERAGEIFS(Sheet1!C:C,Sheet1!A:A, B11:B, Sheet1!F:F, =CHAR(COLUMN()+64) & 10)
Which would automatically grab the column + row 10 e.g C10, D10, E10.
If i put =CHAR(COLUMN()+64) & 10 in its own cell it works but when I add it to averageifs condition it gives me a parsing error.
Expecting C10, D10, E10 from =CHAR(COLUMN()+64) & 10 which should allow me to dynamically filter data on the date int he header above it.
try:
=AVERAGEIFS(Sheet1!C:C, Sheet1!A:A, B11:B, Sheet1!F:F, INDIRECT(CHAR(COLUMN()+64)&10))

Regex to Filtre google sheet column

I am trying to filter and extract the last any two character of a cell value from entire column.
I have tried the mentioned below:-
=FILTER(Data!H:H,REGEXEXTRACT(Data!H:H, "\(..\)$"))
But this is giving me error
I have values like this
Column H My Desired result
----------- -----------------------
as/lk lk
dsfs fs
as*(& (&
asdda da
dasda da
This was achieved by Array Formula:
ARRAYFORMULA(IFERROR(REGEXEXTRACT(Data!H2:H, "(..)$")))

Convert a number column into a time format in Power BI

I'm looking for a way to convert a decimal number into a valid HH:mm:ss format.
I'm importing data from an SQL database.
One of the columns in my database is labelled Actual Start Time.
The values in my database are stored in the following decimal format:
73758 // which translates to 07:27:58
114436 // which translates to 11:44:36
I cannot simply convert this Actual Start Time column into a Time format in my Power BI import as it returns errors for some values, saying it doesn't recognise 73758 as a valid 'time'. It needs to have a leading zero for cases such as 73758.
To combat this, I created a new Text column with the following code to append a leading zero:
Column = FORMAT([Actual Start Time], "000000")
This returns the following results:
073758
114436
-- which is perfect. Exactly what I needed.
I now want to convert these values into a Time.
Simply changing the data type field to Time doesn't do anything, returning:
Cannot convert value '073758' of type Text to type Date.
So I created another column with the following code:
Column 2 = FORMAT(TIME(LEFT([Column], 2), MID([Column], 3, 2), RIGHT([Column], 2)), "HH:mm:ss")
To pass the values 07, 37 and 58 into a TIME format.
This returns the following:
_______________________________________
| Actual Start Date | Column | Column 2 |
|_______________________________________|
| 73758 | 073758 | 07:37:58 |
| 114436 | 114436 | 11:44:36 |
Which is what I wanted but is there any other way of doing this? I want to ideally do it in one step without creating additional columns.
You could use a variable as suggested by Aldert or you can replace Column by the format function:
Time Format = FORMAT(
TIME(
LEFT(FORMAT([Actual Start Time],"000000"),2),
MID(FORMAT([Actual Start Time],"000000"),3,2),
RIGHT([Actual Start Time],2)),
"hh:mm:ss")
Edit:
If you want to do this in Power query, you can create a customer column with the following calculation:
Time.FromText(
if Text.Length([Actual Start Time])=5 then Text.PadStart( [Actual Start Time],6,"0")
else [Actual Start Time])
Once this column is created you can drop the old column, so that you only have one time column in the data. Hope this helps.
I, on purpose show you the concept of variables so you can use this in future with more complex queries.
TimeC =
var timeStr = FORMAT([Actual Start Time], "000000")
return FORMAT(TIME(LEFT([timeStr], 2), MID([timeStr], 3, 2), RIGHT([timeStr], 2)), "HH:mm:ss")

Convert array of rows into array of strings in pyspark

I have a dataframe with 2 columns and I got below array by doing df.collect().
array = [Row(name=u'Alice', age=10), Row(name=u'Bob', age=15)]
Now I want to get an output array like below.
new_array = ['Alice', 'Bob']
Could anyone please let me know how to extract above output using pyspark. Any help would be appreciated.
Thanks
# Creating the base dataframe.
values = [('Alice',10),('Bob',15)]
df = sqlContext.createDataFrame(values,['name','age'])
df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 10|
| Bob| 15|
+-----+---+
df.collect()
[Row(name='Alice', age=10), Row(name='Bob', age=15)]
# Use list comprehensions to create a list.
new_list = [row.name for row in df.collect()]
print(new_list)
['Alice', 'Bob']
I see two columns name and age in the df. Now, you want only the name column to be displayed.
You can select it like:
df.select("name").show()
This will show you only the names.
Tip: Also, you df.show() instead of df.collect(). That will show you in tabular form instead of row(...)

search for specific characters within column and then create different columns from it

I have param_Value column that have different values. I need to extract these values and create columns for all of them.
|PARAM_NAME |param_Value |
__________|____________
|Step 4 | SP:0.09 |
|Procedure | MAX:125 |
|Step 4 | SP:Ambient|
|(null) | +/-:N/A |
|Steam | SP:2 |
|Step 3 | MIN:0 |
|Step 4 | RDPHN427B |
|Testing De | N/A |
I only want columns with: And give them names:
SP: SET_POINT_VALUE,
MAX: MAX_LIMIT,
MIN: MIN_LIMIT,
+/-: UPPER_LOWER_LIMIT
So what I have so far is:
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME,
REGEXP_LIKE("param_Value", 'SP:') SET_POINT_VALUE,
REGEXP_LIKE("param_Value", '+/-:') UPPER_LOWER_LIMIT,
REGEXP_LIKE("param_Value", 'MAX:') MAX_VALUE,
REGEXP_LIKE("param_Value", 'MIN:') MIN_VALUE
FROM PROCESS_STEPS
;
I'm more familiar with TSQL and MySQL, but this ought to do what I think you're looking for. If it doesn't exactly, it should at least point you in the right direction.
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME
, CASE WHEN "param_Value" LIKE 'SP:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END SET_POINT_VALUE
, CASE WHEN "param_Value" LIKE '+/-:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END UPPER_LOWER_LIMIT
, CASE WHEN "param_Value" LIKE 'MAX:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MAX_VALUE
, CASE WHEN "param_Value" LIKE 'MIN:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MIN_VALUE
FROM PROCESS_STEPS
;
The basic concept here is identifying the information you want via LIKE, then using SUBSTR and INSTR to extract it. While LIKE is normally something to stay away from, since there's no leading % in your case, it's Sargable, and thus probably not a total efficiency sink.
Really, though, I have to ask you to question why you're laying out your data like this - substring operations are slow in any language, and a DB is no exception. Why not use another column for your limit type? Why not lay it out in the view you're currently looking at?