Athena Nested Array Structure Query - amazon-athena

Require your help please.
I have a column in Athena which is of below type:
array<struct<addedtitle:string,addedvalue:double,keytitle:string,key:string,recvalue:double,unit:string,isbalanced:boolean
For Example- One of the row is :
[{addedtitle=Sodium Carbonate, addedvalue=null, keytitle=Increase PH, key=p9, recvalue=0.8999999999999999, unit=lbs, isbalanced=null}, {addedtitle=Soduim Hypochlorite (12%), addedvalue=15.0, keytitle=Increase Chlorine, key=p8, recvalue=18.218999999999998, unit=fl oz, isbalanced=null}, {addedtitle=Sodium Bicarbonate, addedvalue=32.0, keytitle=Increase Alkalinity, key=p10, recvalue=33.6, unit=oz, isbalanced=null}, {addedtitle=Calcium Chloride (100%), addedvalue=86.0, keytitle=Increase Calcium Hardness, key=p6, recvalue=88.72002, unit=oz, isbalanced=null}, {addedtitle=Cyanuric Acid, addedvalue=10.0, keytitle=Increase Cyanuric Acid, key=p11, recvalue=11.7, unit=oz, isbalanced=null}]
How can i query this column if this i want all the recvalue for this nested structure in athena with each column
As an output i should get recommendation value in each column :
recommendation0 recommendation1 recommendation2 recommendation3
0.8999999999999999 18.218999999999998 33.6 88.72002

Assuming there are always exactly four elements of the arrays, and they are always in the correct order, you can just pick the elements out of the array like this:
SELECT
the_array_column[1].recvalue AS recommendation0,
the_array_column[2].recvalue AS recommendation1,
the_array_column[3].recvalue AS recommendation2,
the_array_column[4].recvalue AS recommendation3
FROM my_table
(you didn't provide a full schema, so I'm improvising the name of the table and column – also note that array indexing starts at 1)
However, your row example has five elements, and your output example has four, and the order also does not match. Perhaps you can clarify your question if the above does not solve your problem?

Related

It is possible to move the content of 1 row to another row?

I have this data frame:
biopsia1
Name:Juan
Rut: 17006-9
Diagnostic:
Gallbladder Cancer
I split the data base in 2 columns
text
text_2
Name
Juan
Rut
17006-9
Diagnostic
NA
Gallbladder Cancer
NA
The original format have de Diagnostic results in a different row (i can not change it), so when i split the data base "Galbladder Cancer" is below diagnosis, but i want it in the same row.
text
text_2
Diagnostic
Gallbladder Cancer
It is possible to move "Gallbladder Cancer" to the same row?
Thank for your help!

Reversing a column when calculating slope for a line chart in Google Sheets

I have the following sheets:
https://docs.google.com/spreadsheets/d/1aC9lsmxVw0pYN_Wjk7gooB0c7CsvmkRsEeCUBEKUIlM/edit?usp=sharing
It should be pretty obvious looking at it. There is a spark-line which becomes green if the the trend is positive. From the data, it makes intuitive sense that the line should be trending up. However, due to the way I wrote the formula, the line is instead trending down and red. How can I reverse the columns being used in the formula?
Note: The data on the right hand side should remain in the same order.
Thanks for any help.
try:
=IFERROR(ARRAYFORMULA(SPARKLINE(
QUERY({B2:B, ROW(B2:B)}, "select Col1 order by Col2 desc"),
{"charttype", "line"; "color", IF(SLOPE(
QUERY({B2:B, ROW(B2:B)}, "select Col1 order by Col2 desc"),
ROW(A2:A)-1)>=0, "lime", "red"); "linewidth", 2})))

Convert array of rows into array of strings in pyspark

I have a dataframe with 2 columns and I got below array by doing df.collect().
array = [Row(name=u'Alice', age=10), Row(name=u'Bob', age=15)]
Now I want to get an output array like below.
new_array = ['Alice', 'Bob']
Could anyone please let me know how to extract above output using pyspark. Any help would be appreciated.
Thanks
# Creating the base dataframe.
values = [('Alice',10),('Bob',15)]
df = sqlContext.createDataFrame(values,['name','age'])
df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 10|
| Bob| 15|
+-----+---+
df.collect()
[Row(name='Alice', age=10), Row(name='Bob', age=15)]
# Use list comprehensions to create a list.
new_list = [row.name for row in df.collect()]
print(new_list)
['Alice', 'Bob']
I see two columns name and age in the df. Now, you want only the name column to be displayed.
You can select it like:
df.select("name").show()
This will show you only the names.
Tip: Also, you df.show() instead of df.collect(). That will show you in tabular form instead of row(...)

Splitting an array into columns in Athena/Presto

I feel this should be simple, but I've struggled to find the right terminology, please bear with me.
I have two columns, timestamp and voltages which is the array
If I do a simple
SELECT timestamp, voltages FROM table
Then I'd get a result of:
|timestamp | voltages |
|1544435470 |3.7352,3.749,3.7433,3.7533|
|1544435477 |3.7352,3.751,3.7452,3.7533|
|1544435484 |3.7371,3.749,3.7433,3.7533|
|1544435490 |3.7352,3.749,3.7452,3.7533|
|1544435497 |3.7352,3.751,3.7452,3.7533|
|1544435504 |3.7352,3.749,3.7452,3.7533|
But I want to split the voltages array so each element in its array is its own column.
|timestamp | v1 | v2 | v3 | v4 |
|1544435470 |3.7352 |3.749 |3.7433 |3.7533|
|1544435477 |3.7352 |3.751 |3.7452 |3.7533|
|1544435484 |3.7371 |3.749 |3.7433 |3.7533|
|1544435490 |3.7352 |3.749 |3.7452 |3.7533|
|1544435497 |3.7352 |3.751 |3.7452 |3.7533|
|1544435504 |3.7352 |3.749 |3.7452 |3.7533|
I know I can do this with:
SELECT timestamp, voltages[1] as v1, voltages[2] as v2 FROM table
But I'd need to be able to do this programmatically, as opposed to listing them out.
Am I missing something obvious?
This should serve your purpose if you have arrays of fixed length.
You need to first break down each array element into it's own row. You can do this using the UNNEST operator in the following way :
SELECT timestamp, volt
FROM table
CROSS JOIN UNNEST(voltages) AS t(volt)
Using the resultant table you can pivot (convert multiple rows with the same timestamp into multiple columns) by referring to Gordon Linoff's answer for "need to convert data in multiple rows with same ID into 1 row with multiple columns".

Grouping Similar words/phrases

I have a frequency table of words which looks like below
> head(freqWords)
employees work bose people company
1879 1804 1405 971 959
employee
100
> tail(freqWords)
youll younggood yoyo ytd yuorself zeal
1 1 1 1 1 1
I want to create another frequency table which will combine similar words and add their frequencies
In above example, my new table should contain both employee and employees as one element with a frequency of 1979. For example
> head(newTable)
employee,employees work bose people
1979 1804 1405 971
company
959
I know how to find out similar words (using adist, stringdist) but I am unable to create the frequency table. For instance I can use following to get a list of similar words
words <- names(freqWords)
lapply(words, function(x) words[stringdist(x, words) < 3])
and following to get a list of similar phrases of two words
lapply(words, function(x) words[stringdist2(x, words) < 3])
where stringdist2 is follwoing
stringdist2 <- function(word1, word2){
min(stringdist(word1, word2),
stringdist(word1, gsub(word2,
pattern = "(.*) (.*)",
repl="\\2,\\1")))
}
I do not have any punctuation/special symbols in my words/phrases. (I do not know a lot of R; I created stringdist2 by tweaking an implementation of adist2 I found here but I do not understand everything about how pattern and repl works)
So I need help to create new frequency table.