Load array field in csv data file into Athena table - amazon-web-services

This is a sample row in input data file with two fields - dept and names
dept,names
Mathematics,[foo,bar,alice,bob]
Here, 'name' is an array of String and I want to load it as array of String Athena.
Any suggestion?

To have a valid CSV file, make sure you put quotes around your array:
Mathematics,"[foo,bar,alice,bob]"
If you can remove the "[" and "]" the solution below becomes even easier and you can just split without the regex.
Better: Mathematics,"foo,bar,alice,bob"
First create a simple table from CSV with just strings:
CREATE EXTERNAL TABLE IF NOT EXISTS test.mydataset (
`dept` string,
`names` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'quoteChar' = '"',
"separatorChar" = ',',
'collection.delim' = ',',
'mapkey.delim' = ':'
) LOCATION 's3://<your location>'
TBLPROPERTIES ('has_encrypted_data'='false')
Then create a view which uses a regex to remove your '[' and ']' characters, then splits the rest by ',' into an array.
CREATE OR REPLACE VIEW mydataview AS
SELECT dept,
split(regexp_extract(names, '^\[(.*)\]$', 1), ',') as names
FROM mydataset
Then use the view for your queries. I am not 100% sure as I've only spent like 12 hours using Athena.
--
Note that in order to use the quotes, you need to use OpenCSVSerde, the 'lazyserde' won't work as it does support quotes. lazyserde DOES support internal arrays, but you can't use the ',' as a separator in that case. If you want to try that, your data would look like:
Better: Mathematics,foo|bar|alice|bob
In that case this MIGHT work directly:
CREATE EXTERNAL TABLE IF NOT EXISTS test.mydataset (
`dept` string,
`names` array<string>
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'quoteChar' = '"',
"separatorChar" = ',',
'collection.delim' = '|',
'mapkey.delim' = ':'
) LOCATION 's3://<your location>'
TBLPROPERTIES ('has_encrypted_data'='false')
Note how collection.delim = '|', which should translate your field directly to an array.
Sorry I don't have time to test this, I'll be happy to update my answer if you can confirm what works. Hopefully this get's you started.

Related

How to resolve EOF in Athena? FAILED: ParseException line 1:6 missing EOF at '-' near 'cw'

My Athena query
CREATE EXTERNAL TABLE IF NOT EXISTS cwmilenko.new1 ( `ticket-id` string, `amount_stake_one` struct )
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1') location 's3://cw-milenko-tests/1507tick2.parquet/part-00000-bbab8f70-4758-4041-9f69-c17f21c916dac000.snappy.parquet'
Error
FAILED: ParseException line 1:6 missing EOF at '-' near 'cw'
How to fix this problem?
This works
CREATE EXTERNAL TABLE parket3 (
`amount_stake_one` double,
`bet_event_etl_date`string,
`ticket_id` string
) STORED AS PARQUET LOCATION 's3://myparket/' tblproperties ("parquet.compression"="SNAPPY")

Replace brackets and splitting a column into multiple rows based on a delimiter in Postgres

I have a table with column with separated by ';'. The data looks like this:
row_id col
1 p.[D389R;D393_W394delinsRD]
2 p.[D390R;D393_W394delinsRD]
3 p.D389R
4. p.[D370R;D393_W394delinsRD]
I would like replace the '[]' brackets whereever they are and fetch the text. Later, I would like to split the string be ';' and concatenate 'p.' to the splitted text (if it is not there) and create a new row.
The expected output is:
row_id new_col
1 p.D389R
2 p.D393_W394delinsRD
3 p.D390R
4 p.D393_W394delinsRD
5 p.D389R
6 p.D370R
7 p.D393_W394delinsRD
I have tried below query to get the desired output.
SELECT *,
CASE
WHEN regexp_split_to_table(regexp_replace(col, '\[|\]', '', 'g'), E';') NOT LIKE 'p.[%'
THEN 'p.' || (regexp_split_to_table(regexp_replace(col, '\[|\]', '', 'g'), E';'))[1]
ELSE regexp_split_to_table(regexp_replace(col, '\[|\]', '', 'g'), E';')[2]
END AS new_col
FROM table;
Any suggestions would be really helpful.
I would first remove the constant values ( p.[ and ]) from the string and then unnest it.
with clean as (
select row_id, regexp_replace(col, '^p\.(\[){0,1}|\]$', '', 'g') as col
from the_table
)
select row_id, 'p.'|| t.c
from clean c
cross join unnest(string_to_array(c.col, ';')) as t(c)
The CTE (with ...) isn't really necessary, but that way the unnest(...) stays readable.
Online example

Remove repeated substring in column and only return words in between

I have the following dataframe:
Column1 Column2
0 .com<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> .comFinance
1 .com<br><br>Finance<br><br><br><br><br>DO<br><br><br><br><br><br><br> .comFinanceDO
2 <br><br>Finance<br><br><br>ISV<br><br>DO<br>DO Prem<br><br><br><br><br><br> FinanceISVDODO Prem
3 <br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> Finance
4 <br><br>Finance<br><br><br>TTY<br><br><br><br><br><br><br><br><br> ConsultingTTY
I used to following line of code to get Column2:
df['Column2'] = df['Column1'].str.replace('<br>', '', regex=True)
I want to remove all instances of "< b >" and so I want the column to look like this:
Column2
.com, Finance
.com, Finance, DO
Finance, ISV, DO, DO Prem
Finance
Consulting, TTY
Given the following dataframe:
Column1
.com<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br>
.com<br><br>Finance<br><br><br><br><br>DO<br><br><br><br><br><br><br>
<br><br>Finance<br><br><br>ISV<br><br>DO<br>DO Prem<br><br><br><br><br><br>
<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br>
<br><br>Finance<br><br><br>TTY<br><br><br><br><br><br><br><br><br>
df['Column2'] = df['Column1'].str.replace('<br>', ' ', regex=True).str.strip().replace('\\s+', ', ', regex=True) doesn't work because of sections like <br>DO Prem<br>, which will end of like DO, Prem, not DO Prem.
Split on <br> to make a list, then use a list comprehension to remove the '' spaces.
This will preserve spaces where they're supposed to be.
Join the list values back into a string with (', ').join([...])
import pandas as pd
df['Column2'] = df['Column1'].str.split('<br>').apply(lambda x: (', ').join([y for y in x if y != '']))
# output
Column1 Column2
.com<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> .com, Finance
.com<br><br>Finance<br><br><br><br><br>DO<br><br><br><br><br><br><br> .com, Finance, DO
<br><br>Finance<br><br><br>ISV<br><br>DO<br>DO Prem<br><br><br><br><br><br> Finance, ISV, DO, DO Prem
<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> Finance
<br><br>Finance<br><br><br>TTY<br><br><br><br><br><br><br><br><br> Finance, TTY
### Replace br with space
df['Column 2'] = df['column 1'].str.replace('<br>', ' ')
### Get rid of spaces before and after the string
df['Column 2'] = df['Column 2'].strip()
### Replace the space with ,
df['Column 2'] = df['Column 2'].str.replace('\\s+', ',', regex=True)
As pointed out by TrentonMcKinney, his solution is better. This one doesn't solve the issue when there is a space between the string values in Column 1

AWS Glue pushdown predicate not working properly

I'm trying to optimize my Glue/PySpark job by using push down predicates.
start = date(2019, 2, 13)
end = date(2019, 2, 27)
print(">>> Generate data frame for ", start, " to ", end, "... ")
relaventDatesDf = spark.createDataFrame([
Row(start=start, stop=end)
])
relaventDatesDf.createOrReplaceTempView("relaventDates")
relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates")
relaventDatesDf.createOrReplaceTempView("relaventDates")
print("===LOG:Dates===")
relaventDatesDf.show()
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights", push_down_predicate="""
querydatetime BETWEEN '%s' AND '%s'
AND querydestinationplace IN (%s)
""" % (start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d"), ",".join(map(lambda s: str(s), arr))))
However it appears, that Glue still attempts to read data outside the specified date range?
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-01/part-00045-6cdebbb1-562c-43fa-915d-93b125aeee61.c000.snappy.parquet' for reading
INFO FileScanRDD: Reading File path: s3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet, range: 0-11797922, partition values: [12191,17965]
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet' for reading
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Notice the querydatetime=2019-03-01 and querydatetime=2019-03-10 its outside the specified range of 2019-02-13 - 2019-02-27. Is that why there's the next line "aborting HTTP connection" tho? It goes on to say "This is likely an error and may result in sub-optimal behavior" is something wrong?
I wonder if the problem is because it does not support BETWEEN inside the predicate or IN?
The table create DDL
CREATE EXTERNAL TABLE `flights`(
`id` string,
`querytaskid` string,
`queryoriginplace` string,
`queryoutbounddate` string,
`queryinbounddate` string,
`querycabinclass` string,
`querycurrency` string,
`agent` string,
`quoteageinminutes` string,
`price` string,
`outboundlegid` string,
`inboundlegid` string,
`outdeparture` string,
`outarrival` string,
`outduration` string,
`outjourneymode` string,
`outstops` string,
`outcarriers` string,
`outoperatingcarriers` string,
`numberoutstops` string,
`numberoutcarriers` string,
`numberoutoperatingcarriers` string,
`indeparture` string,
`inarrival` string,
`induration` string,
`injourneymode` string,
`instops` string,
`incarriers` string,
`inoperatingcarriers` string,
`numberinstops` string,
`numberincarriers` string,
`numberinoperatingcarriers` string)
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://pinfare-glue/flights/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-parquet',
'averageRecordSize'='19',
'classification'='parquet',
'compressionType'='none',
'objectCount'='623609',
'recordCount'='4368434222',
'sizeKey'='86509997099',
'typeOfData'='file')
One of the issue I can see with the code is that you are using "today" instead of "end" in the between clause. Though I don't see the today variable declared anywhere in your code, I am assuming it has been initialized with today's date.
In that case the range will be different and the partitions being read by glue spark is correct.
In order to push down your condition, you need to change the order of columns in your partition by clause of table definition
A condition having "in" predicate on first partition column can not be push down as you are expecting.
Let me if it helps.
Pushdown predicates in Glue DynamicFrame works fine with between as well as IN clause.
As long as you have correct sequence of partition columns defined in table definition and in query.
I have table with three level of partitions.
s3://bucket/flights/year=2018/month=01/day=01 -> 50 records
s3://bucket/flights/year=2018/month=02/day=02 -> 40 records
s3://bucket/flights/year=2018/month=03/day=03 -> 30 records
Read data in dynamicFrame
ds = glueContext.create_dynamic_frame.from_catalog(
database = "abc",table_name = "pqr", transformation_ctx = "flights",
push_down_predicate = "(year == '2018' and month between '02' and '03' and day in ('03'))"
)
ds.count()
Output:
30 records
So, you are gonna get the correct results, if sequence of columns is correctly specified. Also note, you need to specify '(quote) IN('%s') in IN clause.
Partition columns in table:
querydestinationplace string,
querydatetime string
Data read in DynamicFrame:
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights",
push_down_predicate=
"""querydestinationplace IN ('%s') AND
querydatetime BETWEEN '%s' AND '%s'
"""
%
( ",".join(map(lambda s: str(s), arr)),
start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d")))
Try to do the end as this
start = str(date(2019, 2, 13))
end = str(date(2019, 2, 27))
# Set your push_down_predicate variable
pd_predicate = "querydatetime >= '" + start + "' and querydatetime < '" + end + "'"
#pd_predicate = "querydatetime between '" + start + "' AND '" + end + "'" # Or this one?
flightsGDF = glueContext.create_dynamic_frame.from_catalog(
database = "xxx"
, table_name = "flights"
, transformation_ctx="flights"
, push_down_predicate=pd_predicate)
The pd_predicate will be a string that will work as a push_down_predicate.
Here is a nice read about it if you like.
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/

How does Hive works regular expression with < and > symbols?

This is siva Ramanjaneyulu, I am working on hive. I have got the following problem with hive
sample.log: <ABC>
CREATE TABLE sample4( num1 STRING ) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH
SERDEPROPERTIES ( "input.regex" = "<.*>", "output.format.string" =
"%1$s" ) STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH "../hive-0.9.0/sample.log" INTO TABLE sample4;
select * from sample4;
NULL
Expected output: ABC
Why does this .RegexSerDe not work on regular exprssion <.*>?
how it is possible to remove < and > symbels using regular expression , can u please provide solution for this
Try this :
hive> CREATE TABLE s(num1 STRING) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH
SERDEPROPERTIES ( "input.regex" = "(<.*>)",
"output.format.string" = "%1$s" ) STORED AS TEXTFILE;
Mind the parentheses around regex.
You are getting a NULL value because you didn't include the parentheses in the regex definition. If you don't want the angle brackets to be included in the output you need to put them outside the parentheses. The stuff inside the parenthesis is what will be returned as the output.
CREATE TABLE sample4 (num1 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "<(.*)>"
, "output.format.string" = '%1$s'
)
STORED AS TEXTFILE;