I have the following query used on one of my datasets in Athena.
CREATE TABLE clean_table
WITH (format='Parquet', external_location='s3://test123data') AS
SELECT npi,
provider_last_name,
provider_first_name,
CASE
WHEN REPLACE(num_entitlement_medicare_medicaid,',', '') ='' THEN
null
ELSE CAST(REPLACE(num_entitlement_medicare_medicaid,',', '') AS DECIMAL)
END AS medicare_medicaid_entitlement,
CASE
WHEN REPLACE(total_submitted_charge_amount,',', '') ='' THEN
null
ELSE CAST(REPLACE(num_entitlement_medicare_medicaid,',', '') AS DECIMAL)
END AS total_submitted_charge_amount
FROM cmsaggregatepayment2017
Unfortunately after I run this query I get an error as below:
GENERIC_INTERNAL_ERROR: Path is not absolute: s3://test123data. You may need to manually clean the data at location 's3://aws-athena-query-results-785609744360-us-east-1/Unsaved/2019/12/15/tables/03d3cedf-0101-43cb-91fd-cc8070db0e37' before retrying. Athena will not delete data in your account.
Can someone walk me through how to handle this?
What do I have to do on the bucket since it is empty?
Thanks in advance!
It appears that this message is referring to the Query result location in which Athena automatically stores the output of your queries.
This is useful for running queries on the results of queries, or for simply having a copy of the query output.
See: Working with Query Results, Output Files, and Query History - Amazon Athena
You can specify a new output location by clicking the settings link in the Athena console and then providing a Query result location path, such as: s3://my-bucket/athena-output/
I'm not sure what is causing your specific error, but make sure you append a trailing / to the location. You might also want to create a new bucket for that output.
Related
Sorry writing very less code here:
Issue is after renationalizing the dynamic frame as given example below:
df = testdfr.relationalizing(….)
After renationalizing df schema is having keys:
{“root”, “root_order”, “root_order.item”}
Itemdf = df.select(“root_order.item”) is not selecting anything, when I print schema or do toDF().show():
Output:
root
—Empty table—
But it’s work if I select root or root_order key.
Any suggestion how to select key if it’s having a period sign in key.
I also tried backtick character to enclosed the key but still same issue.
I tried backtick character around whole key and simply complete key but same null values are coming.
I have two S3 buckets that I am looking to join on Athena. In the first bucket, I have an email address in a CSV file with an email column.
sample#email.com
In the other bucket, I have a JSON file with nested email addresses used by the client. The way this has been set up in Glue means the data looks like this:
[sample#email.com;email#sample.com;com#email.sample]
I am trying the join the data by finding the email from the first bucket inside of the string from the second bucket. I have tried:
REGEXP_LIKE(lower("emailaddress"), lower("emails"))
with no success, I have also tried:
select "source".*, "target".*
FROM "source"
inner join "target"
on "membername" = "first_name"
and "memberlastname" = "last_name"
and '%'||lower("emailaddress")||'%' like lower("emails")
With no success. I am doing something wrong and it is evading me where I am making this error.
It seems you need to reverse your like arguments:
-- sample data
WITH dataset (id, email) AS (
VALUES (1,'sample#email.com'),
(2,'non-present#email.com')
),
dataset2 (emails) as (
VALUES ('[sample#email.com;email#sample.com;com#email.sample]')
)
-- query
SELECT *
FROM dataset
INNER JOIN dataset2 on
lower(emails) like '%' || lower(email) || '%'
Output:
id
email
emails
1
sample#email.com
[sample#email.com;email#sample.com;com#email.sample]
In AWS Athena I want to filter logs between a certain time. I need to add a check for the time column to the where clause. I tried finding out how to do this, but I cannot find any examples.
I need something like this:
SELECT distinct(request_url) FROM "mylogs"."alb_logs"
where request_url like '%app%' and time >= date('2019-01-01')
order by request_url
You first need to parse the time using parse_datetime. Afterward, you can use compare functions.
SELECT distinct(request_url) FROM "mylogs"."alb_logs"
WHERE parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')
> parse_datetime('2019-01-01-00:00:00','yyyy-MM-dd-HH:mm:ss')
AND request_url like '%app%'
order by request_url
I ran a simple query using Athena dashboard on data of format csv.The result was a csv with column headers.
When storing the results,Athena stores with the column headers in s3.How can i skip storing header column names,as i have to make new table from the results and it is repetitive
Try "skip.header.line.count"="1", This feature has been available on AWS Athena since 2018-01-19, here's a sample:
CREATE EXTERNAL TABLE IF NOT EXISTS tableName (
`field1` string,
`field2` string,
`field3` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://fileLocation/'
TBLPROPERTIES ('skip.header.line.count'='1')
You can refer to this question:
Aws Athena - Create external table skipping first row
From an Eric Hammond post on AWS Forums:
...
WHERE
date NOT LIKE '#%'
...
I found this works! The steps I took:
Run an Athena query, with the output going to Amazon S3
Created a new table pointing to this output based on How do I use the results of my Amazon Athena query in another query?, changing the path to the correct S3 location
Ran a query on the new table with the above WHERE <datefield> NOT LIKE '#%'
However, subsequent queries store even more data in that S3 directory, so it confuses any subsequent executions.
So last week I was able to begin to stream my Appengine logs into BigQuery and am now attempting to pull some data out of the log entries into a table.
The data in protoPayload.resource is the page requested with the querystring paramters included.
The contents of protoPayload.resource looks like the following examples:
/service.html?device_ID=123456
/service.html?v=2&device_ID=78ec9b4a56
I am getting close, but when there is another entry before device_ID, I am not getting it. As you can see I am not great with Regex, but it is the only way I think I can parse the data in the query. To get just the device ID from the first example, I was able to use the following example. Works great. My next challenge is to the data when the second parameter exists. The device IDs can vary in length from about 10 to 26 characters.
SELECT
RIGHT(Regexp_extract(protoPayload.resource,r'[\?&]([^&]+)'),
length(Regexp_extract(protoPayload.resource,r'[\?&]([^&]+)'))-10) as Device_ID
FROM logs
What I would like is just the values from the querystring device_ID such as:
123456
78ec9b4a56
Assuming you have just 1 query string per record then you can do this:
SELECT REGEXP_EXTRACT(protoPayload.resource, r'device_ID=(.*)$') as device_id FROM mytable
The part within the parentheses will be captured and returned in the result.
If device_ID isn't guaranteed to be the last parameter in the string, then use something like this:
SELECT REGEXP_EXTRACT(protoPayload.resource, r'device_ID=([^\&]*)') as device_id FROM mytable
One approach is to split protoPayload.resource into multiple service entries, and then apply regexp - this way it will support arbitrary number of device_id, i.e.
select regexp_extract(service_entry, r'device_ID=(.*$)') from
(select split(protoPayload.resource, ' ') service_entry from
(select
'/service.html?device_ID=123456 /service.html?v=2&device_ID=78ec9b4a56'
as protoPayload.resource))