Select logs after time in AWS Athena - amazon-athena

In AWS Athena I want to filter logs between a certain time. I need to add a check for the time column to the where clause. I tried finding out how to do this, but I cannot find any examples.
I need something like this:
SELECT distinct(request_url) FROM "mylogs"."alb_logs"
where request_url like '%app%' and time >= date('2019-01-01')
order by request_url

You first need to parse the time using parse_datetime. Afterward, you can use compare functions.
SELECT distinct(request_url) FROM "mylogs"."alb_logs"
WHERE parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')
> parse_datetime('2019-01-01-00:00:00','yyyy-MM-dd-HH:mm:ss')
AND request_url like '%app%'
order by request_url

Related

Column does not exist AWS Timestream Query error

I am trying to apply WHERE clause on DIMENSION of the AWS Timestream records. However, I got the error: Column does not exist
Here is my table schema:
The table schema
The table measure
First, I will show all the sample data I put in the table
SELECT username, time, manual_usage
FROM "meter-reading"."meter-metrics"
ORDER BY time DESC
LIMIT 4
The result:
Result
What I wanted to do is to query and filter the records by the Dimension ("username" specifically).
SELECT *
FROM "meter-reading"."meter-metrics"
WHERE measure_name = "OnceADay"
ORDER BY time DESC LIMIT 10
Then I got the Error: Column 'OnceADay' does not exist
I tried to search for any quotas for Dimensions name and check for error in my schema:
https://docs.aws.amazon.com/timestream/latest/developerguide/ts-limits.html#limits.naming
https://docs.aws.amazon.com/timestream/latest/developerguide/ts-limits.html#limits.system_identifier
But I didn't find that my "username" for the dimension violate any of the above rules.
I checked for some other queries by AWS Blog, the author used the WHERE clause for the Dimension filter normally:
https://aws.amazon.com/blogs/database/effective-queries-for-common-query-patterns-in-amazon-timestream/
I figured it out after I tried with the sample code. Turn out it was a silly mistake I believe.
Using apostrophe (') instead of single quotation marks ("") solved my problem.
SELECT *
FROM "meter-reading"."meter-metrics"
WHERE username = 'OnceADay'
ORDER BY time DESC LIMIT 10

AWS Athena - How to handle GENERIC_INTERNAL_ERROR?

I have the following query used on one of my datasets in Athena.
CREATE TABLE clean_table
WITH (format='Parquet', external_location='s3://test123data') AS
SELECT npi,
provider_last_name,
provider_first_name,
CASE
WHEN REPLACE(num_entitlement_medicare_medicaid,',', '') ='' THEN
null
ELSE CAST(REPLACE(num_entitlement_medicare_medicaid,',', '') AS DECIMAL)
END AS medicare_medicaid_entitlement,
CASE
WHEN REPLACE(total_submitted_charge_amount,',', '') ='' THEN
null
ELSE CAST(REPLACE(num_entitlement_medicare_medicaid,',', '') AS DECIMAL)
END AS total_submitted_charge_amount
FROM cmsaggregatepayment2017
Unfortunately after I run this query I get an error as below:
GENERIC_INTERNAL_ERROR: Path is not absolute: s3://test123data. You may need to manually clean the data at location 's3://aws-athena-query-results-785609744360-us-east-1/Unsaved/2019/12/15/tables/03d3cedf-0101-43cb-91fd-cc8070db0e37' before retrying. Athena will not delete data in your account.
Can someone walk me through how to handle this?
What do I have to do on the bucket since it is empty?
Thanks in advance!
It appears that this message is referring to the Query result location in which Athena automatically stores the output of your queries.
This is useful for running queries on the results of queries, or for simply having a copy of the query output.
See: Working with Query Results, Output Files, and Query History - Amazon Athena
You can specify a new output location by clicking the settings link in the Athena console and then providing a Query result location path, such as: s3://my-bucket/athena-output/
I'm not sure what is causing your specific error, but make sure you append a trailing / to the location. You might also want to create a new bucket for that output.

How to setup an AWS Athena query with multiple regex replacements?

I have been trying to make an AWS Athena query and got enough work done to get my data. However, my data needs to identify some patterns and change it in an uniform way in order to group those "similars". So I'm trying to make a regex_replacement, but how can i do multiple replacements to a same column in the same column?
Here's my query:
with q as (SELECT r.key,
r.otherid,
r.complexString,
minute(date_trunc('minute', from_iso8601_timestamp(r.time) AT TIME ZONE 'America/New_York')) AS minute,
hour(from_iso8601_timestamp(r.time) AT TIME ZONE 'America/New_York') AS hour,
day(from_iso8601_timestamp(r.time) AT TIME ZONE 'America/New_York') AS day
FROM requests0918 t
JOIN requests0918 t1 ON t.id = t1.id
WHERE t1.msg = 'response_written' AND t1.code = '200'
and t.otherid is not null
and t.key is not null
and t.path is not null
limit 10)
Select q.key, q.otherid, REGEXP_REPLACE(q.complexString, '\/accounts\/[0-9]+\/balances', '/accounts/.../balances' ) as path, q.minute, q.hour, q.day from q
So I'm successfully changing this strings to that ones, but I need to set more patterns and to replace under the same column name. So I'm looking on how to do it. I could add more layers of with q as {Query} to add more rules, but that sounds pretty wrong.

How can I check the partition list from Athena in AWS?

I want to check the partition lists in Athena.
I used query like this.
show partitions table_name
But I want to search specific table existed.
So I used query like below but there was no results returned.
show partitions table_name partition(dt='2010-03-03')
Because dt contains hour data also.
dt='2010-03-03-01', dt='2010-03-03-02', ...........
So is there any way to search when I input '2010-03-03' then it search '2010-03-03-01', '2010-03-03-02'?
Do I have to separate partition like this?
dt='2010-03-03', dh='01'
And show partitions table_name returned only 500 rows in Hive. Is the same in Athena also?
In Athena v2:
Use this SQL:
SELECT dt
FROM db_name."table_name$partitions"
WHERE dt LIKE '2010-03-03-%'
(see the official aws docs)
In Athena v1:
There is a way to return the partition list as a resultset, so this can be filtered using LIKE. But you need to use the internal information_schema database like this:
SELECT partition_value
FROM information_schema.__internal_partitions__
WHERE table_schema = '<DB_NAME>'
AND table_name = '<TABLE_NAME>'
AND partition_value LIKE '2010-03-03-%'

Regex QueryString Parsing for a specific in BigQuery

So last week I was able to begin to stream my Appengine logs into BigQuery and am now attempting to pull some data out of the log entries into a table.
The data in protoPayload.resource is the page requested with the querystring paramters included.
The contents of protoPayload.resource looks like the following examples:
/service.html?device_ID=123456
/service.html?v=2&device_ID=78ec9b4a56
I am getting close, but when there is another entry before device_ID, I am not getting it. As you can see I am not great with Regex, but it is the only way I think I can parse the data in the query. To get just the device ID from the first example, I was able to use the following example. Works great. My next challenge is to the data when the second parameter exists. The device IDs can vary in length from about 10 to 26 characters.
SELECT
RIGHT(Regexp_extract(protoPayload.resource,r'[\?&]([^&]+)'),
length(Regexp_extract(protoPayload.resource,r'[\?&]([^&]+)'))-10) as Device_ID
FROM logs
What I would like is just the values from the querystring device_ID such as:
123456
78ec9b4a56
Assuming you have just 1 query string per record then you can do this:
SELECT REGEXP_EXTRACT(protoPayload.resource, r'device_ID=(.*)$') as device_id FROM mytable
The part within the parentheses will be captured and returned in the result.
If device_ID isn't guaranteed to be the last parameter in the string, then use something like this:
SELECT REGEXP_EXTRACT(protoPayload.resource, r'device_ID=([^\&]*)') as device_id FROM mytable
One approach is to split protoPayload.resource into multiple service entries, and then apply regexp - this way it will support arbitrary number of device_id, i.e.
select regexp_extract(service_entry, r'device_ID=(.*$)') from
(select split(protoPayload.resource, ' ') service_entry from
(select
'/service.html?device_ID=123456 /service.html?v=2&device_ID=78ec9b4a56'
as protoPayload.resource))