I need to create a table with a specific condition that can be updated when the bucket is updated. This is an example:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`cards-test` (
`id` bigint,
`created_at` timestamp,
`type` string,
`account_id` bigint,
`last_4_digits` string,
`is_active` boolean,
`status` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://something/cards-bucket/'
TBLPROPERTIES ('classification' = 'parquet');
Now, let's say I want a WHERE clause that says WHERE type = 'type_1', can I insert this here? If so, where?
If not, how should I create a table with such specific conditions out of the buckets?
No, as doc show the syntax for CREATE TABLE - there is no option to provide filtering the data.
What you can do - create another table via CREATE TABLE AS syntax with filtering applied:
CREATE TABLE cards-test-type_1 WITH (
...
) AS
SELECT
*
FROM
cards-test
WHERE type = 'type_1'
Or create a view:
Creates a new view from a specified SELECT query. The view is a logical table that can be referenced by future queries. Views do not contain any data and do not write data. Instead, the query specified by the view runs each time you reference the view by another query.
CREATE VIEW cards-test-type_1 AS
SELECT
*
FROM
cards-test
WHERE type = 'type_1'
Related
I used this article to read my vpc flow logs and everything worked correctly.
https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/
But my question is that when I refer to documentation and run the create table statement, it does not return any record.
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
`version` int,
`account_id` string,
`interface_id` string,
`srcaddr` string,
`dstaddr` string,
`srcport` int,
`dstport` int,
`protocol` bigint,
`packets` bigint,
`bytes` bigint,
`start` bigint,
`end` bigint,
`action` string,
`log_status` string,
`vpc_id` string,
`subnet_id` string,
`instance_id` string,
`tcp_flags` int,
`type` string,
`pkt_srcaddr` string,
`pkt_dstaddr` string,
`region` string,
`az_id` string,
`sublocation_type` string,
`sublocation_id` string,
`pkt_src_aws_service` string,
`pkt_dst_aws_service` string,
`flow_direction` string,
`traffic_path` int
)
PARTITIONED BY (
`aws-account-id` string,
`aws-service` string,
`aws-region` string,
`year` string,
`month` string,
`day` string,
`hour` string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://DOC-EXAMPLE-BUCKET/prefix/AWSLogs/aws-account-id={account_id}/aws-service=vpcflowlogs/aws-region={region_code}/'
TBLPROPERTIES (
'EXTERNAL'='true',
'skip.header.line.count'='1'
)
official doc:
https://docs.aws.amazon.com/athena/latest/ug/vpc-flow-logs.html
This create table statement should work after changing the variables like DOC-EXAMPLE-BUCKET/prefix, account_id and region_code. Why am I getting 0 rows returned for select * query?
You need to manually load the partitions first before you could use them.
From the docs:
After you create the table, you load the data in the partitions for querying. For Hive-compatible data, you run MSCK REPAIR TABLE. For non-Hive compatible data, you use ALTER TABLE ADD PARTITION to add the partitions manually.
So if your structure if hive compatible you can just run:
MSCK REPAIR TABLE `table name`;
And this will load all your new partitions.
Otherwise you'll have to manually load them using ADD PARTITION
ALTER TABLE test ADD PARTITION (aws-account-id='1', aws-acount-service='2' ...) location 's3://bucket/subfolder/data/accountid1/service2/'
Because manually adding partitions is so tedious if your data structure is not hive compatible I recommend you use partition projection for your table.
To avoid having to manage partitions, you can use partition projection. Partition projection is an option for highly partitioned tables whose structure is known in advance. In partition projection, partition values and locations are calculated from table properties that you configure rather than read from a metadata repository. Because the in-memory calculations are faster than remote look-up, the use of partition projection can significantly reduce query runtimes.
I've seen other questions saying their query returns no results. This is not what is happening with my query. The query itself is returning empty strings/results.
I have an 81.7MB JSON file in my input bucket (input-data/test_data). I've setup the datasource as JSON.
However, when I execute SELECT * FROM test_table; it shows (in green) that the data has been scanned, the query was successful and there are results, but not saved to the output bucket or displayed in the GUI.
I'm not sure what I've done wrong in the setup?
This is my table creation:
CREATE EXTERNAL TABLE IF NOT EXISTS `test_db`.`test_data` (
`tbl_timestamp` timestamp,
`colmn1` string,
`colmn2` string,
`colmn3` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://input-data/test_data/'
TBLPROPERTIES ('has_encrypted_data'='false',
'skip.header.line.count'='1');
Resolved this issue. The labels of the table (e.g. the keys) need to be the same labels in the file itself. Simple really!
I am new to Athena, and would request for some help.
I have multiple csv files in the following format. Pls note all fields are in double quotes. And total file size is about 5GB. If possible, I would rather do this without the use of Glue. Unless there is a reason to spend $ on running the crawlers.
"emailusername.string()","emaildomain.string()","name.string()","details.string()"
"myname1","website1.com","fullname1","address1 n details"
"myname2","website2.com","fullname2","address2 n details"
The following code on Athena works perfectly:
CREATE EXTERNAL TABLE IF NOT EXISTS db1.tablea (
`emailusername` string,
`emaildomain` string,
`name` string,
`details` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\")
LOCATION 's3://projectzzzz2/0001_aaaa_delme/'
TBLPROPERTIES ('has_encrypted_data'='false');
However I am neither able to cluster, nor use partitioning. The following code runs successfully. Post that I am also able to Load Partitions successfully. But no data is returned!
CREATE EXTERNAL TABLE IF NOT EXISTS db1.tablea (
`name` string,
`details` string
)
PARTITIONED BY (emaildomain string, emailusername string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\")
LOCATION 's3://projectzzzz2/0001_aaaa_delme/'
TBLPROPERTIES ('has_encrypted_data'='false');
MSCK REPAIR TABLE tablea;
SELECT * FROM "db1"."tablea";
Result: Zero records returned
If your intention is to create partitions on emaildomain, emailusername
You don’t need to have fields called emaildomain, emailusername in the table. However, you need to have 2 directories as domain1/user1 under your s3 location.
e.g. s3://projectzzzz2/0001_aaaa_delme/domain1/user1
make sure
copy your file to s3://projectzzzz2/0001_aaaa_delme ( not to the location s3://projectzzzz2/0001_aaaa_delme/domain1/user1)
then you can issue
ALTER TABLE tablea ADD PARTITION (emaildomain ='domain1', emailusername= 'user1') location ‘s3://projectzzzz2/0001_aaaa_delme/domain1/user1' ;
If you query the table tablea you will see new fields called emaildomain and emailusername been added automatically
As of my knowledge, whenever you add a new user or new email domain then you need to copy your file into the new folder and need to issue the ‘Alter table’ statement accordingly.
i have created a table by mapping the json data, unfortunately i am not able to read the nested array within the json.
{
"total":10,
"count":100,
"values":{
"source":[{"sourceid":"10001","source":"ABC"},
{"sourceid":"10002","source":"XYZ"}
]}
}
```athena table
CREATE EXTERNAL TABLE source_master_data(
total bigint,
count bigint,
values struct<source: array<struct<sourceid: string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://sourcemaster/'
I am trying to read the sourceid and source but no luck.. can anyone help me out
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data.Values) AS t1
The unnest need to be placed on the array type. In your query, you are trying to unnest the struct which is not possible in Athena.
The second issue is the use of values without quotes. This also fails, because values is a reserved word in Athena.
The overall query would look something like this.
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data."values".source) AS t1 (source)
I have a table that tracks user actions on a high-throughput site that is defined as (irrelevant fields, etc removed):
CREATE EXTERNAL TABLE `actions`(
`uuid` string COMMENT 'from deserializer',
`action` string COMMENT 'from deserializer',
`user` struct<id:int,username:string,country:string,created_at:string> COMMENT 'from deserializer')
PARTITIONED BY (
`ingestdatetime` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<path_to_bucket>'
TBLPROPERTIES (
'transient_lastDdlTime'='1506104792')
And want to add some more fields to the user data (e.g. level:int to track what level the user was when they performed the action).
Is it possible to alter the table definition to include these new properties, and if so, is it possible to configure default values in the event that they aren't in the source data files?
No, You can't add a new column to struct in Athena.
You can delete Schema and then create a new Table with required columns.
Deleting schema or database won't affect your data because Athena doesn't store data itself, it just points to data in S3.
Athena's ALTER TABLE ADD COLUMNS supports adding field(s) to struct type column.
Let's say you have a table user and it has a struct type field called user. And you want to add an int type column level to user. You can do the following:
ALTER TABLE actions ADD COLUMNS (user.level:int)