I used this article to read my vpc flow logs and everything worked correctly.
https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/
But my question is that when I refer to documentation and run the create table statement, it does not return any record.
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
`version` int,
`account_id` string,
`interface_id` string,
`srcaddr` string,
`dstaddr` string,
`srcport` int,
`dstport` int,
`protocol` bigint,
`packets` bigint,
`bytes` bigint,
`start` bigint,
`end` bigint,
`action` string,
`log_status` string,
`vpc_id` string,
`subnet_id` string,
`instance_id` string,
`tcp_flags` int,
`type` string,
`pkt_srcaddr` string,
`pkt_dstaddr` string,
`region` string,
`az_id` string,
`sublocation_type` string,
`sublocation_id` string,
`pkt_src_aws_service` string,
`pkt_dst_aws_service` string,
`flow_direction` string,
`traffic_path` int
)
PARTITIONED BY (
`aws-account-id` string,
`aws-service` string,
`aws-region` string,
`year` string,
`month` string,
`day` string,
`hour` string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://DOC-EXAMPLE-BUCKET/prefix/AWSLogs/aws-account-id={account_id}/aws-service=vpcflowlogs/aws-region={region_code}/'
TBLPROPERTIES (
'EXTERNAL'='true',
'skip.header.line.count'='1'
)
official doc:
https://docs.aws.amazon.com/athena/latest/ug/vpc-flow-logs.html
This create table statement should work after changing the variables like DOC-EXAMPLE-BUCKET/prefix, account_id and region_code. Why am I getting 0 rows returned for select * query?
You need to manually load the partitions first before you could use them.
From the docs:
After you create the table, you load the data in the partitions for querying. For Hive-compatible data, you run MSCK REPAIR TABLE. For non-Hive compatible data, you use ALTER TABLE ADD PARTITION to add the partitions manually.
So if your structure if hive compatible you can just run:
MSCK REPAIR TABLE `table name`;
And this will load all your new partitions.
Otherwise you'll have to manually load them using ADD PARTITION
ALTER TABLE test ADD PARTITION (aws-account-id='1', aws-acount-service='2' ...) location 's3://bucket/subfolder/data/accountid1/service2/'
Because manually adding partitions is so tedious if your data structure is not hive compatible I recommend you use partition projection for your table.
To avoid having to manage partitions, you can use partition projection. Partition projection is an option for highly partitioned tables whose structure is known in advance. In partition projection, partition values and locations are calculated from table properties that you configure rather than read from a metadata repository. Because the in-memory calculations are faster than remote look-up, the use of partition projection can significantly reduce query runtimes.
Related
I need to create a table with a specific condition that can be updated when the bucket is updated. This is an example:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`cards-test` (
`id` bigint,
`created_at` timestamp,
`type` string,
`account_id` bigint,
`last_4_digits` string,
`is_active` boolean,
`status` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://something/cards-bucket/'
TBLPROPERTIES ('classification' = 'parquet');
Now, let's say I want a WHERE clause that says WHERE type = 'type_1', can I insert this here? If so, where?
If not, how should I create a table with such specific conditions out of the buckets?
No, as doc show the syntax for CREATE TABLE - there is no option to provide filtering the data.
What you can do - create another table via CREATE TABLE AS syntax with filtering applied:
CREATE TABLE cards-test-type_1 WITH (
...
) AS
SELECT
*
FROM
cards-test
WHERE type = 'type_1'
Or create a view:
Creates a new view from a specified SELECT query. The view is a logical table that can be referenced by future queries. Views do not contain any data and do not write data. Instead, the query specified by the view runs each time you reference the view by another query.
CREATE VIEW cards-test-type_1 AS
SELECT
*
FROM
cards-test
WHERE type = 'type_1'
I am trying to load a files from s3 to Athena to perform a query operation. But all the column values are getting added to the first column.
I have file in the following format:
id,user_id,personal_id,created_at,updated_at,active
34,34,43,31:28.4,27:07.9,TRUE
This is the output I get:
Table creation query:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES (
'transient_lastDdlTime'='1665356861')
Please can someone tell me where am I going wrong?
You should add skip.header.line.count to your table properties to skip the first row. As you have defined all columns as string data type Athena was unable to differentiate between header and first row.
DDL with property added:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES ('skip.header.line.count'='1')
The Serde needs some parameter to recognize CSV files, such as:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
See: LazySimpleSerDe for CSV, TSV, and custom-delimited files - Amazon Athena
An alternative method is to use AWS Glue to create the tables for you. In the AWS Glue console, you can create a Crawler and point it to your data. When you run the crawler, it will automatically create a table definition in Amazon Athena that matches the supplied data files.
I am new to Athena, and would request for some help.
I have multiple csv files in the following format. Pls note all fields are in double quotes. And total file size is about 5GB. If possible, I would rather do this without the use of Glue. Unless there is a reason to spend $ on running the crawlers.
"emailusername.string()","emaildomain.string()","name.string()","details.string()"
"myname1","website1.com","fullname1","address1 n details"
"myname2","website2.com","fullname2","address2 n details"
The following code on Athena works perfectly:
CREATE EXTERNAL TABLE IF NOT EXISTS db1.tablea (
`emailusername` string,
`emaildomain` string,
`name` string,
`details` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\")
LOCATION 's3://projectzzzz2/0001_aaaa_delme/'
TBLPROPERTIES ('has_encrypted_data'='false');
However I am neither able to cluster, nor use partitioning. The following code runs successfully. Post that I am also able to Load Partitions successfully. But no data is returned!
CREATE EXTERNAL TABLE IF NOT EXISTS db1.tablea (
`name` string,
`details` string
)
PARTITIONED BY (emaildomain string, emailusername string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\")
LOCATION 's3://projectzzzz2/0001_aaaa_delme/'
TBLPROPERTIES ('has_encrypted_data'='false');
MSCK REPAIR TABLE tablea;
SELECT * FROM "db1"."tablea";
Result: Zero records returned
If your intention is to create partitions on emaildomain, emailusername
You don’t need to have fields called emaildomain, emailusername in the table. However, you need to have 2 directories as domain1/user1 under your s3 location.
e.g. s3://projectzzzz2/0001_aaaa_delme/domain1/user1
make sure
copy your file to s3://projectzzzz2/0001_aaaa_delme ( not to the location s3://projectzzzz2/0001_aaaa_delme/domain1/user1)
then you can issue
ALTER TABLE tablea ADD PARTITION (emaildomain ='domain1', emailusername= 'user1') location ‘s3://projectzzzz2/0001_aaaa_delme/domain1/user1' ;
If you query the table tablea you will see new fields called emaildomain and emailusername been added automatically
As of my knowledge, whenever you add a new user or new email domain then you need to copy your file into the new folder and need to issue the ‘Alter table’ statement accordingly.
I've created a table in Athena with below config
CREATE EXTERNAL TABLE `extern` (`FirstName` STRING, `LastName` STRING, `Email` STRING, `Phone` STRING, `AddressLine1` STRING, `City` STRING, `State` STRING, `PostalCode` STRING,
`time_on_page` DECIMAL(10,3), `page` STRING, `login_time` TIMESTAMP) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES ('orc.column.index.access'='false')
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION 's3://location/' tblproperties ("orc.compress"="ZLIB")
and when I query it with select * from extern I get empty values for all columns except for time_on_page, page and login_time. However, this returns data in all columns if I use ('orc.column.index.access'='true').
The schema of the column names in the ORC file are exact with what's defined in the create statement.
File Version: 0.12 with ORC_135
Rows: 13559
Compression: ZLIB
Compression size: 131072
Type: struct<FirstName:string,LastName:string,Email:string,Phone:string,AddressLine1:string,City:string,State:string,PostalCode:string,time_on_page:decimal(10,3),page:string,login_time:timestamp>
The question is, could it be possible that when orc.column.index.access=false the engine tries to read column names with case sensitive constraints?
Apparently Presto version of Amazon is based on PrestoDB and this feature doesn't seem to be available on Prestodb as of this writing; Prestosql, however, seem to have had this support from Release 312 and onwards (ref: https://github.com/prestosql/presto/issues/802).
The solution is to make sure the schema in the ORC file uses lowercase column name for it to be matched with names from hive metastore (which are always lowercased).
Debug code for presto db: https://github.com/prestodb/presto/blob/6647e13f64883f7cfa89221d91b981bcc3a57618/presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java#L984
https://github.com/prestodb/presto/blob/95bcb3947cad1570e19a0adaebc58009aa362ada/presto-orc/src/main/java/com/facebook/presto/orc/OrcReader.java#L130
I create a table whith many partitions :
PARTITIONED BY (`year` string, `day` string, `month` string, `version_` string, `af_` string, `type_` string, `msm` string)
after that, I run :
MSCK REPAIR TABLE mytable;
When I launch the preview of mytable, I had 0 rows. I try:
select * from mytable
Also no results.
One solution is to use alter table to add partitions with values, but should I create alter table befory every request ?!
The cause is that your PARTITIONED BY statement has fields in a different order that your directory hierarchy:
PARTITIONED BY (`year` string, `day` string, `month` string, `version_` string, `af_` string, `type_` string, `msm` string)
af_=4/type_=anchor/msm=1026355/year=2017/month=05/day=14/version_=1
You can fix it by listing the fields in PARTITIONED BY in the same order as the directory hierarchy.
I did a small test where I had a partition working, but then recreated the table with a different order and it returned zero rows. (It also created new directories in the expected hierarchy -- weird!)