I am new to Athena, and would request for some help.
I have multiple csv files in the following format. Pls note all fields are in double quotes. And total file size is about 5GB. If possible, I would rather do this without the use of Glue. Unless there is a reason to spend $ on running the crawlers.
"emailusername.string()","emaildomain.string()","name.string()","details.string()"
"myname1","website1.com","fullname1","address1 n details"
"myname2","website2.com","fullname2","address2 n details"
The following code on Athena works perfectly:
CREATE EXTERNAL TABLE IF NOT EXISTS db1.tablea (
`emailusername` string,
`emaildomain` string,
`name` string,
`details` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\")
LOCATION 's3://projectzzzz2/0001_aaaa_delme/'
TBLPROPERTIES ('has_encrypted_data'='false');
However I am neither able to cluster, nor use partitioning. The following code runs successfully. Post that I am also able to Load Partitions successfully. But no data is returned!
CREATE EXTERNAL TABLE IF NOT EXISTS db1.tablea (
`name` string,
`details` string
)
PARTITIONED BY (emaildomain string, emailusername string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\")
LOCATION 's3://projectzzzz2/0001_aaaa_delme/'
TBLPROPERTIES ('has_encrypted_data'='false');
MSCK REPAIR TABLE tablea;
SELECT * FROM "db1"."tablea";
Result: Zero records returned
If your intention is to create partitions on emaildomain, emailusername
You don’t need to have fields called emaildomain, emailusername in the table. However, you need to have 2 directories as domain1/user1 under your s3 location.
e.g. s3://projectzzzz2/0001_aaaa_delme/domain1/user1
make sure
copy your file to s3://projectzzzz2/0001_aaaa_delme ( not to the location s3://projectzzzz2/0001_aaaa_delme/domain1/user1)
then you can issue
ALTER TABLE tablea ADD PARTITION (emaildomain ='domain1', emailusername= 'user1') location ‘s3://projectzzzz2/0001_aaaa_delme/domain1/user1' ;
If you query the table tablea you will see new fields called emaildomain and emailusername been added automatically
As of my knowledge, whenever you add a new user or new email domain then you need to copy your file into the new folder and need to issue the ‘Alter table’ statement accordingly.
Related
So I have two hive queries, one that creates the table and the other one that reads parquet data from another table and inserts the relevant columns into my new table. I would like this new hive table to export its data to an s3 location with data in csv.gz format. My hive queries running on emr are currently outputting 00000_0.gz and I have to rename them using a bash script to csv.gz. This is quite a hacky way as I have to mount my s3 directory into my terminal and it would be ideal if my queries could directly do this. Could someone please review my queries to see where if there's any fault, many thanks.
CREATE TABLE db.test (
app_id string,
app_account_id string,
sdk_ts BIGINT,
device_id string)
PARTITIONED BY (
load_date string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION "s3://test_unload/";
set hive.execution.engine=tez;
set hive.cli.print.header=true;
set hive.exec.compress.output=true;
set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=1024000000;
set hive.merge.size.per.task=1024000000;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into db.test
partition(load_date)
select
'' as app_id,
'288' as app_account_id,
from_unixtime(CAST(event_epoch as BIGINT), 'yyyy-MM-dd HH:mm:ss') as sdk_ts,
device_id,
'20221106' as load_date
FROM processed_events.test
where load_date = '20221106'; ```
I've seen other questions saying their query returns no results. This is not what is happening with my query. The query itself is returning empty strings/results.
I have an 81.7MB JSON file in my input bucket (input-data/test_data). I've setup the datasource as JSON.
However, when I execute SELECT * FROM test_table; it shows (in green) that the data has been scanned, the query was successful and there are results, but not saved to the output bucket or displayed in the GUI.
I'm not sure what I've done wrong in the setup?
This is my table creation:
CREATE EXTERNAL TABLE IF NOT EXISTS `test_db`.`test_data` (
`tbl_timestamp` timestamp,
`colmn1` string,
`colmn2` string,
`colmn3` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://input-data/test_data/'
TBLPROPERTIES ('has_encrypted_data'='false',
'skip.header.line.count'='1');
Resolved this issue. The labels of the table (e.g. the keys) need to be the same labels in the file itself. Simple really!
Getting timeout error for a full text query in Athena like this...
SELECT count(textbody) FROM "email"."some_table" where textbody like '% some text to seach%'
Is there any way to optimize it?
Update:
The create table statement:
CREATE EXTERNAL TABLE `email`.`email5_newsletters_04032019`(
`nesletterid` string,
`name` string,
`format` string,
`subject` string,
`textbody` string,
`htmlbody` string,
`createdate` string,
`active` string,
`archive` string,
`ownerid` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'ESCAPED BY' = '\\'
) LOCATION 's3://some_bucket/email_backup_updated/email5/'
TBLPROPERTIES ('has_encrypted_data'='false');
And S3 bucket contents:
# aws s3 ls s3://xxx/email_backup_updated/email5/ --human
2020-08-22 15:34:44 2.2 GiB email_newsletters_04032019_updated.csv.gz
There are 11 million records in this file. The file can be imported within 30 minutes in Redshift and everything works OK in redshift. I will prefer to use Athena!
CSV is not a format that integrates very well with the presto engine, as queries need to read the full row to reach a single column. A way to optimize usage of athena, which will also save you plenty of storage costs, is to switch to a columnar storage format, like parquet or orc, and you can actually do it with a query:
CREATE TABLE `email`.`email5_newsletters_04032019_orc`
WITH (
external_location = 's3://my_orc_table/',
format = 'ORC')
AS SELECT *
FROM `email`.`email5_newsletters_04032019`;
Then rerun your query above on the new table:
SELECT count(textbody) FROM "email"."email5_newsletters_04032019_orc" where textbody like '% some text to seach%'
create external table reason ( reason_id int,
retailer_id int,
reason_code string,
reason_text string,
ordering int,
creation_date date,
is_active tinyint,
last_updated_by int,
update_date date
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE
location 's3://bucket_name/athena-workspace/athena-input/'
TBLPROPERTIES ("skip.header.line.count"="1");
Query above successfuly executes, however, there is no files in the provided location!!!
Upon successful execution table is created and is empty. How is this possible?
Even if I upload file to the provided location, created table is still empty!!
Athena is not a data store, it is simply a serverless tool to read data in S3 using SQL like expressions.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
This query is creating the metadata of the table, it doesn't write to that location it reads from it.
If you put a CSV into the location and performed select * from reason it would attempt to map any CSV in the prefix of athena-workspace/athena-input/ within bucket bucket_name to your data format using the ROW FORMAT and SERDEPROPERTIES to parse the files. It would also skip the first line assuming its a header.
I ran a simple query using Athena dashboard on data of format csv.The result was a csv with column headers.
When storing the results,Athena stores with the column headers in s3.How can i skip storing header column names,as i have to make new table from the results and it is repetitive
Try "skip.header.line.count"="1", This feature has been available on AWS Athena since 2018-01-19, here's a sample:
CREATE EXTERNAL TABLE IF NOT EXISTS tableName (
`field1` string,
`field2` string,
`field3` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://fileLocation/'
TBLPROPERTIES ('skip.header.line.count'='1')
You can refer to this question:
Aws Athena - Create external table skipping first row
From an Eric Hammond post on AWS Forums:
...
WHERE
date NOT LIKE '#%'
...
I found this works! The steps I took:
Run an Athena query, with the output going to Amazon S3
Created a new table pointing to this output based on How do I use the results of my Amazon Athena query in another query?, changing the path to the correct S3 location
Ran a query on the new table with the above WHERE <datefield> NOT LIKE '#%'
However, subsequent queries store even more data in that S3 directory, so it confuses any subsequent executions.