TLDR : Athena: select top 10 scans more data for parquet format, than csv format. Shouldn't it be the other way round?
I am using Athena(V1) to query the following two datasets (same data but two different file formats):
Format
Size
Athena DB name
Athena table name
dataset description
CSV
91.3 MB
nycitytaxi
data
nycity taxi trip, present in a public s3 bucket
Parquet
19.4 MB
nycitytaxi
aws_glue_result_xxxx
same data as above converted to parquet - through a Glue Crawler job - and stored in one of my S3 buckets
Now I am executing the following query on both the tables :
select lpep_pickup_datetime, lpep_dropoff_datetime
from nycitytaxi.<table_name>
limit 10
On executing this query on the csv based table (table_name: data), Athena console shows it scanned 721.96 KB of data.
On executing this query on the parquet based table (table_name : aws_glue_result_xxxx), Athena console shows it scanned 10.9 MB of data.
Shouldn't Athena be scanning way less data for the parquet based table, since parquet is columnar based, as opposed to row based storage for CSV ?
It is due to your specific query.
select lpep_pickup_datetime, lpep_dropoff_datetime
from nycitytaxi.<table_name>
limit 10
In row based formats like CSV, all data is stored row wise. Which means as soon as you say, select any 10 rows, it can just start reading the csv file from the beginning and select the first 10 rows, resulting in very low data scan.
In columnar data formats like parquet, the records are stored column wise. Let us assume the data has three columns, say id, name, number. This means, all of id values will be stored together, all name values will be stored together and all number values will be stored together. So when you run the query, select 10 rows in parquet, i will have to scan for 10 values in each column which are present in different storage locations. Which means I will have to scan more.
More on parquet pros and cons here.
Related
Given a data-source of 1.4 TB of Parquet data on S3 partitioned by a timestamp field (so partitions are year - month - day) I am querying a specific day of data (2.6 GB of data) and retrieving all available fields in the Parquet files via Redshift Spectrum with this query:
SELECT *
FROM my_external_schema.my_external_table
WHERE year = '2020' and month = '01' and day = '01'
The table is made available via a Glue Crawler that points at the top level "folder" in S3; this creates a Database and then via this command I link the Database to the new external schema:
create external schema my_external_schema from data catalog
database 'my_external_schema'
iam_role 'arn:aws:iam::123456789:role/my_role'
region 'my-region-9';
Analysing the table in my IDE I can see the table is generated by this statement:
create external table my_external_schema.my_external_table
(
id string,
my_value string,
my_nice_value string
)
partitioned by (year string, month string, day string)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
with serdeproperties ('serialization.format'='1')
stored as
inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 's3://my-bucket/my/location/'
table properties ('CrawlerSchemaDeserializerVersion'='1.0', 'CrawlerSchemaSerializerVersion'='1.0', 'UPDATED_BY_CRAWLER'='my_crawler');
When I analyse the query from Redshift I see it was scanned ~86 GB of data instead.
How's that possible? It is a concern because Redshift bills based on the amount of data scanned and looks like the service is scanning around 40 times the actual amount of data is in that partition.
I also tried to execute the same query in Athena and there I get only 2.55 GB of data scanned (definitely more reasonable).
I can't give too many details on the cluster size but assume that those 86GB of scanned data would fit in the cluster's Memory.
The problem seems to be in the AWS Redshift Console.
If we analyse the query from "query details" in Redshift console, I can see that the "Total data scanned" reports 86GB. As Vzarr mentioned, I run the same query on Athena to compare the performance. The execution time was basically the same but the amount of data scanned was completely different: 2.55GB.
I did the same comparison with other queries on S3 external schema, with and without using partitions columns: I saw that the total of GB scanned differs in every test, sometimes differs a lot (320MB in Redshift Spectrum, 20GB in Athena).
I decided to look at the system tables in Redshift in order to understand how the query on the external schema was working. I did a very simple test using SVL_S3QUERY:
SELECT (cast(s3_scanned_bytes as double precision) / 1024 / 1024 / 1024) as gb_scanned,
s3_scanned_rows,
query
FROM SVL_S3QUERY
WHERE query = '<my-query-id>'
The result was completely different from what AWS Redshift Console says for the same query. Not only the gb_scanned was wrong, but s3_scanned_rows was too. The query returns a total of 2.55GB of data Scanned, exactly the same of what Athena said.
To confirm the numbers in the SVL_S3QUERY I used AWS Cost Explorer to double check the total of gb scanned in a day with how much we paid for Redshift Spectrum: the numbers were basically the same.
At this point, I don't know from where or which table the AWS Redshift Console take the query details, but they seem to be completely wrong.
I'm adding files on Amazon S3 from time to time, and I'm using Amazon Athena to perform a query on these data and save it in another S3 bucket as CSV format (aggregated data), I'm trying to find way for Athena to select only new data (which not queried before by Athena), in order to optimize the cost and avoid data duplication.
I have tried to update the records after been selected by Athena, but update query not supported in Athena.
Is any idea to solve this ?
Athena does not keep track of files on S3, it only figures out what files to read when you run a query.
When planning a query Athena will look at the table metadata for the table location, list that location, and finally read all files that it finds during query execution. If the table is partitioned it will list the locations of all partitions that matches the query.
The only way to control which files Athena will read during query execution is to partition a table and ensure that queries match the partitions you want it to read.
One common way of reading only new data is to put data into prefixes on S3 that include the date, and create tables partitioned by date. At query time you can then filter on the last week, month, or other time period to limit the amount of data read.
You can find more information about partitioning in the Athena documentation.
I have partitioned the data by date and here is how it is stored in s3.
s3://dataset/date=2018-04-01
s3://dataset/date=2018-04-02
s3://dataset/date=2018-04-03
s3://dataset/date=2018-04-04
...
Created hive external table on top of this. I am executing this query,
select count(*) from dataset where `date` ='2018-04-02'
This partition has two parquet files like this,
part1 -xxxx- .snappy.parquet
part2 -xxxx- .snappy.parquet
each file size is 297MB. , So not a big file and not many files to scan.
And the query is returning 12201724 records. However it takes 3.5 mins to return this, since one partition itself is taking this time, running even the count query on whole dataset ( 7 years ) of data takes hours to return the results. Is there anyway, I can speed up this ?
Amazon Athena is, effectively, a managed Presto service. It can query data stored in Amazon S3 without having to run any clusters.
It is charged based upon the amount of data read from disk, so it runs very efficiently when using partitions and parquet files.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
My use case is simple. I have 20 TB raw csv uncompressed data in s3 with a partition folder structure of year (10 partitions for 10 years, each partition has 2 TB). I want to convert this data into parquet format(snappy compressed) and keep the similar partition/folder structure. I want ONE Parquet table with TEN 10 partitions in Athena which I will use to query this data by partition and maybe get rid of the raw csv data later. With Glue, it seems like I will create 10 parquet tables which I can't use.
Is this doable in Glue? Instead of using EC2, Hive/Spark I was looking for simple solution. Any recommendation? Any help is much appreciated.
Assuming you have Glue Catalogs on that data, you could load it in as a dynamic frame and then write it back as parquet to the new location:
dynamic_frame = glue_context.create_dynamic_frame.from_catalog(
database=glue_database_name,
table_name=glue_table_name)
data_frame = dynamic_frame.toDF()
data_frame.repartition("year")\
.write\
.partitionBy("year")\
.parquet('s3://target-bucket/prefix/')
Athena creates a temporary table using fields in S3 table. I have done this using JSON data. Could you help me on how to create table using parquet data?
I have tried following:
Converted sample JSON data to parquet data.
Uploaded parquet data to S3.
Created temporary table using columns of JSON data.
By doing this I am able to a execute query but the result is empty.
Is this approach right or is there any other approach to be followed on parquet data?
Sample json data:
{"_id":"0899f824e118d390f57bc2f279bd38fe","_rev":"1-81cc25723e02f50cb6fef7ce0b0f4f38","deviceId":"BELT001","timestamp":"2016-12-21T13:04:10:066Z","orgid":"fedex","locationId":"LID001","UserId":"UID001","SuperviceId":"SID001"},
{"_id":"0899f824e118d390f57bc2f279bd38fe","_rev":"1-81cc25723e02f50cb6fef7ce0b0f4f38","deviceId":"BELT001","timestamp":"2016-12-21T13:04:10:066Z","orgid":"fedex","locationId":"LID001","UserId":"UID001","SuperviceId":"SID001"}
If your data has been successfully stored in Parquet format, you would then create a table definition that references those files.
Here is an example statement that uses Parquet files:
CREATE EXTERNAL TABLE IF NOT EXISTS elb_logs_pq (
request_timestamp string,
elb_name string,
request_ip string,
request_port int,
...
ssl_protocol string )
PARTITIONED BY(year int, month int, day int)
STORED AS PARQUET
LOCATION 's3://athena-examples/elb/parquet/'
tblproperties ("parquet.compress"="SNAPPY");
This example was taken from the AWS blog post Analyzing Data in S3 using Amazon Athena that does an excellent job of explaining the benefits of using compressed and partitioned data in Amazon Athena.
If your table definition is valid but not getting any rows, try this
-- The MSCK REPAIR TABLE command will load all partitions into the table.
-- This command can take a while to run depending on the number of partitions to be loaded.
MSCK REPAIR TABLE {tablename}
steps:
1. create your my_table_json
2. insert data into my_table_json (verify existence of the created json files in the table 'LOCATION')
3. create my_table_parquet: same create statement as my_table_json except you need to add 'STORED AS PARQUET' clause.
4. run: INSERT INTO my_table_parquet SELECT * FROM my_table_json