Hive query on s3 partition is too slow - amazon-web-services

I have partitioned the data by date and here is how it is stored in s3.
s3://dataset/date=2018-04-01
s3://dataset/date=2018-04-02
s3://dataset/date=2018-04-03
s3://dataset/date=2018-04-04
...
Created hive external table on top of this. I am executing this query,
select count(*) from dataset where `date` ='2018-04-02'
This partition has two parquet files like this,
part1 -xxxx- .snappy.parquet
part2 -xxxx- .snappy.parquet
each file size is 297MB. , So not a big file and not many files to scan.
And the query is returning 12201724 records. However it takes 3.5 mins to return this, since one partition itself is taking this time, running even the count query on whole dataset ( 7 years ) of data takes hours to return the results. Is there anyway, I can speed up this ?

Amazon Athena is, effectively, a managed Presto service. It can query data stored in Amazon S3 without having to run any clusters.
It is charged based upon the amount of data read from disk, so it runs very efficiently when using partitions and parquet files.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog

Related

Is there any file format which represent a single database table?

The business logic is as below:
The user upload a csv file;
The application convert the csv file to a database table;
In the future, the user could run sql on the table to generate a BI report;
Currently, the solution is to save the table to MySQL. But as times goes on, the MySQL database contains thousands of tables.
I want find a file format, which represent a table and can be put to a object storage such as AWS S3, and then run an sql on the file.
For example:
Datasource ds = new Datasource("s3://xxx/bbb/t1.tbl");
ResultSet rs = ds.runSQL("select c1, c2 from t1 where c3=8");
What is your ideas or solutions?
Amazon S3 can run an SQL query against a single CSV file. It uses a capability called S3 Select.
From Filtering and retrieving data using Amazon S3 Select - Amazon Simple Storage Service:
With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data.
You can make an API call to S3 to perform the SQL query and retrieve the results. No database required. Just pay for the storage used by the CSV files (which can be gzipped to save space), plus $0.002 per GB scanned and $0.0007 per GB returned.
You can store the file as CSV in S3 and use S3 Select as mentioned in the other answer. Or you can store it as CSV or Parquet (a much more performant format) and run queries against it using AWS Athena.

AWS Athena partitioning - projected and otherwise - make queries run forever

I'm trying to introduce partitioning into an Athena table. By way of background, what I have is a kinesis data stream receiving events that are JSON data, a delivery stream that converts the JSON from text to orc via a glue table we'll call glue-table and writes them to s3://bucket-name/path/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/.
With this setup, and no partitioning properties or fields set up in the glue table at all, there is automatically an external table in Athena named glue-table and with ~50 sample events run through kinesis, I have data in the appropriate path in S3 (aws s3 ls s3://bucket-name/path/year=2022/month=08/day=31/hour=19 for example lists results). I can execute SELECT * FROM glue-table in Athena and get ~50 results in ~900ms.
If I execute MSCK REPAIR TABLE glue_table the result is Partitions not in metastore: glue_table:year=2022/month=08/day=31/hour=19 which implies Athena is able to see that there ought to be partitions based on the templates it finds in the S3 path.
If I then do this to create a partitioned table using projections:
CREATE EXTERNAL TABLE paritioned_table (
action string,
createdat string
)
PARTITIONED BY (
year integer,
month integer,
day integer,
hour integer
)
STORED AS ORC
LOCATION 's3://bucket-name/path/'
TBLPROPERTIES (
'classification'='orc',
'projection.year.type'='integer',
'projection.year.range'='2022,9999',
'projection.month.type'='integer',
'projection.month.range'='01,12',
'projection.day.type'='integer',
'projection.day.range'='01,31',
'projection.hour.type'='integer',
'projection.hour.range'='00,24',
'projection.enabled'='true',
'storage.location.template'='s3://bucket-name/path/year=${year}/month=${month}/day=${day}/hour=${hour}/');
and then execute SELECT action, year FROM partitioned_table WHERE year=2022; then Athena chews on the query for 3 minutes and then returns no results.
If I go modify the glue table so it has the same projection settings from TBLPROPERTIES in the above example set as Table Properties, and then execute SELECT action, year FROM partitioned_table WHERE year=2022; then Athena chews on the query for 3 minutes and then returns no results.
Interestingly, if I drop and recreate partitioned_table so that it's location is s3://bucket-does-not-exist then SELECT action, year FROM partitioned_table WHERE year=2022; also runs for 3+ minutes and then returns no results.
I'm not able to see what I'm doing wrong here after referring to multiple pages of AWS documentation and a few official AWS videos describing this process. What's wrong here?

How Redshift Spectrum scans data?

Given a data-source of 1.4 TB of Parquet data on S3 partitioned by a timestamp field (so partitions are year - month - day) I am querying a specific day of data (2.6 GB of data) and retrieving all available fields in the Parquet files via Redshift Spectrum with this query:
SELECT *
FROM my_external_schema.my_external_table
WHERE year = '2020' and month = '01' and day = '01'
The table is made available via a Glue Crawler that points at the top level "folder" in S3; this creates a Database and then via this command I link the Database to the new external schema:
create external schema my_external_schema from data catalog
database 'my_external_schema'
iam_role 'arn:aws:iam::123456789:role/my_role'
region 'my-region-9';
Analysing the table in my IDE I can see the table is generated by this statement:
create external table my_external_schema.my_external_table
(
id string,
my_value string,
my_nice_value string
)
partitioned by (year string, month string, day string)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
with serdeproperties ('serialization.format'='1')
stored as
inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 's3://my-bucket/my/location/'
table properties ('CrawlerSchemaDeserializerVersion'='1.0', 'CrawlerSchemaSerializerVersion'='1.0', 'UPDATED_BY_CRAWLER'='my_crawler');
When I analyse the query from Redshift I see it was scanned ~86 GB of data instead.
How's that possible? It is a concern because Redshift bills based on the amount of data scanned and looks like the service is scanning around 40 times the actual amount of data is in that partition.
I also tried to execute the same query in Athena and there I get only 2.55 GB of data scanned (definitely more reasonable).
I can't give too many details on the cluster size but assume that those 86GB of scanned data would fit in the cluster's Memory.
The problem seems to be in the AWS Redshift Console.
If we analyse the query from "query details" in Redshift console, I can see that the "Total data scanned" reports 86GB. As Vzarr mentioned, I run the same query on Athena to compare the performance. The execution time was basically the same but the amount of data scanned was completely different: 2.55GB.
I did the same comparison with other queries on S3 external schema, with and without using partitions columns: I saw that the total of GB scanned differs in every test, sometimes differs a lot (320MB in Redshift Spectrum, 20GB in Athena).
I decided to look at the system tables in Redshift in order to understand how the query on the external schema was working. I did a very simple test using SVL_S3QUERY:
SELECT (cast(s3_scanned_bytes as double precision) / 1024 / 1024 / 1024) as gb_scanned,
s3_scanned_rows,
query
FROM SVL_S3QUERY
WHERE query = '<my-query-id>'
The result was completely different from what AWS Redshift Console says for the same query. Not only the gb_scanned was wrong, but s3_scanned_rows was too. The query returns a total of 2.55GB of data Scanned, exactly the same of what Athena said.
To confirm the numbers in the SVL_S3QUERY I used AWS Cost Explorer to double check the total of gb scanned in a day with how much we paid for Redshift Spectrum: the numbers were basically the same.
At this point, I don't know from where or which table the AWS Redshift Console take the query details, but they seem to be completely wrong.

How Amazon Athena selecting new files/records from S3

I'm adding files on Amazon S3 from time to time, and I'm using Amazon Athena to perform a query on these data and save it in another S3 bucket as CSV format (aggregated data), I'm trying to find way for Athena to select only new data (which not queried before by Athena), in order to optimize the cost and avoid data duplication.
I have tried to update the records after been selected by Athena, but update query not supported in Athena.
Is any idea to solve this ?
Athena does not keep track of files on S3, it only figures out what files to read when you run a query.
When planning a query Athena will look at the table metadata for the table location, list that location, and finally read all files that it finds during query execution. If the table is partitioned it will list the locations of all partitions that matches the query.
The only way to control which files Athena will read during query execution is to partition a table and ensure that queries match the partitions you want it to read.
One common way of reading only new data is to put data into prefixes on S3 that include the date, and create tables partitioned by date. At query time you can then filter on the last week, month, or other time period to limit the amount of data read.
You can find more information about partitioning in the Athena documentation.

Query 100Gb of S3 data in milliseconds

I have json data in s3. data looks like
{
"act_timestamp": 1576480759864,
"action": 26,
"cmd_line": "\\??\\C:\\Windows\\system32\\conhost.exe 0xffffffff",
"guid": "45af94911fb911ea827300270e098ff0",
"md5": "d5669294f78a7d48c318ef22d5685ba7",
"name": "conhost.exe",
"path": "C:\\Windows\\System32\\conhost.exe",
"pid": 1968,
"sha2": "6bd1f5ab9250206ab3836529299055e272ecaa35a72cbd0230cb20ff1cc30902",
"proc_id": "45af94901fb911ea827300270e098ff0",
"proc_name": "gcxvdf.exe"
}
I have around 100GB of such jsons stored in s3, in folder structure like year/month/day/hour.
I have to query this data and get results in milliseconds.
query can be like:-
select proc_id where name='conhost.exe',
select proc_id where cmd_line contains 'conhost.exe'.
I tried using AWS Athena and Redshift but both are giving results around 10-20 seconds. I even tried with Paraquet and orc file formats.
Is there any tool/technology/technique which can be used to query this kind of data and get results in milliseconds.
(Reason for response time to be in milliseconds is because I am developing interactive application.)
I think you are looking for a distributed search system like SOLR or elastic search (I am sure there are others, but those are the ones I am familiar with).
Also worth considering if you are able to reduce your data size at all. Any old or stale date in your 100GB?
I am able to solve above use case by using presto,hive on aws emr.
With help of hive we can create table on data in s3, and by using presto and hive as a catalog we can query this data.
Found out that Presto on emr is way too faster than compared to aws athena
(strange that athena uses presto internally)
create table in hive:-
CREATE EXTERNAL TABLE `test_table`(
`field_name1` datatype,
`field_name2` datatype,
`field_name3` datatype
)
STORED AS ORC
LOCATION
's3://test_data/data/';
query this table in presto:-
>presto-cli --catalog hive
>select field_name1 from test_table limit 5;