I have json data in s3. data looks like
{
"act_timestamp": 1576480759864,
"action": 26,
"cmd_line": "\\??\\C:\\Windows\\system32\\conhost.exe 0xffffffff",
"guid": "45af94911fb911ea827300270e098ff0",
"md5": "d5669294f78a7d48c318ef22d5685ba7",
"name": "conhost.exe",
"path": "C:\\Windows\\System32\\conhost.exe",
"pid": 1968,
"sha2": "6bd1f5ab9250206ab3836529299055e272ecaa35a72cbd0230cb20ff1cc30902",
"proc_id": "45af94901fb911ea827300270e098ff0",
"proc_name": "gcxvdf.exe"
}
I have around 100GB of such jsons stored in s3, in folder structure like year/month/day/hour.
I have to query this data and get results in milliseconds.
query can be like:-
select proc_id where name='conhost.exe',
select proc_id where cmd_line contains 'conhost.exe'.
I tried using AWS Athena and Redshift but both are giving results around 10-20 seconds. I even tried with Paraquet and orc file formats.
Is there any tool/technology/technique which can be used to query this kind of data and get results in milliseconds.
(Reason for response time to be in milliseconds is because I am developing interactive application.)
I think you are looking for a distributed search system like SOLR or elastic search (I am sure there are others, but those are the ones I am familiar with).
Also worth considering if you are able to reduce your data size at all. Any old or stale date in your 100GB?
I am able to solve above use case by using presto,hive on aws emr.
With help of hive we can create table on data in s3, and by using presto and hive as a catalog we can query this data.
Found out that Presto on emr is way too faster than compared to aws athena
(strange that athena uses presto internally)
create table in hive:-
CREATE EXTERNAL TABLE `test_table`(
`field_name1` datatype,
`field_name2` datatype,
`field_name3` datatype
)
STORED AS ORC
LOCATION
's3://test_data/data/';
query this table in presto:-
>presto-cli --catalog hive
>select field_name1 from test_table limit 5;
Related
The business logic is as below:
The user upload a csv file;
The application convert the csv file to a database table;
In the future, the user could run sql on the table to generate a BI report;
Currently, the solution is to save the table to MySQL. But as times goes on, the MySQL database contains thousands of tables.
I want find a file format, which represent a table and can be put to a object storage such as AWS S3, and then run an sql on the file.
For example:
Datasource ds = new Datasource("s3://xxx/bbb/t1.tbl");
ResultSet rs = ds.runSQL("select c1, c2 from t1 where c3=8");
What is your ideas or solutions?
Amazon S3 can run an SQL query against a single CSV file. It uses a capability called S3 Select.
From Filtering and retrieving data using Amazon S3 Select - Amazon Simple Storage Service:
With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data.
You can make an API call to S3 to perform the SQL query and retrieve the results. No database required. Just pay for the storage used by the CSV files (which can be gzipped to save space), plus $0.002 per GB scanned and $0.0007 per GB returned.
You can store the file as CSV in S3 and use S3 Select as mentioned in the other answer. Or you can store it as CSV or Parquet (a much more performant format) and run queries against it using AWS Athena.
I am writing the AWS Glue ETL job and I have 2 options to construct the spark dataframe :
Use the AWS Glue Data Catalog as the metastore for Spark SQL
df = spark.sql("select name from bronze_db.table_tbl")
df.write.save("s3://silver/...")
another option is to read directly from s3 location like this
df = spark.read.format("parquet").load("s3://bronze/table_tbl/1.parquet","s3://bronze/table_tbl/2.parquet")
df.write.save("s3://silver/...")
should I consider reading files directly to save cost or any limit on the number of queries (select name from bronze_db.table_tbl) or to get better read performance?
I am not sure if this query will be run on Athena to return the results
If you only have one file and you know the schema there is no need for a table. A table is useful when there are multiple files, you don't know the schema (e.g. the table was set up and is populated by another process), or if you are querying the data from multiple engines (Athena, EMR, Redshift Spectrum, etc.)
Think of tables as an interoperability thing. Interoperability with other processes, other engines, etc.
everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.
I'm adding files on Amazon S3 from time to time, and I'm using Amazon Athena to perform a query on these data and save it in another S3 bucket as CSV format (aggregated data), I'm trying to find way for Athena to select only new data (which not queried before by Athena), in order to optimize the cost and avoid data duplication.
I have tried to update the records after been selected by Athena, but update query not supported in Athena.
Is any idea to solve this ?
Athena does not keep track of files on S3, it only figures out what files to read when you run a query.
When planning a query Athena will look at the table metadata for the table location, list that location, and finally read all files that it finds during query execution. If the table is partitioned it will list the locations of all partitions that matches the query.
The only way to control which files Athena will read during query execution is to partition a table and ensure that queries match the partitions you want it to read.
One common way of reading only new data is to put data into prefixes on S3 that include the date, and create tables partitioned by date. At query time you can then filter on the last week, month, or other time period to limit the amount of data read.
You can find more information about partitioning in the Athena documentation.
I have partitioned the data by date and here is how it is stored in s3.
s3://dataset/date=2018-04-01
s3://dataset/date=2018-04-02
s3://dataset/date=2018-04-03
s3://dataset/date=2018-04-04
...
Created hive external table on top of this. I am executing this query,
select count(*) from dataset where `date` ='2018-04-02'
This partition has two parquet files like this,
part1 -xxxx- .snappy.parquet
part2 -xxxx- .snappy.parquet
each file size is 297MB. , So not a big file and not many files to scan.
And the query is returning 12201724 records. However it takes 3.5 mins to return this, since one partition itself is taking this time, running even the count query on whole dataset ( 7 years ) of data takes hours to return the results. Is there anyway, I can speed up this ?
Amazon Athena is, effectively, a managed Presto service. It can query data stored in Amazon S3 without having to run any clusters.
It is charged based upon the amount of data read from disk, so it runs very efficiently when using partitions and parquet files.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog