AWS QuickSIght : Appending multiple files of same schema to each other - amazon-web-services

My data is separated in multiple files to keep them easy. Now I wanted to do some analysis on all data using Quicksight.
All data means same data from a table kept in multiple files for easy accessibility.
When I am adding new data files one after other to Quicksight it is suggesting me joins (like other table which I don't want).
Please suggest me how can I append the all dataset together in Quicksight.

Related

Athena tables having history of records of every csv

I am uploading CSV files in the s3 bucket and creating tables through glue crawler and seeing the tables in Athena, making connection between Athena and Quicksight, and showing the result graphically there in quicksight.
But what I need to do now is keep the history of the files uploaded, instead of a new CSV file being uploaded and crawler updating the table, can I have crawler save each record separately? or is it even a reasonable thing to do? since I wonder it would then create so many tables and it'll be a mess?
I'm just trying to figure out a way to keep a history of previous records. how can i achieve this?
When you run an Amazon Athena query, Athena will look at the location parameter defined in the table's DDL. This specifies where the data is stored in an Amazon S3 bucket.
Athena will include all files in that location when it runs the query on that table. Thus, if you wish to add more data to the table, simply add another file in that S3 location. To replace data in that table, you can overwrite the file(s) in that location. To delete data, you can delete files from that location.
There is no need to run a crawler on a regular basis. The crawler can be used to create the table definition and it can be run again to update the table definition if anything has changed. But you typically only need to use the crawler once to create the table definition.
If you wish to preserve historical data in the table while adding more data to the table, simply upload the data to new files and keep the existing data files in place. That way, any queries will include both the historical data and the new data because Athena simply looks at all the files in that location.

How does Amazon Athena manage rename of columns?

everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.

Update BigQuery permanent external tables

I'm using BigQuery both to store data within "native" BigQuery tables and to query data stored in Google Cloud Storage. According to the documentation, it is possible to query external sources using two types of tables: permanent and temporary external tables.
Consider the following scenario: every day some parquet files are written in GCS, and with a certain frequency I want to do a JOIN between the data stored in a BigQuery table and the data stored in parquet files. If I create a permanent external table, and then I update the files below, is the content of the table automatically updated as well, or do I have to recreate it from the new files?
What are the best practices for such a scenario?
You don't have to re-create the external table again when you add new files into cloud storage bucket. The only exception is, if the number of columns is different in new file then the external table will not work as expected.
You need to use wildcard symbol to read files that matches to a specific pattern rather than providing a static file name. Example: "gs://bucketName/*.csv"

How to split data when archiving from AWS database to S3

For a project we've inherited we have a large-ish set of legacy data, 600GB, that we would like to archive, but still have available if need be.
We're looking at using the AWS data pipeline to move the data from the database to be in S3, according to this tutorial.
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
However, we would also like to be able to retrieve a 'row' of that data if we find the application is actually using a particular row.
Apparently that tutorial puts all of the data from a table into a single massive CSV file.
Is it possible to split the data up into separate files, with 100 rows of data in each file, and giving each file a predictable file name, such as:
foo_data_10200_to_10299.csv
So that if we realise we need to retrieve row 10239, we can know which file to retrieve, and download just that, rather than all 600GB of the data.
If your data is stored in CSV format in Amazon S3, there are a couple of ways to easily retrieve selected data:
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
S3 Select (currently in preview) enables applications to retrieve only a subset of data from an object by using simple SQL expressions.
These work on compressed (gzip) files too, to save storage space.
See:
Welcome - Amazon Athena
S3 Select and Glacier Select – Retrieving Subsets of Objects

Data Transformation in AWS EMR without using Scala or Python

I have a star schema kind of database structure, like one fact table having all the id’s & skeys, whereas there are multiple dimension tables having the actual id, code, descriptions for the id’s referred in the fact table.
we are moving all these tables (fact & dimensions) to S3 (cloud) individually and each table data are split into multiple parquet files in S3 location (one S3 object per table)
Query: i need to perform a transformation on cloud (ie) i need strip of all the id’s & skeys referred in the fact table and replace it with the actual code that is residing in the dimension tables and create another file and store the final output back in S3 location. This file will later be consumed by Redshift for Analytics.
My Doubt:
Whats the best way to achieve this solution, cos i don’t need raw data (skeys & id’s) in Redshift for cost and storage optimization?
Do we need to first combine these split files (parquet) into one large file (ie) before performing the data transformation. Also, after data transformation, I am planning to save the final output file in parquet format, but the catch is, Redshift doesn’t allow copy of parquet file, so is there a workaround for that
I am not a hardcore programmer and want to avoid using scala/python in a EMR, but I am good at SQL, so is there a way to perform data transformation in cloud thru SQL thru EMR and save the output data into a file or files. Please advise
You should be able to run redshift type queries directly against your s3 parquet data by using amazon athena
some information on that
https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/