I am trying to load some Avro format data to BigQuery through the api and I need some partitioning. According to the documentation here
https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#TimePartitioning
It will create only one partition a day with the ingestion partition that use the _PARTITIONTIME column. Is it possible to create multiple partition a day by using timestamp field?
Another option I can think about was the ranged partition documented here
https://cloud.google.com/bigquery/docs/reference/rest/v2/JobConfiguration#RangePartitioning
however, it was marked as experimental. Not sure it is good for production use?
Related
I have an existing Athena table (w/ hive-style partitions) that's using the Avro SerDe. When I first created the table, I declared the Athena schema as well as the Athena avro.schema.literal schema per AWS instructions. Everything has been working great.
I now wish to add new columns that will apply going forward but not be present on the old partitions. I tried a basic ADD COLUMNS command that claims to succeed but has no impact on SHOW CREATE TABLE. I then wondered if I needed to change the Avro schema declaration as well, which I attempted to do but discovered that ALTER TABLE SET SERDEPROPERTIES DDL is not supported in Athena.
AWS claims I should be able to add columns when using Avro, but at this point I'm unsure how to do it. Even if I'm willing to drop the table metadata and redeclare all of the partitions, I'm not sure how to do it right since the schema is different on the historical partitions.
Looking for high-level guidance on the steps to be taken. Documentation is scant and Athena seems to be lacking support for commands that are referenced in this same scenario in vanilla Hive world. Thanks for any insights.
What is the best way to implement a log-like table in Redshift?
Example: I have a table where I periodically put some metrics.
I want to purge this table when data is older than 1 month. The table contains a timestamp fields that I can use for this.
I can do with a job that runs daily purging data older than X. However I would like to know if there are other built-in options.
Is there a way to define an automatic purge mechanism in Redshift, either by a condition on a field, or by number of records, or by table size?
everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.
Google BigQuery (BQ) allows you to create a partition using timestamp or date types only.
99% of my data has a very clear selector, idClient. I've created to my customer's views with a predicate like idClient = code so the privacy is guaranteed.
The problem with this strategy is that there are customers with 5M rows and others with 200K and as BQ does not have indexes, they are always processing data from each other (and the costs are rising).
I am intending to create a timestamp field where each customer will have a different timestamp that will be repeated for every Insert in every customer sensitive table and thus I can query by timestamp by fixing it as it would be with a standard ID.
Does this make any sense? If BQ was an indexed database I'd be concerned about skewed data but as it is always full table scan, I think I'd have only benefits and no downsides.
The solution for your problem is to add Cluster field to your table which is equivalent to an Index in other databases
This link provides the basic on how to use cluster field
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns
Note: When using cluster field BigQuert dryRun doesn't show the cost improvement which can only be seen post-execution
My goal is to take daily snapshots of an RDS table and put it in a DynamoDB table. The table should only contain data from a single day.
For this have a Data Pipeline set up to query a RDS table and publish the results into S3 in CSV format.
Then a HiveActivity imports this CSV into a DynamoDB table by creating external tables for the file and an existing DynamoDB table.
This works great, but older entries from the previous day still exist in the DynamoDB table. I want to do this within Data Pipeline if at all possible. I need to:
1) Find a way to clear the DynamoDB table, or at least drop/recreate it, or
2) Include an extra column of the snapshot date and find a way to clear out all older entries.
Any ideas on how I can do this?
You can use DynamoDb Time to Live(TTL) which allows you to set an expiration time after which items are auto deleted from the DynamoDb table. TTL is very useful for cases where data loses it's relevance after a specific time period and in your case it can be start time of next day.