I've recently discovered that AWS Glue's "Custom SQL Transform" node uses HiveQL, rather than SQL which is significantly less functional than normal SQL transforms. I was wondering if there was a way to change the language library that it calls and uses, or any different alternatives. I need the ALTER TABLE SQL function and HiveQL creates "views" rather than tables which are inoperable on in my use case.
Related
I am building an ETL pipeline using primarily state machines, Athena, S3, and the Glue catalog. In general things work in the following way:
A table, partitioned by "version", exists in the Glue Catalog. The table represents the output destination of some ETL process.
A step function (managed by some other process) executes "INSERT INTO" athena queries. The step function supplies a "version" that is used as part of the "INSERT INTO" query so that new data can be appended into the table defined in (1). The table contains all "versions" - it's a historical table that grows over time.
My question is: What is a good way of exposing a view/table that allows someone (or something) to query only the latest "version" partition for a given historically partitioned table?
I've looked into other table types AWS offers, including Governed tables and Iceberg tables. Each seems to have some incompatibility with our existing or planned future architecture:
Governed tables do not support writes via athena insert queries. Only Glue ETL/Spark seems to be supported at the moment.
Iceberg tables do not support Lake Formation data filters (which we'd like to use in the future to control data access)
Iceberg tables also seem to have poor performance. Anecdotally, it can take several seconds to insert a very small handful of rows to a given iceberg table. I'd worry about future performance when we want to insert a million rows.
I have a table in BigQuery containing consumers' reviews, some of them are in local languages and I need to use a translation API to translate them and create a new column to the existing table incorporating the transalted reviews. I was wondering whether I can automate this task? e.g. using Google Translate API in BigQuery....
An alter solution to achieve this if customer review has some limited review comments in response then you can create a Bigquery function to replace values.
A sample code is given over github repository.
If you want to use an external API in BigQuery, like a Language Translation API, you can use Remote Functions (a recent release).
In this GitHub repo you can see how to wrap the Azure Translator API (the same way you can use the Google Translate API) into a SQL function and use it in your queries.
Since you have created the Translation SQL function, you can write an update statement (and run it periodically - using scheduled queries) to achieve what you want.
UPDATE mytable SET translated_review_text=translation_function(review_text) WHERE translated_review_text IS NULL
I am trying to use pre aggregations over CLOUD SQL on Google Cloud Platform but the database is denying access and giving error Statement violates GTID consistency.
Any help is appreciated.
Cube.js done pre-aggregation by CREATE TABLE ... SELECT, but you are using MySQL on top of Google SQL with --enforce-gtid-consistency (has limitations).
Since only transactionally safe statements can be logged, there is a limitation to use CREATE TABLE ... SELECT (and some another SQL), because this statement is actually logged as two separate events.
There are two ways how to solve this issue:
1. Use pre-aggregations to an external database. (recommended way).
https://cube.dev/docs/pre-aggregations/#read-only-data-source-pre-aggregations
2. Use not documented flag loadPreAggregationWithoutMetaLock
Attention: This flag is an experimental and can be removed or changed in the feature..
Take a look at the source code
You can pass it directly in the driver constructor. This will produce two SQL statements to pass the limitation:
CREATE TABLE
INSERT INTO
Thanks
At my organization, we are using a stack of AWS S3, AWS Glue, and Athena to drive some reporting of internal metrics. In general, this stack is great for quick set up for reporting off of raw data (stored in S3). The problem we've come against is what to do if we notice we need to somehow update the data that's already stored in S3. For example, we want to update values in a column that have a certain string to update that value.
Unlike a database, we can't just run a query to update all the existing data. I've tried to see if we can utilize Glue Jobs to accomplish this, but from my limited understanding, it doesn't seem like it's meant to do ETL from a bucket back to the same bucket.
The only thing I can think is to write a custom tool that iterates through an S3 bucket, loads a file, provides the transformation, and puts it back, overwriting the original. It seems there has to be a better way though.
Updates are not handled in a native way in a traditional hive-like warehousing solution, which I deem Athena to be. A common solution is a kind of engineering workaround where you do "insert overwrite" a partition (borrowing Hive syntax, possible in Presto and hopefully also possible in Athena, which is based on Presto).
Other solutions include creating new tables and atomically replacing a view, which users are supposed to query, instead of querying the underlying table(s) directly.
As this is a common problem, there are also some ready to use solutions to it, but I do not know whether which/whether they are possible with Athena. They are certainly possible with Presto (Presto SQL):
Hive ACID transactional tables (updates currently required Hive runtime)
Data Lake (open sourced by Databricks; updates currently require Spark runtime)
Hudi (I know little about this one)
We want to use AWS Athena for analytics and segmentation, our problem is that our data is schemaless, rows are different with some similar columns.
Is it possible to create table without defining all the columns?
When we query we know the type (string/int) of each column so if there is a way to define on the query it will be great.
We can structure the data in anyway needed to support schemaless and in any format: CSV / JSON.
Is Athena an option for schemaless uses?
There are many ways to use Athena in schemaless uses and you need to give specific examples of scenarios that you want to support more efficiently as in Athena you pay based on the data that you scan and optimizing your data to minimize the data scan is critical to make it a useful tool in scale.
The simplest way to get you started as you are learning the tool, and the types of queries that you can run on your data, is to define a table with a single column ("line"), and then do the parsing of the data that you want using string functions, or JSON functions if the lines are in JSON format.
You will get good time performance if you have multiple files, but it will be expensive as you need to scan all your data for every query. I suggest that you start with these queries as a good way to define your requirements. As you see the growth of usage, start optimizing the use cases by using the CTAS (Create Table As Select) commands that will generate parquet versions of the original raw data to support the more popular (and expensive) use cases.
You are welcome to read my blog post that is describing the strategy and tactics of a cloud environment using Athena and the other AWS tools around it.