We want to execute a parameterized query in Athena using the javascript sdk by aws.
Seems Athena's named query may be the way to do, but the documentation seems very cryptic to understand how to go about doing this.
It would be great if someone can help us do the following
What is the recommended way to avoid sql injection in athena?
Create a parameterized query like SELECT c FROM Country c WHERE c.name = :name
Pass the name parameter's value
Execute this query
Edit: this answer was written before Athena supported prepared statements.
Named queries is a weird feature of Athena that is not really useful for anything, unfortunately.
Athena does not support prepared statements like many RDBMSs. There are SQL libraries with support for doing parameter expansion client side – Sequel for Ruby is one I have experience with, unfortunately I can't give you a suggestion for JavaScript.
Escaping in Athena's SQL dialect isn't very complicated, however. In identifiers double quotes need to be escaped as two double quotes and in literal strings single quotes need to be escaped as single quotes. Other datatypes just need to be clean, e.g. only digits for integers.
Also, keep in mind that in Athena, the dangers of SQL injection are different than in an RDBMS: Athena can't delete your data. If you set up your IAM permissions correctly the user can't even drop tables, and even if you for some reason run queries with a user that is allowed to drop tables, tables are just metadata and can easily be set up again.
Related
I've recently discovered that AWS Glue's "Custom SQL Transform" node uses HiveQL, rather than SQL which is significantly less functional than normal SQL transforms. I was wondering if there was a way to change the language library that it calls and uses, or any different alternatives. I need the ALTER TABLE SQL function and HiveQL creates "views" rather than tables which are inoperable on in my use case.
I preface all of this to say I’m still actively learning DynamoDB, and I think an answer to my question will help me understand a few things.
I have an analytics microservice that I’m pushing custom (internal) analytics events into a DynamoDB table. Columns in our Dynamo rows/items include data like:
User ID
IP Address
Event Action
Timestamp
Split Test ID
Split Test Value
One of the main questions we want to pull from this db is:
"How many users saw split test x with values y?"
I’m struggling to understand how I should index my database to account for this kind of requests? I set up a “Keys Only” index targeting Split Test ID, and the query to gather these are fairly efficient, but it only pulls UserID and Split Test ID. Ideally I want an efficient query that returns multiple other associated values as well…
How do I achieve this? Do I need to be doing something much differently? Additionally, if any of my understanding of Dynamo, based on my explanations, sounds completely lacking in some regard, please point me in the right direction!
You're thinking of DynamoDB as a schema-less database, which it obviously is. However, that does not mean that a schema is not important. Schemas in NoSQL databases are usually more important than they are in SQL databases and they are usually less straightforward.
The most important thing to determine how you will store your data is how you will access it. You will have to take into account all the ways that you will want to access your data and ensure it is possible by creating the necessary data columns and necessary indexes. In this case, if you want to know how many times two values are combined in a certain way, you could easily add a column that has these combined values (e.g., splitId#splitValue ) and use that in your indexes.
If you want to know more about advanced patterns and such, I advise you to watch this pretty famous re:invent talk by Rick Houlihan or to read the DynamoDB book.
As a last note, I want to add that switching to a SQL server usually is not the solution. Picking NoSQL over SQL is usually based on non-functional requirements. There is a reason NoSQL databases are used in applications that require very low-latency retrieval of data in huge datasets, but as with everything, trade-offs are the name of the game.
I am trying to use pre aggregations over CLOUD SQL on Google Cloud Platform but the database is denying access and giving error Statement violates GTID consistency.
Any help is appreciated.
Cube.js done pre-aggregation by CREATE TABLE ... SELECT, but you are using MySQL on top of Google SQL with --enforce-gtid-consistency (has limitations).
Since only transactionally safe statements can be logged, there is a limitation to use CREATE TABLE ... SELECT (and some another SQL), because this statement is actually logged as two separate events.
There are two ways how to solve this issue:
1. Use pre-aggregations to an external database. (recommended way).
https://cube.dev/docs/pre-aggregations/#read-only-data-source-pre-aggregations
2. Use not documented flag loadPreAggregationWithoutMetaLock
Attention: This flag is an experimental and can be removed or changed in the feature..
Take a look at the source code
You can pass it directly in the driver constructor. This will produce two SQL statements to pass the limitation:
CREATE TABLE
INSERT INTO
Thanks
We want to use AWS Athena for analytics and segmentation, our problem is that our data is schemaless, rows are different with some similar columns.
Is it possible to create table without defining all the columns?
When we query we know the type (string/int) of each column so if there is a way to define on the query it will be great.
We can structure the data in anyway needed to support schemaless and in any format: CSV / JSON.
Is Athena an option for schemaless uses?
There are many ways to use Athena in schemaless uses and you need to give specific examples of scenarios that you want to support more efficiently as in Athena you pay based on the data that you scan and optimizing your data to minimize the data scan is critical to make it a useful tool in scale.
The simplest way to get you started as you are learning the tool, and the types of queries that you can run on your data, is to define a table with a single column ("line"), and then do the parsing of the data that you want using string functions, or JSON functions if the lines are in JSON format.
You will get good time performance if you have multiple files, but it will be expensive as you need to scan all your data for every query. I suggest that you start with these queries as a good way to define your requirements. As you see the growth of usage, start optimizing the use cases by using the CTAS (Create Table As Select) commands that will generate parquet versions of the original raw data to support the more popular (and expensive) use cases.
You are welcome to read my blog post that is describing the strategy and tactics of a cloud environment using Athena and the other AWS tools around it.
In the Cloud SQL documentation, specifically on this link, "https://cloud.google.com/sql/docs/mysql/features", mention the following:
unsupported statements for second generation instances
The following statements are not compatible because the second-generation instances use GTID replication:
statements CREATE TABLE ... SELECT
CREATE TEMPORARY TABLE statements within transactions
Transactions or statements that update transactional and non-transactional tables
I often make updates to my data, so, that would be a problem for my solution.
In the google cloud documentation, in the same link mentioned above, it refers to the official documentation of mysql, which mentions that the restriction of updates are only with non-transactional engines.
Maybe I'm getting confused, but I really want to know if I can update my data or not. Sorry for the lousyness of English, it is obvious to say that I do not master it.
The way I interpret this line: Transactions or statements that update both transactional and nontransactional tables is that transactions or statements that update BOTH transactional AND nontransactional tables at the same time, is not supported. BOTH being the key word.
I would like to think that you can still update transactional tables, and in a separately operation, update non-transactional tables.