How many values allowed in "In Clause" statement in Amazon Athena

How many values allowed in "In Clause" statement in Amazon Athena - amazon-athena

Is there a limit on the number of values that can be passed in the "In Clause" of Amazon Athena query? Tried to look that up in the documentation but can not find any reference. Thank you.
For example
Select * from tablename where columnName in (1,2,3..); -
How many values allowed to be passed in the IN CLAUSE of the above statement?

The only limitation is the length of the query, which is limited to 262144 bytes (or characters):
https://docs.aws.amazon.com/athena/latest/ug/service-limits.html

Related

Athena ignore LIMIT in some queries

I have a table with a lot of partitions (something that we're working on reducing)
When I query :
SELECT * FROM mytable LIMIT 10
I get :
"HIVE_EXCEEDED_PARTITION_LIMIT: Query over table 'mytable' can potentially read more than 1000000 partitions"
Why isn't the "LIMIT 10" part of the query sufficient for Athena to return a result without reading more that 1 or 3 partitions ?
ANSWER :
During the query planing phase, Athena attempts to list all partitions potentially needed to answer the query.
Since Athena doesn't know which partitions actually contain data (not empty partitions) it will add all partitions to the list.

Athena plans a query and then executes it. During planning it lists the partitions and all the files in those partitions. However, it does not know anything about the files, how many records they contain, etc.
When you say LIMIT 10 you're telling Athena you want at most 10 records in the result, and since you don't have any grouping or ordering you want 10 arbitrary records.
However, during the planning phase Athena can't know which partitions have files in them, and how many of those files it will need to read to find 10 records. Without listing the partition locations it can't know they're not all empty, and without reading the files it can't know they're not all empty too.
Therefore Athena first has to get the list of partitions, then list each partition's location on S3, even if you say you only want 10 arbitrary records.
In this case there are so many partitions that Athena short-circuits and says that you probably didn't mean to run this kind of query. If the table had fewer partitions Athena would execute the query and each worker would read as little as possible to return 10 records and then stop – but each worker would produce 10 records, because the worker can't assume that other workers would return any records. Finally the coordinator will pick the 10 records out of all the results form all workers to return as the final result.

Limit works on the display operation only, if I am not wrong. So query will still read everything but only display 10 records.
Try to limit data using where condition, that should solve the issue

I think Athena's workers try to read max number of the partitions (relative to the partition size of the table) to get that random chunk of data and stop when query is fulfilled (which in your case, is the specification of the limit).
In your case, it's not even starting to execute the above process because of too many partitions involved. Therefore, if Athena is not planning your random data selection query, you have to explicitly plan it and hand it over to the execution engine.
Something like:
select * from mytable
where (
partition_column in (
select partition_column from mytable limit cast(10 * rand() as integer)
)
)
limit 100

Athena Partitioning limitation and how to best approach the problem I am describing

So here is what is happening
I have a lambda function which reads a file of certain size and pushed to a server(This is the limitation as the server has limited TPS)
The Lambda function therefore cannot read a large file on S3
I am doing a CTAS (I am calculating the size for buckets). So, for example if I have 140M records S and If I need n recoreds in a file of size s, my bucket count is S/s
However Athena complains that it cannot do more than 100 partitions(Its confusing since I am doing bucketing and not partitioning), but my bucket count comes to the count of 75K.
How do I handle this situation? Something I can think of is
Have a Spark job which does repartitioning again.
Manipulate Glue to somehow allow more than 100 partitions
Both approaches dont appeal to me. There must ne a simpler way.

ctas are limited to writing at most 100 partitions. You need to split your query into a first ctas writing up to 100 queries followed by insert queries also writing up to 100 partitions. An alternative approach though, is to create the table directly on glue and just do insert queries (always writing at most 100 partitions).
Doc here https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html

AWS Glue/Athena - S3 - Table partitioning

Lets assume I have an external table registered in AWS Glue, which is in S3 and queried by Athena.
The best practice is to partition the data. So in a normal case, I have two seemingly same options,
1. /data/_path/yyyy/mm/dd/col1/col2/data.parquet
2. /data/_path/col1/col2/yyyy/mm/dd/data.parquet
i'd assume either way the data scanned/queried by Athena is same for a given col1 and/or col2.
But which one is preferred and why?

The preferred way is from the least granulated variable to the most granulated variable.
Usually is the first answer because you have fewer years than months, fewer months than days, fewer days than col1s and fewer col1s than col2s.
But if you have any specification that requires col1s and col2s come first, and then years, this will not be a problem.

If data is generated based on yyyy/mm/dd fast, then option #1.
For example, your generated data in a month happens on every single day from 01 to 30 (or 29,31) so the pattern is good.
Or another example, your generated data happens at hour level so the pattern yyyy/mm/dd/hh would be great.
If data is generated based on col1/col2 fast, then option #2
For example, your generated data changes based on col1 (class id)/col2 (student id) and data belong to student id follow yyyy/mm/dd so you can go ahead with col1/col2/yyyy/mm/dd
Or you can think of, if your use case use col1/col2 more often in term of querying data, so option #2 is a good option.
To me if we compare about performance of 2 options, I don't think it is significant.

Which query is faster : top X or limit X when using order by in Amazon Redshift

3 options, on a table of events that are inserted by a timestamp.
Which query is faster/better?
Select a,b,c,d,e.. from tab1 order by timestamp desc limit 100
Select top 100 a,b,c,d,e.. from tab1 order by timestamp desc
Select top 100 a,b,c,d,e.. from tab1 order by timestamp desc limit 100

When you ask a question like that, EXPLAIN syntax is helpful. Just add this keyword at the beginning of your query and you will see a query plan. In cases 1 and 2 the plans will be absolutely identical. These are variations of SQL syntax but the internal interpreter of SQL should produce the same query plan according to which requested operations will be performed physically.
More about EXPLAIN command here: EXPLAIN in Redshift

You can get the result by running these queries on a sample dataset. Here are my observations:
Type 1: 5.54s, 2.42s, 1.77s, 1.76s, 1.76s, 1.75s
Type 2: 5s, 1.77s, 1s, 1.75s, 2s, 1.75s
Type 3: Is an invalid SQL statement as you are using two LIMIT clauses
As you can observe, the results are the same for both the queries as both undergo internal optimization by the query engine.

Apparently both TOP and LIMIT do a similar job, so you shouldn't be worrying about which one to use.
More important is the design of your underlying table, especially if you are using WHERE and JOIN clauses. In that case, you should be carefully choosing your SORTKEY and DISTKEY, which will have much more impact on the performance of Amazon Redshift that a simple syntactical difference like TOP/LIMIT.

Vtiger. Change query limit

In vtiger wiki written:
Query always limits its output to 100 records, client application can use limit operator to get different records.
This query does not work:
doQuery("select * from Leads limit='200';")
How to specify the operator in a query?

The "limit" clause only works if the number given is lower than 100. You can't get more records than 100 using "limit" with 1 request.
To get more than 100 records from vTiger services you need to make various request using the "offset" in the "limit" clause.

If you really read the Wiki documentation, you'd see that you need to use:
select *
from Leads
limit 200;
Stop using unnecessary single quotes ('200') - the limit expects a numerical value, there's absolutely no point in converting that to a string (by using single quotes) .....
and drop the equal sign, too - it's not shown in the docs anywhere .....

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How many values allowed in "In Clause" statement in Amazon Athena - amazon-athena

The only limitation is the length of the query, which is limited to 262144 bytes (or characters): https://docs.aws.amazon.com/athena/latest/ug/service-limits.html

Related

Athena ignore LIMIT in some queries

Athena Partitioning limitation and how to best approach the problem I am describing

AWS Glue/Athena - S3 - Table partitioning

Which query is faster : top X or limit X when using order by in Amazon Redshift

Vtiger. Change query limit

Categories

Resources