AWS Athena query giving different result while running query time [duplicate] - amazon-web-services

I have read it over and over again that SQL, at its heart, is an unordered model. That means executing the same SQL query multiple times can return result-set in different order, unless there's an "order by" clause included. Can someone explain why can a SQL query return result-set in different order in different instances of running the query? It may not be the case always, but its certainly possible.
Algorithmically speaking, does query plan not play any role in determining the order of result-set when there is no "order by" clause? I mean when there is query plan for some query, how does the algorithm not always return data in the same order?
Note: Am not questioning the use of order by, am asking why there is no-guarantee, as in, am trying to understand the challenges due to which there cannot be any guarantee.

Some SQL Server examples where the exact same execution plan can return differently ordered results are
An unordered index scan might be carried out in either allocation order or key order dependant on the isolation level in effect.
The merry go round scanning feature allows scans to be shared between concurrent queries.
Parallel plans are often non deterministic and order of results might depend on the degree of parallelism selected at runtime and concurrent workload on the server.
If the plan has nested loops with unordered prefetch this allows the inner side of the join to proceed using data from whichever I/Os happened to complete first

Martin Smith has some great examples, but the absolute dead simple way to demonstrate when SQL Server will change the plan used (and therefore the ordering that a query without ORDER BY will be used, based on the different plan) is to add a covering index. Take this simple example:
CREATE TABLE dbo.floob
(
blat INT PRIMARY KEY,
x VARCHAR(32)
);
INSERT dbo.floob VALUES(1,'zzz'),(2,'aaa'),(3,'mmm');
This will order by the clustered PK:
SELECT x FROM dbo.floob;
Results:
x
----
zzz
aaa
mmm
Now, let's add an index that happens to cover the query above.
CREATE INDEX x ON dbo.floob(x);
The index causes a recompile of the above query when we run it again; now it orders by the new index, because that index provides a more efficient way for SQL Server to return the results to satisfy the query:
SELECT x FROM dbo.floob;
Results:
x
----
aaa
mmm
zzz
Take a look at the plans - neither has a sort operator, they are just - without any other ordering input - relying on the inherent order of the index, and they are scanning the whole index because they have to (and the cheapest way for SQL Server to scan the index is in order). (Of course even in these simple cases, some of the factors in Martin's answer could influence a different order; but this holds true in the absence of any of those factors.)
As others have stated, the ONLY WAY TO RELY ON ORDER is to SPECIFY AN ORDER BY. Please write that down somewhere. It doesn't matter how many scenarios exist where this belief can break; the fact that there is even one makes it futile to try to find some guidelines for when you can be lazy and not use an ORDER BY clause. Just use it, always, or be prepared for the data to not always come back in the same order.
Some related thoughts on this:
Bad habits to kick : relying on undocumented behavior
Why people think some SQL Server 2000 behaviors live on… 12 years later

Quote from Wikipedia:
"As SQL is a declarative programming language, SELECT queries specify a result set, but do not specify how to calculate it. The database translates the query into a "query plan" which may vary between executions, database versions and database software. This functionality is called the "query optimizer" as it is responsible for finding the best possible execution plan for the query, within applicable constraints."
It all depends on what the query optimizer picks as a plan - table scan, index scan, index seek, etc.
Other factors that might influence picking a plan are table/index statistics and parameter sniffing to name a few.
In short, the order is never guaranteed without an ORDER BY clause.

It's simple: if you need the data ordered then use an ORDER BY. It's not hard!
It may not cause you a problem today or next week or even next month but one day it will.
I've been on a project where we needed to rewrite dozens (or maybe hundreds) of queries after an upgrade to Oracle 10g caused GROUP BY to be evaluated in a different way than in had on Oracle 9i, meaning that the queries weren't necessarily ordered by the grouped columns anymore. Not fun and simple to avoid.
Remember that SQL is declarative language so you are telling the DBMS what you want and the DBMS is then working out how to get it. It will bring back the same results every time but may evaluate in a different way each time: there are no guarantees.
Just one simple example of where this might cause you problems is that new rows appear at the end of the table if you select from the table.... until they don't because you've deleted some rows and the DBMS decides to fill in the empty space.
There are an unknowable number of ways it can go wrong unless you use ORDER BY.
Why does water boil at 100 degrees C? Because that's the way it's defined.
Why are there no guarantees about result ordering without an ORDER BY? Because that's the way it's defined.
The DBMS will probably use the same query plan the next time and that query plan will probably return the data in the same order: but that is not a guarantee, not even close to a guarantee.

If you don't specify an ORDER BY then the order will depend on the plan it uses, for example if the query did a table scan and used no index then the result would be the "natural order" or the order of the PK. However if the plan determines to use IndexA that is based on columnA then the order would be in that order. Make sense?

Related

What is the scope of result rows in PDI Kettle?

Working with result rows in kettle is the only way to pass lists internally in the program. But how does this work exactly? This topic has not been well documented and there's a lot of questions.
For example, a job containing 2 transformation can have result rows sent from the first to the second. But what if there's a third transformation getting the result rows? What is the scope? Can you pass result rows to a sub-job as well? Can you clear the result rows based on logic inside a transformation?
Working with lists and arrays is useful and necessary in programming, but confusing in PDI Kettle.
I agree that working with result rows may be confusing, but you can be confident: it works.
Yes, you can pass it the a sub-job, and in a series of sub-jobs (define the scope as "valid in the java machine" for the first test).
And no, there is no way to clear the results in a transformation (and certainly not based on a formula). That would mean a terrible overload in maintenance.
Kettle is not an imperative language, it is more of the data-flow family. It means it is nearer the way you are thinking when developing an ETL and much, much more performant. The drawback is that list and array have no meaning. Only flow of data.
And that is what is a result set : a flow of data, like the the result set of a sql query. The next job has to open it, pass each row to the transformation, and close it after the last row.

sentiment analysis to find top 3 adjectives for products in tweets

there is a sentiment analysis tool to find out people's perception on social network.
This tool can:
(1) Decompose a document into a set of sentences.
(2) Decompose each sentence into a set of words, and perform filtering such that only
product name and adjectives are preserved.
e.g. "This MacBook is awesome. Sony is better than Macbook."
After processing, We can get:
{MacBook, awesome}
{Sony, better}. (not the truth :D)
We just assume there exists a list of product names, P, that we will ever
care, and there exist a list of adjectives, A, that we will ever care.
My questions are:
Can we reduce this problem into a specialized association rule mining
problem and how? If yes, anything need to be noticed like reduction, parameter
settings (minsup and minconf), additional constraints, and modication to the
Aprior algorithm to solve the problem.
Any way to artificially spam the result like adding "horrible" to the top 1 adjective? And any good ways to prevent this spam?
Thanks.
Have you considered counting?
For every product, count how often each adjective occurs.
Report the top-3 adjectives for each product.
Takes just one pass over your data, and does not use a lot of memory (unless you have millions of products to track).
There is no reason to use association rule mining. Association rule mining only pays off when you are looking for large itemsets (i.e. 4 or more terms) and they are equally important. If you know that one term is special (e.g. product name vs. adjectives), it makes sense to split the data set by this unique key, and then use counting.

JPA2 Should I prefer critieria queries to JPQL?

So I have three models.. a Crag has one or more CragLocations and each CragLocation has a Location. I can query for a certain subset of crags using
public List<Crag> getCragsWithGridRef() {
/**
* we want to query select c.* from crag c join CragLocation cl on c.id
* = cl.cragId join Location l on cl.locationId = l.id where
* len(l.gridReference)>1
*/
TypedQuery<Crag> query =
em.createQuery(
"SELECT c FROM Crag c JOIN c.CragLocations cl JOIN cl.location l where LENGTH(l.gridReference) > 1",
Crag.class);
return query.getResultList();
}
I'm largely querying this way because my brain can't handle criteria queries. I struggle to parse the meaning when I'm looking at them.
So is there a performance or maintainability (or other) reason to prefer criteria queries and if so how would you express this query?
No, there's no reason to prefer criteria queries over JPQL ones, especially if you consider JPQL queries easy to understand and thus to maintain, and criteria queries hard to understand and maintain (which I agree with).
Criteria queries, if you use the auto-generated metamodel, are hard to write, but once written, you can be sure that there is no syntax error. That doesn't mean that the query does what it's supposed to do, though. So in any case, you should unit-test the queries. If you have unit test covering the queries, then use what you find the most readable and maintainable. Even if there was a performance difference generating the underlying SQL query, this difference would be negligible compared to the cost of actually executing the query.
I use Criteria queries only in those two situations (and not even always):
the query is dynamically composed from a set of optional search criteria
There are many similar queries sharing a common part, and I want to avoid repeating this common part in each and every query. Using a criteria allows putting the common parts in a reusable method.

MySQL, C++ - Programmatically, How does MySQL Autoincrement Work?

From the latest source code (not certain if it's C or C++) of MySQL, how does it do an autoincrement? I mean, is it efficient in that it stores like a metadata resource on the table where it last left off, or does it have to do a table scan to find the greatest ID in use in the table? Also, do you see any negative aspects of using autoincrement when you look at how it's implemented versus, say, PostgreSQL?
That will depend on which engine the database is using. InnoDB is storing the largest value in memory and not on disk. Very efficient. I would guess most engines would do something similar, but cannot guarantee it.
InnoDB's Auto Increment Is going to run the below query once when DB is loaded and store the variable in memory:
SELECT MAX(ai_col) FROM t FOR UPDATE;
Comparing that to PostgreSQL's complete lack of an auto_increment depends on how you would implement the field yourself. (At least it lacked it last time I used it. They may have changed) Most would create a SEQUENCE. Which appears to be stored in an in memory pseudo-table. I'd take InnoDBs to be a simpler better way. I'd guess InnoDB would be more efficient if they are not equal.

How do we compare two Query result sets in coldfusion

I need to build a generic method in coldfusion to compare two query result sets... Any Ideas???
If you are looking to simply decide whether two queries are exactly alike, then you can do this:
if(serializeJSON(query1) eq serializeJSON(query2)) ...
This will convert both queries to strings and compare the strings.
If you're looking for more nuance, I believe Sergii's approach (convert to struct, compare keys) is probably the right approach. You could "guard" it by adding in simple checks first.... do the column lists match? Is the recordcount the same? That way, if either of those checks fail, you know that the queries can't possibly be equivalent and so it's safe to return false, thereby avoiding the performance hit of a full compare.
If I understand you correctly, you have two result sets with same structure but different datasets (like selecting with different clauses).
If this is correct, I believe that better (more efficient) way is to try to solve this task on the database level. Maybe with temporarily/cumulative tables and/or stored procedure.
Using CF it is almost definitely will need a ton of loops, which can be inappropriate for the large datasets. Though I did something like this for the small datasets using intermediate storage: converted one result set into the structure, and looped over the second query with checking the structure keys.