The query has around 40k rows taken generally from a cached query. For whatever reason the QoQ is just SLOW. I have tried to remove most of the logic (distinct, grouping etc) to no avail which leads me to believe something is wrong in the settings. Anybody have an idea about what is going on and how to speed this up?
subcats (Datasource=, Time=42979ms, Records=14)
SELECT
DISTINCT(SNGP.subtyp1) AS cat,
MIN(SNGP.sortposition) AS sortposition,
MIN(taxonomy.web_url) AS url
FROM
SNGP,
taxonomy
WHERE
SNGP.typ > ''
AND UPPER(SNGP.typ) <> 'EMPTY'
AND UPPER(SNGP.DEPT) = 'SHOES' AND UPPER(SNGP.TYP) = 'FASHION' AND SNGP.SUBTYP1 <> 'EMPTY'
GROUP BY SNGP.subtyp1
ORDER BY SNGP.sortposition ASC
Do you have to do a QoQ; could your original query be amended to give you the data you need? Could you even cache all the possible QoQs you're doing, on a schedule?
You're selecting from two tables (SNGP,taxonomy), but I don't see a join between them
web_url sounds like a string, why are you doing a MIN() on it?
In your WHERE clause move the most restrictive parts of that first. e.g. if typ > '' restricts the results to 1000 rows, but UPPER(SNGP.typ) <> 'EMPTY' would restrict it to just 100 rows, then you should put that first. This is general SQL advice, not sure how well it works with QoQ.
40k rows to then select just 14 results sounds like quite a data mismatch; is there any other way you'd be able to get the data more restricted before you try your QoQ.
Related
I am running a query and it is too large. How may I retrieve the results in sections?
For example, 1 million rows at a time.
Thank you
Use CURSOR
DECLARE my_curosor CURSOR FOR
SELECT ....
FETCH FORWARD 5 FROM my_cursor;
FETCH FORWARD 5 FROM my_cursor;
...
CLOSE my_cursor;
Another simplistic approach is to use LIMIT and OFFSET clauses.
3 options, on a table of events that are inserted by a timestamp.
Which query is faster/better?
Select a,b,c,d,e.. from tab1 order by timestamp desc limit 100
Select top 100 a,b,c,d,e.. from tab1 order by timestamp desc
Select top 100 a,b,c,d,e.. from tab1 order by timestamp desc limit 100
When you ask a question like that, EXPLAIN syntax is helpful. Just add this keyword at the beginning of your query and you will see a query plan. In cases 1 and 2 the plans will be absolutely identical. These are variations of SQL syntax but the internal interpreter of SQL should produce the same query plan according to which requested operations will be performed physically.
More about EXPLAIN command here: EXPLAIN in Redshift
You can get the result by running these queries on a sample dataset. Here are my observations:
Type 1: 5.54s, 2.42s, 1.77s, 1.76s, 1.76s, 1.75s
Type 2: 5s, 1.77s, 1s, 1.75s, 2s, 1.75s
Type 3: Is an invalid SQL statement as you are using two LIMIT clauses
As you can observe, the results are the same for both the queries as both undergo internal optimization by the query engine.
Apparently both TOP and LIMIT do a similar job, so you shouldn't be worrying about which one to use.
More important is the design of your underlying table, especially if you are using WHERE and JOIN clauses. In that case, you should be carefully choosing your SORTKEY and DISTKEY, which will have much more impact on the performance of Amazon Redshift that a simple syntactical difference like TOP/LIMIT.
I ran the following code and an hour later, just as the code was finishing a sort execute error occurred. Is there something wrong with my code or is my computer processor and Ram insufficient
proc sql;
create table today as
select a.account_number, a.client_type, a.device ,a.entry_date_est,
a.entry_time_est, a.duration_seconds, a.channel_name, b.esn, b.service_start_date,
b.service_end_date, b.product_name, b.billing_frequency_fee, b.plan_category,
b.plan_subtype, b.plan_type
from listen_nomiss a inner join service_nomiss b
on (a.account_number = b.account_number)
order by account_number;
quit;
That error is most commonly seen when you run out of utility space to perform the sort. A few suggestions for troubleshooting are available in this SAS KB post; the most useful suggestions:
options fullstimer msglevel=i ; will give you a lot more information about what's going on behind the scenes, so you can troubleshoot what is causing the issue
proc options option=utilloc; run; will tell you where the utility directory is that your temporary files will be created in for the sort. Verify that about 3 times the space needed for the final table is available - sorting requires roughly 3 times the space in order to properly sort the dataset due to how the sort is processed.
OPTIONS COMPRESS; will save some (possibly a lot of) space if not already enabled.
options memsize; and options sortsize; will tell you how much memory is allocated to SAS, and at what size a sort is done in memory versus on disk. sortsize should be about 1/3 of memsize (given the requirement of 3x space to process it). If your final table is around but just over sortsize, you may be better off trying to increase sortsize if the default is too low (same for memsize).
You could also have some issues with permissions; some of the other suggestions in the kb article relate to verifying you actually have permission to write to the utility directory, or that it exists at all.
I've had a project in the past where resources was an issue as well.
A couple of ways around it when sorting were:
Don't forget that proc sort has a TAGSORT option, which will make it first only sort on the by statement variables and attach everything else afterwards. Useful when having many columns not involved in the by statement.
Indexes: if you build an index of exactly the variables in your by-statement, you can use a by statement without sorting, it will rely on the index.
Split it up: you can split up the dataset in multiple chunks and sort each chunk separately. Then you do a data step where you put them all in the set statement. When you use a by statement there as well, SAS will weave the records so that the result is also according to the by-statement.
Note that these approaches have a performance hit (maybe the third one only to a lesser extent) and indexes can give you headaches if you don't take them into account later on (or destroy them intentionally).
One note if/when you would rewrite the whole join as a SAS merge: keep in mind that SAS merge does not by itself mimic many-to-many joins. (it does one-to-one, one-to-many and many-to-one) Probably not the case here (it rarely is), but i mention it to be on the safe side.
I've had a query that has been running fine for about 2 years. The database table has about 50 million rows, and is growing slowly. This last week one of my queries went from returning almost instantly to taking hours to run.
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).latest('id')
I have narrowed the slow query down to the Rank model. It seems to have something to do with using the latest() method. If I just ask for a queryset, it returns an empty queryset right away.
#count returns 0 and is fast
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).count() == 0
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)) == [] #also very fast
Here are the results of running EXPLAIN. http://explain.depesz.com/s/wPh
And EXPLAIN ANALYZE: http://explain.depesz.com/s/ggi
I tried vacuuming the table, no change. There is already an index on the "site" field (ForeignKey).
Strangely, if I run this same query for another client that already has Rank objects associated with her account, then the query returns very quickly once again. So it seems that this is only a problem when their are no Rank objects for that client.
Any ideas?
Versions:
Postgres 9.1,
Django 1.4 svn trunk rev 17047
Well, you've not shown the actual SQL, so that makes it difficult to be sure. But, the explain output suggests it thinks the quickest way to find a match is by scanning an index on "id" backwards until it finds the client in question.
Since you said it has been fast until recently, this is probably not a silly choice. However, there is always the chance that a particular client's record will be right at the far end of this search.
So - try two things first:
Run an analyze on the table in question, see if that gives the planner enough info.
If not, increase the stats (ALTER TABLE ... SET STATISTICS) on the columns in question and re-analyze. See if that does it.
http://www.postgresql.org/docs/9.1/static/planner-stats.html
If that's still not helping, then consider an index on (client,id), and drop the index on id (if not needed elsewhere). That should give you lightning fast answers.
latests is normally used for date comparison, maybe you should try to order by id desc and then limit to one.
I have been querying Geonames for parks per state. Mostly there are under 1000 parks per state, but I just queried Conneticut, and there are just under 1200 parks there.
I already got the 1-1000 with this query:
http://api.geonames.org/search?featureCode=PRK&username=demo&country=US&style=full&adminCode1=CT&maxRows=1000
But increasing the maxRows to 1200 gives an error that I am querying for too many at once. Is there a way to query for rows 1000-1200 ?
I don't really see how to do it with their API.
Thanks!
You should be using the startRow parameter in the query to page results. The documentation notes that it takes an integer value (0 based indexing) and should be
Used for paging results. If you want to get results 30 to 40, use startRow=30 and maxRows=10. Default is 0.
So to get the next 1000 data points (1000-1999), you should change your query to
http://api.geonames.org/search?featureCode=PRK&username=demo&country=US&style=full&adminCode1=CT&maxRows=1000&startRow=1000
I'd suggest reducing the maxRows to something manageable as well - something that will put less of a load on their servers and make for quicker responses to your queries.