Had a look at SAS documentation and around SO, but being a bit distant from the field and SAS specifically, wanted to ask for help.
I am looking at some SAS code, where this specific part is of interest:
SELECT A.*,
CASE WHEN A.all_111 > max(99.99, 0.025*AMOUNT)
This runs before a table of entries is created. The table is supposed to discard values below 100. I am presuming that's what the first argument (99.99) in max does.
However, I am not sure what is the purpose of 0.025*AMOUNT?
The max simply takes the maximum of 99.99 and 0.025* amount, which is 2.5% of the amount. So if 2.5% of the amount is more than 100 then it can be higher than 100. Max would be operating on each individual row value here.
Related
i have to select a set of rows (like 200 unique rows) from 200 million rows at once without order by and it must be efficient.
As you are experiencing sorting 200M rows can take a while and if all you want is 200 rows then this is an expense you shouldn't need to pay. However, you do need to sort on a random value if you want to select 200 rows that are random. Otherwise the sort order of the base tables and the order of reply from the Redshift slices will meaningfully skew you sample.
You can get around this by sampling down (through a random process) so a much more manageable number of rows, then sort by the random value and pick your final 200 rows. While this does need to sort rows it does it upon a significantly smaller number which will speed things up considerably.
select a, b from (
select a, b, random() as ranno
from test_table)
where ranno < .005
order by ranno
limit 200;
You start with 200M rows. Select .5% of them in the WHERE clause. Then order these 10,000 rows before selecting 200. This should speed things up and maintain the randomness of the selection.
Sampling down your data to a reasonable percentage like 10%,5%,1%,.. etc should bring your volume to a manageable size. Then you can order by the sample and choose the count of rows you need.
select * from (select *, random() as sample
from "table")
where sample < .01
order by sample limit 200
The following is an expansion on the question which I found useful for me that others might find helpful as well. In my case, I had a huge table which I could split by a key field value into smaller subsets, but even after splitting it the volume per individual subset would stay very large (10s of millions of rows) and I still needed to sample it anyway. I was initially concerned that the sampling won't work on the subset I created using With statement, but it turned out this is not the case. I compared the distribution of the sample across all different meaningful keys afterwards between the full subset (20 million) and the sample (30K) and I got almost the exact distribution which worked great. Sample code below:
With subset as (select * from "table" Where Key_field='XYZ')
select * from (select *, random() as sample
from subset) s
where s.sample < .01
order by s.sample limit 200
I am trying to work out if this is possible or not in DAX or M.
Basically I want to replicate this:
=IF(T9>0,T9-1,$Q$6)
Which is the formula in T10. So it is counting down by one if the value above it is not 0, otherwise put in a value and start counting down again.
Here is some data and expected outcome:
When the stock on hand drops below 5000 it triggers the lead time count down to start. When that hits 0, it adds stock to the SOH balance, 4000 in this case. Since the stock is below its reorder point, it puts starts the countdown again.
We would need the data in order to answer your question properly, if you can't share the dataset, add the index column using
Transform data > Add Column > Index Column
Imagine a tiered revenue sharing scheme like this:
Revenue up to 10000 get 100% of it.
Revenue up to 12000 get the above plus 80% of the amount above 10000.
Revenue up to 14000 get the above plus 60% of the amount above 12000.
Revenue up to 16000 get the above plus 40% of the amount above 14000.
Revenue over 16000 get the above plus 20% of the amount above 16000.
E.g. A revenue of 13000 will get you a share of 10000+0.82000+0.61000 = 12200.
I tried making a table (each threshold a column) and calculate the individual fractions using IF clauses and then add them all up. It is very cumbersome. I would like to use only two cells with the entire calculation done in one cell, hard-coded.
If at all possible, extra bonus points if I can have the threshold values (10000, 12000, etc) and fractions (100%, 80%, etc.) in separate cells as parameters for the calculation, maybe something like an array-function?
Thank you very much in advance!
Start your lookup table with a value of 0 and a rate of 100%. In this case, VLOOKUP() with the last parameter equal to 1 will correctly find the required row.
In order not to recalculate all the above rows for each of the values, calculate them in advance and place them in the table as an additional column.
For the first line it will be 0, and calculate all subsequent values using a formula like =C2+(A3-A2)*B2
For such a table, a not very complicated formula will return the correct result:
=(<revenue>-VLOOKUP(<revenue>;<lookup_table>;1;1))*VLOOKUP(<revenue>;<lookup_table>;2;1)+VLOOKUP(<revenue>;<lookup_table>;3;1)
The third parameter in the VLOOKUP() functions increases from left to right: 1 - the base amount, 2 - the interest rate, 3 - the calculated markup for reaching the previous levels.
For the data shown in the figure, the formula is used
=(E2-VLOOKUP(E2;$A$2:$C$7;1;1))*VLOOKUP(E2;$A$2:$C$7;2;1)+VLOOKUP(E2;$A$2:$C7;3;1)
I'm trying to count days between a date from the column 'completionDate' and today
The table name is 'Incidents (2)'
I have a simuler table called 'incidents' here it's working.
The code:
DaysClosed = DATEDIFF('Incidents (2)'[completionDate].[Dag];TODAY();DAY)
The error i get:
'A single value for variaton 'Dag' for column 'completionDate' in table 'Incidents (2)' cannot be determined. This can happen when a measure formula refers to a column that contains many values without specifying an aggregation such as min, max, count, or sum to get a single result.'
The error you get strongly depends on how you are evaluating your formula, that's why it might work on another table but not on this one. As #JBfreefolks pointed out correctly you are specifying a column where a scalar value is expected. That can work depending on the context you are evaluating your formula over (assuming it is a measure).
For instance, imagine a data-set with 100 rows equally divided into four categories A,B,C,D. When you create a table visual with a row for each category, each row will have 25 underlying records that will be used in any calculation added to this row (either a measure or an aggregate of any value). This means that when using a formula like datediff with a column reference, it will get 25 values for it's second argument where it expects one.
There are several ways to solve the problem depending on your desired result.
Use a measure like MAX like #JBfreefolks suggested to make sure that a single value is selected from multiple values. The measure will still be calculated over a group of records but will summarize it by taking the maximum date.
Make sure the visual you are using has something like an ID in there so it doesn't group, it displays row context. Any measure added to this visual will evaluate in row context as well.
Use a calculated column instead. They are always evaluated in row context first and their values can be aggregated in a visual later on. When using TODAY() , you probably need to refresh your report at least every day to keep the column up to date.
A more complicated way is to use an iterator like SUMX() or AVERAGEX() to force row context evaluation inside of a measure.
Good to see you solved it already, still posting as it might be helpful to others.
'Incidents (2)'[completionDate].[Dag] referencing a colomn. It is in your computation returning a table (multiple date in the evaluation context) instade of a scalar needed in DATEDIFF calculation.
You need to leverage to be sure that 'Incidents (2)'[completionDate].[Dag] is rturning a scalar value. To do that you can leverage rowcontext and then also formula like Max.
I am trying to emulate pagination in BigQuery by grabbing a certain row number using an offset. It looks like the time to retrieve results steadily degrades as the offset increases until it hits ResourcesExceeded error. Here are a few example queries:
Is there a better way to use the equivalent of an "offset" with BigQuery without seeing performance degradation? I know this might be asking for a magic bullet that doesn't exist, but was wondering if there are workarounds to achieve the above. If not, if someone could suggest an alternative approach to getting the above (such as kinetica or cassandra or whatever other approach), that would be greatly appreciated.
Offset in systems like BigQuery work by reading and discarding all results until the offset.
You'll need to use a column as a lower limit to enable the engine to start directly from that part of the key range, you can't have the engine randomly seek midway through a query efficiently.
For example, let's say you want to view taxi trips by rate code, pickup, and drop off time:
SELECT *
FROM [nyc-tlc:green.trips_2014]
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
LIMIT 100
If you did this via OFFSET 100000, it takes 4s and the first row is:
pickup_datetime: 2014-01-06 04:11:34.000 UTC
dropoff_datetime: 2014-01-06 04:15:54.000 UTC
rate_code: 1
If instead of offset, I had used those date and rate values, the query only takes 2.9s:
SELECT *
FROM [nyc-tlc:green.trips_2014]
WHERE rate_code >= 1
AND pickup_datetime >= "2014-01-06 04:11:34.000 UTC"
AND dropoff_datetime >= "2014-01-06 04:15:54.000 UTC"
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
limit 100
So what does this mean? Rather than allowing the user to specific result # ranges (e.g, so new rows starting at 100000), have then specified it in a more natural form (e.g, so how rides that started on January 6th, 2015.
If you want to get fancy and REALLY need to allow the user to specific actual row numbers, you can make it a lot more efficient by calculating row ranges in advance, say query everything once and remember what row number is at the start of the hour for each day (8760 values), or even minutes (525600 values). You could then use this to better guess efficient start. Do a look-up for the closest day/minute for a given row range (e.g in Cloud Datastore), then convert that users query into the more efficient version above.
As already mentioned by Dan you need to introduce a row number. Now row_number() over () exceeds resources. This basically means you have to split up the work of counting rows:
decide for few and as evenly distributed partitions as possible
count rows of each partition
cumulative sum of partitions to know later when to start where with counting rows
split up up work of counting rows
save new table with row count column for later use
As partitions I used EXTRACT(month FROM pickup_datetime) as it distributes nicely
WITH
temp AS (
SELECT
*,
-- cumulative sum of partition sizes so we know when to start counting rows here
SUM(COALESCE(lagged,0)) OVER (ORDER BY month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) cumulative
FROM (
-- lag partition sizes to next partition
SELECT
*,
LAG(qty) OVER (ORDER BY month) lagged
FROM (
-- get partition sizes
SELECT
EXTRACT(month FROM pickup_datetime) month,
COUNT(1) qty
FROM
`nyc-tlc.green.trips_2014`
GROUP BY
1)) )
SELECT
-- cumulative sum = last row of former partition, add to new row count
cumulative + ROW_NUMBER() OVER (PARTITION BY EXTRACT(month FROM pickup_datetime)) row,
*
FROM
`nyc-tlc.green.trips_2014`
-- import cumulative row counts
LEFT JOIN
temp
ON
(month= EXTRACT(month FROM pickup_datetime))
Once you saved it as a new table you can use your new row column to query without losing performance:
SELECT
*
FROM
`project.dataset.your_new_table`
WHERE
row BETWEEN 10000001
AND 10000100
Quite a hassle, but does the trick.
Why not export the resulting table into GCS?
It will automatically split tables into files if you use wildcards, and this export only has to be done one time, instead of querying every single time and paying for all the processing.
Then, instead of serving the result of the call to the BQ API, you simply serve the exported files.