Creating a Table of Computed Statistics - sas

I am new to SAS, and would like to create a table of summary statistics that I compute (not only the usual mean, median, etc.)as shown below:
Statistic (Header Column 1), Value (Header Column2), Number of Scored Items ## (Row 1), Number of Examinees ## (Row 2), Mean ##.#%, Median ##.#%, Standard Deviation ##.#%, Minimum ##.#%
Maximum ##.#%, Reliability Estimate #.##, Standard Error of Measurement #.## (Row 9).
I have tried using proc means, but it only allows me to use the summary statistics that's built in the function. So for instance, I don't know how I can use a formula to calculate the Reliability Estimate, and then show it in a table along with other summary stats such as the number of unique observations, etc.

Related

Best way to select random rows in redshift without order by

i have to select a set of rows (like 200 unique rows) from 200 million rows at once without order by and it must be efficient.
As you are experiencing sorting 200M rows can take a while and if all you want is 200 rows then this is an expense you shouldn't need to pay. However, you do need to sort on a random value if you want to select 200 rows that are random. Otherwise the sort order of the base tables and the order of reply from the Redshift slices will meaningfully skew you sample.
You can get around this by sampling down (through a random process) so a much more manageable number of rows, then sort by the random value and pick your final 200 rows. While this does need to sort rows it does it upon a significantly smaller number which will speed things up considerably.
select a, b from (
select a, b, random() as ranno
from test_table)
where ranno < .005
order by ranno
limit 200;
You start with 200M rows. Select .5% of them in the WHERE clause. Then order these 10,000 rows before selecting 200. This should speed things up and maintain the randomness of the selection.
Sampling down your data to a reasonable percentage like 10%,5%,1%,.. etc should bring your volume to a manageable size. Then you can order by the sample and choose the count of rows you need.
select * from (select *, random() as sample
from "table")
where sample < .01
order by sample limit 200
The following is an expansion on the question which I found useful for me that others might find helpful as well. In my case, I had a huge table which I could split by a key field value into smaller subsets, but even after splitting it the volume per individual subset would stay very large (10s of millions of rows) and I still needed to sample it anyway. I was initially concerned that the sampling won't work on the subset I created using With statement, but it turned out this is not the case. I compared the distribution of the sample across all different meaningful keys afterwards between the full subset (20 million) and the sample (30K) and I got almost the exact distribution which worked great. Sample code below:
With subset as (select * from "table" Where Key_field='XYZ')
select * from (select *, random() as sample
from subset) s
where s.sample < .01
order by s.sample limit 200

How can I sort from highest to lowest column values in Power BI Matrix Visual

I am trying to figure out how can I sort column values from highest to lowest in Power BI Matrix Visual. I have a small matrix with 3 columns: "No", "Yes" and "Total" and on rows I have the name of some people.
What I want to do is to sort values from highest to lowest in the "No" column but when I click on sort by I only get the option to filter by total count and the names of the people, I have added a picture below for better context. Any help will be much appreciated!
This can be done by creating three measures. Don't use the implicit measures, always create your own.
First, create the measure that generates the total you use right now. It may be a count, it may be a sum, I cannot tell because I don't know your data source. Let's call that measure "total".
Assuming your data source has a column with the "yes" and "no" values, and assuming the name of that column is "status", you can then create two additional measures.
TotalYes = CALCULATE([total],'Table'[status]="yes")
TotalNo = CALCULATE([total],'Table'[status]="no")
Add these measures to the matrix and remove the status column from the columns well. You can now sort the matrix by the "TotalNo" column. Of course, you can rename the column in the matrix, so it just says "No".

Working with large offsets in BigQuery

I am trying to emulate pagination in BigQuery by grabbing a certain row number using an offset. It looks like the time to retrieve results steadily degrades as the offset increases until it hits ResourcesExceeded error. Here are a few example queries:
Is there a better way to use the equivalent of an "offset" with BigQuery without seeing performance degradation? I know this might be asking for a magic bullet that doesn't exist, but was wondering if there are workarounds to achieve the above. If not, if someone could suggest an alternative approach to getting the above (such as kinetica or cassandra or whatever other approach), that would be greatly appreciated.
Offset in systems like BigQuery work by reading and discarding all results until the offset.
You'll need to use a column as a lower limit to enable the engine to start directly from that part of the key range, you can't have the engine randomly seek midway through a query efficiently.
For example, let's say you want to view taxi trips by rate code, pickup, and drop off time:
SELECT *
FROM [nyc-tlc:green.trips_2014]
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
LIMIT 100
If you did this via OFFSET 100000, it takes 4s and the first row is:
pickup_datetime: 2014-01-06 04:11:34.000 UTC
dropoff_datetime: 2014-01-06 04:15:54.000 UTC
rate_code: 1
If instead of offset, I had used those date and rate values, the query only takes 2.9s:
SELECT *
FROM [nyc-tlc:green.trips_2014]
WHERE rate_code >= 1
AND pickup_datetime >= "2014-01-06 04:11:34.000 UTC"
AND dropoff_datetime >= "2014-01-06 04:15:54.000 UTC"
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
limit 100
So what does this mean? Rather than allowing the user to specific result # ranges (e.g, so new rows starting at 100000), have then specified it in a more natural form (e.g, so how rides that started on January 6th, 2015.
If you want to get fancy and REALLY need to allow the user to specific actual row numbers, you can make it a lot more efficient by calculating row ranges in advance, say query everything once and remember what row number is at the start of the hour for each day (8760 values), or even minutes (525600 values). You could then use this to better guess efficient start. Do a look-up for the closest day/minute for a given row range (e.g in Cloud Datastore), then convert that users query into the more efficient version above.
As already mentioned by Dan you need to introduce a row number. Now row_number() over () exceeds resources. This basically means you have to split up the work of counting rows:
decide for few and as evenly distributed partitions as possible
count rows of each partition
cumulative sum of partitions to know later when to start where with counting rows
split up up work of counting rows
save new table with row count column for later use
As partitions I used EXTRACT(month FROM pickup_datetime) as it distributes nicely
WITH
temp AS (
SELECT
*,
-- cumulative sum of partition sizes so we know when to start counting rows here
SUM(COALESCE(lagged,0)) OVER (ORDER BY month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) cumulative
FROM (
-- lag partition sizes to next partition
SELECT
*,
LAG(qty) OVER (ORDER BY month) lagged
FROM (
-- get partition sizes
SELECT
EXTRACT(month FROM pickup_datetime) month,
COUNT(1) qty
FROM
`nyc-tlc.green.trips_2014`
GROUP BY
1)) )
SELECT
-- cumulative sum = last row of former partition, add to new row count
cumulative + ROW_NUMBER() OVER (PARTITION BY EXTRACT(month FROM pickup_datetime)) row,
*
FROM
`nyc-tlc.green.trips_2014`
-- import cumulative row counts
LEFT JOIN
temp
ON
(month= EXTRACT(month FROM pickup_datetime))
Once you saved it as a new table you can use your new row column to query without losing performance:
SELECT
*
FROM
`project.dataset.your_new_table`
WHERE
row BETWEEN 10000001
AND 10000100
Quite a hassle, but does the trick.
Why not export the resulting table into GCS?
It will automatically split tables into files if you use wildcards, and this export only has to be done one time, instead of querying every single time and paying for all the processing.
Then, instead of serving the result of the call to the BQ API, you simply serve the exported files.

Measure to sum another aggregated measure's data

I am working on a report that has data by month. I have created a measure that will calculate a cost per unit which divides the sum of dollars by the sum of production volume for the selected month(s):
Wtd Avg = SUM('GLData - Excel'[Amount])/SUM('GLData - Excel'[Production])
This works well and gives me the weighted average that I need per report category regardless of if I have one or multiple months selected. This actual and budget data is displayed below:
If you take time to total the actual costs you get $3.180. Where I am running into trouble is a measure to sum up to that total for a visual (This visual does not total sadly). Basically I need to sum the aggregated values that we see above. If I use the Wtd Avg measure I get the average for the total data set, or .53. I have attempted another measure, but am not coming up with the correct answer:
Total Per Unit Cost = sumX('GLData - Excel','GLData - Excel'[Wtd Avg])/DISTINCTCOUNT('GLData - Excel'[Date])
We see here I return $3.186. It is close, but it is not aggregating the right way to get exactly the $3.180:
My Total Per Unit Cost formula is off. Really I am simply interested in a measure to sum the post aggregated Wtd Avg measure we see in the first graph and total to $3.180 in this example.
Here is my data table:
As you probably know already, this is happening because measures are dynamic - if you are not grouping by a dimension, they will compute based on the overall table. What you want to do is to force a grouping on your categories, and then compute the sum of the measure for each category.
There are 2 ways to do this. One way is to create a new table in Power BI (Modeling tab -> New Table), and then use a SUMMARIZE() calculation similar to this one to define that table:
SUMMARIZE('GLData - Excel',[Category],[Month],[Actual/Budget],"Wtd Avg",[Wtd Avg])
Unfortunately I do not know your exact column names, so you will need to adjust this calculation to your context. Once your new table is created, you can use the values from that table to create your aggregate visual - in order to get the slicers to work, you may need to join this new table to your original table through the "Manage Relationships" option.
The second way to do this is via the same calculation, but without having to create a new table. This may be less of a hassle. Create a measure like this:
SUMX(SUMMARIZE('GLData - Excel',[Category],[Month],[Actual/Budget],"Wtd Avg",[Wtd Avg]),[Wtd Avg])
If this does not solve your issue, go ahead and show me a screenshot of your table and I may be able to help further.

Marketing penetration in OLAP cube - Help with specific MDX measure definition

I am pretty new to MDX but I know what I want accomplish but its proving very hard. Basically, I have a dataset where each row is a sale for a customer. I also have postcode data and the UK population at each ward.
The total population in each ward is then divided by the count of the wardcode within the data set - e.g. ward A had a population of 1,000. I have ten customers who live in ward A and so the population value is therefore 1,000/10.
So as long as there are no other dimensions selected, only the region hierarchy, I can then drill up and down and the population penetration as count of customers / calculated population value is correct. However, as soon as I introduce more dimension the total population will not sum to its true value.
So I therefore need to do the calculation above within the cube and I am trying to find the MDX function(s) to do this.
Esentially something like -
step 1) sum the number of ward codes (the lowest level of the Geographic hierarchy) and group this by the distinct ward code, eg wardcodeA = 5, wardcodeB=10 etc.
Step 2) Then take the population in each ward (which could be stored as the total at ward level and taking the average) and then divide this by the result of the previous step
step 3) sum the results from each ward at the currently select Geographical level
The fact other dimensions are changing the value of customers / population means that something in your modeling is wrong.
You should have a fact table (can be a view/concept) like this :
REGION_ID, CUSTOMER_COUNT, POPULATION_COUNT
Once you got this create a fact table and a specific measure for counting customers and population with a single dimension linked. This is the main point, do not link your measures with dimension that are not needed.