Create bin for a variable with same number of records in Tableau - grouping

I need to create groups for one variable that has the same number of records, instead of the same "size".
For example, I have 10,000 records of cars with prices between $5,000 and $10,000. Instead of a classical bin with a range size $1,000 each and unequal number of records, I want bins that has exactly the same number of records, even if the price range in unequal.
Classic Bin
Desired Bit
Is this possible? How can I do that?

Tableau doesn't allow different sized bins.
Try making the bins using # of Records (2000 in this case, or Tableau suggested).
Then create some table calculations for Min and Max price fixed with each bin to get the range.

Related

Redshift table size identification based on date

I would like to create a query in redshift where I want to pass dates as between 25-07-2021 and 24-09-2022 and would like to get result in MB(table size) for a particular table between those dates.
I assume that by "get result in MB" you saying that, if those matching rows were all placed in a new table, you would like to know how many MB that table would occupy.
Data is stored in Amazon Redshift in different ways, based upon the particular compression type for each column, and therefore the storage taken on disk is specific to the actual data being stored.
The only way to know how much disk space would be occupied by these rows would be to actually create a table with those rows. It is not possible to accurately predict the storage any other way.
You could, of course, obtain an approximation by counting the number of rows matching the dates and then taking that as a proportion of the whole table size. For example, if the table contains 1m rows and the dats matched 50,000 rows then they would represent 50/1000 (5%). However, this would not be a perfectly accurate measure.

Best way to select random rows in redshift without order by

i have to select a set of rows (like 200 unique rows) from 200 million rows at once without order by and it must be efficient.
As you are experiencing sorting 200M rows can take a while and if all you want is 200 rows then this is an expense you shouldn't need to pay. However, you do need to sort on a random value if you want to select 200 rows that are random. Otherwise the sort order of the base tables and the order of reply from the Redshift slices will meaningfully skew you sample.
You can get around this by sampling down (through a random process) so a much more manageable number of rows, then sort by the random value and pick your final 200 rows. While this does need to sort rows it does it upon a significantly smaller number which will speed things up considerably.
select a, b from (
select a, b, random() as ranno
from test_table)
where ranno < .005
order by ranno
limit 200;
You start with 200M rows. Select .5% of them in the WHERE clause. Then order these 10,000 rows before selecting 200. This should speed things up and maintain the randomness of the selection.
Sampling down your data to a reasonable percentage like 10%,5%,1%,.. etc should bring your volume to a manageable size. Then you can order by the sample and choose the count of rows you need.
select * from (select *, random() as sample
from "table")
where sample < .01
order by sample limit 200
The following is an expansion on the question which I found useful for me that others might find helpful as well. In my case, I had a huge table which I could split by a key field value into smaller subsets, but even after splitting it the volume per individual subset would stay very large (10s of millions of rows) and I still needed to sample it anyway. I was initially concerned that the sampling won't work on the subset I created using With statement, but it turned out this is not the case. I compared the distribution of the sample across all different meaningful keys afterwards between the full subset (20 million) and the sample (30K) and I got almost the exact distribution which worked great. Sample code below:
With subset as (select * from "table" Where Key_field='XYZ')
select * from (select *, random() as sample
from subset) s
where s.sample < .01
order by s.sample limit 200

Is it possible to create a field well with the numbers 1-100 in?

I am currently creating a pivot table in Quicksight which I am looking to create a Field Well with a series of numbers from 1 to 100. They should be static and not effect the data in the table though. I have managed to do this in PowerBI before using the GENERATESERIES function.
Assuming you have 100 rows in your table, you could create a calculated field that uses denseRank() to assign every row a number from 1 to 100. You would need to make sure you rank on field(s) that do(es) not have duplicates. An index-based rank could work, for example.

in Dax, how do I row match by some categorical column instead of getting a single value when using a numerical variable from another related table

How can I use the values of a variable from a table B in a related table A in a such a way that it can be matched to each row of table A correspondingly?
I need to calculate a ratio as this
sum(A[event_number])/sum(B[client_number]
The 2 variables are from different related table. The relation between the tables are one (A) to many (B).
When I put this ratio on a matrix constructed with variables and measures from table A where the rows are stores, the denominator should be the client number per store but I only get the sum of all clients instead.
For example if a store "ASD" has 5 events it should get divided by 20 which are the clients related to that store and not 500 which is the sum of all clients for all stores.
I have tried using related when calculating the ratio, and allexcept to create a column of number of clients in A but nothing has given the expected result. Please help.

How to sum two columns if one column might have two different variables which would call for a different calculation

I need to create a SUM() column (summing two columns), the issue I'm having in the second column might have a result which needs to multiply the first column or it might have a result which needs to be added to the result in the first column.
Example
First Column (Estimated Costs)
50,000
100,000
5,000
Second Column (contingency amounts)
5%
10,000
10%
5,000
The first column will always have numbers. The second column will either have a percentage or an amount. I need to create a formula to sum either possibility. Is this possible or do I need to create two different sum columns for either scenario?
I tried using an OR statement with the SUM() formula but it didn't seem to work.
I understand that you're providing the cotingency cost of the estimate on 2 different ways as a percentage of the estimate or as a lumpsum amount to be added to the estimate.
You should use IF() to differentiate the two operations you have to apply. As the two types of values are very different on the contingency column the #BigBen suggestion of Value>1 should work perfectly well as you won't be adding less than $1 buck as a contingency. Therefore the formula will be:
=IF(B1>1,A1+B1,A1*(1+B1))
Column A is Estimated Costs, Column B is Contingency.