Using TABLESAMPLE with PERCENT returns all the records from table - amazon-web-services

I have a small test table with two fields - id and name, 19 records total. When I try to get 10 percent of record from this table using the following query, I get ALL the records. I tried to do this on large table, but result is the same - all records are returned. The query:
select * from test tablesample (10 percent) s;
If I use ROWS instead of TABLESAMPLE (i.e.: select * from test tablesample (10 rows) s;, it works fine, only 10 records are returned. How can I get just the neccessary percentage of records?

You can refer to the link below:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling
You must be using CombinedHiveOutputFormat, which does not go well with ORC format. Hence you will never be able to save the output from Percent query to a table.
In my knowledge the best way to do this is using rand() function. But again you should not use this with order by() clause as it will impact performance. Here is my sample query which is time efficient :
SELECT * FROM table_name
WHERE rand() <= 0.0001
DISTRIBUTE BY rand()
SORT BY rand()
LIMIT 5000;
I tested this on 900M row table and query executed in 2 mins.
Hope this helps.

You can use PERCENT with TABLESAMPLE. For example:
SELECT * FR0M TABLE_NAME
TABLESAMPLE(1 PERCENT) T;
This will select 1% of the data size of inputs and not necessarily the number of rows. More details can be found here.
But if you are really looking for a method to select a percentage of the number of rows, then you may have to use LIMIT clause with the number of records you need to retrieve.
For example, if your table has 1000 records, then you can select random 10% records as:
select * from table_name order by rand() limit 100;

Related

In PowerBI what options are there to hide/redact/mask values in visuals to anonymise data

What options are there in PowerBI to suppress, redact or hide values to anonymise values in reports and visuals without loosing detail and have that restriction apply to multiple pages in a report?
Cat
Count
%
Category 1
23
10
Category 2
2
0.9%
Category 3
4
1.7%
So that its possible to keep the rows but end up with a placeholder where count is <4 and % is greater than 1% but less than 2%
Cat
Count
%
Category 1
23
10
Category 2
*
0.9%
Category 3
4
*
So far my experience has been
a measure with a filter applied will hide rows but you can't apply a measure filter to an entire page or all report pages.
Ive seen mention of conditional formatting to hide the value by having the font and background the same colour but that seems open to error and labour intensive.
I also want to be clear when a value has been suppressed or masked
I suspect there is a more better way but I haven't been able to figure out where to even start.
OK, I have something working but you will need Tabular Editor to create a calculation group. Here are the steps.
I'm using the following table (named "Table") as the source data.
Add two measures (calculation groups only work on measures) as follows.
% Measure = SUM('Table'[%])
Count Measure = SUM('Table'[Count ])
Open tabular editor and create a new calculation group named "CG" with a calculation item named "Mask". Paste the following code into the calculation item.
if (
(selectedmeasurename() = "% Measure" && selectedmeasure() >1 && selectedmeasure() <2)
||
(selectedmeasurename() = "Count Measure" && selectedmeasure() <4)
,"*",selectedmeasure()
)
4. Save the calculation group and in Power BI drag the name column onto the filter for all pages as follows, ensuring it is selected:
The masking will now happen across all reports automatically. Below you can see the same table on two different reports which have now been successfully masked.
It depends on your data connection type as to whether this is available, but a calculated column (instead of a measure) can be used as a filter at the "this page" or "all pages" level.
If this option is available, then you can find it right next to the "New Measure" field.
Using this and your sample data above, I created a couple of calculated columns and show the resulting table. You can then display these columns and use them as filters throughout the report. Your DAX may be slightly different depending on how the actual data is formatted and such.
Count Calculated Column
Masked Count =
IF(
'Table'[Count] < 4,
"*",
CONVERT('Table'[Count], STRING)
)
% Calculated Column
Masked % =
IF(
'Table'[%] > .01 && 'Table'[%] < .02,
"*",
CONVERT('Table'[%] * 100, STRING) & "%"
)
Resulting Table
Example of how the filter can be used
The values of these columns will update as your data source is refreshed in Power BI. However, calculated columns aren't available for Live Connection, in which case you would have to do this kind of logic at a lower level (in Analysis Services for example).
Additionally, you could potentially use Power Query Editor to accomplish this kind of thing.

Retrieving the row with the greatest timestamp in questDB

I'm currently running QuestDB 6.1.2 on linux. How do I get the row with maximum value from a table? I have tried the following on a test table with around 5 million rows:
select * from table where cast(timestamp as symbol) in (select cast(max(timestamp) as symbol) from table );
select * from table inner join (select max(timestamp) mm from table ) on timestamp >= mm
select * from table where timestamp = max(timestamp)
select * from table where timestamp = (select max(timestamp) from table )
where 1 is correct but runs in ~5s, 2 is correct and runs in ~500ms but looks unnecessarily verbose for a query, 3 compiles but returns an empty table, and 4 is incorrect syntax although that's how sql usually does it
select * from table limit -1 works. QuestDB returns rows sorted by timestamp as default, and limit -1 takes the last row, which happens to be the row with the greatest timestamp. To be explicit about ordering by timestamp, select * from table order by timestamp limit -1 could be used instead. This query runs in around 300-400ms on the same table.
As a side note, the third query using timestamp=max(timestamp) doesn't work yet since QuestDB does not support subqueries in where yet (questDB 6.1.2).

How to create a Average measure on a aggregated table of both rows and columns in Power BI

So I have this
Dataset
My wanted result is to create a measure that is a Average of the Value 1 + value 2, split by date and Type
I have created a legit solution with a calculated table
consolidateTable =
SUMMARIZE('Blad2';Blad2[Type];Blad2[Date];"summarized";sum(Blad2[value 1]) + sum(Blad2[Value 2]))
and the result looks like this:
SummarizedTable
with the measure
measure = AVERAGE(consolidateTable[summarized])
I get the wanted
Result
I want to create the whole process in a measure
I'm happy for any input in how to create this measure piece by piece with first filtering the table.

Creating a column with lookup from another table

I have a table of sales from multiple stores with the value of sales in dollars and the date and the corresponding store.
In another table I have the store name and the expected sales amount of each store.
I want to create a column in the main first table that evaluates the efficiency of sales based on the other table..
In other words, if store B made 500 sales today, I want to check with lookup table to see the target then use it to divide and obtain the efficiency then graph the efficiency of each store.
Thanks.
I tried creating some measures and columns but stuck with circular dependencies
I expect to add one column to my main table to an integer 0 to 100 showing the efficiency.
You can merge the two tables. In the query editor go to Merge Querires > Merge Query As New. Chose your relationship (match it by the column StoreName) and merge the two tables. You will get something like this (just a few of your sample data):
StoreName ActualSaleAmount ExpectedAmount
a 500 3000
a 450 3000
b 370 3500
c 400 5000
Now you can add a calculated column with your efficency:
StoreName ActualSaleAmount ExpectedAmount Efficency
a 500 3000 500/3000
a 450 3000 450/3000
b 370 3500 370/3500
c 400 5000 400/5000
This would be:
Efficency = [ActualSaleAmount] / [ExpectedAmount]

How to find calculated population in dax formula

I have a table with country and population for 2017, and I have another table with country and population growth rate%. And I have one table with years like (2018 to 2028). I am trying to find calculated population for 10 years on the basis of these data as we are calculating compound interest.
Because you are working with growth rates, it is very unlikely that you will want to do this calculation as a measure. Rates don't aggregate well.
So, the first thing you're going to want to do is get your data into one table. I would do this in query editor.
You'll need a Cartesian join between your list of countries and a list of years. The PowerBI method for this is a little non-intuitive. You add a custom column, and in the formula you just type in the name of the table.
The result is that every single row in the countries table will be matched with every single row from the years table. If you have 5 rows in one and 10 rows in the other, the resulting table is 50 rows.
Then Merge in your table with the growth rates. Now you have a table that has the name of the country, the 2017 starting population, the growth rate. This set of rows will be repeated for every year from 2018 - 2028.
There is a specific formula for cumulative (compounded) growth.
Pricipal * ( 1 + RatePerPeriod / NumberOfCompoundsPerPeriod) ^ (NumberOfPeriods * NumberOfCompoundsPerPeriod)
You're doing this annually, so it simplifies a bit
Pricipal * ( 1 + Rate) ^ (NumberOfYears)
And the M will look like this:
[2017 Population] * Number.Power((1 + [Growth]),([Year] - 2016))
Good Luck! Hope it helps.