Remove Rows With Similar Values in Power BI / Power Query - powerbi

I am working with a data set that has some duplicate rows. The rows are not straight duplicates, but have a time stamp less than a second apart. I'd like to remove these duplicates, but the question is how.
My current plan is to add two new columns, which are copies of the time stamp column but one has a second added to it and the other has a second removed from it. I can then add steps to remove rows which have all other values the same, but have the same time stamp as time stamp plus one or minus one. Doing one after the other should eliminate duplicates but not remove truly unique rows.
How can I accomplish this in Power Query?

I think your "current plan" approach is good - I would apply that in a separate Query, started "By Reference" to the original - I'd call it something like Non-duplicated time stamps.
I would duplicate the original time stamp column and then add the new +/- 1 minute columns. I would use Unpivot Only Selected Columns on the 3 added time stamp columns to convert them from columns to rows. Then I would select the generated Value column and apply Keep Duplicates. That will keep just the first row of any duplicates found amongst the 3 time stamps.
Then back in the original query, I would add a Merge Queries step to connect it to the Non-duplicated time stamps query. I would match on the original time stamp column, possibly on other columns if required. The Join Kind would be Left Anti (rows only in first). That should remove your duplicates.

Related

PowerBi Pivot creates repeating values and inserts extra nulls

This is probably going to come out with a really simple solution, but I am having trouble with a simple pivot in PowerBI.
I have a table where I have the costs of different utilities in one column, by month. I want to pivot the different utility types into separate columns, so I just have one row of data per month, with the different utility types across the top.
A simple pivot for some reason puts a bunch of nulls in and repeats the Months column, and I am not sure where I am going wrong.
Original Table
Final Table with Problem
My eyes glossed over that first column, not realizing the unique IDs in that column were causing the problem. I got rid of EngieTable_ID and my table works now.

Power BI Dax Find Earliest match and preform operation

I'm working in Power BI, and I need to do this in DAX to keep it from having to re-read the 200K PDF files(8hr refresh time).
I have a table that has duplicated ID and Step values with a different time stamp. I need to find the earliest ID and subtract all matching ID's from the earliest Time stamp. I can then use the newly found delta time value to filter the table.
I'm struggling because I need to compare on ID out of the table to all of the the other ID's looking for matches.
Example Data:
Final Data:
This post got me close, but in the IF statments they equal it to "Yes" and not to the ID in the row. How to check for duplicates with an added condition

Creating a new table with data from tables of varying size

Hi I have two tables one has a large number of orders with a column for date. The second table has one column labeled month and another with hours making for 12 rows in total. I want to make a new column by dividing the count of orders per month by the hours of that month from the second table.
In excel i'd simply countif orders that are in January from the first table and divide by the hours in January from the second.
I'm having trouble figuring out the best way to make this new table with calculated values from the existing tables.
Thanks for your time.
Below is a picture of table 2. The first table is a standard dataframe with thousands of rows.
Two options.
You can use the "Append Query" and create a new table that is combining all of your data.
You can also use CALCULATE(SUM(table[field]), filter(table, table[field] = table[monthfield]) /SUM(table[field])
If you could give an example of what you have, I could definitely show you how to accomplish this.
Here is a link to the solution file. One by merging data, and one by using CALCULATE(SUM(),FILTER())
https://drive.google.com/file/d/1yxpv62Dnv8LSNW_mxibPfL0aCMrepoCU/view?usp=sharing

Time frames (aka Sprints) field are out of order in Power BI visuals

I'm having issues with these two time frames in my Power BI dashboard that are out of order. I'm wondering what I can do to fix this issue. I already sorted the timeframes to ascending and it didn't do the trick for me, unfortunately. Thank you!
It's sorting alphabetically. To fix this, add a column in the query editor that is either just the start date or end date of the Time Frame (make sure the column is a date type) and then use the sort by column feature to sort your Time Frame column by the new date column you just created.
Note that this probably won't work if you add the column as a DAX calculated column (rather than in the query editor) because it will throw a circular logic error (since the calculated column is dependent on the Time Frame).

Working with large offsets in BigQuery

I am trying to emulate pagination in BigQuery by grabbing a certain row number using an offset. It looks like the time to retrieve results steadily degrades as the offset increases until it hits ResourcesExceeded error. Here are a few example queries:
Is there a better way to use the equivalent of an "offset" with BigQuery without seeing performance degradation? I know this might be asking for a magic bullet that doesn't exist, but was wondering if there are workarounds to achieve the above. If not, if someone could suggest an alternative approach to getting the above (such as kinetica or cassandra or whatever other approach), that would be greatly appreciated.
Offset in systems like BigQuery work by reading and discarding all results until the offset.
You'll need to use a column as a lower limit to enable the engine to start directly from that part of the key range, you can't have the engine randomly seek midway through a query efficiently.
For example, let's say you want to view taxi trips by rate code, pickup, and drop off time:
SELECT *
FROM [nyc-tlc:green.trips_2014]
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
LIMIT 100
If you did this via OFFSET 100000, it takes 4s and the first row is:
pickup_datetime: 2014-01-06 04:11:34.000 UTC
dropoff_datetime: 2014-01-06 04:15:54.000 UTC
rate_code: 1
If instead of offset, I had used those date and rate values, the query only takes 2.9s:
SELECT *
FROM [nyc-tlc:green.trips_2014]
WHERE rate_code >= 1
AND pickup_datetime >= "2014-01-06 04:11:34.000 UTC"
AND dropoff_datetime >= "2014-01-06 04:15:54.000 UTC"
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
limit 100
So what does this mean? Rather than allowing the user to specific result # ranges (e.g, so new rows starting at 100000), have then specified it in a more natural form (e.g, so how rides that started on January 6th, 2015.
If you want to get fancy and REALLY need to allow the user to specific actual row numbers, you can make it a lot more efficient by calculating row ranges in advance, say query everything once and remember what row number is at the start of the hour for each day (8760 values), or even minutes (525600 values). You could then use this to better guess efficient start. Do a look-up for the closest day/minute for a given row range (e.g in Cloud Datastore), then convert that users query into the more efficient version above.
As already mentioned by Dan you need to introduce a row number. Now row_number() over () exceeds resources. This basically means you have to split up the work of counting rows:
decide for few and as evenly distributed partitions as possible
count rows of each partition
cumulative sum of partitions to know later when to start where with counting rows
split up up work of counting rows
save new table with row count column for later use
As partitions I used EXTRACT(month FROM pickup_datetime) as it distributes nicely
WITH
temp AS (
SELECT
*,
-- cumulative sum of partition sizes so we know when to start counting rows here
SUM(COALESCE(lagged,0)) OVER (ORDER BY month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) cumulative
FROM (
-- lag partition sizes to next partition
SELECT
*,
LAG(qty) OVER (ORDER BY month) lagged
FROM (
-- get partition sizes
SELECT
EXTRACT(month FROM pickup_datetime) month,
COUNT(1) qty
FROM
`nyc-tlc.green.trips_2014`
GROUP BY
1)) )
SELECT
-- cumulative sum = last row of former partition, add to new row count
cumulative + ROW_NUMBER() OVER (PARTITION BY EXTRACT(month FROM pickup_datetime)) row,
*
FROM
`nyc-tlc.green.trips_2014`
-- import cumulative row counts
LEFT JOIN
temp
ON
(month= EXTRACT(month FROM pickup_datetime))
Once you saved it as a new table you can use your new row column to query without losing performance:
SELECT
*
FROM
`project.dataset.your_new_table`
WHERE
row BETWEEN 10000001
AND 10000100
Quite a hassle, but does the trick.
Why not export the resulting table into GCS?
It will automatically split tables into files if you use wildcards, and this export only has to be done one time, instead of querying every single time and paying for all the processing.
Then, instead of serving the result of the call to the BQ API, you simply serve the exported files.