PowerBI Incremental Refresh Duplicates

PowerBI Incremental Refresh Duplicates - powerbi

I have setup a PowerBI dataset with incremental refresh following this guide https://learn.microsoft.com/en-us/power-bi/connect-data/incremental-refresh-configure and ensured that all tables have RangeStart > x and RangeEnd <= x to ensure only one side has the =. I continued to investigate https://learn.microsoft.com/en-us/power-bi/connect-data/incremental-refresh-troubleshoot and noticed there is a comment
With a refresh operation, only data that has changed at the data source is refreshed in the dataset. As the data is divided by a date, it’s recommended post (transaction) dates are not changed.
Which to me sounds extremely limiting. Our data has two date fields LastModified and RowCreatedAt that are both date/time columns. Last Modified is the real date/time of the last modification to the data in the row. RowCreatedAt is the real date/time of when that modification was persisted to the database. These can be very different (eg, if the customer is new, but has legacy data, the LastModified date may be very old, but RowCreatedAt will be very recent).
I decided to go with the RowCreatedAt value since that is something that we control (eg, if we were to refresh LastModifiedDate and load in historical data, it would never be imported to PowerBI after the initial refresh). Both the LastModifiedDate and RowCreatedAt fields are updated when data changes in the system (eg, sales order gets a new line item added to it).
My expectation was that when data changed and the partition date was updated, it would properly update the data in the dataset (eg, remove the old row and insert the new row since the same primary key, but other data is changed). This seems completely normal and expected behavior, but from the documentation, it seems like you can only import data which is not ever going to change or you have to refresh your history to the point where the change occurred. This seems like a crazy limitation (eg, who has unchanging data for all time??) so I'm hopefully just misunderstanding something.

Related

Update bigquery table soon after insert records in Table

I have a requirement to load the data after a few minutes I need to update the record, how can I achieve that?? I am getting google.api_core.exceptions.BadRequest: 400 UPDATE or DELETE statement over table dataset.tablename would affect rows in the streaming buffer, which is not supported
Is there any way to flush the data from streaming buffer to permanent storage??
I tried below option but this query also getting the same error.
UPDATE dataset.tablename
SET _PARTITIONTIME = CURRENT_TIMESTAMP()
WHERE _PARTITIONTIME IS NULL```

Streamed data is not immediately available for operations outside of analysis (select) for up to 90 minutes (typically much less). You can use streamingBuffer.oldestEntryTime to see the age of the oldest row in the streaming buffer in the tables.get response.
https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery#dataavailability
As a potential workaround, you could create an independent table with desired changes and join it in a query/view with the table you're streaming to, to see newer values in your query results. Eventually, you could use the "change" table to merge changes into the original table.
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement

Power BI - Copying data to new column before refreshing data

This is what I hope to be a very simple issue, I'm just having a hard time putting the right search terms together in order to find the answer.
Basically, I want to preserve the data from the last refresh before the data is refreshed again, in order to compare the difference.
Example:
I have a basic web scrape that runs off and grabs the latest stock price for Microsoft:
What I want to be able to do during the refresh is to first copy the current value (283.85) to a new column and then refresh the data, so that I have a side by side current and previous price.
Really tried to find an answer, but I don't think I'm using the correct terminology.

I have never used this method. Would it be easier to add a date column to your current table and make it your record table? That way you can do a comparison and visuals from your data.
If you really want separate tables you could update your table with the date column and then write a table query to get your latest stock price according to date

PowerBI / PowerPivot - Data not aggregating by time frame

I have created a powerpivot model include in the image below. I am trying to include the "IncurredLoss" value and have it sliced by time. Written Premium is in the fact table and is displaying correctly. I am aiming for IncurredLoss to display in a similar fashion
I have tried the following solutions:
Add new related column: Related(LossSummary[IncurredLoss]). Result: No data
DAX Summary Measure: =CALCULATE(SUM(LossSummary[IncurredLoss])). Result: Sum of everything in LossSummary[IncurredLoss] (not time sliced)
Simply adding the Incurred Loss column to the Pivot Table panel. Result: Sum of everything in LossSummary[IncurredLoss] (not time sliced)
A few other notes:
LossKey joins LossSummary to PolicyPremiumFact
Reportdate joins PolicyPremiumFact to the Calendar.
There is 1 row in LossSummary per date and Policy. LossKey contains this information and is the PK on that table.
Any ideas, clarifications or pointers are most certainly welcome. Thank you!

The related column should work. I was able to get it to work in both Excel 2016 and Power BI Desktop. Rather than bombarding you with questions, I'll try and walk through how I would troubleshoot further, in the hopes it gets you to a solution faster:
First, check the PolicyPremiumFact table inside Power Pivot and see if the IncurredLossRelated field is blank or not. If it is consistently blank, then the related column isn't working. The primary reason the related column wouldn't work is if there's a problem with your relationships. Things I would check:
Ensure that the relationships are between the fields you think they are between (i.e. you didn't accidentally join LossKey in one table to a different field in the other table)
Ensure that the joined fields contain the same data (i.e. you didn't call a field LossKey, but in fact, it isn't the LossKey at all)
Ensure that the joined fields are the same data type in Power Pivot (this is most common with dates: e.g. joining a text field that looks like a date to an actual date field may work, but not act as expected)
If none of the above are the problem, it doesn't hurt to walk through your data for a given date in Power Pivot. E.g. filter your PolicyPremiumFact table to a specific date and look at the LossKeys. Then go the LossSummary table and filter to those LossKeys. Stepping through like this might reveal an oversight (e.g. maybe the LossKeys weren't fully loaded into your model).
If none of the above reveals anything, or if the related column is not blank inside Power Pivot, my suggestion would be to try a newer version of Excel (e.g. Excel 2016), or the most recent version of Power BI Desktop.
If the issue still occurs in the most recent version of Excel/Power BI Desktop, then there's something else going on with your data model that's impacting the RELATED calculation. If that's the case, it would be very helpful if you could mock up your file with sample data that reproduces the problem and share it.
One final suggestion I have is to consider restructuring your tables before they arrive in your data model. In your case, I'd recommend restructuring PolicyPremiumFact to include all the facts from LossSummary, rather than having a separate table joined to your primary fact table. This is what you're doing with the RELATED field to some extent, but it's cleaner to do before or as your data is imported into Power Pivot (e.g. using SQL or Power Query) rather than in DAX.
Hope some of this helps.

Power Query Formula Language - Detect type of columns

In Power BI, I've got some query tables generated from imported data. All the data comes in as type 'Any', and I'm trying to automatically detect the type of the data in each column.
Some of the queries generate tables with columns based on the in-coming data - I don't know what the columns are going to be until the query runs and sets up the table (data comes from an Azure blob). As I will have quite a few tables to maintain, which columns can change (possibly new columns being added) with any data refresh, it would be unmanageable to go through all of them each time and press 'Detect Data Type' on the columns.
So I'm trying to figure out how I can do a 'Detect Data Type' in the query formula language to attach to the end of the query that generates the table columns. I've tried grabbing the first entry in a column and do Value.Type(column{0}), however this seems to come out as 'Text' for a column which has integers in it. Pressing 'Detect Data Type' does however correctly identifies the type as 'Whole Number'.
Does anyone know how to detect a column's entry types?
P.S. I'm not too worried about a column possibly holding values of different data types

You seem to have multiple issues here. And your solution will be fragile, there's a better way. But let's first deal with column type detection. Power Query uses the 'any' data type as it's go to data type. You can write a function that samples the rows of a column in a table does a best match data type detection then explicitly sets the data type of the column. This is probably messy and tricky since you need to do it once per column. This might be workable for a fixed schema but for a dynamic schema you'll run into a couple of things very quickly. First you'll need to write some crazy PQ code to list all the columns and run you function on each. This will work the first time, but might break in subsequent refreshes because data model changes are not allowed during refresh. If you're using a tool like Power BI Desktop, you'll be able to fix things up. If you publish your report to the Power BI service, you'll just see refresh errors.
Dynamic Schemas will suffer the same data model change issue I mentioned above.
The alternate solution that you won't have problems with is using a Direct Query data source instead of using Power Query. If you load your data into Azure SQL or a Tabular Model, the reporting layer will get the updated fields automatically so you don't have to try to work around using PQ.

Is there any other approach for updating a row in Big Query apart from overwriting the table?

I have a package data with some of its fields as following:
packageid-->string
status--->string
status_type--->string
scans--->record(repeated)
scanid--->string
status--->string
scannedby--->string
Per day, I have a data of 100 000 packages. Total package data size per day becomes 100 MB(approx) and for 1 month it becomes 3GB. For each package, 3-4 updates can come. So do I have to overwrite the package table, every time a package update (e.g. just a change in status field) comes?
Suppose I have data of 3 packages in the table and now the update for 2nd package comes, do I have to overwrite the whole table (deleting and adding the whole data takes 2 transaction per package update)? For 100 000 packages, total transactions will be 10^5 * 10^5 * 2/2.
Is there any other approach for atomic updates without overwriting the table? (as if the table contains 1 million entries and then a package update comes, then overwriting the whole table will be an overhead.)

Currently there is no way to update individual rows. We do see this use case somewhat often, and we recommend something similar to what Mikhail suggested. Basically, if you have some unique ID for a logical row, and a timestamp of the update time to the row data, you can simply add every update as a new row, and apply a view over the table to give you the desired rows.
Your view would look something like this:
SELECT *
FROM (
SELECT
*,
MAX(<timestamp_column>)
OVER (PARTITION BY <id_column>)
AS max_timestamp,
FROM <table>
)
WHERE <timestamp_column> = max_timestamp
(cribbed from here Return only the newest rows from a BigQuery table with a duplicate items)
If your table is partitioned into daily tables (or becomes static after some period), you can then replace the view with the result of the view query after the table stabilizes, and improve your query efficiency.
e.g.
Add Data to TABLE_RAW.
Create view TABLE that performs the above query over TABLE_RAW
At some point after TABLE_RAW is stable, query TABLE with a destination table of TABLE, with write disposition WRITE_TRUNCATE.
Unfortunately, this does add a bit of overhead. That said, for your use case you may be able to just leave the view in place indefinitely, which would simplify things a bit.

You cannot update row in BigQuery table. You can only add one
Overwriting table on each and every transaction - kind of doesn't make sense at all from any prospective
I would suggest just adding each and every transaction as new row.
Meantime, if for any reason (storage cost, query cost, query performance etc.) you want to dedup - you can do batch dedup periodically - let's say daily. In this case, having original data partitioned in daily tables will be beneficial. As at each moment you will need only latest Deduped Table and recent Daily table to query latest transaction. And previous days daily table can be deleted if you worry of storage cost

Biquery supports updates now here, and supports transactions also.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js