If I have a list of many blobs in a container, can I set my RangeStart and RangeEnd parameters to be based on the modified timestamp of the csv files? My blobs are partitioned based on the created date, but the rows can be updated historically. I need to make sure that Power BI has the latest version of each row (based on updated_at timestamp)
1- filter the blobs I want based on the blob prefix (virtual directory)
2- filter the blobs based on the Date modified attribute and set up a parameter for RangeStart and RangeEnd (this limits the number of blobs which need to be looked at by a great deal)
3- sort the data and drop duplicates as a final step
Would this pattern work and does it seem efficient? My problem with using the 'updated_at' timestamp as the incremental column is that files which were created weeks or months ago might get updated (it is purely based on the customer activity). It seems like PBI would need to scan a lot of blobs in order to possibly know which rows have been updated.
I tested this out and it works on PBI desktop, but I am not seeing the parameters show up on PBI online which has me worried (it has been running for ~4 hours so far).
Related
I have made table as follows (https://apexinsights.net/blog/convert-date-range-to-list):
In this scenario suppose I configure Incremental refresh on the Start Date column, then will Power BI support this correctly. I am asking because - say the refresh is for last 2 days or last 2 months, then it will fetch the source rows, and apply the transform to the partition. But my concern is that I will have to put the date param filter on Start date prior to non folding steps so that the query folds (alternatively power query will auto apply date filter so that query can fold).
So when it pulls the data based on start date and apply the transforms then I'm not able to think clearly about what kind of partitions it will create. Whether it is for Start date or for expabded date. Is query folding supported in this scenario?
This is a quite complicated scenario, where I would probably just avoid adding incremental refresh.
You would have to use the RangeStart/RangeEnd parameters twice in this query. Once that gets folded to the data source to retrieve ranges that overlap with the [RangeStart,RangeEnd) interval and a second time after expanding the ranges to filter out individual rows that fall outside [RangeStart,RangeEnd).
This is what I hope to be a very simple issue, I'm just having a hard time putting the right search terms together in order to find the answer.
Basically, I want to preserve the data from the last refresh before the data is refreshed again, in order to compare the difference.
Example:
I have a basic web scrape that runs off and grabs the latest stock price for Microsoft:
What I want to be able to do during the refresh is to first copy the current value (283.85) to a new column and then refresh the data, so that I have a side by side current and previous price.
Really tried to find an answer, but I don't think I'm using the correct terminology.
I have never used this method. Would it be easier to add a date column to your current table and make it your record table? That way you can do a comparison and visuals from your data.
If you really want separate tables you could update your table with the date column and then write a table query to get your latest stock price according to date
I have a report in Power BI that cannot refresh because the data from the table is too large:
The amount of data on the gateway client has exceeded the limit for a single table. Please consider reducing the use of highly repetitive strings values through normalized keys, removing unused columns, or upgrading to Power BI Premium
I have tried to shrink the columns used in the data set to the best of my ability, but it is still too large to refresh. I did a test where, instead of using just a single query to retrieve the data, I made two queries that split the columns roughly half and half and then link them back together in Power BI using their ID column. It looked to me that the test data refresh started working upon splitting up the table's data into two separate queries.
Please correct me if there is a better method to trim the data down to allow the data set to refresh, but for now this is the best solution I see. What I am wondering is, since now my data is split into two separate queries, what is the best way to adapt the already existing visualizations I have that are linked up to the full, non-refreshable query to the split, refreshable queries? It looks to me like I would have to recreate the visuals from scratch, but if there is a way to simply do a mass replace of the fields that would save so much time. The split queries I created both have the same fields as the non-split query.
I'm using a calculated column that is an average. The problem is, the average is above the range of possible values, which should be impossible. I made a calculated column that calculates the average star rating (out of a range of 1-5) and the value on a visual is coming up as 6, which shouldn't be possible, even if all the values were 5 stars, which it isn't. So there must be an outlier causing the average to be above the range of possible values, but it isn't in the original data source which Power BI pulls from. The original data source shows me a value of 4.1 as an average, which is within the expected range. But Power BI's dataset has introduced an outlier or (data is missing) that caused the average to become a 6.
I can elaborate on the dax below, but what I want to try to do is pull the dataset down from power bi to figure out why it's calculating its average that way. Looking at the source data, the average is 4.1 and there are no outliers in the source data. So, it's not the source data that's the problem. Basically, I want to find the outlier that's causing the average rating to differ in Power BI.
Avg Rating = IF(SUM(data[Total Reviews]) = 0, BLANK(), SUM(data[Monthly Stars])/SUM(data[Total Reviews]))
Here's a screencap that shows the two
relevant columns
Notice that I had to manually calculate (aka eyeball the columns and type into a calculator then calculate manually) these two columns, which came out to ~4.6. I'm trying to download this dataset to explore it in further detail without having to eyeball the dataset, as the source doesn't show this discrepancy.
To get to the data you have a number of options.
Create a new report in Power BI Desktop, and then use the connect to PBI Dataset option to access that data, in for example, a table. You can create your own report based on the dataset in the service as well.
Access that data via Analyze in Excel, which should allow you to access the data in a pivot table using Excel
Use the Export data from the visual option, using this you can download 30,000 rows into a csv, or 150,000 in to xlsx formats
Please note, that these options may not be available to you if you do not have the right permissions in the workspace, or options have been turned off in the Power BI Admin tenancy settings.
I'm brand new to Power BI and I'm used to setting up most of my data in SQL Server (for SSRS). I have a data set and I was able to add in a Calendar table with my dates. My goal is to do a Year-over-year comparison. I got the year-over-year part working with the help of a couple of tutorials, but I want to restrict the report output to only data up to the last end of month (otherwise the YoY shows a case differential for dates out into 2021 - not helpful). I need a dynamic filter and all I seem to be able to set are static filters. This filter needs to be on the data itself - nothing a user can or should touch. Any help would be appreciated.