I'm doing a project of decomposing the total power consumption signal to identify/ recognize each individual type of appliance being used. I would like to know if wavelet packet decomposition in Matlab be able to do such task. If so, please guide for methodology.
My input data set are the following:
1. The total power consumption of appliances (turning on/off randomly).
2. Power consumption of each individual appliance at the same time frame. That is, the sum of data for each appliance is equal to the total power consumption data in item 1.
Any help would be highly appreciated.
Related
Should we use the group by function in Power Query and create a new table, or is it better to create as many measures as we need ? (one measure for each column) ?
Which one is more powerful?
Thank you !
It depends on your purpose. If you have a granular fact table that you want to aggregate first before creating the data model, you can do that through Power Query before feeding the model. Even then, I would recommend doing it on the server-side if you are bringing a SQL table; so that you can perform a native SQL group by rather than having to do it through Power Query syntax solely. Power Query has some performance lagging and each nth step in PQ is evaluated from 1st step internally and it requires a full refresh of the table.
However, if you only want to perform group by to be utilized in an analysis, it is always a good idea to use DAX measures and refrain from using PQ. Also, you can't resort to PQ for different analysis scenarios. DAX is built for those scenarios and it is extremely powerful. DAX measures are the most powerful concept of Power BI. Also, they get evaluated in filter context/slicers; i.e. respond to the selection of values in slicers and / or whatever is present in the Axis (business case)
There are tons of supports for DAX measure optimization, such as SQLBI, Stack, Power BI community. If optimized correctly, DAX measures enhance report performance tremendously without creating any lagging in the report at all.
Few resources to look into
1
2
3
When you are creating a new table in power query, it means results are pre calculated and there will be some performance gain if we consider report usage. But, it will increase your Data Model size. Where as Measure will calculate things on the fly. This will keep your model size same but add some slowness in the presentation part. As a whole, there is no specific answer for your question as per my knowledge as it depends on so many other things like-
Your data size
How many measure you wants to create
How complex your logic inside measure's
How often you need reload your data
and so on...
I'm using a calculated column that is an average. The problem is, the average is above the range of possible values, which should be impossible. I made a calculated column that calculates the average star rating (out of a range of 1-5) and the value on a visual is coming up as 6, which shouldn't be possible, even if all the values were 5 stars, which it isn't. So there must be an outlier causing the average to be above the range of possible values, but it isn't in the original data source which Power BI pulls from. The original data source shows me a value of 4.1 as an average, which is within the expected range. But Power BI's dataset has introduced an outlier or (data is missing) that caused the average to become a 6.
I can elaborate on the dax below, but what I want to try to do is pull the dataset down from power bi to figure out why it's calculating its average that way. Looking at the source data, the average is 4.1 and there are no outliers in the source data. So, it's not the source data that's the problem. Basically, I want to find the outlier that's causing the average rating to differ in Power BI.
Avg Rating = IF(SUM(data[Total Reviews]) = 0, BLANK(), SUM(data[Monthly Stars])/SUM(data[Total Reviews]))
Here's a screencap that shows the two
relevant columns
Notice that I had to manually calculate (aka eyeball the columns and type into a calculator then calculate manually) these two columns, which came out to ~4.6. I'm trying to download this dataset to explore it in further detail without having to eyeball the dataset, as the source doesn't show this discrepancy.
To get to the data you have a number of options.
Create a new report in Power BI Desktop, and then use the connect to PBI Dataset option to access that data, in for example, a table. You can create your own report based on the dataset in the service as well.
Access that data via Analyze in Excel, which should allow you to access the data in a pivot table using Excel
Use the Export data from the visual option, using this you can download 30,000 rows into a csv, or 150,000 in to xlsx formats
Please note, that these options may not be available to you if you do not have the right permissions in the workspace, or options have been turned off in the Power BI Admin tenancy settings.
I want to import a 500 GB dataset into Power BI, but Power BI is limited 1 GB. How can I get the data into Power BI?
Thanks.
For 500GB I'd definitely recommend Direct Query mode (as Joe recommends) or a live connection to a SSAS cube. In these scenarios, the data model is hosted in a separate location (such as a database server) and Power BI sends its queries to that location and displays the returned results.
However, I'll add that the 1GB limit is the limit after compression. (Meaning you can fit more than 1GB of uncompressed data into the advertised 1GB dataset limit.)
While it would be incredibly difficult to reduce a 500GB dataset to 1GB (even with compression), there are things you can do once you understand how the compression works in Power BI.
In Power BI, compression is done by columns, not rows. So a column that has 800 million rows with identical values can see significant compression. Likewise, a column with a different value in every row cannot be compressed much at all.
Therefore:
Do not import columns you do not absolutely need for analysis (particularly identity columns, GUIDs, free-form text fields, or binary data such as images)
Look at columns with a high degree of variability and see if you can also eliminate them.
Reduce the variability of a column where possible. E.g. if you only need a date & not a time, do not import the time. If you only need the whole number, do not import 7 decimal places.
Bring in less rows. If you cannot eliminate high-variability columns, then importing 1 year of data instead of 17 (for example) will also reduce the data model size.
Marco Russo & the SQLBI team have a number of good resources for further optimizing the size of a data model (SSAS tabular, Power Pivot & Power BI all use the same underlying modelling engine). For example: Optimizing Multi-Billion Row Tables in Tabular
If possible given your source data, you could use Direct Query mode. The 1 GB limit does not apply to Direct Query. There are some limitations to Direct Query mode, so check the documentation to make sure that it will meet your needs.
Some documentation can be found here.
1) make Aggregation on data on sql side __reduce size
2) import only useful column____________reduce size
I have looked into the responses of "ItemSeach ()" and "lookUp()" functions in Amazon Advertising API and
could not find a possible way to get daily/monthly sales of an item.
Popular product research software like , JungleScout, ProfitPhonix, AMZ tracker etc do display Number of monthly sales but all of them show different results.
Does Amazon provide this information ? If not then how the above software are estimating it?
I think when they fetch the ASIN information, they do store "some thing" in their DB and next time when the same ASIN is pulled again then the estimated sales are roughly calculated based on DB previous value/score.
Any help will be highly appreciated .
Thanks
It is not a solution, but here is a reply from UnicornSmasher I found, it may help to save time searching for something that doesn't exist.
constantine We just took all of the bulk data from the products that are being tracked in AMZ Tracker and applied a formula to it all. If you have specific products that are way off please let us know! Certain categories we had less data on. This is version 1 of the research tool, so I'm sure it will continue to improve quickly over time.
Here is the link to question and answer:
amz forum
So, now, the question is 'What formula do they use?'
Let me know if you come up with an idea :)
Let me tell you first that if you're not the part of the Amazon data team you can't get the sales numbers of any product. And, its probably not easy to estimate sales using Amazon advertising API. You need to constantly track a huge number of products to estimate the sales. Here I can explain how AMZ Insight an Amazon tracking tool estimates the sales of any product.
They constantly track a few thousand products from all the categories and collect massive data. Then their in-house data scientist analyze the data to form the sales estimating algorithm. Relationship of multiple data points plots a scattered graph which means of course sales estimates are not 100 percent right.
Data is continuously gathered and analyzed by tracking the Best Seller Rank (BSR), Buybox, reviews and more factors. Then the relationship between this data is formed to come up with the unit sales. Once this relationship is in place then it is much easier to estimate monthly sales and revenue for the product.
I am currently doing a customer segmentation project in SAS.
I have identified 2700 customers who are have made a purchase in each of the 4 years I am analysing. For the cluster analysis the more purchases/customer each year the better the data quality is. However as I become more selective over number of purchases needed each year per customer, the less customers can be considered in the cluster analysis.
How should I go about choosing the cutoff point for the number of purchases necessary per customer per year to be considered for analysis. I am struggling with this trade off between data quality and having enough customers for analysis.
Thanks a lot! :)
There is no correct way. It entirely depends on your data.
Clustering such data is "magic" and the results tend to be all but statistically sound. More like random gueses.
Because of this, always try multiple parameters and carefully inspect the results. No equation ever will tell you what a good clustering is.