PCA analysis with scikit-learn library

PCA analysis with scikit-learn library - pca

I am unable to perform PCA analysis by selecting n_components greater than number of rows when a dataset has n_cols > n_rows
Do you have an idea to resolve this issue ?
Thanks,
Richard
I try to use all options with the n_components paremeter

Related

VertexAI Tabular AutoML rejecting rows containing nulls

I am trying to build a binary classifier based on a tabular dataset that is rather sparse, but training is failing with the following message:
Training pipeline failed with error message: Too few input rows passed validation. Of 1169548 inputs, 194 were valid. At least 50% of rows must pass validation.
My understanding was that tabular AutoML should be able to handle Null values, so I'm not sure what's happening here, and I would appreciate any suggestions. The documentation explicitly mentions reviewing each column's nullability, but I don't see any way to set or check a column's nullability on the dataset tab (perhaps the documentation is out of date?). Additionally, the documentation explicitly mentions that missing values are treated as null, which is how I've set up my CSV. The documentation for numeric however does not explicitly list support for missing values, just NaN and inf.
The dataset is 1 million rows, 34 columns, and only 189 rows are null-free. My most sparse column has data in 5,000 unique rows, with the next rarest having data in 72k and 274k rows respectively. Columns are a mix of categorical and numeric, with only a handful of columns without nulls.
The data is stored as a CSV, and the Dataset import seems to run without issue. Generate statistics ran on the dataset, but for some reason the missing % column failed to populate. What might be the best way to address this? I'm not sure if this is a case where I need to change my null representation in the CSV, change some dataset/training setting, or if its an AutoML bug (less likely). Thanks!

To allow invalid & null values during training & prediction, we have to explicitly set the allow invalid values flag to Yes during training as shown in the image below. You can find this setting under model training settings on the dataset page. The flag has to be set on a column by column basis.

I tried #Kabilan Mohanraj's suggestion and it resolved my issue. What I had to do was click the dropdown to allow invalid values into training. After making this change, all rows passed validation and my model was able to train without issue. I'd initially assumed that missing values would not count as invalid, which was incorrect.

Sentiment Analysis PowerBI AI Insights Visualization

I have a Data Set of online product-reviews (without any grades/stars/etc.). To this data-set I applied the integrated PowerBI AI-Insights Text Analytics Sentiment Analysis model and got a a sentiment score for each review. Next, I transformed the score into textual discrete values: POSITIVE, NEGATIV and NEUTRAL.
The dataset is artificially created by me, so I know the polarity of each comment. Now I want to compare the predicted value to the actual value. I've done this by adding a new column that compares the actual value with the predicted value and displays "PREDICTED" if the correct value was predicted and "NOT PREDICTED" if the prediction was false (it doesn't matter if it is positive, negative or neutral). My goal is to calculate some model metrics so I can evaluate the capabilities of this PowerBI integrated model and to visualize the results. How can I do this? Is "accuracy" the first thing that I have to start with? If yes then how can I calculate and visualize a result like the "accuracy".
Thank you for all your answers in advance.

Yes, take accuracy in first consideration. If you find 70 or 80 percent above results are accurate, you can easily rely on the PowerBI AI-Insights Text Analytics Sentiment Analysis. You can then create your visuals for Sentiment data. But if there is 50-50 occurrence of predicted and not predicted result, you may go for 3rd party Sentiment analysis service like - Google, Alchemy.

Find position of a character in a string function in M query vs find DAX

I tried to use the mid and find function in power BI as it can be done in excel. However, I get the error 'find' wasn't recognized.
After searching for a while a have a conclusion that FIND and MID function work in DAX (Excel and Power BI - but not in M query (edit custom) column). Instead of using find, in and M query we should use BIText, PositionOfAny.
Here is an example:
DAX:
MID([TRAFFIC_SIGNAL]), find([TRAFFIC_SIGNAL],"&"),3)
M query:
Text.Combine({Text.Start(Text.Upper([TRAFFIC_SIGNAL]), 3), " ",
Text.Middle(Text.Upper([TRAFFIC_SIGNAL]),
Text.PositionOfAny([TRAFFIC_SIGNAL], {"&"})+1, 3)})
It works so I would like to share because I haven't know the difference between DAX and m query in Power BI before, but this example helps.

I am not sure what you are looking for, but I am just going to highlight the differences between m query and DAX.
M Query:
M query is used to bring the data into the model
This can be accessed using the Power Query Editor
DAX:
DAX is used to create measures and columns after the data is pulled
This is mainly used for summarizing the data
Hope this helps.

M and DAX are entirely different languages used for different purposes.
Primarily,
M is how you get and transform your data before loading to the data model.
DAX is for reading the data in your data model aggregating it to show in visuals.
There are plenty of things you can do with both where it isn't clear which is the better option. It can be highly case-dependent but the above is a simple guide.
In any case, I wouldn't recommend that M code for that purpose. Something like the following should be simpler and more similar to the DAX code:
Text.Middle([TRAFFIC_SIGNAL], Text.PositionOf([TRAFFIC_SIGNAL],"&"), 3)

DAX way to return summarised data

I hope I'm not missing an easy solution am still getting used to DAX and can't yet find an appropriate logic.
I have a large dataset, >10m rows which I want to test. An identifier column "DocumentNumber" might occur on multiple rows and I want to find where the sum of "Value" over these rows for a given "DocumentNumber" is non-zero.
Tried to use PowerQuery > removed all but these two columns > Group By > DocumentNumber > Sum of Value. However my 32 bit version of Excel appears to run out of memory performing this step Expression.Error: Evaluation ran out of memory and can't continue.
Wrote a DAX measure > Sum of Values and dropped into a pivot table with a view to filtering out the zero values but when I try to drag in the DocumentNumber to rows there are more than a million rows so the table won't render.
Is there a logic I should follow in DAX that would achieve step 2 before bringing it to the pivot table? Can DAX actually create a new table in the data model which is the aggregated and filtered data rather than using a pivot? I believe this is possible in PowerBI but not sure about Excel evironment.

Calculate variance, quartiles in Weka explorer

I am using Weka for Data mining a dataset. I can find median, stdev using explorer but not range, quartiles, variance and mode. Is there any configuration required in the tool for the same or it just can't possible with the tool?

You can use a Filter, the Unsupervised Attribute Filter "AddExpression" or the "MathExpression", to calculate something for a single attribute.
Obviously, this is primitive, and you cannot do this for each attribute in one fell swoop.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js