Is there a way to replace existing values with NaN - python-2.7

I'm experimenting with the algorithms in iPython Notebooks and would like to know if I can replace the existing values in a dataset with Nan (about 50% or more) at random positions with each column having different proportions of Nan values.
I'm using the Iris dataset for this experimentation to see how the algorithms work and which one works the best.
Thanks in advance for the help.

There is a replace function in python.
Link to answer

Related

Virtual array to be used into VLOOKUP and SUMIF formulas

I have the following tables
In Table 1 I have some items with relative quantity values. Beneath, I have a look-up table from which I can find the rule for summing up the item's quantities into the right Container.
Currently I use additional columns, one for each Container, for helping in succeding the task.
Each cell in additional columns has the following formula (ex. E4):
=IF(VLOOKUP($B4,$D$12:$G$17,MATCH(E$2,$D$11:$G$11,0),0)="x",1,0)
Then, each Container has a Sum of Values calculated as follow (ex. E3):
=SUMPRODUCT($C$4:$C$9,E$4:E$9)
The question is... Is there a way (no VBA) to obtain the same result without using additional helping columns?
I would like using something like this as formula (but it doesn't work):
=SUMPRODUCT($C$4:$C$9,IF(VLOOKUP($B4:$B9,$D$12:$G$17,MATCH(E$2,$D$11:$G$11,0),0)="x",1,0))
In short, I don't know if (and if yes, how) the helping columns in the sheet I use can be calculated on the fly from excel as virtual columns directly into a cell formula.
No limitations in the use of VLOOkUP and SUMIF functions -> SUMIF, SUMIFS, INDEX, MATCH and any other combination of Excel functions is fine as long as the goal of eliminating the help columns is achieved.
Any help on this would be really appreciated.
Thanks in advance to everyone
Try,
In E3, modified your formula and copied across right to G3 :
=SUMPRODUCT($C$4:$C$9,IF(VLOOKUP(T(IF({1},$B4:$B9)),$D$12:$G$17,MATCH(E$2,$D$11:$G$11,0),0)="x",1,0))
Or,
=SUMPRODUCT($C$4:$C$9*(VLOOKUP(T(IF({1},$B4:$B9)),$D$12:$G$17,MATCH(E$2,$D$11:$G$11,0),0)="x"))

Creating Matrix in Stata

I am simulating pga tournaments using Stata. My simulation results table consists of:
column 1: the names of the 30 players in the tournament
columns 2 - 30,001: the 4 round results of my monte-carol simulations.
what I am trying to do is create a 30 x 30 matrix with the golfers' names as column 1 and across the column names where each cell represents the percentage of times Golfer A beat Golfer B outright from the 30,000 simulations. Is this possible to do in Stata? Thanks
I tend to say that everything is always possible in all programming languages, but somethings are much more difficult to do in some languages compared to others. I do not think that Stata is great tool for what you intend to do.
You need to provide some code examples for us to be able to help you with your task, but here is one thing I can say. Stata has two programming languages. One is often called Stata (but is called ado on Stata Corps webiste) and the other is Mata. If you for some reason need to use the software Stata, you should do this in the language Mata that has more matrix operators than ado. And in ado you cant store text in a matrix, so if you want to store the name of the golfer you need to use Mata, but you can also use indexes of rows and columns to keep track of the golfers.
With that said, Stata is primarily a tool to make operations and analyze a single dataset loaded into memory (recently support for multiple datasets has been added). So to answer your question, yes, this can be done in Stata, but you are probably much better of doing it in a language with more support for multidimensional arrays/vectors. For example, R or Python.

Large z-score values

We was working on large datasets of telecom. when we standardized the data we’ve got big z-score it varies from -0.xxx to 300 or 400!
These attributes has for exemple min=0 and Max about 4,000,000
Yes somes variables has outliers. We’ll this have good results for clustering without dealing with outliers?
The results of the proc fastclus with 8 cluster lead to grouped cluster (the seventh has 1,600,000 observations) there one too with 1 observation.
What’s our problem?
https://medium.com/p/6b6056224c54/info?source=email-75f4ab5a8577-1529361861973-activity.response_createdhttps://medium.com/p/6b6056224c54/info?source=email-75f4ab5a8577-1529361861973-activity.response_created
Your variables likely are very skewed.
The use of z standardization on such variables is questionable. You probably should look into box-cox transformations, too.

Converting mixed data set to numerical data set

In my project i have to work with mixed dataset (i.e.it has both categorical and numerical data). Is there any algorithm or method for converting categorical values to numerical values so finally my dataset should contain only numerical values. Can anyone please help me out....
(Im doing my project in matlab)
Use one-hot encoding.
But don't expect the results to be very good. There is a lot of meaning lost this way.

Is there an absolute column limit for Google's Charts?

I have finally gotten a column chart working for my data set. However, it only outputs fifteen columns, and the data set has 36 columns. It will output fifteen columns (or less if I limit the set to only items that are non-zero...but my boss wants all of the data shown) no matter what width the graph is set to.
Is there an absolute hard-coded column limit for graphs made by Google's Charts API, and if not, is there a way I can tell the graph to output everything?
I've just run into this myself, almost 7 years after the original problem report. Columns representing the right-side of my data are being silently un-drawn.
Let's look at the big picture. Somebody provides a charting library. They should be expected to show the data as best they can. In the case of a column table, that would be to show the first and last columns, and then choose which intermediate columns to show based on an algorithm that takes available pixels into account. It would then let the user zoom in to see the full set of columns within the selected range. This gives the developer using the chart the freedom to show an unlimited amount of data and not have to worry that someday columns at the end are simply not drawn.
Google is already choosing to not print some of the column labels due to space constraints, so they're already halfway to understanding the big picture.
Nowhere in the documentation does it explain this truncation of columns due to space constraints, or for any other chart type that I've seen. But you sure can choose your background colors in great levels of detail.
If I had known this restriction going in, I would have chosen a different chart package and not wasted my time. My choices now are to break my "Lifetime" data into yearly graphs that fit in the available space, which is clunky as hell, or migrate to a different chart package. Thanks Google. :^(
P.S. I tried to post this as a comment to the OP, but after using SO for years I don't have enough points...