ELKI data generator and outliers - data-mining

I want to make a test for LOF, showing how well it manages the dense-sparse problem of a dataset. In the tutorial of ELKI data generator I am shown how to make a dataset from a xml file like this with 4 clusters:
<dataset random-seed="1" test-model="1">
<cluster name="Dense" size="290">
<normal mean="0.5" stddev="0.2"/>
<normal mean="0.5" stddev="0.2"/>
<clip min="0 0" max="1 1"/>
</cluster>
<cluster name="Sparse" size="100">
<normal mean="0.25" stddev="0.05"/>
<normal mean="0.75" stddev="0.05"/>
<clip min="0 0" max="1 1"/>
</cluster>
<cluster name="Middle" size="100">
<normal mean="0.75" stddev="0.05"/>
<normal mean="0.75" stddev="0.05"/>
<clip min="0 0" max="1 1"/>
</cluster>
<cluster name="Noise" size="10" density-correction="50">
<uniform min="0" max="1"/>
<uniform min="0" max="1"/>
</cluster>
</dataset>
But how do I get a hold on the outliers. The ELKI tool want a minority label for the outliers to show a ROCAUC curve. And the file I get out of the xml file is just a file of points in the data set.
Should I then make a plot and identify the outliers myself and put a yes or no after them all to say whether they are outliers or not and set the minority label to yes, being outliers OR is there an easier way?

ELKI will default to using the smallest class for evaluation. (You can configure evaluation differently!)
ELKI will issue a warning if the outliers are more than 5% of the data, since it is assumed that outliers are rare (they should be much less than 5%, actually).
So on your data set, ELKI should default to using "Noise" as outlier class.
In your configuration Noise should be 2% of the data set, so it should not warn. It should simply work out of the box.

Related

XSLT -- detect if node has already been copied to the result tree

Using xsltproc to clean up input XML.
Think about a part number referencing a part description from random locations in the document. My XML input is poorly designed and it has part number references to part descriptions all over with no real pattern as to where they are located. Some references are text in elements, some in attributes, sometimes the attribute changes meaning depending on context. The attribute containing the part number does not have a consistent name, the name used alters depending on the value of other attributes. Maybe I could build a key selecting the dozen varying places containing part number but it would be a mess. I would also worry about inadvertently selecting the wrong items with complex patterns.
So my goal is to only copy the referenced part descriptions to the output document once (not all descriptions are referenced). I can insert tests in all of the various templates to detect the part number in context. The very simple solution would be to just test if it has already been copied to the result tree and not copy it again. But there is no way to track this?
Plan B is to copy it multiple times to the result tree and then do a second pass over the output document to remove the duplicates.
The use of temporal language in the question ("has already been") is a good clue that you're thinking about this the wrong way. In a declarative language, you shouldn't be thinking in terms of the order of processing.
What you're probably looking for is something like this:
<xsl:variable name="first-of-a-kind-part-references" as="node()*">
<xsl:for-each-group select="f:all-part-references(/)"
group-by="f:get-referenced-part(.)/#id">
<xsl:sequence select="current-group()[1]"/>
</xsl:for-each-group>
</xsl:variable>
and then when processing a part reference
<xsl:if test=". intersect $first-of-a-kind-part-references">
...
</xsl:if>

XSLT - filter out elements that are not x-referenced

I have developed a (semi-)identity transformation from which I need to filter out elements that are unused.
The source XML contains 2001 "zones". No more, no less.
It also contains any number of devices, which are placed in these zones.
One specific example source XML contains 8800 of these devices.
More than one device can be placed in the same zone.
Zone 0 is a "null zone", meaning that a device placed in this zone is currently unassigned.
This means that the number of real zones is 2000.
Simplified source XML:
<configuration>
<zones>
<zone id="0">
...
<zone id="2000"/>
</zones>
<devices>
<device addr="1">
<zone>1</zone>
</device>
...
<device addr="8800">
<zone>1</zone>
</device>
</devices>
</configuration>
The problem we have is that out of the 2000 usable zones, most often only roughly 200 of these contain one or more devices.
I need to whittle out unused zones. There are reasons for this which serve only to detract from the question at hand, so if you don't mind I will not elaborate here.
I currently have this problem solved, like so:
<xsl:for-each select="zones/zone[#id > 0]">
<xsl:when test="/configuration/devices/device[zone=current()/#id]">
<xsl:call-template name="Zone"/>
</xsl:when>
</xsl:for-each>
And this works.
But on some of the larger projects the transformation takes absolute ages.
That is because in pseudo code this translates to:
for each <zone> in <zones>
find any <device> in <devices> with reference to <zone>
if found
apply zone template
endif
endfor
With 2000 zones to iterate over - and each iteration triggering up to 8800 searches for a qualifying device - you can imagine this taking a very long time.
And to compound problems, libxslt provides no API for progress reporting. This means that for a long time our application will appear frozen while it imports and converts the customer XML.
I do have the option to write every zone unconditionally, and upon the application bootstrapping from our (output) XML, remove or ignore any zones that have no devices placed in them.
And it may turn out that this may be the only option I have.
The downside to this is that my output XML then contains a lot of zones that are not referenced.
That makes it a bit difficult to consolidate what we have in our configuration and what parts of it the application is actually using.
My question to you is:
Have I got other options that ensure that the output XML only contains used zones?
I am not averse to performing a follow-up XSLT conversion.
I was for instance thinking that it may be possible(?) to write an attribute used="false" to each <Zone> element in my output.
Then as I go over the devices, I find the relevant zone in my output XML (providing it is assigned / zone is non-zero) and change this to used="true".
Then follow up with a quick second transformation to remove all zones which have used="false".
But, can I reference my own output elements during an XSLT transformation and change its contents?
You said you have a kind of identity transformation so I would use that as the starting point, plus a key:
<xsl:key name="zone-ref" match="device" use="zone"/>
and an empty template
<xsl:template match="zones/zone[not(key('zone-ref', #id))]"/>
that prevents unreferences zones from being copied.
Or, if there are other conditions, then e.g.
<xsl:template match="zones/zone[#id > 0 and not(key('zone-ref', #id))]"/>

Find entries with zero distance variance and recorded watts

I'm a cyclist and a programmer. During my rides, I'm recording data to xml files using a phone based gps tracker and a power meter. After the ride, I use the power meter software to merge the data and then upload to a web site. On the website, the resulting data is showing highly inaccurate data for WR Watts (It is a weighted average, also known as normalized power, which by definition is higher than average power and lower than my maximum recorded watts. See http://ridewithgps.com/trips/4834566 (Export TCX History to get the file I'm referring to). /<Watts>\d{4,} returns no results.
Calories: 1809
Max Watts: 676
Avg. Watts: 213 (170 with 0s)
WR Power 23487
Work 1681 kJ
Max Speed: 26.2 mph
Avg. Speed: 16.6 mph
Here are two sample readings from the tcx history file.
<Trackpoint>
<Time>2015-05-30T11:35:50Z</Time>
<Position>
<LatitudeDegrees>41.96306</LatitudeDegrees>
<LongitudeDegrees>-87.645939</LongitudeDegrees>
</Position>
<AltitudeMeters>177.7</AltitudeMeters>
<DistanceMeters>71.5</DistanceMeters>
<Cadence>67</Cadence>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>104</Watts>
</TPX>
</Extensions>
</Trackpoint>
<Trackpoint>
<Time>2015-05-30T11:35:51Z</Time>
<Position>
<LatitudeDegrees>41.963076</LatitudeDegrees>
<LongitudeDegrees>-87.646094</LongitudeDegrees>
</Position>
<AltitudeMeters>178.0</AltitudeMeters>
<DistanceMeters>75.7</DistanceMeters>
<Cadence>67</Cadence>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>156</Watts>
</TPX>
</Extensions>
</Trackpoint>
I've reviewed every entry for <Watts>\d*</Watts> where the corresponding Cadence was zero (if I'm not pedaling, watts should be zero).
:g/<Cadence>0/.,+3s/<Watts>[1-9]\d*/<Watts>0/c
But that did not resolve the issue. My next step is to find entries where the distance between tracepoints does not change and which contains a wattage greater than zero. This returns E65: Illegal back reference
:g/<DistanceMeters>\(\d*\)/+1,+15s/<DistanceMeters>\1/
CLARIFICATION:
I'm looking for locations where Watts must be zero and are not. These would be where I am not pedaling (Cadence = 0) and also when I am not moving (Consecutive distance nodes that are identical). I've already corrected the Wattage for cadence = 0, but don't know how to find consecutive <DistanceMeters>N</DistanceMeters> nodes where N is unchanged.
Given enough determination, it can be done:
:%s/\m<DistanceMeters>\([0-9.]\+\)<\/DistanceMeters>\n\(.*\n\)\{1,15}\s\+<DistanceMeters>\1<\/DistanceMeters>\n\(.*\n\)\{3}\s\+<Watts>\zs[1-9]\d*\ze<\/Watts>/0/gc
However, why on Earth would you want to do that in Vim?

Finding a correlation between variable and class variable

I have a dataset which contains 7 numerical attributes and one nominal which is the class variable. I was wondering how I can the best attribute that can be used to predict the class attribute. Would finding the largest information gain by each attribute be the solution?
So the problem you are asking about falls under the domain of feature selection, and more broadly, feature engineering. There is a lot of literature online regarding this, and there are definitely a lot of blogs/tutorials/resources online for how to do this.
To give you a good link I just read through, here is a blog with a tutorial on some ways to do feature selection in Weka, and the same blog's general introduction on feature selection. Naturally there are a lot of different approaches, as knb's answer pointed out.
To give a short description though, there are a few ways to go about it: you can assign a score to each of your features (like information gain, etc) and filter out features with 'bad' scores; you can treat finding the best parameters as a search problem, where you take different subsets of the features and assess the accuracy in turn; and you can use embedded methods, which kind of learn which features contribute most to the accuracy as the model is being built. Examples of embedded methods are regularization algorithms like LASSO and ridge regression.
Do you just want that attribute's name, or do you also want a quantifiable metric (like a t-value) for this "best" attribute?
For a qualitative approach, you can generate a classification tree with just one split, two leaves.
For example, weka's "diabetes.arff" sample-dataset (n = 768), which has a similar structure as your dataset (all attribs numeric, but the class attribute has only two distinct categorical outcomes), I can set the minNumObj parameter to, say, 200. This means: create a tree with minimum 200 instances in each leaf.
java -cp $WEKA_JAR/weka.jar weka.classifiers.trees.J48 -C 0.25 -M 200 -t data/diabetes.arff
Output:
J48 pruned tree
------------------
plas <= 127: tested_negative (485.0/94.0)
plas > 127: tested_positive (283.0/109.0)
Number of Leaves : 2
Size of the tree : 3
Time taken to build model: 0.11 seconds
Time taken to test model on training data: 0.04 seconds
=== Error on training data ===
Correctly Classified Instances 565 73.5677 %
This creates a tree with one split on the "plas" attribute. For interpretation, this makes sense, because indeed, patients with diabetes have an elevated concentration of glucose in their blood plasma. So "plas" is the most important attribute, as it was chosen for the first split. But this does not tell you how important.
For a more quantitative approach, maybe you can use (Multinomial) Logistic Regression. I'm not so familiar with this, but anyway:
In the Exlorer GUI Tool, choose "Classify" > Functions > Logistic.
Run the model. The odds ratio and the coefficients might contain what you need in a quantifiable manner. Lower odds-ratio (but > 0.5) is better/more significant, but I'm not sure. Maybe read on here, this answer by someone else.
java -cp $WEKA_JAR/weka.jar weka.classifiers.functions.Logistic -R 1.0E-8 -M -1 -t data/diabetes.arff
Here's the command line output
Options: -R 1.0E-8 -M -1
Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
Class
Variable tested_negative
============================
preg -0.1232
plas -0.0352
pres 0.0133
skin -0.0006
insu 0.0012
mass -0.0897
pedi -0.9452
age -0.0149
Intercept 8.4047
Odds Ratios...
Class
Variable tested_negative
============================
preg 0.8841
plas 0.9654
pres 1.0134
skin 0.9994
insu 1.0012
mass 0.9142
pedi 0.3886
age 0.9852
=== Error on training data ===
Correctly Classified Instances 601 78.2552 %
Incorrectly Classified Instances 167 21.7448 %

Sitecore multivariate testing: how are values calculated?

I've created a multivariate test, but the values are stuck at zero. To make these values go up in accordance with my users' behavior when they convert, I have to do something I haven't done yet, and no documentation I can find describes just what this is.
I've tried configuring a goal with a rule of "item ID is equal to (some goal page's item ID)" and set the points value to eg. 1 or 5. But when I visit the page in a browser, the values on my test stay stuck stubbornly on zero.
Is there something I'm not doing?
This is most likely caused by the Test Statistics cache. It is set by default to 1 hour expiration time. To see your results instantly simply change the setting in the web.config to:
<setting name="WebEdit.TestStatisticsCacheExpiration" value="00:00:00" />