OPTICSXi - ELKI ResultWriter - data-mining

I'm using ELKI to cluster, in a hierarchical way, a dataset of geolocations using OPTICSXi.
The result of the execution of the algorithm is a set of files.
The content of a file could be:
# Cluster: nameOfCluster
# OPTICSModel
# Parents: nameOfParents (this element doesn't exist for the root cluster)
# Children: nameOfChild_0, nameOfChild_1 ... nameOfChild_n, (optional)
ID=1 lat0 lon0 reachability=?
ID=3062 lat1 lon1 reachability=1.30972586 predecessor=1
ID=7383 lat2 lon2 reachability=2.56784445 predecessor=3062
ID=42839 lat3 lon3 reachability=4.05510623 predecessor=1
I don't understand if the elements that are in each file (in the example there are four elements) belong to the same cluster or could belong to different clusters. In the latter case, I need to write some code that builds the clusters ( for example looking at the predecessor of each node), or there are some parameters that could I specify in Elki to obtain each single cluster?

By default, ELKI will produce a directory with one file per cluster. Unless the output file already exists, in which case you will get all the clusters written into the same file, separated with comments as seen above.
With a hierarchical result, such as OPTICSXi, your should however also treat all members of the child clusters to be also part of the parent. These are clusters nested into the parent. They are not repeated in the parent, to reduce redundancy in the output.
Compare the output of OPTICSXi to OPTICS output. What the Xi approach does, is split the data for you, based on sudden drops in reachability-distance. All clusters of Xi should be subsequences of the original OPTICS cluster order.
In your case, you may have chosen minPts too small, if your cluster has just 4 elements. (Although, you may have truncated the file, or you may have a lot of elements in child clusters; so the output may be fine).
Also note that you will usually want to validate whether you want the first element(s) of your cluster to belong to the cluster or not; similarly the last elements. OPTICSXi tends to err on the first elements, but not in a systematic way that would be trivial to fix. The first and last elements are those that bridge the gap from one cluster to another. You really should verify these manually (which is a good reason to not choose minPts too small).
I strongly recommend to build/use a visualization for your specific use case. Then you could just load such a cluster into your visualization and visually inspect if the result makes sense to you. I have used OPTICSXi on geographic data, and that worked very well for me.

So, if I've understood well, in the example above, the cluster is composed of the elements
ID=1, ID=3062, ID=7383, ID=42839, and all the elements in nameOfChild_0, nameOfChild_1 ... nameOfChild_n.
Maybe, I don't have to join the children in the root element, because I guess I'll obtain a unique big cluster contained all my geo-locations, in fact I have 903 child elements and 18795 node (ID).
I've done a lot of tests, choosing minPoint = {2,5,10} and xi = {0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001}. I use a visualization of my clusters, but I can't find a good result. I'm having a lot of trouble.
Thanks to your reply I've understood that I split my elements too much, in the sense that for me each file is a cluster, and for this reason I don't consider the child elements in the parent, but I consider them as separated clusters.
Moreover, I noticed that the first and the last element sometimes are wrong, I've thought to verify if this elements are predecessor of at least one element in the cluster, or at least one element in the cluster is a predecessor of those. Does this make sense?

Related

How do I query Prometheus for the timeseries that was updated last?

I have 100 instances of a service that use one database. I want them to export a Prometheus metric with the number of rows in a specific table of this database.
To avoid hitting the database with 100 queries at the same time, I periodically elect one of the instances to do the measurement and set a Prometheus gauge to the number obtained. Different instances may be elected at different times. Thus, each of the 100 instances may have its own value of the gauge, but only one of them is “current” at any given time.
What is the best way to pick only this “current” value from the 100 gauges?
My first idea was to export two gauges from each instance: the actual measurement and its timestamp. Then perhaps I could take the max(timestamp), then and it with the actual metric. But I can’t figure out how to do this in PromQL, because max will erase the instance I could and on.
My second idea was to reset the gauge to −1 (some sentinel value) at some time after the measurement. But this looks brittle, because if I don’t synchronize everything tightly, the “current” gauge could be reset before or after the “new” one is set, causing gaps or overlaps. Similar considerations go for explicitly deleting the metric and for exporting it with an explicit timestamp (to induce staleness).
I figured out the first idea (not tested yet):
avg(my_rows_count and on(instance) topk(1, my_rows_count_timestamp))
avg could as well be max or min, it only serves to erase instance from the final result.
last_over_time should do the trick
last_over_time(my_rows_count[1m])
given only one of them is “current” at any given time, like you said.

Slow insertion using Neptune and Gremlin

I'm having problems with the insertion using gremlin to Neptune.
I am trying to insert many nodes and edges, potentially hundred thousands of nodes and edges, with checking for existence.
Currently, we are using inject to insert the nodes, and the problem is that it is slow.
After running the explain command, we figured out that the problem was the coalesce and the where steps - it takes more than 99.9% of the run duration.
I want to insert each node and edge only if it doesn’t exist, and that’s why I am using the coalesce and where steps.
For example, the query we use to insert nodes with inject:
properties_list = [{‘uid’:’1642’},{‘uid’:’1322’}…]
g.inject(properties_list).unfold().as_('node')
.sideEffect(__.V().where(P.eq('node')).by(‘uid).fold()
.coalesce(__.unfold(), __.addV(label).property(Cardinality.single,'uid','1')))
With 1000 nodes in the graph and properties_list with 100 elements, running the query above takes around 30 seconds, and it gets slower as the number of nodes in the graph increases.
Running a naive injection with the same environment as the query above, without coalesce and where, takes less than 1 second.
I’d like to hear your suggestions and to know what are the best practices for inserting many nodes and edges (with checking for existence).
Thank you very much.
If you have a set of IDs that you want to check for existence, you can speed up the query significantly by also providing just a list of IDs to the query and calculating the intersection of the ones that exist upfront. Then, having calculated the set that need updates you can just apply them in one go. This will make a big difference. The reason you are running into problems is that the mid traversal V has a lot of work to do. In general it would be better to use actual IDs rather than properties (UID in your case). If that is not an option the same technique will work for property based IDs. The steps are:
Using inject or sideEffect insert the IDs to be found as one list and the corresponding map containing the changes to conditionally be applied in a separate map.
Find the intersection of the ones that exist and those that do not.
Using that set of non existing ones, apply the updates using the values in the set to index into your map.
Here is a concrete example. I used the graph-notebook for this but you can do the same thing in code:
Given:
ids = "['1','2','9998','9999']"
and
data = "[['id':'1','value':'XYZ'],['id':'9998','value':'ABC'],['id':'9999','value':'DEF']]"
we can do something like this:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by()
which correctly finds the ones that do not already exist:
{'id': 9998, 'value': 'ABC'}
{'id': 9999, 'value': 'DEF'}
You can use this pattern to construct your conditional inserts a lot more efficiently (I hope :-) ). So to add the new vertices you might do:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by().
addV('test').
property(id,select('d').select('id')).
property('value',select('d').select('value'))
v[9998]
v[9999]
As a side note, we are adding two new steps to Gremlin - mergeV and mergeE that will allow this to be done much more easily and in a more declarative style. Those new steps should be part of the TinkerPop 3.6 release.

Classify K-means in Text Mining

The goal is to create a computer-generated news site that aggregates headlines from different news sources around the world:
Taking a look at the centroid table results I want to Understand the following:
https://ibb.co/n1mvnbk
I used K=5
and I am using TF-IDF
Explain what those numbers mean?
When an attribute is zero in multiple clusters, what does it mean?
When I sort the centroid table by each cluster at a descending order, I find some words or attributes that have a higher value with this cluster while zero values in other clusters. Does this mean that these words occur more or less frequently in this cluster?
How can I discuss the clustering model
Do all the clusters make sense and why?
Do you think k=5 is a good choice for this dataset? or I need to choose 3? How can I classify that?
I believe K=5 denotes number of cluster you are looking into current Dataset. On the basis 5 centroid will be placed in data will be around them.
Do you think k=5 is a good choice for this dataset? Its hard to predict this way. It is all done by mathematical combination and permutation.
You might use Elbow Method to identify correct number of cluster needed for any given dataset. This methodology is based on WCSS(Within Cluster Sums of Squares) which find distance between points and provide centroid points.
Those numbers are the average tf-idf of the cluster. So a 0 means that the word is not in the cluster, and the highest valued words are most characteristic words for the cluster.
Note that for text you'll want to use spherical k-means rather than regular k-means.
Choosing k is a big problem. Forget the elbow method, it never works except for you examples. Experiment with different k and choose the one that is most convincing or most useful. None of the usual heuristics for choosing the k in k-means will work here I fear (VRC is IMHO the best). The main reason is that the data cannot be well partitioned into k clusters. There is no reason to assume there are exactly k topics in the world, nor that every document only contains one topic. Instead, topics will be a complex structure itself. For example there is Trump, but there also is the Trump Erdogan meeting, and there is the impeachment. These are not disjoint. But you will also have articles that don't fit into any of these topics. This leads to the effect that the true best k would likely be very very large, as large as the number of articles (and hence not useful).

Scaling down spanner nodes

What are the limitations / considerations in scaling down spanner nodes? Since there is a tight coupling of nodes to data stored - is it fair to say that it is highly scalable but not elastic? The following is a quote from the quizlet case study on GCP website...
"it might be impossible to reduce the number of nodes on your database, even if you previously ran the database with that number of nodes."
The word "might" needs some expanding
To expand on the "might" -- we restrict the reduction of nodes to meet a 2T/node limit for the instance. You can scale up and down, as long as the down-sizing doesn't cross that threshold.
Hope this helps!
A few things we would recommend to scale down effectively is by deleting unused data (databases, tables, global index, rows etc.). This data will be cleaned within ~7 days, allowing you to potentially run with lower node counts.

Face Recognition Using Backpropagation Neural Network?

I'm very new in image processing and my first assignment is to make a working program which can recognize faces and their names.
Until now, I successfully make a project to detect, crop the detected image, make it to sobel and translate it to array of float.
But, I'm very confused how to implement the Backpropagation MLP to learn the image so it can recognize the correct name for the detected face.
It's a great honor for all experts in stackoverflow to give me some examples how to implement the Image array to be learned with the backpropagation.
It is standard machine learning algorithm. You have a number of arrays of floats (instances in ML or observations in statistics terms) and corresponding names (labels, class tags), one per array. This is enough for use in most ML algorithms. Specifically in ANN, elements of your array (i.e. features) are inputs of the network and labels (names) are its outputs.
If you are looking for theoretical description of backpropagation, take a look at Stanford's ml-class lectures (ANN section). If you need ready implementation, read this question.
You haven't specified what are elements of your arrays. If you use just pixels of original image, this should work, but not very well. If you need production level system (though still with the use of ANN), try to extract more high level features (e.g. Haar-like features, that OpenCV uses itself).
Have you tried writing your feature vectors to an arff file and to feed them to weka, just to see if your approach might work at all?
Weka has a lot of classifiers integrated, including MLP.
As I understood so far, I suspect the features and the classifier you have chosen not to work.
To your original question: Have you made any attempts to implement a neural network on your own? If so, where you got stuck? Note, that this is not the place to request a complete working implementation from the audience.
To provide a general answer on a general question:
Usually you have nodes in an MLP. Specifically input nodes, output nodes, and hidden nodes. These nodes are strictly organized in layers. The input layer at the bottom, the output layer on the top, hidden layers in between. The nodes are connected in a simple feed-forward fashion (output connections are allowed to the next higher layer only).
Then you go and connect each of your float to a single input node and feed the feature vectors to your network. For your backpropagation you need to supply an error signal that you specify for the output nodes. So if you have n names to distinguish, you may use n output nodes (i.e. one for each name). Make them for example return 1 in case of a match and 0 else. You could very well use one output node and let it return n different values for the names. Probably it would even be best to use n completely different perceptrons, i.e. one for each name, to avoid some side-effects (catastrophic interference).
Note, that the output of each node is a number, not a name. Therefore you need to use some sort of thresholds, to get a number-name relation.
Also note, that you need a lot of training data to train a large network (i.e. to obey the curse of dimensionality). It would be interesting to know the size of your float array.
Indeed, for a complex decision you may need a larger number of hidden nodes or even hidden layers.
Further note, that you may need to do a lot of evaluation (i.e. cross validation) to find the optimal configuration (number of layers, number of nodes per layer), or to find even any working configuration.
Good luck, any way!