]1
i have a dataset from which i want to predict the patient getting the disease. the graph below is the first step after:
scaling the 3 features
running the scikit learn's PCA routine
the original dataset has 25 features but for our exercise, we were asked to use only 3 features. these 3 features are then reduced to 2 thru PCA
When i look at this data, my first response is that the green dots (those not prone to kidney disease) are not separable.
Is my assumption correct?
May be your PCA is not yielding the best features to separate your data. I would suggest using something like Random Forest or XGBoost where you can see feature importance easily and then use the best 3 features to try to classify the data.
It is not possible to separate the dots.
Related
I am new to SageMaker. I have a large csv dataset which I would like labelled:
sentence_id
sentence
pre_agreed_label
148392
A sentence
0
383294
Another sentence
1
For each sentence, I would like a) a yes/no binary classification in response to a question, and b) on a scale of 1-3, how obvious the classification was. I need the sentence id to map to other parts of the dataset, and will use the pre-agreed labels to assess accuracy.
I have identified SageMaker GroundTruth labelling jobs as a possible way to do this. Is this the best way? In trying to set it up I have run into a few problems.
The first problem is I can't find a way to display only the sentence column to the labellers, hiding the sentence_id and pre_agreed_labels.
The second is that there is either single labelling or multi labelling, but I would like a way to have two sets of single-selection labels:
Select one for binary classification:
Yes
No
Select one for difficulty of classification:
Easy
Medium
Hard
It seems as though this can be done using custom HTML, but I don't know how to do this - the template it gives you doesn't even render
Finally, having not used mechanical turk before, are there ways of ensuring people take the work seriously and don't just select random answers? I can see there's an option to have x number of people answer the same question, but is there also a way to put in an obvious question to which we already have a 'pre_agreed_label' every nth question, and kick people off the task if they get it wrong? There also appears to be a maximum of $1.20 per task which seems odd.
I want to do a graph by group. For histograms, that can be done with the option hist ... , by(group). Unfortunately, stata will attempt to fit all groups on the same figure. This becomes unreadable when there is many groups.
I'm looking for a solution that will allow me to fix the number of subplots per figure, and create multiple figures -- but I'd also be open to alternative solutions that are "industry standard". I'll try to make the solution work for a multitude of figure types, not only hist, and appreciate approaches that dont rely on additional packages.
Here's a sample dataset, where -- for example -- I'd like to fix that no more than 4 groups are shown per figure.
sysuse educ99gdp
hist public, by(country)
I don't have any fixed idea on how exactly this should look, it should be scalable for 30-50 categorical values though. So, grouping the countries and showing multiple histograms in the same subplot works fine when there's only 10, not so fine when there is 30.
One suggested solution - but I'm happy to have it another way that is more Stata-ish - would be to have 3 separate figures, 4 subplots on each, and the last figure would have two empty slots.
The goal is to create a computer-generated news site that aggregates headlines from different news sources around the world:
Taking a look at the centroid table results I want to Understand the following:
https://ibb.co/n1mvnbk
I used K=5
and I am using TF-IDF
Explain what those numbers mean?
When an attribute is zero in multiple clusters, what does it mean?
When I sort the centroid table by each cluster at a descending order, I find some words or attributes that have a higher value with this cluster while zero values in other clusters. Does this mean that these words occur more or less frequently in this cluster?
How can I discuss the clustering model
Do all the clusters make sense and why?
Do you think k=5 is a good choice for this dataset? or I need to choose 3? How can I classify that?
I believe K=5 denotes number of cluster you are looking into current Dataset. On the basis 5 centroid will be placed in data will be around them.
Do you think k=5 is a good choice for this dataset? Its hard to predict this way. It is all done by mathematical combination and permutation.
You might use Elbow Method to identify correct number of cluster needed for any given dataset. This methodology is based on WCSS(Within Cluster Sums of Squares) which find distance between points and provide centroid points.
Those numbers are the average tf-idf of the cluster. So a 0 means that the word is not in the cluster, and the highest valued words are most characteristic words for the cluster.
Note that for text you'll want to use spherical k-means rather than regular k-means.
Choosing k is a big problem. Forget the elbow method, it never works except for you examples. Experiment with different k and choose the one that is most convincing or most useful. None of the usual heuristics for choosing the k in k-means will work here I fear (VRC is IMHO the best). The main reason is that the data cannot be well partitioned into k clusters. There is no reason to assume there are exactly k topics in the world, nor that every document only contains one topic. Instead, topics will be a complex structure itself. For example there is Trump, but there also is the Trump Erdogan meeting, and there is the impeachment. These are not disjoint. But you will also have articles that don't fit into any of these topics. This leads to the effect that the true best k would likely be very very large, as large as the number of articles (and hence not useful).
I am working on a multilabel text classification problem with 10 labels.
The dataset is small, +- 7000 items and +-7500 labels in total. I am using python sci-kit learn and something strange came up in the results. As a baseline I started out with using the countvectorizer and was actually planning on using the tfidf vectorizer which I thought would work better. But it doesn't.. with the countvectorizer I get a performance of a 0,1 higher f1score. (0,76 vs 0,65)
I cannot wrap my head around why this could be the case?
There are 10 categories and one is called miscellaneous. Especially this one gets a much lower performance with tfidf.
Does anyone know when tfidf could perform worse than count?
The question is, why not ? Both are different solutions.
What is your dataset, how many words, how are they labelled, how do you extract your features ?
countvectorizer simply count the words, if it does a good job, so be it.
There is no reason why idf would give more information for a classification task. It performs well for search and ranking, but classification needs to gather similarity, not singularities.
IDF is meant to spot the singularity between one sample vs the rest of the corpus, what you are looking for is the singularity between one sample vs the other clusters. IDF smoothens the intra-cluster TF similarity.
I have generated the vectors for a list of tokens from a large document using word2vec. Given a sentence, is it possible to get the vector of the sentence from the vector of the tokens in the sentence.
There are differet methods to get the sentence vectors :
Doc2Vec : you can train your dataset using Doc2Vec and then use the sentence vectors.
Average of Word2Vec vectors : You can just take the average of all the word vectors in a sentence. This average vector will represent your sentence vector.
Average of Word2Vec vectors with TF-IDF : this is one of the best approach which I will recommend. Just take the word vectors and multiply it with their TF-IDF scores. Just take the average and it will represent your sentence vector.
There are several ways to get a vector for a sentence. Each approach has advantages and shortcomings. Choosing one depends on the task you want to perform with your vectors.
First, you can simply average the vectors from word2vec. According to Le and Mikolov, this approach performs poorly for sentiment analysis tasks, because it "loses the word order in the same way as the standard bag-of-words models do" and "fail[s] to recognize many sophisticated linguistic phenomena, for instance sarcasm". On the other hand, according to Kenter et al. 2016, "simply averaging word embeddings of all words in a text has proven to be a strong baseline or feature across a multitude of tasks", such as short text similarity tasks. A variant would be to weight word vectors with their TF-IDF to decrease the influence of the most common words.
A more sophisticated approach developed by Socher et al. is to combine word vectors in an order given by a parse tree of a sentence, using matrix-vector operations. This method works for sentences sentiment analysis, because it depends on parsing.
It is possible, but not from word2vec. The composition of word vectors in order to obtain higher-level representations for sentences (and further for paragraphs and documents) is a really active research topic. There is not one best solution to do this, it really depends on to what task you want to apply these vectors. You can try concatenation, simple summation, pointwise multiplication, convolution etc. There are several publications on this that you can learn from, but ultimately you just need to experiment and see what fits you best.
It depends on the usage:
1) If you only want to get sentence vector for some known data. Check out paragraph vector in these papers:
Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. Eprint Arxiv,4:1188–1196.
A. M. Dai, C. Olah, and Q. V. Le. 2015. DocumentEmbedding with Paragraph Vectors. ArXiv e-prints,July.
2) If you want a model to estimate sentence vector for unknown(test) sentences with unsupervised approach:
You could check out this paper:
Steven Du and Xi Zhang. 2016. Aicyber at SemEval-2016 Task 4: i-vector based sentence representation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016), San Diego, US
3)Researcher are also looking for the output of certain layer in RNN or LSTM network, recent example is:
http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12195
4)For the gensim doc2vec, many researchers could not get good results, to overcome this problem, following paper using doc2vec based on pre-trained word vectors.
Jey Han Lau and Timothy Baldwin (2016). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, 2016.
5) tweet2vec or sent2vec
.
Facebook has SentEval project for evaluating the quality of sentence vectors.
https://github.com/facebookresearch/SentEval
6) There are more information in the following paper:
Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering
And for now you can use 'BERT':
Google release the source code as well as pretrained models.
https://github.com/google-research/bert
And here is an example to run bert as a service:
https://github.com/hanxiao/bert-as-service
You can get vector representations of sentences during training phase (join the test and train sentences in a single file and run word2vec code obtained from following link).
Code for sentence2vec has been shared by Tomas Mikolov here.
It assumes first word of a line to be sentence-id.
Compile the code using
gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -funroll-loops
and run it using
./word2vec -train alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1
EDIT
Gensim (development version) seems to have a method to infer vectors of new sentences. Check out model.infer_vector(NewDocument) method in https://github.com/gojomo/gensim/blob/develop/gensim/models/doc2vec.py
I've had good results from:
Summing the word vectors (with tf-idf weighting). This ignores word order, but for many applications is sufficient (especially for short documents)
Fastsent
Google's Universal Sentence Encoder embeddings are an updated solution to this problem. It doesn't use Word2vec but results in a competing solution.
Here is a walk-through with TFHub and Keras.
Deep averaging network (DAN) can provide sentence embeddings in which word bi-grams are averaged and passed through feedforward deep neural network(DNN).
It is found that transfer learning using sentence embeddings tends to outperform word level transfer as it preserves the semantic relationship.
You don't need to start the training from scratch, the pretrained DAN models are available for perusal ( Check Universal Sentence Encoder module in google hub).
let suppose this is current sentence
import gensim
from gensim.models import Word2Vec
from gensim import models
model = gensim.models.KeyedVectors.load_word2vec_format('path of your trainig
dataset', binary=True)
strr = 'i am'
strr2 = strr.split()
print(strr2)
model[strr2] //this the the sentance embeddings.