Concept Behind The Transformed Data Of LDA Model - evaluation

My question is related to Latent Dirichlet Allocation. Suppose we apply LDA on our dataset, then apply fit transform on that.
the output is a matrix that is a collection of five documents. Each document consists of three topics. othe output is below:
[[ 0.0922935 0.09218227 0.81552423]
[ 0.81396651 0.09409428 0.09193921]
[ 0.05265482 0.05240119 0.89494398]
[ 0.05278187 0.89455775 0.05266038]
[ 0.85209554 0.07338382 0.07452064]]
So, this is the matrix that will be sent to a classification method for an evaluation purpose.
For the classification part, we need the labels for each row. But we do not have the labels which means I have to create them by my own.
One approach could be getting the highest probability for each topic as the corresponding label.
For example, the labels may be like so:
[2,0,2,1,0,]
However, this is very simple example.
I can also consider two highest probability for each document if each documents only has two topics. So, the example would be like this:
[[ 0.0922935 0 0.81552423]
[ 0.81396651 0.09409428 0]
[ 0.05265482 0 0.89494398]
[ 0.05278187 0.89455775 0]
[ 0.85209554 0 0.07452064]]
As you can see I have the rule of keeping the same probability for each label if they have the highest probabilities.
Which approach is correct? Has anyone used any other approach that is more meaningful?
Many thanks in advance!

Related

remove-duplicates produces numbers that haven't been in the original list in NetLogo

I'm creating a list out of the patch variable "geb-id" (a 7-digit integer number) with the following line:
set geb-id-list [geb-id] of patches with [geb-id >= 0 AND residents != 0]
When I look at the produced list, all looks fine. Then I'm using
set geb-id-list remove-duplicates geb-id-list
to remove the duplicates, as there are several patches with the same geb-id but I only want one geb-id per list. The list now doesn't contain any duplicates anymore, but some of the numbers have turned into random floating point numbers and I cannot trace back where these numbers suddenly came from (I checked the source data and they don't contain any floating point numbers).
For example:
The list with duplicates:
[ 133349 133349 133351 133351 133351 133360 133360 133360 133375 133375 133375 ]
The list without duplicates:
[ 133349 133351 133360 209587.61538461538 133915.6666666667 1518018.2 133375 ]
The floating point numbers that appear in the list without duplicates do not appear in the original list (even when removing the digits after the comma).
Here is the list with with the geb-id's as a csv file.
I think your source data does have floating point values in it. When I use your provided CSV and run this code:
extensions [csv]
to check-floats
let values-2d csv:from-file "geb-id-list.csv"
let values map [ r -> item 0 r ] values-2d
let nums filter [ v -> v != "NaN" ] values
let floats filter [ n -> floor n != n ] nums
show floats
end
The check-floats outputs [2079045.4999999998 2111524.956521739 862483.9473684211 361412.1111111111 1278359.6666666667 2111602.5 1564756.1249999998 2111443.4285714286 2150385.3333333335 134019.9090909091 2111560.285714286 2111643.75 133585.33333333334 134012.2 133580.00000000003 1518018.2 133915.6666666667 1452081.6666666665 133894.57142857145 ... ].
If you are loading in those geb-id's from an external source and you are sure there are no floats in that source data, then something in your NetLogo model code must be altering them to that format before you try to use remove-duplicates on them. If you can't discover how the floats are getting in there, posting more of your model's code could help someone find the cause.

NetLogo - Delaying the execution of certain commands based on ticks

Hello NetLogo community,
I am trying to ask agents named "users" to save certain value (string) of a variable for last two ticks (last two instances when "Go" command is executed). But, users have to store these values after first two ticks. Can anyone suggest me a way out? I have tried implementing the following logic but it does not seem to work.
ask users
[
set history-length-TM 2
if ticks > 2
[
set TM-history n-values history-length-TM [mode-taken]
foreach TM-history [x = "car"]
[
commands that are to be executed
.....
......
]
]
]
"history-length-TM" is the extent of ticks for which the values are to be stored. "TM-History" is the list to store the values of variable "mode-taken". Please advise a better method that could help me achieve the intent. Thanks in advance.
I am not sure I completely understand how ticks relates to this question. My suggestion would be something along these lines:
globals [history-length-TM]
users-own [TM-history]
to setup
set history-length-TM 2
...
end
ask users
....
set TM-history fput mode-taken TM-history
if length [TM-history] > history-length-TM [set TM-history but-last TM-history]
end
The idea is that the memory fills up (using fput) by placing the new mode-taken at the front of the list. Once the memory is too long, then the last (which is oldest) is dropped off the list.

Word2Vec is it for word only in a sentence or for features as well?

I would like to ask more about Word2Vec:
I am currently trying to build a program that check for the embedding vectors for a sentence. While at the same time, I also build a feature extraction using sci-kit learn to extract the lemma 0, lemma 1, lemma 2 from the sentence.
From my understanding;
1) Feature extractions : Lemma 0, lemma 1, lemma 2
2) Word embedding: vectors are embedded to each character (this can be achieved by using gensim word2vec(I have tried it))
More explanation:
Sentence = "I have a pen".
Word = token of the sentence, for example, "have"
1) Feature extraction
"I have a pen" --> lemma 0:I, lemma_1: have, lemma_2:a.......lemma 0:have, lemma_1: a, lemma_2:pen and so on.. Then when try to extract the feature by using one_hot then will produce:
[[0,0,1],
[1,0,0],
[0,1,0]]
2) Word embedding(Word2vec)
"I have a pen" ---> "I", "have", "a", "pen"(tokenized) then word2vec from gensim will produced matrices for example if using window_size = 2 produced:
[[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345]
]
The floating and integer numbers are for explanation purpose and original data should vary depending on the sentence. These are just dummy data to explain.*
Questions:
1) Is my understanding about Word2Vec correct? If yes, what is the difference between feature extraction and word2vec?
2) I am curious whether can I use word2vec to get the feature extraction embedding too since from my understanding, word2vec is only to find embedding for each word and not for the features.
Hopefully someone could help me in this.
It's not completely clear what you're asking, as you seem to have many concepts mixed-up together. (Word2Vec gives vectors per word, not character; word-embeddings are a kind of feature-extraction on words, rather than an alternative to 'feature extraction'; etc. So: I doubt your understanding is yet correct.)
"Feature extraction" is a very general term, meaning any and all ways of taking your original data (such as a sentence) and creating a numerical representation that's good for other kinds of calculation or downstream machine-learning.
One simple way to turn a corpus of sentences into numerical data is to use a "one-hot" encoding of which words appear in each sentence. For example, if you have the two sentences...
['A', 'pen', 'will', 'need', 'ink']
['I', 'have', 'a', 'pen']
...then you have 7 unique case-flattened words...
['a', 'pen', 'will', 'need', 'ink', 'i', 'have']
...and you could "one-hot" the two sentences as a 1-or-0 for each word they contain, and thus get the 7-dimensional vectors:
[1, 1, 1, 1, 1, 0, 0] # A pen will need ink
[1, 1, 0, 0, 0, 1, 1] # I have a pen
Even with this simple encoding, you can now compare sentences mathematically: a euclidean-distance or cosine-distance calculation between those two vectors will give you a summary distance number, and sentences with no shared words will have a high 'distance', and those with many shared words will have a small 'distance'.
Other very-similar possible alternative feature-encodings of these sentences might involve counts of each word (if a word appeared more than once, a number higher than 1 could appear), or weighted-counts (where words get an extra significance factor by some measure, such as the common "TF/IDF" calculation, and thus values scaled to be anywhere from 0.0 to values higher than 1.0).
Note that you can't encode a single sentence as a vector that's just as wide as its own words, such as "I have a pen" into a 4-dimensional [1, 1, 1, 1] vector. That then isn't comparable to any other sentence. They all need to be converted to the same-dimensional-size vector, and in "one hot" (or other simple "bag of words") encodings, that vector is of dimensionality equal to the total vocabulary known among all sentences.
Word2Vec is a way to turn individual words into "dense" embeddings with fewer dimensions but many non-zero floating-point values in those dimensions. This is instead of sparse embeddings, which have many dimensions that are mostly zero. The 7-dimensional sparse embedding of 'pen' alone from above would be:
[0, 1, 0, 0, 0, 0, 0] # 'pen'
If you trained a 2-dimensional Word2Vec model, it might instead have a dense embedding like:
[0.236, -0.711] # 'pen'
All the 7 words would have their own 2-dimensional dense embeddings. For example (all values made up):
[-0.101, 0.271] # 'a'
[0.236, -0.711] # 'pen'
[0.302, 0.293] # 'will'
[0.672, -0.026] # 'need'
[-0.198, -0.203] # 'ink'
[0.734, -0.345] # 'i'
[0.288, -0.549] # 'have'
If you have Word2Vec vectors, then one alternative simple way to make a vector for a longer text, like a sentence, is to average together all the word-vectors for the words in the sentence. So, instead of a 7-dimensional sparse vector for the sentence, like:
[1, 1, 0, 0, 0, 1, 1] # I have a pen
...you'd get a single 2-dimensional dense vector like:
[ 0.28925, -0.3335 ] # I have a pen
And again different sentences may be usefully comparable to each other based on these dense-embedding features, by distance. Or these might work well as training data for a downstream machine-learning process.
So, this is a form of "feature extraction" that uses Word2Vec instead of simple word-counts. There are many other more sophisticated ways to turn text into vectors; they could all count as kinds of "feature extraction".
Which works best for your needs will depend on your data and ultimate goals. Often the most-simple techniques work best, especially once you have a lot of data. But there are few absolute certainties, and you often need to just try many alternatives, and test how well they do in some quantitative, repeatable scoring evaluation, to find which is best for your project.

Multi-label text classification with scikit-learn

I'm new to machine learning and I'm having trouble adapting any examples that I've found to my specific problem. The official documentation for scikit is rather spartan and full of terminology I'm unfamiliar with, so I'm not really sure which algorithm I should be using, how to properly prepare my data for it, and how to get the predictions in the form I want.
I already have my feature extraction function for the text in place, which returns a tuple of floats ranging from 0.0 to 100.0. These represent the prevalence of a certain characteristic in the text as a percentage. So my features for a certain piece of text would look something like (0.0, 17.31, 57.0, 93.2, ...). I'm unsure of which algorithm would be the most suitable for this type of data.
As per the title, I also need the ability to classify a piece of text using more than one label. Reading some other SO questions clued me in that I need to use MultiLabelBinarizer and OneVsRestClassifier, but I'm still unsure how to apply them to my data and whichever algorithm I'll need to use.
I also didn't find any examples that would return prediction results for the multiple labels in the form I want them. That is, instead of a binary "is or isn't this label", I'd like a percentage chance that the text is of a certain label. So when doing something like classifier.predict(testData) I'd like a return like {"spam":87.3, "code":27.9, "urlList":3.12} instead of something like ["spam", "code", "urlList"]. That way I can make more precise decisions about what to do with a certain text.
I should probably also mention one characteristic of the dataset that I'm using, and that is that 85-90% of the text will be code, and therefore only have one tag, "code". I imagine there are some tweaks to the algorithm required to account for this?
Some simplified and probably unsuitable code:
possibleLabels = ["code", "spam", "urlList"]
trainData, trainLabels = [ (0.0, 17.31, 57.0, 93.2), ... ], [ ["spam"], ["code"], ["code", "urlList"], ... ]
testData, testLabels = [], [] # Separate batch of samples in the same format as above
# Not sure if this is the proper way to prepare my labels,
# nor how to later resolve the binarized versions to their string counterparts.
mlb = preprocessing.MultiLabelBinarizer()
fitTrainLabels = mlb.fit_transform(trainLabels)
# Feels like I need more to make it suitable for my data
classifier = OneVsRestClassifier()
classifier.fit(trainData, fitTrainLabels)
# Need return as a list of dicts containing probability of tags, ie. [ {"spam":87.3, "code":27.9, "urlList":3.12}, {...}, ... ]
predicted = classifier.predict(testData)

Remove item from multiple lists

I have a subproblem for which the solution is way larger than what I would suspect is necessary.
The problem is defined as follows
Remove X from all groups where group has id between Y and Z or A and B
expressed as a pseudo query it looks like this with Y,Z,A,B set to 0,1,3,4
remove(X,
[ period(0,1), period(3,4) ],
[
group(0, [ subgroup([_,_,X,_,_]), subgroup([X])]),
group(1, [ subgroup([X])]),
group(2, [ subgroup([_,_,X])]),
group(3, [ subgroup([_,X,_])]),
group(4, [ subgroup([X,_,_])])
], UpdatedGroups).
The result will be
UpdatedGroups = [
group(0, [ subgroup([_,_,_,_]), subgroup([])]),
group(1, [ subgroup([])]),
group(2, [ subgroup([_,_,X])]),
group(3, [ subgroup([_,_])]),
group(4, [ subgroup([_,_])])
]
So, my solution to this is:
While start of current period is less than or equal end of current period, do removal of X in groups, while "incrementing" start of day. Repeat until no more periods
The removal of X in groups is done by "looping" all groups and check if it matches the period, and if it does remove the user from subgroups, which again is done by "looping".
This is a very tedious but straight forward solution, now my problem is that I quite often find myself doing stuff like this, and cannot find approaches to do this in a less comprehensive way.
Are there other approaches than mine that does not cover 50+ lines?
Updated
Thanks a lot, the code became so much cleaner - it might go further, but now it is possible to actually post here (this is modified a bit - but the logic is there)
inPeriods(Day, [ period(Start,End) | _ ]) :- between(Start,End, Day).
inPeriods(Day, [ _ | RemainingPeriods ]) :- inPeriods(Day, RemainingPeriods).
ruleGroupsInPeriod(Periods, rulegroup(Day,_)) :- inPeriods(Day, Periods).
removeUserFromRelevantRuleGroups(UserId, Periods, RuleGroups, UpdatedRuleGroups) :-
include(ruleGroupsInPeriod(Periods), RuleGroups, IncludedRuleGroups).
exclude(ruleGroupsInPeriod(Periods), RuleGroups, ExcludedRuleGroups),
maplist(updateRuleGroup(UserId), IncludedRuleGroups, UpdatedIncludedRuleGroups)
append(UpdatedIncludedRuleGroups, ExcludedRuleGroups, UpdatedRuleGroups).
updateRuleGroup(UserId, rulegroup(X, RuleList), rulegroup(X, UpdatedRuleList)) :-
maplist(updateRule(UserId), RuleList, UpdatedRuleList).
updateRule(UserId, rule(X, UserList), rule(X, UpdatedUserList)) :-
delete(UserList, UserId, UpdatedUserList).
Yes.
The pattern you describe is very common, and all serious Prolog systems ship with powerful meta-predicates (i.e., predicates whose arguments denote predicates) that let you easily describe this and many other common situations in a flexible manner, using at most a few simple additional definitions for your concrete relations.
Richard O'Keefe's proposal for An elementary Prolog Library contains the descriptions and implementations of many such predicates, which are becoming increasingly available in all major Prolog implementations and also in the Prologue for Prolog.
In particular, you should study:
include/3
exclude/3
maplist/[2,3]
Note though that many of the described predicates are impure in the sense that they will destroy declarative properties of your code. In contrast to true relations, you won't be able to use them in all directions while preserving logical soundness.
Exercise: Which of the three predicates mentioned above, if any, preserve logical-purity when everything else is pure, and which, if any, do not?