AWS Machine Learning issue - amazon-web-services

I use AWS Machine Learning to predict if a tweet message is positive or negative.
I have a CSV file with about 1000 tweets (2 columns "message" TEXT and "is_postive" BINARY).
If the message contains some words that I've defined by my side, "is_positive" is set to 0 (else 1)
My issue is that evaluations always return 1 (even if I try a message with a "bad" word).
How can I have more relevant results?
Thanks for your help!

Navigate to your datasource and select your LM model. Clicking on the attributes will give you an idea of how "statistically relevant" the columns in your teaching data are. Your result is most probably due to your teaching data. Since the entire tweet message is in one column, the model is most likely looking for a correlation on all words in the sample tweets. A better model may be to use a "sentiment" library of which there are publicly available versions which would shift your model to look at each word in the tweet vs. the tweet as a whole as yours currently is.

Related

AWS GroundTruth text labeling - hide columns in the data, and checking quality of answers

I am new to SageMaker. I have a large csv dataset which I would like labelled:
sentence_id
sentence
pre_agreed_label
148392
A sentence
0
383294
Another sentence
1
For each sentence, I would like a) a yes/no binary classification in response to a question, and b) on a scale of 1-3, how obvious the classification was. I need the sentence id to map to other parts of the dataset, and will use the pre-agreed labels to assess accuracy.
I have identified SageMaker GroundTruth labelling jobs as a possible way to do this. Is this the best way? In trying to set it up I have run into a few problems.
The first problem is I can't find a way to display only the sentence column to the labellers, hiding the sentence_id and pre_agreed_labels.
The second is that there is either single labelling or multi labelling, but I would like a way to have two sets of single-selection labels:
Select one for binary classification:
Yes
No
Select one for difficulty of classification:
Easy
Medium
Hard
It seems as though this can be done using custom HTML, but I don't know how to do this - the template it gives you doesn't even render
Finally, having not used mechanical turk before, are there ways of ensuring people take the work seriously and don't just select random answers? I can see there's an option to have x number of people answer the same question, but is there also a way to put in an obvious question to which we already have a 'pre_agreed_label' every nth question, and kick people off the task if they get it wrong? There also appears to be a maximum of $1.20 per task which seems odd.

Android/ Java : IS there fast way to filter large data saved in a list ? and how to get high quality picture with small storage space in server?

I have two questions
the first one is:
I have large data come from the server I saved it in a list , the customer can filter this data by 7 filters and two by text watcher this thing caused filtering operation to slow it takes 4 seconds in each time
I tried to put the filter keywords like(length or width ...) in one if and (&&) between them
but it didn't give me a result, also I tried to replace the textwatcher by spinner but it's not
useful.
I'm using one (for loop)
So the question: how can I use multi filter for list contain up to 2000 row with mini or zero slow?
the second is:
I saved from 2 to 8 pictures in the server in string form
the question is when I get these pictures from the server how can I show them in high quality?
when I show them I can see the pixels and this is not good for the customer
I don't want these pictures to take large space in the server and at the same time I want it in good quality when I restore them to display
I'm using Android/ Java
Thank you
The answer on my first quistion is if you want using filter (like when you are using online clothes shop and you want to filter it by less price ) you should use the hash map, not ordinary list it will be faster
The answer on my second question is: if you want to save store images in a database you should save it as a link, not a string or any other datatype

How can I resolve INDEX MATCH errors caused by discrepancies in the spelling of names across multiple data sources?

I've set up a Google Sheets workbook that synthesizes data from a few different sources via manual input, IMPORTHTML and IMPORTRANGE. Once the data is populated, I'm using INDEX MATCH to filter and compare the information and to RANK each data set.
Since I have multiple data inputs, I'm running into a persistent issue of names not being written exactly the same between sources, even though they're the same person. First names are the primary culprit (i.e. Mary Lou vs Marylou vs Mary-Lou vs Mary Louise) but some last names with special symbols (umlauts, accents, tildes) are also causing errors. When Sheets can't recognize a match, the INDEX MATCH and RANK functions both break down.
I'm wondering how to better unify the data automatically so my Sheet understands that each occurrence is actually the same person (or "value").
Since you can't edit the results of an IMPORTHTML directly, I've set up "helper columns" and used functions like TRIM and SPLIT to try and fix instances as I go, but it seems like there must be a simpler path.
It feels like IFS could work but I can't figure how to integrate it. Also thinking this may require a script, which I'm just beginning to study.
Here's a simplified example of what I'm trying to achieve and the corresponding errors: Sample Spreadsheet
The first tab is attempting to pull and RANK data from tabs 2 and 3. Sample formulas from the Summary tab, row 3 (Amelia Rose):
Cell B3: =INDEX('Q1 Sales'!B:B, MATCH(A3,'Q1 Sales'!A:A,0))
Cell C3: =RANK(B3,$B$2:B,1)
Cell D3: =INDEX('Q2 Sales'!B:B, MATCH(A3,'Q2 Sales'!A:A,0))
Cell E3: =RANK(D3,$D$2:D,1)
I'd be grateful for any insight on how to best index 'Q2Sales'!B3 as the correct value for 'Summary'!D3. Thanks in advance - the thoughtful answers on Stack Overflow have gotten me this far!
to counter every possible scenario do it like this:
=ARRAYFORMULA(IFERROR(VLOOKUP(LOWER(REGEXREPLACE(A2:A, "-|\s", )),
{REGEXEXTRACT(LOWER(REGEXREPLACE('Q2 Sales'!A2:A, "-|\s", )),
TEXTJOIN("|", 1, LOWER(REGEXREPLACE(A2:A, "-|\s", )))), 'Q2 Sales'!B2:B}, 2, 0)))

MaxMind's GeoIPCity for a single country only?

Recently I have stumbled upon a problem - MaxMind's GeoIPCity file is way too big for our needs and contains A LOT of data we don't need and won't need.
The question is: is there a way to limit the City database to a single country? let's say, Canadian cities only?
You cannot just conveniently download the database for Canadian cities only, but you can certainly prune the database once you have downloaded and loaded it. This is true whether you use the MaxMind DB or download the CSV format, just trim out the lines that do not represent Canada's country code or geoname_id (depending on v1 or v2 of the dataset).
If you identify your specific coding environment and language, I'm certain someone can help you write a few lines of code that chops out all the fat.

Why Word2Vec's most_similar() function is giving senseless results on training?

I am running the gensim word2vec code on a corpus of resumes(stopwords removed) to identify similar context words in the corpus from a list of pre-defined keywords.
Despite several iterations with input parameters,stopword removal etc the similar context words are not at all making sense(in terms of distance or context)
Eg. correlation and matrix occurs in the same window several times yet matrix doesnt fall in the most_similar results for correlation
Following are the details of the system and codes
gensim 2.3.0 ,Running on Python 2.7 Anaconda
Training Resumes :55,418 sentences
Average words per sentence : 3-4 words(post stopwords removal)
Code :
wordvec_min_count=int()
size = 50
window=10
min_count=5
iter=50
sample=0.001
workers=multiprocessing.cpu_count()
sg=1
bigram = gensim.models.Phrases(sentences, min_count=10, threshold=5.0)
trigram = gensim.models.Phrases(bigram[sentences], min_count=10, threshold=5.0)
model=gensim.models.Word2Vec(sentences = trigram[sentences], size=size, alpha=0.005, window=window, min_count=min_count,max_vocab_size=None,sample=sample, seed=1, workers=workers, min_alpha=0.0001, sg=sg, hs=1, negative=0, cbow_mean=1,iter=iter)
model.wv.most_similar('correlation')
Out[20]:
[(u'rankings', 0.5009744167327881),
(u'salesmen', 0.4948525130748749),
(u'hackathon', 0.47931140661239624),
(u'sachin', 0.46358123421669006),
(u'surveys', 0.4472047984600067),
(u'anova', 0.44710394740104675),
(u'bass', 0.4449636936187744),
(u'goethe', 0.4413239061832428),
(u'sold', 0.43735259771347046),
(u'exceptional', 0.4313117265701294)]
I am lost as to why the results are so random ? Is there anyway to check the accuracy for word2vec ?
Also is there an alternative of word2vec for most_similar() function ? I read about gloVE but was not able to install the package.
Any information in this regard would be helpful
Enable INFO-level logging and make sure that it indicates real training is happening. (That is, you see incremental progress taking time over the expected number of texts, over the expected number of iterations.)
You may be hitting this open bug issue in Phrases, where requesting the Phrase-promotion (as with trigram[sentences]) only offers a single-iteration, rather than the multiply-iterable collection object that Word2Vec needs.
Word2Vec needs to pass over the corpus once for vocabulary-discovery, then iter times again for training. If sentences or the phrasing-wrappers only support single-iteration, only the vocabulary will be discovered – training will end instantly, and the model will appear untrained.
As you'll see in that issue, a workaround is to perform the Phrases-transformation and save the results into an in-memory list (if it fits) or to a separate text corpus on disk (that's already been phrase-combined). Then, use a truly restartable iterable on that – which will also save some redundant processing.