Google Cloud Natural Language API - Sentence Extraction ( Python 2.7) - python-2.7

I am working with the Google Cloud Natural Language API . My goal is to extract the sentences and sentiment inside a larger block of text and run sentiment analysis on them.
I am getting the following "unexpected indent" error. Based on my research, its doesn't appear to be a "basic" indent error (such as an rogue space etc.).
print('Sentence {} has a sentiment score of {}'.format(index,sentence_sentiment)
IndentationError:unexpected indent
the following line of code inside the for loop (see full code below) is causing the problem. If I remove it the issue goes away.
print(sentence.content)
Also if I move this print statement outside the loop, I don't get an error, but only the last sentence of the large block of text is printed (as could be expected).
I am totally new to programming - so if someone can explain what I am doing wrong in very simple terms and point me in the right direction I would be really appreciative.
Full script below
Mike
from google.cloud import language
text = 'Terrible, Terrible service. I cant believe how bad this was.'
client = language.Client()
document = client.document_from_text(text)
sent_analysis = document.analyze_sentiment()
sentiment = sent_analysis.sentiment
annotations = document.annotate_text(include_sentiment=True, include_syntax=True, include_entities=True)
print ('this is the full text to be analysed:')
print(text)
print('Here is the sentiment score and magnitude for the full text')
print(sentiment.score, sentiment.magnitude)
#now for the individual sentence analyses
for index, sentence in enumerate(annotations.sentences):
sentence_sentiment = sentence.sentiment.score
print(sentence.content)
print('Sentence {} has a sentiment score of {}'.format(index, sentence_sentiment))

This looks completely correct, though there may be a tab/space issue lurking there that did not survive being posted in your question. Can you get your text editor to display whitespace characters? There is usually an option to that. If it is a Python-aware editor, there will be a option to change tabs to spaces.
You may be able to make the problem go away by deleting the line
print(sentence.content)
and changing the following one to
print('{}\nSentence {} has a sentiment score of {}'.format(sentence.content, index, sentence_sentiment))

Related

SAP Crystal Report If Then Else Statement Color Highlighting

So I have zero knowledge of how to operate SAP Crystal Reports but find myself needing to make a report for some everyday tasks. I found a similar report already created in the system and tried adapting it (after "saving as" so I don't break the original file) but am not getting the results I am looking for.
My goal is to highlight specific information when it appears. If a person makes a mistake and enters the wrong information in our software, the report will highlight the error for me.
All I would like it to do is when there is information that doesn't match the formula it will highlight it yellow or red etc.
For example, code I've been trying to get work is:
if ({vw_Tickets.SiteName} = "SCALE" and {vw_Tickets_Material_Detail.MaterialCode} = "22CD") then crred else crwhite;
I have different variations of the above code stacked on top of one another, the names change but that's about it.
I don't know if I'm using the formula wrong or if I'm typing it in the wrong location. To make my changes I'm in Section Expert > Details > Color > and then the red x-2 next to the color list > details > background color.
The Highlighting Expert doesn't do what I need it to do. I need it to highlight when something unusual happens.
I know that probably doesn't make a lot of sense, but any help or direction would be appreciated!
Screenshot of crystal report formula
The semicolons are not correct. The formula should only have one result. Try to write something like this:
If {siteName} = "TIGER-TECH" Then crlime Else
If {siteName} = "EMPLOYEE DUMP ACCOUNT" Then crlime Else
If {siteName} = "SCALE CACHE" ... Then crRed
// Last One
Else "crWhite"
If you want to leave the original font color you should use "crNoColor".

Find All String Occurrences, Except The Last One Found, and Remove Them

I am using Google Docs to open Walmart receipts that I email to myself. The Walmart store that I use 99.9% of the time seems to have made some firmware update to the Ingenico POS terminal that makes it display a running SUBTOTAL after each item is identified by the scanner. Here are some images to support my question..
The POS terminal looks like this:
Second image is the is the electronic receipt which I email myself from their IOS app. It is presumably taken from the POS terminal because it has the extra running SUBTOTAL lines after each item like the POS terminal screen shows. It has been doing this for a few months and I've been given no reason to believe, by management, that it will be corrected any time soon.
The final image is my actual paper receipt. This is printed from the register, its the one that you walk out with it and show the greeter/exit person to check your buggy and the items you've purchased.
Note that it does not show the extra SUBTOTAL.
I open the electronic receipt in a Google Document and their automatic OCR spits out the text of the receipt. It does a pretty darn good job, I'd say its 95%+ accurate with these receipts. I apply a very crude little regex that reformats these electronic receipts so that I can enter them into a database and use that data for my family's budgeting, taxes, and so forth. That has been working very well for me, albeit I would like to further automate that process but thats for a different question some day perhaps.
Right now, that little crude regex no longer formats the receipt into something usable for me.
What I would like to do is to remove the extra SUBTOTALS from the (broken) electronic receipt but leave the last SUBTOTAL alone. I highlighted the last SUBTOTAL on the receipt, which is always there, and should remain.
I have seen two other questions that are similar but I could not apply them to my situation. One of them was:
Remove all occurrences except the last one
What have I tried?
The following regex works in the online tester at regex101.com:
\nSUBTOTAL\t\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})
It took me a while to come up with that regex from searching around but essentially I want it to find all of the SUBTOTAL literals with a preceding new-line and any decimal number amount from 0.01 to 999.99) and I just want to replace what that finds with a new-line and then I can allow my other regex creation to work on that like it used to before the firmware update to the POS terminal.
The regex correctly identifies every SUBTOTAL (including the last one) on the regex101.com site. I can apply a substitution of "\n" and I am back to seeing the receipt data I can work with but there were two issues:
1) I cant replicate this using Google Apps Script.
Here is my example:
function myFunction() {
var body = DocumentApp.getActiveDocument().getBody();
var newText = body.getText()
.match('\nSUBTOTAL\t\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})')[1]
.replace(/%/mgi, "%\n");
body.clear();
body.setText(newText);
}
2) If I were to get the above code to work, I still have the issue of wanting to leave the last SUBTOTAL intact.
Here is a Google Doc that I have set up to experiment with:
https://docs.google.com/document/d/11bOJp2rmWJkvPG1FCAGsQ_n7MqTmsEdhDQtDXDY-52s/edit?usp=sharing
I use this regular expresion.
// JavaScript Syntax
'/\nSUBTOTAL\s\d{1,3}\.\d{2}| SUBTOTAL\n\d{1,3}\.\d{2}/g'
Also I make a script for google docs. You can use this Google Doc and see the results.
function deleting_subs() {
var body = DocumentApp.getActiveDocument().getBody();
var newText = body.getText();
var out = newText.replace(/\nSUBTOTAL\s\d{1,3}\.\d{2}|` SUBTOTAL\n\d{1,3}\.\d{2}/g, '');
// This is need to become more readable the resulting text.
out = out.replace(/R /g, 'R\n');
body.clear();
body.setText(out);
}
To execute the script, open the google doc file and click on:
Add ons.
Del_subs -> Deleting Subs.
Tip: After execute the complement/add on (Deleting Subs), undo the document edition, in that way other users can return to previous version of the text.
Hope this help to you.

Why Word2Vec's most_similar() function is giving senseless results on training?

I am running the gensim word2vec code on a corpus of resumes(stopwords removed) to identify similar context words in the corpus from a list of pre-defined keywords.
Despite several iterations with input parameters,stopword removal etc the similar context words are not at all making sense(in terms of distance or context)
Eg. correlation and matrix occurs in the same window several times yet matrix doesnt fall in the most_similar results for correlation
Following are the details of the system and codes
gensim 2.3.0 ,Running on Python 2.7 Anaconda
Training Resumes :55,418 sentences
Average words per sentence : 3-4 words(post stopwords removal)
Code :
wordvec_min_count=int()
size = 50
window=10
min_count=5
iter=50
sample=0.001
workers=multiprocessing.cpu_count()
sg=1
bigram = gensim.models.Phrases(sentences, min_count=10, threshold=5.0)
trigram = gensim.models.Phrases(bigram[sentences], min_count=10, threshold=5.0)
model=gensim.models.Word2Vec(sentences = trigram[sentences], size=size, alpha=0.005, window=window, min_count=min_count,max_vocab_size=None,sample=sample, seed=1, workers=workers, min_alpha=0.0001, sg=sg, hs=1, negative=0, cbow_mean=1,iter=iter)
model.wv.most_similar('correlation')
Out[20]:
[(u'rankings', 0.5009744167327881),
(u'salesmen', 0.4948525130748749),
(u'hackathon', 0.47931140661239624),
(u'sachin', 0.46358123421669006),
(u'surveys', 0.4472047984600067),
(u'anova', 0.44710394740104675),
(u'bass', 0.4449636936187744),
(u'goethe', 0.4413239061832428),
(u'sold', 0.43735259771347046),
(u'exceptional', 0.4313117265701294)]
I am lost as to why the results are so random ? Is there anyway to check the accuracy for word2vec ?
Also is there an alternative of word2vec for most_similar() function ? I read about gloVE but was not able to install the package.
Any information in this regard would be helpful
Enable INFO-level logging and make sure that it indicates real training is happening. (That is, you see incremental progress taking time over the expected number of texts, over the expected number of iterations.)
You may be hitting this open bug issue in Phrases, where requesting the Phrase-promotion (as with trigram[sentences]) only offers a single-iteration, rather than the multiply-iterable collection object that Word2Vec needs.
Word2Vec needs to pass over the corpus once for vocabulary-discovery, then iter times again for training. If sentences or the phrasing-wrappers only support single-iteration, only the vocabulary will be discovered – training will end instantly, and the model will appear untrained.
As you'll see in that issue, a workaround is to perform the Phrases-transformation and save the results into an in-memory list (if it fits) or to a separate text corpus on disk (that's already been phrase-combined). Then, use a truly restartable iterable on that – which will also save some redundant processing.

AWS Machine Learning issue

I use AWS Machine Learning to predict if a tweet message is positive or negative.
I have a CSV file with about 1000 tweets (2 columns "message" TEXT and "is_postive" BINARY).
If the message contains some words that I've defined by my side, "is_positive" is set to 0 (else 1)
My issue is that evaluations always return 1 (even if I try a message with a "bad" word).
How can I have more relevant results?
Thanks for your help!
Navigate to your datasource and select your LM model. Clicking on the attributes will give you an idea of how "statistically relevant" the columns in your teaching data are. Your result is most probably due to your teaching data. Since the entire tweet message is in one column, the model is most likely looking for a correlation on all words in the sample tweets. A better model may be to use a "sentiment" library of which there are publicly available versions which would shift your model to look at each word in the tweet vs. the tweet as a whole as yours currently is.

How to randomly divide a huge corpus into 3?

I have a corpus(held in a JSerial Datastore) of thousands of documents with annotations. Now I need to divide it into 3 smaller ones, with random picking. What is the easiest way in GATE?
a piece of running code or detailed guide will be most welcomed!
I would use the Groovy console for this (load the "Groovy" plugin, then start the console from the Tools menu).
The following code assumes that
you have opened the datastore in GATE developer
you have loaded the source corpus, and its name is "fullCorpus"
you have created three (or however many you need) other empty corpora and saved them (empty) to the same datastore. These will receive the partitions
you have no other corpora open in GATE developer apart from these four
you have no documents open
Then you can run the following in the Groovy console:
def rnd = new Random()
def fullCorpus = corpora.find { it.name == 'fullCorpus' }
def parts = corpora.findAll {it.name != 'fullCorpus' }
fullCorpus.each { doc ->
def targetCorpus = parts[rnd.nextInt(parts.size())]
targetCorpus.add(doc)
targetCorpus.unloadDocument(doc)
}
return null
The way this works is to iterate over the documents and pick a corpus at random for each document to be added to. The target sub-corpora should end up roughly (but not necessarily exactly) the same size.
The script does not save the final sub-corpora, so if it messes up you can just close them and then re-open them (empty) from the original datastore, the fix and re-run the script. Once you're happy with the final result, right click on each sub-corpus in turn in the left hand tree and "save to its datastore" to write it all to disk.