How to randomly divide a huge corpus into 3? - gate

I have a corpus(held in a JSerial Datastore) of thousands of documents with annotations. Now I need to divide it into 3 smaller ones, with random picking. What is the easiest way in GATE?
a piece of running code or detailed guide will be most welcomed!

I would use the Groovy console for this (load the "Groovy" plugin, then start the console from the Tools menu).
The following code assumes that
you have opened the datastore in GATE developer
you have loaded the source corpus, and its name is "fullCorpus"
you have created three (or however many you need) other empty corpora and saved them (empty) to the same datastore. These will receive the partitions
you have no other corpora open in GATE developer apart from these four
you have no documents open
Then you can run the following in the Groovy console:
def rnd = new Random()
def fullCorpus = corpora.find { it.name == 'fullCorpus' }
def parts = corpora.findAll {it.name != 'fullCorpus' }
fullCorpus.each { doc ->
def targetCorpus = parts[rnd.nextInt(parts.size())]
targetCorpus.add(doc)
targetCorpus.unloadDocument(doc)
}
return null
The way this works is to iterate over the documents and pick a corpus at random for each document to be added to. The target sub-corpora should end up roughly (but not necessarily exactly) the same size.
The script does not save the final sub-corpora, so if it messes up you can just close them and then re-open them (empty) from the original datastore, the fix and re-run the script. Once you're happy with the final result, right click on each sub-corpus in turn in the left hand tree and "save to its datastore" to write it all to disk.

Related

Deleting previous results from Data Collection Results tab

I ran my PTV Vissim simulation a few times to see if it works. I noticed that the Data Collection Results tab collected all the data from these runs. How can I delete these previous data?
Edit:
I think I was able to figure it out: 1) you have to delete/move the files from the "results" folder in the same directory, 2) close out the simulation, 3) then open the simulation again...
Yes, one way is to simply delete the results folder and re-open the network.
However, from within Vissim you can go to the menu Evaluation → Result lists → Simulation runs and in the opening list delete the obsolete (or all) simulation runs (this will also remove the corresponding evaluation results).
There is also the option in Evaluation → Configuration, Tab Result management, to keep the result of the "current (multi-)run only". Selecting this will automatically throw away existing results every time you start a new simulation.

Google Cloud Natural Language API - Sentence Extraction ( Python 2.7)

I am working with the Google Cloud Natural Language API . My goal is to extract the sentences and sentiment inside a larger block of text and run sentiment analysis on them.
I am getting the following "unexpected indent" error. Based on my research, its doesn't appear to be a "basic" indent error (such as an rogue space etc.).
print('Sentence {} has a sentiment score of {}'.format(index,sentence_sentiment)
IndentationError:unexpected indent
the following line of code inside the for loop (see full code below) is causing the problem. If I remove it the issue goes away.
print(sentence.content)
Also if I move this print statement outside the loop, I don't get an error, but only the last sentence of the large block of text is printed (as could be expected).
I am totally new to programming - so if someone can explain what I am doing wrong in very simple terms and point me in the right direction I would be really appreciative.
Full script below
Mike
from google.cloud import language
text = 'Terrible, Terrible service. I cant believe how bad this was.'
client = language.Client()
document = client.document_from_text(text)
sent_analysis = document.analyze_sentiment()
sentiment = sent_analysis.sentiment
annotations = document.annotate_text(include_sentiment=True, include_syntax=True, include_entities=True)
print ('this is the full text to be analysed:')
print(text)
print('Here is the sentiment score and magnitude for the full text')
print(sentiment.score, sentiment.magnitude)
#now for the individual sentence analyses
for index, sentence in enumerate(annotations.sentences):
sentence_sentiment = sentence.sentiment.score
print(sentence.content)
print('Sentence {} has a sentiment score of {}'.format(index, sentence_sentiment))
This looks completely correct, though there may be a tab/space issue lurking there that did not survive being posted in your question. Can you get your text editor to display whitespace characters? There is usually an option to that. If it is a Python-aware editor, there will be a option to change tabs to spaces.
You may be able to make the problem go away by deleting the line
print(sentence.content)
and changing the following one to
print('{}\nSentence {} has a sentiment score of {}'.format(sentence.content, index, sentence_sentiment))

Writing to Google Spreadsheet API Extremely Slow

I am trying to write data from here(http://acleddata.com/api/acled/read) to Google Sheets via its API.I'm using the gspread package to help.
Here is the code:
r = requests.get("http://acleddata.com/api/acled/read")
data = r.json()
data = data['data']
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('credentials.json', scope)
gc = gspread.authorize(credentials)
for row in data:
sheet.append_row(row.values())
The data is a list of dictionaries, each dictionary representing a row in a spreadsheet. This is writing to my Google Sheet but it is unusably slow. It took easily 40 minutes to write a hundred rows, and then I interrupted the script.
Is there anything I can do to speed up this process?
Thanks!
Based on your code, you're using the older V3 Google Data API. For better performance, switch to the V4 API. A migration guide is available here.
Here is the faster solution:
cell_list = sheet.range('A2:'+numberToLetters(num_columns)+str(num_lines+1))
for cell in cell_list:
val = df.iloc[cell.row-2, cell.col-1]
if type(val) is str:
val = val.decode('utf-8')
elif isinstance(val,(int, long, float, complex)):
val= int(round(val))
cell.value = val
sheet.update_cells(cell_list)
This is derived from here https://www.dataiku.com/learn/guide/code/python/export-a-dataset-to-google-spreadsheets.html
I believe the change here is that this solution creates a cell_list object, which only requires one API call.
Based from this thread, Google Spreadsheets API can be pretty slow depending on many factors including your connection speed to Google servers, usage of proxy, etc. Avoid having gspread.login inside a loop because this method is slow.
...get_all_records came to my rescue, much faster than range for entire sheet.
I have also read in this forum that it depends on the size of the worksheet, so as the rows increase in the worksheet, the program run even more slower.

AWS Machine Learning issue

I use AWS Machine Learning to predict if a tweet message is positive or negative.
I have a CSV file with about 1000 tweets (2 columns "message" TEXT and "is_postive" BINARY).
If the message contains some words that I've defined by my side, "is_positive" is set to 0 (else 1)
My issue is that evaluations always return 1 (even if I try a message with a "bad" word).
How can I have more relevant results?
Thanks for your help!
Navigate to your datasource and select your LM model. Clicking on the attributes will give you an idea of how "statistically relevant" the columns in your teaching data are. Your result is most probably due to your teaching data. Since the entire tweet message is in one column, the model is most likely looking for a correlation on all words in the sample tweets. A better model may be to use a "sentiment" library of which there are publicly available versions which would shift your model to look at each word in the tweet vs. the tweet as a whole as yours currently is.

how to make layers combination in photoshop

I have multiple layers in a file and i want to create combinations of those layers. Like layer1 and layer 2, layer 1 and layer 3
see the example.
Check the image
I want to automate this process like u can see in image that i have places first layer to be used on left side of the image and then i need a tool which can turn off the layers on right side one by one and save the image like layer01-layer02.jpg , layer01-layer03.jpg.
Kindly help me in this.
EDIT:
a script on github helped me solve the problem
Photoshop script that separately saves top level layers inside the selected group
just place the layer which u want to keep in all images outside the group, and all other images in a layer group.
then select the group and run this script it will save all combinations with that layer placed outside.
now if anyone know here scripting i have one question.
like when we run the script it asks for file name and then it adds incrimental numbers after it like if we written abc in file name then it will save images as abc1, abc 2 like this
What i want is that if we have written abc in filename and run the script it shud add layer name after it. like if layer names are japan,america, then it shud save files as abcjapan, abcamerica.
Can it be done?
You will need to cycle through all the layers in your document and set the visibility as needed before exporting the image. To set the visibility you'll need something like this:
for(var layerIndex=0; layerIndex < app.activeDocument.artLayers.length; layerIndex++) {
var layer=app.activeDocument.artLayers[layerIndex];
//do some logic to determine if the layer should be visible
layer.visible = true;
}
And see the answer to this question re: saving your jpeg once you've set the layer visibility as needed.