Where can I find pre trained word embeddings (English) in word2vec format of 50 dimensions? - word2vec

Preferably it should be a txt file than a binary file. All of the pre-trained word embeddings I found where of 300+ dimensions.
Thank you

http://nlp.stanford.edu/data/glove.6B.zip
Download this file which in GloVe format and convert it to word2vec format using this script: https://github.com/jroakes/glove-to-word2vec

It is plausibly the case that any 50 dimensions of a 300-dimensional model are still useful. So you could conceivably take a 300-dimensional set, in text, and patch the file to specify 50-dimensions and discard the last 250 dimensions of each line.

Related

How to find string matches between a text file and CSV?

I am currently working with Python 2.7 on a stand alone system. My preferred method for solving this problem would be to use pandas data-frames to compare. However, I do not have access to install the library on the system I'm working with. So my question is how else could I use a text file and look for matches of the strings in a csv.
If I have a main csv file with many fields (for relevance the first one is timestamps) and several other text files that contain a list of timestamps how can I compare each of the txt files with the main csv and if a match is found grab the entire row from the csv based on the specific field matching and outputting that result to another csv
Example:
example.csv
timestamp,otherfield,otherfield2
1.2345,,
2.3456,,
3.4567,,
5.7867,,
8.3654,,
12.3434,,
32.4355,,
example1.txt
2.3456
3.4565
example2.txt
12.3434
32.4355
If there are any questions I'm happy to answer them.
You can load all the files into lists, then search the lists using
with open('example.txt','r') as file_handle:
example_file_content = file_handle.read().split("\n")
with open("example1.txt", "r") as file_handle:
example1_file_content = file_handle.read().split("\n")
for index, line in enumerate(example_file_content):
if line.split(",")[0] in example1_file_content:
print('match found; line is',line)

How to modify only timecodes but not other numbers or letters in a text file?

I'm making a program to speed up and slow down parts of videos, and I want to support modifying times on subtitles to match. How can I search for only the timecodes in a text file and modify them?
This is for srt subtitle files. Timecodes are in the format of HH.MM.SS,mmm. The files contain other numbers (eg in hex colors) so I only want to search for numbers in the specific timecode format.
I already have a function to take an input time in seconds and return an output time in seconds. It should also be fairly easy to convert between 'timecode' format and time in seconds.
This is an example of the text file:
1
00:00:00,000 --> 00:00:09,138
<font color="#CCCCCC">alexxa</font><font color="#E5E5E5"> who's your favorite president</font>
2
00:00:04,759 --> 00:00:12,889
<font color="#E5E5E5">George Washington</font><font color="#CCCCCC"> has my vote Alexa</font>
The only thing left is how to take in only timecodes and then replace them with new timecodes?
Not sure where to go from here. It would also be good to avoid looping through the text file more than necessary because there will be a lot of timecodes to change.
Given it's a text format, the most efficient way to match (and replace) the format of the time-stamps in your file would be to use regular expressions: https://en.cppreference.com/w/cpp/regex
the algo would you like this: you read line by line from your source file, for
every read line where RE matches, you replace it with the new time-stamps (i.e. craft a new line) and output to a new file (or to a buffer, which later could be committed into the source file - after processing is done). Other lines (where RE does not match) you output intact, as they were read.

Next Word Prediction using corpus text file in c/c++/objective c

I am trying to make an next word prediction program using corpus text file.
As we get in Android/iPhone (Quick type in iOS 8). when user type a word and then the next word suggestion is shown to the user and so on...
I have found that we have to make a use of n-gram algorithm
https://www.cs.cornell.edu/courses/CS4740/2012sp/lectures/smoothing+backoff-1-4pp.pdf
but there is no code or example available to start with it.
Can any one help me how to start for implementing this algo in c/c++/objective c

How to convert a text file into ARFF format?

I'm using WEKA tool for text classification, and I have to convert plain text files into ARFF format. However, I don't know how to do that. Can anyone please help me to convert a text file into ARFF format?
Thank you Renklauf for ur response,
I didn't understood these points "Since text editors like Notepad only allow a limited number of columns, you'll need to get something like Notepad++ to fit everything on one line." .. can u plz explain in brief ..
Suppose the text data is like a simple sport article like
" Basketball is a team sport, the objective being to shoot a ball through a basket horizontally positioned to score points while following a set of rules. Usually, two teams of five players play on a marked rectangular court with a basket at each width end. Basketball is one of the world's most popular and widely viewed sports" ...
This is my text document and I want to convert this to arff format .. and after that I need to use that arff format file for SVM text classification ..
For a document classification task, each document is considered an attribute and must be enclosed in quotes. Suppose you have a corpus of 10 sports articles tagged as either pro-Yankees or pro-Red Sox for a classifier that automatically classifies sports articles as either pro-Yankees or pro-Red Sox. You need to take each document, enclose it in quotes,place it on a single line, and then place your {yankees, red_sox} attribute value after the quotes-enclosed string.
#relation yankeesOrRedSox
#attribute article string
#attribute yankeesOrSox { yankees, red_sox }
#data
"text of article 1 here", yankees
.
.
.
"text of article 10 here", red_sox
It's key that the article is placed on a single line. When I began using Weka for text classification, this is a point that caused me a lot of frustration at first. Since text editors like Notepad only allow a limited number of columns, you'll need to get something like Notepad++ to fit everything on one line. Notepad++ has a Join Lines function that allows you to place a lot of text on a single line.
Hope this helps.

regex to parse image

If I have a string in the form:

What is the best regex I can use to parse these elements in an array? (so I can write away the correct image)
update: i understand base64 encoding but the question is actually how to parse these kind of embedded icons in webpages. since i dont know if people are using e.g. base62 ... or other image strings or even other formats to embed images. etc... i also see examples in pages where the identifier is image/x-icon but he string actually contains a png.
UPDATE just some giveback to share the code where I used this: http://plugins.svn.wordpress.org/wp-favicons/trunk/filters/search/filter_extract_from_page.php
Though I still have some questions e.g. IF only base64 is used etc... but time will tell in practice.
Can you see the base64 at the beginning? You don't need regex. You need to decode this base64 string into a byte stream and then save it as an image.
I have now saved the following text into a file icon.txt:
iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAABmJLR0QAAAAAAAD5Q7t
/AAAA2UlEQVQ4y8WSvQvCMBDFX2rFUvuFSAUFBQfBwUXQVfFfFpzdRV2c7O5UKmihX9E6RZo2pXbyTbmX3C+5uwD
/FskG+76WsvX65n
/3Lm0pdU214HOAbHIWwvzeYPL1p4cT4QCi5DIxEINIdWt+Hs9cXAtg3UOkIJAUpT5ADiho8kbD0NG0LB6Q76xIevwCpW+0bBvj7Y5wgCpI148RBxTmYo7Z1RGPkSk
/kc4jgme0oHoJlmFUOC+8lUEMN0ASvyBpGha++IXCJrJyKJGhjIalyZVyNqufP9j
/9AH0S0vqrU+YMgAAAABJRU5ErkJggg==
And processed:
base64 -d icon.txt > icon.png
and it shows a red heart icon, 16x16 pixels.
This is the way you can decode it in the command line. Most programming languages offer good libraries to decode it directly in your program.
EDIT: If you use PHP, then have a look at base64_decode().