How to work with Doc2Vec and which approach is better training the model on my dataset or using a pretrained model? - word2vec

I am building a classification model for a dataset of items. Basically, I have 2 columns ex:
Item name
category
unsalted butter
dairy and eggs
cheese
dry grocery
peanut butter cream
dry grocery
I did the required preprocessing to clean the item name which is my input, one hot encoding for the category which is the target output, and I want to use KNN algorithm to classify the item name so I have to convert the item names to numbers.
I am struggling with the conversion model, I am not able to build the right model and check the word2vec accuracy results.
Would you please offer me a help in this since I am begginer in word embeddings technique?
I tried the following:
def tagged_document(text):
for i, sent in enumerate(text):
for j, word in enumerate(sent.split()):
yield gensim.models.doc2vec.TaggedDocument(word, [j])
data_for_training = list(tagged_document(df['item_name']))
print(data_for_training[3])
Output: [TaggedDocument(words='peanut', tags=[0]), TaggedDocument(words='butter', tags=[1]), TaggedDocument(words='cream', tags=[2])]
model = gensim.models.doc2vec.Doc2Vec(size=150, window=4, min_count=2, workers=10, epochs=30)
model.build_vocab(data_for_training)
model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)
model.save(model.bin)
print(model)
print(list(model.wv.vocab))
Output:
Doc2Vec(dm/m,d150,n5,w4,mc2,s0.001,t10)
['u', 'n', 's', 'a', 'l', 't', 'e', 'd', 'b', 'r', 'c', 'm', 'o', 'k', 'x', 'g', 'p', 'i', 'f', 'h', 'y', 'w', 'v', 'z', 'j', 'q', '7', '2', 'ü', '\x95', 'ñ', '1', '±', 'ç', '5', '4', '0', 'ã', 'ä', 'ù', 'ø', '8', '6', '²', '\x8a', 'ª', '\x82', '\x84', 'ð', '\x9f', '¥', '\x96', '§', '3', '\x91', '¯', '¬', '\xad', '¨', 'â', '\x80', '\x99', 'ï', '¿', '½', '\x93', '9', '©', '¢', '\x97', '\x94', '·', '\x88', '\x8d', '\x83', '\x98', '\x90', '®', 'å', 'é', '\x9d', 'æ', '¡', '¹', '´', '\x8c', '°', '¼', '\x87']

First and foremost, the words part of a TaggedDocument should be a list of words. By providing only a single word, it will be treated by Python as a list of single-character 'words'.
So when you supply...
TaggedDocument(tags=[0], words='peanut')
...that's equivalent to...
TaggedDocument(tags[0], words=['p', 'e', 'a', 'n', 'u', 't'])
That's why your final model has only single-character 'words' in it.
If in fact later you want to look-up Doc2Vec document-vectors by the 'Item name' values as look-up keys, you'll want to be sure your code instead creates TaggedDocuments more like:
TaggedDocument(tags=['unsalted butter'], words=['dairy', 'and', 'cream'])
On the other hand, if you want to look-up vectors by 'category' values as look-up keys, then you'll need the categories to be the tags:
TaggedDocument(tags=['dairy and cream'], words=['unsalted', 'butter'])
Which really depends on what you're trying to achieve – what data is supposed to halp you classify into which bins?
And, it's not clear Doc2Vec should be something helpful here, given the data you've shown & task you've described (classification).
Doc2Vec helps turn texts of many words into shorter summary vectors. It's usually demonstrated on texts that are at least as long as sentences, but possibly paragraphs, articles, or even full books. With single words, or short phrases of just a few words, it will have a much harder time learning/providing meaningful vectors.
Do you already have a classifier of any ype, even a poorly-performing one, working on this same data using simpler techniques, such the "bag-of-words" representations available through Scikit-Learn classes like CountVectorizer?
If not, I suggest doing that first, to achieve actual classification on a simpler and more typical base.
Only with that baseline in place, then you could consider using features derived from Word2Vec or Doc2Vec, to see if they help. Unless you have longer multi-word product descriptions, they might not.

Related

I want to get some letters using the Regular Expressions

As I said on the title, I want to get some letters using 'Regular Expressions'. But I don't know how to get it.
re.findall("\d*\.?\d+[^Successful 50/50s]", a)
'Defence\nClean sheets\n53\nGoals conceded\n118\nTackles\n186\nTackle success %\n75%\nLast man tackles\n2\nBlocked shots\n24\nInterceptions\n151\nClearances\n805\nHeaded Clearance\n380\nClearances off line\n3\nRecoveries\n666\nDuels won\n435\nDuels lost\n330\nSuccessful 50/50s\n25\nAerial battles won\n206\nAerial battles lost\n193\nOwn goals\n1\nErrors leading to goal\n1Team Play\nAssists\n2\nPasses\n7,979\nPasses per match\n56.19\nBig chances created\n3\nCrosses\n48\nCross accuracy %\n25%\nThrough balls\n10\nAccurate long balls\n936Discipline\nYellow cards\n13\nRed cards\n0\nFouls\n48\nOffsides\n2Attack\nGoals\n6\nHeaded goals\n4\nGoals with right foot\n1\nGoals with left foot\n1\nHit woodwork\n3'
I want to get just the number including floats and % but excepting the 'Successful 50/50s'. But also want to remain thousand’s place like 7,979.
You can use this regex, which will match all numbers except the one where your numbers are preceded and followed by a slash like 50/50
(?<!/)\d*(?:,\d+)*\.?\d+\b(?!/)
Regex Demo
Your updated Python code,
import re
s = '''Defence\nClean sheets\n53\nGoals conceded\n118\nTackles\n186\nTackle success %\n75%\nLast man tackles\n2\nBlocked shots\n24\nInterceptions\n151\nClearances\n805\nHeaded Clearance\n380\nClearances off line\n3\nRecoveries\n666\nDuels won\n435\nDuels lost\n330\nSuccessful 50/50s\n25\nAerial battles won\n206\nAerial battles lost\n193\nOwn goals\n1\nErrors leading to goal\n1',
'Team Play\nAssists\n2\nPasses\n7,979\nPasses per match\n56.19\nBig chances created\n3\nCrosses\n48\nCross accuracy %\n25%\nThrough balls\n10\nAccurate long balls\n936',
'Discipline\nYellow cards\n13\nRed cards\n0\nFouls\n48\nOffsides\n2',
'Attack\nGoals\n6\nHeaded goals\n4\nGoals with right foot\n1\nGoals with left foot\n1\nHit woodwork\n3'''
print(re.findall(r'(?<!/)\d*(?:,\d+)*\.?\d+\b(?!/)', s))
Prints all numbers except those 50/50,
['53', '118', '186', '75', '2', '24', '151', '805', '380', '3', '666', '435', '330', '25', '206', '193', '1', '1', '2', '7,979', '56.19', '3', '48', '25', '10', '936', '13', '0', '48', '2', '6', '4', '1', '1', '3']

Arranging nested tuples

I know that this is probably a silly question and I apologize for that, but I am very new to python and have tried to solve this for a long time now, with no success.
I have a list of tuples similar to the one bellow:
data = [('ralph picked', ['nose', '4', 'apple', '30', 'winner', '3']),
('aaron popped', ['soda', '1', 'popcorn', '6', 'pill', '4', 'question', '29'])]
I would like to sort the nested list in descending other:
data = [('ralph picked', ['apple', '30', 'nose', '4', 'winner', '3']),
('aaron popped', ['question', '29', 'popcorn', '6', 'pill', '4', 'soda', '1'])]
I tried using simple
sorted(data)
but what I get is only the first item of tuple sorted. What I am missing here? I really thank you for any help.
Let's consider only the inner list. The first issue is that it seems like you want to keep word, number pairs together. We can use zip to combine them, remembering that seq[::2] gives us every second element starting at the 0th, and seq[1::2] gives us every second starting at the first:
>>> s = ['nose', '4', 'apple', '30', 'winner', '3']
>>> zip(s[::2], s[1::2])
<zip object at 0xb5e996ac>
>>> list(zip(s[::2], s[1::2]))
[('nose', '4'), ('apple', '30'), ('winner', '3')]
Now, as you've discovered, if you call sorted on a sequence, it sorts first by the first element, then by the second to break ties, etc., going as deep as it needs to. So if we call sorted on this:
>>> sorted(zip(s[::2], s[1::2]))
[('apple', '30'), ('nose', '4'), ('winner', '3')]
Well, that looks like it works, but only by fluke because apple-nose-winner is in alphabetical order. Really we want to sort by the second term. sorted takes a key parameter:
>>> sorted(zip(s[::2], s[1::2]), key=lambda x: x[1])
[('winner', '3'), ('apple', '30'), ('nose', '4')]
That didn't work either, because it's sorting the number strings lexicographically (dictionary-style, so '30' comes before '4'). We can tell it we want to use the numerical value, though:
>>> sorted(zip(s[::2], s[1::2]), key=lambda x: int(x[1]))
[('winner', '3'), ('nose', '4'), ('apple', '30')]
Almost there -- we want this reversed:
>>> sorted(zip(s[::2], s[1::2]), key=lambda x: int(x[1]), reverse=True)
[('apple', '30'), ('nose', '4'), ('winner', '3')]
And this is almost right, but we need to flatten it. We can use either a nested list comprehension:
>>> s2 = sorted(zip(s[::2], s[1::2]), key=lambda x: int(x[1]), reverse=True)
>>> [value for pair in s2 for value in pair]
['apple', '30', 'nose', '4', 'winner', '3']
or use itertools.chain:
>>> from itertools import chain
>>> list(chain.from_iterable(s2))
['apple', '30', 'nose', '4', 'winner', '3']
And I think that's where we wanted to go.

Regex to eliminate SQL insert values with multiple lines

I'm migrating some BDD to a new structure and I need to make some changes in the structure. To do it, I start with a backup with the insert commands and using sublimetext2 and RegReplace I create some script to adapt the inserts.
The problem I have is when I need to delete one of the column's values and some of the data is text that can be in multiple lines and I have multiple inserts.
I'm using this regex:
(.*table.*VALUES \(.*,)(.*,)(([\s\S]*,){12})(.*;)
Replace by: \1\3\5
And this is the data:
INSERT INTO table (cola, colb, colc, cold, cole, colf, colg, colg, colh, coli, colj, colk, coll, colm, coln, colo, colp, colq, colr, cols, colt, culu) VALUES (1, '2', 3, NULL, '5', 6, '7', '8', 9, NULL, NULL, 12, '13', '14', '15', '16', '17', '18', '19', 20, '21', 22');
INSERT INTO table (cola, colb, colc, cold, cole, colf, colg, colg, colh, coli, colj, colk, coll, colm, coln, colo, colp, colq, colr, cols, colt, culu) VALUES (1, '2', 3, NULL, '5', 6, '7', '8', 9, NULL, NULL, 12, '13', '14', '
15
', '16', '17', '18', '19', 20, '21', '22');
If I use the regex with just one line it will eliminate the column number 9, but when I code it in sublimetext2 or input both lines together or more it will not work because it doesn't separate both INSERT INTO statements.
Here is the example not working
Thanks for your help :)
Do you try with ungreedy quantifiers:
(.*?table.*?VALUES \((?:[\s\S]*?,){14})([\s\S]*?,)(([\s\S]*?,)*?)(.*?;)
(if you want to test it with regex101 don't forget to add the g modifier)
notice:
with RegReplace, you can set greedy to false and remove all the question marks.
RegReplace support the dotall modifier thus you can replace [\s\S] by . and add (?s) before
example with the 2 notices:
(?s)(.*table.*VALUES \((?:.*,){14})(.*,)((.*,)*)(.*;)

Django paginator to include all elements of all previous pages

I read the docs on Pagination with Django and can't find a solution to my problem there. I want to paginate a queryset (5 elements per page) so that my object_list contains all elements of all previous pages up to the ones of the requested page.
This is what normally happens when I call for for the objects of page 2:
>>> p = Paginator(queryset, 5) # 5 elements per page
>>> page2 = p.page(2)
>>> page2.object_list
['6', '7', '8', '9', '10']
What I want to get is this:
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
Any ideas?
enter code hereIt's normal, because, this is what the object Paginator do
page1 = p.page(1)
page1.object_list
[1, 2, 3, 4, 5] (5 items per page, from item(1) to item(5), this is the first page)
page2 = p.page(2)
page2.object_list
['6', '7', '8', '9', '10'](5 items per page, from item(6) to item(10),this is the second page)
The definition of object Paginator:
Give Paginator a list of objects, plus the number of items you’d like to have on each page,

Get data from string

i have a string like
A & A COMPUTERS INC [RC1058054]
i want a regex to split all the data inside [ ] .Any ideas ?
To capture the data between [ and ] you can use the regex:
\[([^]]*)\]
Since the current version of the question leaves out the programming language, I just pick one.
>>> import re
>>> s = "A & A COMPUTERS INC [RC1058054]"
>>> re.search("\[(.*)\]", s).group(1)
'RC1058054'
>>> # If you want to "split all data" ...
>>> [ x for x in re.search(s).group(1) ]
['R', 'C', '1', '0', '5', '8', '0', '5', '4']
This regex (?<=\[)[^]]*(?=\]) capture all data between [ and ] for .net and java platform.