Quick implementation of character n-grams for word - python-2.7

I wrote the following code for computing character bigrams and the output is right below. My question is, how do I get an output that excludes the last character (ie t)? and is there a quicker and more efficient method for computing character n-grams?
b='student'
>>> y=[]
>>> for x in range(len(b)):
n=b[x:x+2]
y.append(n)
>>> y
['st', 'tu', 'ud', 'de', 'en', 'nt', 't']
Here is the result I would like to get:['st','tu','ud','de','nt]
Thanks in advance for your suggestions.

To generate bigrams:
In [8]: b='student'
In [9]: [b[i:i+2] for i in range(len(b)-1)]
Out[9]: ['st', 'tu', 'ud', 'de', 'en', 'nt']
To generalize to a different n:
In [10]: n=4
In [11]: [b[i:i+n] for i in range(len(b)-n+1)]
Out[11]: ['stud', 'tude', 'uden', 'dent']

Try zip:
>>> def word2ngrams(text, n=3, exact=True):
... """ Convert text into character ngrams. """
... return ["".join(j) for j in zip(*[text[i:] for i in range(n)])]
...
>>> word2ngrams('foobarbarblacksheep')
['foo', 'oob', 'oba', 'bar', 'arb', 'rba', 'bar', 'arb', 'rbl', 'bla', 'lac', 'ack', 'cks', 'ksh', 'she', 'hee', 'eep']
but do note that it's slower:
import string, random, time
def zip_ngrams(text, n=3, exact=True):
return ["".join(j) for j in zip(*[text[i:] for i in range(n)])]
def nozip_ngrams(text, n=3):
return [text[i:i+n] for i in range(len(text)-n+1)]
# Generate 10000 random strings of length 100.
words = [''.join(random.choice(string.ascii_uppercase) for j in range(100)) for i in range(10000)]
start = time.time()
x = [zip_ngrams(w) for w in words]
print time.time() - start
start = time.time()
y = [nozip_ngrams(w) for w in words]
print time.time() - start
print x==y
[out]:
0.314492940903
0.197558879852
True

Although late, NLTK has an inbuilt function that implements ngrams
# python 3
from nltk import ngrams
["".join(k1) for k1 in list(ngrams("hello world",n=3))]
['hel', 'ell', 'llo', 'lo ', 'o w', ' wo', 'wor', 'orl', 'rld']

Ths fucntion gives you ngrams for n = 1 to n:
def getNgrams(sentences, n):
ngrams = []
for sentence in sentences:
_ngrams = []
for _n in range(1,n+1):
for pos in range(1,len(sentence)-_n):
_ngrams.append([sentence[pos:pos+_n]])
ngrams.append(_ngrams)
return ngrams

Related

Using regex on spaCy PhraseMatcher [duplicate]

I am creating a spaCy regular expression matches for matching number and extracting it pandas data frame.
Question: Panda picks up from number but overwrites value instead of appending. How to solve it?
(original code credit: yarongon)
from __future__ import unicode_literals
import spacy
import re
import pandas as pd
from datetime import date
nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'ner'])
doc = nlp("This is a sample number: 11. This is second sample number: 1145.")
NUM_PATTERN = re.compile(r"\d+")
for match in re.finditer(NUM_PATTERN, doc.text):
start, end = match.span()
Number = doc.char_span(start, end)
print Number
pandas_attributes = [Number,]
df = pd.DataFrame(pandas_attributes,
columns=['Number'])
print df
Output:
11
1145
Number
0 1145
Expected output:
Number
o 11
1 1145
Edit 1:
I am trying multiple pattern match on single text.
from __future__ import unicode_literals
import spacy
import re
import pandas as pd
from datetime import date
nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'ner'])
doc = nlp("This is a sample-number: 11. This is second sample number: 1145.")
NUM_PATTERN = re.compile(r"\d+")
HYPH_PATTERN = re.compile('\w+(?:-)\w+')
for match in re.finditer(NUM_PATTERN, doc.text):
start, end = match.span()
Number = doc.char_span(start, end)
print Number
for match in re.finditer(HYPH_PATTERN, doc.text):
start, end = match.span()
Hyph_word = doc.char_span(start, end)
print Hyph_word
pandas_attributes = [Number,Hyph_word]
df = pd.DataFrame(pandas_attributes,
columns=['Number','Hyphenword'])
print df
Current output.
Output:
11
1145
sample-number
AssertionError: 2 columns passed, passed data had 3 columns
Expected output:
Number Hyphen_word
11 sample-number
1145
edit 2: output
Number Hyphenword
0 (11) (1145)
1 (sample, -, number) Non
Expected output:
Number Hyphenword
0 11 sample-word
1 1145 Non
You need append values to list in loop:
L = []
for match in re.finditer(NUM_PATTERN, doc.text):
start, end = match.span()
L.append(doc.char_span(start, end))
and then use DataFrame constructor:
df = pd.DataFrame(L,columns=['Number'])
You can also append tuples with multiple values:
Sample:
L = []
for x in range(3):
Number = x + 1
Val = x + 4
L.append((Number, Val))
print (L)
[(1, 4), (2, 5), (3, 6)]
df = pd.DataFrame(L,columns=['Number', 'Val'])
print (df)
Number Val
0 1 4
1 2 5
2 3 6
I believe you can use double append:
PATTERNS = [NUM_PATTERN, HYPH_PATTERN]
pandas_attributes = []
for pat in PATTERNS:
L = []
for match in re.finditer(pat, doc.text):
start, end = match.span()
L.append(doc.char_span(start, end))
pandas_attributes.append(L)
df = pd.DataFrame(pandas_attributes,
index=['Number','Hyphenword']).T

Regular expression SpaCy

I am creating a spaCy regular expression matches for matching number and extracting it pandas data frame.
Question: Panda picks up from number but overwrites value instead of appending. How to solve it?
(original code credit: yarongon)
from __future__ import unicode_literals
import spacy
import re
import pandas as pd
from datetime import date
nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'ner'])
doc = nlp("This is a sample number: 11. This is second sample number: 1145.")
NUM_PATTERN = re.compile(r"\d+")
for match in re.finditer(NUM_PATTERN, doc.text):
start, end = match.span()
Number = doc.char_span(start, end)
print Number
pandas_attributes = [Number,]
df = pd.DataFrame(pandas_attributes,
columns=['Number'])
print df
Output:
11
1145
Number
0 1145
Expected output:
Number
o 11
1 1145
Edit 1:
I am trying multiple pattern match on single text.
from __future__ import unicode_literals
import spacy
import re
import pandas as pd
from datetime import date
nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'ner'])
doc = nlp("This is a sample-number: 11. This is second sample number: 1145.")
NUM_PATTERN = re.compile(r"\d+")
HYPH_PATTERN = re.compile('\w+(?:-)\w+')
for match in re.finditer(NUM_PATTERN, doc.text):
start, end = match.span()
Number = doc.char_span(start, end)
print Number
for match in re.finditer(HYPH_PATTERN, doc.text):
start, end = match.span()
Hyph_word = doc.char_span(start, end)
print Hyph_word
pandas_attributes = [Number,Hyph_word]
df = pd.DataFrame(pandas_attributes,
columns=['Number','Hyphenword'])
print df
Current output.
Output:
11
1145
sample-number
AssertionError: 2 columns passed, passed data had 3 columns
Expected output:
Number Hyphen_word
11 sample-number
1145
edit 2: output
Number Hyphenword
0 (11) (1145)
1 (sample, -, number) Non
Expected output:
Number Hyphenword
0 11 sample-word
1 1145 Non
You need append values to list in loop:
L = []
for match in re.finditer(NUM_PATTERN, doc.text):
start, end = match.span()
L.append(doc.char_span(start, end))
and then use DataFrame constructor:
df = pd.DataFrame(L,columns=['Number'])
You can also append tuples with multiple values:
Sample:
L = []
for x in range(3):
Number = x + 1
Val = x + 4
L.append((Number, Val))
print (L)
[(1, 4), (2, 5), (3, 6)]
df = pd.DataFrame(L,columns=['Number', 'Val'])
print (df)
Number Val
0 1 4
1 2 5
2 3 6
I believe you can use double append:
PATTERNS = [NUM_PATTERN, HYPH_PATTERN]
pandas_attributes = []
for pat in PATTERNS:
L = []
for match in re.finditer(pat, doc.text):
start, end = match.span()
L.append(doc.char_span(start, end))
pandas_attributes.append(L)
df = pd.DataFrame(pandas_attributes,
index=['Number','Hyphenword']).T

Extracting elements from list using Python

How can I extract '1' '11' and '111' from this list ?
T0 = ['4\t1\t\n', '0.25\t11\t\n', '0.2\t111\t\n']
to extract '4', '0.25' and '0.2' I used this :
def extract(T0):
T1 = []
for i in range(0, len(T0)):
pos = T0[i].index('\t')
T1.append(resultat[i][0: pos])
return T1
then I got :
T1 = ['4','0.25','0.2']
but for the rest I don't know how to extract it
can you help me please?
Using your code as base, it can be done as below. Will return as string if its alphabet, otherwise return as decimal integer.
def extract(T0):
T1=[]
for i in range len(T0):
tmp = T0[i].split('\t')[1]
if tmp.isalpha():
T1.append(tmp)
else:
T1.append(int(tmp))
return T1
Alternatively, try below for a more compact code using list comprehension
def extract(T0):
# return as string if its alphabet else return as decimal integer
# change int function to float if wanna return as float
tmp = [i.split('\t')[1] for i in T0]
return [i if i.isalpha() else int(i) for i in tmp]
Example
T0= ['X\tY\tf(x.y)\n', '0\t0\t\n', '0.1\t10\t\n', '0.2\t20\t\n', '0.3\t30\t\n']
extract(T0) # return ['Y', 0, 10, 20, 30]
You can accomplish this with the re module and a list comprehension.
import re
# create a regular expression object
regex = re.compile(r'[0-9]{1,}\.{0,1}[0-9]{0,}')
# assign the input list
T0 = ['4\t1\t\n', '0.25\t11\t\n', '0.2\t111\t\n']
# get a list of extractions using the regex
extractions = [x for x in [re.findall(regex, e) for e in T0]]
print extractions
# => [['4', '1'], ['0.25', '11'], ['0.2', '111']]

Appending individual lists created from a list comprehension using values from input()

I created a list comprehension to provide me the following:
listoflists = [[] for i in range(252*5)]
I then simplified the list in variable newlists to contain only the number of lists in range(weeks) which is a dynamic variable.
I want to append each individual list in the following loop for a specified range with the append process moving through each list after its reached a specified length. The values are generated from an input function. For instance, if the first list in newlists exceeds a length of 5 I want the values following the 5th loop to then append to the next list and so on. The code I currently have is:
p = 0
singlist = []
listoflists = [[] for i in range(252*5)]
newlists= [listoflists[i] for i in range(weeks)]
while p<(int(people)*weeks): #fix appending process
for i in range(int(people)*weeks):
weekly =input("Put your hours: ")
singlist.append(int(weekly))
p += 1
if weekly.isalpha() == True:
print("Not a valid amount of time")
for i in range(0,weeks):
while len(newlists[i])<int(people):
newlists[i].append(singlist[i])
This code however appends the same values to all lists in range weeks. What is the most efficient way to fix this? Thank you!
if singlist = [10,15,20,25]
desire output for newlists is: [[10,15],[20,25]]
How I've structured the program:
import sys
import numpy as np
import pandas as pd
from datetime import tzinfo,timedelta,datetime
import matplotlib.pyplot as plt
import itertools as it
from itertools import count,islice
team = []
y = 0
while y == 0:
try:
people = input("How many people are on your engagement? ")
if people.isdigit() == True:
y += 1
except:
print("Not a number try again")
z= 0
while z<int(people):
for i in range(int(people)):
names = input("Name: ")
if names.isalpha() == False:
team.append(names)
z+=1
elif names.isdigit() == True:
print("Not a name try again")
ties = [] # fix looping for more than one person
e = 0
while e<int(people):
for i in range(int(people)):
title = input("What is their title: ")
if title.isdigit() == True:
print("Not a title try again")
else:
ties.append(title)
e+=1
values = [] #fix looping for more than one person
t= 0
while t <int(people):
for i in range(int(people)):
charge = input("How much are you charging for them: ")
if charge.isalpha() == True:
print("Not a valid rate")
else:
values.append(int(charge))
t +=1
weeks = int(input("How many weeks are you including: "))
days = []
x = 0
while x<weeks: #include a parameter for dates of a 7 day difference to only be permitted
try:
for i in range(int(weeks)):
dates = input("Input the dates (mm/dd/yy): ")
dt_start = datetime.strptime(dates,'%m/%d/%y')
days.append(dates)
x+=1
except:
print("Incorrect format")
p = 0
singlist = []
listoflists = [[] for i in range(252*5)]
newlists= [listoflists[i] for i in range(weeks)]
while p<(int(people)*weeks): #fix appending process
for i in range(int(people)*weeks):
weekly =input("Put your hours: ")
singlist.append(int(weekly))
p += 1
if weekly.isalpha() == True:
print("Not a valid amount of time")
def func(items,n):
items = iter(items)
for i in it.count():
out = it.islice(items,weeks*i,weeks*i+n)
if not out:
break
output = list(func(singlist,weeks))
# items = [1,2,3,...n]
# output = [[1,2],[3,4],..], n = 2 elements each
items_ = iter(items)
outiter = iter(lambda: [next(items_) for i in range(n)],[])
outlist = list(outiter)
You can do the same thing using while loop in place of count() and [a:b] slice operation on list instead of islice(). But using iterators is very efficient.

How to print out tags in python

If I have a string such as this:
text = "They refuse to permit us."
txt = nltk.word_tokenize(text)
With this if I print POS tags; nltk.pos_tag(txt) I get
[('They','PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP')]
How can I print out only this:
['PRP', 'VBP', 'TO', 'VB', 'PRP']
You got a list of tuples, you should iterate through it to get only the second element of each tuple.
>>> tagged = nltk.pos_tag(txt)
>>> tags = [ e[1] for e in tagged]
>>> tags
['PRP', 'VBP', 'TO', 'VB', 'PRP']
Take a look at Unpacking a list / tuple of pairs into two lists / tuples
>>> from nltk import pos_tag, word_tokenize
>>> text = "They refuse to permit us."
>>> tagged_text = pos_tag(word_tokenize(text))
>>> tokens, pos = zip(*tagged_text)
>>> pos
('PRP', 'VBP', 'TO', 'VB', 'PRP', '.')
Possibly at some point you will find the POS tagger is slow and you will need to do this (see Slow performance of POS tagging. Can I do some kind of pre-warming?):
>>> from nltk import pos_tag, word_tokenize
>>> from nltk.tag import PerceptronTagger
>>> tagger = PerceptronTagger()
>>> text = "They refuse to permit us."
>>> tagged_text = tagger.tag(word_tokenize(text))
>>> tokens, pos = zip(*tagged_text)
>>> pos
('PRP', 'VBP', 'TO', 'VB', 'PRP', '.')
You can iterate like -
print [x[1] for x in nltk.pos_tag(txt)]