Related
I am doing a text mining project that will analyze some speeches from the three remaining presidential candidates. I have completed POS tagging with OpenNLP and created a two column data frame with the results. I have added a variable, called pair. Here is a sample from the Clinton data frame:
V1 V2 pair
1 c( NN FALSE
2 "thank VBP FALSE
3 you PRP FALSE
4 so RB FALSE
5 much RB FALSE
6 . . FALSE
7 it PRP FALSE
8 is VBZ FALSE
9 wonderful JJ FALSE
10 to TO FALSE
11 be VB FALSE
12 here RB FALSE
13 and CC FALSE
14 see VB FALSE
15 so RB FALSE
16 many JJ FALSE
17 friends NNS FALSE
18 . . FALSE
19 ive JJ FALSE
20 spoken VBN FALSE
What I'm now trying to do is write a function that will iterate through the V2 POS column and evaluate it for specific pattern pairs. (These come from Turney's PMI article.) I'm not yet very knowledgeable when it comes to writing functions, so I'm certain I've done it wrong, but here is what I've got so far.
pairs <- function(x){
JJ <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
for(i in 1:(length)(x) {
if(x == J && x+1 == N) { #i.e., if the first word = J and the next = N
pair[i] <- "JJ|NN" #insert this into the 'pair' variable
} else if (x == R && x+1 == J && x+2 != N) {
pair[i] <- "RB|JJ"
} else if (x == J && x+1 == J && x+2 != N) {
pair[i] <- "JJ|JJ"
} else if (x == N && x+1 == J && x+2 != N) {
pair[i] <- "NN|JJ"
} else if (x == R && x+1 == V) {
pair[i] <- "RB|VB"
} else {
pair[i] <- "FALSE"
}
}
}
# Run the function
cl.df.pairs <- pairs(cl.df$V2)
There are a number of (truly embarrassing) issues. First, when I try to run the function code, I get two Error: unexpected '}' in " }" errors at the end. I can't figure out why, because they match opening "{". I'm assuming it's because R is expecting something else to be there.
Also, and more importantly, this function won't exactly get me what I want, which is to extract the word pairs that match a pattern and then the pattern that they match. I honestly have no idea how to do that.
Then I need to figure out how to evaluate the semantic orientation of each word combo by comparing the phrases to the pos/neg lexical data sets that I have, but that's a whole other issue. I have the formula from the article, which I'm hoping will point me in the right direction.
I have looked all over and can't find a comparable function in any of the NLP packages, such as OpenNLP, RTextTools, etc. I HAVE looked at other SO questions/answers, like this one and this one, but they haven't worked for me when I've tried to adapt them. I'm fairly certain I'm missing something obvious here, so would appreciate any advice.
EDIT:
Here is the first 20 lines of the Sanders data frame.
head(sa.POS.df, 20)
V1 V2
1 the DT
2 american JJ
3 people NNS
4 are VBP
5 catching VBG
6 on RB
7 . .
8 they PRP
9 understand VBP
10 that IN
11 something NN
12 is VBZ
13 profoundly RB
14 wrong JJ
15 when WRB
16 , ,
17 in IN
18 our PRP$
19 country NN
20 today NN
And I've written the following function:
pairs <- function(x, y) {
require(gsubfn)
J <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
for(i in 1:(length(x))) {
ngram <- c(x[[i]], x[[i+1]])
# the ngram consists of the word on line `i` and the word below line `i`
}
strapply(y[i], "(J)\n(N)", FUN = paste(ngram, sep = " "), simplify = TRUE)
ngrams.df = data.frame(ngrams=ngram)
return(ngrams.df)
}
So, what is SUPPOSED to happen is that when strapply matches the pattern (in this case, an adjective followed by a noun, it should paste the ngram. And all of the resulting ngrams should populate the ngrams.df.
So I've entered the following function call and get an error:
> sa.JN <- pairs(x=sa.POS.df$V1, y=sa.POS.df$V2)
Error in x[[i + 1]] : subscript out of bounds
I'm only just learning the intricacies of regular expressions, so I'm not quite sure how to get my function to pull the actual adjective and noun. Based on the data shown here, it should pull "american" and "people" and paste them into the data frame.
Okay, here we go. Using this data (shared nicely with dput()):
df = structure(list(V1 = structure(c(15L, 3L, 11L, 4L, 5L, 9L, 2L,
16L, 18L, 14L, 13L, 8L, 12L, 20L, 19L, 1L, 7L, 10L, 6L, 17L), .Label = c(",",
".", "american", "are", "catching", "country", "in", "is", "on",
"our", "people", "profoundly", "something", "that", "the", "they",
"today", "understand", "when", "wrong"), class = "factor"), V2 = structure(c(3L,
5L, 7L, 12L, 11L, 10L, 2L, 8L, 12L, 4L, 6L, 13L, 10L, 5L, 14L,
1L, 4L, 9L, 6L, 6L), .Label = c(",", ".", "DT", "IN", "JJ", "NN",
"NNS", "PRP", "PRP$", "RB", "VBG", "VBP", "VBZ", "WRB"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"))
I'll use the stringr package because of its consistent syntax so I don't have to look up the argument order for grep. We'll first detect the adjectives, then the nouns, and figure out where the line up (offsetting by 1). Then paste the words together that correspond to the matches.
library(stringr)
adj = str_detect(df$V2, "JJ")
noun = str_detect(df$V2, "NN")
pairs = which(c(FALSE, adj) & c(noun, FALSE))
ngram = paste(df$V1[pairs - 1], df$V1[pairs])
# [1] "american people"
Now we can put it in a function. I left the patterns as arguments (with adjective, noun as the defaults) for flexibility.
bigram = function(word, type, patt1 = "JJ", patt2 = "N[A-Z]") {
pairs = which(c(FALSE, str_detect(type, pattern = patt1)) &
c(str_detect(type, patt2), FALSE))
return(paste(word[pairs - 1], word[pairs]))
}
Demonstrating use on the original data
with(df, bigram(word = V1, type = V2))
# [1] "american people"
Let's cook up some data with more than one match to make sure it works:
df2 = data.frame(w = c("american", "people", "hate", "a", "big", "bad", "bank"),
t = c("JJ", "NNS", "VBP", "DT", "JJ", "JJ", "NN"))
df2
# w t
# 1 american JJ
# 2 people NNS
# 3 hate VBP
# 4 a DT
# 5 big JJ
# 6 bad JJ
# 7 bank NN
with(df2, bigram(word = w, type = t))
# [1] "american people" "bad bank"
And back to the original to test out a different pattern:
with(df, bigram(word = V1, type = V2, patt1 = "N[A-Z]", patt2 = "V[A-Z]"))
# [1] "people are" "something is"
I think the following is the code you wrote, but without throwing errors:
pairs <- function(x) {
J <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
pair = rep("FALSE", length(x))
for(i in 1:(nrow(x)-2)) {
this.pos = x[i,2]
next.pos = x[i+1,2]
next.next.pos = x[i+2,2]
if(this.pos == J && next.pos == N) { #i.e., if the first word = J and the next = N
pair[i] <- "JJ|NN" #insert this into the 'pair' variable
} else if (this.pos == R && next.pos == J && next.next.pos != N) {
pair[i] <- "RB|JJ"
} else if (this.pos == J && next.pos == J && next.next.pos != N) {
pair[i] <- "JJ|JJ"
} else if (this.pos == N && next.pos == J && next.next.pos != N) {
pair[i] <- "NN|JJ"
} else if (this.pos == R && next.pos == V) {
pair[i] <- "RB|VB"
} else {
pair[i] <- "FALSE"
}
}
## then deal with the last two elements, for which you can't check what's up next
return(pair)
}
not sure what you mean by this, though:
Also, and more importantly, this function won't exactly get me what I
want, which is to extract the word pairs that match a pattern and then
the pattern that they match. I honestly have no idea how to do that.
Does Groovy have a smart way to check if a list is sorted? Precondition is that Groovy actually knows how to sort the objects, e.g. a list of strings.
The way I do right now (with just some test values for this example) is to copy the list to a new list, then sort it and check that they are equal. Something like:
def possiblySorted = ["1", "2", "3"]
def sortedCopy = new ArrayList<>(possiblySorted)
sortedCopy.sort()
I use this in unit tests in several places so it would be nice with something like:
def possiblySorted = ["1", "2", "3"]
possiblySorted.isSorted()
Is there a good way like this to check if a list is sorted in Groovy, or which is the preffered way? I would almost expect Groovy to have something like this, since it is so smart with collections and iteration.
If you want to avoid doing an O(n*log(n)) operation to check if a list is sorted, you can iterate it just once and check if every item is less or equals than the next one:
def isSorted(list) {
list.size() < 2 || (1..<list.size()).every { list[it - 1] <= list[it] }
}
assert isSorted([])
assert isSorted([1])
assert isSorted([1, 2, 2, 3])
assert !isSorted([1, 2, 3, 2])
Why not just compare it to a sorted instance of the same list?
def possiblySorted = [ 4, 2, 1 ]
// Should fail
assert possiblySorted == possiblySorted.sort( false )
We pass false to the sort method, so it returns a new list rather than modifying the existing one
You could add a method like so:
List.metaClass.isSorted = { -> delegate == delegate.sort( false ) }
Then, you can do:
assert [ 1, 2, 3 ].isSorted()
assert ![ 1, 3, 2 ].isSorted()
hi i have below code working fine:
if getattr(hotel_main, "X", 1):
hotels1 = hotels.filter(Q(X=True))
for hotel in hotels1:
if models.CalendarDay.objects.filter(hotel=hotel, date=date).count() == 0:
similar_venues.append(hotel)
I reused above code again and again to check different conditions like Q(Y=True),Q(Y=True),Q(Z=True)
if i can filter a list based on the condition i can get rid of repeating code... i want something like this: similar_venues.filter(Q(X=True)) Any help please...
If i understood correctly what you asked:
filter_on_x = [obj for obj in similar_venues if obj.X]
filter_on_y = [obj for obj in similar_venues if obj.Y]
and so on for all the X, Y, Z
You can write conditions in a list:
conditions = [ Q(Y=True),Q(Y=True),Q(Z=True) ]
if getattr(hotel_main, "X", 1):
q_date = Q( calendarday__date = date )
for q in conditions:
for hotel in hotels.filter( q_date & q).distinct():
similar_venues.append(hotel)
I'm working with a large list containing integers and I would like to do some pattern matching on them (like finding certain sequences). Regular expressions would be the perfect fit, except that they always seem to only handle lists of characters, a.k.a. strings. Is there any library (in any language) that can handle lists of an arbitrary type?
I'm aware that I could convert my integer list into a string and then do a normal regular expression search but that seems a bit wasteful and inelegant.
edit:
My requirements are fairly simple. No need for nested lists, no need for fancy character classes. Basically I'm just interested in occurrences of sequences that can get pretty complicated. (e.g. something like "[abc]{3,}.([ab]?[^a]{4,7})" etc. where a,b,c are integers). This should be possible to generalize over any type which can be checked for equality. For an enumerable type you could also get things like "[a-z]" to work.
Regular expressions match only strings, by definition.
Of course, in theory you could construct an equivalent grammar, say for lists of numbers. With new tokens like \e for even numbers, \o for odd numbers, \s for square numbers, \r for real numbers etc., so that
[1, 2, 3, 4, 5, 6]
would be matched by
^(\o\e)*$
or
[ln(3), math.pi, sqrt(-1)]
would be matched by
^\R*$
etc. Sounds like a fun project, but also like a very difficult one. And how this could be expanded to handle arbitrary lists, nested and all, is beyond me.
Some of the regex syntax generalize to generic sequences. Also, to be able to specify any object, strings is not the best medium for the expression themselves.
"Small" example in python:
def choice(*items):
return ('choice',[value(item) for item in items])
def seq(*items):
return ('seq',[value(item) for item in items])
def repeat(min,max,lazy,item):
return ('repeat',min,max,lazy,value(item))
def value(item):
if not isinstance(item,tuple):
return ('value',item)
return item
def compile(pattern):
ret = []
key = pattern[0]
if key == 'value':
ret.append(('literal',pattern[1]))
elif key == 'seq':
for subpattern in pattern[1]:
ret.extend(compile(subpattern))
elif key == 'choice':
jumps = []
n = len(pattern[1])
for i,subpattern in enumerate(pattern[1]):
if i < n-1:
pos = len(ret)
ret.append('placeholder for choice')
ret.extend(compile(subpattern))
jumps.append(len(ret))
ret.append('placeholder for jump')
ret[pos] = ('choice',len(ret)-pos)
else:
ret.extend(compile(subpattern))
for pos in jumps:
ret[pos] = ('jump', len(ret)-pos)
elif key == 'repeat':
min,max,lazy,subpattern = pattern[1:]
for _ in xrange(min):
ret.extend(compile(subpattern))
if max == -1:
if lazy:
pos = len(ret)
ret.append('placeholder for jump')
ret.extend(compile(subpattern))
ret[pos] = ('jump',len(ret)-pos)
ret.append(('choice',pos+1-len(ret)))
else:
pos = len(ret)
ret.append('placeholder for choice')
ret.extend(compile(subpattern))
ret.append(('jump',pos-len(ret)))
ret[pos] = ('choice',len(ret)-pos)
elif max > min:
if lazy:
jumps = []
for _ in xrange(min,max):
ret.append(('choice',2))
jumps.append(len(ret))
ret.append('placeholder for jump')
ret.extend(compile(subpattern))
for pos in jumps:
ret[pos] = ('jump', len(ret)-pos)
else:
choices = []
for _ in xrange(min,max):
choices.append(len(ret))
ret.append('placeholder for choice')
ret.extend(compile(subpattern))
ret.append(('drop,'))
for pos in choices:
ret[pos] = ('choice',len(ret)-pos)
return ret
def match(pattern,subject,start=0):
stack = []
pos = start
i = 0
while i < len(pattern):
instruction = pattern[i]
key = instruction[0]
if key == 'literal':
if pos < len(subject) and subject[pos] == instruction[1]:
i += 1
pos += 1
continue
elif key == 'jump':
i += instruction[1]
continue
elif key == 'choice':
stack.append((i+instruction[1],pos))
i += 1
continue
# fail
if not stack:
return None
i,pos = stack.pop()
return pos
def find(pattern,subject,start=0):
for pos1 in xrange(start,len(subject)+1):
pos2 = match(pattern,subject,pos1)
if pos2 is not None: return pos1,pos2
return None,None
def find_all(pattern,subject,start=0):
matches = []
pos1,pos2 = find(pattern,subject,start)
while pos1 is not None:
matches.append((pos1,pos2))
pos1,pos2 = find(pattern,subject,pos2)
return matches
# Timestamps: ([01][0-9]|2[0-3])[0-5][0-9]
pattern = compile(
seq(
choice(
seq(choice(0,1),choice(0,1,2,3,4,5,6,7,8,9)),
seq(2,choice(0,1,2,3)),
),
choice(0,1,2,3,4,5),
choice(0,1,2,3,4,5,6,7,8,9),
)
)
print find_all(pattern,[1,3,2,5,6,3,4,2,4,3,2,2,3,6,6,5,3,5,3,3,2,5,4,5])
# matches: (0,4): [1,3,2,5]; (10,14): [2,2,3,6]
A few points of improvement:
More constructs: classes with negation, ranges
Classes instead of tuples
If you really need a free grammar like in regular expressions, then you have to go a way as described in Tim's answer.
If you only have a fixed number of patterns to search for, then the easiest and fastest way is to write your own search/filter functions.
Interesting problem indeed.
Lateral thinking: download the .Net Framework Source code, lift the Regex source code and adapt it to work with integers rather than characters.
Just an idea.
Well, Erlang has pattern matching (of your type) built right in. I did something similar once in Ruby - a bit of probably not too well performing hackery, see http://radiospiel.org/0x16-its-a-bird
Clojure since version 1.9 has clojure.spec in the standard library, which can do exactly that and more. For example to describe a sequence of odd numbers that may end with one even number you'd write:
(require '[clojure.spec.alpha :as s])
(s/def ::odds-then-maybe-even (s/cat :odds (s/+ odd?)
:even (s/? even?)))
Then to get matching subsequences you'd do this:
(s/conform ::odds-then-maybe-even [1 3 5 100])
;;=> {:odds [1 3 5], :even 100}
(s/conform ::odds-then-maybe-even [1])
;;=> {:odds [1]}
And to find out why a sequence doesn't match your definition:
(s/explain ::odds-then-maybe-even [100])
;; In: [0] val: 100 fails spec: ::odds-then-maybe-even at: [:odds] predicate: odd?
See full documentation with examples at https://clojure.org/guides/spec
You can try pamatcher, it's a JavaScript library that generalize the notion of regular expressions for any sequence of items (of any type).
An example for "[abc]{3,}.([ab]?[^a]{4,7})" pattern matching, where a, b, c are integers:
var pamatcher = require('pamatcher');
var a = 10;
var b = 20;
var c = 30;
var matcher = pamatcher([
{ repeat:
{ or: [a, b, c] },
min: 3
},
() => true,
{ optional:
{ or: [a, b] }
},
{ repeat: (i) => i != a,
min: 4,
max: 7
}
]);
var input = [1, 4, 8, 44, 55];
var result = matcher.test(input);
if(result) {
console.log("Pattern matches!");
} else {
console.log("Pattern doesn't match.");
}
Note: I am the creator of this library.
If I have a list of items like this:
local items = { "apple", "orange", "pear", "banana" }
how do I check if "orange" is in this list?
In Python I could do:
if "orange" in items:
# do something
Is there an equivalent in Lua?
You could use something like a set from Programming in Lua:
function Set (list)
local set = {}
for _, l in ipairs(list) do set[l] = true end
return set
end
Then you could put your list in the Set and test for membership:
local items = Set { "apple", "orange", "pear", "banana" }
if items["orange"] then
-- do something
end
Or you could iterate over the list directly:
local items = { "apple", "orange", "pear", "banana" }
for _,v in pairs(items) do
if v == "orange" then
-- do something
break
end
end
Use the following representation instead:
local items = { apple=true, orange=true, pear=true, banana=true }
if items.apple then
...
end
You're seeing firsthand one of the cons of Lua having only one data structure---you have to roll your own. If you stick with Lua you will gradually accumulate a library of functions that manipulate tables in the way you like to do things. My library includes a list-to-set conversion and a higher-order list-searching function:
function table.set(t) -- set of list
local u = { }
for _, v in ipairs(t) do u[v] = true end
return u
end
function table.find(f, l) -- find element v of l satisfying f(v)
for _, v in ipairs(l) do
if f(v) then
return v
end
end
return nil
end
Write it however you want, but it's faster to iterate directly over the list, than to generate pairs() or ipairs()
#! /usr/bin/env lua
local items = { 'apple', 'orange', 'pear', 'banana' }
local function locate( table, value )
for i = 1, #table do
if table[i] == value then print( value ..' found' ) return true end
end
print( value ..' not found' ) return false
end
locate( items, 'orange' )
locate( items, 'car' )
orange found
car not found
Lua tables are more closely analogs of Python dictionaries rather than lists. The table you have create is essentially a 1-based indexed array of strings. Use any standard search algorithm to find out if a value is in the array. Another approach would be to store the values as table keys instead as shown in the set implementation of Jon Ericson's post.
This is a swiss-armyknife function you can use:
function table.find(t, val, recursive, metatables, keys, returnBool)
if (type(t) ~= "table") then
return nil
end
local checked = {}
local _findInTable
local _checkValue
_checkValue = function(v)
if (not checked[v]) then
if (v == val) then
return v
end
if (recursive and type(v) == "table") then
local r = _findInTable(v)
if (r ~= nil) then
return r
end
end
if (metatables) then
local r = _checkValue(getmetatable(v))
if (r ~= nil) then
return r
end
end
checked[v] = true
end
return nil
end
_findInTable = function(t)
for k,v in pairs(t) do
local r = _checkValue(t, v)
if (r ~= nil) then
return r
end
if (keys) then
r = _checkValue(t, k)
if (r ~= nil) then
return r
end
end
end
return nil
end
local r = _findInTable(t)
if (returnBool) then
return r ~= nil
end
return r
end
You can use it to check if a value exists:
local myFruit = "apple"
if (table.find({"apple", "pear", "berry"}, myFruit)) then
print(table.find({"apple", "pear", "berry"}, myFruit)) -- 1
You can use it to find the key:
local fruits = {
apple = {color="red"},
pear = {color="green"},
}
local myFruit = fruits.apple
local fruitName = table.find(fruits, myFruit)
print(fruitName) -- "apple"
I hope the recursive parameter speaks for itself.
The metatables parameter allows you to search metatables as well.
The keys parameter makes the function look for keys in the list. Of course that would be useless in Lua (you can just do fruits[key]) but together with recursive and metatables, it becomes handy.
The returnBool parameter is a safe-guard for when you have tables that have false as a key in a table (Yes that's possible: fruits = {false="apple"})
function valid(data, array)
local valid = {}
for i = 1, #array do
valid[array[i]] = true
end
if valid[data] then
return false
else
return true
end
end
Here's the function I use for checking if data is in an array.
Sort of solution using metatable...
local function preparetable(t)
setmetatable(t,{__newindex=function(self,k,v) rawset(self,v,true) end})
end
local workingtable={}
preparetable(workingtable)
table.insert(workingtable,123)
table.insert(workingtable,456)
if workingtable[456] then
...
end
The following representation can be used:
local items = {
["apple"]=true, ["orange"]=true, ["pear"]=true, ["banana"]=true
}
if items["apple"] then print("apple is a true value.") end
if not items["red"] then print("red is a false value.") end
Related output:
apple is a true value.
red is a false value.
You can also use the following code to check boolean validity:
local items = {
["apple"]=true, ["orange"]=true, ["pear"]=true, ["banana"]=true,
["red"]=false, ["blue"]=false, ["green"]=false
}
if items["yellow"] == nil then print("yellow is an inappropriate value.") end
if items["apple"] then print("apple is a true value.") end
if not items["red"] then print("red is a false value.") end
The output is:
yellow is an inappropriate value.
apple is a true value.
red is a false value.
Check Tables Tutorial for additional information.
function table.find(t,value)
if t and type(t)=="table" and value then
for _, v in ipairs (t) do
if v == value then
return true;
end
end
return false;
end
return false;
end
you can use this solution:
items = { 'a', 'b' }
for k,v in pairs(items) do
if v == 'a' then
--do something
else
--do something
end
end
or
items = {'a', 'b'}
for k,v in pairs(items) do
while v do
if v == 'a' then
return found
else
break
end
end
end
return nothing
A simple function can be used that :
returns nil, if the item is not found in table
returns index of item, if item is found in table
local items = { "apple", "orange", "pear", "banana" }
local function search_value (tbl, val)
for i = 1, #tbl do
if tbl[i] == val then
return i
end
end
return nil
end
print(search_value(items, "pear"))
print(search_value(items, "cherry"))
output of above code would be
3
nil