The Google Cloud Natural Language API can be used to analyse text and return a syntactic parse tree with each word labeled with parts-of-speech tags.
Is there a way to deturmine if a noun is plural or not?
If Google Cloud NL is able to work out the lemma then perhaps the information is there but not returned through the API?
Update
With the NL API's GA launch, the annotateText endpoint now returns a number key for each token indicating whether word is singular, plural, or dual. For the sentence "There are some cats here," the API returns the following token data for 'cats' (notice that number is PLURAL):
{
"text": {
"content": "cats",
"beginOffset": -1
},
"partOfSpeech": {
"tag": "NOUN",
"aspect": "ASPECT_UNKNOWN",
"case": "CASE_UNKNOWN",
"form": "FORM_UNKNOWN",
"gender": "GENDER_UNKNOWN",
"mood": "MOOD_UNKNOWN",
"number": "PLURAL",
"person": "PERSON_UNKNOWN",
"proper": "PROPER_UNKNOWN",
"reciprocity": "RECIPROCITY_UNKNOWN",
"tense": "TENSE_UNKNOWN",
"voice": "VOICE_UNKNOWN"
},
"dependencyEdge": {
"headTokenIndex": 1,
"label": "DOBJ"
},
"lemma": "cat"
}
See the full documentation here.
Thanks for trying out the NL API.
Right now there isn't a clean way to detect plurals other than to note that the base word is different than the lemma and guess whether it's plural (in English, perhaps it ends in an -s).
However, we plan to release a much better way of detecting morphological information like plurality, so stay tuned.
Related
I have a ES DB storing history records from a process I run every day. Because I want to show only 20 records per page in the history (order by date), I was using pagination (size + from_) combined scroll, which worked just fine. But when I wanted to used sort in the query it didn't work. So I found that scroll with sort don't work. Looking for another alternative I tried the ES helper scan which works fine for scrolling and sorting the results, but with this solution pagination doesn't seem to work, which I don't understand why since the API says that scan sends all the parameters to the underlying search function. So my question is if there is any method to combine the three options.
Thanks,
Ruben
When using the elasticsearch.helpers.scan function, you need to pass preserve_order=True to enable sorting.
(Tested using elasticsearch==7.5.1)
yes, you can combine scroll with sort, but, when you can sort string, you will need change the mapping for it works fine, Documentation Here
In order to sort on a string field, that field should contain one term
only: the whole not_analyzed string. But of course we still need the
field to be analyzed in order to be able to query it as full text.
The naive approach to indexing the same string in two ways would be to
include two separate fields in the document: one that is analyzed for
searching, and one that is not_analyzed for sorting.
"tweet": {
"type": "string",
"analyzer": "english",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
The main tweet field is just the same as before: an analyzed full-text field.
The new tweet.raw subfield is not_analyzed.
Now, or at least as soon as we have reindexed our data, we can use the
tweet field for search and the tweet.raw field for sorting:
GET /_search
{
"query": {
"match": {
"tweet": "elasticsearch"
}
},
"sort": "tweet.raw"
}
We have an OData-compliant API that delegates some of its full text search needs to an Elasticsearch cluster.
Since OData expressions can get quite complex, we decided to simply translate them into their equivalent Lucene query syntax and feed it into a query_string query.
We do support some text-related OData filter expressions, such as:
startswith(field,'bla')
endswith(field,'bla')
substringof('bla',field)
name eq 'bla'
The fields we're matching against can be analyzed, not_analyzed or both (i.e. via a multi-field).
The searched text can be a single token (e.g. table), only a part thereof (e.g. tab), or several tokens (e.g. table 1., table 10, etc).
The search must be case-insensitive.
Here are some examples of the behavior we need to support:
startswith(name,'table 1') must match "Table 1", "table 100", "Table 1.5", "table 112 upper level"
endswith(name,'table 1') must match "Room 1, Table 1", "Subtable 1", "table 1", "Jeff table 1"
substringof('table 1',name) must match "Big Table 1 back", "table 1", "Table 1", "Small Table12"
name eq 'table 1' must match "Table 1", "TABLE 1", "table 1"
So basically, we take the user input (i.e. what is passed into the 2nd parameter of startswith/endswith, resp. the 1st parameter of substringof, resp. the right-hand side value of the eq) and try to match it exactly, whether the tokens fully match or only partially.
Right now, we're getting away with a clumsy solution highlighted below which works pretty well, but is far from being ideal.
In our query_string, we match against a not_analyzed field using the Regular Expression syntax. Since the field is not_analyzed and the search must be case-insensitive, we do our own tokenizing while preparing the regular expression to feed into the query in order to come up with something like this, i.e. this is equivalent to the OData filter endswith(name,'table 8') (=> match all documents whose name ends with "table 8")
"query": {
"query_string": {
"query": "name.raw:/.*(T|t)(A|a)(B|b)(L|l)(E|e) 8/",
"lowercase_expanded_terms": false,
"analyze_wildcard": true
}
}
So, even though, this solution works pretty well and the performance is not too bad (which came out as a surprise), we'd like to do it differently and leverage the full power of analyzers in order to shift all this burden at indexing time instead of searching time. However, since reindexing all our data will take weeks, we'd like to first investigate if there's a good combination of token filters and analyzers that would help us achieve the same search requirements enumerated above.
My thinking is that the ideal solution would contain some wise mix of shingles (i.e. several tokens together) and edge-nGram (i.e. to match at the start or end of a token). What I'm not sure of, though, is whether it is possible to make them work together in order to match several tokens, where one of the tokens might not be fully input by the user). For instance, if the indexed name field is "Big Table 123", I need substringof('table 1',name) to match it, so "table" is a fully matched token, while "1" is only a prefix of the next token.
Thanks in advance for sharing your braincells on this one.
UPDATE 1: after testing Andrei's solution
=> Exact match (eq) and startswith work perfectly.
A. endswith glitches
Searching for substringof('table 112', name) yields 107 docs. Searching for a more specific case such as endswith(name, 'table 112') yields 1525 docs, while it should yield less docs (suffix matches should be a subset of substring matches). Checking in more depth I've found some mismatches, such as "Social Club, Table 12" (doesn't contain "112") or "Order 312" (contains neither "table" nor "112"). I guess it's because they end with "12" and that's a valid gram for the token "112", hence the match.
B. substringof glitches
Searching for substringof('table',name) matches "Party table", "Alex on big table" but doesn't match "Table 1", "table 112", etc. Searching for substringof('tabl',name) doesn't match anything
UPDATE 2
It was sort of implied but I forgot to explicitely mention that the solution will have to work with the query_string query, mainly due to the fact that the OData expressions (however complex they might be) will keep getting translated into their Lucene equivalent. I'm aware that we're trading off the power of the Elasticsearch Query DSL with the Lucene's query syntax, which is a bit less powerful and less expressive, but that's something that we can't really change. We're pretty d**n close, though!
UPDATE 3 (June 25th, 2019):
ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html
This is an interesting use case. Here's my take:
{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer",
"filter": ["lowercase"]
},
"my_edge_ngram_analyzer": {
"tokenizer": "my_edge_ngram_tokenizer",
"filter": ["lowercase"]
},
"my_reverse_edge_ngram_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase","reverse","substring","reverse"]
},
"lowercase_keyword": {
"type": "custom",
"filter": ["lowercase"],
"tokenizer": "keyword"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "25"
},
"my_edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25"
}
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 25
}
}
}
},
"mappings": {
"test_type": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_ngram_analyzer",
"fields": {
"starts_with": {
"type": "string",
"analyzer": "my_edge_ngram_analyzer"
},
"ends_with": {
"type": "string",
"analyzer": "my_reverse_edge_ngram_analyzer"
},
"exact_case_insensitive_match": {
"type": "string",
"analyzer": "lowercase_keyword"
}
}
}
}
}
}
}
my_ngram_analyzer is used to split every text into small pieces, how large the pieces are depends on your use case. I chose, for testing purposes, 25 chars. lowercase is used since you said case-insensitive. Basically, this is the tokenizer used for substringof('table 1',name). The query is simple:
{
"query": {
"term": {
"text": {
"value": "table 1"
}
}
}
}
my_edge_ngram_analyzer is used to split the text starting from the beginning and this is specifically used for the startswith(name,'table 1') use case. Again, the query is simple:
{
"query": {
"term": {
"text.starts_with": {
"value": "table 1"
}
}
}
}
I found this the most tricky part - the one for endswith(name,'table 1'). For this I defined my_reverse_edge_ngram_analyzer which uses a keyword tokenizer together with lowercase and an edgeNGram filter preceded and followed by a reverse filter. What this tokenizer basically does is to split the text in edgeNGrams but the edge is the end of the text, not the start (like with the regular edgeNGram).
The query:
{
"query": {
"term": {
"text.ends_with": {
"value": "table 1"
}
}
}
}
for the name eq 'table 1' case, a simple keyword tokenizer together with a lowercase filter should do it
The query:
{
"query": {
"term": {
"text.exact_case_insensitive_match": {
"value": "table 1"
}
}
}
}
Regarding query_string, this changes the solution a bit, because I was counting on term to not analyze the input text and to match it exactly with one of the terms in the index.
But this can be "simulated" with query_string if the appropriate analyzer is specified for it.
The solution would be a set of queries like the following (always use that analyzer, changing only the field name):
{
"query": {
"query_string": {
"query": "text.starts_with:(\"table 1\")",
"analyzer": "lowercase_keyword"
}
}
}
I am trying to download a Facebook discussion using the graph API. The problem is: the discussion is located in a page, and in a tree-style manner, meaning that there are two types of comments: "main" comments, to the first message, and "subcomments" to the main comments themselves.
It seems that the graph result only shows the "main" comments and doesn't show the subcomments. Here's an example of a comment it returns:
{
"id": "53526364352_1574091",
"can_remove": false,
"created_time": "2014-02-05T10:46:37+0000",
"from": {
"name": "Main commenter",
"id": "5345353"
},
"like_count": 163,
"message": "I am a main comment",
"user_likes": false
},
There is no link or whatever to the subcomments of this main comment (and there are many).
Is there a way to get the subcomments?
If 10101140614002197_8831228 is an ID of a root comment, then you can check for subcomments/replies by requesting COMMENT_ID/comments.
For example:
the root comment: http://graph.facebook.com/10101140614002197_8831228
the subcomment: http://graph.facebook.com/10101140614002197_8831228/comments
this root comment has no subcomments so the data list is empty: https://graph.facebook.com/10101140614002197_8831286/comments
You can use field expansion (curly parenthesis in url) to get nested data
http://graph.facebook.com/{object-id}/comments?fields=id,message,comments{id,message,comments{id,message,comments}}
More info here in the section labeled Nested requests (a.k.a. field expansion).
If you want to traverse and flatten the tree, you can do this:
def get_all_comments(post_or_comment_id):
next_ids = [post_or_comment_id]
results = []
while next_ids:
next_id = next_ids.pop()
comments = get_comments_from_facebook(next_id) # Facebook API call
results += comments
next_ids.extend(c["id"] for c in comments)
return results
Make sure to add parent to the API call so you can replicate the tree.
Reg-ex always confuses me, plus super simple syntax's are hard to Google. I am using reg-ex here strictly with find and replace no need for any languages to do some reg-ex just want to save time editing a lot of data :)
I have a huge json file, these are only two pieces of data, but it's good for this example.
[
{
name: 'John',
team: 'Wolves',
team_id: 1,
number: 24
},
{
name: 'Kevin',
team: 'Rockets',
team_id: 1,
number: 6
}
]
Inside my json I need to put double quotes over pretty much every key:value pair, numbers are optional.
I need to get rid of the single quotes, then put double quotes over everything.
Final result looking like this.
[
{
"name": "John",
"team": "Wolves",
"team_id": "1",
"number": "24"
},
{
"name": "Kevin",
"team": "Rockets",
"team_id": "1",
"number": "6"
}
]
Again, numbers are optional but it would be nice to know how to double quote those.
Extra: I vaguely remember doing something like this awhile back, but can't find where I found that information. This would be a nice reference. Does anyone have any good links to the basics of regex, I just want to save time when working with a lot of data. Thanks.
Try something along the lines of this:
(\w+):\s*('?)([^']+?)\2(?=[\n,]) and replace by "\1": "\3"
Demo: http://regex101.com/r/pX9xX6
Edit:
Just tested in Sublime, seems to work fine.
Well, the exact syntax depends on the tool. If you were using vim, for instance:
:%s/'\([^']*\)'/"\1"/g
and
:%s/^\([ ^I]*\)\([^ ^I]*\):/\1"\2":/
would probably do the trick, although you'd want to do a manual check for any quoted quotes..
I admin a couple pages on FB and we recently got hit by a supposedly fake page.
http://www.facebook.com/pages/Duke-St-Rollins/478408292178951
The page is supposedly a duplicate of this user:
http://www.facebook.com/DukeStRollins
However. When I entered this into Graph.facebook.com/478408292178951 I got this returned:
{
"name": "Duke St. Rollins",
"is_published": true,
"talking_about_count": 2,
"category": "Public figure",
"id": "478408292178951",
"link": "http://www.facebook.com/pages/Duke-St-Rollins/478408292178951",
"likes": 2
}
When I entered THIS into graph.facebook.com/Duke-St-Rollins I got this returned:
{
"name": "Duke St. Rollins",
"is_published": true,
"username": "DukeStRollins",
"about": "World famous troll and nemesis of teabaggers.",
"bio": "Press!\n\nhttp://blogs.phoenixnewtimes.com/bastard/2012/07/duke_st_rollins_on_jan_brewer.php \n\nhttp://madmikesamerica.com/2012/07/an-interview-with-duke-st-rollins/\n\nYouTube Channel\nhttp://www.youtube.com/channel/UC_xk6GQzKacHImYl3Vns4VQ\n",
"personal_info": "Follow me on Twitter ",
"talking_about_count": 6450,
"category": "Public figure",
"id": "204170076355643",
"link": "http://www.facebook.com/DukeStRollins",
"likes": 9459,
"cover": {
"cover_id": 261500633955920,
"source": "http://sphotos-a.xx.fbcdn.net/hphotos-ash4/s720x720/376513_261500633955920_779910133_n.jpg",
"offset_y": 92
}
}
If I am understanding how this works correctly, and did this right, does this mean the supposed 'fake' FB page is actually owned by the 'real' Duke?
If I have this wrong (and I hope I do), can someone please explain this to me slowly, like you are talking to a kid as I am TOTALLY new to doing the FB page stuff and until yesterday, never even knew about graph.facebook stuff.
Consider me a noob. Because I am. But I'd really like to know if what I think I am seeing, is what I fear.
No, they aren't the same. The former is a page/public figure. The latter is a user. You can tell them apart by their different IDs (478408292178951 / 204170076355643). They share the same name but can't share the same graph api address because hyphens are ignored (try http://graph.facebook.com/Duke-------------StRollins), which means DukeStRollins and Duke-St-Rollins are effectively identical.
This is in my opinion a glitch in the API: a query by name should be able to distinguish between these two resources, even if a hyphen - is the only difference between their names; that there is not just makes it easier for the spoofer to confuse people.
You've probably already seen this: https://developers.facebook.com/docs/reference/api/
In http://www.facebook.com/pages/Duke-St-Rollins/478408292178951 "Duke-St-Rollins" is just a dummy name created by the Facebook using the title name of the page. You can use anything in place of it to redirect you to the same page. The below links will all redirect you to http://www.facebook.com/pages/Duke-St-Rollins/478408292178951
http://www.facebook.com/pages/page-name/478408292178951
http://www.facebook.com/pages/page-title-name/478408292178951
http://www.facebook.com/pages/dummy-name/478408292178951
http://www.facebook.com/478408292178951 (note that here facebook recognizes 478408292178951 as the username for the page since the page didn't set it's username yet)
In the other page, http://www.facebook.com/Duke-St-Rollins "Duke-St-Rollins" is a username set by the page and hence it uses "dukestrollins" as the facebook graph node to recognize the page. (Note that in the username any dots or hypens will be removed automatically) All the links below redirect you to http://www.facebook.com/DukeStRollins
http://www.facebook.com/duke.st.rollins
http://www.facebook.com/dukestrollins
http://www.facebook.com/Duke-StRollins
http://www.facebook.com/204170076355643
http://www.facebook.com/pages/some-thing-here/204170076355643