Mixed word and geo search

Mixed word and geo search - geocoding

How can I make a search query that
Finds everything around a location (using aroundLatLng and aroundRadius)
It's biased by a word search
Doesn't require results to include words (e.g. they're used for relevance, but if no words match a record it is not excluded a priori)
?
So for example if I query for "great view" near {lat: 48.8, lng: 2.3}, it should rank by word-relevance first, then by distance, and records which doesn't match any words but are near Paris should come up anyway in the results.

To make sure all words are not mandatory, you can use the optionalWords feature to consider all query words optional (beware, at least 1 will always need to match):
var query = "great view";
index.search(query, {
optionalWords: query,
aroundLatLng: '48.8,2.3',
aroundRadius: 10000
}).then(....);
Then you can eventually move the geo criteria of the default ranking formula down, just before the custom one.

Related

Remove columns by name based on pattern

How can I remove a large number of columns by name based on a pattern?
A data set exported from Jira has a ton of extra columns that I've no interest in. 400 Log entries, 50 Comments, dozens of links or attachments. Problem is that they get random numbers assigned which means that removing them with hardcoded column names will not work. That would look like this and break as the numbers change:
= Table.RemoveColumns(#"Previous Step",{"Watchers", "Watchers_10", "Watchers_11", "Watchers_12", "Watchers_13", "Watchers_14", "Watchers_15", "Watchers_16", "Watchers_17", "Watchers_18", "Watchers_19", "Watchers_20", "Watchers_21", "Watchers_22", "Watchers_23", "Watchers_24", "Watchers_25", "Watchers_26", "Watchers_27", "Watchers_28", "Log Work", "Log Work_29", "Log Work_30", "Log Work_31", "Log Work_32", ...
How can I remove a large number of columns by using a pattern in the name? i.e. remove all "Log Work" columns.

The best way I've found is to use List.FindText on Table.ColumnNames to get a list of column names dynamically based on target string:
= Table.RemoveColumns(#"Previous Step", List.FindText(Table.ColumnNames(#"Previous Step"), "Log Work")
This works by first grabbing the full list of Column Names and keeping only the ones that match the search string. That's then sent to RemoveColumns as normal.
Limitation appears to be that FindText doesn't offer complex pattern matching.
Of course, when you want to remove a lot of different patterns, having individual steps isn't very interesting. A way to combine this is to use List.Combine to join the resulting column names together.
That becomes:
= Table.RemoveColumns(L, List.Combine({ List.FindText(Table.ColumnNames(L), "Watchers_"), List.FindText(Table.ColumnNames(L), "Log Work"), List.FindText(Table.ColumnNames(L), "Comment"), List.FindText(Table.ColumnNames(L), "issue link"), List.FindText(Table.ColumnNames(L), "Attachment")} ))
SO what's actually written there is:
Table.RemoveColumns(PreviousStep, List.Combine({ foundList1, foundlist2, ... }))
Note the { } that signifies a list! You need to use this as List.Combine only accepts a single argument which is itself already a List of lists. And the Combine call is required here.
Also note the L here instead of #"Previous Step". That's used to make the entire thing more readable. Achieved by inserting a step named "L" that just has = #"Promoted Headers".
This allows relatively maintainable removal of multiple columns by name, but it's far from perfect.

NLP with less than 20 Words on google cloud

According to this documentation: the classifyText method requires at least 20 words.
https://cloud.google.com/natural-language/docs/classifying-text#language-classify-content-nodejs
If I send in less than 20 words I get this no matter how clear the content is:
Invalid text content: too few tokens (words) to process.
Looking for a way to force this without disrupting the NLP too much. Are there neutral vector words that can be appended to short phrases that would allow the classifyText to process anyways?
ex.
async function quickstart() {
const language = require('#google-cloud/language');
const client = new language.LanguageServiceClient();
//less than 20 words. What if I append some other neutral words?
//.. a, of , it, to or would it be better to repeat the phrase?
const text = 'The Atlanta Braves is the best team.';
const document = {
content: text,
type: 'PLAIN_TEXT',
};
const [classification] = await client.classifyText({document});
console.log('Categories:');
classification.categories.forEach(category => {
console.log(`Name: ${category.name}, Confidence: ${category.confidence}`);
});
}
quickstart();

The problem with this is you're adding bias no matter what kind of text you send.
Your only chance is to fill up your string up to the minimum word limit with empty words that will be filtered out by the preprocessor and tokenizer before they go to the neural network.
I would try to add a string suffix at the end of the sentence with just stopwords from NLTK like this:
document.content += ". and ourselves as herserf for each all above into through nor me and then by doing"
Why the end? Because usually text has more information at the beginning.
In case Google does not filter stopwords behind the scenes (which I doubt), this would add just white noise where the network has no focus or attention.
Remember: DO NOT add this string when you have enough words because you are billed for 1K character blocks before they are filtered.
I would also add that string suffix to sencences in your train/test/validation set that have less than 20 words and see how it works. The network should learn to ignore the whole sentence.

lucene, search for sentences that contain one term twice

I have pairs of search strings, and I want to use Lucene to search for sentences that contain all terms that are contained in these strings. So for example if I have the two search strings "white shark" and "fish", I want all sentences containing both "white", "shark" and "fish". Apparently, with Lucene this can be done rather easily by means of a boolean query; this is how I do it in my code:
String search = str1+" "+ str2;
BooleanQuery booleanQuery = new BooleanQuery();
QueryParser queryParser = new QueryParser(...);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
booleanQuery.add(queryParser.parse(search), BooleanClause.Occur.MUST);
However, I also have pairs of search strings where one string is a subpart of the other, such as e.g. "timber wolf" and "wolf", and in these cases I would like to only get sentences that contain "wolf" at least twice (and "timber" at least once). Is there any way to achieve this with Lucene? Many thanks in advance for your answers!

Keep in mind, that a doc that has both "timber wolf" and a separate "wolf" will get scored higher all else being equal, since the term "wolf" occurs twice giving it a higher tf score. In most cases like this, the desired results being the first ones is acceptable, usually even desirable.
That said, I believe you could get what you want using a phrase query with slop, and having the slop set adequately high. Something like:
"timber wolf wolf"~10000
Which is probably high enough for most cases. This would require two instances of wolf and one of timber.
However, if you need timber wolf to appear (that is, those two terms adjacent and in order), you'll need to abandon the query parser, and construct the appropriate queries yourself. SpanQueries, specifically.
SpanQuery wolfQuery = new SpanTermQuery(new Term("myField", "wolf"));
SpanQuery[] timberWolfSubQueries = {
new SpanTermQuery(new Term("myField", "timber")),
new SpanTermQuery(new Term("myField", "wolf"))
};
//arguments "0, true" mean 0 slop and in order (respectively)
SpanQuery timberWolfQuery = new SpanNearQuery(timberWolfSubQueries, 0, true);
SpanQuery[] finalSubQueries = {
wolfQuery, timberWolfQuery
};
//arguments "10000, false" mean 10000 slop and not (necessarily) in order
SpanQuery finalQuery = new SpanNearQuery(finalSubQueries, 10000, false);

How to change a node's property based on one of its other properties in Neo4j

I just started using Neo4j server 2.0.1. I am having trouble with the writing a cypher script to change one of the nodes property to something based one of its already defined properties.
So if I created these node's:
CREATE (:Post {uname:'user1', content:'Bought a new pair of pants today', kw:''}),
(:Post {uname:'user2', content:'Catching up on Futurama', kw:''}),
(:Post {uname:'user3', content:'The last episode of Game of Thrones was awesome', kw:''})
I want the script to look at the content property and pick out the word "Bought" and set the kw property to that using a regular expression to pick out word(s) larger then five characters. So, user2's post kw would be "Catching, Futurama" and user3's post kw would be "episode, Thrones, awesome".
Any help would be greatly appreciated.

You could do something like this:
MATCH (p:Post { uname:'user1' })
WHERE p.content =~ "Bought .+"
SET p.kw=filter(w in split(p.content," ") WHERE length(w) > 5)
if you want to do that for all posts, which might not be the fastest operation:
MATCH (p:Post)
WHERE p.content =~ "Bought .+"
SET p.kw=filter(w in split(p.content," ") WHERE length(w) > 5)
split splits a string into a collection of parts, in this case words separated by space
filter filters a collection by a condition behind WHERE, only the elements that fulfill the condition are kept
Probably you'd rather want to create nodes for those keywords and link the post to the keyword nodes.

map-reduce function in CouchDB

I have a java program, that reads all words of a PDF file. I saved the words with the pagenumbers in a database (couchDB). Now I want to write a map and a reduce function, which list each word with the pagenumbers where the word occurs, but if words occur more than once on a page, I want just one entry. The result should be a row with word and a second row with a list (String separated with comma) of pagenumbers. Each word with the pagenumber is a separate document in couchDB.
How can I do this with a map-reduce function (filter same entries of pagenumbers)?
Thanks for help.

Surely there is more than one way of doing it. I'd go for something simple. Lets say your documents look somewhat like this:
{ 'type': 'word-index', 'word': 'Great', 'page_number': 45 }
This is a result of finding the word 'Great' on page 45. Now your view index is created by a view function:
function map(doc) {
if (doc.type == 'word-index') {
emit([doc.word, doc.page_number], null);
}
}
For reduce part just use the "_count" builtin.
Now to get the list of all the occurrences of word "Great" in your book, just query your view with startkey=["Great"] and endkey=["Great", {}]. Now the result would look somewhat like:
["Great", 45], 4
["Great", 70], 7
Which means that world "Great" appeared 4 times on page 45 and 7 times on page 70. You can extract your comma separated list you needed from it. The number of occurrences is a bonus.
--edit--
You also have to use group_level=2 in your query. If you don't the result of the query would simply be a single row with the count off all the documents you have.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Mixed word and geo search - geocoding

Related

Remove columns by name based on pattern

NLP with less than 20 Words on google cloud

lucene, search for sentences that contain one term twice

How to change a node's property based on one of its other properties in Neo4j

map-reduce function in CouchDB

Categories

Resources