How to properly escape fulltext index query parameter in Spring Data Neo4j? - spring-data-neo4j

In a Spring Data Neo4j Repository I have a custom query defined as:
#Query(
value = "CALL db.index.fulltext.queryNodes('mimir_NorseDeity_label_fulltext', $label) YIELD node AS n RETURN n SKIP $skip LIMIT $limit",
countQuery = "CALL db.index.fulltext.queryNodes('mimir_NorseDeity_label_fulltext', $label) YIELD node AS COUNT(n)"
)
Page<NorseDeity> fulltextFindByLabel(#Param("label") String label, Pageable page);
But the $label parameter needs to be escaped for many special characters that are not supported in the Lucene query language.
How do I escape them without adding boilerplate to the Repository interface?

Related

Spark Scala: SQL rlike vs Custom UDF

I've a scenario where 10K+ regular expressions are stored in a table along with various other columns and this needs to be joined against an incoming dataset. Initially I was using "spark sql rlike" method as below and it was able to hold the load until incoming record counts were less than 50K
PS: The regular expression reference data is a broadcasted dataset.
dataset.join(regexDataset.value, expr("input_column rlike regular_exp_column")
Then I wrote a custom UDF to transform them using Scala native regex search as below,
Below val collects the reference data as Array of tuples.
val regexPreCalcArray: Array[(Int, Regex)] = {
regexDataset.value
.select( "col_1", "regex_column")
.collect
.map(row => (row.get(0).asInstanceOf[Int],row.get(1).toString.r))
}
Implementation of Regex matching UDF,
def findMatchingPatterns(regexDSArray: Array[(Int,Regex)]): UserDefinedFunction = {
udf((input_column: String) => {
for {
text <- Option(input_column)
matches = regexDSArray.filter(regexDSValue => if (regexDSValue._2.findFirstIn(text).isEmpty) false else true)
if matches.nonEmpty
} yield matches.map(x => x._1).min
}, IntegerType)
}
Joins are done as below, where a unique ID from reference data will be returned from UDF in case of multiple regex matches and joined against reference data using unique ID to retrieve other columns needed for result,
dataset.withColumn("min_unique_id", findMatchingPatterns(regexPreCalcArray)($"input_column"))
.join(regexDataset.value, $"min_unique_id" === $"unique_id" , "left")
But this too gets very slow with skew in execution [1 executor task runs for a very long time] when record count increases above 1M. Spark suggests not to use UDF as it would degrade the performance, any other best practises I should apply here or if there's a better API for Scala regex match than what I've written here? or any suggestions to do this efficiently would be very helpful.

Does #DynamoDBAttribute support document paths in the attribute name?

I've checked the DynamoDB documentation, and I can't find anything to confirm or deny whether this is allowed.
Is it valid to use a Document Path for the attributeName of #DynamoDBAttribute, as in this code snippet?
#DynamoDBDocument
public class MyClass {
#DynamoDBAttribute(attributeName="object.nestedObject.myAttribute")
private String myAttribute;
.
.
.
// Getters & Setters, etc
}
Edit: Just to be clear, I am specifically trying to find out whether document paths are valid in the #DynamoDBAttribute Java annotation as a way to directly access a nested value. I know that document paths work in general when specifying a query, but this question is specifically about DynamoDBMapper annotations.
Yes, the attribute name can have Dot on it. However, in my opinion, it is not recommended to have Dot on attribute name. Usually, the Dot will be used to navigate the tree in Map attribute.
The following are the naming rules for DynamoDB:
All names must be encoded using UTF-8, and are case-sensitive.
Table names and index names must be between 3 and 255 characters long,
and can contain only the following characters:
a-z
A-Z
0-9
_ (underscore)
(dash)
. (dot)
Attribute names must be between 1 and 255 characters long.
Accessing Map Elements:-
The dereference operator for a map element is . (a dot). Use a dot as
a separator between elements in a map:
MyMap.nestedField
MyMap.nestedField.deeplyNestedField
I can create the item with attribute name containing Dot and query the item using FilterExpression successfully.
It works similarly in all language AWS SDKs. As long as data type is defined as String, it would work as expected.
Some JS examples:-
Create Item:-
var table = "Movies";
var year = 2017;
var title = "putitem data test 2";
var dotAttr = "object.nestedObject.myAttribute";
var params = {
TableName:table,
Item:{
"yearkey": year,
"title": title,
"object.nestedObject.myAttribute": "S123"
},
ReturnValues : 'NONE'
};
Update:-
It works fine with #DynamoDBAttribute annotation as well.
private String dotAttr;
#DynamoDBAttribute(attributeName = "object.nestedObject.myAttribute")
public String getDotAttr() {
return dotAttr;
}
It is not possible to reference a nested path using the attribute name in a #DynamoDBAttribute. I needed to use a POJO type with an added#DynamoDBDocument annotation to represent each level of nesting.

Is There an HBase 1.2 Equivalent of Accumulo's RegExFilter?

I have a background in Accumulo 1.6, and I've been directed to analyze the API of HBase 1.2. Does HBase 1.2 have a built-in functional equivalent to the RegExFilter in Accumulo? I've not seen one, so below I've attempted to hack up something that's functionally equivalent.
On the Accumulo side...
Accumulo's RegExFilter supports regular expressions in any combination of these fields:
Row
Column Family
Column Qualifier
Value
RegExFilter has an option to support "OR" behavior across the expressions -- to return the entry if any of the expressions match.
RegExFilter has an option to support matches on substrings.
On the HBase 1.2 side, my understanding is that...
A RegexStringComparator can be used to filter a field by using it with a CompareFilter implementation.
A RowFilter can filter based on the row key portion of the key
A QualifierFilter can filter based on the column qualifier portion of the key
A ValueFilter can filter based on the value
Multiple filters can be added to a FilterList, itself a subclass of Filter, and AND/OR behavior can be defined among the Filters in the FilterList.
In HBase, to mimic Accumulo's RegExFilter, I could do something like the following:
List<Filter> filters = new ArrayList<Filter>();
// Regex for row
if (rowRegEx != null) {
filters.add(new RowFilter(CompareOp.EQUAL,
new RegexStringComparator(rowRegEx));
}
// Regex for column family
if (colFamRegEx != null) {
filters.add(new FamilyFilter(CompareOp.EQUAL,
new RegexStringComparator(colFamRegEx));
}
// Regex for column qualifier
if (colQualRegEx != null) {
filters.add(new QualifierFilter(CompareOp.EQUAL,
new RegexStringComparator(colQualRegEx));
}
// Regex for value
if (valRegEx != null) {
filters.add(new ValueFilter(CompareOp.EQUAL,
new RegexStringComparator(valRegEx));
}
// For "or" behavior, use FilterList.Operator.MUST_PASS_ONE
FilterList filterList = new FilterList(
FilterList.Operator.MUST_PASS_ALL, filters);
Scan scan = new Scan();
scan.setFilter(filterList);
// TODO How to mimic "match on substring" behavior of RegExFilter?
Question 1 Does my (untested) code snippet seem like a legitimate HBase 1.2 equivalent to Accumulo 1.6's RegExFilter? If not, then what could be improved?
Question 2 I assume Accumulo's RegExFilter's "match on substrings" option could be implemented in HBase 1.2 using a slightly differently defined regular expression. Is there a less clunky way to do this, so that ideally I could use the same regular expression in both Accumulo and HBase?

PIG regex extract then filter the unnamed regex tuple

I have a string as:
[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]
I want to regex_extract_all to turn it into elements in a tuple and sepereated by ",". Then I need to filter out the ones don't contain structure and location.
However, I got an error that can't filter regex type. Any idea?
By the way, the ending goal is to parse out the longest hierarchy like (topic|news|politics|elections|primary)
update the script:
data = load load '/web/visit_log/20160303'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir
The syntax for filter matches seems incorrect.The data doesn't seem to have () in it.
c = filter b by not extr matches '(structure|location)';
Try changing this to
c = filter b by not (extr matches 'structure|location');

How to change a node's property based on one of its other properties in Neo4j

I just started using Neo4j server 2.0.1. I am having trouble with the writing a cypher script to change one of the nodes property to something based one of its already defined properties.
So if I created these node's:
CREATE (:Post {uname:'user1', content:'Bought a new pair of pants today', kw:''}),
(:Post {uname:'user2', content:'Catching up on Futurama', kw:''}),
(:Post {uname:'user3', content:'The last episode of Game of Thrones was awesome', kw:''})
I want the script to look at the content property and pick out the word "Bought" and set the kw property to that using a regular expression to pick out word(s) larger then five characters. So, user2's post kw would be "Catching, Futurama" and user3's post kw would be "episode, Thrones, awesome".
Any help would be greatly appreciated.
You could do something like this:
MATCH (p:Post { uname:'user1' })
WHERE p.content =~ "Bought .+"
SET p.kw=filter(w in split(p.content," ") WHERE length(w) > 5)
if you want to do that for all posts, which might not be the fastest operation:
MATCH (p:Post)
WHERE p.content =~ "Bought .+"
SET p.kw=filter(w in split(p.content," ") WHERE length(w) > 5)
split splits a string into a collection of parts, in this case words separated by space
filter filters a collection by a condition behind WHERE, only the elements that fulfill the condition are kept
Probably you'd rather want to create nodes for those keywords and link the post to the keyword nodes.