Is There an HBase 1.2 Equivalent of Accumulo's RegExFilter? - regex

I have a background in Accumulo 1.6, and I've been directed to analyze the API of HBase 1.2. Does HBase 1.2 have a built-in functional equivalent to the RegExFilter in Accumulo? I've not seen one, so below I've attempted to hack up something that's functionally equivalent.
On the Accumulo side...
Accumulo's RegExFilter supports regular expressions in any combination of these fields:
Row
Column Family
Column Qualifier
Value
RegExFilter has an option to support "OR" behavior across the expressions -- to return the entry if any of the expressions match.
RegExFilter has an option to support matches on substrings.
On the HBase 1.2 side, my understanding is that...
A RegexStringComparator can be used to filter a field by using it with a CompareFilter implementation.
A RowFilter can filter based on the row key portion of the key
A QualifierFilter can filter based on the column qualifier portion of the key
A ValueFilter can filter based on the value
Multiple filters can be added to a FilterList, itself a subclass of Filter, and AND/OR behavior can be defined among the Filters in the FilterList.
In HBase, to mimic Accumulo's RegExFilter, I could do something like the following:
List<Filter> filters = new ArrayList<Filter>();
// Regex for row
if (rowRegEx != null) {
filters.add(new RowFilter(CompareOp.EQUAL,
new RegexStringComparator(rowRegEx));
}
// Regex for column family
if (colFamRegEx != null) {
filters.add(new FamilyFilter(CompareOp.EQUAL,
new RegexStringComparator(colFamRegEx));
}
// Regex for column qualifier
if (colQualRegEx != null) {
filters.add(new QualifierFilter(CompareOp.EQUAL,
new RegexStringComparator(colQualRegEx));
}
// Regex for value
if (valRegEx != null) {
filters.add(new ValueFilter(CompareOp.EQUAL,
new RegexStringComparator(valRegEx));
}
// For "or" behavior, use FilterList.Operator.MUST_PASS_ONE
FilterList filterList = new FilterList(
FilterList.Operator.MUST_PASS_ALL, filters);
Scan scan = new Scan();
scan.setFilter(filterList);
// TODO How to mimic "match on substring" behavior of RegExFilter?
Question 1 Does my (untested) code snippet seem like a legitimate HBase 1.2 equivalent to Accumulo 1.6's RegExFilter? If not, then what could be improved?
Question 2 I assume Accumulo's RegExFilter's "match on substrings" option could be implemented in HBase 1.2 using a slightly differently defined regular expression. Is there a less clunky way to do this, so that ideally I could use the same regular expression in both Accumulo and HBase?

Related

replace expression format xx-xx-xxxx_12345678

IDENTIFIER
31-03-2022_13636075
01-04-2022_13650262
04-04-2022_13663174
05-04-2022_13672025
20220099001
11614491_R
10781198
00000000000
11283627_P
11614491_R
-1
how can i remove (only) the "XX-XX-XXXXX_" Part in certain values of a column in SSIS but WITHOUT affecting values that doesn't have this format? For example "21-05-2022_12345678" = "12345678" but the other values i don't want them affected. This are just examples of many rows from this column so i want only the ones that have this format to be affected.
SELECT REVERSE(substring(REVERSE('09-03-2022_13481330'),0,CHARINDEX('_',REVERSE('09-03-2022_13481330'),0)))
result
13481330
but this also affects others values.Also this is in ssms not ssis because i am not sure how to transform this expression in ssis code.
Update : Corrected code in SSIS goes as following:
(FINDSTRING(IDENTIFIER,"__-__-____[_]",1) == 1) ? SUBSTRING(IIDENTIFIER,12,LEN(IDENTIFIER) - 11) : IDENTIFIER
Do you have access to the SQL source? You can do this on the sql by using a LIKE and crafting a match pattern using the single char wildcard _ please see below example
DECLARE #Value VARCHAR(50) = '09-03-2022_13481330'
SELECT CASE WHEN #Value LIKE '__-__-____[_]%' THEN
SUBSTRING(#Value,12,LEN(#Value)-11) ELSE #Value END
Please see the Microsoft Documentation on LIKE and using single char wildcards
If you don't have access to the source SQL it gets a bit more tricky as you might need to use regex in a script task or maybe there is a expression you can apply

How to normalize fields delimited by colon thats into a single column in informatica cloud

I need help to normalize the field "DSC_HASH" inside a single column delimeted by colon.
Input:
Outuput:
I achieved what I needed with java transformation:
1) In java transformation I created 4 output columns: COD1_out, COD2_out, COD3_out and DSC_HASH_out
2) Then I put the following code:
String [] column_split;
String column_delimiter = ";";
String [] column_data;
String data_delimiter = ":" ;
Column_split = DSC_HASH.split(column_delimiter);
COD1_out = COD1;
COD2_out = COD2;
COD3_out = COD3;
for (int I =0; i < column_split.length; i++){
column_data = column_split[i].split(data_delimiter);
DSC_HASH_out = column_data[0];
generateRow();
}
There are no generic parsers or loop construct in Informatica that can take one record and output an arbitrary number of records.
There are some ways you can bypass this limitation:
Using the Java Transformation, as you did, which is probably the easiest... if you know Java :) There may be limitations to performance or multi-threading.
Using a Router or a Normalizer with a fixed number of output records, high enough to cover all your cases, then filter out empty records. The expressions to extract fields are a bit complex to write (an maintain).
Using the XML Parser, but you have to convert your data to XML before, and design an XML schema. For example your first line would be changed in (on multiple lines for readability):
<e><n>2320</n><h>-1950312402</h></e>
<e><n>410</n><h>103682488</h></e>
<e><n>4301</n><h>933882987</h></e>
<e><n>110</n><h>-2069728628</h></e>
Using SQL Transformation or Stored Procedure Transformation to use database standard or custom functions, but that would result in an SQL query for each input row, which is bad performance-wise
Using a Custom Transformation. Does anyone want to write C++ for that ?
The Java Transformation is clearly a good solution for this situation.

SPARQL: combine and exclude regex filters

I want to filter my SPARQL query for specific keywords while at the same time excluding other keywords. I thought this may be easily accomplished with FILTER (regex(str(?var),"includedKeyword","i") && !regex(str(?var),"excludedKeyword","i")). It works without the "!" condition, but not with. I also separated the FILTER statements, but no use.
I used this query on http://europeana.ontotext.com/ :
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX edm: <http://www.europeana.eu/schemas/edm/>
PREFIX ore: <http://www.openarchives.org/ore/terms/>
SELECT DISTINCT ?CHO
WHERE {
?proxy dc:subject ?subject .
FILTER ( regex(str(?subject),"gemälde","i") && !regex(str(?subject),"Fotografie","i") )
?proxy edm:type "IMAGE" .
?proxy ore:proxyFor ?CHO.
?agg edm:aggregatedCHO ?CHO; edm:country "germany".
}
But I always get the result on the first row with the title "Gemäldegalerie", which has a dc:subject of "Fotografie" (the one I want excluded). I think the problem lies in the fact that one object from the Europeana database can have more than one dc:subject property, so maybe it looks only for one of these properties while ignoring the other ones.
Any ideas? Would be very thankful!
The problem is that your combined filter checks for the same binding of ?subject. So it succeeds if at least one value of ?subject matches both conditions (which is almost always true, because the string "Gemäldegalerie", for example, matches your first regex and does not match the second).
So for the negative condition, you need to formulate something that checks for all possible values, rather than just one particular value. You can do this using SPARQL's NOT EXISTS function, for example like this:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX edm: <http://www.europeana.eu/schemas/edm/>
PREFIX ore: <http://www.openarchives.org/ore/terms/>
SELECT DISTINCT ?CHO
WHERE {
?proxy edm:type "IMAGE" .
?proxy ore:proxyFor ?CHO.
?agg edm:aggregatedCHO ?CHO; edm:country "germany".
?proxy dc:subject ?subject .
FILTER(regex(str(?subject),"gemälde","i"))
FILTER NOT EXISTS {
?proxy dc:subject ?otherSubject.
FILTER(regex(str(?otherSubject),"Fotografie","i"))
}
}
As an aside: since you are doing regular expression checks, and now combining them with an NOT EXISTS operator, this is likely to become very expensive for the query processor quite quickly. You may want to think about alternative ways to formulate your query (for example, using the exact subject string to include or exclude to eliminate the regex), or even having a look at some non-standard extensions that the SPARQL endpoint might provide (OWLIM, for example, the store on which the Europeana endpoint runs, supports various full-text-search extensions, though I am not sure they are enabled in the Europeana endpoint).

Setting a regex optional group to None in Scala

I have the following regular expression pattern that matches fully qualified Microsoft SQL Server table names ([dbName].[schemaName].[tableName]), where the schema name is optional:
val tableNamePattern = """\[(\w+)\](?:\.\[(\w+)\])?\.\[(\w+)\]""".r
I am using it like this:
val tableNamePattern(database, schema, tableName) = fullyQualifiedTableName
When the schema name is missing (e.g.: [dbName].[tableName]), the schema value gets set to null.
Is there a Scala idiomatic way to set it to None instead, and to Some(schema) when the schemaName is provided?
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
-- Jamie Zawinski
I'm going to copy the code from the accepted answer on the linked question, and without giving credit too. Here it is:
object Optional {
def unapply[T](a: T) = if (null == a) Some(None) else Some(Some(a))
}
val tableNamePattern(database, Optional(schema), tablename) = fullyQualifiedTableName
PS: I just today wondered on twitter whether creating special-case extractors was as common as they were suggested. :)

How Lucene scores results in a RegexQuery?

I can see how two values, when doing a regular/fuzzy full text search, can be compared to determine which one is "better" (i.e. one value contains more keywords than the other, one contains less non-keywords than the other).
However, how Lucene computes the score when doing regex queries using RegexQuery? It is a boolean query - a field's value is either compatible with the regex or not. Lucene can't take keywords from my regex query and do its usual magic...
There are two passes. In the first, it generates a list of all terms which match the regex. In the second, it finds all documents with terms matching that regex.
The major code you want to look at is in MultiTermQuery:
public Query rewrite(IndexReader reader) throws IOException {
FilteredTermEnum enumerator = getEnum(reader);
BooleanQuery query = new BooleanQuery();
try {
do {
Term t = enumerator.term();
if (t != null) {
TermQuery tq = new TermQuery(t); // found a match
tq.setBoost(getBoost() * enumerator.difference()); // set the boost
query.add(tq, false, false); // add to query
}
} while (enumerator.next());
} finally {
enumerator.close();
}
return query;
}
Two things:
The boolean query is instantiated with coord on. So the standard coord scoring applies (i.e. the more terms you get, the better).
The boost of the term query is given by enumerator.difference(). However, as of 3.0.1 this just returns 1:
#Override
public final float difference() {
// TODO: adjust difference based on distance of searchTerm.text() and term().text()
return 1.0f;
}
So at some point this will return the distance (probably levenstein) between the terms. But for now it does nothing.
This is just a wild guess, but one possible metric could be the number of backtracking steps the regex engine needs to take to match your search strings.
Of course, these values also depend mightily on the quality of your regex, but when comparing several matches, the one that was "easier to match" could be considered a better match than the one that the regex engine had to go through contortions for.