How to use regular expression in hbase - regex

I am new to HBase and trying to do some scan query. Below is my sample data:
2470883371 column=card info:CARD_TYPE, timestamp=1439291958723, value=MASTERCARD
2470883371 column=card info:UNIQUE_NO, timestamp=1439291958767, value=991-761-828-450
2470883371 column=card info:EXPIRY_DATE, timestamp=1439291958747, value=Wed Oct 03 18:09:34 IST 2018
3495415072 column=card info:CARD_TYPE, timestamp=1439291958835, value=MASTERCARD
3495415072 column=card info:UNIQUE_NO, timestamp=1439291959618, value=973-470-914-600
3495415072 column=card info:EXPIRY_DATE, timestamp=1439291958850, value=Wed Oct 03 18:09:34 IST 2018
I want to query like:
Retrive all results that start from rowkey id 2470883 (actual value is 2470883371)
Retrive all results whose unique number starts from 991-761-828 (actual value is 991-761-828-450)
Is it possible in HBase using scan? Basically I want to know how to use a regular expression.

Take a look at Filters and Comparators, for example RegexStringComparator https://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/filter/RegexStringComparator.html

This regular expression should match what you need:
((?:^2470883.*?$)|(?:^.*?value=991-761-828.*?$))

Related

How to reformat this datetime without regex in Google Sheets?

In Google Sheets i want to reformat this datetime Mon, 08 Mar 2021 10:57:15 GMT into this 08/03/2021.
Using RegEx i achieve the goal with
=to_date(datevalue(REGEXEXTRACT("Mon, 08 Mar 2021 10:57:15 GMT","\b[0-9]{2}\s\D{3}\s[0-9]{4}\b")))
But how can i do it without RegEx? This datetime format seems to be a classic one - can it really be, that no onboard formula can't do it? I rather think, i miss the right knowledge here...
Please try the following formula and format as date
=TRIM(LEFT(INDEX(SPLIT(K13,","),,2),12))*1
(do adjust according to your locale)
Another option is to use Custom Script.
Example:
Code:
function formatDate(date) {
return Utilities.formatDate(new Date(date), "GMT", "dd/MM/YYYY")
}
Formula in B1: =formatDate(A1)
Output:
Reference:
Custom Functions in Google Sheets

Simple Neptune Gremlin query to perform date comparison degrades due to large join

We have a graph that contains both customer and product verticies. For a given product, we want to find out how many customers who signed up before DATE have purchased this product. My query looks something like
g.V('PRODUCT_GUID') // get product vertex
.out('product-customer') // get all customers who ever bought this product
.has('created_on', gte(datetime('2020-11-28T00:33:44.536Z'))) // see if the customer was created after a given date
.count() // count the results
This query is incredibly slow, so I looked at the neptune profiler and saw something odd. Below is the full profiler output. Ignore the elapsed time in the profiler. This was after many attempts at the same query, so the cache is warm. in the wild, it can take 45 seconds or more.
*******************************************************
Neptune Gremlin Profile
*******************************************************
Query String
==================
g.V('PRODUCT_GUID').out('product-customer').has('created_on', gte(datetime('2020-11-28T00:33:44.536Z'))).count()
Original Traversal
==================
[GraphStep(vertex,[PRODUCT_GUID]), VertexStep(OUT,[product-customer],vertex), HasStep([created_on.gte(Sat Nov 28 00:33:44 UTC 2020)]), CountGlobalStep]
Optimized Traversal
===================
Neptune steps:
[
NeptuneCountGlobalStep {
JoinGroupNode {
PatternNode[(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=30586, expectedTotalOutput=30586, indexTime=0, joinTime=14, numSearches=1, actualTotalOutput=13424}
PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^<DATETIME>) .], {estimatedCardinality=1285574, indexTime=10, joinTime=140, numSearches=13424}
}, annotations={path=[Vertex(?1):GraphStep, Vertex(?3):VertexStep], joinStats=true, optimizationTime=0, maxVarId=8, executionTime=165}
}
]
Physical Pipeline
=================
NeptuneCountGlobalStep
|-- StartOp
|-- JoinGroupOp
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=30586, expectedTotalOutput=30586})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^<DATETIME>) .], {estimatedCardinality=1285574})
Runtime (ms)
============
Query Execution: 164.996
Traversal Metrics
=================
Step Count Traversers Time (ms) % Dur
-------------------------------------------------------------------------------------------------------------
NeptuneCountGlobalStep 1 1 164.919 100.00
>TOTAL - - 164.919 -
Predicates
==========
# of predicates: 131
Results
=======
Count: 1
Output: [22]
Index Operations
================
Query execution:
# of statement index ops: 13425
# of unique statement index ops: 13425
Duplication ratio: 1.0
# of terms materialized: 0
In particular
DynamicJoinOp(PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^) .], {estimatedCardinality=1285574})
This line surprises me. The way I'm reading this is that Neptune is ignoring the verticies coming from ".out('product-customer')" to satisfy the ".has('created_on'...)" requirement, and is instead joining on every single customer vertex that has the created_on attribute.
I would have expected that the cardinality is only the number of customers with an edge from the product, not every single customer.
I'm wondering if there's a way to only run this comparison on the customers coming from the "out('product-customer')" step.
Neptune actually must solve the first pattern,
(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6)
before it can solve the second,
(?3, <created_on>, ?7, ?)
Each quad pattern is an indexed lookup bound by at least two fields. So the first lookup uses the SPOG index in Neptune bound by the Subject (the ID) and the Predicate (the edge label). This will return a set of Objects (the vertex IDs for the vertices at the other end of the product-customer edges) and references them via the ?3 variable for the next pattern.
In the next pattern those vertex IDs (?3) are bound with the Predicate (property key of created-on) to evaluate the condition of the date range. Because this is a conditional evaluation, each vertex in the set of ?3 has to be evaluated (each 'created-on' property on each of those vertices has to be read).

Extract Date and Time in ABAP via Regex

I wanted to separate the time and date from this string using REGEX because I feel like it is the only way I can separate it. But I am not really familiar on how to do it maybe someone can help me out here.
The original string: Your item was delivered in or at the mailbox at 3:34 pm on September 1, 2016 in TEXAS, MT 59102
The output i want to achieve/populate:
lv_time = 3:34 pm
lv_date = September 1, 2016
Here's the code I was trying to do but I am only able to cut it like this:
lv_status = Your item was delivered in or at the mailbox at
lv_time = 3
lv_date = :34 pm on September 1, 2016 in TEXAS, MT 59102.
Here's the code I have so far:
DATA: lv_status TYPE string,
lv_time TYPE string,
lv_date TYPE string,
lv_off TYPE i.
lv_status = 'Your item was delivered in or at the mailbox at 3:34 pm on September 1, 2016 in TEXAS, MT 59102.'.
FIND REGEX '(\d+)\s*(.*)' IN lv_status SUBMATCHES lv_time lv_date MATCH OFFSET lv_off.
lv_status = lv_status(lv_off).
You asked for it, here it comes:
\b((1[0-2]|0?[1-9]):([0-5][0-9]) ([AaPp][Mm])) on (January|February|March|April|May|June|July|August|September|October|November|December)\D?(\d{1,2}\D?)?\D?((?:19[7-9]\d|20\d{2})|\d{2})
This accepts time in HH:MM am/pm format, and dates in Jan-Dec, dd 1970-2999.
Each part is captured in its own group.
The demo shows a version that allows abbreviated month names:
Demo

Regex for getting the date

The query
SELECT REGEXP_SUBSTR('Outstanding Trade Ticket Report_08 Apr 14.xlsx', '\_(.*)\.') AS FILE_DATE FROM DUAL
gives the OUTPUT:
_08 Apr 14.
Please advise the correct regex to be used for getting the date without the characters.
I can use RTRIM and LTRIM but want to try it using regex.
You can use:
SELECT REGEXP_SUBSTR('Outstanding Trade Ticket Report_08 Apr 14.xlsx', '\_(.*)\.',
1, 1, NULL, 1) from dual
The last argument is used to determine which matched group to return.
Link to Fiddler

manipulation on a file - vb.net plus some regex

The below is the content of my file(which is already sorted). Whichever is there between square brackets, relate to one transaction. The transactions can be groupa, groupb,groupc etc.
Jan 2012 02:10:12 [5678](groupa):Part 1:data1
Jan 2012 02:10:12 [5678](groupa):Part 2:data2
Jan 2012 02:10:12 [5678](groupa):Part 3:data3
Jan 2012 02:10:12 [5678](groupa):Part 4:data4
Jan 2012 02:13:14 [12308](groupa):Part 1:data1
Jan 2012 02:13:14 [12308](groupa):Part 2:data2
Jan 2012 02:13:24 [34517](groupb):Part 1:data1
Jan 2012 02:13:24 [34517](groupb):Part 2:data2
I want to output the below data to another file using vb.net. It should contain the transaction group, followed by the time(the time should be taken from the first row of the contents grouped by transaction, then grouped by the number inside the square bracket, in the contents). Next line should concatenate the data(after Part [1-9]:), corresponding to the particular transaction grouped by the number inside the square bracket. For the above contents,
groupa at Jan 2012 02:10:12
data1data2data3data4
groupa at Jan 2012 02:13:14
data1data2
groupb at Jan 2012 02:13:24
data1data2
So first let's create a class to represent that data. It will make it easier to work it. Here is what mine looks like:
Public Class LogEntry
Public Property DateTime As DateTime
Public Property Id As Integer
Public Property Group As String
Public Property Part As String
Public Property Data As String
End Class
Now that we have that, let's parse each line with a regular expression. They aren't my strength, but in this case it works:
Dim text = File.ReadAllLines("log.log")
Dim rx As New Regex("^(?<date>.+)\s\[(?<id>\d+)\]\((?<group>.+)\):(?<part>.+):(?<data>.+)$")
Dim logEntries As New List(Of LogEntry)
For Each line In text
Dim match = rx.Match(line)
Dim entry As New LogEntry With _
{
.DateTime = DateTime.ParseExact(match.Groups("date").Value, "MMM yyyy hh:mm:ss", System.Globalization.CultureInfo.CurrentCulture),
.Id = Int32.Parse(match.Groups("id").Value),
.Group = match.Groups("group").Value.Trim(),
.Part = match.Groups("part").Value.Trim(),
.Data = match.Groups("data").Value.Trim()
}
logEntries.Add(entry)
Next
Here we are loading the text from a file. It doens't matter how it gets the text. After that we iterate over each line and gather the information with a regular expression. Once we parse it, we create a LogEntry and add it to a list. As a list this will make it easier to work. We can use LINQ to group, then print it out:
Dim grouped = logEntries _
.GroupBy(Function(x) New With {Key .Id = x.Id, Key .Group = x.Group, Key .DateTime = x.DateTime}) _
.OrderBy(Function(x) x.Key.DateTime)
For Each group In grouped
Console.WriteLine("{0} at {1:MMM yyyy hh:mm:ss}", group.Key.Group, group.Key.DateTime)
Console.WriteLine(String.Join("", group.Select(Function(x) x.Data)))
Next