Great Expectations: expect_values_to_match_regex rule gives out error for all regex - great-expectations

NOTE: Running on Snowflake
I actually need the regex to check for SSN. The one I'm using is -
Regex used: '^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$'
Error message:
Invalid regular expression: '^(?!666|000|9\d{2})\d{3}-(?!00)\d{2}-(?!0{4})\d{4}$', no argument for repetition operator: ?\n[SQL: SELECT sum(CASE WHEN (ssn IS NOT NULL AND NOT (ssn RLIKE %(param_1)s)) THEN %(param_2)s ELSE %(param_3)s END) AS "column_values.match_regex.unexpected_count", sum(CASE WHEN (ssn IS NULL) THEN %(param_4)s ELSE %(param_5)s END) AS "column_values.nonnull.unexpected_count" \nFROM ge_temp_4b471582]\n[parameters: {'param_1': '^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$', 'param_2': 1, 'param_3': 0, 'param_4': 1, 'param_5': 0}]
Rule used:
{"expectation_type": "expect_column_values_to_match_regex", "kwargs": {"column": "ssn", "regex": "^(?!666|000|9\d{2})\d{3}-(?!00)\d{2}-(?!0{4})\d{4}$"}, "meta": {}}
A couple months ago, when I ran this, it was running perfectly. Is there something I'm doing wrong here? Or is it a problem from Great Expectation's side?
So, I figured there must be a problem with the regex. But, using a basic regex like the one mentioned below, I'm getting a similar error.
regex used: '[Aa-Zz]'
Error message:
Invalid regular expression: '[Aa-Zz]', invalid character class range: a-Z\n[SQL: SELECT first AS unexpected_values \nFROM ge_temp_de90d2b5 \nWHERE first IS NOT NULL AND NOT (first RLIKE %(param_1)s)\n LIMIT %(param_2)s]\n[parameters: {'param_1': '[Aa-Zz]', 'param_2': 20}]\n(Background on this error at: https://sqlalche.me/e/14/f405)\n"/n%22),
Thanks in advance! :)

Related

Regexp expression from Oracle SQL to Big Query

I previously had help here for an Regexp expression in oracle sql which worked great.However, our place is converting to Big Query and the regexp does not seem to be working anymore.
In my tables, i have the following data
WC 12/10 change FC from 24 to 32
W/C 12/10 change fc from 401 to 340
W/C12/10 18-26
This oracle sql would have split the table up to give me the before number (24) and (32) and (12/10).
cast(REGEXP_SUBSTR(Line_Comment, '((\d+ |\d+)(change )?(- |-|to |to|too|too )(\d+))', 1, 1, 'i',2) as Int) as Before,
cast(REGEXP_SUBSTR(Line_Comment, '((\d+ |\d+)(change )?(- |-|to |to|too|too )(\d+))', 1, 1, 'i', 5) as Int) as After,
REGEXP_SUBSTR(Line_Comment, '((\d+)(\/|-|.| )(\d+)(\/|-|.| )(\d+))|(\d+)(\/|-|.| )(\d+)', 1, 1, 'i') as WC_Date,
Totally understand that comments are not consistent and may not work but if it works more than 80% of the time which it has then we are fine with this.
Since moving to big query, I'm getting this error message. In oracle, the tables were in varchar but in big query when they migrated it, its now in strings. Could this be the reason why its broken?Is there anyone who can help with this?This is way over my head.
No matching signature for function REGEXP_SUBSTR for argument types:
STRING, STRING, INT64, INT64, STRING, INT64. Supported signatures:
REGEXP_SUBSTR(STRING, STRING, [INT64], [INT64]); REGEXP_SUBSTR(BYTES,
BYTES, [INT64], [INT64]) at [69:12]
Since google bigquery REGEXP_SUBSTR doesn't support the subexpr parameter of Oracle's REGEXP_SUBSTR, you need to modify your regexes to take advantage of the fact that:
If the regular expression contains a capturing group, the function returns the substring that is matched by that capturing group.
So for each value you are trying to extract, you need to make that the only capturing group in the regex:
cast(REGEXP_SUBSTR(Line_Comment, '(?:(\d+ |\d+)(?:change )?(?:- |-|to |to|too|too )(?:\d+))', 1, 1) as Int) as Before,
cast(REGEXP_SUBSTR(Line_Comment, '(?:(?:\d+ |\d+)(?:change )?(?:- |-|to |to|too|too )(\d+))', 1, 1) as Int) as After,
REGEXP_SUBSTR(Line_Comment, '((?:\d+)(?:\/|-|.| )(?:\d+)(?:\/|-|.| )(?:\d+))|((?:\d+)(?:\/|-|.| )(?:\d+))', 1, 1) as WC_Date,
Note you can substantially simplify your regexes as below:
(\d+) ?(?:change )?(?:-|too?) ?(?:\d+)
(?:\d+) ?(?:change )?(?:-|too?) ?(\d+)
(?:\d+)(?:[\/.-](?:\d+)){1,2}
Regex demos on regex101: numbers, date
Based on the sample data you provided in the comment section, you can try below query:
with t1 as (
select 'WC 12/10 change FC from 24 to 32' as Comment
union all select 'W/C 12/10 change fc from 401 to 340' as Comment,
union all select 'W/C12/10 18-26' as Comment
)
select Comment,
regexp_extract(t1.Comment, r'(\d+\/\d+)') as WC,
regexp_extract(t1.Comment, r'.+\s(\d{1,3})[\s|\-]') as Before,
regexp_extract(t1.Comment, r'.+[\sto\s|\-](\d{1,3})$') as After
from t1
Output:
Consider below super simple approach
select Comment,
format('%s/%s', arr[offset(0)], arr[safe_offset(1)]) as wc,
arr[safe_offset(2)] as before,
arr[safe_offset(3)] as after
from your_table, unnest([struct(regexp_extract_all(Comment, r'\d+') as arr)])
if applied to sample data in your question - output is

Fluentd Parsing

Hi i'm trying to parse single line log using fluentd. Here is log i'm trying to parse.
F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F8..........etc
This will parse into like this:
{ "F2" : "4200000000000000", "F3" : "000000", "F4" : "000000060000" ............etc }
I tried to use regex but it's confusing and making me write multiple regexes for different keys and values. Is there any easier way to achieve this ?
EDIT1: Heya! I will make this more detailed. I'm currently tailing logs using fluentd to Elasticsearch+Kibana. Here is unparsed example log that fluentd sending to Elasticsearch:
21/09/02 16:36:09.927238: 1 frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random message and strings,F7:.......etc
Elasticsearch recived message:
{"message":"frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random
digits and chars,F7:.......etc"}
This log has only message key so i can't index and create dashboard on only using whole message field. What am i trying to achieve is catch only useful fields, add key into it if it has no key and make indexing easier.
Expected output:
{"logdate" : "21/09/02 16:36:09.927238",
"source" : "frSMS",
"UID" : "#HTF4J",
"statuscode" : "msg0210",
"F2": "00000000000000000",
"F3": "randomchar314516",.....}
I used regex plugin to parse into this but it was too overwhelming and . Here is what i did so far:
^(?<logDate>\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b)....(?<source>fr[A-Z]{3,4}|to[A-Z]{3,4}\b).(?<status>\d\b).(?<dummyfield>\d{5}\b).(?<HUID>.[A-Z]{5}\b)..(?<d1>\d{3}\b).(?<d2>\d{3}\b).(?<msgcode>msg\d{4}\b).(?<dummyfield1>\d{16}\b).(?<dummyfield2>\d{6}\b).(?<dummyfield3>\d{6,7}\b).(?<dummyfield4>\d{6}\b).(?<dummyfield5>\d{2}\b)...
Which results to :
"logDate": "21/09/02 16:36:09.205706",
"source": "toSMS" ,
"status": "0",
"dummyfield": "13995" ,
"UID" : "#HTFAA" ,
"d1" : "156" ,
"d2" : "156" ,
"msgcode" : "msg0210",
"dummyfield1" :"0000000000000000" ,
"dummyfield2" :"002000",
"dummyfield3" :"2000000",
"dummyfield4" :"00",
"dummyfield5" :"2000000" ,
"dummyfield6" :"867202"
Which only applies to example log and has useless fields like field1, dummyfield, dummyfield1 etc.
Other logs has the useful values and keys(date,source,msgcode,UID,F1,F2 fields) like i showcased on expected output. Not useful fields are not static(they can be none, or has less|more digits and chars) so they trigger the pattern not matched error.
So the question is:
How do i capture useful fields that i mentioned using regex?
How do i capture F1,F2,F3...... fields that has different value
patterns like char string mixed?
PS: I wraped the regex i wrote into html snippet so the <> capturing fields don't get deleted
Regex pattern to use:
(F[\d]+):([\d]+)
This pattern will catch all the 'F' values with whatever digit that comes after - yes even if it's F105 it still works. This whole 'F105' will be stored as the first group in your regex match expression
The right part of the above pattern will catch the value of all the digits following ':' up until any charachter that is not a digit. i.e. ',', 'F', etc.. and will store it as the second group in your regex match
Use
Depending on your coding language you will have to access your regex matches variable with an iterator and extract group 1 and group 2 respectivly
Python example:
import re
log = 'F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F105:9726450'
pattern = '(F[\d]+):([\d]+)'
matches = re.finditer(pattern,log)
log_dict = {}
for match in matches:
log_dict[match.group(1)] = match.group(2)
print(log_dict)
Output
{'F2': '4200000000000000', 'F3': '000000', 'F4': '000000060000', 'F6': '000000000000', 'F7': '000000000', 'F105': '9726450'}
Assuming the logdate will be static(in pattern wise) You can ignore useless values using ".+" regex and get collect the useful values by their patterns. So the regex will be like this :
(?\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b).+(?fr[A-Z]{3,4}|to[A-Z]{3,4}).+(?#[A-Z0-9]{5}).+(?msg\d{4})
And output will be like:
{"logdate" : "21/09/02 16:36:09.927238", "source" : "frSMS",
"UID" : "#HTF4J","statuscode" : "msg0210"}
And I'm working on getting F2,F3,FN keys and values.

Exact match of string in pandas python

I have a column in data frame which ex df:
A
0 Good to 1. Good communication EI : tathagata.kar#ae.com
1 SAP ECC Project System EI: ram.vaddadi#ae.com
2 EI : ravikumar.swarna Role:SSE Minimum Skill
I have a list of of strings
ls=['tathagata.kar#ae.com','a.kar#ae.com']
Now if i want to filter out
for i in range(len(ls)):
df1=df[df['A'].str.contains(ls[i])
if len(df1.columns!=0):
print ls[i]
I get the output
tathagata.kar#ae.com
a.kar#ae.com
But I need only tathagata.kar#ae.com
How Can It be achieved?
As you can see I've tried str.contains But I need something for extact match
You could simply use ==
string_a == string_b
It should return True if the two strings are equal. But this does not solve your issue.
Edit 2: You should use len(df1.index) instead of len(df1.columns). Indeed, len(df1.columns) will give you the number of columns, and not the number of rows.
Edit 3: After reading your second post, I've understood your problem. The solution you propose could lead to some errors.
For instance, if you have:
ls=['tathagata.kar#ae.com','a.kar#ae.com', 'tathagata.kar#ae.co']
the first and the third element will match str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i])
And this is an unwanted behaviour.
You could add a check on the end of the string: str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i]+r'(?:\s|$)')
Like this:
for i in range(len(ls)):
df1 = df[df['A'].str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i]+r'(?:\s|$)')]
if len(df1.index != 0):
print (ls[i])
(Remove parenthesis in the "print" if you use python 2.7)
Thanks for the help. But seems like I found a solution that is working as of now.
Must use str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i])
This seems to solve the problem.
Although thanks to #IsaacDj for his help.
Why not just use:
df1 = df[df['A'].[str.match][1](ls[i])
It's the equivalent of regex match.

How to do SPARQL query using bif:regexp_match on Jena

I have the following SPARQL query on Virtuoso:
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT DISTINCT ?p, ?title WHERE {
?p a ?c.
?c rdfs:subClassOf* wd:Q2431196.
?p rdfs:label ?title .
FILTER (bif:regexp_match("^Vamp( [(].*[)])?$", ?title))
}
On this SPARQL endpoint, it works fine. It returns the tv show Vamp and also Vamp (telenovela) as expected.
Now I'm trying to do the same on Java, using Jena API, and it fails as follows.
Exception in thread "main" com.hp.hpl.jena.query.QueryParseException: Line 10, column 204: Unresolved prefixed name: bif:regexp_match
I found a solution to get rid of the Jena exception, as suggested for bif:contains. The query would then be as follows:
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT DISTINCT ?p, ?title WHERE {
?p a ?c.
?c rdfs:subClassOf* wd:Q2431196.
?p rdfs:label ?title .
?title <bif:regexp_match> "^Vamp( [(].*[)])?$"
}
However, that query does not return any elements as the previous query did. It doesn't return any elements on the SPARQL endpoint web interface either (as the previous query did)
Am I doing something wrong? How can I regex it properly?
ps: using FILTER REGEX( ?title, "^Vamp( [(].*[)])?$") works on the web SPARQL endpoint, but throws the following error when on Java/Jena:
Sep 16, 2015 3:16:32 PM org.apache.jena.riot.system.ErrorHandlerFactory$ErrorLogger logError
SEVERE: Invalid byte 2 of 3-byte UTF-8 sequence.`
I think this error has to do with the ( ) characters..
use this PREFIX bif:<bif:>
instead of PREFIX bif:<> in for jena.
You were right in your regex pattern, just a little editing when it comes to java.
For it to work in java, just put the left parentheses ( after ^ and put the right parentheses ) before $.
Your regex pattern should be like this:
"^(Vamp( [(].*[)])?)$";
hope this helps
You can use the following prefix declaration as a workaround.
PREFIX bif: <bif:>
Live Link demonstrating workaround in action.
Live Virtuoso SPARQL Query Editor Link showcasing workaround.
Ultimately, the URI for the Prefix declaration should be:
PREFIX bif: <http://www.openlinksw.com/schemas/bif#>
Which I explain in a Twitter Thread about the same issue i.e., we are working to rectify the regression associated with the standard prefix declaration above.
Jena will fail to parse your SPARQL as it is invalid.
The main issue is that bif: is a built in prefix in Virtuoso.
To allow Jena to parse it you need to add
PREFIX bif:<>
to your query.
As AndyS answered in here, the problem is that bif is a virtuoso-specific feature, So you should use QueryEngineHTTP instead of QueryExecutionFactory.sparqlService. This will submit your query directly to the endpoint and will not pass it through Jena parser.
QueryEngineHTTP query_engine = new QueryEngineHTTP(endpoint, query);

gst regular expression mismatch of group generates exception

I have a simple example in GNU Smalltalk 3.2.5 of attempting to group match on a key value setting:
st> m := 'a=b' =~ '(.*?)=(.*)'
MatchingRegexResults:'a=b'('a','b')
The above example works just as expected. However, if there is no match to the second group (.*), an exception is generated:
st> m := 'a=' =~ '(.*?)=(.*)'
Object: Interval new "<-0x4ce2bdf0>" error: Invalid index 1: index out of range
SystemExceptions.IndexOutOfRange(Exception)>>signal (ExcHandling.st:254)
SystemExceptions.IndexOutOfRange class>>signalOn:withIndex: (SysExcept.st:660)
Interval>>first (Interval.st:245)
Kernel.MatchingRegexResults>>at: (Regex.st:382)
Kernel.MatchingRegexResults>>printOn: (Regex.st:305)
Kernel.MatchingRegexResults(Object)>>printString (Object.st:534)
Kernel.MatchingRegexResults(Object)>>printNl (Object.st:571)
I don't understand this behavior. I would have expected the result to be ('a', nil) and that m at: 2 to be nil. I tried a different approach as follows:
st> 'a=' =~ '(.*?)=(.*)' ifMatched: [ :m | 'foo' printNl ]
'foo'
'foo'
Which determines properly that there's a match to the regex. But I still can't check if a specific group is nil:
st> 'a=' =~ '(.*?)=(.*)' ifMatched: [ :m | (m at: 2) ifNotNil: [ (m at: 2) printNl ] ]
Object: Interval new "<-0x4ce81b58>" error: Invalid index 1: index out of range
SystemExceptions.IndexOutOfRange(Exception)>>signal (ExcHandling.st:254)
SystemExceptions.IndexOutOfRange class>>signalOn:withIndex: (SysExcept.st:660)
Interval>>first (Interval.st:245)
Kernel.MatchingRegexResults>>at: (Regex.st:382)
optimized [] in UndefinedObject>>executeStatements (a String:1)
Kernel.MatchingRegexResults>>ifNotMatched:ifMatched: (Regex.st:322)
Kernel.MatchingRegexResults(RegexResults)>>ifMatched: (Regex.st:188)
UndefinedObject>>executeStatements (a String:1)
nil
st>
I don't understand this behavior. I would have expected the result to be ('a', nil) and that m at: 2 to be nil. At least that's the way it works in any other language I've used regex in. This makes me think maybe I'm not doing something correct with my syntax.
My question this is: do I have the correct syntax for attempting to match ASCII key value pairs like this (for example, in parsing environment settings)? And if I do, why is an exception being generated, or is there a way I can have it provide a result that I can check without generating an exception?
I found a related issue reported at gnu.org from Dec 2013 with no responses.
The issue had been fixed in master after the above report was received. The commit can be seen here. A stable release is currently blocked by the glib event loop integration.
ValidationExpression="[0-9]{2}[(a-z)(A-Z)]{5}\d{4}[(a-z)(A-Z)]{1}\d{1}Z\d{1}"
SetFocusOnError="true" ControlToValidate="txtGST" Display="Dynamic" runat="server" ErrorMessage="Invalid GST No." ValidationGroup="Add" ForeColor="Red"></asp:RegularExpressionValidator>