loadCSV conditional handling with regex (Syntaxproblem) - regex

A csv file contains in one column either a date format like "2016-12-01T00:00:00+01" or a different value like an integer.
My idea was while running the loadCSV to make a switch like an if-else statement to have either turned the date into an unix timestamp or do not change the value at all. To detect if its a date or not I tried to use a regex.
I came up with the following statement
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///xxx.csv" AS row
FIELDTERMINATOR ';'
FOREACH(n IN (CASE row.dataValue=~ "/(\d{4})-(\d{2})-(\d{2})T(\d{2})\:(\d{2})\:(\d{2})[+-](\d{2})\" THEN [] else [row.dataValue= apoc.date.parse(row.dataValue, "s", "yyyy-mm-dd'T'HH:mm:ss+01")] END) |
CREATE (d:datapoint {data: row.dataValue})
return d
THis throws an error:
Invalid input 'd': expected ... which seems to refer to the first letter d in the regex.
a) What would be a correct syntax
b) Is the statement at all correct to get what I want?
Any hint is very appreciated.

Try using [0-9] instead of \d. I don't know if that is your issue but your regex seems fine (except that first forward slash at the very beginning).
Try something like this:
[0-9]{4}-[0-9]{2}-[0-9]{2}T(?:[0-9]{2}:){2}[0-9]{2}[-+][0-9]{2}
https://regex101.com/r/fqydFq/1

Related

Having difficulty in pattern matching Postal Codes for an oracle regexp_like command

The Problem:
All I'm trying to do is come up with a pattern matching string for my regular expression that lets me select Canadian postal codes in this format: 'A1A-2B2' (for example).
The types of data I am trying to insert:
Insert Into Table
(Table_Number, Person_Name, EMail_Address, Street_Address, City, Province, Postal_Code, Hire_Date)
Values
(87, 'Tommy', 'mobster#gmail.com', '123 Street', 'location', 'ZY', 'T4X-1S2', To_Date('30-Aug-2020 08:50:56');
This is a slightly modified/generic version to protect some of the data. All of the other columns enter just fine/no complaints. But the postal code it does not seem to like when I try to run a load data script.
The Column & Constraint in question:
Postal_Code varchar2(7) Constraint Table_Postal_Code Null
Constraint CK_Postal_Code Check ((Regexp_like (Postal_Code, '^\[[:upper:]]{1}[[:digit:]]{1}[[:upper:]][[:punct:]]{1}[[:digit:]]{1}[[:upper:]](1}[[:digit:]]{1}$')),
My logic here: following the regular expression documentation:
I have:
an open quote
a exponent sign to indicate start of string
Backslash (I think to interpet a string literal)
-1 upper case letter, 1 digit, 1 uppercase , 1 :punct: to account for the hypen, 1 digit, 1 upper case letter, 1 digit
$ to indicate end of string
Close quote
In my mind, something like this should work, it accounts for every single letter/character and the ranges they have to be in. But something is off regarding my formatting of this pattern matching string.
The error I get is:
ORA-02290: check constraint (user.CK_POSTAL_CODE) violated
(slightly modified once more to protect my identity)
Which tells me that the data insert statement is tripping off my check constraint and thats about it. So its as issue with the condition of the constraint itself - ie string I'm using to match it. My instructor has told me that insert data is valid, and doesn't need any fix-up so I'm at a loss.
Limits/Rules: The Hyphen has to be there/matched to my understanding of the problem. They are all uppercase in the dataset, so I don't have to worry about lowercase for this example.
I have tried countless variations of this regexp statement to see if anything at all would work, including:
changing all those uppers to :alpha: , then using 'i' to not check for case sensitivity for the time being
removing the {1} in case that was redudant
using - (backslash hyphen) , to turn into a string literal maybe
using only Hyphen by itself
even removing regexp altogether and trying a LIKE [A-Z][0-9][A-Z]-[0-9][A-Z][0-9] etc
keeping the uppers , turning :digit:'s to [0-9] to see if that would maybe work
The only logical thing I can think of now is: the check constraint is actually working fine and tripping off when it matches my syntax. But I didn't write it clearly enough to say "IGNORE these cases and only get tripped/activated if it doesn't meet these conditions"
But I'm at my wits end and asking here as a last resort. I wouldn't if I could see my mistake eventually - but everything I can think of, I probably tried. I'm sure its some tiny formatting rule I just can't see (I can feel it).Thank you kindly to anyone who would know how to format a pattern matching string like this properly.
It looks like you may have been overcomplicating the regex a bit. The regex below matches your description based on the first set of bullets you lined out:
REGEXP_LIKE (postal_code, '^[A-Z]\d[A-Z]-\d[A-Z]\d$')
I see two problems with that regexp.
Firstly, you have a spurious \ at the start. It serves you no purpose, get rid of it.
Secondly, the second-from last {1} appears in your code with mismatched brackets as (1}. I get the error ORA-12725: unmatched parentheses in regular expression because of this.
To be honest, you don't need the {1}s at all: they just tell the regular expression that you want one of the previous item, which is exactly what you'd get without them.
So you can fix the regexp in your constraint by getting rid of the \ and removing the {1}s, including the one with mismatched parentheses.
Here's a demo of the fixed constraint in action:
SQL> CREATE TABLE postal_code_test (
2 Postal_Code varchar2(7) Constraint Table_Postal_Code Null
3 Constraint CK_Postal_Code Check ((Regexp_like (Postal_Code, '^[[:upper:]][[:digit:]][[:upper:]][[:punct:]][[:digit:]][[:upper:]][[:digit:]]$'))));
Table created.
SQL> INSERT INTO postal_code_test (postal_code) VALUES ('T4X-1S2');
1 row created.
SQL> INSERT INTO postal_code_test (postal_code) VALUES ('invalid');
INSERT INTO postal_code_test (postal_code) VALUES ('invalid')
*
ERROR at line 1:
ORA-02290: check constraint (user.CK_POSTAL_CODE) violated
You do not need the backslash and you have (1} instead of {1}.
You can simplify the expression to:
Postal_Code varchar2(7)
Constraint Table_Postal_Code Null
Constraint CK_Postal_Code Check (
REGEXP_LIKE(Postal_Code, '^[A-Z]\d[A-Z][[:punct:]]\d[A-Z]\d$')
)
or:
Constraint CK_Postal_Code Check (
REGEXP_LIKE(
Postal_Code,
'^[A-Z][0-9][A-Z][[:punct:]][0-9][A-Z][0-9]$'
)
)
or:
Constraint CK_Postal_Code Check (
REGEXP_LIKE(
Postal_Code,
'^[[:upper:]][[:digit:]][[:upper:]][[:punct:]][[:digit:]][[:upper:]][[:digit:]]$'
)
)
or (although the {1} syntax is redundant here):
Constraint CK_Postal_Code Check (
REGEXP_LIKE(
Postal_Code,
'^[[:upper:]]{1}[[:digit:]]{1}[[:upper:]]{1}[[:punct:]]{1}[[:digit:]]{1}[[:upper:]]{1}[[:digit:]]{1}$'
)
)
fiddle
removing regexp altogether and trying a LIKE [A-Z][0-9][A-Z]-[0-9][A-Z][0-9] etc
That will not work as the LIKE operator does not match regular expression patterns.

Wrong regexp query for elasticsearch

I have some problems with the regexp query for elasticsearch. In my index there's a text field with comma-separated numeric values (IDs), f.e.
2,140,3,2495
And I have the following query term:
"regexp" : {
"myIds" : {
"value" : "^2495,|,2495,|,2495$|^2495$",
"boost" : 1
}
}
But my result list is empty.
Let me say that I know that regexp queries are kind of slow but the index still exists and is filled with millions of documents so unfortunately it's not an option to restructure it. So I need a regex solution.
In ElasticSearch regex, patterns are anchored by default, the ^ and $ are treated as literal chars.
What you mean to use is "2495,.*|.*,2495,.*|.*,2495|2495" - 2495, at the start of string, ,2495, in the middle, ,2495 at the end or a whole string equal to 2495.
Or, you may use a simpler
"(.*,)?2495(,.*)?"
That means
(.*,)? - an optional text (not including line breaks) ending with ,
2495 - your value
(,.*)? - an optional text (not including line breaks) ending with ,
Here is an online demo showing how this expression works (not a proof though).
Ok, I got it to work but run in another problem now. I built the string as follows:
(.*,)?2495(,.*)?|(.*,)?10(,.*)?|(.*,)?898(,.*)?
It works good for a few IDs but if I have let's say 50 IDs, then ES throws an exception which says that the regexp is too complex to process.
Is there a way to simplify the regexp or restructure the query it selves?

Why is this seemingly correct Regex not working correctly in Rascal?

In have following code:
set[str] noNnoE = { v | str v <- eu, (/\b[^eEnN]*\b/ := v) };
The goal is to filter out of a set of strings (called 'eu'), those strings that have no 'e' or 'n' in them (both upper- and lowercase). The regular expression I've provided:
/\b[^eEnN]?\b/
seems to work like it should, when I try it out in an online regex-tester.
When trying it out in the Rascel terminal it doesn't seem to work:
rascal>/\b[^eEnN]*\b/ := "Slander";
bool: true
I expected no match. What am I missing here? I'm using the latest (stable) Rascal release in Eclipse Oxygen1a.
Actually, the online regex-tester is giving the same match that we are giving. You can look at the match as follows:
if (/<w1:\b[^eEnN]?\b>/ := "Slander")
println("The match is: |<w1>|");
This is assigning the matched string to w1 and then printing it between the vertical bars, assuming the match succeeds (if it doesn't, it returns false, so the body of the if will not execute). If you do this, you will get back a match to the empty string:
The match is: ||
The online regex tester says the same thing:
Match 1
Full match 0-0 ''
If you want to prevent this, you can force at least one occurrence of the characters you are looking for by using a +, versus a ?:
rascal>/\b[^eEnN]+\b/ := "Slander";
bool: false
Note that you can also make the regex match case insensitive by following it with an i, like so:
/\b[^en]+\b/i
This may make it easier to write if you need to add more characters into the character class.
This solution (/\b[^en]+\b/i) doesn't work for strings consisting of two words, such as the Czech Republic.
Try /\b[^en]+\b$/i. That seems to work for me.

PL/SQL: Find all cyrillic (or non-latin1) signs via regex

I'm currently trying to figure out a way to output the IDs of all Rows within a table that contain any cyrillic (or non-latin-1) letters, no matter what column they're in
I've inherited a script that uses cursors to iterate through the tables and columns and searches for the cyrillic signs via a regex statement using unistr(), but i can't figure out why it does not seem to be working anymore on our oracle 12 db
The statement is as follows:
stmt := 'select ID from '||table_name || ' where regexp_LIKE('||table_name||'.'||column_name||','||stmt_template|| ')';
table_name and column name should be selft explanatory, stmt_template is a template that is defined earlier and contains my problem. 'stmt' is used as follows (and works):
OPEN stmt_cursor for stmt;
LOOP [some code]
The stmt_template is defined as follows and always throws me an error
stmt_template VARCHAR(32767) := '^[''||unistr(''\20AC'')||unistr(''\1EF8'')||''-''||unistr(''\1EF9'')||unistr(''\1EF2'')||''-''||unistr(''\1EF3'')||unistr(''\1EE4'')||''-''||unistr(''\1EE5'')||unistr(''\1ED6'')||''-''||unistr(''\1ED7'')||unistr(''\1ECA'')||''-''||unistr(''\1ECF'')||unistr(''\1EC4'')||''-''||unistr(''\1EC5'')||unistr(''\1EBD'')||unistr(''\1EAA'')||''-''||unistr(''\1EAC'')||unistr(''\1EA0'')||''-''||unistr(''\1EA1'')||unistr(''\1E9E'')||unistr(''\1E9B'')||unistr(''\1E8C'')||''-''||unistr(''\1E93'')||unistr(''\1E80'')||''-''||unistr(''\1E85'')||unistr(''\1E6A'')||''-''||unistr(''\1E6B'')||unistr(''\1E60'')||''-''||unistr(''\1E63'')||unistr(''\1E56'')||''-''||unistr(''\1E57'')||unistr(''\1E44'')||''-''||unistr(''\1E45'')||unistr(''\1E40'')||''-''||unistr(''\1E41'')||unistr(''\1E30'')||''-''||unistr(''\1E31'')||unistr(''\1E24'')||''-''||unistr(''\1E27'')||unistr(''\1E1E'')||''-''||unistr(''\1E21'')||unistr(''\1E10'')||''-''||unistr(''\1E11'')||unistr(''\1E0A'')||''-''||unistr(''\1E0B'')||unistr(''\1E02'')||''-''||unistr(''\1E03'')||unistr(''\0292'')||unistr(''\0259'')||unistr(''\022A'')||''-''||unistr(''\0233'')||unistr(''\01FA'')||''-''||unistr(''\021F'')||unistr(''\01F7'')||unistr(''\01F4'')||''-''||unistr(''\01F5'')||unistr(''\01E2'')||''-''||unistr(''\01EF'')||unistr(''\01DE'')||''-''||unistr(''\01DF'')||unistr(''\01CD'')||''-''||unistr(''\01D4'')||unistr(''\01BF'')||unistr(''\01B7'')||unistr(''\01AF'')||''-''||unistr(''\01b0'')||unistr(''\01A0'')||''-''||unistr(''\01A1'')||unistr(''\018F'')||unistr(''\0187'')||''-''||unistr(''\0188'')||unistr(''\0134'')||''-''||unistr(''\017f'')||unistr(''\00AE'')||''-''||unistr(''\0131'')||unistr(''\00A1'')||''-''||unistr(''\00AC'')||unistr(''\0009'')||unistr(''\000A'')||unistr(''\000D'')||unistr(''\0020'')||''-''||unistr(''\007E'')||'']*$'')';
This is supposed to be searching for a long list of cyrillic letters and other special characters, though it throws me the following:
ORA-00936: missing expression
I've already tried to search for everything not within the ascii table using
stmt_template VARCHAR(32767) :='''[^-~]''';
though this doesn't seem to give me the test-tuples I prepared (using some cyrillic characters as well as a € sign and stuff) but some rows that don't contain any 'illegal' characters
stmt_template VARCHAR(32767) := '''[^.' || CHR (1) || '-' || CHR (255) || ']''';
doesn't work either as it gives me the same as the above
can anyone help me identify my mistake/typo or whatever error there is in the first regex statement?
If you need any more information, please tell me, thx in advance
Your statement evaluates to:
select ID from table_name where regexp_LIKE(table_name.column_name,,'^['||unistr('\20AC')||unistr('\1EF8')||'-'||unistr('\1EF9')||unistr('\1EF2')||'-'||unistr('\1EF3')||unistr('\1EE4')||'-'||unistr('\1EE5')||unistr('\1ED6')||'-'||unistr('\1ED7')||unistr('\1ECA')||'-'||unistr('\1ECF')||unistr('\1EC4')||'-'||unistr('\1EC5')||unistr('\1EBD')||unistr('\1EAA')||'-'||unistr('\1EAC')||unistr('\1EA0')||'-'||unistr('\1EA1')||unistr('\1E9E')||unistr('\1E9B')||unistr('\1E8C')||'-'||unistr('\1E93')||unistr('\1E80')||'-'||unistr('\1E85')||unistr('\1E6A')||'-'||unistr('\1E6B')||unistr('\1E60')||'-'||unistr('\1E63')||unistr('\1E56')||'-'||unistr('\1E57')||unistr('\1E44')||'-'||unistr('\1E45')||unistr('\1E40')||'-'||unistr('\1E41')||unistr('\1E30')||'-'||unistr('\1E31')||unistr('\1E24')||'-'||unistr('\1E27')||unistr('\1E1E')||'-'||unistr('\1E21')||unistr('\1E10')||'-'||unistr('\1E11')||unistr('\1E0A')||'-'||unistr('\1E0B')||unistr('\1E02')||'-'||unistr('\1E03')||unistr('\0292')||unistr('\0259')||unistr('\022A')||'-'||unistr('\0233')||unistr('\01FA')||'-'||unistr('\021F')||unistr('\01F7')||unistr('\01F4')||'-'||unistr('\01F5')||unistr('\01E2')||'-'||unistr('\01EF')||unistr('\01DE')||'-'||unistr('\01DF')||unistr('\01CD')||'-'||unistr('\01D4')||unistr('\01BF')||unistr('\01B7')||unistr('\01AF')||'-'||unistr('\01b0')||unistr('\01A0')||'-'||unistr('\01A1')||unistr('\018F')||unistr('\0187')||'-'||unistr('\0188')||unistr('\0134')||'-'||unistr('\017f')||unistr('\00AE')||'-'||unistr('\0131')||unistr('\00A1')||'-'||unistr('\00AC')||unistr('\0009')||unistr('\000A')||unistr('\000D')||unistr('\0020')||'-'||unistr('\007E')||']*$'))
Which, with the guts of the regular expression removed looks like:
REGEXP_LIKE(table_name.column_name,,'your regex...'))
You need to remove the duplicate comma from the start of the regular expression string and the duplicate closing round bracket from the end.
Change your definition of stmt_template to
stmt_template VARCHAR(32767) := '^[''''||unistr(''\20AC'')||unistr(''\1EF8'')||''-''||
unistr(''\1EF9'')||unistr(''\1EF2'')||''-''||
unistr(''\1EF3'')||unistr(''\1EE4'')||''-''||
unistr(''\1EE5'')||unistr(''\1ED6'')||''-''||
unistr(''\1ED7'')||unistr(''\1ECA'')||''-''||
unistr(''\1ECF'')||unistr(''\1EC4'')||''-''||
unistr(''\1EC5'')||unistr(''\1EBD'')||unistr(''\1EAA'')||''-''||
unistr(''\1EAC'')||unistr(''\1EA0'')||''-''||
unistr(''\1EA1'')||unistr(''\1E9E'')||unistr(''\1E9B'')||unistr(''\1E8C'')||''-''||
unistr(''\1E93'')||unistr(''\1E80'')||''-''||
unistr(''\1E85'')||unistr(''\1E6A'')||''-''||
unistr(''\1E6B'')||unistr(''\1E60'')||''-''||
unistr(''\1E63'')||unistr(''\1E56'')||''-''||
unistr(''\1E57'')||unistr(''\1E44'')||''-''||
unistr(''\1E45'')||unistr(''\1E40'')||''-''||
unistr(''\1E41'')||unistr(''\1E30'')||''-''||
unistr(''\1E31'')||unistr(''\1E24'')||''-''||
unistr(''\1E27'')||unistr(''\1E1E'')||''-''||
unistr(''\1E21'')||unistr(''\1E10'')||''-''||
unistr(''\1E11'')||unistr(''\1E0A'')||''-''||
unistr(''\1E0B'')||unistr(''\1E02'')||''-''||
unistr(''\1E03'')||unistr(''\0292'')||unistr(''\0259'')||unistr(''\022A'')||''-''||
unistr(''\0233'')||unistr(''\01FA'')||''-''||
unistr(''\021F'')||unistr(''\01F7'')||unistr(''\01F4'')||''-''||
unistr(''\01F5'')||unistr(''\01E2'')||''-''||
unistr(''\01EF'')||unistr(''\01DE'')||''-''||
unistr(''\01DF'')||unistr(''\01CD'')||''-''||
unistr(''\01D4'')||unistr(''\01BF'')||unistr(''\01B7'')||unistr(''\01AF'')||''-''||
unistr(''\01b0'')||unistr(''\01A0'')||''-''||
unistr(''\01A1'')||unistr(''\018F'')||unistr(''\0187'')||''-''||
unistr(''\0188'')||unistr(''\0134'')||''-''||
unistr(''\017f'')||unistr(''\00AE'')||''-''||
unistr(''\0131'')||unistr(''\00A1'')||''-''||
unistr(''\00AC'')||unistr(''\0009'')||unistr(''\000A'')||unistr(''\000D'')||unistr(''\0020'')||''-''||
unistr(''\007E'')||'''']*$'')';
It appears that the original definition left an unbalanced single-quote at the beginning and end of the string. I'm still not certain that will work as there appears to be an unmatched right-parenthesis at the very end of the string but it might be better.
Best of luck.
This should give you data that isn't within the ascii-7 range chr(32) - chr(127):
select col1
from my_table
where regexp_like(col1, '[^'||chr(32)||'-'||chr(127)||']')
Note that I'm excluding control characters (less than dec 32) and extended ascii (> 127) in my range.

Getting a specific tag and combining if multiple same tags are found together

I want to keep the words with the tag NA. If more than one such words come together, I want to combine them into a one word.
Example:
%if i have
a='[The/D, handle/NA, of/NS, the/NaAq, hair/NA, brush/NA, is/NaAZ broken/A]'
% the output I want:
output={'handle', 'hair brush'}
I tried with searching for /NA but the problem is there are false positives which are the, is.
Currently my code is:
g=split(a(2:end-1));
b= strfind(g,'/NA');
g(~cellfun(#isempty, b))
Any ideas how to proceed? Any one-line regular expression will be very helpful if possible.
Looks like a nice NLP problem. Maybe this gets you started:
a='[The/D, handle/NA, of/NS, the/NaAq, hair/NA, brush/NA, is/NaAZ broken/A]';
output={'handle', 'hair brush'};
expr = '(\S+/NA, )+'; % look for words followed by '/NA, '
match = regexp(a,expr,'match');
output = strtrim(strrep(match,'/NA,','')) % strrep: get rid of tag - strtrim: get rid of tailing blank
Note that this approach will fail if the last word is tagged with /NA. You can catch that case independently though.