Equivalent regex in T-SQL with start/end input markers - regex

I'm trying to get the following RegEx to work:
^[a-zA-Z][a-zA-Z ''.-]+[a-zA-Z]$
It should allow any alphas, space, apostrophe, full stop and hyphen as long as the beginning and last chars as alphas.
John - ok
John Smith - ok
John-Smith - ok
John.Smith - ok
.John Smith - not ok
John Smith. - not ok
When I use this in T-SQL it doesn't seem to work and I'm not sure if it the input start/end markers that are not compatible in T-SQL. How do I translate this to valid T-SQL?:
CREATE Function [dbo].[IsValidName](#value VarChar(MAX))
RETURNS INT
AS
Begin
DECLARE #temp INT
SET #temp = (
SELECT
CASE WHEN #value LIKE '%^[a-zA-Z][a-zA-Z ''.-]+[a-zA-Z]$%' THEN 1
ELSE 0
END
)
RETURN #Temp
End

T-SQL doesn't support regular expressions "out of the box". Depending on what environment you are using, there are different solutions, but none will probably be "pure T-SQL". In a Microsoft environment you can use CLR procedures to achieve this.
See SQL Server Regular expressions in T-SQL for some options.

I made SOMETHING like this to scrub data, to remove non-alpha characters, I've slightly modified it to fit your needs
CREATE Function [dbo].Func (#Temp VarChar(1000))
Returns VarChar(1000)
AS
BEGIN
DECLARE #Len INT = LEN(#Temp)
DECLARE #RETURN INT
Declare #KeepValues as varchar(50)
Set #KeepValues = '%[^a-z^ ]%'
IF PatIndex(#KeepValues, #Temp) = 1
BEGIN
Set #RETURN = 0
END
IF PATINDEX(#KeepValues, #Temp) = #Len
BEGIN
SET #RETURN = 0
END
IF PATINDEX(#KeepValues, #Temp) = 0
SET #RETURN = 1
IF #RETURN IS NULL
BEGIN
SET #Return = 1
END
RETURN #RETURN
END
This is assuming you would not need to do any sort of data scrubbing for restricted characters. If you need to scrub for restricted characters let me know we can add a little more in there but based on your dataset this will return the correct answers

Related

Regular expression match U-SQL

I need a help in writing in U-SQL to output records to two different files based on a regular expression output.
Let me explain my scenario in detail.
Let us assume my input file has two columns, "Name" and person identification number ("PIN"):
Name , PIN
John ,12345
Harry ,01234
Tom, 24659
My condition for PIN is it should start with either 1 or 2. In the above case records 1 & 3 are valid and record 2 is invalid.
I need to output record 1 & 3 to my output processed file and 2 to my error file
How can I do this and also can I use Regex.Match to validate the regular expression?
//posting my code
#person =
EXTRACT UserId int,
PNR string,
UID String,
FROM "/Samples/Data/person.csv"
USING Extractors.csv();
#rs1=select UserId,PNR,UID,Regex.match(PNR,'^(19|20)[0-9]{2}((0[1-9])$') as pnrval,Regex.match(UID,'^(19|20)[0-9]{2}$') as uidval
from #person
#rs2 = select UserId,PNR,UID from #rs1 where pnrval=true or uidval=true
#rs3 = select UserId,PNR,UID from #rs1 where uidval=false or uidval= false
OUTPUT #rs2
TO "/output/sl.csv"
USING Outputters.Csv();
OUTPUT #rs3
TO "/output/error.csv"
USING Outputters.Csv();
But I'm receiving this error:
Severity Code Description Project File Line Suppression State Error
E_CSC_USER_INVALIDCOLUMNTYPE: 'System.Text.RegularExpressions.Match'
cannot be used as column type.
#someData =
SELECT * FROM
( VALUES
("John", "12345"),
("Harry", "01234"),
("Tom", "24659")
) AS T(Name, pin);
#result1 =
SELECT Name,
pin
FROM #someData
WHERE pin.StartsWith("1") OR pin.StartsWith("2");
#result2 =
SELECT Name,
pin
FROM #someData
WHERE !pin.StartsWith("1") AND !pin.StartsWith("2");
#person =
EXTRACT UserId int,
PNR string,
UID String,
FROM "/Samples/Data/person.csv"
USING Extractors.csv();
#rs1=select UserId,PNR,UID,Regex.Ismatch(PNR,'^(19|20)[0-9]{2}((0[1-9])$') as pnrval,Regex.Ismatch(UID,'^(19|20)[0-9]{2}$') as uidval
from #person
#rs2 = select UserId,PNR,UID from #rs1 where pnrval=true or uidval=true
#rs3 = select UserId,PNR,UID from #rs1 where pnrval=false or uidval= false
OUTPUT #rs2
TO "/output/sl.csv"
USING Outputters.Csv();
OUTPUT #rs3
TO "/output/error.csv"
USING Outputters.Csv();
This worked for my requirement. Thanks for the support and suggestions
Considering your input, I would use
.*\s*,\s*[12]\d+
.* matches any amount of characters and is needed to match everything before the comma
\s*,\s* matches a comma optionally preceded and or followed by any amount of blanks (\s matches a blank)
[12] matches a single digit, equal to 1 or 2; this satisfies your requirement about PINs
\d+ matches one or more digits
Live demo here.
As far as using Regex.Match, I'll quote this answer on StackOverflow:
System.Text.RegularExpressions.Match is not part of the built-in U-SQL types.
So what I would do here is pre-parsing your CSV in C#; something like:
Regex CurrentRegex = new Regex(#".*\s*,\s*[12]\d+", RegexOptions.IgnoreCase);
foreach (var LineOfText in File.ReadAllLines(InputFilePath))
{
Match CurrentMatch = CurrentRegex.Match(LineOfText);
if (CurrentMatch.Success)
{
// Append line to success file
}
else
{
// Append line to error file
}
CurrentMatch = CurrentMatch.NextMatch();
}

VBA: Filtering by multiple criteria (more than 2) using wildcards [duplicate]

Right now I am doing coding to set a filter for a data chart. Basically, I don't know how to post the data sheet up here so just try to type them ):
(starting from the left is column A)
Name * BDevice * Quantity * Sale* Owner
Basically I need to filter out for 2 column:
-The BDevice with any word contain "M1454" or "M1467" or "M1879" (It means that M1454A or M1467TR would still fit in)
-The Owner with PROD or RISK
Here is the code I wrote:
Sub AutoFilter()
ActiveWorkbook.ActiveSheet..Range(B:B).Select
Selection.Autofilter Field:=1 Criteria1:=Array( _
"*M1454*", "*M1467*", "*M1879*"), Operator:=xlFilterValues
Selection.AutoFilter Field:=4 Criteria1:="=PROD" _
, Operator:=xlOr, Criteria2:="=RISK"
End Sub
When I run the code, the machine returns error 1004 and the part which seems to be wrong is the Filter part 2 ( I am not sure about the use of Field, so I can not say it for sure)
Edit; Santosh: When I try your code, the machine gets error 9 subscript out of range. The error came from the with statement. (since the data table has A to AS column so I just change to A:AS)
While there is a maximum of two direct wildcards per field in the AutoFilter method, pattern matching can be used to create an array that replaces the wildcards with the Operator:=xlFilterValues option. A Select Case statement helps the wildcard matching.
The second field is a simple Criteria1 and Criteria2 direct match with a Operator:=xlOr joining the two criteria.
Sub multiWildcardFilter()
Dim a As Long, aARRs As Variant, dVALs As Object
Set dVALs = CreateObject("Scripting.Dictionary")
dVALs.CompareMode = vbTextCompare
With Worksheets("Sheet1")
If .AutoFilterMode Then .AutoFilterMode = False
With .Cells(1, 1).CurrentRegion
'build a dictionary so the keys can be used as the array filter
aARRs = .Columns(2).Cells.Value2
For a = LBound(aARRs, 1) + 1 To UBound(aARRs, 1)
Select Case True
Case aARRs(a, 1) Like "MK1454*"
dVALs.Add Key:=aARRs(a, 1), Item:=aARRs(a, 1)
Case aARRs(a, 1) Like "MK1467*"
dVALs.Add Key:=aARRs(a, 1), Item:=aARRs(a, 1)
Case aARRs(a, 1) Like "MK1879*"
dVALs.Add Key:=aARRs(a, 1), Item:=aARRs(a, 1)
Case Else
'no match. do nothing
End Select
Next a
'filter on column B if dictionary keys exist
If CBool(dVALs.Count) Then _
.AutoFilter Field:=2, Criteria1:=dVALs.keys, _
Operator:=xlFilterValues, VisibleDropDown:=False
'filter on column E
.AutoFilter Field:=5, Criteria1:="PROD", Operator:=xlOr, _
Criteria2:="RISK", VisibleDropDown:=False
'data is filtered on MK1454*, MK1467* or MK1879* (column B)
'column E is either PROD or RISK
'Perform work on filtered data here
End With
If .AutoFilterMode Then .AutoFilterMode = False
End With
dVALs.RemoveAll: Set dVALs = Nothing
End Sub
If exclusions¹ are to be added to the filtering, their logic should be placed at the top of the Select.. End Select statement in order that they are not added through a false positive to other matching criteria.
                                Before applying AutoFilter Method
                                After applying AutoFilter w/ multiple wildcards
¹ See Can Advanced Filter criteria be in the VBA rather than a range? and Can AutoFilter take both inclusive and non-inclusive wildcards from Dictionary keys? for more on adding exclusions to the dictionary's filter set.
For using partial strings to exclude rows and include blanks you should use
'From Jeeped's code
Dim dVals As Scripting.Dictionary
Set dVals = CreateObject("Scripting.Dictionary")
dVals.CompareMode = vbTextCompare
Dim col3() As Variant
Dim col3init As Integer
'Swallow row3 into an array; start from 1 so it corresponds to row
For col3init = 1 to Sheets("Sheet1").UsedRange.Rows.count
col3(col3init) = Sheets("Sheet1").Range(Cells(col3init,3),Cells(col3init,3)).Value
Next col3init
Dim excludeArray() As Variant
'Partial strings in below array will be checked against rows
excludeArray = Array("MK1", "MK2", "MK3")
Dim col3check As Integer
Dim excludecheck as Integer
Dim violations As Integer
For col3check = 1 to UBound(col3)
For excludecheck = 0 to UBound(excludeArray)
If Instr(1,col3(col3check),excludeArray(excludecheck)) <> 0 Then
violations = violations + 1
'Sometimes the partial string you're filtering out for may appear more than once.
End If
Next col3check
If violations = 0 and Not dVals.Exists(col3(col3check)) Then
dVals.Add Key:=col3(col3check), Item:=col3(col3check) 'adds keys for items where the partial strings in excludeArray do NOT appear
ElseIf col3(col3check) = "" Then
dVals.Item(Chr(61)) = Chr(61) 'blanks
End If
violations = 0
Next col3check
The dVals.Item(Chr(61)) = Chr(61) idea came from Jeeped's other answer here
Multiple Filter Criteria for blanks and numbers using wildcard on same field just doesn't work
Try below code :
max 2 wildcard expression for Criteria1 works. Refer this link
Sub AutoFilter()
With ThisWorkbook.Sheets("sheet1").Range("A:E")
.AutoFilter Field:=2, Criteria1:=Array("*M1454*", "*M1467*"), Operator:=xlFilterValues
.AutoFilter Field:=5, Criteria1:="=PROD", Operator:=xlOr, Criteria2:="=RISK"
End With
End Sub

How to find all the source lines containing desired table names from user_source by using 'regexp'

For example we have a large database contains lots of oracle packages, and now we want to see where a specific table resists in the source code. The source code is stored in user_source table and our desired table is called 'company'.
Normally, I would like to use:
select * from user_source
where upper(text) like '%COMPANY%'
This will return all words containing 'company', like
121 company cmy
14 company_id, idx_name %% end of coding
453 ;companyname
1253 from db.company.company_id where
989 using company, idx, db_name,
So how to make this result more intelligent using regular expression to parse all the source lines matching a meaningful table name (means a table to the compiler)?
So normally we allow the matched word contains chars like . ; , '' "" but not _
Can anyone make this work?
To find company as a "whole word" with a regular expression:
SELECT * FROM user_source
WHERE REGEXP_LIKE(text, '(^|\s)company(\s|$)', 'i');
The third argument of i makes the REGEXP_LIKE search case-insensitive.
As far as ignoring the characters . ; , '' "", you can use REGEXP_REPLACE to suck them out of the string before doing the comparison:
SELECT * FROM user_source
WHERE REGEXP_LIKE(REGEXP_REPLACE(text, '[.;,''"]'), '(^|\s)company(\s|$)', 'i');
Addendum: The following query will also help locate table references. It won't give the source line, but it's a start:
SELECT *
FROM user_dependencies
WHERE referenced_name = 'COMPANY'
AND referenced_type = 'TABLE';
If you want to identify the objects that refer to your table, you can get that information from the data dictionary:
select *
from all_dependencies
where referenced_owner = 'DB'
and referenced_name = 'COMPANY'
and referenced_type = 'TABLE';
You can't get the individual line numbers from that, but you can then either look at user_source or use a regexp on the specific source code, which woudl at least reduce false positives.
SELECT * FROM user_source
WHERE REGEXP_LIKE(text,'([^_a-z0-9])company([^_a-z0-9])','i')
Thanks #Ed Gibbs, with a little trick this modified answer could be more intelligent.

PL/SQL optimize searching a date in varchar

I have a table, that contains date field (let it be date s_date) and description field (varchar2(n) desc). What I need is to write a script (or a single query, if possible), that will parse the desc field and if it contains a valid oracle date, then it will cut this date and update the s_date, if it is null.
But there are one more condition - there are must be exactly one occurence of a date in the desc. If there are 0 or >1 - nothing should be updated.
By the time I came up with this pretty ugly solution using regular expressions:
----------------------------------------------
create or replace function to_date_single( p_date_str in varchar2 )
return date
is
l_date date;
pRegEx varchar(150);
pResStr varchar(150);
begin
pRegEx := '((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d)((.|\n|\t|\s)*((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d))?';
pResStr := regexp_substr(p_date_str, pRegEx);
if not (length(pResStr) = 10)
then return null;
end if;
l_date := to_date(pResStr, 'dd.mm.yyyy');
return l_date;
exception
when others then return null;
end to_date_single;
----------------------------------------------
update myTable t
set t.s_date = to_date_single(t.desc)
where t.s_date is null;
----------------------------------------------
But it's working extremely slow (more than a second for each record and i need to update about 30000 records). Is it possible to optimize the function somehow? Maybe it is the way to do the thing without regexp? Any other ideas?
Any advice is appreciated :)
EDIT:
OK, maybe it'll be useful for someone. The following regular expression performs check for valid date (DD.MM.YYYY) taking into account the number of days in a month, including the check for leap year:
(((0[1-9]|[12]\d|3[01])\.(0[13578]|1[02])\.((19|[2-9]\d)\d{2}))|((0[1-9]|[12]\d|30)\.(0[13456789]|1[012])\.((19|[2-9]\d)\d{2}))|((0[1-9]|1\d|2[0-8])\.02\.((19|[2-9]\d)\d{2}))|(29\.02\.((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00))))
I used it with the query, suggested by #David (see accepted answer), but I've tried select instead of update (so it's 1 regexp less per row, because we don't do regexp_substr) just for "benchmarking" purpose.
Numbers probably won't tell much here, cause it all depends on hardware, software and specific DB design, but it took about 2 minutes to select 36K records for me. Update will be slower, but I think It'll still be a reasonable time.
I would refactor it along the lines of a single update query.
Use two regexp_instr() calls in the where clause to find rows for which a first occurrence of the match occurs and a second occurrence does not, and regexp_substr() to pull the matching characters for the update.
update my_table
set my_date = to_date(regexp_subtr(desc,...),...)
where regexp_instr(desc,pattern,1,1) > 0 and
regexp_instr(desc,pattern,1,2) = 0
You might get even better performance with:
update my_table
set my_date = to_date(regexp_subtr(desc,...),...)
where case regexp_instr(desc,pattern,1,1)
when 0 then 'N'
else case regexp_instr(desc,pattern,1,2)
when 0 then 'Y'
else 'N'
end
end = 'Y'
... as it only evaluates the second regexp if the first is non-zero. The first query might also do that but the optimiser might choose to evaluate the second predicate first because it is an equality condition, under the assumption that it's more selective.
Or reordering the Case expression might be better -- it's a trade-off that's difficult to judge and probably very dependent on the data.
I think there's no way to improve this task. Actually, in order to achieve what you want it should get even slower.
Your regular expression matches text like 31.02.2013, 31.04.2013 outside the range of the month. If you put year in the game,
it gets even worse. 29.02.2012 is valid, but 29.02.2013 is not.
That's why you have to test if the result is a valid date.
Since there isn't a full regular expression for that, you would have to do it by PLSQL really.
In your to_date_single function you return null when a invalid date is found.
But that doesn't mean there won't be other valid dates forward on the text.
So you have to keep trying until you either find two valid dates or hit the end of the text:
create or replace function fn_to_date(p_date_str in varchar2) return date is
l_date date;
pRegEx varchar(150);
pResStr varchar(150);
vn_findings number;
vn_loop number;
begin
vn_findings := 0;
vn_loop := 1;
pRegEx := '((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d)';
loop
pResStr := regexp_substr(p_date_str, pRegEx, 1, vn_loop);
if pResStr is null then exit; end if;
begin
l_date := to_date(pResStr, 'dd.mm.yyyy');
vn_findings := vn_findings + 1;
-- your crazy requirement :)
if vn_findings = 2 then
return null;
end if;
exception when others then
null;
end;
-- you have to keep trying :)
vn_loop := vn_loop + 1;
end loop;
return l_date;
end;
Some tests:
select fn_to_date('xxxx29.02.2012xxxxx') c1 --ok
, fn_to_date('xxxx29.02.2012xxx29.02.2013xxx') c2 --ok, 2nd is invalid
, fn_to_date('xxxx29.02.2012xxx29.02.2016xxx') c2 --null, both are valid
from dual
As you are going to have to do try and error anyway one idea would be to use a simpler regular expression.
Something like \d\d[.]\d\d[.]\d\d\d\d would suffice. That would depend on your data, of course.
Using #David's idea you could filter the ammount of rows to apply your to_date_single function (because it's slow),
but regular expressions alone won't do what you want:
update my_table
set my_date = fn_to_date( )
where regexp_instr(desc,patern,1,1) > 0

Validating a Salesforce Id

Is there a way to validate a Salesforce ID, maybe using RegEx? They are normally 15 chars or 18 chars but do they follow a pattern that we can use to check that it's a valid id.
There are two levels of validating salesforce id:
check format using regular expression [a-zA-Z0-9]{15}|[a-zA-Z0-9]{18}
for 18-characted ids you can check the the 3-character checksum:
Code examples provided in comments:
C#
Go
Javascript
Ruby
Something like this should work:
[a-zA-Z0-9]{15,18}
It was suggested that this may be more correct because it prevents Ids with lengths of 16 and 17 characters to be rejected, also we try to match against 18 char length first with 15 length as a fallback:
[a-zA-Z0-9]{18}|[a-zA-Z0-9]{15}
Just use instanceOf to check if the string is an instance of Id.
String s = '1234';
if (s instanceOf Id) System.debug('valid id');
else System.debug('invalid id');
The easiest way I've come across, is to create a new ID variable and assign a String to it.
ID MyTestID = null;
try {
MyTestID = MyTestString; }
catch(Exception ex) { }
If MyTestID is null after trying to assign it, the ID was invalid.
This regex has given me the optimal results so far.
\b[a-z0-9]\w{4}0\w{12}|[a-z0-9]\w{4}0\w{9}\b
You can also check for 15 chars, and then add an extra 3 chars optional, with an expression similar to:
^[a-z0-9]{15}(?:[a-z0-9]{3})?$
on i mode, or not:
^[A-Za-z0-9]{15}(?:[A-Za-z0-9]{3})?$
Demo
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
Javascript: /^(?=.*?\d)(?=.*?[a-z])[a-z\d]{18}$/i
These were the Salesforce Id validation requirements for me.
18 characters only
At least one digit
At least one alphabet
Case insensitive
Test cases
Should fail
1
a
1234
abgcde
1234aDcde
12345678901234567*
123456789012345678
abcDefghijabcdefgh
Should pass
1234567890abcDeFgh
1234abcd1234abcd12
abcd1234abcd1234ab
1abcDefhijabcdefgf
abcDefghijabcdefg1
12345678901234567a
a12345678901234567
For understanding the regex, please refer this thread
The regex provided by Daniel Sokolowski works perfectly to verify if the id is in the correct format.
If you want to verify if an id corresponds to an actual record in the database, you'll need to first find the object type from the first three characters (commonly known as prefix) and then query the object type:
boolean isValidAndExists(String key) {
Map<String, Schema.SObjectType> objTypes = Schema.getGlobalDescribe();
for (Schema.SObjectType objType : objTypes.values()) {
Schema.DescribeSObjectResult objDesc = objType.getDescribe();
if (objDesc.getKeyPrefix() == key.substring(0,3)) {
String objName = objDesc.getName();
String query = 'SELECT Id FROM ' + objName + ' WHERE Id = \'' + key + '\'';
SObject[] objs = Database.query(query);
return !objs.isEmpty();
}
}
return false;
}
Be aware that Schema.getGlobalDescribe can be an expensive operation and degrade the performance of your application if you use that often.
If you need to check that often, I recommend creating a Custom Setting or Custom Metadata to store the relation between prefixes and object types.
Assuming you want to validate Ids in Apex, there are a few approaches discussed in the other answers. Here is an alternative, with notes on the various approaches.
The try-catch method (credit to #matt_k) certainly works, but some folks worry about overhead, especially if testing many Ids.
I used instanceof Id for a long time (credit to #melani_s), until I discovered that it sometimes gives the wrong answer (e.g., '481D0B74-41CF-47E9').
Multiple answers suggest regexen. As the accepted answer correctly points out (credit to #zacheusz), 18 character Ids are only valid if their checksums are correct, which means the regex solutions can be wrong. That answer also helpfully provides code in several languages to test Id checksums. But not in Apex.
I was going to implement the checksum code in Apex, but then I realized the Salesforce had already done the work, so instead I just convert 18 digit Ids to 15 digit Ids (via .to15() which uses the checksum to fix capitalization, as opposed to truncating the string) and then back to 18 digits to let SF do the checksum calc, then I compare the original checksum and the new one. This is my method:
static Pattern ID_REGEX = Pattern.compile('[a-zA-Z0-9]{15}(?:[A-Z0-5]{3})?');
/**
* #description Determines if a string is a valid SalesforceId. Confirms checksum of 18 digit Ids.
* Works for cases where `x instanceof id` returns the wrong answer, like '481D0B74-41CF-47E9'.
* Does NOT check for the existence of a record with the given Id.
* #param s a string to validate
*
* #return true if the string `s` is a valid Salesforce Id.
*/
public static Boolean isValidId(String s) {
Matcher m = ID_REGEX.matcher(s);
if (m.matches() == false) return false; // if it doesn't match the regex it cannot be valid
if (s.length() == 15) return true; // if 15 char string matches the regex, assume it must be valid
String check = (Id)((Id)s).to15(); // Convert to 15 char Id, then to Id and back to string, giving correct 18-char Id
return s.right(3) == check.right(3); // if 18 char string matches the regex, valid if checksum correct
}
Additionally checking getSObjectType() != null would be perfect if we are dealing with Salesforce records
public static boolean isRecordId(string recordId){
try{
return string.isNotBlank(recordId) && ((Id)recordId.trim()).getSObjectType() != null;
}catch(Exception ex){
return false;
}
}