Regular expression match U-SQL - regex

I need a help in writing in U-SQL to output records to two different files based on a regular expression output.
Let me explain my scenario in detail.
Let us assume my input file has two columns, "Name" and person identification number ("PIN"):
Name , PIN
John ,12345
Harry ,01234
Tom, 24659
My condition for PIN is it should start with either 1 or 2. In the above case records 1 & 3 are valid and record 2 is invalid.
I need to output record 1 & 3 to my output processed file and 2 to my error file
How can I do this and also can I use Regex.Match to validate the regular expression?
//posting my code
#person =
EXTRACT UserId int,
PNR string,
UID String,
FROM "/Samples/Data/person.csv"
USING Extractors.csv();
#rs1=select UserId,PNR,UID,Regex.match(PNR,'^(19|20)[0-9]{2}((0[1-9])$') as pnrval,Regex.match(UID,'^(19|20)[0-9]{2}$') as uidval
from #person
#rs2 = select UserId,PNR,UID from #rs1 where pnrval=true or uidval=true
#rs3 = select UserId,PNR,UID from #rs1 where uidval=false or uidval= false
OUTPUT #rs2
TO "/output/sl.csv"
USING Outputters.Csv();
OUTPUT #rs3
TO "/output/error.csv"
USING Outputters.Csv();
But I'm receiving this error:
Severity Code Description Project File Line Suppression State Error
E_CSC_USER_INVALIDCOLUMNTYPE: 'System.Text.RegularExpressions.Match'
cannot be used as column type.

#someData =
SELECT * FROM
( VALUES
("John", "12345"),
("Harry", "01234"),
("Tom", "24659")
) AS T(Name, pin);
#result1 =
SELECT Name,
pin
FROM #someData
WHERE pin.StartsWith("1") OR pin.StartsWith("2");
#result2 =
SELECT Name,
pin
FROM #someData
WHERE !pin.StartsWith("1") AND !pin.StartsWith("2");

#person =
EXTRACT UserId int,
PNR string,
UID String,
FROM "/Samples/Data/person.csv"
USING Extractors.csv();
#rs1=select UserId,PNR,UID,Regex.Ismatch(PNR,'^(19|20)[0-9]{2}((0[1-9])$') as pnrval,Regex.Ismatch(UID,'^(19|20)[0-9]{2}$') as uidval
from #person
#rs2 = select UserId,PNR,UID from #rs1 where pnrval=true or uidval=true
#rs3 = select UserId,PNR,UID from #rs1 where pnrval=false or uidval= false
OUTPUT #rs2
TO "/output/sl.csv"
USING Outputters.Csv();
OUTPUT #rs3
TO "/output/error.csv"
USING Outputters.Csv();
This worked for my requirement. Thanks for the support and suggestions

Considering your input, I would use
.*\s*,\s*[12]\d+
.* matches any amount of characters and is needed to match everything before the comma
\s*,\s* matches a comma optionally preceded and or followed by any amount of blanks (\s matches a blank)
[12] matches a single digit, equal to 1 or 2; this satisfies your requirement about PINs
\d+ matches one or more digits
Live demo here.
As far as using Regex.Match, I'll quote this answer on StackOverflow:
System.Text.RegularExpressions.Match is not part of the built-in U-SQL types.
So what I would do here is pre-parsing your CSV in C#; something like:
Regex CurrentRegex = new Regex(#".*\s*,\s*[12]\d+", RegexOptions.IgnoreCase);
foreach (var LineOfText in File.ReadAllLines(InputFilePath))
{
Match CurrentMatch = CurrentRegex.Match(LineOfText);
if (CurrentMatch.Success)
{
// Append line to success file
}
else
{
// Append line to error file
}
CurrentMatch = CurrentMatch.NextMatch();
}

Related

Fluentd Parsing

Hi i'm trying to parse single line log using fluentd. Here is log i'm trying to parse.
F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F8..........etc
This will parse into like this:
{ "F2" : "4200000000000000", "F3" : "000000", "F4" : "000000060000" ............etc }
I tried to use regex but it's confusing and making me write multiple regexes for different keys and values. Is there any easier way to achieve this ?
EDIT1: Heya! I will make this more detailed. I'm currently tailing logs using fluentd to Elasticsearch+Kibana. Here is unparsed example log that fluentd sending to Elasticsearch:
21/09/02 16:36:09.927238: 1 frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random message and strings,F7:.......etc
Elasticsearch recived message:
{"message":"frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random
digits and chars,F7:.......etc"}
This log has only message key so i can't index and create dashboard on only using whole message field. What am i trying to achieve is catch only useful fields, add key into it if it has no key and make indexing easier.
Expected output:
{"logdate" : "21/09/02 16:36:09.927238",
"source" : "frSMS",
"UID" : "#HTF4J",
"statuscode" : "msg0210",
"F2": "00000000000000000",
"F3": "randomchar314516",.....}
I used regex plugin to parse into this but it was too overwhelming and . Here is what i did so far:
^(?<logDate>\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b)....(?<source>fr[A-Z]{3,4}|to[A-Z]{3,4}\b).(?<status>\d\b).(?<dummyfield>\d{5}\b).(?<HUID>.[A-Z]{5}\b)..(?<d1>\d{3}\b).(?<d2>\d{3}\b).(?<msgcode>msg\d{4}\b).(?<dummyfield1>\d{16}\b).(?<dummyfield2>\d{6}\b).(?<dummyfield3>\d{6,7}\b).(?<dummyfield4>\d{6}\b).(?<dummyfield5>\d{2}\b)...
Which results to :
"logDate": "21/09/02 16:36:09.205706",
"source": "toSMS" ,
"status": "0",
"dummyfield": "13995" ,
"UID" : "#HTFAA" ,
"d1" : "156" ,
"d2" : "156" ,
"msgcode" : "msg0210",
"dummyfield1" :"0000000000000000" ,
"dummyfield2" :"002000",
"dummyfield3" :"2000000",
"dummyfield4" :"00",
"dummyfield5" :"2000000" ,
"dummyfield6" :"867202"
Which only applies to example log and has useless fields like field1, dummyfield, dummyfield1 etc.
Other logs has the useful values and keys(date,source,msgcode,UID,F1,F2 fields) like i showcased on expected output. Not useful fields are not static(they can be none, or has less|more digits and chars) so they trigger the pattern not matched error.
So the question is:
How do i capture useful fields that i mentioned using regex?
How do i capture F1,F2,F3...... fields that has different value
patterns like char string mixed?
PS: I wraped the regex i wrote into html snippet so the <> capturing fields don't get deleted
Regex pattern to use:
(F[\d]+):([\d]+)
This pattern will catch all the 'F' values with whatever digit that comes after - yes even if it's F105 it still works. This whole 'F105' will be stored as the first group in your regex match expression
The right part of the above pattern will catch the value of all the digits following ':' up until any charachter that is not a digit. i.e. ',', 'F', etc.. and will store it as the second group in your regex match
Use
Depending on your coding language you will have to access your regex matches variable with an iterator and extract group 1 and group 2 respectivly
Python example:
import re
log = 'F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F105:9726450'
pattern = '(F[\d]+):([\d]+)'
matches = re.finditer(pattern,log)
log_dict = {}
for match in matches:
log_dict[match.group(1)] = match.group(2)
print(log_dict)
Output
{'F2': '4200000000000000', 'F3': '000000', 'F4': '000000060000', 'F6': '000000000000', 'F7': '000000000', 'F105': '9726450'}
Assuming the logdate will be static(in pattern wise) You can ignore useless values using ".+" regex and get collect the useful values by their patterns. So the regex will be like this :
(?\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b).+(?fr[A-Z]{3,4}|to[A-Z]{3,4}).+(?#[A-Z0-9]{5}).+(?msg\d{4})
And output will be like:
{"logdate" : "21/09/02 16:36:09.927238", "source" : "frSMS",
"UID" : "#HTF4J","statuscode" : "msg0210"}
And I'm working on getting F2,F3,FN keys and values.

How to insert a new line after each occurrence of a particular format in a text field

I have a system that I can output a spreadsheet from. I then take this outputted spreadsheet and import it into MS Access. There, I run some basic update queries before merging the final result into a SharePoint 2013 Linked List.
The spreadsheet I output has an unfortunate Long Text field which has some comments in it, which are vital. On the system that hosts the spreadsheet, these comments are nicely formatted. When the spreadsheet it output though, the field turns into a long, very unpretty string like so:
09:00 on 01/03/2017, Firstname Surname. :- Have responded to request for more information. 15:12 on 15/02/2017, Firstname Surname. :- Need more information to progress request. 17:09 on 09/02/2017, Firstname Surname. :- Have placed request.
What I would like to do is run a query (either in MS Access or MS Excel) which can scan this field, detect occurrences of "##:## on ##/##/####, Firstname Surname. :-" and then automatically insert a line break before them, so this text is more neatly formatted. It would obviously skip the first occurrence of this format, as otherwise it would enter a new line at the start of the field. Ideal end result would be:
09:00 on 01/03/2017, Firstname Surname. :- Have responded to request
for more information.
15:12 on 15/02/2017, Firstname Surname. :- Need more information to progress request.
17:09 on 09/02/2017, Firstname Surname. :- Have placed request.
To be honest, I haven't tried much myself so far, as I really don't know where to start. I don't know if this can be done without regular expressions, or within a simple query versus VBA code.
I did start building a regular expression, like so:
[0-9]{2}:[0-9]{2}\s[o][n]\s[0-9]{2}\/[0-9]{2}\/[0-9]{4}\,\s
But this looks a little ridiculous and I'm fairly certain I'm going about it in a very unnecessary way. From what I can see from the text, detecting the next occurrence of "##:## on ##/##/####" should be enough. If I take a new line after this, that will suffice.
You have your RegExp pattern, now you need to create a function to append found items with your extra delimiter.
look at this function. It takes, your long string and finds your date-stamp using your pattern and appends with your delimiter.
Ideally, i would run each line twice and add delimiters after each column so you have a string like,
datestamp;firstname lastname;comment
you can then use arr = vba.split(text, ";") to get your data into an array and use it as
date-stamp = arr(0)
name = arr(1)
comment = arr(2)
Public Function FN_REGEX_REPLACE(iText As String, iPattern As String, iDelimiter As String) As String
Dim objRegex As Object
Dim allmatches As Variant
Dim I As Long
On Error GoTo FN_REGEX_REPLACE_Error
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Multiline = True
.Global = True
.IgnoreCase = True
.Pattern = iPattern
If .test(iText) Then
Set allmatches = .Execute(iText)
If allmatches.count > 0 Then
For I = 1 To allmatches.count - 1 ' for i = 0 to count will start from first match
iText = VBA.Replace(iText, allmatches.item(I), iDelimiter & allmatches.item(I))
Next I
End If
End If
End With
FN_REGEX_REPLACE = Trim(iText)
Set objRegex = Nothing
On Error GoTo 0
Exit Function
FN_REGEX_REPLACE_Error:
MsgBox Err.description
End Function
use above function as
mPattern = "[0-9]{2}:[0-9]{2}\s[o][n]\s[0-9]{2}\/[0-9]{2}\/[0-9]{4}\,"
replacedText = FN_REGEX_REPLACE(originalText,mPattern,vbnewline)
Excel uses LF for linebreaks, Access uses CRLF.
So it should suffice to run a simple replacement query:
UPDATE myTable
SET LongTextField = Replace([LongTextField], Chr(10), Chr(13) & Chr(10))
WHERE <...>
You need to make sure that this runs only once on newly imported records, not repeatedly on all records.

Equivalent regex in T-SQL with start/end input markers

I'm trying to get the following RegEx to work:
^[a-zA-Z][a-zA-Z ''.-]+[a-zA-Z]$
It should allow any alphas, space, apostrophe, full stop and hyphen as long as the beginning and last chars as alphas.
John - ok
John Smith - ok
John-Smith - ok
John.Smith - ok
.John Smith - not ok
John Smith. - not ok
When I use this in T-SQL it doesn't seem to work and I'm not sure if it the input start/end markers that are not compatible in T-SQL. How do I translate this to valid T-SQL?:
CREATE Function [dbo].[IsValidName](#value VarChar(MAX))
RETURNS INT
AS
Begin
DECLARE #temp INT
SET #temp = (
SELECT
CASE WHEN #value LIKE '%^[a-zA-Z][a-zA-Z ''.-]+[a-zA-Z]$%' THEN 1
ELSE 0
END
)
RETURN #Temp
End
T-SQL doesn't support regular expressions "out of the box". Depending on what environment you are using, there are different solutions, but none will probably be "pure T-SQL". In a Microsoft environment you can use CLR procedures to achieve this.
See SQL Server Regular expressions in T-SQL for some options.
I made SOMETHING like this to scrub data, to remove non-alpha characters, I've slightly modified it to fit your needs
CREATE Function [dbo].Func (#Temp VarChar(1000))
Returns VarChar(1000)
AS
BEGIN
DECLARE #Len INT = LEN(#Temp)
DECLARE #RETURN INT
Declare #KeepValues as varchar(50)
Set #KeepValues = '%[^a-z^ ]%'
IF PatIndex(#KeepValues, #Temp) = 1
BEGIN
Set #RETURN = 0
END
IF PATINDEX(#KeepValues, #Temp) = #Len
BEGIN
SET #RETURN = 0
END
IF PATINDEX(#KeepValues, #Temp) = 0
SET #RETURN = 1
IF #RETURN IS NULL
BEGIN
SET #Return = 1
END
RETURN #RETURN
END
This is assuming you would not need to do any sort of data scrubbing for restricted characters. If you need to scrub for restricted characters let me know we can add a little more in there but based on your dataset this will return the correct answers

First instance after specific string

So, I have this RegEx that captures a specific string I need (thanks to Shawn Mehan):
>?url:\'\/watch\/(video[\w-\/]*)
It works great, but now I need to mod my criteria. I need to capture ONLY the first URL after EACH instance of: videos:[{title:. Bolded all instances below and also bolded the first URL I'd want captured as an example.
How might I approach this? I have a VBScript that will dump each URL to a text file, so I just need help selecting the correct URLs from the blob below. Thinking something like, "if this string is found, do this, loop". Setting the regex global to false should only grab the first instance each round, right? A basic example would help.
I believe I have all of the pieces I need, but I'm not quite sure how to put them together. I'm expecting the code below to loop through and find the index of each instance of "videos:[{title:", then the regex to grab the first URL after (regexp global set to false) based on the pattern, then write the found URL to my text file, loop until all are found. Not working...
(larger portion of html_dump: http://pastebin.com/6i5gmeTB)
Set objWshShell = Wscript.CreateObject("Wscript.Shell")
Set fso = CreateObject("Scripting.FileSystemObject")
Set objRegExp = new RegExp
objRegExp.Global = False
objRegExp.Pattern = ">?url:\'\/watch\/(video[\w-\/]*)"
filename = fso.GetParentFolderName(WScript.ScriptFullName) & "\html_dump.txt" 'Text file contains html
set urldump = fso.opentextfile(filename,1,true)
do until urldump.AtEndOfStream
strLine = urldump.ReadLine()
strSearch = InStrRev(strLine, "videos:[{title:") 'Attempting to find the position of "videos:[{title:" to grab the first URL after.
If strSearch >0 then
Set myMatches = objRegExp.Execute(strLine) 'This matches the URL pattern.
For Each myMatch in myMatches
strCleanURL = myMatch.value
next
'===Writes clean urls to txt file...or, it would it if worked===
filename1 = fso.GetParentFolderName(WScript.ScriptFullName) & "\URLsClean.txt" 'Creates and writes to this file
set WriteURL = fso.opentextfile(filename1,2,true)
WriteURL.WriteLine strCleanURL
WriteURL.Close
else
End if
loop
urldump.close
var streams = [ {streamID:138, cards:[{cardId: 59643,cardTypeId: 48,clickCount: 84221,occurredOn: '2015-08-17T15:30:17.000-07:00',expiredOn: '',header: 'Latest News Headlines', subHeader: 'Here are some of the latest headlines from around the world.', link: '/watch/playlist/544/Latest-News-Headlines', earn: 3, playlistRevisionID: 3427, image: 'http%3A%2F%2Fpthumbnails.5min.com%2F10380591%2F519029502_3_o.jpg', imageParamPrefix: '?', size: 13, durationMin: 15, durationTime: '14:34',pos:0,trkId:'2gs55j6u0nz8', true,videos:[{title:'World\'s First Sky Pool Soon To Appear In South London',thumbnail:'http%3A%2F%2Fpthumbnails.5min.com%2F10380509%2F519025436_c_140_105.jpg',durationTime:'0:39',url:'/watch/video/716424/worlds-first-sky-pool-soon-to-appear-in-south-london',rating:'4.2857'},{title:'Treasure Hunters Find $4.5 Million in Spanish Coins',thumbnail:'http%3A%2F%2Fpthumbnails.5min.com%2F10380462%2F519023092_3.jpg',durationTime:'0:54',url:'/watch/video/715927/treasure-hunters-find-4-5-million-in-spanish-coins',rating:'4.25'},{title:'Former President Jimmy Carter Says Cancer Has Spread to Brain',thumbnail:'http%3A%2F%2Fpthumbnails.5min.com%2F10380499%2F519024920_c_140_105.jpg',durationTime:'1:59',url:'/watch/video/716363/former-president-jimmy-carter-says-cancer-has-spread-to-brain',rating:'2.8889'},{title:'Josh Duggar Had Multiple Accounts on AshleyMadison.Com',thumbnail:'http%3A%2F%2Fpthumbnails.5min.com%2F10380505%2F519025222_c_140_105.jpg',durationTime:'1:30',
Assuming that your input comes from a file and is in correct JSON format you could do something like this in PowerShell:
$jsonfile = 'C:\path\to\input.txt'
$json = Get-Content $jsonfile -Raw | ConvertFrom-Json
$json.streams.cards | ForEach-Object { $_.videos[0].url }
The above is assuming that streams is the topmost key in your JSON data.
Note that the code requires at least PowerShell v3.

How to find all the source lines containing desired table names from user_source by using 'regexp'

For example we have a large database contains lots of oracle packages, and now we want to see where a specific table resists in the source code. The source code is stored in user_source table and our desired table is called 'company'.
Normally, I would like to use:
select * from user_source
where upper(text) like '%COMPANY%'
This will return all words containing 'company', like
121 company cmy
14 company_id, idx_name %% end of coding
453 ;companyname
1253 from db.company.company_id where
989 using company, idx, db_name,
So how to make this result more intelligent using regular expression to parse all the source lines matching a meaningful table name (means a table to the compiler)?
So normally we allow the matched word contains chars like . ; , '' "" but not _
Can anyone make this work?
To find company as a "whole word" with a regular expression:
SELECT * FROM user_source
WHERE REGEXP_LIKE(text, '(^|\s)company(\s|$)', 'i');
The third argument of i makes the REGEXP_LIKE search case-insensitive.
As far as ignoring the characters . ; , '' "", you can use REGEXP_REPLACE to suck them out of the string before doing the comparison:
SELECT * FROM user_source
WHERE REGEXP_LIKE(REGEXP_REPLACE(text, '[.;,''"]'), '(^|\s)company(\s|$)', 'i');
Addendum: The following query will also help locate table references. It won't give the source line, but it's a start:
SELECT *
FROM user_dependencies
WHERE referenced_name = 'COMPANY'
AND referenced_type = 'TABLE';
If you want to identify the objects that refer to your table, you can get that information from the data dictionary:
select *
from all_dependencies
where referenced_owner = 'DB'
and referenced_name = 'COMPANY'
and referenced_type = 'TABLE';
You can't get the individual line numbers from that, but you can then either look at user_source or use a regexp on the specific source code, which woudl at least reduce false positives.
SELECT * FROM user_source
WHERE REGEXP_LIKE(text,'([^_a-z0-9])company([^_a-z0-9])','i')
Thanks #Ed Gibbs, with a little trick this modified answer could be more intelligent.