Parsing a long column to multiple rows using RegEx?

Parsing a long column to multiple rows using RegEx? - regex

I have a table where the history for a file's edits is stored off, one row per file. Most of them are pipe delimited so I transform the table into a 'one row one edit' style by parsing out the fields with this sort of thing:
LATERAL FLATTEN (INPUT => SPLIT(x.User,'|')) a
However, annoyingly one of the fields doesn't have pipes and instead has a timestamp between edits. (it's so people in the file can see the edit history) In our SAS world, I have a job (below) that parses it out using a RegEx and looping around to do the parse/transpose. Is such a thing doable in Snowflake?
data notesdata_parsed;
rx_date = prxparse('/[ ]\d+[\/]\d+[\/](2020|2021)[ ]\d+[:]\d+[:]\d+( AM -| PM -)/');
set notesdata;
where textfield ne '';
do while(1);
rx_pos = prxmatch(rx_date,textfield);
if rx_pos = 0 then
do;
textfield_new=textfield;
output;
leave;
end;
textfield_new = substr(textfield,1,rx_pos-1);
textfield=substr(textfield,rx_pos+1);
output;
end;
drop rx_date textfield rx_pos;
run;

I'm not sure of exact regex you'd need in Snowflake, but you could leverage the REGEXP_REPLACE function in Snowflake to make the date into a PIPE and then do your existing LATERAL FLATTEN type of thing.
Something along the lines of:
LATERAL FLATTEN (INPUT => SPLIT(REGEXP_REPLACE(x.User,'{regex expression}','|'),'|')) a
The regex syntax in Snowflake might be a little different, so I just used a placeholder there. I'm not an expert in Regex.

Related

SAS: create new column based on regex logical statement

I have a large dataset where one column contains free text. I wish to create a new column based on whether this free text contains a regular expression.
Eg:
I want to know whether this column contains the text GnRH, or those letters in any case, and create a new column with a flag to indicate if this is true or not.

FIND or INDEX work as well, and slightly easier to understand.
DUMMY = find(text, "gnrh", 'it') > 0;

Try this
data have;
input text $20.;
datalines;
Not in this line
In GnRH this line
Not here either
This one GNRH too
;
data want;
set have;
dummy = prxmatch('/gnrh/i', text) > 0;
run;

Athena SQL create table with text data

Below is how the data looks
Flight Number: SSSVAD123X Date: 2/8/2020 1:04:40 PM Page[s] Printed: 1 Document Name: DownloadAttachment Print Driver: printermodel (printer driver)
I need help creating an Athena SQL create table with in below format
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
this is new to me, any direction towards solution will work

You may be able to use a regex serde to parse your files. It depends on the shape of your data. You only provide a single line so this assumes that every line in your data files look the same.
Here's the Athena documentation for the feature: https://docs.aws.amazon.com/athena/latest/ug/apache.html
You should be able to do something like the following:
CREATE EXTERNAL TABLE flights (
flight_number STRING,
`date` STRING,
pages_printed INT,
document_name STRING,
print_driver STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^Flight Number:\\s+(\\S+)\\s+Date:\\s+(\\S+)\\s+Page\\[s\\] Printed:\\s+(\\S+)\\s+Document Name:\\s+(\\S+)\\s+Print Driver:\\s+(\\S+)\\s+\\(printer driver\\)$"
) LOCATION 's3://example-bucket/some/prefix/'
Each capture group in the regex will map to a column, in order.
Since I don't have access to your data I can't test the regex, unfortunately, so there may be errors in it. Hopefully this example is enough to get you started.

First, make sure your data format uses tab spacing between columns because your sample doesn't seem to have a consistent separator.
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
As per AWS documentation, use the LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files if your data does not include values enclosed in quotes. You don't need to make it complicated using Regex.
Reference: https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html
As LazySimpleSerDe is the default used by AWS Athena, you don't even need to declare it, see the create table statement for your data sample:
CREATE EXTERNAL TABLE IF NOT EXISTS `mydb`.`mytable` (
`Flight Number` STRING,
`Date` STRING,
`Pages Printed` INT,
`Document Name` STRING,
`Print Driver` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION
's3://awsexamplebucket1-logs/AWSLogs/'
You can use an online generator to help you in the future: https://www.hivetablegenerator.com/
From the generator page: "Easily convert any JSON (even complex Nested ones), CSV, TSV, or Log sample file to an Apache HiveQL DDL create table statement."

How can I combine multiple where (and or) statements?

I can get my code to work but only if I split it into two data steps. When I combine, I get an empty set as a result. Ultimately I want one month of data, but only the record that contains one of 6 keywords.
Example of code that does not work:
data d_prep;
set DataTable;
where
(CREATIONTIME ge "01DEC2017"d) and (CREATIONTIME le "31DEC2017"d)
and
(TEXT contains 'Initialize VNC'
or TEXT contains 'FT-Download'
or TEXT contains 'FT-Upload'
or TEXT contains 'Remote Session'
or TEXT contains 'Login of user'
or TEXT contains 'Create AdHoc-Action');
run;
This gives me zero observations. However, if I split into two steps, I get it to work:
data d_prep;
set DataTable
where
(CREATIONTIME ge "01DEC2017"d) and (CREATIONTIME le "31DEC2017"d);
run;
data d_prep_1;
set d_prep;
where
TEXT contains 'Initialize VNC'
or TEXT contains 'FT-Download'
or TEXT contains 'FT-Upload'
or TEXT contains 'Remote Session'
or TEXT contains 'Login of user'
or TEXT contains 'Create AdHoc-Action';
run;

Good day, Since you are using where environment, I found cool feature ? it essentially is contains:
data wanted;
set begin;
where var1 <10000 and var2=2016 and (TEXT ? 'mobile' or TEXT ? 'scout');
run;
I tested this with data and seems to be working.
Check the SaS documentation for this trick.

PL/SQL regexp_like filters

I want to delete some tables and wrote this procedure:
set serveroutput on
declare
type namearray is table of varchar2(50);
total integer;
name namearray;
begin
--select statement here ..., please see below
total :=name.count;
dbms_output_line(total);
for i in 1 .. total loop
dbms_output.put_line(name(i));
-- execute immediate 'drop table ' || name(i) || ' purge';
End loop;
end;
/
The idea is to drop all tables with table name having pattern like this:
ERROR_REPORT[2 digit][3 Capital characters][10 digits]
example: ERROR_REPORT16MAY2014122748
However, I am not able to come up with the correct regexp. Below are my select statements and results:
select table_name bulk collect into name from user_tables where regexp_like(table_name, '^ERROR_REPORT[0-9{2}A-Z{3}0-9{10}]');
The results included all the table names I needed plus ERROR_REPORT311AUG20111111111. This should not be showing up in the result.
The follow select statement showed the same result, which meant the A-Z{3} had no effect on the regexp.
select table_name bulk collect into name from user_tables where regexp_like(table_name, '^ERROR_REPORT[0-9{2}0-9{10}]');
My question is what would be the correct regexp, and what's wrong with mine?
Thanks,
Alex

Correct regex is
'^ERROR_REPORT[0-9]{2}[A-Z]{3}[0-9]{10}'

I think this regex should work:
^ERROR_REPORT[0-9]{2}[A-Z]{3}[0-9]{10}
However, please check the regex101 link. I've assumed that you need 2 digits after ERROR_REPORT but your example name shows 3.

Assigning index to two concatenated tables in SAS?

I have two table with exactly the same column headers and one row each. I have the code to concatenate them which works fine.
data concatenation;
set CURR_CURR CURR_30;
run;
However, there is no index in the output to say which row corresponds to which table.
I've tried using 'create index' and 'index create' already but they don't work syntactically. Simply I'd just want to add a column of strings and move it to the front of all the other columns in the data set.

INDSNAME option on the SET statement + variable to store the information.
If you set the length statement ahead of your SET statement it will create it as the first column.
Just a note that this isn't the same as an 'index'. An index in SAS has a different meaning which isn't what you're trying to create here.
data concatenation;
length dset source $50.;
set CURR_CURR CURR_30 indsname=source;
dset=source;
run;

Reeza's answer is very similar to something I figured out that worked as well. Here's my version as an alternative.
data concatenation;
length id $ 10;
set CURR_CURR (in=a) CURR_30 (in=b);
if a then id = 'curr_curr';
else if b then id = 'curr_30';
run;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing a long column to multiple rows using RegEx? - regex

Related

SAS: create new column based on regex logical statement

Athena SQL create table with text data

How can I combine multiple where (and or) statements?

PL/SQL regexp_like filters

Assigning index to two concatenated tables in SAS?

Categories

Resources