sas, regex, numbers, substring, prxchange - regex

I need help with the below code. I do not see how this is extracting the number from this address line text. When it (the pattern) says s/\D/ / I thought this replaces the digits with a space. I know the second part here is taking the substring up to the first space in the address line text. But, then I do not see how this is extracting the numbers. I pulled up the data set and it looks like this does work. Please help me understand how this is working.
DATA OUT.REQ_1_2_03;
SET OUT.REQ_1_2_02;
/* GET STREET NUMBER*/
PRE_RCV_ST_NB=PRXCHANGE('s/\D/ /',-1,SUBSTR(PRE_RCV_ADDRESSS_LINE_1,1,PRXMATCH('/\s/',PRE_RCV_ADDRESSS_LINE_1)));
POST_RCV_ST_NB=PRXCHANGE('s/\D/ /',-1,SUBSTR(POST_RCV_ADDRESSS_LINE_1,1,PRXMATCH('/\s/',POST_RCV_ADDRESSS_LINE_1)));
PRE_HOST_ST_NB=PRXCHANGE('s/\D/ /',-1,SUBSTR(PRE_HOST_ADDR_LINE_1,1,PRXMATCH('/\s/',PRE_HOST_ADDR_LINE_1)));
POST_HOST_ST_NB=PRXCHANGE('s/\D/ /',-1,SUBSTR(POST_HOST_ADDR_LINE_1,1,PRXMATCH('/\s/',POST_HOST_ADDR_LINE_1)));
RUN;

try to understand using an example
PRE_RCV_ADDRESSS_LINE_1 ="123hello Village st"
start from the left side of the code.
first use prxmatch and it finds first space(\s)that comes 123hello
do substr till that space and you get 123hello
then remove prxchanges to replace \D (that is anything other than digit) and
is converted to 123
to sum it up by example
"123hello Village st" -- find space(\s) by prxmatch and substring till space gives "123hello"
"123hello" is changed to "123" by prxchange which replaces anything other than digit(\D) .
/* run this step to understand it better*/
data want ;
PRE_RCV_ADDRESSS_LINE_1 = "123hello Village st";
test1= SUBSTR(PRE_RCV_ADDRESSS_LINE_1,1,PRXMATCH('/\s/',PRE_RCV_ADDRESSS_LINE_1));
PRE_RCV_ST_NB= PRXCHANGE('s/\D//',-1,SUBSTR(PRE_RCV_ADDRESSS_LINE_1,1,PRXMATCH('/\s/',PRE_RCV_ADDRESSS_LINE_1)));
run;

Related

Retrieving the 12th through 14th characters from a long strong using ONLY regex - Grafana variable

I have a small issue, I am trying to get specific characters from a long string using regex but I am having trouble.
Workflow
Prometheus --> Grafana --> Variable (using regex)
I can't use anything other than Regex expressions to achieve this result
I am currently using this expression to grab the long string from some json output:
.*channel_id="(.*?)".*
FROM THIS
{account_id="XXXXXXX-xxxx-xxxx-xxxx-xxxxxxxxxx",account_name="testalpha",channel_id="s0022110430col0901241usa",channel_abbr="s0022109430col}
This returns a string that's ALWAYS 24 characters long:
s0022110430col0901241usa
PROBLEM:
I need to grab the 3 letters 'col' and 'usa' as they are the two teams that are playing, ideally I would be able to pipe the results from the first regex to get these values (the position is key, since the first value will ALWAYS be the 12-14th characters and the second value is the last 3 characters) if I could output these values in uppercase with the string "vs" in between to create a string such as:
COL vs USA
or
ARG vs BRA
I am open to any and every suggestion anyone may have
Thank you!
PS - The uppercase thing is 'nice to have' BUT not needed
I'm still learning RegEx, so this is all I could come up with:
For the col (first team):
(?<=(channel_id=".{11}))\w{3}
For the usa (second team):
(?<=(channel_id=".{21}))\w{3}
Can you define the channel_id?
It begins with 's' and then there are many numbers. If they are always numbers, you can use this regex:
channel_id=".[0-9]+([a-z]+)[0-9]+([a-z]+)
You will get 2 groups, one with "col" and the other with "usa".
Edit:
Or if you just know, that you have always the same size, you can use something like:
channel_id=".{11}([a-z]+).{7}([a-z]+)

Removing unmatched text and building a table with the remaining matches

I have 30000 lines that look like the one below.
342800005013000 CON N GORE PT LOT 31 RP 11R2284 PT PART 1 RP 11R4541 PT PART 2
I would like to capture the 15 digit number at the beginning and any "11R***" numbers.
In Notepad++ I've used \d{15}|(11R\d*)* to match everything that I want. Ultimately I would like to get all the matched results into excel. What would be the best way to do so?
Thanks for your help.
Notepad++ Matches
You could try this one
(^[0-9]*)|(11R[0-9A-Za-z]*)
edit: check it now, the code formatting correctly displays the regex;

Extract numbers from line not starting with comment symbol using regex

I'm try to replace all numbers not in the comment section. Here is a sample of the file to fix:
/* 2018-01-01 06:00:55 : realtime(0.002) --status(10)-- ++numretLines(0)++ --IP(192.168.1.5) PORT(22)-- queryNo(2) comment[TO: Too much time] TYPE[QUERY 4.2] */
select count(*) from table where id1 = 41111 and id2 = 221144
GO
Basically, I would like to replace numbers in strings not beginning with "/*".
I came up with the following regex: /^(?!\/\*)(?:.+\K(\d+?))/gmU
But I only manage to extract the first number of each line not starting with "/*". How could I extend this to get all the numbers of those rows?
Thanks!
Assuming your regex engine (which you haven't told) supports look behind and look ahead, you can use this regex:
(?<!^\/\*.*)(?:(?<=\s)\d+(?=\s))+
The regex starts by using a negative look behind, looking for the start of line, followed by a slash and a star.
Then it creates a new negative look behind for a White Space, then any number of digits, followed by a negative look ahead for a White Space. This Group is repeated any number of times.
You need to set the global and 'multiline' flag.
The regex skips numbers not surrounded by White Space (for instance 'id1')
Based on Wiktor Stribiżew comment, I used \/\*.*?\*\/(*SKIP)(*F)|-?\b\d+(\.\d+)? to extract the numbers, including decimals and negative values.

Using Regex to clean a csv file in R

This is my first post so I hope it is clear enough.
I am having a problem regarding cleaning my CSV files before I can read them into R and have spent the entire day trying to find a solution.
My data is supposed to be in the form of two columns. The first column is a timestamp consisting of 10 digits and the second an ID consisting of 11 or 12 Letters and numbers (the first 6 are always numbers).
For example:
logger10 |
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F0831102744
010700EDDA0F|
would become:
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F
0831102744 | 010700EDDA0F
(please excuse the lines in the middle, that was my attempt at separating the columns...).
The csv file seems to occasionally be missing a comma which means that sometimes one row will end up like this:
0923120531,010300033ADD0925075301,010700EDD00A
My hardware also adds the word logger10 (or whichever number logger this is) whenever it restarts which gives a similar problem e.g. logger10logger100831102744.
I think I have managed to solve the logger text problem (see code) but I am sure this could be improved. Also, I really don't want to delete any of the data.
My real trouble is making sure there is a line break in the right place after the ID and, if not, I would like to add one. I thought I could use regex for this but I'm having difficulty understanding it.
Any help would be greatly appreciated!
Here is my attempt:
temp <- list.files(pattern="*.CSV") #list of each csv/logger file
for(i in temp){
#clean each csv
tmp<-readLines(i) #check each line in file
tmp<-gsub("logger([0-9]{2})","",tmp) #remove logger text
pattern <- ("[0-9]{10}\\,[0-9]{6}[A-Z,0-9]{5,6}") #regex pattern ??
if (tmp!= pattern){
#I have no idea where to start here...
}
}
here is some raw data:
logger01
0729131218,020700EE1961
0729131226,020700EE1961
0831103159,0203000316DB
0831103207,0203000316DB0831103253,010700EDE28C
0831103301,010700EDE28C
0831103522,010300029815
0831103636,010300029815
0831103657,020300029815
If you want to do this in a single pass:
(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?
can be replaced with
\1\t\2\n
What this does is look for any of those rogue logger01 entries (including the space after it) optionally: That trailing ? after the group means that it can match 0 or 1 time: if it does match, it will. If it's not there, the match just keeps going anyway.
Following that, you look for (and capture) exactly 10 hex values (either digits or A-F). The ,? means that if a comma exists, it will match, but it can match 0 or 1 time as well (making it optional).
Following that, look for (and capture) exactly 12 hex values. Finally, to get rid of any strange trailing spaces, the ? (a space character followed by ?) will optionally match the trailing space.
Your replacement will replace the first captured group (the 10 hex digits), add in a tab, replace the second captured group (the 12 hex digits), and then a newline.
You can see this in use on regex101 to see the results. You can use code generator on the left side of that page to get some preformatted PHP/Javascript/Python that you can just drop into a script.
If you're doing this from the command line, perl could be used:
perl -pe 's/(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?/\1\t\2\n/g'
If another language, you may need to adapt it slightly to fit your needs.
EDIT
Re-reading the OP and comments, a slightly more rigid regex could be
(?:logger\d\d\ )?([\dA-F]{10}),?(\d{6}[\dA-F]{5,6})\ ?
I updated the regex101 link with the changes.
This still looks for the first 10 hex values, but now looks for exactly 6 digits, followed by 5-6 hex values, so the total number of characters matched is 11 or 12.
The replacement would be the same.
Paste your regex here https://regex101.com/ to see whether it catches all cases. The 5 or 6 letters or digits could pose an issue as it may catch the first digit of the timestamp when the logger misses out a comma. Append an '\n' to the end of the tmp string should work provided the regex catches all cases.

extract number from string in Oracle

I am trying to extract a specific text from an Outlook subject line. This is required to calculate turn around time for each order entered in SAP. I have a subject line as below
SO# 3032641559 FW: Attached new PO 4500958640- 13563 TYCO LJ
My final output should be like this: 3032641559
I have been able to do this in MS excel with the formulas like this
=IFERROR(INT(MID([#[Normalized_Subject]],SEARCH(30,[#[Normalized_Subject]]),10)),"Not Found")
in the above formula [#[Normalized_Subject]] is the name of column in which the SO number exists. I have asked to do this in oracle but I am very new to this. Your help on this would be greatly appreciated.
Note: in the above subject line the number 30 is common in every subject line.
The last parameter of REGEXP_SUBSTR() indicates the sub-expression you want to pick. In this case you can't just match 30 then some more numbers as the second set of digits might have a 30. So, it's safer to match the following, where x are more digits.
SO# 30xxxxxx
As a regular expression this becomes:
SO#\s30\d+
where \s indicates a space \d indicates a numeric character and the + that you want to match as many as there are. But, we can use the sub-expression substringing available; in order to do that you need to have sub-expressions; i.e. create groups where you want to split the string:
(SO#\s)(30\d+)
Put this in the function call and you have it:
regexp_substr(str, '(SO#\s)(30\d+)', 1, 1, 'i', 2)
SQL Fiddle