Replacing any single digit in string with leading 0 in SAS - regex

I have a variable with the values as t14-1-1, t14-1-1A, t14-2-1-1, t14-2-4-15A, etc as mentioned in the cards statement below.
What i need is to pad any single digit in the string with a leading 0, as we do it with sas format z2.
data test01;
input have $40.;
want02=prxchange('s/(^|-)\d($|-)*/\10\2/',-1,strip(have));
want03=prxchange('s/(^|-)\d($|-)*(.+)/\10\2/',-1,strip(have));
cards;
t14-1-1
t14-1-1A
t14-2-1-1
t14-2-1-1A
t14-2-4-15A
t14-2-4-15B
t14-2-4-16
t14-2-4-17
t14-2-4-17A
t14-2-4-17B
l16-2-9-1-1
l16-2-9-2-1
l16-2-9-2-2
;
run;
What I need is the following:
t14-01-01
t14-01-01A
t14-02-01-01
t14-02-01-01A
t14-02-04-15A
t14-02-04-15B
t14-02-04-16
t14-02-04-17
t14-02-04-17A
t14-02-04-17B
l16-02-09-01-01
l16-02-09-02-01
l16-02-09-02-02
I know I have a way of doing this with array and scan, length and tranward functions. I was just wondering if this can be done through prxchange (regular expression) in a few steps with less complexity.
I have tried a lot with different permutation and combinations with no luck.
Thanks for the help in Advance!

I don't know if SAS regex flavour supports lookarround, but, if it does, this should do the job:
search: (?<=-)(\d)(?!\d)
replace: 0$1
Where:
(?<=-) is a lookbehind that make sure we have a dash before
(\d) is a single digit captured in group 1
(?!\d) is a negative lookahead that make sure we have not digit after

Related

extract text data using regexp in MATLAB

I'm dealing with extracting visibility data in METAR(airport weather observation data).
Visibility is a 4 digit(0~9) data, and can also be expressed as'CAVOK' when visibility is good.
but it's quite tricky to use regexp. (METAR data have many variations.)
Data sample(MET_VIS) below:
201903072300 METAR RKPC 072300Z 17003KT 110V210 CAVOK 05/02 Q1026 NOSIG=
201903062000 METAR RKPC 062000Z 33018G29KT 4000 BR FEW012 SCT025 08/04 Q1018 WS R13 R31 NOSIG=
201903062200 METAR RKPC 062200Z 33015KT 290V350 9999 SCT030 07/03 Q1019 NOSIG=
201903080000 METAR RKPC 080000Z 29002KT CAVOK 08/02 Q1027 NOSIG=
I want to extract CAVOK, 4000, 9999, CAVOK on each line.
I tried but this code doesn't work with line 3 :( It returns blank.
regexp(MET_VIS(i),'((?<=KT\s)\d{4})|CAVOK','match')
The third value does not end on KT. What you might do is use another positive lookbehind to check if the string before it ends on KT and match a range of matching 7 times A-Z0-9 followed by a whitespace char after it.
Then you either match 4 digits or CAVOK using an alternation (?:\d{4}|CAVOK) or else you could match CAVOK anywhere in the string.
Add a word boundary after it to prevent the match being part of a larger word.
(?:(?<=KT\s)|(?<=KT [A-Z0-9]{7}\s))(?:\d{4}|CAVOK)\b
Regex demo
You could also make an assumption about the range of "words" from the end your target should be allowed to occur in. For example:
/\b(?:\d{4}|CAVOK)\b(?=(?: \S+){3,9}$)/gm
See regex demo.
Here we're looking for a four-digit number or the phrase CAVOK only, if it is followed by 3 to 9 non-space substrings of variable length until the end of the line.

Regex expression for date within dates range

I need to validate with regex a date in format yyyy-mm-dd (2019-12-31) that should be within the range 2019-12-20 - 2020-01-10.
What would be the regex for this?
Thanks
Regex only deal with characters. so we have to work out at each position in the date what are the valid characters.
The first part is easy. The first two characters have to be 20
Now it gets complicated the next character can be a 1 or a 2 but what follows depends on the value of that character so we split the rest of the regex into two sections the first if the third character matches 1 and the second if it matches 2
We know that if the third character is a 1 then what must follow is the characters 9-12- as the range starts at 2019-12-20 now for the day part. The 9th character is the tens for the day this can only be 2 or 3 as we are already in the last month and the minimum date is 20. The last character can be any digit 0-9. This gives us a day match of [23][0-9]. Putting this together we now have a pattern for years starting 2019 as 19-12-[23][0-9]
It the third character is a 2 then we can match up to the day part of the date a gain as the range ends in January. This gives us a partial match of 20-01- leaving us to work on the day part. Hear we know that the first character of the day can either be a 1 or 0 however if it's a 1 then the last character must be a 0 and if it's a 0 then the last character can only be in the range 1 to 9. This give us another alteration (?:0[1-9]|10) Putting the second part together we get 20-01-(?:0[1-9]|10).
Combining these together gives the final regex 20(?:19-12-[23][0-9]|20-01-(?:0[1-9]|10))
Note that I'm assuming that the date you are testing against is a validly formatted date.
Try this:
(2019|2020)\-(12|01)\-([0-3][0-9]|[0-9])
But be aware that this will allow number up to where the first digit is between zero and three and the second digit between zero and nine for the dd value. You could specify all numbers you want to allow (from 20 to 10) like this (20|21|22|23|24|25|26|27|28|29|30|31|01|1|02|2|03|3|04|4|05|5|06|6|07|7|08|8|09|9|10).
(2019|2020)\-(12|01)\-(20|21|22|23|24|25|26|27|28|29|30|31|01|1|02|2|03|3|04|4|05|5|06|6|07|7|08|8|09|9|10)
But honestly... Regular-Expressions are not the right tool for this. RegExp gives a mask to something, not a logical context. Use regex to extract the data/value from a string and validate those values using another language.
The above 2nd Regex will, f.e. match your dates, but also values outside of this range since there is no context between 2019|2020 and the second group 12|01 so they match values like 2019-12-11 but also 2020-12-11.
To only match the values you want this will be a really large regex like this (inner brackets only if you need them) ((2019)-(12)-(20)|(2019)-(12)-(21)|(2019)-(12)-(22)|...) and continue with all possible dates - and ask yourself: what would you do if you find such a regex in a project you have to work with ;)
Better solution (quick and dirty, there might be better solutions):
(?<yyyy>20[0-9]{2})\-(?<mm>[01][0-9]|[0-9])\-(?<dd>[0-3][0-9]|[0-9])
This way you have three named groups (yyyy, mm, dd) you can access and validate the matched values... The regex is smaller, you have a better association between code and regex and both are easier to maintain.

Regex for validation of a street number

I'm using an online tool to create contests. In order to send prizes, there's a form in there asking for user information (first name, last name, address,... etc).
There's an option to use regular expressions to validate the data entered in this form.
I'm struggling with the regular expression to put for the street number (I'm located in Belgium).
A street number can be the following:
1234
1234a
1234a12
begins with a number (max 4 digits)
can have letters as well (max 2 char)
Can have numbers after the letter(s) (max3)
I came up with the following expression:
^([0-9]{1,4})([A-Za-z]{1,2})?([0-9]{1,3})?$
But the problem is that as letters and second part of numbers are optional, it allows to enter numbers with up to 8 digits, which is not optimal.
1234 (first group)(no letters in the second group) 5678 (third group)
If one of you can tip me on how to achieve the expected result, it would be greatly appreciated !
You might use this regex:
^\d{1,4}([a-zA-Z]{1,2}\d{1,3}|[a-zA-Z]{1,2}|)$
where:
\d{1,4} - 1-4 digits
([a-zA-Z]{1,2}\d{1,3}|[a-zA-Z]{1,2}|) - optional group, which can be
[a-zA-Z]{1,2}\d{1,3} - 1-2 letters + 1-3 digits
or
[a-zA-Z]{1,2} - 1-2 letters
or
empty
\d{0,4}[a-zA-Z]{0,2}\d{0,3}
\d{0,4} The first groupe matches a number with 4 digits max
[a-zA-Z]{0,2} The second groupe matches a char with 2 digit in max
\d{0,3} The first groupe matches a number with 3 digits max
You have to keep the last two groups together, not allowing the last one to be present, if the second isn't, e.g.
^\d{1,4}(?:[a-zA-z]{1,2}\d{0,3})?$
or a little less optimized (but showing the approach a bit better)
^\d{1,4}(?:[a-zA-z]{1,2}(?:\d{1,3})?)?$
As you are using this for a validation I assumed that you don't need the capturing groups and replaced them with non-capturing ones.
You might want to change the first number check to [1-9]\d{0,3} to disallow leading zeros.
Thank you so much for your answers ! I tried Sebastian's solution :
^\d{1,4}(?:[a-zA-z]{1,2}\d{0,3})?$
And it works like a charm ! I still don't really understand what the ":" stand for, but I'll try to figure it out next time i have to fiddle with Regex !
Have a nice day,
Stan
The first digit cannot be 0.
There shouldn't be other symbols before and after the number.
So:
^[1-9]\d{0,3}(?:[a-zA-Z]{1,2}\d{0,3})?$
The ?: combination means that the () construction does not create a matching substring.
Here is the regex with tests for it.

Why my regex is failing for single digits but working for double digits?

I have the requirement to validate a String containing two numbers separated by a dash(-) or a comma(,). Valid values are :
23.98-34.76 or 23.98,34.76
23-34 or 23,34
5-6 or 5,6
I have the following regex which is a slight modification of the answer that I received here in SO. It is covering the 1st and 2nd case above but not the third case involving single digits only.
The modified regex String that I am using is :
(\d+\.?\d+?)([-,])(\d+\.?\d+?)
Where did my regex go wrong?
Correct regex should be like this:
(\d+(\.\d+)?)[-,](\d+(\.\d+)?)
i.e. if there is a period then it is always followed by 1 or more digits.
Otherwise in your regex it will also match strings like 123.,789.

Excel Sort by 2nd character in alphanumeric string

I have a column in an Excel spreadsheet that contains the following:
### - 3-digit number
#### - 4-digit number
A### - character with 3-digits
#A## - digit followed by character then 2 more digits
There may also be superfluous characters to the right of these strings.
I would like to sort the entire spreadsheet by this column in the following order (ascending or descending):
the first three types of strings alphabetically as expected (NOT ASCII-Betically!)
Then the #A## by the character first, then by the first digit.
Example:
000...999, 0000...9999, A000...Z999, 0A00...9A99, 0B00...9B99...9Z99
I feel there is a very simple solution using a regular expression or macro but my VBa and RegExp are pretty rusty (a friend asked me for this but I' m more of a C-guy these days). I have read some solutions which involve splitting the data into additional columns which I would be fine with.
I would settle for a link to a good guide. Eternal thanks in advance.
If you want to sort by second character regardless of the content ahead and behind, then regex ^.(.) represents second character match...