Remove "+" from output in SAS Content Catecorization - regex

I got a little problem in SAS Content Categorization. I'm working with getting out two values. Value 1 and value 2.
I use predicate_rule, so when I click on the matched string in the program I get
ARGUMENT 0 [val1]: 4
ARGUMENT 1 [val2]: 4
ARGUMENT 2 [valName]: Score
In this example 4 is just an example of a value, but my problem is that when it stand 4+4 (no space between 4, + and 4) I can't get the latest value WITHOUT the plus symbol, so I get this out
ARGUMENT 0 [val1]: 4
ARGUMENT 1 [val2]: +4
ARGUMENT 2 [valName]: Score
I only manage to get the value printet correctly if there is space between the numbers and plus symbol.
I have now crateded two regex and two predicate_rules.
This one is for the first value (val1), called: Regex1
REGEX:[1-5]
This is for the seconed value (val2), called: Regex2
REGEX:\+[1-5]
I know that I get the plus symbol printed out because of Regex2, but I can't manage to get the latest value without typing it this way.
In the main concept I have created two predicate_rules. One that should manage the score which have space between the numbers and the plus symbol, and one that should manage when there is no space between.
#With space
PREDICATE_RULE:(valName,val1,val2):(ORDDIST_4, "_valName{valName}", "_val1{Regex1}", "+", "_val2{Regex1}")
#Without space
PREDICATE_RULE:(valName,val1,val2):(ORDDIST_3, "_valName{valName}", "_val1{Regex1}", "_val2{Regex2}")
valName only contains terms that should be in distance of the arguments so it matches correctly.
Thanks in advance.

I think you can look at altering your 2nd regex in the predicate_rule. Since you mentioned that text pattern like 4+4 is an issue. You could probably look into Positive lookbehind to solve the issue. Positive lookbehind will help you to select your group before your main expression without including it in the result.
Pattern like below could easily solve by Positive lookbehind:
4+4
4 + 4
4 +4
4 4
Try the following regex for the 2nd predicate_rule :
(?<=[\+ ])[\d]

Related

How to regex extract only numbers up to the first comma or after a specific keyword?

I'm having trouble trying to regex extract the 'positions' from the following types of strings:
6 red players position 5, button 2
earn $50 pos3, up to $1,000
earn $50 pos 2, up to $500
table button 4, before Jan 21
I want to get the number that comes after 'pos' or 'position', and if there's no such keyword, get the last number before the first comma. The position value can be a number between 1 and 100. So 'position' for each of the previous rows would be:
Input text
Desired match (position)
6 red players position 5, button 2
5
earn $50 pos3, up to $1,000
3
earn $50 pos 2, up to $500
2
table button 4, before Jan 21
4
I have a big data set (in BigQuery) populated with basically those 4 types of strings.
I've already searched for this type of problem but found no solution or point to start from.
I've tried .+?(?=,) (link) which extracts everything up to the first comma (,), but then I'm not sure how to go about extracting only the numbers from this.
I've tried (?:position|pos)\s?(\d) (link) which extracts what I want for group 1 (by using non-capturing groups), but doesn't solve the 4th type of string.
I feel like there's a way to combine these two, but I just don't know how to get there yet.
And so, after the two things I've tried, I have two questions:
Is this possible with only regex? If so, how?
What would I need to do in SQL to make my life easier at getting these values?
I'd appreciate the help/guidance with this. Thanks a ton!
You can use
^(?:[^,]*[^0-9,])?(\d+),
See the RE2 regex demo. Details:
^ - start of string
(?:[^,]*[^0-9,])? - an optional sequence of:
[^,]* - zero or more chars other than comma
[^0-9,] - a char other than a digit and comma
(\d+) - Group 1: one or more digits
, - a comma
Use look ahead for a comma, with a look behind requiring the previous char to be a space or a letter to prevent matching the “1” in “$1,000”:
(?<=[ a-z])(\d+)(?=,)
See live demo.

Regex in sqlite

i have been thinking about this a lot.
So i wanna create a table which contains a password.
The password should at least be 6 chars long a contain minimum 2 numbers.
My version was:
create table User (
passwort varchar(80) not null check (length(passwort) >= 6 and passwort like '%[0-9]%[0-9]%')
);
The Problem with this approach is that the password has to contain [0-9] twice instead of the actual numbers. Does anyone know how to get rid of that problem ?
Thanks in advance.
How about .*?\d.*?\d.*?
This ensures that between zero or more characters (including digits), there must be 2 digits.
While I still recommend you split the work in 2 as per my comment, ie.
Check the length of the string.
Use the actual expression to check if the string contains 2 numbers.
You can use the following expression: ^(?=.{6,}).*?\d.*?\d.*?$. What is does is that it looks ahead for a minimum of 6 characters and then checks that the string is made up from 2 numbers, which can be separated by 0 or more characters.
An example of the expression is available here.

Regex expression for date within dates range

I need to validate with regex a date in format yyyy-mm-dd (2019-12-31) that should be within the range 2019-12-20 - 2020-01-10.
What would be the regex for this?
Thanks
Regex only deal with characters. so we have to work out at each position in the date what are the valid characters.
The first part is easy. The first two characters have to be 20
Now it gets complicated the next character can be a 1 or a 2 but what follows depends on the value of that character so we split the rest of the regex into two sections the first if the third character matches 1 and the second if it matches 2
We know that if the third character is a 1 then what must follow is the characters 9-12- as the range starts at 2019-12-20 now for the day part. The 9th character is the tens for the day this can only be 2 or 3 as we are already in the last month and the minimum date is 20. The last character can be any digit 0-9. This gives us a day match of [23][0-9]. Putting this together we now have a pattern for years starting 2019 as 19-12-[23][0-9]
It the third character is a 2 then we can match up to the day part of the date a gain as the range ends in January. This gives us a partial match of 20-01- leaving us to work on the day part. Hear we know that the first character of the day can either be a 1 or 0 however if it's a 1 then the last character must be a 0 and if it's a 0 then the last character can only be in the range 1 to 9. This give us another alteration (?:0[1-9]|10) Putting the second part together we get 20-01-(?:0[1-9]|10).
Combining these together gives the final regex 20(?:19-12-[23][0-9]|20-01-(?:0[1-9]|10))
Note that I'm assuming that the date you are testing against is a validly formatted date.
Try this:
(2019|2020)\-(12|01)\-([0-3][0-9]|[0-9])
But be aware that this will allow number up to where the first digit is between zero and three and the second digit between zero and nine for the dd value. You could specify all numbers you want to allow (from 20 to 10) like this (20|21|22|23|24|25|26|27|28|29|30|31|01|1|02|2|03|3|04|4|05|5|06|6|07|7|08|8|09|9|10).
(2019|2020)\-(12|01)\-(20|21|22|23|24|25|26|27|28|29|30|31|01|1|02|2|03|3|04|4|05|5|06|6|07|7|08|8|09|9|10)
But honestly... Regular-Expressions are not the right tool for this. RegExp gives a mask to something, not a logical context. Use regex to extract the data/value from a string and validate those values using another language.
The above 2nd Regex will, f.e. match your dates, but also values outside of this range since there is no context between 2019|2020 and the second group 12|01 so they match values like 2019-12-11 but also 2020-12-11.
To only match the values you want this will be a really large regex like this (inner brackets only if you need them) ((2019)-(12)-(20)|(2019)-(12)-(21)|(2019)-(12)-(22)|...) and continue with all possible dates - and ask yourself: what would you do if you find such a regex in a project you have to work with ;)
Better solution (quick and dirty, there might be better solutions):
(?<yyyy>20[0-9]{2})\-(?<mm>[01][0-9]|[0-9])\-(?<dd>[0-3][0-9]|[0-9])
This way you have three named groups (yyyy, mm, dd) you can access and validate the matched values... The regex is smaller, you have a better association between code and regex and both are easier to maintain.

Regex to match position from right to left

I hope you can help me, I'm studying how to make a regex and now I have this problem:
Write a regex that accepts strings with 0 and 1 and that has a 1 on position 5 from right to left.
e.g. 10000 is accepted because it has an 1 on the position 5 from right to left or 010000, 0010000 or 1110000 are accepted.
I was thinking with something like: (0+1)*+1(0+1)(0+1)(0+1)(0+1)(0+1)
You can use this regex:
1[01]{4}$
If you want to match full input then use:
^[01]*1[01]{4}$
Here 1[01]{4}$ ensures that we have 4 digits of 0 and 1 after we match 1 thus making 1 at 5th position from right to left.
RegEx Demo
Well - think of it this way. It needs to be as many 1s and 0s as you please, followed by a 1, followed by 4 more ones or zeroes.
So:
my_regex =
"^[01]*" + // Starts with One or zero, zero or more times
"1" + // Followed by a one
"[01]{4}$" // Followed by four things, which could be either zero or one, before ending.
Your (0 + 1) syntax looks foreign to me. I'm using character classes to specify the [01] things but you could use (0|1) in their place, which is what your attempt looks more like.
The full thing, together, is ^[01]*1[01]{4}$

regex find match within the first n items

I have a string of 8 separated hexadecimal numbers, such as:
3E%12%3%1F%3E%6%1%19
And I need to check if the number 12 is located within the first 4 set of numbers.
I'm guessing this shouldn't be all that complex, but my searches turned up empty. Regular expressions are always a trouble for me, but I don't have access to anything else in this scenario. Any help would be appreciated.
^([^%]+%){0,3}12%
See it in action
The idea is:
^ - from the start
[^%]+% - match multiple non % characters, followed by a % character
{0,3} - between 0 and 3 of those
12% - 12% after that
Here you go
^([^%]*%){4}(?<=.*12.*)
This will match both the following if that is what is intended
1%312%..
1%123%..
Check the solution if %123% is matched or not
If the number 12 should stand on its own then use
^([^%]*%){4}(?<=.*\b12\b.*)