prxchange function - sas

I have a question on prxchange function.
I have a variable with alphanumeric characters, special characters and blanks, and I want to obtain the variable without numbers and special characters, but not remove spaces
I use: UPCASE(prxchange(" s/[^A-Z]//i",-1,variable));
but I obtain values without spaces
What I have: PROVA DB2.? RACF2
What I want: PROVA DB RACF
What I obtain with my function: PROVADBRACF
what can I do? Thank you,
Marina

You can use:
UPCASE(prxchange(" s/[^A-Z\s\t]*//i",-1,variable));
\t -- all tabs
\s -- all spaces
Code:
data l;
variable="PROVA DB2.? RACF2";
v2=UPCASE(prxchange(" s/[^\w\s\t]//i",-1,variable));
run;
Result:
PROVA DB RACF

Add a space to the character class. I put the space after the Z, although it could have also gone before the A
UPCASE(prxchange("s/[^A-Z ]//i",-1,variable));
A quick summary of regex in SAS can be found at Perl Regular Expressions Tip Sheet
The negated character class specifier [^] enumerates the characters to keep (or not-match) during a substitution operation.
[^A-Z]
means do not match anything that is not between the letters A to Z, which does not include the space character, so the spaces are getting removed.
The expression you wrote for substitution is remove any sequence that matches not a letter

A simple compress function is enough in this scenario as mentioned by #dork horsten
data test;
variable="PROVA DB2.? RACF2";
v2= compress(variable,'.?');
run;

Related

Converting a normal regular RegEx to one in SAS

Although reading SAS documentation and various example pages, I am struggeling to convert a slightly more complicated RegEx to SAS syntax. I using the command prxchange. This is what I came up so far convert a filename-string like pre_31DEC2019_299792458.xls to an integer number (of length 8) 299792458 inside a SAS data step:
tmp=prxchange('s/pre_([a-zA-Z0-9]{8,9})_([0-9]{1,16})\.xls/\2/g',-1,have);
want=input(tmp,8.);
The error message I have points to somewhere else in the code, but I am rather certain that it is those two lines which cause a problem since leaving out the two quoted lines makes the SAS error message vanish.
References
Inofficial SAS howto on RegEx suggests that I could use standard RegExes.
Why use regex at all?
want = input(scan(have,-2,'._'),32.);
You can use
tmp = prxchange('s/^pre_[A-Za-z0-9]+_([0-9]+)\.xls$/$1/', -1, have);
See the regex demo
Details
s/ - substitution action (we are replacing the match)
^ - start of string
pre_ - a literal prefix
[A-Za-z0-9]+ - one or more alphanumeric ASCII chars (note you may simply use .* here instead if there can be anything)
_ - an underscore
([0-9]+) - Group 1: one or more digits
\.xls$ - .xls at the end of string
$1 - the whole match, the whole string matched, will be replaced with the contents of Group 1.
As far as the prxchange function is concerned, note that it replaces all occurrences of the pattern once you pass -1 as the times argument, thus, no g flag is necessary.
Many ways you could try:
data _null_;
a="pre_31DEC2019_299792458.xls";
b=input(prxchange('s/.*\_(.*)\..*/$1/',-1,a),12.);
c=input(prxchange('s/.*(\d{9}).*/$1/',-1,a),12.);
d=input(prxchange('s/.*(?<=\_)(\d+).*/$1/',-1,a),12.);
put _all_;
run;
.* means any one character many times; for b, the numbers you need are between _and .; for c, it is 9 digitals; for d, it look behind "_" to find digitals.

End a regular expression pattern with a string

all. I have spent some time now to learn regular expression, but eventually there is a problem I cannot solve properly.
Lets assume the following 'string' (html-extract):
"{'2018-05-02', '2018-01-05', r, '2018-07-01', '2017-07-02', '2016-07-31' random_text XYCCC Letters and 55565798 ]}"
My intention is, to extract all values from '2018-05-02' ... to (and excluding) random_text. I tried to achieve this through chosing the "anything but" structure to achieve this [^a] (not a):
\'[^random]*
The above does not do the job, because random is not a string, but a set of characters, hence the 'r' in the string will split my extracted value.
If there is no r in the text before the word random_text, this would work fine:
\'[^r]*
Is there any way to include a specific string as the end of my sequence. e.g.
start: \'
repeated characters unlike string: [^{my_string}]*
Appreciate any insight :)
This regex will do the job:
'.+'(?= random)
Just replace random with the string you want to exclude at the end.
Demo & explanation

Regex parse a command line string but don't return spaces between quotes

I am using python to parse a string that is passed in by the optparse module.
I want to split the string on certain delimiters but not in between quote marks.
A sample string is:
--state-basedir /dir/dir/dir/ --cmd=\"param load $v2param\" --master=/dev/ttyUSB0 --console --map --out=udp:192.168.1.1:14550
This string is passed in as a single optparse argument, I am then going to pass it to another process.
I have been trying various things at http://pythex.org/
The closest I have gotten is:
`(?<!")[\s=](?![\s0-9a-zA-Z\$\\]*")`
The issue is that the = sign after --cmd and the space before --master are not matched.
In plain English, this is how I am reading my regex:
match either a space character or an equal character as long as it is not preceded by a quotation mark and as long as it is not proceeded by a combination of any other letter,numbers,punctuation and another quotation mark
I had a feeling that there was something else I was missing, like greediness, so I tried adding ? after my look-ahead and look-behind terms. If I put a ? after my look-behind one I can get the space before --master but if I put the ? after my look-ahead term I get the spaces in the quotation marks now, which I don't want.
The idea here is that I am going to use re.split to handle things.
Thanks for any explanations as to what I am doing wrong.
This is not a regex answer and it's also not pretty, but it is one line.
sum([[x] if '"' in x else re.split(' |=',x) for x in re.split('=(\".+?\" )',a)],[])
output:
['--state-basedir', '/dir/dir/dir/', '--cmd', '"param load $v2param" ', '--master', '/dev/ttyUSB0', '--console', '--map', '--out', 'udp:192.168.1.1:14550']
Starting from the re.split('=(\".+?\" )',a)] this splits out text surrounded by quotes (more specifically ="something another thing"). The split pieces are then split further with re.split(' |=',x) if they do not have a " in them, or are just returned as is [x] if they do. The last step is collapsing the resulting 2d list by overloading sum with sum(two_d_list,[]).
I hope this answer helps but I understand if it isn't what you're looking for

split text into words and exclude hyphens

I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)

Extract numbers between brackets within a string [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Extract info inside all parenthesis in R (regex)
I inported data from excel and one cell consists of these long strings that contain number and letters, is there a way to extract only the numbers from that string and store it in a new variable? Unfortunately, some of the entries have two sets of brackets and I would only want the second one? Could I use grep for that?
the strings look more or less like this, the length of the strings vary however:
"East Kootenay C (5901035) RDA 01011"
or like this:
"Thompson-Nicola J (Copper Desert Country) (5933039) RDA 02020"
All I want from this is 5901035 and 5933039
Any hints and help would be greatly appreciated.
There are many possible regular expressions to do this. Here is one:
x=c("East Kootenay C (5901035) RDA 01011","Thompson-Nicola J (Copper Desert Country) (5933039) RDA 02020")
> gsub('.+\\(([0-9]+)\\).+?$', '\\1', x)
[1] "5901035" "5933039"
Lets break down the syntax of that first expression '.+\\(([0-9]+)\\).+'
.+ one or more of anything
\\( parentheses are special characters in a regular expression, so if I want to represent the actual thing ( I need to escape it with a \. I have to escape it again for R (hence the two \s).
([0-9]+) I mentioned special characters, here I use two. the first is the parentheses which indicate a group I want to keep. The second [ and ] surround groups of things. see ?regex for more information.
?$ The final piece assures that I am grabbing the LAST set of numbers in parens as noted in the comments.
I could also use * instead of . which would mean 0 or more rather than one or more i in case your paren string comes at the beginning or end of a string.
The second piece of the gsub is what I am replacing the first portion with. I used: \\1. This says use group 1 (the stuff inside the ( ) from above. I need to escape it twice again, once for the regex and once for R.
Clear as mud to be sure! Enjoy your data munging project!
Here is a gsubfn solution:
library(gsubfn)
strapplyc(x, "[(](\\d+)[)]", simplify = TRUE)
[(] matches an open paren, (\\d+) matches a string of digits creating a back-reference owing to the parens around it and finally [)] matches a close paren. The back-reference is returned.