Transpose column with particular character to rows - sas

How can we transpose several variables in SAS that begin with a specific character e.g. _ without naming each variable separately? These variables only have _ in common and rets of the variable names are different. Once transposed, I would like to rename the column as new_col such that all the variables will fall into separate rows below this column.

You can use a variable short cut list, specifically the colon which allows you to match based on the same prefix, in this case the underscore.
Mentioning _: will refer to all variables that start with _. If you need to exclude any this will not work.

Related

update where symbol length less than 3 characters KDB+/Q

I have a table like this:
test:([]column1:`A`B`C`D`E;column2:`Consumer`RealEstate`27`85`Technology)
I need to write an update query where the character count of column2 is 2 or less, but I couldn't find any way to reference the character length of a symbol to use in a where clause. How would I write a query like that so my result is the following?
test:([]column1:`A`B`C`D`E;column2:`Consumer`RealEstate`NewCategory`NewCategory`Technology)
One way to accomplish what you want is to convert the symbols into strings (i.e. lists of chars) and apply the where clause to the count of each of those strings:
update column2:`NewCategory from `test where 3>count each string column2
(using a backtick on the table name here, assuming you want to apply the change in place)
Another option is to use a vector conditional
update column2:?[3>count each string column2;`NewCatergory;column2] from test

SAS search and replace the last word using regex [duplicate]

Match strings ending in certain character
I am trying to get create a new variable which indicates if a string ends with a certain character.
Below is what I have tried, but when this code is run, the variable ending_in_e is all zeros. I would expect that names like "Alice" and "Jane" would be matched by the code below, but they are not:
proc sql;
select *,
case
when prxmatch("/e$/",name) then 1
else 0
end as ending_in_e
from sashelp.class
;quit;
You should account for the fact that, in SAS, strings are of char type and spaces are added up to the string end if the actual value is shorter than the buffer.
Either trim the string:
prxmatch("/e$/",trim(name))
Or add a whitespace pattern:
prxmatch("/e\s*$/",name)
^^^
to match 0 or more whitespaces.
SAS character variables are fixed length. So you either need to trim the trailing spaces or include them in your regular expression.
Regular expressions are powerful, but they might be confusing to some. For such a simple pattern it might be clearer to use simpler functions.
proc print data=sashelp.class ;
where char(name,length(name))='e';
run;

Remove substrings that vary in value in Oracle

I have a column in Oracle which can contain up to 5 separate values, each separated by a '|'. Any of the values can be present or missing. Here are come examples of how the data might look:
100-1
10-3|25-1|120/240
15-1|15-3|15-2|120/208
15-1|15-3|15-2|120/208|STA-2
112-123|120/208|STA-3
The values are arbitrary except for the order. The numerical values separated by dashes always come first. There can be 1 to 3 of these values present. The numerical values separated by a slash (if it is present) is next. The string, 'STA', and a numerical value separated by a dash is always last, if it is present.
What I would like to do is reformat this column to only ever include the first three possible values, those being the three numerical values separated by dashes. Afterwards, I want to replace 2nd numeric in each value (the numeric after the dash) using the following pattern:
1 = A
2 = B
3 = C
I would also like to remove the dash afterwards, but not the '|' that separates the values unless there is a trailing '|'.
To give you an idea, here's how the values at the beginning of the post would look after the reformatting:
100A
10C|25A
15A|15C|15B
15A|15C|15B
112ABC
I'm thinking this can be done with regex expressions but it's got me a little confused. Does anyone have a solution?
If I have to solve this problem I will solve it in following ways.
SELECT
REGEXP_REPLACE(column,'\|\d+\/\d+(\|STA-\d+)?',''),
REGEXP_REPLACE(column,'(\d+)-(1)([^\d])','\1A\3'),
REGEXP_REPLACE(column,'(\d+)-(2)([^\d])','\1B\3'),
REGEXP_REPLACE(column,'(\d+)-(3)([^\d])','\1C\3'),
REGEXP_REPLACE(column,'(\d+)-(123)([^\d])','\1ABC')
FROM table;
Explanation: Let us break down each REGEXP_REPLACE statement one by one.
REGEXP_REPLACE(column,'\|\d+\/\d+(\|STA-\d+)?','')
This will replace the end part like 120/208|STA-2 with empty string so that further processing is easy.
Finding match was easy but replacing A for 1, B for 2 and C for 3 was not possible ( as per my knowledge ) So I did those matching and replacements separately.
In each regex from second statement (\d+)-(yourNumber)([^\d]) first group is number before - then yourNumber is either 1,2,3 or 123 followed by |.
So the replacement will be according to yourNumber.
All demos here from version 1 to 5.
Note:- I have just done replacement for combination of yourNUmber for those present in question. You can do likewise for other combinations too.
you can do this in one line, but you can write simple function to do that
SELECT str, REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?','') cut
, REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4') rep3toC
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4') rep2toB
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4'), '(\-)([B,C]*)(1)([B,C]*)', '\1\2A\4') rep1toA
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4'), '(\-)([B,C]*)(1)([B,C]*)', '\1\2A\4'), '-', '') "rep-"
FROM (
SELECT '100-1' str FROM dual UNION
SELECT '10-3|25-1|120/240' str FROM dual UNION
SELECT '15-1|15-3|15-2|120/208' str FROM dual UNION
SELECT '15-1|15-3|15-2|120/208|STA-2' str FROM dual UNION
SELECT '112-123|120/208|STA-3' FROM dual
) tab

Subsetting a string based on pre- and suffix

I have a column with these type of names:
sp_O00168_PLM_HUMAM
sp_Q8N1D5_CA158_HUMAN
sp_Q15818_NPTX1_HUMAN
tr_Q6FGH5_Q6FGH5_HUMAN
sp_Q9UJ99_CAD22_HUMAN
I want to remove everything before, and including, the second _ and everything after, and including, the third _.
I do not which to remove based on number of characters, since this is not a fixed number.
The output should be:
PLM
CA158
NPTX1
Q6FGH5
CAD22
I have played around with these, but don't quite get it right..
library(stringer)
str_sub(x,-6,-1)
That’s not really a subset in programming terminology1, it’s a substring. In order to extract partial strings, you’d usually use regular expressions (pretty much regardless of language); in R, this is accessible via sub and other related functions:
pattern = '^.*_.*_([^_]*)_.*$'
result = sub(pattern, '\\1', strings)
1 Aside: taking a subset is, as the name says, a set operation, and sets are defined by having no duplicate elements and there’s no particular order to the elements. A string by contrast is a sequence which is a very different concept.
Another possible regular expression is this:
sub("^(?:.+_){2}(.+?)_.+", "\\1", vec)
# [1] "PLM" "CA158" "NPTX1" "Q6FGH5" "CAD22"
where vec is your vector of strings.
A visual explanation:
> gsub(".*_.*_(.*)_.*", "\\1", "sp_O00168_PLM_HUMAM")
[1] "PLM"

postgresql: How to concatenate two regexp_matches()

I'm trying to extract both ints and chars from names such as 123A America, 234B Britania.
I only want the the number and the attached letter (i.e. 123A) .
I'm using regexp_matches(name, '(\d+)(\D)') and it results as:
{123,A},
{456,B}
I thought using concatenation, getting the first element of an array and the second element using two different functions
(regexp_matches(name, '(\d+)(\D)' )) [1] || (regexp_matches(name, '(\d+)(\D)' )) [2]
But it generates an error:
ERROR: functions and operators can take at most one set argument
How can I get the two element as one string?
You don't have to get the two items you're searching for as different sets, just get them as a single set. Remove the )( between \d+ and \D and that will return a set containing the entire string you're looking for.
Results in this -
regexp_matches('123A America, 234B Britania', '(\d+\D)' )
This will only find the first match. To get all matching substrings, use the g flag -
regexp_matches('123A America, 234B Britania', '(\d+\D)', 'g')
good answer by #Scott S however if you can't achieve what you need within one capture group the solution is to write a function, assign the regexp result to a variable and then use it.
CREATE OR REPLACE FUNCTION do_something(_input character varying)
RETURNS character varying AS
$BODY$
DECLARE
matches text[];
BEGIN
matches := regexp_matches(_input, '^([0-9]{1,}_[^_]{1,})_[a-z]{1,}(.*)$','i');
return substring(matches[1], 0, 24)||matches[2];
END
$BODY$
LANGUAGE plpgsql;