So one of my variables was coded in a messy mix of numeric values, texts, parenthesis and so on. I actually only need to extract the numeric values which are recorded as 12345 (for example, not limited to a specific number of digits, i mean it could be a n-k-digit to n-digit) followed by || and then description that might also contain some numeric values. So when I applied SAS compress funtion newvar = compress(oldvar, '', 'a'), the newvar extracted ALL the numbers from the oldvar. Thus it looks like 12345|||(789)|| etc. The number of '|' sign (which is control character to indicate line breaks etc.?) varies though.
I only need to extract the first numeric values before the '|' sign. Any help please?
Thanks in advance.
Use the SCAN() function to extract the values. It will result in a character value and converting to a numeric should be straightforward.
new_var = input(scan(old_var, 1, "|"), best12.);
This should do it:
substr("12345||45||89||...",1,find("|","12345||45||89||...",1)-1)
Related
Why does this code not need two trim statements, one for first and one for last name? Does the length statement remove blanks?
data work.maillist; set cert.maillist;
length FullName $ 40;
fullname=trim(firstname)||' '||lastname;
run;
length is a declarative statement and introduces a variable to the Program Data Vector (PDV) with the specific length you specify. When an undeclared variable is used in a formula SAS will assign it a default length depending on the formula or usage context.
Character variables in SAS have a fixed length and are padded with spaces on the right. That is why the trim(firstname) is needed when || lastname concatenation occurs. If it wasn't, the right padding of firstname would be part of the value in the concatenation operations, and might likely exceed the length of the variable receiving the result.
There are concatenation functions that can simplify string operations
CAT same as using <var>|| operator
CATT same as using trim(<var>)||
CATS same as using trim(left(<var>))||
CATX same as using CATS with a delimiter.
STRIP same as trim(left(<var>))
Your expression could be re-coded as:
fullname = catx(' ', firstname, lastname);
Is there a reason you think it should? Can you see trailing spaces in the surname, have you tried a length() function?
I could be wrong here but sometimes when you apply a function (put especially) or import data you can inadvertently store leading or trailing spaces. Trailing spaces are a mystery because you don't realise they are there until you try to do something else with the data.
A length statement should allow you to store exactly the data you give it providing you use a number/character variable correctly with truncation only occurring if the length value is too short.
I've found the
compress() function to be the most convenient for dealing with white space and punctuation particularly if you are concatenating variables.
https://www.geeksforgeeks.org/sas-compress-function-with-examples/
All the best,
Phil
Because SAS will truncate the value when it is too long to fit into FULLNAME. And when it is too short it will fill in the rest of FULLNAME with spaces anyway so there is no need to remove them.
It would only be an issue if the length of FULLNAME is smaller than the sum of the lengths of FIRSTNAME and LASTNAME plus one. Otherwise the result cannot be too long to fit into FULLNAME, even if there are no trailing spaces in either FIRSTNAME or LASTNAME.
Try it yourself with non-blank values so it is easier to see what is happening.
1865 data test;
1866 length one $1 two $2 three $3 ;
1867 one = 'ABCD';
1868 two = 'ABCD';
1869 three='ABCD';
1870 put (_all_) (=);
1871 run;
one=A two=AB three=ABC
NOTE: The data set WORK.TEST has 1 observations and 3 variables.
I would like to remove dashes from 3 to 9 digit numbers. A certain percentage of those numbers have leading zeros in them. I tried using the Compress function, but this stripped the zeros as well. What would be the best function to use?
I understand your "numbers" are actually codes with digits and dashes and you want to keep only the digits, so what you need is string processing.
The compress function in SAS has a second (optional) parameter. If you don't specify it, the function will remove all white space characters. If you do, it will remove the characters specified. So try
no_dash = compress(with_dash, '-');
Alternatively you could remove all non digit characters, using a third (also optional) parameter
no_dash = compress(with_dash, '0123456789', 'k');
The k specifies to keep instead of remove the characters specified. You can shorten this by adding the d to the third parameter, telling SAS to add all digits to the second:
no_dash = compress(with_dash, '', 'dk');
If you have stored the compressed result (with implicit conversion) in a numeric variable, that variable may need a format to get the result you want.
data _null_;
my_dashed_text = '000-90-123';
my_compressed_text = compress(my_dashed_text, '-');
attrib my_num_var
length = 8
format = z9.
;
my_num_var = compress(my_dashed_text, '-');
put (_all_) (=/);
run;
------ LOG -----
NOTE: Character values have been converted to numeric values at the places given by:
(Line):(Column).
36:16
my_dashed_text=000-90-123
my_compressed_text=00090123
my_num_var=000090123
The Z numeric format tells SAS to add leading zeros that fill out to the specified width when displaying the number. The format is a fixed width, so a my_num_var from both "123-456" and "0-1-2-3-45-6" will display a Z9 formatted value of 000123456. SAS formatting can't make a number value look like 123456 or 0123456 when rendered through a single format specification (such a Z9)
I have a table with a column po_number of type varchar in Postgres 8.4. It stores alphanumeric values with some special characters. I want to ignore the characters [/alpha/?/$/encoding/.] and check if the column contains a number or not. If its a number then it needs to typecast as number or else pass null, as my output field po_number_new is a number field.
Below is the example:
SQL Fiddle.
I tired this statement:
select
(case when regexp_replace(po_number,'[^\w],.-+\?/','') then po_number::numeric
else null
end) as po_number_new from test
But I got an error for explicit cast:
Simply:
SELECT NULLIF(regexp_replace(po_number, '\D','','g'), '')::numeric AS result
FROM tbl;
\D being the class shorthand for "not a digit".
And you need the 4th parameter 'g' (for "globally") to replace all occurrences.
Details in the manual.
For a known, limited set of characters to replace, plain string manipulation functions like replace() or translate() are substantially cheaper. Regular expressions are just more versatile, and we want to eliminate everything but digits in this case. Related:
Regex remove all occurrences of multiple characters in a string
PostgreSQL SELECT only alpha characters on a row
Is there a regexp_replace equivalent for postgresql 7.4?
But why Postgres 8.4? Consider upgrading to a modern version.
Consider pitfalls for outdated versions:
Order varchar string as numeric
WARNING: nonstandard use of escape in a string literal
I think you want something like this:
select (case when regexp_replace(po_number, '[^\w],.-+\?/', '') ~ '^[0-9]+$'
then regexp_replace(po_number, '[^\w],.-+\?/', '')::numeric
end) as po_number_new
from test;
That is, you need to do the conversion on the string after replacement.
Note: This assumes that the "number" is just a string of digits.
The logic I would use to determine if the po_number field contains numeric digits is that its length should decrease when attempting to remove numeric digits.
If so, then all non numeric digits ([^\d]) should be removed from the po_number column. Otherwise, NULL should be returned.
select case when char_length(regexp_replace(po_number, '\d', '', 'g')) < char_length(po_number)
then regexp_replace(po_number, '[^0-9]', '', 'g')
else null
end as po_number_new
from test
If you want to extract floating numbers try to use this:
SELECT NULLIF(regexp_replace(po_number, '[^\.\d]','','g'), '')::numeric AS result FROM tbl;
It's the same as Erwin Brandstetter answer but with different expression:
[^...] - match any character except a list of excluded characters, put the excluded charaters instead of ...
\. - point character (also you can change it to , char)
\d - digit character
Since version 12 - that's 2 years + 4 months ago at the time of writing (but after the last edit that I can see on the accepted answer), you could use a GENERATED FIELD to do this quite easily on a one-time basis rather than having to calculate it each time you wish to SELECT a new po_number.
Furthermore, you can use the TRANSLATE function to extract your digits which is less expensive than the REGEXP_REPLACE solution proposed by #ErwinBrandstetter!
I would do this as follows (all of the code below is available on the fiddle here):
CREATE TABLE s
(
num TEXT,
new_num INTEGER GENERATED ALWAYS AS
(NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER) STORED
);
You can add to the 'ABCDEFG... string in the TRANSLATE function as appropriate - I have decimal point (.) and a space ( ) at the end - you may wish to have more characters there depending on your input!
And checking:
INSERT INTO s VALUES ('2'), (''), (NULL), (' ');
INSERT INTO t VALUES ('2'), (''), (NULL), (' ');
SELECT * FROM s;
SELECT * FROM t;
Result (same for both):
num new_num
2 2
NULL
NULL
NULL
So, I wanted to check how efficient my solution was, so I ran the following test inserting 10,000 records into both tables s and t as follows (from here):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
INSERT INTO t
with symbols(characters) as
(
VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
)
select string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), '')
from symbols
join generate_series(1,10) as word(chr_idx) on 1 = 1 -- word length
join generate_series(1,10000) as words(idx) on 1 = 1 -- # of words
group by idx;
The differences weren't that huge but the regex solution was consistently slower by about 25% - even changing the order of the tables undergoing the INSERTs.
However, where the TRANSLATE solution really shines is when doing a "raw" SELECT as follows:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER
FROM s;
and the same for the REGEXP_REPLACE solution.
The differences were very marked, the TRANSLATE taking approx. 25% of the time of the other function. Finally, in the interests of fairness, I also did this for both tables:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
num, new_num
FROM t;
Both extremely quick and identical!
i have a issue where the there is a amount field which has data like
(- 98765.00),minus{spaces]{numbers} ?, i need to remove the space between the minus and the number and get is as (-98765.00), how do i do it in expression transformation.
field datatype is decimal (8,2).
Thanks,
Kiran
output_port: TO_DECIMAL(REPLACECHR(FALSE,input_port,' ',''))
REPLACECHR replaces the blanks with empty character, essentially removing them. The first argument can be TRUE/FALSE to specify case sensitive or not, but it is not important in this case.
You can use REG_REPLACE function to replace space
To achieve this you need to follow below steps,
* Create two variable ports
* REG_REPLACE - function requires string column, so you need to convert the decimal column to string column using TO_CHAR function
First variable port(string) - TO_CHAR(column_name)
* In previous port data is converted to string, now convert it again to decimal and apply REG_REPLACE function
Second variable port(decimal) - to_decimal(reg_replace(first_variable_port,'s+',''))
s - determines the white spaces in informatica regular expression
See the below image,
same number which you provided is used. Use the same data type and function
Debugger gives the exact result by removing white space in the below image,
May be you have the issue with other transformations which you are passing through. Debug and verify the data once.
Hope you got it, any issues feel free to ask
To have enjoy informatica, have a fun on https://etlinfromatica.wordpress.com/
If my understanding is correct, you need to replace both the spaces and the brackets. Here's the expression:
TO_DECIMAL(
REPLACECHR(0,
REPLACECHR(0, '(- 98765.00)', ' ', '') -- this part does the space replacement
, '()', '') -- this part replaces the brackets
)
I have a data with commas in tab file and I have imported it the values were imported into sas as a char datatype with a comma values.
like 23,1 53,2
I want to now convert these into numeric with either . or comma how do i do it?
if I use
want=input(have,comma.);
informat want comma.;
format want comma.;
I get missing values., !
You can use the NUMXw.d informat to input numbers with commas as the decimal separator.
want = input(have,NUM4.1);
or just use that on the initial input statement and you don't have to convert it.
NUMXw.d also is a format, so you can use it to display the variable with a comma if that's how you are more comfortable viewing decimals.
You can use a TRANWRD function to replace the comma with a period, then wrap this within an INPUT function to convert the new character value to numeric.
F2 = INPUT(TRANWRD(F1,',','.'),4.1);