Can't figure out why the WHERE clause does not filter the rows selected.
The neT evaluation is correction during assignment, but not working in the WHERE clause (all 5 rows selected!)
data a;
input a $ b $; datalines;
abcd abdef
abcd abcdefg
abcdef abcd
xxyyzz .
. xxyyzz
run;
proc sql;
create table b as
select *
, a neT b as neT
, a eqT b as eqT
from a
where a neT b
;
quit;
Table B
You need strip
where strip(a) neT strip(b)
This is also relevant to the topic.
44 data _null_;
45 a='12';
46 b='1 ';
47 x = a eq: b;
48 z = strip(a) eq: strip(b);
49 put _all_;
50 run;
a=12 b=1 x=0 z=1 _ERROR_=0 _N_=1
This is confusingly documented, but may be correct behavior (though I have no idea why).
In the documentation of the SQL truncating expressions:
The Base SAS WHERE processor handles truncated comparisons differently than PROC SQL does. The Base SAS WHERE processor truncates comparisons based on the actual length of a string, even if a string includes blanks at the end. PROC SQL trims trailing blanks from the string values before it truncates comparisons. PROC SQL truncates string comparisons similar to other SQL processors that conform, in various degrees, to the INCITS/ISO/IEC/ANSI SQL:2011 Standards.
Now, this is confusing of course because the documentation says that base SAS Where would work the way you describe in the second example, i.e.:
data b;
set a;
where a ne: b;
run;
Per the documentation, that is expected to return 5 rows. However, it's unclear why the SQL where behaves this way and not the way the documentation says SQL handles things - but it's also not really surprising, as I'd always understood where to typically work the same way in both PROC SQL and the data step. You might want to try raising a track with tech support - or if you don't know how to do that, I can, to verify that this is intended behavior and not unintended.
Of course, why this is considered correct behavior (even in the data step where), I have no idea. It makes the truncating expressions basically useless for comparing variables to each other, without adding the strip or trim to both ... which is the normal behavior in other modes, anyway. I'm guessing this is some sort of backwards compatibility issue gone haywire, and I'd encourage you to create a SASWare Ballot idea to have this behavior changed.
You can work around this either as Data _Null explains - by using strip on both arguments, or trim - or, if you actually are using the neT in a column expression as well as the where, you can use calculated:
proc sql;
create table b as
select *
, a neT b as _neT
, a eqT b as _eqT
from a
where calculated _net eq 1
;
quit;
Related
Suppose i have a column called ABC and that variable has the data like
:
123_112233_66778_1122 or
123_112233_1122_11232 or
1122_112233_66778_123
so i want to generate the desire variable in the next column as 1122. like this "1122" i have a long list where i need to cross the value from the column called ABC, if found the exact match then need to generate. However, i don't want to generate the match like 112233 because it does not match the value what i am looking for.
For an example you can see all three line what i have given for reference. I am taking only the match records which is "1122" from all the above 3 lines.
I really have no clue to overcome on the problem. I have tried my hands with wildcards but did not get much success. Any help would be much apricated
It is hard to tell from your description, but from the values you show it looks like you want the INDEXW() function. That will let you search a string for matching words with a option to specify which characters are to be considered as the separators between the words. The result is the location of where the word starts within longer string. When the word is not found the result is a zero.
Let's create a simple example to demonstrate.
data have;
input abc $30. ;
cards;
123_112233_66778_1122
123_112233_1122_11232
1122_112233_66778_123
;
data want;
set have ;
location = indexw(trim(abc),'1122','_');
run;
Note that SAS will consider any value other than zero (or missing) as TRUE so you can just use the INDEXW() function call in a WHERE statement.
data want;
set have;
where indexw(trim(abc),'1122','_');
run;
I have an input file with a lot of dollar amounts given like this:
$433.5B $41.1B $331.1B $407.4B
$110.8B $19B $2,265.8B $170.1B
where the 'B' character stands for "billions". I do not have other suffixes like M or k. I need to read in this file using an INPUT statement in SAS inside a DATA step, and these figures should be numerics. There are several challenges to overcome, as well as a couple of features of the data to note:
There are dollar signs everywhere.
Some of the numbers have decimal points, and some don't, so we're dealing with variable-length data.
There are commas inside the numbers, such as $2,265.8B.
The most pesky aspect of this data are the B's after each amount.
The B's are always in the same columns.
What informat should I use to read in this numerical data?
I thought of using something along the lines of :DOLLAR4.1, like this:
Data bigcompanies;
Infile 'path\bigcompanies.dat' MISSOVER;
Input (sales profits assets market_value) (:DOLLAR4.1);
Run;
but it gives me nothing (as in, I get periods for those numbers). I don't know how to handle the B, which is, I think, the crux of the problem. The SAS documentation on the DOLLAR informat is rather sparse, unfortunately.
Many thanks for your help!
If the data is in fixed columns then just skip the columns where the B appears.
data test;
input sales dollar6. +1 profits dollar8. +1 assets dollar9. +1 market_value dollar9. +1 ;
*---+---10----+---20----+---30----+---40 ;
cards;
$433.5B $41.1B $331.1B $407.4B
$110.8B $19B $2,265.8B $170.1B
;
proc print;
run;
Results
market_
Obs sales profits assets value
1 433.5 41.1 331.1 407.4
2 110.8 19.0 2265.8 170.1
Note that you normally never want to add a decimal part to an informat. That is telling SAS where to place the decimal point when it does not appear in the source text. So "integers" will be divided by that power of 10.
Do we have any alternative for like operator(sql) in SAS datastep?
I am using below code for my requirement. but it is not working.
IF var1 ne : 'ABC' then new_var=XYZ;
Please anyone suggest what is wrong in this or suggest to me what the correct usage is for this situation.
Thanks,
In datastep, 'if' could be used with 'index/find/findw', but if you want to use 'like', you must use 'where' and 'like' together.
data want;
set sashelp.class;
where name like 'A%';
run;
You can use the find function,e.g.:
data want;
set sashelp.class;
if find(name,'e') then new_var='Y';
run;
The colon operator as you've used it only compares values that begin with the quoted string 'ABC'. Essentially SAS compares the 2 values, truncated to the smallest length of the 2. So if all the values in var1 are more than 3 characters, then it will truncate the values to 3 characters before comparing with 'ABC'.
It therefore differs from the like function in sql, which is used in conjunction with the % wildcard operator to determine whether to look at the beginning, end, or anywhere in the string.
To replicate like, you need to use a function such as find as recommended by #Amir, or index which is also commonly used in this situation.
I'd like to use the following syntax
data new;
set old (where=(mystring in ('string1','string2',...,'string500')));
run;
in order to filter a very large input data set. The 500 strings at first are contained as numeric values in the variable "bbb" in the dataset "aux". So far I have created a macro variable which contains the required list of the 500 strings the following way:
proc sql noprint;
select bbb into :StringList1 separated by "',' "
from work.aux;
quit;
data _null_; call symputx('StringList2',compress("'&StringList1'")); run;
data new;
set old (where=(mystring in (&StringList2)));
run;
... which seems to work. But there is a warning telling me that
The quoted string currently being processed has become more than 262
characters long. You might have unbalanced quotation marks.
Results still seem to be plausible. Should I be worried that one day results might become wrong?
More importantly: I try to find a way to avoid using the compress function by setting up the
separated by "',' "
option in a way that does not contain blanks in the first place. Unfortunately the following seems not to work:
separated by "','"
It doesn't give me a eror message but when looking at the macro variable there is a multipage-mess of red line numbers (the color which usually denotes error messages), empty rows, minus signs, ... . The following screenshot shows part of the log after running this code:
proc sql noprint;
select vnr into :StringVar1 separated by "','"
from work.var_nr_import;
quit;
%put &StringVar1.;
Have already tried to make use of the STR()-function but no success so far.
I cannot replicate your error messages in SAS 9.3
If your variable is numeric you don't need quotes in the macro variable.
If it is character try using the QUOTE() function.
proc sql noprint;
select quote(bbb) into :StringList1 separated by " "
from work.aux;
quit;
A macro variable can only contain 65,534 characters. So if there are too many values of BBB then your macro variable value will be truncated. This could lead to unbalanced quotes. That is most likely the source of your errors.
Note that you can turn off the warning about the length of the quoted strings by using the NOQUOTELENMAX system option, but in this application you wouldn't want to because the individual quoted strings are not that long.
You will be better served to use another method to subset your data if lists this long are required.
This will work,
for double quotations
proc sql noprint;
select quote(bbb) into :StringList1 separated by ","
from work.aux;
quit;
for single quotations
proc sql noprint;
select "'"||bb||"'" into :StringList1 separated by ","
from work.aux;
quit;
I am stuck in this one particular point. I have a character variable with observations extracted from rtf document. I need to keep only the observations from obs A to obs B. The firstobs and obs is not helpful here because we do not know the observation number beforehand. All we know is the two unique strings. For example in the dataset, I need to create a dataset with observations from obs 11 to 16. This is only part of dataset, the original dataset has over 1500 observations, that is why we use unique text to capture instead of observation number.
Thank you all in advance.
You don't explain enough, but odds are you can do something sort of like this if I understand you right (you have a "start" and a "stop" string in the document).
data want;
set have;
retain keep 0;
if strvar = "keepme" then keep=1;
if keep=1;
if strvar = "lastone" then keep=0;
run;
IE, have some condition set the keep variable to 1, then test for it, then have the off condition after that (assuming you want to keep the off condition row). Use string functions like index or find or scan to search for your particular string if it's not an entire string. You could also use regular expressions if necessary.