Dropping observations in Stata based on length? - stata

I have a string variable in Stata called Cod. I want to drop the observations such that Cod has less than 16 characters. Any suggestion?

You should look at help string functions to learn basic syntax here.
drop if length(Cod) < 16
may be what you seek.

Related

Strange sorting in Stata after doing encode

Variable X used to be string. So I used encode command to make it non-string.
But after that when I sort it, it's sorted in this way.
1000
10000
10001
10003
10005
1003
But usually, it should be sorted like
1000
1001
1003
1005
Why is sorting so strange after doing encode?
And it appears 1003 created from encode and 1003 in using dataset are considered different numbers.
Not strange at all. Right near the top of help encode Stata tells you "Do not use encode if varname contains numbers that merely happen to be stored as strings".
encode maps strings in alphabetical (here alphanumeric) order to numeric values 1 up (unless you specify otherwise with a label() option).
So "1000" will sort before "10000" before "1001", and so forth.
You probably need destring but why was the variable read as string? That's what you need to worry about.
encode is for strings when you want a numeric equivalent. So "cat" "dog" "frog" "toad" will map to 1 2 3 4 and the string values will become value labels.
destring is for mistaken strings. The variable should be numeric, but something went wrong on reading the data. So, what was it that went wrong? Common errors include
Header data from a spreadsheet that should be a variable label (or ignored) got read in as data.
Codes for missing data such as NA that make sense to people or to some other program but do not correspond to Stata representations of missing.
Garbage of some kind.
To check for problems, you could look at the values that wouldn't translate to numbers:
tab whatever if missing(real(whatever))

Extract left part of the string in SAS?

Is there a function SAS proc SQL which i can use to extract left part of the string.it is something similar to LEFT function sql server. in SQL I have left(11111111, 4) * 9 = 9999, I would like to something similar in SAS proc SQL. Any help will be appreciated.
Had an impression you want to repeat the substring instead of multiply, so I'm adding REPEAT function just for the curiosity.
proc sql;
select
INPUT(SUBSTR('11111111', 1, 4), 4.) * 9 /* if source is char */
, INPUT(SUBSTR(PUT(11111111, 16. -L), 1, 4), 4.) * 9 /* if source is number */
, REPEAT(SUBSTR(PUT(11111111, 16. -L), 1, 4), 9) /* repeat instead of multiply */
FROM SASHELP.CLASS (obs=1)
;
quit;
substr("some text",1,4) will give you "some". This function works the same way in a lot of SQL implementations.
Also, note that this is a string function, but in your example you're applying it to a number. SAS will let you do this, but in general it's wise to control you conversion between strings and numbers with put() and input() functions to keep your log clean and be sure that you're only converting where you actually intend to.
You might be looking for SUBSTRN function..
SUBSTRN(string, position <, length>)
Arguments
string specifies a character or numeric constant, variable,
or expression.
If string is numeric, then it is converted to a character value that
uses the BEST32. format. Leading and trailing blanks are removed, and
no message is sent to the SAS log.
position is an integer that specifies the position of the first
character in the substring.
length is an integer that specifies the length of the substring. If
you do not specify length, the SUBSTRN function returns the substring
that extends from the position that you specify to the end of the
string.
As others have pointed out, substr() is the function you are looking for, although I feel that a more useful answer would also 'teach you how to fish'.
A great way to find out about SAS functions is to google sas functions by category which at the time of writing this post will direct you here:
SAS Functions and CALL Routines by Category
It's worth scanning through this list at least once just to get an idea of all of the functions available.
If you're after a specific version, you may want to include the SAS version number in your search. Note that the link above is for 9.2.
If you have scanned through all the functions, and still can't find what you are looking for, then your next option may be to write your own SAS function using proc fcmp. If you ever need assistance with doing this than I suggest posting a new question.

How do I use numeric functions to correct date typos?

I know it's easy enough to do manual corrections on date typos, but I want to automate such corrections using one or more SAS functions, given that my dataset is large and typos are frequent.
For instance, it seems that whomever created the dataset I am cleaning often transposed digits in the year of someone's birthdate (e.g., '2102' rather than '2012', '2110' instead of '2010', etc). I'm aware of string functions such as INDEX() that find certain character values or strings and then allow for the replacement of said characters in the same position (i.e., replace "ABCD" with "ABBB", regardless of the string's location in a value). Can the same process be replicated with numeric (and specifically date) values?
I don't think SAS has any functions that would check numeric values for digit patterns. I often do data cleaning and address this issue by making a character variable out of the numeric date variable, then using character functions and Perl regex to clean the character values, and then storing the cleaned values as numeric date.
For specifically date values, you could try using SAS date functions (e.g. DAY(), MONTH(), YEAR(), MDY(), etc.) to extract parts of the date value, error-check them, and put them all back together into a date value. This could be a good quick solution if you expect a limited set of typos and you roughly know what they are. For a more thorough error check, converting the numeric values to character and using char or regex functions would give you more options.
The only really concise suggestion I can imagine is using mdy (Assuming this is date, not datetime variables).
For example:
data want;
set have;
if year(datevar) > 2100 then
datevar = mdy(month(datevar),day(datevar),year(datevar)-90);
run;
would correct any '2104' to '2014'. That's a very simple correction (and may well do as much harm as good, since '2114' is also a possible typo), but things along those lines - break the date up into its pieces, verify the pieces, reconstruct using mdy.

SAS - selecting character observations from position 1 to position 2

I am stuck in this one particular point. I have a character variable with observations extracted from rtf document. I need to keep only the observations from obs A to obs B. The firstobs and obs is not helpful here because we do not know the observation number beforehand. All we know is the two unique strings. For example in the dataset, I need to create a dataset with observations from obs 11 to 16. This is only part of dataset, the original dataset has over 1500 observations, that is why we use unique text to capture instead of observation number.
Thank you all in advance.
You don't explain enough, but odds are you can do something sort of like this if I understand you right (you have a "start" and a "stop" string in the document).
data want;
set have;
retain keep 0;
if strvar = "keepme" then keep=1;
if keep=1;
if strvar = "lastone" then keep=0;
run;
IE, have some condition set the keep variable to 1, then test for it, then have the off condition after that (assuming you want to keep the off condition row). Use string functions like index or find or scan to search for your particular string if it's not an entire string. You could also use regular expressions if necessary.

Is there a limit to the value levels in a proc format statement?

proc format;
value $STNAME 'AL'='Alabama'
'AK'='Alaska
'AR'='Arkansas'
'AZ'='Arizona'
'CA'='California'
'CO'='Colorado'
'CT'='Connecticut'
'DC'='DistrictOfColumbia'
'DE'='Deleware'
'FL'='Florida'
'GA'='Georgia'
'HI'='Hawaii'
'IA'='Iowa'
'ID'='Idaho'
'IL'='Illinois'
'IN'='Indiana'
'KS'='Kansas'
'KY'='Knetucky'
'LA'='Louisiana'
'MA'='Massachusetts'
'MD'='Maryland'
'ME'='Maine'
'MI'='Michigan'
'MN'='Minnesota'
'MO'='Missouri'
'MS'='Mississippi'
'MT'='Montana'
'NC'='North Carolina'
'ND'='North Dakota'
'NE'='Nebraska'
'NH'='New Hampshire'
'NJ'='New Jersey'
'NM'='New Mexico'
'NY'='New York'
'NV'='Nevada'
'OH'='Ohio'
'OK'='Oklahoma'
'OR'='Oregon'
'PA'='Pennsylvania'
'RI'='Rhode Island'
'SC'='South Carolina'
'SD'='South Dakota'
'TN'='Tennessee'
'TX'='Texas'
'UT'='Utah'
'VA'='Virginia'
'VT'='Vermont'
'WA'='Washington'
'WI'='Wisconsin'
'WV'='West Virginia'
'WY'='Wyoming';
run;
It freezes up in the middle of the proc format step. If I split I shorten it, it runs fine.
Anyone aware how to get around this?
You are missing a closing quote on Alaska. I placed the code in my IDE and I could tell from the highlighting.
As long as your hard drive can hold the SAS program file, it does not have a limit on the number of unique values inside a proc format or the amount of memory needed to load it. As #Carolina has suggested you are missing an end quote for Alaska. If there is no end quote, the states after Alaska are in a different color. After you add the end quote, the highlighting after Alaska should change to a unanimous color.
Like this:
screenshot link
It might be better to use more conventional spacing for better readability.
Also, you might want spaces between 'DistrictOfColumbia' and Kentucky is spelt incorrectly.
Hope this helps.