replace a variable's value if it contains a certain string - regex

I have a variable called diagnosis and I want to replace everything that contains the word "pneumonia" with just "pneumonia"
I tried this:
replace diagnosis = "Pneumonia" if regexm (diagnosis, "pneumonia")
But I got an error: unrecognized command: regexm
I have Stata/IC version 12.1 for Windows.

Answer:
take out the space between regexm and (diagnosis, "pneumonia")
Additional suggestions:
regexm takes a long time, so I would do something more like
replace diagnosis = "Pneumonia" if diagnosis == "pneumonia"
which achieves the same result, or if you want to do this more generally you can write
replace diagnosis = strproper(diagnosis)
which has the same results in your example.

Another approach.
. clonevar newdiag = diag
. replace newdiag = "pneumonia" if strpos(strlower(diag),"pneumonia")>0
(3 real changes made)
. list, clean noobs
diag newdiag
pneumonia pneumonia
Pneumonia pneumonia
Bronchial pneumonia pneumonia
Flu and pneumonia pneumonia
earache earache

Related

Unwanted cut words in the output

I tried to submit the code below, but SAS doesn't show the complete values in the output. It would cut the word after the blank in the values of the variable "COMPLICATION".
Can anyone fix it?
data complication;
length SUBJECT 8 COMPLICATION $ 15.5;
input SUBJECT COMPLICATION $ ;
datalines;
2076 Pneumonia
3585 DVT(Lower)
3630 DVT(Lower)
4585 Compartment
4599 Aspiration
4760 Acute Resp
4775 Pneumonia
2076 Heart Attack
3585 Pneumonia
3660 Heart Attack
4585 Pneumonia
4599 Pneumonia
4760 Pneumonia
4775 DVT(Lower)
2076 Renal Fail
3585 Renal Fail
3630 Pancreatit
4585 Skin Break
4599 Renal Fail
4760 Renal Fail
4775 Pneumonia
3630 Pneumonia
4775 Renal Fail
;
run;
The default value (or field) separator is a space, so the character input is 'clipped' at the first space. Use the LIST INPUT modifier & after a variable name to cause INPUT to use two spaces as the delimiter, and thus allow embedded single spaces in the field value.
The LENGTH statement should not have $15.5 change it to $15. The INPUT statement already knows about the COMPLICATION variable because it was set up with LENGTH in the prior statement. So change COMPLICATION $ to COMPLICATION &
So you want
data complication;
length SUBJECT 8 COMPLICATION $ 15;
input SUBJECT COMPLICATION & ;
...

How to remove repeated words or phrases within the same string

I am working with a string variable response in Stata. This variable stores complete sentences, and many of these sentences have repeated phrases.
For example:
how do you know how do you know what it is?
it was during the during the past thirty days
well well I would hope I would hope that they're doing that
I want to clean these strings by removing all repeated phrases.
In other words, I want to transform this sentence:
how do you know how do you know what it is?
to the one below:
how do you know what it is?
So far, I have tried to fix each case individually, but this is incredibly time-consuming as there are thousands of repeated words/phrases.
I would like to run code that can identify when a phrase is repeated within the same observation / string, and then remove one instance of that phrase (or word).
I imagine regular expressions would help, but I cannot figure out much more than this.
The following works for me:
clear
input str80 string
"Pearly Spencer how do you know how do you know what it is?"
"it was during the during the past thirty days"
"well well I would hope I would hope that they're doing that"
"well well they're doing that I would hope I would hope "
"well well I would hope I would hope that they're doing that but but they don't"
end
clonevar wanted = string
local stop = 0
while `stop' == 0 {
generate dup = ustrregexs(2) if ustrregexm(wanted, "(\W|^)(.+)\s\2")
replace wanted = subinstr(wanted, dup, "", 1)
capture assert dup == ""
if _rc == 0 local stop = 1
else drop dup
}
replace wanted = strtrim(stritrim(wanted))
list wanted
+----------------------------------------------------------+
| wanted |
|----------------------------------------------------------|
1. | Pearly Spencer how do you know what it is? |
2. | it was during the past thirty days |
3. | well I would hope that they're doing that |
4. | well they're doing that I would hope |
5. | well I would hope that they're doing that but they don't |
+----------------------------------------------------------+
The above solution uses a regular expression to first identify repeated words / phrases. Then it eliminates this from the string by substituting a space in its place.
Because this particular regular expression does not find all sets in one pass (for example in the last observation there are three sets - well, I would hope and but), the process is repeated using a while loop until no repeated elements remain in the string.
In the final step, all unnecessary spaces are deleted to bring the string back to shape.

Removing unmatched text and building a table with the remaining matches

I have 30000 lines that look like the one below.
342800005013000 CON N GORE PT LOT 31 RP 11R2284 PT PART 1 RP 11R4541 PT PART 2
I would like to capture the 15 digit number at the beginning and any "11R***" numbers.
In Notepad++ I've used \d{15}|(11R\d*)* to match everything that I want. Ultimately I would like to get all the matched results into excel. What would be the best way to do so?
Thanks for your help.
Notepad++ Matches
You could try this one
(^[0-9]*)|(11R[0-9A-Za-z]*)
edit: check it now, the code formatting correctly displays the regex;

Regex for getting just name of street and number from messy address

I have this list of messy addresses, some are clean some aren't:
Av. Chorrillos # 1759 Local 1082 Exterior Jumbo
Av. Balmaceda N° 2355 Local BS - 121 / Subterráneo sector servicios
Tarapaca N° 729
The structure is usually name of street + N°|#|nothing + number + extra stuff
I'd like to erase this extra stuff so that the expected output from the above list is:
Av. Chorrillos # 1759
Av. Balmaceda N° 2355
Tarapaca N° 729
I tried using a combination of letters and lookback:
([a-zA-Z\s]+\d+)
But the # and N° gave me trouble, so I tried also including them
([(\w|°|#)\s]+\d+)
but still no luck.
I know regex on addresses is a nightmare, but any regex that fits those three cases above would fit 95% of my list, which is good enough for me!
I'm using this with python regex in case that matters.
You can find the list of addresses and my regex attempt on regex101
(Some addresses have extra info BEFORE the relevant information of street + number, but I'm fine with screwing up those)
Based on your specifications. I came up with this regex.
Regex: ^.*?(?:[N°#Nº]\s*)?\d+
Explanation:
^.*? consumes everything from beginning of string. Since match is lazy it will match until next part which is (?:[N°#Nº]\s*)?
(?:[N°#Nº]\s*)? matches optional N°#Nº followed by zero or more whitespaces.
\d+ matches numbers.
Regex101 Demo

substitution with eval and repeat the character by grouping string length?

My input as follow
my $s = '<B>Estimated:</B>
The N-terminal of the sequence considered is M (Met).
The estimated half-life is: 30 hours (mammalian reticulocytes, in vitro).
>20 hours (yeast, in vivo).
>10 hours (Escherichia coli, in vivo).
<B>Instability index:</B>
The instability index (II) is computed to be 31.98
This classifies the protein as stable.';
I want to remove the <B></B> tags from string and put the underline for bold tags.
I expected output is
Estimated:
---------
The N-terminal of the sequence considered is M (Met).
The estimated half-life is: 30 hours (mammalian reticulocytes, in vitro).
>20 hours (yeast, in vivo).
>10 hours (Escherichia coli, in vivo).
Instability index:
------------------
The instability index (II) is computed to be 31.98
This classifies the protein as stable.
For this tried the following regex but I don't know what is the problem there.
$s=~s/<B>(.+?)<\/B>/"$1\n";"-" x length($1)/seg; # $1\n in not working
In the above regex I don't know how to put this "$1\n"? And how to use the continuous statement in substitution separated by ; or anything else?
How can I fix it?
The e modifier returns back just the last-executed statement, so
$s=~s/<B>(.+?)<\/B>/"$1\\n";"-" x length($1)/seg;
throws away the "$1\\n" (which should really be "$1\n")
This works:
$s=~s/<B>(.+?)<\/B>/"$1\n" . "-" x length($1)/seg;
The reason I was asking about your Perl version was to assess if it was possible to do what is effectively a variable-length lookbehind with \K:
$s=~s/<B>(.+?)<\/B>\K/ "\n" . "-" x length($1)/seg;
\K is available for Perl versions 5.10+.