I have several string variables that I would like to turn into a comma-separated string in one variable. When I use egen concat with the punct(", ") option I get trailing commas if that associated row is missing from entries, which is common in my data.
I thought that I could remove the trailing commas with regexm() and a for loop, but my concatenated string variable doesn't change.
How do I get this REGEX to match in Stata? (Or maybe I'm on totally the wrong path.)
clear
input str5 name1 str5 name2 str5 name3
Tom Dick Harry
Tom "" ""
end
ds name*
local n: word count `r(varlist)'
display `n'
egen names = concat(name*), punct(", ")
generate names2 = names
forvalues i = 1/`n' {
replace names2 = regexr(names2, ",.$", "")
}
list
This provides:
. list
+-------------------------------------------------------------+
| name1 name2 name3 names names2 |
|-------------------------------------------------------------|
1. | Tom Dick Harry Tom, Dick, Harry Tom, Dick, Harry |
2. | Tom Tom, , Tom, , |
+-------------------------------------------------------------+
egen's concat() function just implements a loop. You can write your own instead:
gen names = name1
forval j = 2/4 {
replace names = cond(mi(names), name`j', names + "," + name`j') if !mi(name`j')
}
Does something like this work for your data?
clear
input str5 name1 str5 name2 str5 name3 str5 name4
Tom Dick Harry Hank
Tom "" "" Hank
Tom "" Harry "" Hank
Tom "" "" ""
end
list
egen names = concat(name*), punct(" ")
gen names2 = subinstr(itrim(names), " ", ", ", .)
list
If your string variables have spaces, e.g. "Hank and Gloria", that will fail.
Related
I want to generate a dummy variable which is 1 if there is any match in two variables. These two variables are generated by egen concat and each contains a group of languages used in a country.
For example, var1 has values of apc apc apc apc, and var2 has values of apc or var1 is apc fra nya and var2 is apc. In either cases, fndmtch2 or egen anymatch would not give me 1. Is there anyway I can get 1 for each case?
Your data example can be simplified to
sysuse auto
egen var1 = concat(mpg foreign), punct(" ")
egen var2 = concat(trunk foreign), punct(" ")
as mapping to string in this instance is not needed for mpg trunk any more than it was needed for foreign. concat() maps to string on the fly, and the only issues with numeric variables (neither applying here) are if fractional parts are present or you want to see value labels.
Now that it is confirmed that multiple words can be present, we can work with a slightly more interesting example.
Here are two methods. One is to loop over the words in one variable and also the words in the other variable to check if there are any matches.
Stata's definition of a word here is that words are delimited by spaces. That being so, we can check for the occurrence of " word " within " variable ", where the leading and trailing spaces are needed because in say "frog toad newt" neither "frog" nor "newt" occurs with both leading and trailing spaces. In the OP's example the check may not be needed, but it often is, just as a search for "1" or "2" or "3" finds any of those within "11 12 13", which is wrong if you seek any as a word and not as a single character.
More is said on search for words within strings in a paper in press at the Stata Journal and likely to appear in 22(4) 2022.
* Example generated by -dataex-. For more info, type help dataex
clear
input str8 var1 str5 var2
"FR DE" "FR"
"FR DE GB" "GB"
"GB" "FR"
"IT FR" "GB DE"
end
gen wc = wordcount(var1)
su wc, meanonly
local max1 = r(max)
replace wc = wordcount(var2)
su wc, meanonly
local max2 = r(max)
drop wc
gen match = 0
quietly forval i = 1/`max1' {
forval j = 1/`max2' {
replace match = 1 if word(var1, `i') == word(var2, `j') & word(var1, `i') != ""
}
}
gen MATCH = 0
forval i = 1/`max1' {
replace MATCH = 1 if strpos(" " + var2 + " ", " " + word(var1, `i') + " ")
}
list
+----------------------------------+
| var1 var2 match MATCH |
|----------------------------------|
1. | FR DE FR 1 1 |
2. | FR DE GB GB 1 1 |
3. | GB FR 0 0 |
4. | IT FR GB DE 0 0 |
+----------------------------------+
EDIT
replace MATCH = 1 if strpos(" " + var2 + " ", " " + word(var1, `i') + " ") & !missing(var1, var2)
is better code to avoid the uninteresting match of " " with " ".
I have some really messed up names from a system that I'm trying to match First and Last names in AD. Just need to parse the strings. I have names such as :
Hagstrom, N.P., Ana (Analise)
Banas, R.N., Cynthia
Saltzmann, N.P., April
Lee, Christopher
Rajaram, Pharm.D., Sharmee
Goode Jr, John (Jack) L
Reyes, R.N., Meghan
Miller, M.S., Adrienne M
Chavez, Gabriela
Stevens, MS, CCC-SLP, Christopher
Lockwood Flores, R.N., Jessica
I have tried this, but for some reason, the GivenName isn't being returned properly.
$Name = "Saltzmann, N.P., April"
$GivenName = $Name.Split(",")[$Name.Split(",").GetUpperBound(0)]
$SN = $Name.Split(",")[0]
If ($SN.IndexOf("-") -gt -1) {
$HypenLast = $SN.Split("-")[0]
$SNName = $SN.Split("-")[1]
}
If ($GivenName.IndexOf(" ") -gt -1) {
$GivenName = $GivenName.Replace("(","").Replace(")","").Split(" ")[0]
$MiddleName =$GivenName.Replace("(","").Replace(")","").Split(" ")[1]
}
Trying to take everything before the first comma and everything after last comma, but take letters before the second space of the first name.
Trying to get LastName FirstName but then need to flip it to FirstName LastName. Thanks.
All of the names could be piped to a script block that uses a regex with some named capture groups. The named capture group values can be extracted to rebuild the name you need using string interpolation.
$nameList | ForEach-Object {
$match = [Text.RegularExpression.Regex]::Match($_, "(?<last>[\w\s]+),(?:.*,)?(?:\s*)(?<first>\w+)")
$lastName = $match.Groups["last"].Value
$firstName = $match.Groups["first"].Value
"$firstName $lastName"
}
Update: the first version of this question was implicitly asking how to extract a substring if it has ANY match in another vector, for which #Colonel Beauvel provided an elegant response:
This does the trick, base R:
newname = sapply(nametitle, function(u){
bool = sapply(name, function(x) grepl(x, u))
if(any(bool)) name[bool][1] else NA })
newname
John Smith, MD PhD Jane Doe, JD
"John" "Jane"
However, I did not realize that I was actually asking for a way to find exact matches until the function kindly contributed did not work for all elements in my vector. Therefore, the following is my revised question.
Say I have the following character vector of generic names and their academic degrees:
nametitle <- c("John Smith, MD PhD", "Jane Doe, JD", "John-Paul Jones, MS")
And I have a "look-up" vector of first names:
name <- c("John", "Jane", "Mark", "Steve")
What I want to do is search each element of nametitle, and if part of the element (i.e., a substring of each string) is an exact match of an element from name, then in a new vector newname, write that element of nametitle with the corresponding element of name, or if there is no exact match, write the original value from nametitle.
Therefore, what I'd expect the proper function to do is return newname with the three elements below:
[1] "John" [2] "Jane" [3] "John-Paul Jones, MS"
I've attempted the following using the function contributed above:
newname = sapply(nametitle, function(u){
bool = sapply(name, function(x) grepl(x, u))
if(any(bool)) name[bool][1] else NA })
Which performs just fine for elements "John Smith, MD Phd" and "Jane Doe, JD", but not for "John-Paul Jones, MS" -- this element is replaced with "John" in the new vector newname.
There may be a simple change that can be made to the original function contributed by #Colonel Beauvel to resolve this issue, but using nested sapply functions is throwing me through a loop (pun intended?). Thanks.
This does the trick, base R:
newname = sapply(nametitle, function(u){
bool = sapply(name, function(x) grepl(x, u))
if(any(bool)) name[bool][1] else NA
})
#>newname
#John Smith, MD PhD Jane Doe, JD
# "John" "Jane"
Here's an easy way. First, create a regex pattern based on your name vector:
pattern <- paste0(".*(?<=\\s|^)(", paste(name, collapse = "|"), ")(?=\\s|$).*")
# [1] ".*(?<=\\s|^)(John|Jane|Mark|Steve)(?=\\s|$).*"
If you use this pattern, a single sub command will do the trick:
sub(pattern, "\\1", nametitle, perl = TRUE)
# [1] "John" "Jane" "John-Paul Jones, MS"
What is the best way to convert an array of values in ColdFusion
[ Fed Jones, John Smith, George King, Wilma Abby]
and to a list where the last comma is an or
Fed Jones, John Smith, George King or Wilma Abby
I thought REReplace might work but haven't found the right expression yet.
If you've got an array, combining the last element with an ArrayToList is the simplest way (as per Henry's answer).
If you've got it as a string, using rereplace is a valid method, and would work like so:
<cfset Names = rereplace( Names , ',(?=[^,]+$)' , ' or ' ) />
Which says match a comma, then check (without matching) that there are no more commas until the end of the string (which of course will only apply for the last comma, and it will thus be replaced).
It'd be easier to manipulate in the array level first, before converting into a list.
names = ["Fed Jones", "John Smith", "George King", "Wilma Abby"];
lastIndex = arrayLen(names);
last = names[lastIndex];
arrayDeleteAt(names, lastIndex);
result = arrayToList(names, ", ") & " or " & last;
// result == "Fed Jones, John Smith, George King or Wilma Abby"
Another option is to work with a list / string using listLast and the JAVA lastIndexOf() method of the result string.
<cfscript>
names = ["Fed Jones", "John Smith", "George King", "Wilma Abby"];
result = arraytoList(names,', ');
last = listLast(result);
result = listLen(result) gt 1 ? mid(result, 1, result.lastIndexOf(',')) & ' or' & last : result;
</cfscript>
<cfoutput>#result#</cfoutput>
Result:
Fed Jones, John Smith, George King or Wilma Abby
I have a comma-separated list of first- and lastnames which I need to convert to SQL
(whitespace exists after the comma):
joe, cool
alice, parker
etc.
should become:
( firstname ='joe' and lastname = 'cool' ) or
( firstname ='alice' and lastname = 'parker' )
How can I achieve this with a regular expression?
In Perl you can do this:
s/(\S+),\s*(\S+)/( firstname ='\1' and lastname = '\2' )/
From the command line:
> perl -pe "s/(\S+),\s*(\S+)/( firstname ='\1' and lastname = '\2' )/" input.txt
Input:
joe, cool
alice, parker
Output:
( firstname ='joe' and lastname = 'cool' )
( firstname ='alice' and lastname = 'parker' )