Counting first and last names appearing more than once - stata

I have the following names:
clear
input str25 names
"Trenton Mercer"
"Carissa Moyer"
"Timothy Delgado"
"Kaylynn Payne"
"Harry Patton"
"Charlie Dudley"
"Harry Schmitt"
"Wyatt Hammond"
"Kasen Delgado"
"Katherine Noble"
"Julius Jarvis"
"Harry Carney"
"Wyatt Holden"
"Megan Wilson"
"Priscilla Shaffer"
"Savanah Marshall"
"Harry Delgado"
"Harper Ballard"
"Harry Mcmahon"
"Alejandro Jarvis"
end
How can I identify which first and last names (separately) come up more than once?
I would also like to count how many times these appear.

Pearly's solution (with split as the definitely best choice for the issue) appears reasonable. But there are still some unnecessary contours. For example, generating tag, b1, b2 variables seems not really needed.
And more important, the final output is not thoroughly consistent, with the counting info just in line with seemingly-random order, which is also different from the original one without clear explanation.
Thus, I try to contribute a solution (which must also have defects), just as a way to avoid those issues while still providing the output that you are seeking for.
split names
foreach v in `r(varlist)' {
egen TotalAppear_`v' = total(`v' != ""), by(`v')
egen LastAppear_`v' = max(_n), by(`v')
replace LastAppear_`v' = LastAppear_`v'==_n
list `v' TotalAppear_`v' if LastAppear_`v' == 1 & TotalAppear_`v' >1
}
It should be noted your description leads to assumptions made in my code as well as in Pearly's solution:
Every name has only 2 parts, i.e. first name and last name, so not including any middle name(s).
You just want to compare within each group (each first name among first names, last name among last names), not comparing any one with those from the other group.

Related

Remove columns by name based on pattern

How can I remove a large number of columns by name based on a pattern?
A data set exported from Jira has a ton of extra columns that I've no interest in. 400 Log entries, 50 Comments, dozens of links or attachments. Problem is that they get random numbers assigned which means that removing them with hardcoded column names will not work. That would look like this and break as the numbers change:
= Table.RemoveColumns(#"Previous Step",{"Watchers", "Watchers_10", "Watchers_11", "Watchers_12", "Watchers_13", "Watchers_14", "Watchers_15", "Watchers_16", "Watchers_17", "Watchers_18", "Watchers_19", "Watchers_20", "Watchers_21", "Watchers_22", "Watchers_23", "Watchers_24", "Watchers_25", "Watchers_26", "Watchers_27", "Watchers_28", "Log Work", "Log Work_29", "Log Work_30", "Log Work_31", "Log Work_32", ...
How can I remove a large number of columns by using a pattern in the name? i.e. remove all "Log Work" columns.
The best way I've found is to use List.FindText on Table.ColumnNames to get a list of column names dynamically based on target string:
= Table.RemoveColumns(#"Previous Step", List.FindText(Table.ColumnNames(#"Previous Step"), "Log Work")
This works by first grabbing the full list of Column Names and keeping only the ones that match the search string. That's then sent to RemoveColumns as normal.
Limitation appears to be that FindText doesn't offer complex pattern matching.
Of course, when you want to remove a lot of different patterns, having individual steps isn't very interesting. A way to combine this is to use List.Combine to join the resulting column names together.
That becomes:
= Table.RemoveColumns(L, List.Combine({ List.FindText(Table.ColumnNames(L), "Watchers_"), List.FindText(Table.ColumnNames(L), "Log Work"), List.FindText(Table.ColumnNames(L), "Comment"), List.FindText(Table.ColumnNames(L), "issue link"), List.FindText(Table.ColumnNames(L), "Attachment")} ))
SO what's actually written there is:
Table.RemoveColumns(PreviousStep, List.Combine({ foundList1, foundlist2, ... }))
Note the { } that signifies a list! You need to use this as List.Combine only accepts a single argument which is itself already a List of lists. And the Combine call is required here.
Also note the L here instead of #"Previous Step". That's used to make the entire thing more readable. Achieved by inserting a step named "L" that just has = #"Promoted Headers".
This allows relatively maintainable removal of multiple columns by name, but it's far from perfect.

Googlesheet IF with multiple cases

in my Google sheet table I have the first list with summary of invoices which are then separated to 4 lists according to parameters (manually). I need to know about all invoices from the first list, on which category/list they are.
So for example - lists: Alphabet, abc, def, mno, xyz. In Alphabet is column "list".
How to write function which found invoice on another list according to ID (column B) from Alphabet and write name of the correct list to column "list". I tried to write this function using IF, match, etc. But I still don't have solution. Can you help me please? Sorry for my English :-)
So here is an example which you could adapt. In columns E:H on the first sheet (and I could hide these columns later, starting in row2 and dragging down as needed, I put the following formulas:
=IF(LEN(iferror(query(abc!$A$2:$A,"select A where A='" & $A2 &"'"),""))>0,"abc","")
=IF(LEN(iferror(query(def!$A$2:$A,"select A where A='" & $A2 &"'"),""))>0,"def","")
=IF(LEN(iferror(query(mno!$A$2:$A,"select A where A='" & $A2 &"'"),""))>0,"mno","")
=IF(LEN(iferror(query(xyz!$A$2:$A,"select A where A='" & $A2 &"'"),""))>0,"xyz","")
Probably I could have simplified a little by putting the sheet names in E1:H1, but you get the idea.
Each of these looks for the ID. If the query succeeds, it returns the name of the sheet. If it fails, it returns the empty string.
Now in column B where I actually want the results, I put this formula in B2 and drag to copy as needed.
=if(E2&F2&G2&H2="","nowhere",E2&F2&G2&H2)
It says put those strings together, and if there is nothing there say nowhere, otherwise say the list. If it appears on more than one, and that can really happen, you could use JOIN instead.

giving a string variable values conditional on another variable

I am using Stata 14. I have US states and corresponding regions as integer.
I want create a string variable that represents the region for each observation.
Currently my code is
gen div_name = "A"
replace div_name = "New England" if div_no == 1
replace div_name = "Middle Atlantic" if div_no == 2
.
.
replace div_name = "Pacific" if div_no == 9
..so it is a really long code.
I was wondering if there is a shorter way to do this where I can automate assigning values rather than manually hard coding them.
You can define value labels in one line with label define and then use decode to create the string variable. See the help for those commands.
If the correspondence was defined in a separate dataset you could use merge. See e.g. this FAQ
There can't be a short-cut here other than typing all the names at some point or exploiting the fact that someone else typed them earlier into a file.
With nine or so labels, typing them yourself is quickest.
Note that you type one statement more than you need, even doing it the long way, as you could start
gen div_name = "New England" if div_no == 1

Searching column with string for wildcard match

I can't seem to figure out the VLOOKUP magic needed to make this work as I want it to.
See, what I've got is a column B containing filenames, like this:
[COLUMN B]
./11001 Boogie Oogie Oogie (A Taste Of Honey).wav
./11001 Rescue Me (A Taste Of Honey).wav
./11001 Sukiyaki (A Taste Of Honey).wav
./11002 Memory (Acker Bilk).wav
./11002 Stuck On You (Acker Bilk).wav
./11002 Could I Have This Dance (Acker Bilk).wav
./11002 Do That To Me One More Time (Acker Bilk).wav
./11002 This Masquerade (Acker Bilk).wav
./11002 Just Once (Acker Bilk).wav
And so on for 6220 entries.
I have another column, Column E, which contains a TRACK NAME which is present within the filename. Looks like this:
American Patrol
Artistry In Rhythm
Begin The Beguine
Big John's Special
Cherokee
For example. So what I want to do is, in another column I want to search through Column B using the strings from Column E and then returning the matched string from Column B.
So if we imagine I put this formula in the C Column starting in the same row as the American Patrol track name, it would search through the range in Column B and return this:
./11249 American Patrol (BBC Big Band).wav
./11249 Artistry In Rhythm (BBC Big Band).wav
./11249 Begin The Beguine (BBC Big Band).wav
And so on.
I tried doing this formula
=VLOOKUP(E2;B2:B6235;2;TRUE)
So, this returns a file name, but it seems to have matched all the filenames and are just returning whichever result I specify in the col_index variable, so now it returns the second match (basically, just the second row in Column B) and if I put a 3 instead, it would just return the third hit, again having matched all the file names, it seems..
I'm not that familiar with Excel functions, so I'm not sure where to look for the solution beyond this.
You should not be using TRUE as a VLOOKUP function's range_lookup parameter on unsorted data. You can, however, wrap your track title in wildcards to achieve the search you are looking for.
      
The formula in C1 is,
=INDEX(B:B, MATCH("*"&E1&"*",B:B, 0))
... or,
=VLOOKUP("*"&E1&"*",B:B, 1, FALSE)
They accomplish the same thing.

Stata: Efficient way to replace numerical values with string values

I have code that currently looks like this:
replace fname = "JACK" if id==103
replace lname = "MARTIN" if id==103
replace fname = "MICHAEL" if id==104
replace lname = "JOHNSON" if id==104
And it goes on for multiple pages like this, replacing an ID name with a first and last name string. I was wondering if there is a more efficient way to do this en masse, perhaps by using the recode command?
I will echo the other answers that suggest a merge is the best way to do this.
But if you absolutely must code the lines item-wise (again, messy) you can generate a long list ("pages") of replace commands by using MS Excel to "help" you write the code. Here is a picture of your Excel sheet with one example, showing the MS Excel formula:
columns:
A B C D
row: 1 last first id code
2 MARTIN JACK 103 ="replace fname=^"&B2&"^ if id=="&C2
You type that in, make sure it looks like Stata code when the formula calculates (aside from the carets), and copy the formula in column D down to the end of your list. Then copy the whole block of Stata code in column D generated by the formulas into your do-file, and do a find and replace (be careful here if you are using the caret elsewhere for mathematical uses!!) for all ^ to be replaced with ", which will end up generating proper Stata syntax.
(This is truly a brute force way of doing this, and is less dynamic in the case that there are subsequent changes to your generation list. All--apologies in advance for answering a question here advocating use of Excel :) )
You don't explain where the strings you want to add come from, but what is generally the best technique is explained at
http://www.stata.com/support/faqs/data-management/group-characteristics-for-subsets/index.html
Create an associative array of ids vs Fname,Lname
103 => JACK,MARTIN
104 => MICHAEL,JOHNSON
...
Replace
id => hash{id} ( fname & lname )
The efficiency of doing this will be taken care by the programming language used