Find most frequent word in string variables - stata

I have a string variable with different colors:
gen cols="red green red red blue maroon green pink"
I want to find which color in this list appears most frequently.
I tried the count command but this produces wrong results.

There is a community-contributed command that does this in one. tabsplit from tab_chi on SSC is designed for this purpose.
clear
input strL (colors numbers)
"red green red red blue maroon green pink" "87 45 65 87 98 12 90 43"
end
tabsplit colors, sort
colors | Freq. Percent Cum.
------------+-----------------------------------
red | 3 37.50 37.50
green | 2 25.00 62.50
blue | 1 12.50 75.00
maroon | 1 12.50 87.50
pink | 1 12.50 100.00
------------+-----------------------------------
Total | 8 100.00
tabsplit numbers, sort
numbers | Freq. Percent Cum.
------------+-----------------------------------
87 | 2 25.00 25.00
12 | 1 12.50 37.50
43 | 1 12.50 50.00
45 | 1 12.50 62.50
65 | 1 12.50 75.00
90 | 1 12.50 87.50
98 | 1 12.50 100.00
------------+-----------------------------------
Total | 8 100.00
.
EDIT As documented in its help, tabsplit allows options of tabulate as appropriate, including those for saving results. However, that is not especially helpful here as matrow() won't work for string variables. That isn't documented directly but follows from the principle that Stata matrices can't hold strings. matcell() does work here, but knowing the frequencies alone is not especially helpful. The overarching principle is that for many questions involving words within strings a structure with single words in each value of a string variable is much easier to work with.

Related

Not understanding group/value/capture attributes of Powershell object matches method

Because of my lack of understanding of Powershell objects my question may not be worded accurately. I take it from the documentation Powershell 7.3 ForEach-Object that I am using a script block & utilizing the Powershell automatic variable $_ But that is about as relevant to my example that these docs get.
I'm trying to access each of two parts of a collection of text file type name/address listings. Namely the first three listings (001 - 003) or the second three (004 - 006)
Using $regexListings and $testListings I have tested that I can access, the first three or second three listings, using references to the capture groups e.g $1 $2 See this example running here: regex101
When I run the following Powershell code:
$regexListings = '(?s)(001.*?003.*?$)|(004.*?006.*?$)'
$testListings =
'001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road'
$testListings | Select-String -AllMatches -Pattern $regexListings | ForEach-Object {$_.Matches}
Output is:
Groups : {0, 1, 2}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 204
Value : 001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road
ValueSpan :
My interpretation of the Powershell output is:
there are 3 match groups?
no captures available
the value is all of it?
Why does the Powershell script output Captures {0} when the link page (regex101) above describes two capture groups which I can access?
The documentation Groups, Captures, and Substitutions is helpful but doesn't address this kind of issue. I have gone on using trial & error examples like:
ForEach-Object {$_.Matches.Groups}
ForEach-Object {$_.Matches.Captures}
ForEach-Object {$_.Matches.Value}
And I'm still none the wiser.
Information overflow. What's being output is what's relevant to us, the administrators. Capture group 0 is the entire value since $regexListings indeed matches the entire string. This is where PowerShell attempts to be helpful with it's rich type system and displays what we may find useful; although, this may just be the implementation of the creators of the cmdlet. So, you were on the right track with $_.Matches.Groups which should've exposed the capture groups and the values for the RegEx matching.
If you're looking to access those values, as mentioned above, you'd have to iterate over .Matches.Groups within that Foreach-Object. What you're passing isn't the individual captures to that cmdlet, but rather the captures of the expression as a whole. This is why you're better off saving to a variable and indexing through the group capture(s) such as: $var.Matches.Groups[0], or $var.Matches.Groups[1], etc.. You can also just use the automatic variable $matches to get some confusion out the way seeing as it's populated via the -Match operator, you can index through the captures with $matches[n] instead. Using your same example:
$regexListings = '(?s)(001.*?003.*?$)|(004.*?006.*?$)'
$testListings =
'001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road'
$testListings -match $regexListings
$Matches
Which outputs:
True # this is output by -match letting you know it's succeeded in matching.
Name Value
---- -----
1 001 AALTON Alan 25 Every Street ...
0 001 AALTON Alan 25 Every Street ...
Now you have a hashtable with a more representable example of the pattern matching.
In order to access each of two parts of the listings I needed to be able to see them in the output using:
$regexListings = '(?ms)(001.*?003.*?$)|(004.*?006.*?$)'
$testListings | Select-String -AllMatches -Pattern $regexListings | ForEach-Object {$_.Matches.Captures}
Groups : {0, 1, 2}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 102
Value : 001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
ValueSpan :
Groups : {0, 1, 2}
Success : True
Name : 0
Captures : {0}
Index : 103
Length : 101
Value : 004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road
ValueSpan :
The differences from the question code being:
using multi line modifier (?ms) instead of (?s) in the regex
using {$_.Matches.Captures} as the regex contains capture grouping
Access to these captures can be got from assigning a variable then indexing e.g:
$result = $testListings | Select-String -AllMatches -Pattern $regexListings | ForEach-Object {$_.Matches.Captures}
$result[1]
$result[0]

Matching values in a variable by year

I have the following minimal example:
input str5 name year match1 match2 match3
Alice 2000 . . .
Alice 2000 . . .
Bob 2000 . . .
Carol 2001 0 . .
Alice 2002 0 1 .
Carol 2002 1 0 .
Bob 2003 0 0 1
Bob 2003 0 0 1
end
I have data on name and year, and I want to create binary variables called match'year' that equals 1 if this name is in the data previous 'year'. For example, looking at the first observation in Stata, match1 is a binary variable that equals 1 if Alice appears in year 1999, and match2 is a binary variable that equals 1 if Alice appears in 1998, etc.
If there is no year prior to that year (in this case there is no 1999 or 1998), the binary variable will be missing.
How can I construct these match variables? Note that I have millions of unique names, and using command levelsof name, local(match) results in macro substitution results in line that is too long error. Also note that there are sometimes duplicates of names in a given year, and some names may be missing in a given year.
Thanks for the data example. Here is some technique using rangestat from SSC. I don't understand your rule on which values should be 0 and which missing.
* Example generated by -dataex-. For more info, type help dataex
clear
input str5 name float year
"Alice" 2000
"Alice" 2000
"Alice" 2002
"Bob" 2000
"Bob" 2003
"Bob" 2003
"Carol" 2001
"Carol" 2002
end
gen one = 1
forval j = 1/3 {
rangestat (max) match`j'=one, int(year -`j' -`j') by(name)
}
drop one
sort name year
list, sepby(year)
+-----------------------------------------+
| name year match1 match2 match3 |
|-----------------------------------------|
1. | Alice 2000 . . . |
2. | Alice 2000 . . . |
|-----------------------------------------|
3. | Alice 2002 . 1 . |
|-----------------------------------------|
4. | Bob 2000 . . . |
|-----------------------------------------|
5. | Bob 2003 . . 1 |
6. | Bob 2003 . . 1 |
|-----------------------------------------|
7. | Carol 2001 . . . |
|-----------------------------------------|
8. | Carol 2002 1 . . |
+-----------------------------------------+
As the original author of levelsof I find it a little melancholy to see it pressed into service where it is of little or no help.
Here is an alternative approach using frames:
keep name year
frame copy default prev
frame prev: duplicates drop
frame prev: rename year myear
gen myear=.
forvalues i=1/3 {
replace myear = year-`i'
frlink m:1 name myear, frame(prev) generate(match`i')
replace match`i' = 1 if match`i'!=.
}
drop myear
Output:
name year match1 match2 match3
1. Alice 2000 . . .
2. Alice 2000 . . .
3. Bob 2000 . . .
4. Carol 2001 . . .
5. Alice 2002 . 1 .
6. Carol 2002 1 . .
7. Bob 2003 . . 1
8. Bob 2003 . . 1

Group together records that share either one of two variables

I am working with Stata and I have a large data set where I need to group records together if they share one of two variables.
For example, take the following three observations:
Observation | matching var1 | matching var2
1 xxx aaa
2 xxx bbb
3 yay bob
If I were to group the records by var1, the first two observations will be in the same group and the last observation will be in a separate group. Similarly, if I were to group using var2, observations two and three will be in in the same group and observation one will be in a separate group. However, if I were to group the records based on a match on either var1 or var2, all observations will be in the same group.
I would like to create a 'group id' variable that will take the same value across all these records.
Any suggestions on how I should go about it?
The community-contributed group_twoway (available in SSC) can match two variables:
ssc install group_twoway
Using an additional example from yours:
clear
input str3(var1 var2)
"xxx" "aaa"
"yyy" "bbb"
"mmm" "ccc"
"nnn" "ccc"
"mmm" "ddd"
"ooo" "ff"
"pp" "eee"
"qq" "ff"
"rr" "u"
"xxx" "bbb"
end
group_twoway var1 var2, generate(group_id)
Result # of obs.
-----------------------------------------
not matched 0
matched 10
-----------------------------------------
list, sepby(group_id) constant
+------------------------+
| var1 var2 group_id |
|------------------------|
1. | xxx aaa 1 |
2. | yyy bbb 1 |
|------------------------|
3. | mmm ccc 2 |
4. | nnn ccc 2 |
5. | mmm ddd 2 |
|------------------------|
6. | ooo ff 3 |
|------------------------|
7. | pp eee 4 |
|------------------------|
8. | qq ff 3 |
|------------------------|
9. | rr u 5 |
|------------------------|
10. | xxx bbb 1 |
+------------------------+

Right Align Columns in Text File with Sed

I have a file containing a lot of information that I want to get in a specific format, i.e. add a specific number of spaces between the different columns. I can add the same amount of spaces to every line, but some of the columns need to be right aligned, meaning that I might need to add more spaces in some lines. I have no idea how to do this, and awk doesn't seem to work since I have more than two lines modify.
Here's an example:
I have managed to get a file looking something like this
apple 1 33.413 C cat 10
banana 2 21.564 B horse 356
cherry 3 43.223 D cow 32
pear 4 26.432 A goat 22
raspberry 5 72.639 C eagle 4
watermelon 6 54.436 A fox 976
pumpkin 7 42.654 B mouse 1
peanut 8 36.451 B dog 56
orange 9 57.333 C elephant 32
coconut 10 10.445 A frog 3
blueberry 11 46.435 B camel 446
But I want to get the file on this format
apple 1 33.413 C cat 10
banana 2 21.564 B horse 356
cherry 3 43.223 D cow 32
pear 4 26.432 A goat 22
raspberry 5 72.639 C eagle 4
watermelon 6 54.436 A fox 976
pumpkin 7 42.654 B mouse 1
peanut 8 36.451 B dog 56
orange 9 57.333 C elephant 32
coconut 10 10.445 A frog 3
blueberry 11 46.435 B camel 446
What bash command can I use to right align the second and fifth columns?
You can use printf with width as you want like this:
awk '{printf "%-15s%3d%10s%2s%15s %-5d\n", $1, $2, $3, $4, $5, $6}' file
apple 1 33.413 C cat 10
banana 2 21.564 B horse 356
cherry 3 43.223 D cow 32
pear 4 26.432 A goat 22
raspberry 5 72.639 C eagle 4
watermelon 6 54.436 A fox 976
pumpkin 7 42.654 B mouse 1
peanut 8 36.451 B dog 56
orange 9 57.333 C elephant 32
coconut 10 10.445 A frog 3
blueberry 11 46.435 B camel 446
Feel free to adjust widths to tweak the output.

Stata Regular expressions extracting numerical values

I have some data that looks like this
var1
h 01 .00 .0 abc
d 1.0 .0 14.0abc
1,0.0 0.0 .0abc
It should be noted that the last three alpha values are the same, and I am hoping to extract all the numerical values within the string. The code that I'm using look like this
gen x1=regexs(1) if regexm(var1,"([0-9]+) [ ]*(abc)*$")
However, this code only extracts the numbers before the abc term and stops after a space or a .. For example, only 0 before abc is extracted from the first term. I was wondering whether there is a way to handle this and extract all the numerical values before the alpha characters.
As #Roberto Ferrer points out, your question isn't very clear, but here is an example using moss from SSC:
. clear
. input str16 var1
var1
1. "h 01 .00 .0 abc"
2. "d 1.0 .0 14.0abc"
3. "1,0.0 0.0 .0abc"
4. end
. moss var1, regex match("([0-9]+\.*[0-9]*|\.[0-9]+)")
. l _match*
+---------------------------------------+
| _match1 _match2 _match3 _match4 |
|---------------------------------------|
1. | 01 .00 .0 |
2. | 1.0 .0 14.0 |
3. | 1 0.0 0.0 .0 |
+---------------------------------------+