Group together records that share either one of two variables - stata

I am working with Stata and I have a large data set where I need to group records together if they share one of two variables.
For example, take the following three observations:
Observation | matching var1 | matching var2
1 xxx aaa
2 xxx bbb
3 yay bob
If I were to group the records by var1, the first two observations will be in the same group and the last observation will be in a separate group. Similarly, if I were to group using var2, observations two and three will be in in the same group and observation one will be in a separate group. However, if I were to group the records based on a match on either var1 or var2, all observations will be in the same group.
I would like to create a 'group id' variable that will take the same value across all these records.
Any suggestions on how I should go about it?

The community-contributed group_twoway (available in SSC) can match two variables:
ssc install group_twoway
Using an additional example from yours:
clear
input str3(var1 var2)
"xxx" "aaa"
"yyy" "bbb"
"mmm" "ccc"
"nnn" "ccc"
"mmm" "ddd"
"ooo" "ff"
"pp" "eee"
"qq" "ff"
"rr" "u"
"xxx" "bbb"
end
group_twoway var1 var2, generate(group_id)
Result # of obs.
-----------------------------------------
not matched 0
matched 10
-----------------------------------------
list, sepby(group_id) constant
+------------------------+
| var1 var2 group_id |
|------------------------|
1. | xxx aaa 1 |
2. | yyy bbb 1 |
|------------------------|
3. | mmm ccc 2 |
4. | nnn ccc 2 |
5. | mmm ddd 2 |
|------------------------|
6. | ooo ff 3 |
|------------------------|
7. | pp eee 4 |
|------------------------|
8. | qq ff 3 |
|------------------------|
9. | rr u 5 |
|------------------------|
10. | xxx bbb 1 |
+------------------------+

Related

Matching values in a variable by year

I have the following minimal example:
input str5 name year match1 match2 match3
Alice 2000 . . .
Alice 2000 . . .
Bob 2000 . . .
Carol 2001 0 . .
Alice 2002 0 1 .
Carol 2002 1 0 .
Bob 2003 0 0 1
Bob 2003 0 0 1
end
I have data on name and year, and I want to create binary variables called match'year' that equals 1 if this name is in the data previous 'year'. For example, looking at the first observation in Stata, match1 is a binary variable that equals 1 if Alice appears in year 1999, and match2 is a binary variable that equals 1 if Alice appears in 1998, etc.
If there is no year prior to that year (in this case there is no 1999 or 1998), the binary variable will be missing.
How can I construct these match variables? Note that I have millions of unique names, and using command levelsof name, local(match) results in macro substitution results in line that is too long error. Also note that there are sometimes duplicates of names in a given year, and some names may be missing in a given year.
Thanks for the data example. Here is some technique using rangestat from SSC. I don't understand your rule on which values should be 0 and which missing.
* Example generated by -dataex-. For more info, type help dataex
clear
input str5 name float year
"Alice" 2000
"Alice" 2000
"Alice" 2002
"Bob" 2000
"Bob" 2003
"Bob" 2003
"Carol" 2001
"Carol" 2002
end
gen one = 1
forval j = 1/3 {
rangestat (max) match`j'=one, int(year -`j' -`j') by(name)
}
drop one
sort name year
list, sepby(year)
+-----------------------------------------+
| name year match1 match2 match3 |
|-----------------------------------------|
1. | Alice 2000 . . . |
2. | Alice 2000 . . . |
|-----------------------------------------|
3. | Alice 2002 . 1 . |
|-----------------------------------------|
4. | Bob 2000 . . . |
|-----------------------------------------|
5. | Bob 2003 . . 1 |
6. | Bob 2003 . . 1 |
|-----------------------------------------|
7. | Carol 2001 . . . |
|-----------------------------------------|
8. | Carol 2002 1 . . |
+-----------------------------------------+
As the original author of levelsof I find it a little melancholy to see it pressed into service where it is of little or no help.
Here is an alternative approach using frames:
keep name year
frame copy default prev
frame prev: duplicates drop
frame prev: rename year myear
gen myear=.
forvalues i=1/3 {
replace myear = year-`i'
frlink m:1 name myear, frame(prev) generate(match`i')
replace match`i' = 1 if match`i'!=.
}
drop myear
Output:
name year match1 match2 match3
1. Alice 2000 . . .
2. Alice 2000 . . .
3. Bob 2000 . . .
4. Carol 2001 . . .
5. Alice 2002 . 1 .
6. Carol 2002 1 . .
7. Bob 2003 . . 1
8. Bob 2003 . . 1

How to add a space between the words when there is a special character in pyspark dataframe using regex?

I have a dataframe which consists of reviews and has special characters in between the words. I want to add a space.
For example,
Spark)NLP -> Spark ) NLP
Machine-Learning -> Machine - Learning
Below is my dataframe
temp = spark.createDataFrame([
(0, "This is 5years of Spark)world 5-6"),
(1, "I wish Java-DL could use case-classes"),
(2, "Data-science is cool"),
(3, "Machine")
], ["id", "words"])
+---+-------------------------------------+
|id |words |
+---+-------------------------------------+
|0 |This is 5years of Spark)world 5-6 |
|1 |I wish Java-DL could use case-classes|
|2 |Data-science is cool |
|3 |Machine |
+---+-------------------------------------+
I have used the below code to do that but it is not working
temp_1 = temp.withColumn('words', F.regexp_replace('words', r'(?<! )(?=[.,!?()\/\-\+\'])|(?<=[.,!?()\/\-\+\'])(?! )', '$1 $2 $3'))
Desired output:
+---+-----------------------------------------+
|id |words |
+---+-----------------------------------------+
|0 |This is 5years of Spark ) world 5 - 6 |
|1 |I wish Java - DL could use case - classes|
|2 |Data - science is cool |
|3 |Machine |
+---+-----------------------------------------+
You can use
\b[^\w\s]\b|_
And replace with $0 . See the regex demo.
If you do not consider an underscore to be a special char, just use \b[^\w\s]\b that matches any char other than word and whitespace chars between word chars. Note word chars include underscores.
If there must be letters or digits on each side, replace word boundaries with lookarounds: (?<=[^\W_])[^\w\s](?=[^\W_])|_. To only find special chars between letters: (?<=[^\W\d_])[^\w\s](?=[^\W\d_])|_ or (?<=\p{L})[^\w\s](?=\p{L})|_.

Regular expression in Oracle to Filter particular charecters only

I have a scenario
Case 1: "NO 41 ABC STREET"
Case 2: "42 XYZ STREET"
For almost 100 000 data in my table.
I want a regexp that
omits 'NO 41' and leaves back ABC STREET as output in case 1, whereas
in case 2 I want '42 XYZ STREET' as output.
regexp_replace('NO 41 ABC STREET', 'NO [0-9]+ |([0-9]+)', '\1') outputs ABC STREET.
regexp_replace('42 XYZ STREET', 'NO [0-9]+ |([0-9]+)', '\1') outputs 42 XYZ STREET.
You have provided only 2 scenarios of your data in the table. Assuming that you only want to replace the characters in a column which starts with a "NO" followed by digit and then space before some other characters, you could use this.
SQL Fiddle
Query:
select s,REGEXP_REPLACE(s,'^NO +\d+ +') as r FROM data
Results:
| S | R |
|------------------|---------------|
| NO 41 ABC STREET | ABC STREET |
| 42 XYZ STREET | 42 XYZ STREET |
If you have more complex data to be filtered, please edit your question and describe it clearly.

How to remove all of the contents of a string except the first character?

I have a data set with first name, middle name, and last name. I'm going to merge it with another data set matching on the same variables.
In one data set the variable mi looks like:
Lowell
Ann
Carl
A
Fran
Allen
And I want it to look like:
L
A
C
A
F
A
I tried this:
gen mi2 = substr(mi, 2, length(mi))
but this does the opposite of what I want but it's the closest that I've been able to do. I know this is probably a really easy problem but I'm stumped at the moment.
You are on the right track with substr. See the example below:
clear
input str10 mi
Lowell
Ann
Carl
A
Fran
Allen
end
gen mi2 = substr(mi,1,1)
list, sep(0)
+--------------+
| mi mi2 |
|--------------|
1. | Lowell L |
2. | Ann A |
3. | Carl C |
4. | A A |
5. | Fran F |
6. | Allen A |
+--------------+
The second and third arguments to substr are the starting position and number of characters respectively. In this case, you want to start at the first character, and take one character, so substr(mi, 1, 1) is what you need.

Why doesn't this command line (Batch file) work?

I have a command line which looks for certain IDs (2 IDs )in 2nd column. But I want this command to search all the columns, not just second column.
Can anyone help?
The command line for searching 2nd column is:
findstr /rb /c:"[^|]*| *ID1 *|" /c:"[^|]*| *ID2 *|" "src.txt" >" dest.txt"
Can someone modify it so that it searches all the columns instead of just the second and also give 2 command lines which will:
(1) Searches all the columns instead of just 2nd.
(2) Searches only for 1 ID.
(3) Searches only for 3 IDs.
src.txt -
The text is in this manner:
Ja | 11 | xxx
Jn | 19 | yyy
Jx | 21 | yyyas | sas
Also few lines may have more columns like that last one.
Thanks!
To find in src.txt containing the lines
Ja | 11 | xxx
Jn | 19 | yyy
nJ | 19 | yyy
Ax | 21 | Jyyas | sas
Ax | 23 | yyJas | sas
only the 3 lines where a value within a column starts with J and therefore writting to file dest.txt the lines
Ja | 11 | xxx
Jn | 19 | yyy
Ax | 21 | Jyyas | sas
the following command can be used
findstr /R /C:"^J" /C:"\| *J" "src.txt" >"dest.txt"
^J is for finding lines starting with J and \| *J is for finding lines having a value starting with J after 0 or more spaces in a different column than first column.
Please note that parameter /B is removed as otherwise this would not work.
/rb in your example is /R an /B combined in one parameter string.