Creating variable out of conditional values in another one

Creating variable out of conditional values in another one - if-statement

I have quite a large conflict dataset (71 million observations) with many variables and date (daily).
This is from the GDELT project for which the way the dataset is structured is that for each day, there is a target country and a source country of aggression. Namely, the first of January of 2000, many countries engaged in aggressive behaviour against others or themselves, and this dataset tracks this.
It looks like this:
clear
input long date_01 str18 source_01 str19 target_01 str4 cameocode_01
20000101 "AFG" "AFGGOV" "020"
20000101 "AFG" "AFGGOV" "0841"
20000101 "AFG" "ARE" "036"
20000101 "AFG" "CVL" "043"
20000101 "AFG" "GOV" "010"
20000101 "AFG" "GOV" "043"
20000101 "AFGGOV" "kasUAF" "0353"
20000101 "AFGGOV" "kasUAF" "084"
20000101 "AFG" "IGOUNO" "030"
20000101 "AFG" "IND" "042"
20000101 "AFG" "IND" "043"
end
What I would like to do is to isolate these events per country.
For instance, I would like to create a variable for the US where, for each date, I have all the times that the US was either a target or a source, and their respective cameo code. I have a considerable number of countries but only need a subset of them and I know their names in advance.
As you can see in the example, the first variable is date, which for these cells is always 2000101 but after a couple of hundreds observations it changes to 2000102, denoting a change in day.
The second variable source_01 is a country attacking another one. In the example, IND is India, AFG is Afghanistan and the other codes are other countries.
The third variable target_01 is just the victim of the conflict.
Finally, cameocode_01 is a level of intensity of conflict measured with some algorithm that tracks the news in each language.
What I am after is to create a new (per country) variable that extracts the cameo code of that event if a specific country is involved either as source or target.
For this specific example, below is my desired output for the case of India (code IND), which is involved in two events on the specific date:
date INDIAcameo
20000101 "042"
20000101 "043"
I have tried this:
replace INDIA cameo=cameocode if "target" ~ "source" ==IND
However, it says type mismatch and I doubt it would give me what I look for anyway.

If you know in advance the countries you are interested in, then the following will work:
clear
input long date_01 str18 source_01 str19 target_01 str4 cameocode_01
20000101 "AFG" "AFGGOV" "020"
20000101 "AFG" "IND" "043"
20000101 "AFG" "AFGGOV" "0841"
20000101 "AFG" "ARE" "036"
20000101 "AFG" "CVL" "043"
20000101 "AFG" "GOV" "010"
20000101 "AFG" "GOV" "043"
20000101 "AFGGOV" "kasUAF" "0353"
20000101 "AFGGOV" "kasUAF" "084"
20000101 "AFG" "IGOUNO" "030"
20000102 "AFG" "IND" "042"
end
foreach c in AFG IND ARE {
generate ind_`c' = cameocode_01 if strmatch(source_01, "`c'") | ///
strmatch(target_01, "`c'")
}
Note that I have slightly modified your example for better illustration.
To see the results:
list, sepby(date) abbreviate(15)
+-------------------------------------------------------------------------------+
| date_01 source_01 target_01 cameocode_01 ind_AFG ind_IND ind_ARE |
|-------------------------------------------------------------------------------|
1. | 20000101 AFG AFGGOV 020 020 |
2. | 20000101 AFG IND 043 043 043 |
3. | 20000101 AFG AFGGOV 0841 0841 |
4. | 20000101 AFG ARE 036 036 036 |
5. | 20000101 AFG CVL 043 043 |
6. | 20000101 AFG GOV 010 010 |
7. | 20000101 AFG GOV 043 043 |
8. | 20000101 AFGGOV kasUAF 0353 |
9. | 20000101 AFGGOV kasUAF 084 |
10. | 20000101 AFG IGOUNO 030 030 |
|-------------------------------------------------------------------------------|
11. | 20000102 AFG IND 042 042 042 |
+-------------------------------------------------------------------------------+
or
foreach v of varlist ind* {
sort date `v'
list date `v' if !missing(`v'), sepby(date) abbreviate(15)
}
+--------------------+
| date_01 ind_AFG |
|--------------------|
3. | 20000101 010 |
4. | 20000101 020 |
5. | 20000101 030 |
6. | 20000101 036 |
7. | 20000101 043 |
8. | 20000101 043 |
9. | 20000101 043 |
10. | 20000101 0841 |
|--------------------|
11. | 20000102 042 |
+--------------------+
+--------------------+
| date_01 ind_IND |
|--------------------|
10. | 20000101 043 |
|--------------------|
11. | 20000102 042 |
+--------------------+
+--------------------+
| date_01 ind_ARE |
|--------------------|
10. | 20000101 036 |
+--------------------+

Related

How to create column with name of column with the highest value per each ID in SAS Enterprise Guide / PROC SQL?

I have table in SAS Enterprise Guide like below:
ID | COL_A | COL_B | COL_C
-----|-------|-------|------
111 | 10 | 20 | 30
222 | 15 | 80 | 10
333 | 11 | 10 | 20
444 | 20 | 5 | 20
Requirements:
And I need to create new column "TOP" where will be the name of column with the highest values for each ID.
If for example 2 or more columns have the same highest value take the first under the alphabet.
Desire output:
ID | COL_A | COL_B | COL_C | TOP
-----|-------|-------|--------|-------
111 | 10 | 20 | 30 | COL_C
222 | 15 | 80 | 10 | COL_B
333 | 11 | 10 | 20 | COL_C
444 | 20 | 5 | 20 | COL_A
Becasue:
for ID = 111 the highest value is in COL_C, so name "COL_C" is in column "TOP"
for ID = 444 two columns have the highest value, so based on alpabet criterion in column "TOP" is name "COL_A"
How can i do that in SAS Enterprise Guide or in PROC SQL ?

This you can do with functions. Use MAX() to find the largest value. Use WHICHN() to find the index number of the first variable with that value. Use the VNAME() function to get the name of the variable with that index.
data want;
set have;
length TOP $32;
array list col_a col_b col_c;
top = vname(list[whichn(max(of list[*]),of list[*])]);
run;

Filling in blank entries

I am working with a Stata dataset that tracks a company's contract year.
However, systematically I am missing a year:
Is there a code I could quickly run through to replace the missing year with the year from the previous observation?

The following works for me:
clear
input var year
564 2029
597 2029
653 .
342 2041
456 2041
end
replace year = year[_n-1] if missing(year)
list
+------------+
| var year |
|------------|
1. | 564 2029 |
2. | 597 2029 |
3. | 653 2029 |
4. | 342 2041 |
5. | 456 2041 |
+------------+

Checking for a Range of Values

I could check for a range of values, use the BETWEEN operator.
MySQL [distributor]> select prod_name, prod_price from products where prod_price between 3.49 and 11.99;
+---------------------+------------+
| prod_name | prod_price |
+---------------------+------------+
| Fish bean bag toy | 3.49 |
| Bird bean bag toy | 3.49 |
| Rabbit bean bag toy | 3.49 |
| 8 inch teddy bear | 5.99 |
| 12 inch teddy bear | 8.99 |
| 18 inch teddy bear | 11.99 |
| Raggedy Ann | 4.99 |
| King doll | 9.49 |
| Queen doll | 9.49 |
+---------------------+------------+
9 rows in set (0.005 sec)
I reference to django docs and found gte, gt, lt, lte but no between.
How could I achieve the between functionality?

use this in django ORM products.objects.filter(prod_price__range=(3.49 , 11.99)) ref for more info

Using REGEXP in a loop reading from other table in MySQL

Have a question about the use of REGEXP in a query. I have a table in my database with near 553,000 records (34,517 x 16). And also have a list of values which need to find in that table. Using REGEXP I have success in find some values using this statement:
SELECT * FROM `TableA` WHERE ((*desiredvalue* REGEXP 'a|b|c|d...'));
Now, the list of desiredvalues growth from 20 to 1700, so, exists some way to put this new values in a single column in a Table B and search in the TableA using a reading loop over the new table. Mi first instinct was save the consult and paste the all 1700 records, but the idea is do it automatic when the Table Bbe updated.
Here's is an example of my initial matrix (all values are 14 character strings):
+-----+---+---+---+---+-----+----+
|Group|SP1|SP2|SP3|SP4|.....|SP15|
+-----+---+---+---+---+-----+----+
|G1 |a |b |c |d |.....|x |
|G2 | |b |h |d |.....|z |
|G4 |a |b | |m |.....|r |
|G5 |o |p |q |r |.....|h |
+-----+---+---+---+---+-----+----+
The idea if I have a list with of values val=(a,c,h,r,p), I obtain this result:
+---+-----+
|val|Group|
+---+-----+
|a |G1 |
|a |G4 |
|c |G1 |
|h |G2 |
|r |G4 |
|r |G5 |
|p |G5 |
+---+-----+
Thanks!
Christian

Stata: reshaping dataset from wide to long

Say I have a data set of country GDPs formatted like this:
---------------------------------
| Year | Country A | Country B |
| 1990 | 128 | 243 |
| 1991 | 130 | 212 |
| 1992 | 187 | 207 |
How would I use Stata's reshape command to change this into a long table with country-year rows, like the following?
----------------------
| Country| Year | GDP |
| A | 1990 | 128 |
| A | 1991 | 130 |
| A | 1992 | 187 |
| B | 1990 | 243 |
| B | 1991 | 212 |
| B | 1992 | 207 |

It is recommended that you try to solve the problem on your own first. Although you might have tried, you show no sign that you did. For future questions, please post the code you attempted, and why it didn't work for you.
The following gives what you ask for:
clear all
set more off
input ///
Year CountryA CountryB
1990 128 243
1991 130 212
1992 187 207
end
list
reshape long Country, i(Year) j(country) string
rename Country GDP
order country Year GDP
sort country Year
list, sep(0)
Note: you need the string option here because your stub suffixes are strings (i.e. "A" and "B"). See help reshape for the details.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js