Can bitwise operators be used to quickly find a value from a permutation? - bit-manipulation

Let's say that I have two metrics: Age (1-20) and size (SMALL, MEDIUM, LARGE) and I want to find what the relative oldness (Young, Adult, Senior, Very Old) is? (Think "1 year old mouse vs 1 year old elephant," the oldness is different depending on size.)
SMALL MEDIUM LARGE
1 | Young 1 | Young 1 | Young
2 | Young 2 | Young 2 | Young
3 | Adult 3 | Young 3 | Young
4 | Adult 4 | Young 4 | Young
5 | Senior 5 | Young 5 | Young
6 | Very Old 6 | Adult 6 | Young
7 | Very Old 7 | Adult 7 | Young
8 | Very Old 8 | Adult 8 | Young
9 | Very Old 9 | Senior 9 | Young
10 | Very Old 10 | Senior 10 | Adult
11 | Very Old 11 | Senior 11 | Adult
12 | Very Old 12 | Senior 12 | Adult
13 | Very Old 13 | Very Old 13 | Adult
14 | Very Old 14 | Very Old 14 | Adult
15 | Very Old 15 | Very Old 15 | Senior
16 | Very Old 16 | Very Old 16 | Senior
17 | Very Old 17 | Very Old 17 | Senior
18 | Very Old 18 | Very Old 18 | Senior
19 | Very Old 19 | Very Old 19 | Senior
20 | Very Old 20 | Very Old 20 | Very Old
I could simply use a relational table like this to find the relative oldness given the age and size:
size age oldness
-------------------------
small 1 young
medium 8 adult
large 17 senior
This requires calculating and storing every permutation, and then searching against all of those permutations. Admittedly, databases are good at this. But, if I want to add another category VERY LARGE, the permutations I have to calculate, store and search against goes way up.
I could see a system working where size is just an age multiplier. For instance, if Small = x5, Medium = x2 and Large = x1, then the age 2 becomes 10, 4 and 2 respectively. You could then simplify the logic as "if multiplied age <= 10, oldness = young." However, the data might not always fits such a nice model. For instance, above, Larges are only young up to 9, not 10.
I feel like octals and bitwise operators might be a solution here, but I'm struggling to make it work. I'm thinking about chmod and how a permission like 755 holds 9 different pieces of information, and that you can ask simple questions like "can the current user execute?" by just doing 755 & 100. I think that perhaps something like this could help me, but I haven't been able to crack this nut.
Any ideas?

Related

How can I transpose multiple columns at once?

I am trying to transpose three columns by two variables.
My current dataset looks like:
Person Date Company Industry Number
John 2017 Apple Tech 5
John 2017 Starbucks Beverages 3
Kim 2014 Hilton Hotels 9
I would like my output data set to look like:
Person | Date | Company1 | Industry1 | Number1 | Company2 |Industry2| Number2
John | 2017 | Apple | Tech | 5 | Starbucks| Beverage| 3
Kim | 2014 | Hilton | Hotels | 9 | - | - | -
As you can see, I would like each observation to be unique by name and date.
Any suggestions?

How to delete variables which occur in column x but not in column y?

How can I delete duplicates which occur in column x but not in column y?
My dataset is as follows:
+-------+---+---+
| year | x | y |
+-------+---+---+
| 2001 | 1 | 2 |
| 2001 | 2 | 3 |
| 2001 | 2 | 3 |
| 2001 | 4 | 6 |
| 2001 | 5 | 9 |
| 2001 | 4 | 2 |
| 2001 | 4 | 9 |
+-------+---+---+
What I want is to remove the entries which occur in column y from the ones in column x.
My result would be: 1,4,5
I am currently learning Stata and I would love to know a good source for all possible commands, if this exists? So I can learn better on my own. Currently I have trouble to find good sources.
In Stata what you call columns are always called variables.
See http://www.statalist.org/forums/help#stata for general advice on how to present data examples in Stata questions. (The comments on CODE delimiters don't apply here.)
This may help. I didn't understand the role of year in your problem.
clear
input year x y
2001 1 2
2001 2 3
2001 2 3
2001 4 6
2001 5 9
2001 4 2
2001 4 9
end
rename x Datax
rename y Datay
gen long obs = _n
reshape long Data, i(obs) j(which) string
bysort Data (which) : drop if which[_N] == "y"
list
+---------------------------+
| obs which year Data |
|---------------------------|
1. | 1 x 2001 1 |
2. | 4 x 2001 4 |
3. | 7 x 2001 4 |
4. | 6 x 2001 4 |
5. | 5 x 2001 5 |
+---------------------------+
All possible commands aren't documented in a single place. Someone could write new commands all the time and they would not be documented anywhere except their help files. Did you mean that? Nor are all existing commands documented in one place: many are user-written and most of those are just documented by their help files.
Most of the official commands in Stata as supplied by StataCorp are documented in the manuals. Literally, there are also undocumented commands (I am not inventing this: see help undocumented) and there are also nondocumented commands that exist, known about because StataCorp mention them in talks or emails. To be as positive as possible: start with the manuals, bundled with your copy of Stata as .pdf files.

Reshaping when year and countries are both columns

I am trying to reshape some data. The issue is that usually data is either long or wide but this seems to be set up in a way that I cannot figure out how to reshape. The data looks as follows:
year australia canada denmark ...
1999 10 15 20
2000 12 16 25
2001 14 18 40
And I would like to get it into a panel format like the following
year country gdppc
1999 australia 10
2000 australia 12
2001 australia 14
1999 canada 16
2000 canada 18
The problem is just in the variable names. See e.g. this FAQ for the advice that you may need rename first before you can reshape.
For more complicated variants of this problem with similar data, see e.g. this paper.
clear
input year australia canada denmark
1999 10 15 20
2000 12 16 25
2001 14 18 40
end
rename (australia-denmark) gdppc=
reshape long gdppc , i(year) string j(country)
sort country year
list, sepby(country)
+--------------------------+
| year country gdppc |
|--------------------------|
1. | 1999 australia 10 |
2. | 2000 australia 12 |
3. | 2001 australia 14 |
|--------------------------|
4. | 1999 canada 15 |
5. | 2000 canada 16 |
6. | 2001 canada 18 |
|--------------------------|
7. | 1999 denmark 20 |
8. | 2000 denmark 25 |
9. | 2001 denmark 40 |
+--------------------------+

How can I match two columns of data by name?

I have two sets of data that look something like this:
Bill | 7
Sam | 13
Chuck | 9
and
Bill | 6
Sam | 3
Beth | 6
and I want:
Beth | 0 | 6
Bill | 7 | 6
Chuck| 9 | 0
Sam | 13 | 3
I don't even care if the data ends up looking like this:
Bill | 7 | Bill | 6
| | Beth | 6
Sam | 13 | Sam | 3
Chuck| 9 | Chuck| 0
I just would like to match up the names.
Your desired outcome - I've never seen such an order in "real life practice".
To use the data, I would go with an operating system tool to combine the source files
(like: copy file1 + file2 newfile.csv; new file extension for easily recognizing by OOo Calc).
In CALC you can then sort / filter, to show a persons data together, or sum / calculate with it.
If you want standard operations, like SUM per person, check out the pivot table feature.
HTH

Generating a new variable by selection from multiple variables

I have some data on diseases and age of diagnosis. Each participant was asked what diseases they have had and at what age that disease was diagnosed.
There are a set of variables disease1-28 with a numeric code for each disease and another set age1-28 with the age at diagnosis in years. The diseases are placed in successive variables in the order recalled; the age of diagnosis is placed in the appropriate age variable.
I would like to generate a new variable for each of several diseases giving the age of diagnosis of that disease: e.g. asthma_age_at_diagnosis
Can I do this without having 28 replace statements?
Example of the data:
+-------------+----------+----------+----------+------+------+------+
| Participant | Disease1 | Disease2 | Disease3 | Age1 | Age2 | Age3 |
+-------------+----------+----------+----------+------+------+------+
| 1 | 123 | 3 | . | 30 | 2 | . |
| 2 | 122 | 123 | 5 | 23 | 51 | 44 |
| 3 | 5 | . | . | 50 | . | . |
+-------------+----------+----------+----------+------+------+------+
I give a general heads-up that a question of this form without any code of your own is often considered off-topic for Stack Overflow. Still, the Stata users around here are the people answering Stata questions (surprise) and we usually indulge questions like this if interesting and well-posed.
I'd advise a different data structure, period. With your example data
clear
input Patient Disease1 Disease2 Disease3 Age1 Age2 Age3
1 123 3 . 30 2 .
2 122 123 5 23 51 44
3 5 . . 50 . .
end
You can reshape
reshape long Disease Age, i(Patient) j(Order)
drop if missing(Disease)
list, sep(0)
+--------------------------------+
| Patient Order Disease Age |
|--------------------------------|
1. | 1 1 123 30 |
2. | 1 2 3 2 |
3. | 2 1 122 23 |
4. | 2 2 123 51 |
5. | 2 3 5 44 |
6. | 3 1 5 50 |
+--------------------------------+
With the data in this form you can now answer lots of questions easily. I don't see that a whole bunch of new variables would make many analyses easier. Another way to see this is that you have hinted that the order in which diseases are coded is arbitrary; that being so, wiring that into the data structure is ill-advised. Even if order is important, it is still accessible as part of the dataset (variable Order).
Hint: If you still want separate variables for some purposes, look at separate.