Is using prime number good for determining the "on" / "true" values for a set of conditions? - primes

I want to check if a list contains an element. Now instead of traversing the list every single time, I want to find a quick solution. Suppose there are a list of rules r1,r2,r3,r4,r5. Each with an id 3,5,7,11,13(all prime numbers) for users. I am getting different user ids with a list of their specific rules/characteristics .
I have different groupings of rules for bucketing users according to certain conditions. For ex. group1 = r1 & r2 , group2 = r3 , group3 = r3,r4 & r5 ; here groups serve as buckets of users with a combination of certain rules/characteristics
Now If a user u1 satisfies rules/charactersitics r1,r2 &r3 he should fall in group1 & group2
And If a user u2 satisfies rules/charactersitics r3,r4,r5 he should fall in group2 & group3
According to my solution I will find out the product of the rule ids which are prime numbers for a user. Therefore for u1 it will be r1 * r2 * r3 = 3*5*7 = 105 . I will do this calculation only once & then find out the product of the rule ids per group.
group1 -> r1 * r2 = 3*5 = 15
group2 -> r3 = 7
group3 -> r3*r4*r5 = 7*11*13 = 1001
I can also do the above calculation(for finding group product) at the time of group creation to avoid traversing the rule ids in groups every time.
Now for the main part, I will check
for all groups {
if uid rule product % group product ==0
then uid lies in that group
}
Now since I am using only prime numbers, If a rule id is present in a group but not in a user's rules then its uid rule product % group product can never be zero.
I hope my question is not too vague. Waiting to listen from people who have done something like that or think this could work.

found your question by walking around : .
you which to distinguish several rules: using prime would work but cost a lot. Simplest way is to add power of 2: 1+2+4+8 etc. (yes binary ! ) To find if a rule or a set is inside do an AND: if result = question, ok.
This way with 32 bits: 32 rules .with prime 2 X 3 X 5 X 7 × 11 X 13 X 17 (7rules only) already takes more than 32bits. Quite simple:a new rule multiply by 2 in first case, otherwise multiply by next (growing) prime. could quickly go out of range.
hth

Related

How to merger these two records ino one row removing Null value in Informatica using transformation. Please see the snapshot for scenario

enter image description here
Input-
Code value Min Max
A abc 10 null
A abc Null 20
Output-
Code value Min Max
A abc 10 20
You can use an aggregator transformation to remove nulls and get single row. I am providing solution based on your data only.
use an aggregator with below ports -
inout_Code (group by)
inout_value (group by)
in_Min
in_Max
out_Min= MAX(in_Min)
out_Max = MAX(in_Max)
And then attach out_Min, out_Max, code and value to target.
You will get 1 record for a combination of code and value and null values will be gone.
Now, if you have more than 4/5/6/more etc. code,value combinations and some of min, max columns are null and you want multiple records, you need more complex mapping logic. Let me know if this helps. :)

How to distribute values into group in python

I have a dataset of actions doing over time, an attribute 'Hour' ( contains values from 0 ->23 ). Now I want to create another attribute, say 'PartOfDay', which group 24 hours into 4 parts. For tuples have 'Hour' value of 0 to 5, then the 'PartOfDay' value should be 1; if 'Hour' value in [6,11], then the 'PartOfDay' value should be 2;...How can I do?
The codes would do this:
train['PartOfDay']=1
train.loc[(train.Hour>=6) & (train.hour<=11),'PartOfDay']=2
train.loc[(train.Hour>=12) & (train.hour<=17),'PartOfDay']=3
train.loc[(train.Hour>=18) & (train.hour<=23),'PartOfDay']=4
but it seems not so beautiful, I would like to know a more decent one if possible
Thank you for all your supports!!
While it is not clear what train.loc represents, a general approach to your problem is to use modulus function to set the RHS:
1 + int(train.Hour / 6)

How to write loop across Hierarchical Data (household-individual) in stata?

I'm now working on a household survey data set and I'd like to give certain members extra IDs according to their relationship to the household head. More specifically, I need to identify the adult children of household head and his/her spouse, if married, and assign them "sub-household IDs".
The variables are: hhid - household ID; pid -individual ID; relhead - relationship with head.
Regarding relhead, a 1 represents the head, a 6 represents a child, and a 7 represents a child-in-law. Below some example data, including in the last column the desired outcome. I assume that whenever a 6 is followed by a 7, they constitute a couple and belong to the same sub-household.
hhid pid relhead sub_hhid(desired)
50 1 1 1
50 2 3 1
50 3 6 2
50 4 6 3
50 5 7 3
-----------------------------------------------
67 1 1 1
67 3 6 2
67 4 7 2
Here are some thoughts:
There may be married and unmarried adult children within one household, the family structure is a little bit complicated, so I want to write some loop across the members in a household.
The basic idea is in the outer loop we identify the children staying-at-home and then check if there's a spouse presented, if there is, then we give the couple an indicator, if not, we continue and give the single stay_chil other indicator. After walking through all the possible members within a household, we get a series of within-household IDs. To facilitate further analysis , I need some kind of external ID variable to separate the sub-families.
* Define N as the total number of household, n as number of individual household size
* sty_chil is indicator for adult child who living with parents(head)
* sty_chil_sp is adult child's spouse
* "hid" and "ind_id" are local macros
forvalue hid=1/N {
forvalue ind_id= 1/n {
if sty_chil[`ind_id']==1 {
check if sty_chil_sp[`ind_id+1']==1 {
if yes then assign sub_hhid to this couples *a 6-7 pairs,identifid as couple
}
else { * single 6 identifid as single child
assign sub_hhid to this child
}
else { *Other relationships rather than 6, move forward
++ind_id the members within a household
}
++hid *move forward across households
}
The built-in stata by,sort: is pretty powerful but here I want to treat part of family members who fall into certain criterion and leave other untouched, so a if-else type loop is more natural for me (even by: may achieve my goal,it's always too tactful when situation become not so simple,and we cannot exhaust all the possible pattern of household pattern).
An immediate problem is that I don't know how to write loop across house IDs and individual IDs, because I used to acquire the household size (increment of outer loop) using by command (I'm not sure in this case it's 1 or the numerber of family members), and I'm not sure if mix up the by and if loops is a good programming practice, I favor write a "full loop" in this case. Please give me some clues how to achieve my goal and provide (illustrate)pseudo code for me.
An extra question is I cannot find the ado file which contains the content of by command, does it exist?
I will abstract from the issue of whether the assumption used to create matches is a sensible one or not. Rather, let this be an example of reaching the desired results without using explicit loops. Some logic and the use of subscripting (see help subscripting) can get you far.
clear
set more off
*----- example data -----
input ///
hhid pid relhead sub_hhid
50 1 1 1
50 3 6 2
50 4 6 3
50 5 7 3
67 1 1 1
67 3 6 2
67 4 7 2
67 5 6 3
end
list, sepby(hhid)
*----- what you want -----
bysort hhid (pid): gen hhid2 = sum( !(relhead == 7 & relhead[_n-1] == 6) )
list, sepby(hhid)
As you can see, one line of code gets you there. The reasoning is the following:
sum() gives the running sum. The arguments to sum(), being conditions, can either be True or False. The ! denotes the logical not (see help operators).
If it is not the case that the relationship is daughter/son-in-law AND the previous relationship is daughter/son, the condition evaluates to True and takes on the value of 1, increasing the running sum by 1. If it evaluates to False, meaning that the relationship is daughter/son-in-law AND the previous relationship is daughter/son, then it takes on the value of 0 and the running sum will not increase. This gives the result you seek.
You do this using the by: prefix, since you want to check each original household independently, so to speak.
For the the first observation of each original household, the condition always evaluates to True. This is because there exist no "previous" observation (relationship), and Stata considers relhead to be missing (., a very large number) and therefore, not equal to 6. This takes the running sum from 0 to 1 for the first observation of each sub-group, and so on.
Bottom line: learn how to use by: and take advantage of the features offered by Stata. Do not swim against the current; not here.
Edit
Please note that instead of progressively changing your example data set, you should provide a representative example from the beginning. Not doing so can render answers that are initially OK, completely inadequate.
For your modified example, add:
replace hhid2 = 1 if !inlist(relhead,6,7)
That will simply assign anyone not 6 or 7 to the same household as the head. The head is assumed to always have hhid2 == 1. If the head can have hhid2 != 1, then
bysort hhid (relhead): replace hhid2 = hhid2[1] if !inlist(relhead,6,7)
should work.
You can follow with:
bysort hhid (pid): replace hhid2 = hhid2[_n-1] + 1 if hhid2 != hhid2[_n-1] & _n > 1
but because they are IDs, it's not really necessary.
Finally, use:
gen hhid3 = string(hhid) + "_" + string(hhid2)
to create IDs with the form 50_1, 50_2, 50_3, etc.
Like I said before, if your data presents more complications, you should present a relevant example.

Create prioritization log in Excel - Two lists

I am trying to create a prioritization list. I have 6 distinct values that the user inputs into a worksheet (by way of a VBA GUI). Excel calculates these values and creates a prioritization number. I need to list them (through a function(s)) in two tables. The problem comes into play when there are duplicate values (ie ProjA = 23 and ProjB = 23).
I don't care which one is listed first, but everything I have tried has secondary issues. There are two sheets to my work book. The first is where the "raw" data is entered and the second is where I would like the two lists to be located. *I do not want to use pivots for these lists.
Priority Number Proj Name
57 Project Alpha c
57 DUI Button Project
56 asdf
57 asdfsdfg
56 asdfasdf
56 Project Alpha a
56 Project Alpha b
18 Project BAS
List A (would include a value range of 1-20 and
List B (would include a value range of 20 - inf)
So, I want it to look like this:
Table 1 (High Priority) Table 2 (Low Priority)
Project BAS Project Apha C
DUI Button Project
Etc.
Generally these open-ended questions aren't received on StackOverflow. You should make an attempt to demonstrate what you've tried so far, and exactly where you're becoming confused. Otherwise people are doing your work for you, rather than trying to solve specific errors.
However, because you're new here, I've made an exception.
You can begin solving your issue by looping through the priority list and copy the values into the appropriate lists. For starters, I assumed that priority values begin at cell A2 and project names begin at cell B2 (the cells A1 and B1 would be the headers). I also assumed we're using a worksheet called Sheet1.
Now I need to know the length of the priority/project name list. I can determine this by using an integer called maxRows, calculated by Worksheets.Cells(1, 1).End(xlDown).Row. This gives the number of values in regular table (including the header, A1).
I continue by setting the columns for each priority list (high/low). In my example, I set these to columns 3 and 4. Then I clear these columns to remove any values that already existed there.
Then I create some tracking variables that will help me determine how many items I've already added to each list (highPriorityCount and lowPriorityCount).
Finally, I loop through the original list and check if the priority value is low (< 20) or high (the else condition). The project names are placed into the appropriate column, using the tracking variables I created above.
Note: Anywhere that uses a 2 as an offset is due to the fact that I am accounting for the header cells (row 1).
Option Explicit
Sub CreatePriorityTables()
With Worksheets("Sheet1")
' Determine the length of the main table
Dim maxRows As Integer
maxRows = .Cells(1, 1).End(xlDown).Row
' Set the location of the priority lists
Dim highPriorityColumn As Integer
Dim lowPriorityColumn As Integer
highPriorityColumn = 3
lowPriorityColumn = 4
' Empty the priority lists
.Columns(highPriorityColumn).Clear
.Columns(lowPriorityColumn).Clear
' Create headers for priority lists
.Cells(1, highPriorityColumn).Value = "Table 1 (High Priority)"
.Cells(1, lowPriorityColumn).Value = "Table 2 (Low Priority)"
' Create some useful counts to track
Dim highPriorityCount As Integer
Dim lowPriorityCount As Integer
highPriorityCount = 0
lowPriorityCount = 0
' Loop through all values and copy into priority lists
Dim currentColumn As Integer
Dim i As Integer
For i = 2 To maxRows
' Determine column by priority value
If (.Cells(i, 1) < 20) Then
.Cells(lowPriorityCount + 2, lowPriorityColumn).Value = .Cells(i, 2)
lowPriorityCount = lowPriorityCount + 1
Else
.Cells(highPriorityCount + 2, highPriorityColumn).Value = .Cells(i, 2)
highPriorityCount = highPriorityCount + 1
End If
Next i
End With
End Sub
This should produce the expected behavior.

How do I calculate the maximum or minimum seen so far in a sequence, and its associated id?

From this Stata FAQ, I know the answer to the first part of my question. But here I'd like to go a step further. Suppose I have the following data (already sorted by a variable not shown):
id v1
A 9
B 8
C 7
B 7
A 5
C 4
A 3
A 2
To calculate the minimum in this sequence, I do
generate minsofar = v1 if _n==1
replace minsofar = min(v1[_n-1], minsofar[_n-1]) if missing(minsofar)
To get
id v1 minsofar
A 9 9
B 8 9
C 7 8
B 7 7
A 5 7
C 4 5
A 3 4
A 2 3
Now I'd like to generate a variable, call it id_min that gives me the ID associated with minsofar, so something like
id v1 minsofar id_min
A 9 9 A
B 8 9 A
C 7 8 B
B 7 7 C
A 5 7 C
C 4 5 A
A 3 4 C
A 2 3 A
Note that C is associated with 7, because 7 is first associated with C in the current sorting. And just to be clear, my ID variable here shows as a string variable just for the sake of readability -- it's actually numeric.
Ideas?
EDIT:
I suppose
gen id_min = id if _n<=2
replace id_min = id[_n-1] if v1[_n-1]<minsofar[_n-1] & missing(id_min)
replace id_min = id_min[_n-1] if missing(id_min)
does the job at least for the data in this example. Don't know if it would work for more complex cases.
This works for your example. It uses the user-written command vlookup, which you can install running findit vlookup and following through the link that appears.
clear
set more off
input ///
str1 id v1
A 9
B 8
C 7
B 7
A 5
C 4
A 3
A 2
end
encode id, gen(id2)
order id2
drop id
list
*----- what you want -----
// your code
generate minsofar = v1 if _n==1
replace minsofar = min(v1[_n-1], minsofar[_n-1]) if missing(minsofar)
// save original sort
gen osort = _n
// group values of v1 but respecting original sort so values of
// id2 don't jump around
sort v1 osort
// set obs after first as missing so id2 is unique within v1
gen v2 = v1
by v1: replace v2 = . if _n > 1
// lookup
vlookup minsofar, gen(idmin) key(v2) value(id2)
// list
sort osort
drop osort v2
list, sep(0)
Your code has generate minsofar = v1 if _n==1 which is better coded as generate minsofar = v1 in 1, because it is more efficient.
Your minsofar variable is just a displaced copy of v1, so if this is always the case, there should be simpler ways of handling your problem. I suspect your problem is easier than you have acknowledged until now, and that has come through your post. Perhaps giving more context, expanded example data, etc. could get you better advice.
This is both easier and a little more challenging than implied so far. Given value (a little more evocative than the OP's v1) and a desire to keep track of minimum so far, that's for example
generate min_so_far = value[1]
replace min_so_far = value if value < min_so_far[_n-1] in 2/L
where the second statement exploits the unsurprising fact that Stata replaces in the current order of observations. [_n-1] is the index of the previous observation and in 2/L implies a loop over all observations from the second to the last.
Note that the OP's version is buggy: by always looking at the previous observation, the code never looks at the very last value and will overlook that if it is a new minimum. It may be that the OP really wants "minimum before now" but that is not what I understand by "minimum so far".
If we have missing values in value they will not enter the comparison in any malign way: missing is always regarded as arbitrarily large by Stata, so missings will be recorded if and only if no non-missings are present so far, which is as it should be.
The identifier of that minimum at first sight yields to the same logic
generate min_so_far = value[1]
gen id_min = id[1]
replace min_so_far = value if value < min_so_far[_n-1] in 2/L
replace id_min = id if value < min_so_far[_n-1] in 2/L
There are at least two twists that might bite. The OP mentions a possibility that the identifier might be missing so that we might have a new minimum but not know its identifier. The code just given will use a missing identifier, but if the desire is to keep separate track of the identifier of the minimum value with known identifiers, different code is needed.
A twist not mentioned to date is that observations with different identifier might all have the same minimum so far. The code above replaces the identifier only the first time a particular minimum is seen; if the desire is to record the identifier of the last occurrence the < in the last code line above should be replaced with <=. If the desire is to keep track of the all the identifiers of the minimum so far, then a string variable is needed to concatenate all the identifiers.
With a structure of panel or longitudinal data the whole thing is done under the aegis of by:.
I can't see a need to resort to user-written extensions here.