I have use case that has data of employee of a company of different age group.
I need to find highest salary of male and female employee of three age group category.
for detail pls go to below link -
http://www.myhadoopexamples.com/2014/03/01/hadoop-mapreduce-example-with-partitioner/
My question is - Here we have only two key emited by mapper i.e male and female.
and we have set 3 reducer in driver class thus 3 partition will be created.
There could be two below things -
3 reducer will be running for each 3 partition which inturn find out
max female and male salary in each partioner. and gives expected result as shown in above link.
Only reducer will be running actually , one of male and one for female and do the calculation
If you want to know the real num of reducers, you'd better run it on cluster.
As said in Number of reducer in map reduce, it will launch 3 reducers and 1 reducer will process no data. If you want to use all three reducers, you may change the Partioner class, like partitioning data by the age group
Related
Subject ID Condition Task A Task B First Task
1001 1
1002 2
1003 1
This is a within-subjects design. Each participant took part in tasks A and B, however, the order in which the tasks were presented (first or second) depends upon condition (e.g., those in condition 1 perform task A first followed by task B and vice-versa). Note that the task columns do have their own score but I cannot add here.
Is it possible to produce an elegant piece of code that mutates a new column/variable called 'first task' that selects cases when a subject belonging to condition 1, their corresponding score from task A is put into this new 'first task' column. Those subjects belonging to condition 2, their score from task B is put into the first task column (because those in condition 2 received task B first).
I hope this makes sense. I am trying to combine mutate with cases_when, group_by and if/if_else functions to achieve something like this, but have not succeeded.
Consider the fictional data to illustrate my problem, which contains in reality thousands of rows.
Figure 1
Each individual is characterized by values attached to A,B,C,D,E. In figure1, I show 3 individuals for which some characteristics are missing. Do you have any idea how can I get the following completed table (figure 2)?
Figure 2
With the ID in figure 1 I could have used the carryforward command to filling in the values. But since each individual has a different number of rows I don't know how to create the ID.
Edit: All individual share the characteristic "A".
Edit: the existing order of observations is informative.
To detect the change of id, the idea is to compare if the precedent value of char is >= in each rows.
This works only if your data are ordered, but it seems mandatory in your data.
gen id= 1 if (char[_n-1] >= char[_n]) | _n ==1
replace id = sum(id) if id==1
replace id = id[_n-1] if missing(id)
fillin id char
drop _fillin
If an individual as only the characteristics A and C and another individual as only the characteristics D and E, this won't work, but it seems impossible to detect with your data.
SELECT
a.id,
b.url as codingurl
FROM fact_A a
INNER JOIN dim_B b
ON strpos(a.url,b.url)> 0
Records Count in Fact_A: 2 Million
Records Count in Dim_B : 1500
Time Taken to Execute : 10 Mins
No of Nodes: 2
Could someone help me with an understanding why the above query takes more time to execute?
We have declared the distribution key in Fact_A to appropriately distribute the records evenly in both the nodes and also Sort Key is created on URL in Fact_A.
Dim_B table is created with DISTRIBUTION ALL.
Redshift does not have full-text search indexes or prefix indexes, so a query like this (with strpos used in filter) will result in full table scan, executing strpos 3 billion times.
Depending on which urls are in dim_B, you might be able to optimise this by extracting prefixes into separate columns. For example, if you always compare subpaths of the form http[s]://hostname/part1/part2/part3 then you can extract "part1/part2/part3" as a separate column both in fact_A and dim_B, and make it the dist and sort keys.
You can also rely on parallelism of Redshift. If you resize your cluster from 2 nodes to 20 nodes, you should see immediate performance improvement of 8-10 times as this kind of query can be executed by each node in parallel (for the most part).
I'm working with our IT group to develop an optimizer for logistics operations. The basic design is that it will look at shipments, run a search for additional shipments originating with in XX miles of the previous shipments destination, and link them together in a loop. It will continue to do this until it hits a user defined set of shipment legs where the loop ends at or close to 1st shipment origin.
The issue we are facing is that the materials we ship are chemicals, which can have interactions if placed in a tank that contained XX chemical before it. The obvious solution is to use a different tank or wash it out, but we also need it to compute solutions prior to that.
My problem is, currently, there is no way on the market to do that prior product optimization.
The question is: Is there some kind of logic table function I can write that will allow the optimizer to see an element in the data set (say, Product Family of 1) that will pull from a product database containing predefined product families (i.e. PF 1 = Chemicals A1-B7, PF 2 = Chemical B8-J8, etc.) and then ping off of a logic table that defines a do not ship with list (i.e. PF 1 cannot ship if PF 2 was on the previous leg.
UPDATE:
I solved the first part of the problem. I created unique ids for each observation:
gen id=_n
Then, I used
fillin id categ
which essentially created what I was looking for.
However, for the rest of the variables (except id and categ), almost all observations are missing. Now, I need your help to duplicate the rest of the variables instead of having them missing.
Just as an example, each observation is associated with a particular week. I am missing most of them. Or another dummy variable indicates whether a purchase was made at a drug or grocery store. Most of them are missing too.
Thanks!
ORIGINAL MESSAGE:
Need your help in Stata!
Each observation in my database is a 1-unit purchase of a beer product made by a customer. These product purchases are categorized unto 8 general categories such that the variable "categ" has values from 1 to 8 (1=import, 2=craft, 3=premium, 4=light, etc).
For my multinomial logit model, I need to observe all categories purchased or not purchased by the customer in each observation.
Assume, this is my initial dataset:
customer id-------beer category-----units purchased
----------1------------------1--------------------- 1
----------2----------------- 3--------------------- 1
----------3 -----------------2 ---------------------1
This is what I am looking for:
customer id-------beer category-----units purchased
----------1------------------1--------------------- 1
----------1 -----------------2 ---------------------0
----------1----------------- 3--------------------- 0
----------2----------------- 1--------------------- 0
----------2----------------- 3--------------------- 1
----------2 -----------------3--------------------- 0
----------3----------------- 1--------------------- 0
----------3----------------- 2--------------------- 0
----------3 -----------------2 ---------------------1
Currently, my dataset is 600,000 obs. After this procedure, I should have 600,000*8=4,800,000 obs.
When constructing this code, it is necessary that all other variables in the dataset are duplicated according to the associated category of beer.
I assume that "fillin" and less likely "expand" might work.
You help will tremendously help.
Thanks!
This is an old question, but i'll post a possible answer if someone else is having this problem.
In this case, you could generate variables for every option of your "choice variable", and after that, apply the reshape long command:
tab beercategory, gen(b)
reshape long b , i(customerid) j(newvarname)
Greetings