I have the following dataset:
clear
input float(department employee expertise_area share)
1 56 334 1
1 143 389 .04
1 143 334 .18
1 143 383 .02
1 143 398 .1
1 143 414 .02
1 143 396 .08
1 143 385 .08
1 143 403 .3
1 143 409 .02
1 143 373 .02
1 143 392 .06
1 143 397 .06
1 143 394 .02
1 214 373 1
4 145 399 .029
4 145 409 .7681
4 145 311 .0145
4 145 403 .1884
4 161 62 .4
4 161 373 .6
4 285 355 .5333
4 285 392 .0333
4 285 304 .0333
4 285 310 .2333
4 285 73 .0333
4 285 331 .0333
4 285 399 .0333
4 285 414 .0667
186 161 62 .4
186 161 373 .6
186 247 409 .0025
186 247 311 .0025
186 247 338 .25
186 247 298 .0051
186 247 334 .649
186 247 337 .0051
186 247 404 .0076
186 247 339 .0051
186 247 301 .0025
186 247 403 .0631
186 247 347 .0025
186 247 336 .0051
186 285 304 .0333
186 285 399 .0333
186 285 355 .5333
186 285 392 .0333
186 285 310 .2333
186 285 73 .0333
186 285 414 .0667
186 285 331 .0333
end
I would like to compute the differences between the distribution of the prior experience of employees in a team (or department).
This is the mean Euclidean distance, which calculates that separation of individuals in a team:
Here, p_ij and p_kj are the share of employee i’s or k’s expertise in area j over his career and n equals the team size.
For example, for department 1, employee 143, he has worked 18% of his career on area 334 (this example corresponds to observation 3). The team size for department 1 is 3, that is for department 1, n=3.
In summary, I want to calculate the Euclidean distance for each department (1, 4, 186) for 3 points (or employees) each [(56, 143, 214), (145, 161, 285), (161, 247, 285) respectively] with 13, 13 and 22 dimensions (or expertise_area) respectively. Note that I should be able to produce output even if a department has more than 3 employees (or points).
The output should look as follows:
+------------+--------------------+
| department | euclidean_distance |
+------------+--------------------+
| 1 | .4022 |
| 4 | .4131 |
| 186 | .3882 |
+------------+--------------------+
How can I compute this in Stata?
I am currently practicing SAS programming on using two SAS dataset(sample and master) . Below are the hypothetical or dummy data created for illustration purpose to solve my problem through SAS programming . I would like to extract the data for the id's in sample dataset from master dataset. I have given an example with few id's as sample dataset, for which i need to extract last 12 month information from master table for each id's based on the yearmonth information( desired output given in the third output).
similar to this, i have many column which i need 12 months data for each id and yearmonth.
I have written a code with do loop to iterate each row of sample dataset then find the data in the master table from start (yearmonth and end date(12 month ago) for each iteration, and then transpose it using proc transpose. Then merge the sample dataset with transpose data using data step merge using id and yearmonth. But i feel the code which i have written is not optimized because it is lopping several times for each row in sample dataset and finds data from master table . Can anyone help me in solving this problem using SAS programming with optimized way.
One sample dataset (dataset name - sample).
ID YEARMONTH NO_OF_CUST
1 200909 50
1 201005 65
1 201008 78
1 201106 95
2 200901 65
2 200902 45
2 200903 69
2 201005 14
2 201006 26
2 201007 98
3 201011 75
3 201012 75
One master dataset(dataset name - master dataset huge dataset over the year for each id from start of the account to till date.)
ID YEARMONTH NO_OF_CUST
1 200808 125
1 200809 125
1 200810 111
1 200811 174
1 200812 98
1 200901 45
1 200902 74
1 200903 73
1 200904 101
1 200905 164
1 200906 104
1 200907 22
1 200908 35
1 200909 50
1 200910 77
1 200911 86
1 200912 95
1 201001 95
1 201002 87
1 201003 79
1 201004 71
1 201005 65
1 201006 66
1 201007 66
1 201008 78
1 201009 88
1 201010 54
1 201011 45
1 201012 100
1 201101 136
1 201102 111
1 201103 17
1 201104 77
1 201105 111
1 201106 95
1 201107 79
1 201108 777
1 201109 758
1 201110 32
1 201111 15
1 201112 22
2 200711 150
2 200712 150
2 200801 44
2 200802 385
2 200803 65
2 200804 66
2 200805 200
2 200806 333
2 200807 285
2 200808 265
2 200809 222
2 200810 220
2 200811 205
2 200812 185
2 200901 65
2 200902 45
2 200903 69
2 200904 546
2 200905 21
2 200906 256
2 200907 214
2 200908 14
2 200909 44
2 200910 65
2 200911 88
2 200912 79
2 201001 65
2 201002 45
2 201003 69
2 201004 54
2 201005 14
2 201006 26
2 201007 98
3 200912 77
3 201001 66
3 201002 69
3 201003 7
3 201004 7
3 201005 7
3 201006 65
3 201007 75
3 201008 85
3 201009 89
3 201010 100
3 201011 75
3 201012 75
Below is sample output which i am trying to update for an each sample id's in sample dataset.
Without sample code of what you've been trying to do so far it is a bit difficult to figure out what you want but a "SAS" way of getting the same result as the image file might be the following.
EDIT: Edited my answer so it takes the last 12 months by ID
data test;
infile datalines dlm='09'x;
input ID YEARMONTH NO_OF_CUST;
datalines;
1 200808 125
1 200809 125
1 200810 111
1 200811 174
1 200812 98
1 200901 45
1 200902 74
1 200903 73
1 200904 101
1 200905 164
1 200906 104
1 200907 22
1 200908 35
1 200909 50
1 200910 77
1 200911 86
1 200912 95
1 201001 95
1 201002 87
1 201003 79
1 201004 71
1 201005 65
1 201006 66
1 201007 66
1 201008 78
1 201009 88
1 201010 54
1 201011 45
1 201012 100
1 201101 136
1 201102 111
1 201103 17
1 201104 77
1 201105 111
1 201106 95
1 201107 79
1 201108 777
1 201109 758
1 201110 32
1 201111 15
1 201112 22
2 200711 150
2 200712 150
2 200801 44
2 200802 385
2 200803 65
2 200804 66
2 200805 200
2 200806 333
2 200807 285
2 200808 265
2 200809 222
2 200810 220
2 200811 205
2 200812 185
2 200901 65
2 200902 45
2 200903 69
2 200904 546
2 200905 21
2 200906 256
2 200907 214
2 200908 14
2 200909 44
2 200910 65
2 200911 88
2 200912 79
2 201001 65
2 201002 45
2 201003 69
2 201004 54
2 201005 14
2 201006 26
2 201007 98
3 200912 77
3 201001 66
3 201002 69
3 201003 7
3 201004 7
3 201005 7
3 201006 65
3 201007 75
3 201008 85
3 201009 89
3 201010 100
3 201011 75
3 201012 75
;
run;
proc sort data=test;
by id yearmonth;
run;
data result;
set test;
array prev_month {13} PREV_MONTH_0-PREV_MONTH_12;
by id;
if first.id then do;
do i = 1 to 13;
prev_month(i) = 0;
end;
end;
do i = 13 to 2 by -1;
prev_month(i) = prev_month(i-1);
end;
prev_month(1) = NO_OF_CUST;
drop i PREV_MONTH_0;
retain PREV_MONTH:;
run;
I'm new to R. This is my data (using dplyr):
> withCommas
Source: local data frame [326 x 1]
NA
1 16,244,600
2 8,227,103
3 5,959,718
4 3,428,131
5 2,612,878
6 2,471,784
7 2,252,664
8 2,014,775
9 2,014,670
10 1,841,710
.. ...
Classes ‘tbl_df’ and 'data.frame': 326 obs. of 1 variable:
$ : Factor w/ 207 levels ""," 1,008 "," 1,129 ",..: 40 178 143 100 66 63 61 58 57 16 ...
I'm trying to get rid of the commas (so the first row should be 16244600). So I tried the following:
#1st try
noCommas <- gsub("([0-9]+)\\,([0-9])", "\\1\\2", withCommas)
#2nd try
noCommas <- gsub(",", "", withCommas)
In all cases, I got this output:
[1] "c(40 178 143 100 66 63 61 58 57 16 14 11 9 6 4 182 176 174 170 161 148 147 139 137 136 134 118 117 116 114 113 109 107 105 95 93 92 90 89 88 87 84 83 78 75 74 73 72 71 70 56 55 49 47 43 42 39 28 25 24 23 190 188 181 172 165 163 162 160 153 152 151 150 149 146 145 144 138 132 131 130 129 128 127 126 125 124 115 112 111 110 106 98 97 96 94 86 85 82 81 80 77 76 69 68 54 52 51 50 46 45 44 41 \n38 37 36 35 34 33 32 31 30 29 27 26 22 21 20 19 18 17 187 186 185 184 183 179 177 169 168 167 166 159 158 157 156 155 142 141 140 122 121 120 119 104 103 102 101 99 67 65 64 62 60 59 15 13 12 10 8 7 5 3 2 189 180 175 173 173 171 164 154 135 133 108 91 79 53 48 123 1 191 191 191 191 191 1 191 191 191 191 191 191 191 191 191 191 191 191 191 191 191 191 191 191 1 206 1 205 200 202 198 201 196 \n195 204 194 199 193 203 197 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1)"
This seems very strange to me as I don't understand where the numbers are coming from. Any help appreciated.
Edit:
Only the first 225 rows of the variable withCommas have values. After that, the values of the column are empty.
Source: http://data.worldbank.org/data-catalog/GDP-ranking-table
CSV: https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
What about this solution? I think that main problems arise because a data frame is a list and gsub is expecting a character string and so passing it to that function cause to apply the function to the lists and not the elements that are part of the lists themselves. That's the reason for an apply function. Or of course, if the columns is one, passing just that column as a vector with ddf$column_with_commas as the solution provided by other users.
as.data.frame(apply(ddf, 2, function(x) as.numeric(gsub(",", "", x))))
NA.
1 16244600
2 8227103
3 5959718
4 3428131
5 2612878
6 2471784
7 2252664
8 2014775
9 2014670
10 1841710
Data
ddf <- structure(list(NA. = structure(c(2L, 10L, 9L, 8L, 7L, 6L, 5L,
4L, 3L, 1L), .Label = c("1,841,710", "16,244,600", "2,014,670",
"2,014,775", "2,252,664", "2,471,784", "2,612,878", "3,428,131",
"5,959,718", "8,227,103"), class = "factor")), .Names = "NA.", class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
data <- read.table(header=F, text="1 16,244,600
2 8,227,103
3 5,959,718
4 3,428,131
5 2,612,878
6 2,471,784
7 2,252,664
8 2,014,775
9 2,014,670
10 1,841,710 ")
colnames(data) <- c("a","b")
data$b <- as.numeric(gsub(",", "", data$b))
Output:
a b
1 1 16244600
2 2 8227103
3 3 5959718
4 4 3428131
5 5 2612878
6 6 2471784
7 7 2252664
8 8 2014775
9 9 2014670
10 10 1841710
I need to perform a matching between an image and an histogram I receive as a text.
I do the cdf for both of them:
//Calculating cumulative histogram of src
double total = src.rows*src.cols;
double probSrc[255];
int newValuesSrc[255];
double cuml = 0;
for(int j = 0; j < 256; j++)
{
probSrc[j] = imageHistogram[j]/total; // Probability of each value in image
cuml = cuml + probSrc[j]; // Cumulative probability of current and all previous values
double cdfmax = cuml * 255; // Cumulative probability * max value
newValuesSrc[j] = (int) round(cdfmax);
cout << imageHistogram[j] << " "<< probSrc[j] << " " << newValuesSrc[j] << endl;
}
readHistogramFromFile();
//Calculating cumulative histogram from file
double probDst[255];
int newValuesDst[255];
cuml = 0;
for(int j = 0; j < 256; j++)
{
probDst[j] = receivedHistogram[j]/total; // Probability of each value in image
cuml = cuml + probDst[j]; // Cumulative probability of current and all previous values
double cdfmax = cuml * 255; // Cumulative probability * max value
newValuesDst[j] = (int) round(cdfmax);
cout << receivedHistogram[j] << " "<< probDst[j] << " " << newValuesDst[j] << endl;
}
and I get this values:
For the src image:
207677 0.0901376 23
37615 0.016326 27
19098 0.00828906 29
11955 0.0051888 31
8744 0.00379514 32
7386 0.00320573 32
6546 0.00284115 33
6178 0.00268142 34
5967 0.00258984 34
5437 0.00235981 35
5280 0.00229167 36
5127 0.00222526 36
5002 0.00217101 37
4839 0.00210026 37
4754 0.00206337 38
4676 0.00202951 38
4547 0.00197352 39
4517 0.0019605 39
4484 0.00194618 40
4290 0.00186198 40
4197 0.00182161 41
4188 0.00181771 41
4265 0.00185113 42
4229 0.0018355 42
4233 0.00183724 43
4245 0.00184245 43
4358 0.00189149 44
4330 0.00187934 44
4400 0.00190972 45
4474 0.00194184 45
4519 0.00196137 46
4415 0.00191623 46
4477 0.00194314 47
4468 0.00193924 47
4580 0.00198785 48
4416 0.00191667 48
4558 0.0019783 49
4674 0.00202865 49
4705 0.0020421 50
4998 0.00216927 50
4848 0.00210417 51
4782 0.00207552 51
4883 0.00211936 52
4989 0.00216536 52
4957 0.00215148 53
4987 0.0021645 53
5133 0.00222786 54
4967 0.00215582 54
5217 0.00226432 55
5185 0.00225043 56
5140 0.0022309 56
5236 0.00227257 57
5291 0.00229644 57
5458 0.00236892 58
5473 0.00237543 59
5464 0.00237153 59
5495 0.00238498 60
5439 0.00236068 60
5458 0.00236892 61
5557 0.00241189 62
5881 0.00255252 62
5900 0.00256076 63
5935 0.00257595 64
5902 0.00256163 64
6040 0.00262153 65
6203 0.00269227 66
6146 0.00266753 66
6140 0.00266493 67
6075 0.00263672 68
6054 0.0026276 68
6238 0.00270747 69
6060 0.00263021 70
6153 0.00267057 70
6303 0.00273568 71
6231 0.00270443 72
6278 0.00272483 72
6360 0.00276042 73
6359 0.00275998 74
6368 0.00276389 75
6438 0.00279427 75
6329 0.00274696 76
6408 0.00278125 77
6360 0.00276042 77
6378 0.00276823 78
6329 0.00274696 79
6394 0.00277517 79
6517 0.00282856 80
6521 0.0028303 81
6707 0.00291102 82
6788 0.00294618 82
6761 0.00293446 83
6878 0.00298524 84
7004 0.00303993 85
6963 0.00302214 85
7050 0.0030599 86
6940 0.00301215 87
6875 0.00298394 88
7073 0.00306988 89
7035 0.00305339 89
7146 0.00310156 90
7007 0.00304123 91
7159 0.0031072 92
7089 0.00307682 92
7185 0.00311849 93
7410 0.00321615 94
7237 0.00314106 95
7334 0.00318316 96
7364 0.00319618 97
7452 0.00323437 97
7760 0.00336806 98
7839 0.00340234 99
7882 0.00342101 100
7885 0.00342231 101
8055 0.00349609 102
7923 0.0034388 103
8165 0.00354384 103
8306 0.00360503 104
8271 0.00358984 105
8275 0.00359158 106
8634 0.0037474 107
8684 0.0037691 108
8752 0.00379861 109
9080 0.00394097 110
8958 0.00388802 111
9094 0.00394705 112
9279 0.00402734 113
9234 0.00400781 114
9348 0.00405729 115
9440 0.00409722 116
9431 0.00409332 117
9662 0.00419358 118
9842 0.0042717 119
9816 0.00426042 121
9957 0.00432161 122
10353 0.00449349 123
10626 0.00461198 124
10764 0.00467187 125
10832 0.00470139 126
10767 0.00467318 128
11222 0.00487066 129
11469 0.00497786 130
11661 0.0050612 131
11731 0.00509158 133
12023 0.00521832 134
12086 0.00524566 135
12094 0.00524913 137
12362 0.00536545 138
12364 0.00536632 139
12659 0.00549436 141
12587 0.00546311 142
12776 0.00554514 144
13037 0.00565842 145
13252 0.00575174 147
13425 0.00582682 148
13595 0.00590061 150
13795 0.00598741 151
14308 0.00621007 153
14232 0.00617708 154
14657 0.00636155 156
14966 0.00649566 157
14867 0.00645269 159
15051 0.00653255 161
15510 0.00673177 162
15357 0.00666536 164
15326 0.00665191 166
15308 0.0066441 168
15316 0.00664757 169
15321 0.00664974 171
15298 0.00663976 173
15435 0.00669922 174
15496 0.00672569 176
15307 0.00664366 178
15343 0.00665929 179
15356 0.00666493 181
15315 0.00664714 183
15444 0.00670312 185
15346 0.00666059 186
15583 0.00676345 188
15429 0.00669661 190
15641 0.00678863 191
15661 0.00679731 193
15638 0.00678733 195
15689 0.00680946 197
15866 0.00688628 198
15552 0.00675 200
15150 0.00657552 202
15185 0.00659071 203
14941 0.00648481 205
14989 0.00650564 207
14585 0.0063303 208
14718 0.00638802 210
14553 0.00631641 212
14612 0.00634201 213
14520 0.00630208 215
14358 0.00623177 216
13931 0.00604644 218
13580 0.0058941 220
13370 0.00580295 221
13281 0.00576432 222
13053 0.00566536 224
12711 0.00551693 225
12556 0.00544965 227
12556 0.00544965 228
12125 0.00526259 229
12184 0.00528819 231
11975 0.00519748 232
12198 0.00529427 233
11919 0.00517318 235
11898 0.00516406 236
11589 0.00502995 237
11348 0.00492535 239
11011 0.00477908 240
10523 0.00456727 241
10388 0.00450868 242
9795 0.0042513 243
9251 0.00401519 244
9014 0.00391233 245
8436 0.00366146 246
8266 0.00358767 247
7851 0.00340755 248
7299 0.00316797 249
6996 0.00303646 250
6303 0.00273568 250
5625 0.00244141 251
5375 0.0023329 251
5102 0.00221441 252
4747 0.00206033 253
4313 0.00187196 253
3809 0.00165321 253
3307 0.00143533 254
2756 0.00119618 254
2276 0.000987847 254
1935 0.000839844 255
1617 0.000701823 255
1087 0.000471788 255
547 0.000237413 255
217 9.4184e-05 255
31 1.34549e-05 255
4 1.73611e-06 255
0 0 255
0 0 255
0 0 255
0 0 255
0 0 255
0 0 255
0 0 255
0 0 255
0 0 255
0 0 255
0 0 255
0 0 255
0 0 255
0 0 255
0 0 255
0 0 255
0 0 255
And for the histogram I receive:
10 4.34028e-06 0
11 4.77431e-06 0
12 5.20833e-06 0
13 5.64236e-06 0
14 6.07639e-06 0
15 6.51042e-06 0
16 6.94444e-06 0
17 7.37847e-06 0
18 7.8125e-06 0
19 8.24653e-06 0
20 8.68056e-06 0
22 9.54861e-06 0
24 1.04167e-05 0
26 1.12847e-05 0
28 1.21528e-05 0
30 1.30208e-05 0
34 1.47569e-05 0
38 1.64931e-05 0
42 1.82292e-05 0
50 2.17014e-05 0
60 2.60417e-05 0
70 3.03819e-05 0
80 3.47222e-05 0
90 3.90625e-05 0
100 4.34028e-05 0
120 5.20833e-05 0
140 6.07639e-05 0
160 6.94444e-05 0
160 6.94444e-05 0
150 6.51042e-05 0
140 6.07639e-05 0
130 5.64236e-05 0
120 5.20833e-05 0
110 4.77431e-05 0
100 4.34028e-05 0
90 3.90625e-05 0
80 3.47222e-05 0
70 3.03819e-05 0
60 2.60417e-05 0
50 2.17014e-05 0
40 1.73611e-05 0
30 1.30208e-05 0
20 8.68056e-06 0
10 4.34028e-06 0
10 4.34028e-06 0
10 4.34028e-06 0
10 4.34028e-06 0
11 4.77431e-06 0
12 5.20833e-06 0
13 5.64236e-06 0
14 6.07639e-06 0
15 6.51042e-06 0
16 6.94444e-06 0
17 7.37847e-06 0
18 7.8125e-06 0
19 8.24653e-06 0
20 8.68056e-06 0
22 9.54861e-06 0
24 1.04167e-05 0
26 1.12847e-05 0
28 1.21528e-05 0
30 1.30208e-05 0
34 1.47569e-05 0
38 1.64931e-05 0
42 1.82292e-05 0
50 2.17014e-05 0
60 2.60417e-05 0
70 3.03819e-05 0
80 3.47222e-05 0
90 3.90625e-05 0
100 4.34028e-05 0
120 5.20833e-05 0
140 6.07639e-05 0
160 6.94444e-05 0
160 6.94444e-05 0
150 6.51042e-05 0
140 6.07639e-05 0
130 5.64236e-05 1
120 5.20833e-05 1
110 4.77431e-05 1
100 4.34028e-05 1
90 3.90625e-05 1
80 3.47222e-05 1
70 3.03819e-05 1
60 2.60417e-05 1
50 2.17014e-05 1
40 1.73611e-05 1
30 1.30208e-05 1
20 8.68056e-06 1
10 4.34028e-06 1
10 4.34028e-06 1
11 4.77431e-06 1
12 5.20833e-06 1
13 5.64236e-06 1
14 6.07639e-06 1
15 6.51042e-06 1
16 6.94444e-06 1
17 7.37847e-06 1
18 7.8125e-06 1
19 8.24653e-06 1
20 8.68056e-06 1
22 9.54861e-06 1
24 1.04167e-05 1
26 1.12847e-05 1
28 1.21528e-05 1
30 1.30208e-05 1
34 1.47569e-05 1
38 1.64931e-05 1
42 1.82292e-05 1
50 2.17014e-05 1
60 2.60417e-05 1
70 3.03819e-05 1
80 3.47222e-05 1
90 3.90625e-05 1
100 4.34028e-05 1
120 5.20833e-05 1
140 6.07639e-05 1
160 6.94444e-05 1
160 6.94444e-05 1
150 6.51042e-05 1
140 6.07639e-05 1
130 5.64236e-05 1
120 5.20833e-05 1
110 4.77431e-05 1
100 4.34028e-05 1
90 3.90625e-05 1
80 3.47222e-05 1
70 3.03819e-05 1
60 2.60417e-05 1
50 2.17014e-05 1
40 1.73611e-05 1
30 1.30208e-05 1
20 8.68056e-06 1
10 4.34028e-06 1
10 4.34028e-06 1
10 4.34028e-06 1
10 4.34028e-06 1
11 4.77431e-06 1
12 5.20833e-06 1
13 5.64236e-06 1
14 6.07639e-06 1
15 6.51042e-06 1
16 6.94444e-06 1
17 7.37847e-06 1
18 7.8125e-06 1
19 8.24653e-06 1
20 8.68056e-06 1
22 9.54861e-06 1
24 1.04167e-05 1
26 1.12847e-05 1
28 1.21528e-05 1
30 1.30208e-05 1
34 1.47569e-05 1
38 1.64931e-05 1
42 1.82292e-05 1
50 2.17014e-05 1
60 2.60417e-05 1
70 3.03819e-05 1
80 3.47222e-05 1
90 3.90625e-05 1
100 4.34028e-05 1
120 5.20833e-05 1
140 6.07639e-05 1
160 6.94444e-05 1
160 6.94444e-05 1
150 6.51042e-05 1
140 6.07639e-05 1
130 5.64236e-05 1
120 5.20833e-05 1
110 4.77431e-05 1
100 4.34028e-05 1
90 3.90625e-05 1
80 3.47222e-05 1
70 3.03819e-05 1
60 2.60417e-05 1
50 2.17014e-05 1
40 1.73611e-05 1
30 1.30208e-05 1
20 8.68056e-06 1
10 4.34028e-06 1
10 4.34028e-06 1
10 4.34028e-06 1
20 8.68056e-06 1
30 1.30208e-05 1
40 1.73611e-05 1
50 2.17014e-05 1
60 2.60417e-05 1
70 3.03819e-05 1
80 3.47222e-05 1
90 3.90625e-05 1
100 4.34028e-05 1
120 5.20833e-05 1
140 6.07639e-05 1
160 6.94444e-05 1
160 6.94444e-05 1
150 6.51042e-05 1
140 6.07639e-05 1
130 5.64236e-05 1
120 5.20833e-05 1
110 4.77431e-05 1
100 4.34028e-05 1
90 3.90625e-05 1
80 3.47222e-05 1
70 3.03819e-05 1
60 2.60417e-05 1
50 2.17014e-05 1
40 1.73611e-05 1
30 1.30208e-05 1
20 8.68056e-06 1
10 4.34028e-06 1
10 4.34028e-06 1
10 4.34028e-06 1
20 8.68056e-06 1
30 1.30208e-05 1
40 1.73611e-05 1
40 1.73611e-05 1
50 2.17014e-05 1
55 2.38715e-05 1
60 2.60417e-05 1
65 2.82118e-05 1
70 3.03819e-05 1
75 3.25521e-05 1
80 3.47222e-05 1
85 3.68924e-05 2
90 3.90625e-05 2
95 4.12326e-05 2
90 3.90625e-05 2
80 3.47222e-05 2
70 3.03819e-05 2
60 2.60417e-05 2
50 2.17014e-05 2
40 1.73611e-05 2
30 1.30208e-05 2
20 8.68056e-06 2
10 4.34028e-06 2
10 4.34028e-06 2
10 4.34028e-06 2
20 8.68056e-06 2
30 1.30208e-05 2
40 1.73611e-05 2
40 1.73611e-05 2
50 2.17014e-05 2
55 2.38715e-05 2
60 2.60417e-05 2
65 2.82118e-05 2
70 3.03819e-05 2
75 3.25521e-05 2
80 3.47222e-05 2
85 3.68924e-05 2
90 3.90625e-05 2
95 4.12326e-05 2
100 4.34028e-05 2
105 4.55729e-05 2
110 4.77431e-05 2
115 4.99132e-05 2
120 5.20833e-05 2
As you can see, the src histogram are a lot more distributed than the histrogram I receive ([0-255] against [0-2]).
My question is, what do I do now? How do I match them?
Why don't you scale [0-2] histogram to [0-255]? oldValue * 255 / 2.
I have file that looks like this
gene_id_100100 sp|Q53IZ1|ASDP_PSESP 35.81 148 90 2 13 158 6 150 6e-27 109 158 531
gene_id_100600 sp|Q49W80|Y1834_STAS1 31.31 99 63 2 1 95 279 376 7e-07 50.1 113 402
gene_id_100 sp|A7TSV7|PAN1_VANPO 36.36 44 24 1 41 80 879 922 1.9 32.3 154 1492
gene_id_10100 sp|P37348|YECE_ECOLI 32.77 177 104 6 3 172 2 170 2e-13 71.2 248 272
gene_id_101100 sp|B0U4U5|SURE_XYLFM 29.11 79 41 3 70 148 143 206 0.14 35.8 175 262
gene_id_101600 sp|Q5AWD4|BGLM_EMENI 35.90 39 25 0 21 59 506 544 4.9 30.4 129 772
gene_id_102100 sp|P20374|COX1_APILI 38.89 36 22 0 3 38 353 388 0.54 32.0 92 521
gene_id_102600 sp|Q46127|SYW_CLOLO 79.12 91 19 0 1 91 1 91 5e-44 150 92 341
gene_id_103100 sp|Q9UJX6|ANC2_HUMAN 53.57 28 13 0 11 38 608 635 2.1 28.9 42 822
gene_id_103600 sp|C1DA02|SYL_LARHH 35.59 59 30 2 88 138 382 440 4.6 30.8 140 866
gene_id_104100 sp|B8DHP2|PROB_LISMH 25.88 85 50 2 37 110 27 109 0.81 32.3 127 276
gene_id_105100 sp|A1ALU1|RL3_PELPD 31.88 69 42 2 14 77 42 110 2.2 31.6 166 209
gene_id_105600 sp|P59696|T200_SALTY 64.00 125 45 0 5 129 3 127 9e-58 182 129 152
gene_id_10600 sp|G3XDA3|CTPH_PSEAE 28.38 74 48 1 4 77 364 432 0.56 31.6 81 568
gene_id_106100 sp|P94369|YXLA_BACSU 35.00 100 56 3 25 120 270 364 4e-08 53.9 120 457
gene_id_106600 sp|P34706|SDC3_CAEEL 60.00 20 8 0 18 37 1027 1046 2.3 32.7 191 2150
Now, I need to extract the gene ID, which is the one between || in the second column. In other words, I need an output that looks like this:
Q53IZ1
Q49W80
A7TSV7
P37348
B0U4U5
Q5AWD4
P20374
Q46127
Q9UJX6
C1DA02
B8DHP2
A1ALU1
P59696
G3XDA3
P94369
P34706
I have been trying to do it using the following command:
awk '{for(i=1;i<=NF;++i){ if($i==/[A-Z][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]/){print $i} } }'
but it doesn't seem to work.
Pattern matching is not really necessary. I'd suggest
awk -F\| '{print $2}' filename
This splits the line into |-delimited fields and prints the second of them.
Alternatively,
cut -d\| -f 2 filename
achieves the same.