shell join wrongly matching letters with emoji - regex

I have two files having each two columns: (Count | Term). Where term column is in Unicode and some terms are emojis.
I am trying to join both files based on term column (second column) using this code:
join -j 2 -o 1.1,1.2,2.1,2.2 <(sort -k2 File1.txt) <(sort -k2 File2.txt) > join_File1_File2.txt
When I join the two files I have correct output except some wrong lines having emoji joined with letters and characters like :
11 ๐‘ ๐‘–๐‘›๐‘ฆ๐‘‘๐‘ฅ๐‘‘๐‘ฆโˆซ 4 ๐Ÿ–๏ธ๐Ÿ–๏ธ๐Ÿ–๏ธ๐Ÿ–๏ธ๐Ÿ–๏ธ
484 โ€ฆโ€ฆโ€ฆ 79 โœŠโœŠโœŠ
27 โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€” 25 ๐Ÿ‡จ๐Ÿ‡ฆ๐Ÿ‡จ๐Ÿ‡ฆ๐Ÿ‡จ๐Ÿ‡ฆ๐Ÿ‡จ๐Ÿ‡ฆ๐Ÿ‡จ๐Ÿ‡ฆ๐Ÿ‡จ๐Ÿ‡ฆ
I have even some wrong matched emojis like this :
39 ๐ŸŒŠ 726 ๐Ÿšด
And some wrong matched characters like :
2 ยฐโ€ฒ 1 ยฐโ˜€
Here is a sample of Files : File1, File2 and JoinFile
File1 :
1 โ€โ€ฆโ€ฆโ€ฆโ€ฆ
1369 i
1347 โ€ฆ
1339 it
8 โ‹…๐‘‘๐‘ โƒ—
1322 with
1 ๐‘Žโˆ’โ€พโˆš๐‘โ€พโˆšโˆด
1302 are
1299 your
1276 my
39 ๐ŸŒŠ
1272 with
1261 from
1255 this
1244 what
File 2:
1437 to
1435 your
1433 are
1421 in
83 โ›”๏ธ๐Ÿ‡ฎ๐Ÿ‡น
1411 and
1404 for
1 โ€๐Ÿ˜ป๐Ÿ˜ป๐Ÿ˜ป
1384 you
1373 on
726 ๐Ÿšด
1347 โ€ฆ
13 โค๏ธ๐Ÿ’œ๐Ÿงก๐Ÿ’›๐Ÿ’š๐Ÿ’™
1333 this
1322 with
Join_File1_File2
1 โ€โ€ฆโ€ฆโ€ฆโ€ฆ 1 โ€๐Ÿ˜ป๐Ÿ˜ป๐Ÿ˜ป
1347 โ€ฆ 1347 โ€ฆ
39 ๐ŸŒŠ 726 ๐Ÿšด
8 โ‹…๐‘‘๐‘ โƒ— 83 โ›”๏ธ๐Ÿ‡ฎ๐Ÿ‡น
1 ๐‘Žโˆ’โ€พโˆš๐‘โ€พโˆšโˆด 13 โค๏ธ๐Ÿ’œ๐Ÿงก๐Ÿ’›๐Ÿ’š๐Ÿ’™
1302 are 1433 are
1255 this 1333 this
1272 with 1322 with
1322 with 1322 with
1299 your 1435 your

Related

How can I replace codes of a variable (food Ingredient) with value (calorie) from another dataset?

I have a Stata dataset which has six variables with consumed food ingredient codes and their weight in grams. I have another distinct dataset which has food ingredient codes and consecutive calorie per 100 g. I need to replace codes with calorie to calculate total calorie consumption.
How can I do that? (by replacing or generating new variable)
My first (master) dataset is
clear
input double hhid int(Ingredient_1_code Ingredient_1_weight Ingredient_2_code incredient_2_weight Ingredient_3_code ingredient_3_weight Ingredient_4_code Ingredient_4_weight Ingredient_5_code Ingredient_5_weight Ingredient_6_code Ingredient_6_weight)
1 269 8 266 46 . . . . . . . .
1 315 19 . . . . . . . . . .
1 316 9 . . . . . . . . . .
1 2522 3 . . . . . . . . . .
1 1 1570 . . . . . . . . . .
1 1 530 . . . . . . . . . .
1 61 262 64 23 57 17 31 8 2522 5 . .
1 130 78 64 23 57 17 2521 2 31 15 248 1
1 228 578 64 138 57 37 248 3 2521 14 31 35
2 142 328 . . . . . . . . . .
2 272 78 . . . . . . . . . .
2 1 602 . . . . . . . . . .
2 51 344 61 212 246 2 64 50 65 11 2522 10
2 176 402 44 348 61 163 57 17 248 2 64 71
3.1 1 1219 . . . . . . . . . .
3.1 1 410 . . . . . . . . . .
3.1 54 130 52 60 61 32 51 23 21 17 57 4
3.1 44 78 130 44 57 3 248 4 31 49 2522 6
3.1 231 116 904 119 61 220 57 22 248 3 254 6
3.2 156 396 . . . . . . . . . .
3.2 272 78 . . . . . . . . . .
end
My second dataset with food ingredient codes and calorie per 100 g is
clear
input str39 Ingredient int(Ingredient_codes Calorie_per_100gm)
"Parboiled rice (coarse)" 1 344
"Non-parboiled rice (coarse)" 2 344
"Fine rice" 3 344
"Rice flour" 4 366
"Suji (cream of wheat/barley)" 5 364
"Wheat" 6 347
"Atta" 7 334
"Maida (wheat flour w/o bran)" 8 346
"Semai/noodles" 9 347
"Chaatu" 10 324
"Chira (flattened rice)" 11 356
"Muri/Khoi (puffed rice)" 12 361
"Barley" 13 324
"Sagu" 14 346
"Corn" 15 355
"Cerelac" 16 418
"Lentil" 21 317
"Chick pea" 22 327
"Anchor daal" 23 375
"Black gram" 24 317
"Khesari" 25 352
"Mung" 26 161
end
I want to get calories per 100 g in master dataset in according to ingredients.
I agree with the comment Nick made about it is better to first make this data long. Read why that is a much better practice here: https://worldbank.github.io/dime-data-handbook/processing.html#making-data-tidy
However, it can be done in the current un-tidy wide format if you for some reason must keep your data like that. The code below shows how that can be done.
clear
input str39 Ingredient int(Ingredient_codes Calorie_per_100gm)
"Parboiled rice (coarse)" 1 344
"Non-parboiled rice (coarse)" 2 344
"Fine rice" 3 344
"Rice flour" 4 366
"Suji (cream of wheat/barley)" 5 364
"Wheat" 6 347
"Atta" 7 334
"Maida (wheat flour w/o bran)" 8 346
"Semai/noodles" 9 347
"Chaatu" 10 324
"Chira (flattened rice)" 11 356
"Muri/Khoi (puffed rice)" 12 361
"Barley" 13 324
"Sagu" 14 346
"Corn" 15 355
"Cerelac" 16 418
"Lentil" 21 317
"Chick pea" 22 327
"Anchor daal" 23 375
"Black gram" 24 317
"Khesari" 25 352
"Mung" 26 161
end
drop Ingredient
tempfile code_calories
save `code_calories'
clear
input double hhid int(Ingredient_1_code Ingredient_1_weight Ingredient_2_code incredient_2_weight Ingredient_3_code ingredient_3_weight Ingredient_4_code Ingredient_4_weight Ingredient_5_code Ingredient_5_weight Ingredient_6_code Ingredient_6_weight)
1 269 8 266 46 . . . . . . . .
1 315 19 . . . . . . . . . .
1 316 9 . . . . . . . . . .
1 2522 3 . . . . . . . . . .
1 1 1570 . . . . . . . . . .
1 1 530 . . . . . . . . . .
1 61 262 64 23 57 17 31 8 2522 5 . .
1 130 78 64 23 57 17 2521 2 31 15 248 1
1 228 578 64 138 57 37 248 3 2521 14 31 35
2 142 328 . . . . . . . . . .
2 272 78 . . . . . . . . . .
2 1 602 . . . . . . . . . .
2 51 344 61 212 246 2 64 50 65 11 2522 10
2 176 402 44 348 61 163 57 17 248 2 64 71
3.1 1 1219 . . . . . . . . . .
3.1 1 410 . . . . . . . . . .
3.1 54 130 52 60 61 32 51 23 21 17 57 4
3.1 44 78 130 44 57 3 248 4 31 49 2522 6
3.1 231 116 904 119 61 220 57 22 248 3 254 6
3.2 156 396 . . . . . . . . . .
3.2 272 78 . . . . . . . . . .
end
*Standardize varname
rename incredient_2_weight Ingredient_2_weight
rename ingredient_3_weight Ingredient_3_weight
*Loop over all variables
forvalues var_num = 1/6 {
*Rename to match name in code_calories dataset
rename Ingredient_`var_num'_code Ingredient_codes
*Merge calories for this ingridient
merge m:1 Ingredient_codes using `code_calories', keep(master matched) nogen
*Calculate number of calories for this ingredient
gen Calories_`var_num' = Calorie_per_100gm * Ingredient_`var_num'_weight
*Order new variables and restore names of variables
order Calorie_per_100gm Calories_`var_num', after(Ingredient_`var_num'_weight)
rename Ingredient_codes Ingredient_`var_num'_code
rename Calorie_per_100gm Calorie_per_100gm_`var_num'
}
*Summarize calories across all ingredients
egen total_calories = rowtotal(Calories_?)

How do you Resample daily with a conditional statement in pandas

I have a pandas data frame below: (it does have other columns but these are the important ones) Date column is the Index
Number_QA_VeryGood Number_Valid_Cells Time
Date
2015-01-01 91 92 18:55
2015-01-02 6 6 18:00
2015-01-02 13 13 19:40
2015-01-03 106 106 18:45
2015-01-05 68 68 18:30
2015-01-06 111 117 19:15
2015-01-07 89 97 18:20
2015-01-08 86 96 19:00
2015-01-10 9 16 18:50
I need to resample daily the first two columns will be resampled with sum.
The last column needs to look at the highest daily value for the Number_Valid_Cells column and use that time for the value.
example output should be: (1/2/02 is line which changed)
Number_QA_VeryGood Number_Valid_Cells Time
Date
2015-01-01 91 92 18:55
2015-01-02 19 19 19:40
2015-01-03 106 106 18:45
2015-01-05 68 68 18:30
2015-01-06 111 117 19:15
2015-01-07 89 97 18:20
2015-01-08 86 96 19:00
2015-01-10 9 16 18:50
What is the best way to get this to work.
Or you can try
df.groupby(df.index).agg({'Number_QA_VeryGood':'sum','Number_Valid_Cells':'sum','Time':'last'})
Out[276]:
Time Number_QA_VeryGood Number_Valid_Cells
Date
2015-01-01 18:55 91 92
2015-01-02 19:40 19 19
2015-01-03 18:45 106 106
2015-01-05 18:30 68 68
2015-01-06 19:15 111 117
2015-01-07 18:20 89 97
2015-01-08 19:00 86 96
2015-01-10 18:50 9 16
Update: sort_values first
df.sort_values('Number_Valid_Cells').groupby(df.sort_values('Number_Valid_Cells').index)\
.agg({'Number_QA_VeryGood':'sum','Number_Valid_Cells':'sum','Time':'last'})
Out[314]:
Time Number_QA_VeryGood Number_Valid_Cells
Date
1/1/2015 18:55 91 92
1/10/2015 18:50 9 16
1/2/2015 16:40#here.changed 19 19
1/3/2015 18:45 106 106
1/5/2015 18:30 68 68
1/6/2015 19:15 111 117
1/7/2015 18:20 89 97
1/8/2015 19:00 86 96
Data input :
Number_QA_VeryGood Number_Valid_Cells Time
Date
1/1/2015 91 92 18:55
1/2/2015 6 6 18:00
1/2/2015 13 13 16:40#I change here
1/3/2015 106 106 18:45
1/5/2015 68 68 18:30
1/6/2015 111 117 19:15
1/7/2015 89 97 18:20
1/8/2015 86 96 19:00
1/10/2015 9 16 18:50
You can use groupby sum for first two columns, if you have the values of Number_Valid_Cells sorted then
ndf = df.reset_index().groupby('Date').sum()
ndf['Time'] = df.reset_index().drop_duplicates(subset='Date',keep='last').set_index('Date')['Time']
Number_QA_VeryGood Number_Valid_Cells Time
Date
2015-01-01 91 92 18:55
2015-01-02 19 19 19:40
2015-01-03 106 106 18:45
2015-01-05 68 68 18:30
2015-01-06 111 117 19:15
2015-01-07 89 97 18:20
2015-01-08 86 96 19:00
2015-01-10 9 16 18:50

Verifying output to "Find the numbers between 1 and 1000 whose prime factors' sum is itself a prime" from Allain's Jumping into C++ (ch7 #3)

The question:
Design a program that finds all numbers from 1 to 1000 whose prime factors, when added
together, sum up to a prime number (for example, 12 has prime factors of 2, 2, and 3, which
sum to 7, which is prime). Implement the code for that algorithm.
I modified the problem to only sum unique factors, because I don't see why you'd count a factor twice, as in his example using 12.
My solution. Is there any good (read: automated) way to verify the output of my program?
Sample output for 1 to 1000:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
19
20
22
23
24
25
26
28
29
30
31
34
37
40
41
43
44
46
47
48
49
52
53
58
59
60
61
63
67
68
70
71
73
76
79
80
82
83
88
89
92
94
96
97
99
101
103
107
109
113
116
117
118
120
121
124
127
131
136
137
139
140
142
147
148
149
151
153
157
160
163
164
167
169
171
172
173
176
179
181
184
188
189
191
192
193
197
198
199
202
207
210
211
212
214
223
227
229
232
233
239
240
241
244
251
252
257
261
263
268
269
271
272
273
274
275
277
279
280
281
283
286
289
292
293
294
297
298
306
307
311
313
317
320
325
331
332
333
334
337
347
349
351
352
353
358
359
361
367
368
369
373
376
379
382
383
384
388
389
394
396
397
399
401
404
409
412
414
419
421
423
424
425
428
431
433
439
443
449
454
457
459
461
462
463
464
467
468
472
475
478
479
480
487
491
495
499
503
509
513
521
522
523
524
529
531
538
539
541
544
546
547
548
549
550
557
560
561
562
563
567
569
571
572
575
577
587
588
593
594
599
601
603
604
605
607
612
613
617
619
621
622
628
631
639
640
641
643
646
647
651
652
653
659
661
664
668
673
677
683
684
691
692
694
701
704
709
712
714
718
719
725
726
727
733
736
738
739
741
743
751
752
756
757
759
761
764
765
768
769
772
773
775
777
783
787
792
797
798
801
809
811
821
823
825
827
828
829
833
837
838
839
841
846
847
848
850
853
856
857
859
862
863
873
877
881
883
887
891
892
903
904
907
908
909
911
918
919
922
925
928
929
932
937
941
944
947
953
954
957
960
961
966
967
971
975
977
981
983
991
997
999
Update: I have solved my problem and verified the output of my program using an OEIS given series, as suggested by #MVW (shown in the source given by my new github solution). In the future, I will aim to test my programs by doing zero or more of the following (depending on the scope/importance of the problem):
google keywords for an existing solution to the problem, comparing it against my solution if I find it
unit test components for correctness as they're built and integrated, comparing these tests with known correct outputs
Some suggestions:
You need to check the properties of your calculated numbers.
Here that means
calculating the prime factors and
calculating their sum and
testing if that sum is a prime number.
Which is what your program should do in the first place, by the way.
So one nice option for checking is comparing your output with a known solution or the output of a another program which is known to work. The tricky bit is to have such a solution or program available. And I neglect that your comparison could be plagued by errors as well :-)
If you just compare it with other implementations, e.g. programs from other folks here, it would turn out more of a voting, it would not be a proof. It would just give increased probability that your program is correct, if several independent implementations come up with the same result. Of course all implementations could err :-)
The more agree the better.
And the more diverse the implementations are, the better.
E.g. you could use different programming languages, algebraic systems or a friend with time and paper and pencil and Wikipedia. :-)
Another means is to add checks to your intermediate steps, to get more confidence in your result. Kind of building a chain of trust.
You could output the prime factors you determined and compare it with the output
of a prime factorization program which is known to work.
Then you check if your summing works.
Finally you could check if the primality test you apply to the candidate sums is working correctly by feeding it with known prime numbers and non prime numbers and so on.
That is kind of what folks do with unit testing for example. Trying to cover most parts of the code as working, hoping if the parts work, that the whole will work.
Or you could formally prove your program step by step, using Hoare Calculus for example or another formal method.
But that is tricky, and you might end up shifting program errors to errors in the proof.
And today, in the era of internet, of course, you could internet search for the solution:
Try searching for sum of prime factors is prime in the online encyclopedia of integer sequences, which should give you series A100118. :-)
It is the problem with multiplicity, but shows you what the number theory pros do, with Mathematica and program fragments to calculate the series, the argument for the case of 1 and literature. Quite impressive.
Here's the answer I get. I exclude 1 as it has no prime divisors so their sum is 0, not a prime.
Haskell> filter (isPrime . sum . map fst . primePowers) [2..1000]
[2,3,4,5,6,7,8,9,10,11,12,13,16,17,18,19,20,22,23,24,25,27,29,31,32,34,36,37,40,
41,43,44,47,48,49,50,53,54,58,59,61,64,67,68,71,72,73,79,80,81,82,83,88,89,96,97
,100,101,103,107,108,109,113,116,118,121,125,127,128,131,136,137,139,142,144,149
,151,157,160,162,163,164,165,167,169,173,176,179,181,191,192,193,197,199,200,202
,210,211,214,216,223,227,229,232,233,236,239,241,242,243,250,251,256,257,263,269
,271,272,273,274,277,281,283,284,288,289,293,298,307,311,313,317,320,324,328,331
,337,343,345,347,349,352,353,358,359,361,367,373,379,382,383,384,385,389,390,394
,397,399,400,401,404,409,419,420,421,428,431,432,433,435,439,443,449,454,457,461
,462,463,464,467,472,478,479,484,486,487,491,495,499,500,503,509,512,521,523,529
,538,541,544,547,548,557,561,562,563,568,569,570,571,576,577,578,587,593,595,596
,599,601,607,613,617,619,622,625,630,631,640,641,643,647,648,651,653,656,659,661
,665,673,677,683,691,694,701,704,709,714,715,716,719,727,729,733,739,743,751,757
,759,761,764,768,769,773,777,780,787,788,795,797,798,800,808,809,811,819,821,823
,825,827,829,838,839,840,841,853,856,857,858,859,862,863,864,877,881,883,885,887
,903,907,908,911,919,922,924,928,929,930,937,941,944,947,953,956,957,961,967,968
,971,972,977,983,991,997,1000]
Haskell> primePowers 12
[(2,2),(3,1)]
Haskell> primePowers 14
[(2,1),(7,1)]
You could hard-code this list in and test against it. I'm pretty confident these results are without error.
(read . is "of").

Select rows from data.frame ending with a specific character string in R

I'm using R and I have a data.frame with nearly 2,000 entries that looks as follows:
> head(PVs,15)
LogFreq Word PhonCV FreqDev
1593 140 was CVC 5.480774
482 139 had CVC 5.438114
1681 138 zou CVVC 5.395454
1662 137 zei CVV 5.352794
1619 136 werd CVCC 5.310134
1592 135 waren CVV-CV 5.267474
620 134 kon CVC 5.224814
646 133 kwam CCVC 5.182154
483 132 hadden CVC-CV 5.139494
436 131 ging CVC 5.096834
734 130 moest CVVCC 5.054174
1171 129 stond CCVCC 5.011514
1654 128 zag CVC 4.968854
1620 127 werden CVC-CV 4.926194
1683 126 zouden CVV-CV 4.883534
What I want to do is to create a new data.frame that is equal to PVs, except that all entries having as a member of the "Word" column a string of character that does NOT end in either "te" or "de" removed. i.e. All words not ending in either "de" or "te" should be removed from the data.frame.
I know how to slectively remove entries from data.frames using logical operators, but those work when you're setting numeric criteria. I think to do this I need to use regular expressions, but sadly R is the only programming language I "know", so I'm far from knowing what type of code to use here.
I appreciate your help.
Thanks in advance.
Method 1
You can use grepl with an appropraite regular expression. Consider the following:
x <- c("blank","wade","waste","rubbish","dedekind","bated")
grepl("^.+(de|te)$",x)
[1] FALSE TRUE TRUE FALSE FALSE FALSE
The regular expression says begin (^) with anything any number of times (.+) and then find either de or te ((de|te)) then end ($).
So for your data.frame try,
subset(PVs,grepl("^.+(de|te)$",Word))
Method 2
To avoid the regexp method you can use a substr method instead.
# substr the last two characters and test
substr(x,nchar(x)-1,nchar(x)) %in% c("de","te")
[1] FALSE TRUE TRUE FALSE FALSE FALSE
So try:
subset(PVs,substr(Word,nchar(Word)-1,nchar(Word)) %in% c("de","te"))
I modified the data a bit so that there were words that ended in te or de.
> PV
LogFreq Word PhonCV FreqDev
1593 140 blahte CVC 5.480774
482 139 had CVC 5.438114
1681 138 aaaade CVVC 5.395454
1662 137 zei CVV 5.352794
1619 136 werd CVCC 5.310134
1592 135 waren CVV-CV 5.267474
620 134 kon CVC 5.224814
646 133 kwamde CCVC 5.182154
483 132 hadden CVC-CV 5.139494
436 131 ging CVC 5.096834
734 130 moeste CVVCC 5.054174
1171 129 stond CCVCC 5.011514
1654 128 zagde CVC 4.968854
1620 127 werden CVC-CV 4.926194
1683 126 zouden CVV-CV 4.883534
# Add a column to PV that you can visually check the regular expression matches.
PV$Match <- grepl(pattern = "(de|te)$", PV$Word)
# Subset PV data frame to show only TRUE matches
PV <- PV[PV$Match == FALSE, ]
The result is shown below
LogFreq Word PhonCV FreqDev Match
482 139 had CVC 5.438114 FALSE
1662 137 zei CVV 5.352794 FALSE
1619 136 werd CVCC 5.310134 FALSE
1592 135 waren CVV-CV 5.267474 FALSE
620 134 kon CVC 5.224814 FALSE
483 132 hadden CVC-CV 5.139494 FALSE
436 131 ging CVC 5.096834 FALSE
1171 129 stond CCVCC 5.011514 FALSE
1620 127 werden CVC-CV 4.926194 FALSE
1683 126 zouden CVV-CV 4.883534 FALSE
Using grep
grep -xvE '.{17}(de|te).*' file.txt

Qt application killed because Out Of Memory (OOM)

I am running a Qt application on embedded Linux platform. The system has 128 MB RAM, 512MB NAND, no swap. The application uses a custom library for the peripherals, the rest are all Qt and c/c++ libs. The application uses SQLITE3 as well.
After 2-3 hours, the machine starts running very slow, shell commands take 10 or so seconds to respond. Eventually the machine hangs, and finally OOM killer kills the application, and the system starts behaving at normal speed.
After some system memory observations using top command reveals that while application is running, the system free memory is decreasing, while slab keeps on increasing. These are the snaps of top given below. The application is named xyz.
At Application start :
Mem total:126164 anon:3308 map:8436 free:32456
slab:60936 buf:0 cache:27528 dirty:0 write:0
Swap total:0 free:0
PID VSZ VSZRW^ RSS (SHR) DIRTY (SHR) STACK COMMAND
776 29080 9228 8036 528 968 0 84 ./xyz -qws
781 3960 736 1976 1456 520 0 84 sshd: root#notty
786 3676 680 1208 764 416 0 88 /usr/libexec/sftp-server
770 3792 568 1948 1472 464 0 84 {sshd} sshd: root#pts/0
766 3792 568 956 688 252 0 84 /usr/sbin/sshd
388 1864 284 552 332 188 0 84 udevd --daemon
789 2832 272 688 584 84 0 84 top
774 2828 268 668 560 84 0 84 -sh
709 2896 268 556 464 80 0 84 /usr/sbin/inetd
747 2828 268 596 516 68 0 84 /sbin/getty -L ttymxc0 115200 vt100
777 2824 264 444 368 68 0 84 tee out.log
785 2824 264 484 416 68 0 84 sh -c /usr/libexec/sftp-server
1 2824 264 556 488 64 0 84 init
After some time :
Mem total:126164 anon:3312 map:8440 free:9244
slab:83976 buf:0 cache:27584 dirty:0 write:0
Swap total:0 free:0
PID VSZ VSZRW^ RSS (SHR) DIRTY (SHR) STACK COMMAND
776 29080 9228 8044 528 972 0 84 ./xyz -qws
781 3960 736 1976 1456 520 0 84 sshd: root#notty
786 3676 680 1208 764 416 0 88 /usr/libexec/sftp-server
770 3792 568 1948 1472 464 0 84 {sshd} sshd: root#pts/0
766 3792 568 956 688 252 0 84 /usr/sbin/sshd
388 1864 284 552 332 188 0 84 udevd --daemon
789 2832 272 688 584 84 0 84 top
774 2828 268 668 560 84 0 84 -sh
709 2896 268 556 464 80 0 84 /usr/sbin/inetd
747 2828 268 596 516 68 0 84 /sbin/getty -L ttymxc0 115200 vt100
777 2824 264 444 368 68 0 84 tee out.log
785 2824 264 484 416 68 0 84 sh -c /usr/libexec/sftp-server
1 2824 264 556 488 64 0 84 init
Funnily though, I can not see any major changes in the output of top involving the application itself. Eventually the application is killed, top output after that :
Mem total:126164 anon:2356 map:916 free:2368
slab:117944 buf:0 cache:1580 dirty:0 write:0
Swap total:0 free:0
PID VSZ VSZRW^ RSS (SHR) DIRTY (SHR) STACK COMMAND
781 3960 736 708 184 520 0 84 sshd: root#notty
786 3724 728 736 172 484 0 88 /usr/libexec/sftp-server
770 3792 568 648 188 460 0 84 {sshd} sshd: root#pts/0
766 3792 568 252 0 252 0 84 /usr/sbin/sshd
388 1864 284 188 0 188 0 84 udevd --daemon
819 2832 272 676 348 84 0 84 top
774 2828 268 512 324 96 0 84 -sh
709 2896 268 80 0 80 0 84 /usr/sbin/inetd
747 2828 268 68 0 68 0 84 /sbin/getty -L ttymxc0 115200 vt100
785 2824 264 68 0 68 0 84 sh -c /usr/libexec/sftp-server
1 2824 264 64 0 64 0 84 init
The dmesg shows :
sh invoked oom-killer: gfp_mask=0xd0, order=2, oomkilladj=0
[<c002d4c4>] (unwind_backtrace+0x0/0xd4) from [<c0073ac0>] (oom_kill_process+0x54/0x1b8)
[<c0073ac0>] (oom_kill_process+0x54/0x1b8) from [<c0073f14>] (__out_of_memory+0x154/0x178)
[<c0073f14>] (__out_of_memory+0x154/0x178) from [<c0073fa0>] (out_of_memory+0x68/0x9c)
[<c0073fa0>] (out_of_memory+0x68/0x9c) from [<c007649c>] (__alloc_pages_nodemask+0x3e0/0x4c8)
[<c007649c>] (__alloc_pages_nodemask+0x3e0/0x4c8) from [<c0076598>] (__get_free_pages+0x14/0x4c)
[<c0076598>] (__get_free_pages+0x14/0x4c) from [<c002f528>] (get_pgd_slow+0x14/0xdc)
[<c002f528>] (get_pgd_slow+0x14/0xdc) from [<c0043890>] (mm_init+0x84/0xc4)
[<c0043890>] (mm_init+0x84/0xc4) from [<c0097b94>] (bprm_mm_init+0x10/0x138)
[<c0097b94>] (bprm_mm_init+0x10/0x138) from [<c00980a8>] (do_execve+0xf4/0x2a8)
[<c00980a8>] (do_execve+0xf4/0x2a8) from [<c002afc4>] (sys_execve+0x38/0x5c)
[<c002afc4>] (sys_execve+0x38/0x5c) from [<c0027d20>] (ret_fast_syscall+0x0/0x2c)
Mem-info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: hi: 42, btch: 7 usd: 0
Active_anon:424 active_file:11 inactive_anon:428
inactive_file:3 unevictable:0 dirty:0 writeback:0 unstable:0
free:608 slab:29498 mapped:14 pagetables:59 bounce:0
DMA free:692kB min:268kB low:332kB high:400kB active_anon:0kB inactive_anon:0kB active_file:4kB inactive_file:0kB unevictable:0kB present:24384kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 103 103
Normal free:1740kB min:1168kB low:1460kB high:1752kB active_anon:1696kB inactive_anon:1712kB active_file:40kB inactive_file:12kB unevictable:0kB present:105664kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 3*4kB 3*8kB 5*16kB 2*32kB 4*64kB 2*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 692kB
Normal: 377*4kB 1*8kB 4*16kB 1*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1740kB
30 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
32768 pages of RAM
687 free pages
1306 reserved pages
29498 slab pages
59 pages shared
0 pages swap cached
Out of memory: kill process 774 (sh) score 339 or a child
Killed process 776 (xyz)
So it's obvious that there is a memory leak, it must be my app since my app is killed. But I am not doing any malloc s from the program. I have taken care as to limit the scope of variables so that they are deallocated after they are used. So I am at a complete loss as to why is slab increasing in the top output. I have tried http://valgrind.org/docs/manual/faq.html#faq.reports but didn't work.
Currently trying to use Valgrind on desktop (since I have read it only works for arm-cortex) to check my business logic.
Addittional info :
root#freescale ~/Application/app$ uname -a
Linux freescale 2.6.31-207-g7286c01 #2053 Fri Jun 22 10:29:11 IST 2012 armv5tejl GNU/Linux
Compiler : arm-none-linux-gnueabi-4.1.2 glibc2.5
cpp libs : libstdc++.so.6.0.8
Qt : 4.7.3 libs
Any pointers would be greatly appreciated...
I don't think the problem is directly in your code.
The reason is obvious: your application space does not increase (both RSS and VSW do not increase).
However, you do see the number of slabs increasing. You cannot use or increase the number of slabs from your application - it's a kernel-only thingie.
Some obvious causes of slab size increase from the top of my head:
you never really close network sockets
you read many files, but never close them
you use many ioctls
I would run strace and look at its output for a while. strace intercepts interactions with the kernel. If you have memory issues, I'd expect repeated calls to brk(). If you have other issues, you'll see repeated calls to open without close.
If you have some data structure allocation, check for the correctness of adding children and etc.. I had similar bug in my code. Also if you make big and large queries to the database it may use more ram memory. Try to find some memory leak detector to find if there is any leak.