I'm trying to do memory profiling on linux for that I'm using command adb shell perfdump meminfo . I'm not able to understand the output for PSS column as totalPSS is not equal to sum of individual PSS values. Below is the process map I got after running the above command.
MEMORY OF /usr/bin/PuffinApp (pid 2376)
TOTAL MEMORY USAGE (kB):
Pss 66695
SwapPss 11388
Graphics 0
------
TOTAL (kB) 78083
OTHER MEMORY STATS (kB):
Vss 2250440
Rss 76560
Uss 64132
CachedPss 12371
NonCachedPss 54324
Swap 11388
SwapUss 11388
PROCESS STATS:
Maj faults 33592
Min faults 4533273
Threads 253
PROCESS MAPS:
PSS SwapPSS TotalPSS Private Private Shared Shared Referenced Name
Clean Dirty Clean Dirty
------ ------ ------ ------ ------ ------ ------ ------ ------
45752 5208 50960 68 45684 0 0 45524 [anon:rw-p]
4452 1948 6400 224 4228 0 0 4380 [heap]
1612 1988 3600 0 1612 0 0 1568 [anon:rwxp]
36 0 36 20 16 0 0 36 [anon:-w-p]
4 28 32 0 4 0 0 4 [stack]
2116 108 2224 1968 148 0 0 1716 /usr/lib/libpryon.so
1984 4 1988 1936 48 0 0 1104 /usr/bin/PuffinApp
832 0 832 804 20 16 0 516 /usr/lib/libReggaeMediaLib.so
756 32 788 624 84 96 0 384 /usr/lib/libavformat.so.58.12.100
484 156 640 392 92 0 0 400 /usr/lib/libReggaeWidevine.so
522 48 570 280 180 124 0 568 /usr/lib/libavcodec.so.58.18.100
500 0 500 0 468 0 64 532 /dev/shm/puffin-micStream
344 0 344 332 12 0 0 136 /usr/lib/libLocaleWakewordAdapter.so
338 4 342 0 88 1148 0 1236 /usr/lib/libcrypto.so.1.1
280 8 288 276 4 0 0 132 /usr/lib/libSpotifyAdapter.so
264 0 264 184 12 136 0 208 /usr/lib/libreggae-core.so
247 8 255 8 28 976 0 784 /usr/lib/libPuffinExternalCapabilityAPI.so
4 248 252 4 0 0 0 4 /usr/lib/libavfilter.so.7.16.100
222 4 226 200 0 68 0 268 /usr/lib/libopus.so.0.8.0
202 0 202 0 0 0 404 404 /dev/shm/audio_playback_stream_2
176 16 192 168 8 0 0 96 /usr/lib/libxml2.so.2.9.13
183 0 183 120 4 128 0 64 /usr/lib/libDefaultClient.so
142 0 142 88 12 172 0 256 /usr/lib/libacsdkAudioPlayer.so
140 0 140 136 4 0 0 44 /usr/lib/libacsdkVisualCharacteristics.so
------ ------ ------ ------ ------ ------ ------ ------
66613 11388 78001 10728 53404 11912 512 71972 TOTAL
If i calculate the sum of individual PSS (1st column ) I will get 61492 which is around 5MB lesser than the total PSS. Can someone explain me why these two values are different and who is using the remaining memory ?
I have N buckets. Each bucket can contain 0 or 1. C is number that represents how many number 1 is showing continuously (e.g. if C=3 i would have 111).
E.g. for N=5 and C=2, total number of all combinations is 19 (here C=2, so I have always to have at least two ones - 11 in row):
And this is calculation for first 20 N and C numbers (I marked yellow case above):
How to get to the formula that depends on C and N ?
This python progam
import scipy.special
import fractions
def bi(n, m):
return scipy.special.comb(n, m, exact=True)
def fr(*args):
return fractions.Fraction(*args)
def f(N, k):
N = fr(N)
k = fr(k)
s = 1
m = 0
while m <= k - 1:
if m % k == N % k:
x = (N - m)/k
s -= bi(m, x) * (-1)**x * 2**(-(k + 1)*x)
m += 1
while m <= N:
if m % k == N % k:
x = (N - m)/k
s -= (bi(m, x) - fr(1, 2)**k * bi(m - k, x) ) * (-1)**x * 2**(-(k + 1)*x)
m += 1
return(s * 2**N)
for N in range(1, 20):
for C in range(1, N + 1):
print("%6.d" % f(N, C), end = ' ')
print()
Outputs:
1
3 1
7 3 1
15 8 3 1
31 19 8 3 1
63 43 20 8 3 1
127 94 47 20 8 3 1
255 201 107 48 20 8 3 1
511 423 238 111 48 20 8 3 1
1023 880 520 251 112 48 20 8 3 1
2047 1815 1121 558 255 112 48 20 8 3 1
4095 3719 2391 1224 571 256 112 48 20 8 3 1
8191 7582 5056 2656 1262 575 256 112 48 20 8 3 1
16383 15397 10616 5713 2760 1275 576 256 112 48 20 8 3 1
32767 31171 22159 12199 5984 2798 1279 576 256 112 48 20 8 3 1
65535 62952 46023 25888 12880 6088 2811 1280 576 256 112 48 20 8 3 1
131071 126891 95182 54648 27553 13152 6126 2815 1280 576 256 112 48 20 8 3 1
262143 255379 196132 114832 58631 28240 13256 6139 2816 1280 576 256 112 48 20 8 3 1
524287 513342 402873 240335 124192 60320 28512 13294 6143 2816 1280 576 256 112 48 20 8 3 1
The formula is from Markus Scheuer.
I made a for cycle to calculate the population of an alien species growth. This is the cycle:
int mind = 96;
int aliens = 1;
for (int i=0; i <= mind; i++)
{
aliens = aliens * 2;
}
cout << aliens;
Oddly, the cout is returning 0, and it makes no sense, it should return a very high value. Is the cycle badly coded?
The issue is simple. you have a int (most likely 32-bit signed integer). The operation you're doing (x2 each cycle) can be expressed as a shift arithmetic left.
Beware the powers of 2! Doing 1 << 31 on a 32-bit signed integer will effectively go back to 0 (after an overflow).
Let's see how your loop goes.
0 2
1 4
2 8
3 16
4 32
5 64
6 128
7 256
8 512
9 1024
10 2048
11 4096
12 8192
13 16384
14 32768
15 65536
16 131072
17 262144
18 524288
19 1048576
20 2097152
21 4194304
22 8388608
23 16777216
24 33554432
25 67108864
26 134217728
27 268435456
28 536870912
29 1073741824
30 -2147483648 // A.K.A. overflow
31 0
At this point I don't think I need to tell you 0 x 2 = 0
The point being: use a double or a integer variable that's at least mind + 1 bits long
Which is faster in GLSL:
pow(x, 3.0f);
or
x*x*x;
?
Does exponentiation performance depend on hardware vendor or exponent value?
I wrote a small benchmark, because I was interested in the results.
In my personal case, I was most interested in exponent = 5.
Benchmark code (running in Rem's Studio / LWJGL):
package me.anno.utils.bench
import me.anno.gpu.GFX
import me.anno.gpu.GFX.flat01
import me.anno.gpu.RenderState
import me.anno.gpu.RenderState.useFrame
import me.anno.gpu.framebuffer.Frame
import me.anno.gpu.framebuffer.Framebuffer
import me.anno.gpu.hidden.HiddenOpenGLContext
import me.anno.gpu.shader.Renderer
import me.anno.gpu.shader.Shader
import me.anno.utils.types.Floats.f2
import org.lwjgl.opengl.GL11.*
import java.nio.ByteBuffer
import kotlin.math.roundToInt
fun main() {
fun createShader(code: String) = Shader(
"", null, "" +
"attribute vec2 attr0;\n" +
"void main(){\n" +
" gl_Position = vec4(attr0*2.0-1.0, 0.0, 1.0);\n" +
" uv = attr0;\n" +
"}", "varying vec2 uv;\n", "" +
"void main(){" +
code +
"}"
)
fun repeat(code: String, times: Int): String {
return Array(times) { code }.joinToString("\n")
}
val size = 512
val warmup = 50
val benchmark = 1000
HiddenOpenGLContext.setSize(size, size)
HiddenOpenGLContext.createOpenGL()
val buffer = Framebuffer("", size, size, 1, 1, true, Framebuffer.DepthBufferType.NONE)
println("Power,Multiplications,GFlops-multiplication,GFlops-floats,GFlops-ints,GFlops-power,Speedup")
useFrame(buffer, Renderer.colorRenderer) {
RenderState.blendMode.use(me.anno.gpu.blending.BlendMode.ADD) {
for (power in 2 until 100) {
// to reduce the overhead of other stuff
val repeats = 100
val init = "float x1 = dot(uv, vec2(1.0)),x2,x4,x8,x16,x32,x64;\n"
val end = "gl_FragColor = vec4(x1,x1,x1,x1);\n"
val manualCode = StringBuilder()
for (bit in 1 until 32) {
val p = 1.shl(bit)
val h = 1.shl(bit - 1)
if (power == p) {
manualCode.append("x1=x$h*x$h;")
break
} else if (power > p) {
manualCode.append("x$p=x$h*x$h;")
} else break
}
if (power.and(power - 1) != 0) {
// not a power of two, so the result isn't finished yet
manualCode.append("x1=")
var first = true
for (bit in 0 until 32) {
val p = 1.shl(bit)
if (power.and(p) != 0) {
if (!first) {
manualCode.append('*')
} else first = false
manualCode.append("x$p")
}
}
manualCode.append(";\n")
}
val multiplications = manualCode.count { it == '*' }
// println("$power: $manualCode")
val shaders = listOf(
// manually optimized
createShader(init + repeat(manualCode.toString(), repeats) + end),
// can be optimized
createShader(init + repeat("x1=pow(x1,$power.0);", repeats) + end),
// can be optimized, int as power
createShader(init + repeat("x1=pow(x1,$power);", repeats) + end),
// slightly different, so it can't be optimized
createShader(init + repeat("x1=pow(x1,${power}.01);", repeats) + end),
)
for (shader in shaders) {
shader.use()
}
val pixels = ByteBuffer.allocateDirect(4)
Frame.bind()
glClearColor(0f, 0f, 0f, 1f)
glClear(GL_COLOR_BUFFER_BIT or GL_DEPTH_BUFFER_BIT)
for (i in 0 until warmup) {
for (shader in shaders) {
shader.use()
flat01.draw(shader)
}
}
val flops = DoubleArray(shaders.size)
val avg = 10 // for more stability between runs
for (j in 0 until avg) {
for (index in shaders.indices) {
val shader = shaders[index]
GFX.check()
val t0 = System.nanoTime()
for (i in 0 until benchmark) {
shader.use()
flat01.draw(shader)
}
// synchronize
glReadPixels(0, 0, 1, 1, GL_RGBA, GL_UNSIGNED_BYTE, pixels)
GFX.check()
val t1 = System.nanoTime()
// the first one may be an outlier
if (j > 0) flops[index] += multiplications * repeats.toDouble() * benchmark.toDouble() * size * size / (t1 - t0)
GFX.check()
}
}
for (i in flops.indices) {
flops[i] /= (avg - 1.0)
}
println(
"" +
"$power,$multiplications," +
"${flops[0].roundToInt()}," +
"${flops[1].roundToInt()}," +
"${flops[2].roundToInt()}," +
"${flops[3].roundToInt()}," +
(flops[0] / flops[3]).f2()
)
}
}
}
}
The sampler function is run 9x 512² pixels * 1000 times, and evaluates the function 100 times each.
I run this code on my RX 580, 8GB from Gigabyte, and collected the following results:
Power
#Mult
GFlops*
GFlopsFp
GFlopsInt
GFlopsPow
Speedup
2
1
1246
1429
1447
324
3.84
3
2
2663
2692
2708
651
4.09
4
2
2682
2679
2698
650
4.12
5
3
2766
972
974
973
2.84
6
3
2785
978
974
976
2.85
7
4
2830
1295
1303
1299
2.18
8
3
2783
2792
2809
960
2.90
9
4
2836
1298
1301
1302
2.18
10
4
2833
1291
1302
1298
2.18
11
5
2858
1623
1629
1623
1.76
12
4
2824
1302
1295
1303
2.17
13
5
2866
1628
1624
1626
1.76
14
5
2869
1614
1623
1611
1.78
15
6
2886
1945
1943
1953
1.48
16
4
2821
1305
1300
1305
2.16
17
5
2868
1615
1625
1619
1.77
18
5
2858
1620
1625
1624
1.76
19
6
2890
1949
1946
1949
1.48
20
5
2871
1618
1627
1625
1.77
21
6
2879
1945
1947
1943
1.48
22
6
2886
1944
1949
1952
1.48
23
7
2901
2271
2269
2268
1.28
24
5
2872
1621
1628
1624
1.77
25
6
2886
1942
1943
1942
1.49
26
6
2880
1949
1949
1953
1.47
27
7
2891
2273
2263
2266
1.28
28
6
2883
1949
1946
1953
1.48
29
7
2910
2279
2281
2279
1.28
30
7
2899
2272
2276
2277
1.27
31
8
2906
2598
2595
2596
1.12
32
5
2872
1621
1625
1622
1.77
33
6
2901
1953
1942
1949
1.49
34
6
2895
1948
1939
1944
1.49
35
7
2895
2274
2266
2268
1.28
36
6
2881
1937
1944
1948
1.48
37
7
2894
2277
2270
2280
1.27
38
7
2902
2275
2264
2273
1.28
39
8
2910
2602
2594
2603
1.12
40
6
2877
1945
1947
1945
1.48
41
7
2892
2276
2277
2282
1.27
42
7
2887
2271
2272
2273
1.27
43
8
2912
2599
2606
2599
1.12
44
7
2910
2278
2284
2276
1.28
45
8
2920
2597
2601
2600
1.12
46
8
2920
2600
2601
2590
1.13
47
9
2925
2921
2926
2927
1.00
48
6
2885
1935
1955
1956
1.47
49
7
2901
2271
2279
2288
1.27
50
7
2904
2281
2276
2278
1.27
51
8
2919
2608
2594
2607
1.12
52
7
2902
2282
2270
2273
1.28
53
8
2903
2598
2602
2598
1.12
54
8
2918
2602
2602
2604
1.12
55
9
2932
2927
2924
2936
1.00
56
7
2907
2284
2282
2281
1.27
57
8
2920
2606
2604
2610
1.12
58
8
2913
2593
2597
2587
1.13
59
9
2925
2923
2924
2920
1.00
60
8
2930
2614
2606
2613
1.12
61
9
2932
2946
2946
2947
1.00
62
9
2926
2935
2937
2947
0.99
63
10
2958
3258
3192
3266
0.91
64
6
2902
1957
1956
1959
1.48
65
7
2903
2274
2267
2273
1.28
66
7
2909
2277
2276
2286
1.27
67
8
2908
2602
2606
2599
1.12
68
7
2894
2272
2279
2276
1.27
69
8
2923
2597
2606
2606
1.12
70
8
2910
2596
2599
2600
1.12
71
9
2926
2921
2927
2924
1.00
72
7
2909
2283
2273
2273
1.28
73
8
2909
2602
2602
2599
1.12
74
8
2914
2602
2602
2603
1.12
75
9
2924
2925
2927
2933
1.00
76
8
2904
2608
2602
2601
1.12
77
9
2911
2919
2917
2909
1.00
78
9
2927
2921
2917
2935
1.00
79
10
2929
3241
3246
3246
0.90
80
7
2903
2273
2276
2275
1.28
81
8
2916
2596
2592
2589
1.13
82
8
2913
2600
2597
2598
1.12
83
9
2925
2931
2926
2913
1.00
84
8
2917
2598
2606
2597
1.12
85
9
2920
2916
2918
2927
1.00
86
9
2942
2922
2944
2936
1.00
87
10
2961
3254
3259
3268
0.91
88
8
2934
2607
2608
2612
1.12
89
9
2918
2939
2931
2916
1.00
90
9
2927
2928
2920
2924
1.00
91
10
2940
3253
3252
3246
0.91
92
9
2924
2933
2926
2928
1.00
93
10
2940
3259
3237
3251
0.90
94
10
2928
3247
3247
3264
0.90
95
11
2933
3599
3593
3594
0.82
96
7
2883
2282
2268
2269
1.27
97
8
2911
2602
2595
2600
1.12
98
8
2896
2588
2591
2587
1.12
99
9
2924
2939
2936
2938
1.00
As you can see, a power() call takes exactly as long as 9 multiplication instructions. Therefore every manual rewriting of a power with less than 9 multiplications is faster.
Only the cases 2, 3, 4, and 8 are optimized by my driver. The optimization is independent of whether you use the .0 suffix for the exponent.
In the case of exponent = 2, my implementation seems to have lower performance than the driver. I am not sure, why.
The speedup is the manual implementation compared to pow(x,exponent+0.01), which cannot be optimized by the compiler.
Because the multiplications and the speedup align so perfectly, I created a graph to show the relationship. This relationship kind of shows that my benchmark is trustworthy :).
Operating System: Windows 10 Personal
GPU: RX 580 8GB from Gigabyte
Processor: Ryzen 5 2600
Memory: 16 GB DDR4 3200
GPU Driver: 21.6.1 from 17th June 2021
LWJGL: Version 3.2.3 build 13
While this can definitely be hardware/vendor/compiler dependent, advanced mathematical functions like pow() tend to be considerably more expensive than basic operations.
The best approach is of course to try both, and benchmark. But if there is a simple replacement for an advanced mathematical functions, I don't think you can go very wrong by using it.
If you write pow(x, 3.0), the best you can probably hope for is that the compiler will recognize the special case, and expand it. But why take the risk, if the replacement is just as short and easy to read? C/C++ compilers don't always replace pow(x, 2.0) by a simple multiplication, so I wouldn't necessarily count on all GLSL compilers to do that.
Today I've run into this problem, but I couldn't solve it after a period of time. I need some help
I have number N. The problem is to find next higher number ( > N ) with only one zero bit in binary.
Example:
Number 1 can be represented in binary as 1.
Next higher number with only one zero bit is 2 - Binary 10
A few other examples:
N = 2 (10), next higher number with one zero bit is 5 (101)
N = 5 (101), next higher number is 6 (110)
N = 7 (111), next higher number is 11 (1011)
List of 200 number:
1 1
2 10 - 1
3 11
4 100
5 101 - 1
6 110 - 1
7 111
8 1000
9 1001
10 1010
11 1011 - 1
12 1100
13 1101 - 1
14 1110 - 1
15 1111
16 10000
17 10001
18 10010
19 10011
20 10100
21 10101
22 10110
23 10111 - 1
24 11000
25 11001
26 11010
27 11011 - 1
28 11100
29 11101 - 1
30 11110 - 1
31 11111
32 100000
33 100001
34 100010
35 100011
36 100100
37 100101
38 100110
39 100111
40 101000
41 101001
42 101010
43 101011
44 101100
45 101101
46 101110
47 101111 - 1
48 110000
49 110001
50 110010
51 110011
52 110100
53 110101
54 110110
55 110111 - 1
56 111000
57 111001
58 111010
59 111011 - 1
60 111100
61 111101 - 1
62 111110 - 1
63 111111
64 1000000
65 1000001
66 1000010
67 1000011
68 1000100
69 1000101
70 1000110
71 1000111
72 1001000
73 1001001
74 1001010
75 1001011
76 1001100
77 1001101
78 1001110
79 1001111
80 1010000
81 1010001
82 1010010
83 1010011
84 1010100
85 1010101
86 1010110
87 1010111
88 1011000
89 1011001
90 1011010
91 1011011
92 1011100
93 1011101
94 1011110
95 1011111 - 1
96 1100000
97 1100001
98 1100010
99 1100011
100 1100100
101 1100101
102 1100110
103 1100111
104 1101000
105 1101001
106 1101010
107 1101011
108 1101100
109 1101101
110 1101110
111 1101111 - 1
112 1110000
113 1110001
114 1110010
115 1110011
116 1110100
117 1110101
118 1110110
119 1110111 - 1
120 1111000
121 1111001
122 1111010
123 1111011 - 1
124 1111100
125 1111101 - 1
126 1111110 - 1
127 1111111
128 10000000
129 10000001
130 10000010
131 10000011
132 10000100
133 10000101
134 10000110
135 10000111
136 10001000
137 10001001
138 10001010
139 10001011
140 10001100
141 10001101
142 10001110
143 10001111
144 10010000
145 10010001
146 10010010
147 10010011
148 10010100
149 10010101
150 10010110
151 10010111
152 10011000
153 10011001
154 10011010
155 10011011
156 10011100
157 10011101
158 10011110
159 10011111
160 10100000
161 10100001
162 10100010
163 10100011
164 10100100
165 10100101
166 10100110
167 10100111
168 10101000
169 10101001
170 10101010
171 10101011
172 10101100
173 10101101
174 10101110
175 10101111
176 10110000
177 10110001
178 10110010
179 10110011
180 10110100
181 10110101
182 10110110
183 10110111
184 10111000
185 10111001
186 10111010
187 10111011
188 10111100
189 10111101
190 10111110
191 10111111 - 1
192 11000000
193 11000001
194 11000010
195 11000011
196 11000100
197 11000101
198 11000110
199 11000111
200 11001000
There are three cases.
The number x has more than one zero bit in its binary representation. All but one of these zero bits must be "filled in" with 1 to obtain the required result. Notice that all numbers obtained by taking x and filling in one or more of its low-order zero bits are numerically closer to x compared to the number obtained by filling just the top-most zero bit. Therefore the answer is the number x with all-but-one of its zero bits filled: only its topmost zero bit remains unfilled. For example if x=110101001 then the answer is 110111111. To get the answer, find the index i of the topmost zero bit of x, and then calculate the bitwise OR of x and 2^i - 1.
C code for this case:
// warning: this assumes x is known to have *some* (>1) zeros!
unsigned next(unsigned x)
{
unsigned topmostzero = 0;
unsigned bit = 1;
while (bit && bit <= x) {
if (!(x & bit)) topmostzero = bit;
bit <<= 1;
}
return x | (topmostzero - 1);
}
The number x has no zero bits in binary. It means that x=2^n - 1 for some number n. By the same reasoning as above, the answer is then 2^n + 2^(n-1) - 1. For example, if x=111, then the answer is 1011.
The number x has exactly one zero bit in its binary representation. We know that the result must be strictly larger than x, so x itself is not allowed to be the answer. If x has the only zero in its least-significant bit, then this case reduces to case #2. Otherwise, the zero should be moved one position to the right. Assuming x has zero in its i-th bit, the answer should have its zero in i-1-th bit. For example, if x=11011, then the result is 11101.
You could also use another approach:
Every number with exactly one zero bit can be represented as
2^n - 1 - 2^m
Now the task is easy:
1. Find an n, great enough for at least 2^n-1-2^0>x, that's equivalent to 2^n>x+2
2. Find the greatest m for which 2^n-1-2^m is still greater than x.
as Code:
#include <iostream>
#include <math.h>
using namespace std;
//binary representation
void bin(unsigned n)
{
for (int i = floor(log2(n));i >= 0;--i)
(n & (1<<i))? printf("1"): printf("0");
}
//outputs the next greater int to x with exactly one 0 in binary representation
int nextHigherOneZero(int x)
{
unsigned int n=0;
while((1<<n)<= x+2 ) ++n;
unsigned int m=0;
while((1<<n)-1-(1<<(m+1)) > x && m<n-2)
++m;
return (1<<n)-1-(1<<m);
}
int main()
{
int r=0;
for(int i = 1; i<100;++i){
r=nextHigherOneZero(i);
printf("\nX: %i=",i);
bin(i);
printf(";\tnextHigherOneZero(x):%i=",r);
bin(r);
printf("\n");
}
return 0;
}
You can try it here (with some additional Debug-Output):
http://ideone.com/6w3fAN
As a note: its probably possible to get m and n faster with some good binary logic, feel free to contribute...
Pro of this approach:
No assumptions needs to be made
Cons:
Ugly while loops
couldn't miss the opportunity to remember binary logic :), here's my solution:
here's main
main(int argc, char** argv)
{
int i = 139261;
i++;
while (!oneZero(i))
{
i++;
}
std::cout << i;
}
and here's all logic to find if number has 1 zero
bool oneZero(int i)
{
int count = 0;
while (i != 0)
{
// check last bit if it is zero
if ((1 & i) == 0) {
count++;
if (count > 1) return false;
}
// make the number shorter :)
i = i >> 1;
}
return (count == 1);
}