Memory usage during and after drop_duplicates() - python-2.7

I am working with a data frame that takes up roughly 2 Gb of memory (according to htop) with dimensions (6287475,19). The data frame is heterogeneous in data type, which probably does not matter. Immediately after loading the data frame I drop duplicate rows using the command
df.drop_duplicates(inplace=True)
During the execution of this command memory usage jumps to about 7 Gb. After the command is completed the memory reduced to almost 5 Gb, which is more than twice the memory required to store a single instance of the data frame. If I then delete the data frame with del df memory usage decreases to about 3 Gb.
The behavior is the same if I do the following:
df2 = df.drop_duplicates
del df
del df2
Running gc.collect() does nothing and memory usage returns to its baseline level after terminating the python session. Does this look like a memory leak? Has anyone seen similar behavior?
Environment:
64-bit linux
python 2.7.7 (64-bit)
pandas 0.14.1
numpy 1.8.2
Ipython 2.2.0 (behavior same for cpython)

Related

AWS DMS replication instance out of memory

I recently started to work with AWS Data Migration Service (DMS) and running into some issues.
Currently attempting to migrate a 10GB Oracle DB to AWS RDS Postgres. Works but has crazy(?) memory requirements. Feels like it loads the entire DB into memory... Started with dms.r4.large (15.5GB) but can not allocate memory after approx. 98%.... Will run smoothly with dms.r4.xlarge (30.5GB)
As you can see in the screenshot (free-able memory, minimum), the instance is constantly running "full" before all memory gets released when the task finishes (or crashs).
Is there any setting to change this and why does it behave like this? It makes the whole task unnecessary expensive...
As confirmed by AWS, this was indeed a bug with the latest engine (v3.1.3). Following additional insights have been provided by AWS to estimate the actual memory requirements:
Full LOB mode (using single row insert+update, commit rate)
Memory: (# of lob columns in a table) x (Number of table in parallel,
default is 8) x (lob chunk size) x (Commit rate during full load) = 2
* 8 *64(k) * 10000k
Note: You may consider to reduce the "Commit rate during full load "
value because we allocate memory using roughly the above method
Limited LOB mode (using array)
Memory: (# of lob columns in a table) x (Number of table in
parallel, default is 8) x maxlobSize x bulkArraySize = 2 * 8 * 4096(k)
* 1000

fine tuning vgg raise memory error

Hi i'm trying to fine tuning vgg on my problem but when i try to train the net i get this error.
OOM when allocating tensor with shape[25088,4096]
The net has this structure:
I take this tensorflow pretrained vgg implementation code from this site.
I only add this procedure to train the net:
with tf.name_scope('joint_loss'):
joint_loss = ya_loss+yb_loss+yc_loss+yd_loss+ye_loss+yf_loss+yg_loss+yh_loss+yi_loss+yl_loss+ym_loss+yn_loss
# Loss with weight decay
l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables()])
self.joint_loss = joint_loss + self.weights_decay * l2_loss
self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate).minimize(joint_loss)
i try to reduce the batch size to 2 but not works i get the same error. The error is due to the big tensor that cannot be allocated in memory. I get this error only in train cause if i feed a value without minimize the net works. How i can avoid this error? how can i save memory of graphic card(Nvidia GeForce GTX 970)?
UPDATE: if i use the GradientDescentOptimizer the training process start, instead if i use AdamOptimizer i get the memory error, seems that the GradientDescentOptimizer use less memory.
Without a backward pass ("feed a value without minimizing"), TensorFlow can immediately de-allocate intermediate activations. With a backward pass, the graph has a giant U-shape, where activations from the forward pass need to be kept in memory for the backward pass. There are some tricks (such as swapping to host memory), but in general backprop means that memory usage will be higher.
Adam does keep some extra bookkeeping variables around, so it will increase memory usage proportional to the amount of memory your weight variables are already using. If your training steps take quite a while (in which case having the variable updates on the GPU isn't important), you could instead locate the optimization ops in host memory.
If you need a larger batch size and can't reduce image resolution or model size, combining gradients from multiple workers/GPUs using something like SyncReplicasOptimizer can be a good option. Looking at the paper associated with this model, it looks like they were training on 4 GPUs each with 12GB of memory.

Python program unexpectidely dies

I am training an LSTM network using lasagne, but my program dies without any error during the training. If I use a much smaller dataset ( only 13 data points) the entire code runs.
I was initially using my GPU and thought that the memory was running out. But even though I used CPU It dies.
import resource
rsrc = resource.RLIMIT_DATA
resource.setrlimit(rsrc,(1024, resource.RLIM_INFINITY ))
My final attempt was to try this but to no avail.

Text classification process kills when I am using linear SVM for 10000 rows

I am programming in python 2.7 with NLTK library for both text prepossessing and classification in sentiment analysis. I am using nltk wrapper of scikit-learn algorithms. bellow code is after prepossessing and separation to train and test sets.
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
training_set = nltk.classify.util.apply_features(extractFeatures, trainTweets)
testing_set = nltk.classify.util.apply_features(extractFeatures, testTweets)
#LinearSVC
LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
LinearSVCAccuracy = nltk.classify.accuracy(LinearSVC_classifier, testing_set)*100
print "LinearSVC accuracy percentage:" + str(LinearSVCAccuracy)
it works fine when the number of rows are like 4000 tweets for training, but when it increases to for example 10000 tweets, possess is getting killed with following error.
Memory cgroup out of memory: Kill process 24293 (python) score 848 or
sacrifice child
Killed process 24293, UID 29091, (python) total-vm:14569168kB,
anon-rss:14206656kB, file-rss:3412kB
Clocksource tsc unstable (delta = -17179861691 ns). Enable clocksource
failover by adding clocksource_failover kernel parameter.
RAM of my pc is 8 Gig but I even tried with 16 Gig RAM and still has problem. How can I Classify this amount for tweets without any problem?
Which OS are you running? Which python distribution? Try to install cython and/or using scikit-learn directly. Have a look at scikit-learn optimization techniques

Not enough space to cache rdd in memory warning

I am running a spark job, and I got Not enough space to cache rdd_128_17000 in memory warning. However, in the attached file, it obviously saying only 90.8 G out of 719.3 G is used. Why is that? Thanks!
15/10/16 02:19:41 WARN storage.MemoryStore: Not enough space to cache rdd_128_17000 in memory! (computed 21.4 GB so far)
15/10/16 02:19:41 INFO storage.MemoryStore: Memory use = 4.1 GB (blocks) + 21.2 GB (scratch space shared across 1 thread(s)) = 25.2 GB. Storage limit = 36.0 GB.
15/10/16 02:19:44 WARN storage.MemoryStore: Not enough space to cache rdd_129_17000 in memory! (computed 9.4 GB so far)
15/10/16 02:19:44 INFO storage.MemoryStore: Memory use = 4.1 GB (blocks) + 30.6 GB (scratch space shared across 1 thread(s)) = 34.6 GB. Storage limit = 36.0 GB.
15/10/16 02:25:37 INFO metrics.MetricsSaver: 1001 MetricsLockFreeSaver 339 comitted 11 matured S3WriteBytes values
15/10/16 02:29:00 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt1/var/lib/hadoop/s3/959a772f-d03a-41fd-bc9d-6d5c5b9812a1-0000 134217728 bytes md5: qkQ8nlvC8COVftXkknPE3A== md5hex: aa443c9e5bc2f023957ed5e49273c4dc
15/10/16 02:38:15 INFO s3n.MultipartUploadOutputStream: uploadPart /mnt/var/lib/hadoop/s3/959a772f-d03a-41fd-bc9d-6d5c5b9812a1-0001 134217728 bytes md5: RgoGg/yJpqzjIvD5DqjCig== md5hex: 460a0683fc89a6ace322f0f90ea8c28a
15/10/16 02:42:20 INFO metrics.MetricsSaver: 2001 MetricsLockFreeSaver 339 comitted 10 matured S3WriteBytes values
This is likely to be caused by the configuration of spark.storage.memoryFraction being too low. Spark will only use this fraction of the allocated memory to cache RDDs.
Try either:
increasing the storage fraction
rdd.persist(StorageLevel.MEMORY_ONLY_SER) to reduce memory usage by serializing the RDD data
rdd.persist(StorageLevel.MEMORY_AND_DISK) to partially persist onto disk if memory limits are reached.
This could be due to the following issue if you're loading lots of avro files:
https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCANx3uAiJqO4qcTXePrUofKhO3N9UbQDJgNQXPYGZ14PWgfG5Aw#mail.gmail.com%3E
With a PR in progress at:
https://github.com/databricks/spark-avro/pull/95
I have a Spark-based batch application (a JAR with main() method, not written by me, I'm not a Spark expert) that I run in local mode without spark-submit, spark-shell, or spark-defaults.conf. When I tried to use IBM JRE (like one of my customers) instead of Oracle JRE (same machine and same data), I started getting those warnings.
Since the memory store is a fraction of the heap (see the page that Jacob suggested in his comment), I checked the heap size: IBM JRE uses a different strategy to decide default heap size and it was too small, so I simply added appropriate -Xms and -Xmx params and the problem disappeared: now the batch works fine both with IBM and Oracle JRE.
My usage scenario is not typical, I know, however I hope this can help someone.