AWS Sagemaker Studio Local Ram Overload - amazon-web-services

Every time i open studio and get to the jupyter labs page about 10s seconds later the browser starts to eat local ram at ~0.1GB/s and one or two cores max out at 100% utilization.
It will continue doing this until all local ram on my local computer has been used up and then it overloads the swap and everything freezes requiring reboot.
This happens with Firefox, Chrome, and Gnome browser. I have yet to try this on windows, if possible I would like to get this working with my current environment.
If it helps:
Ubuntu 20.04.1 LTS, 64 bit
Intel i7-7700HQ CPU # 2.80GHz × 8
GeForce GTX 1060
In all cases there is a rather large spike in bandwidth corresponding with the start of the issue.
~7+ MBs for ~2 seconds. This seems to correspond with a fetch request being made.
The process name consuming the memory and CPU when using chrome appears to be:
chrome -type=renderer followed by a bunch of other stuff
Occasionally the page will crash before using all the ram the error message on chrome is:
Error code: SIGTRAP
If the sagemaker jupyter labs tab is killed or the browser is closed before ram maxes out.
The cores it is maxing out go to normal utilization and the ram is released.

Related

How to minimize Google Cloud launch latency

I have a persistent server that unpredictably receives new data from users, needing about 10 GPU instances to crank at the problem for about 5 minutes, and I send the answer back to the users. The server itself is a cheap always-persistent single CPU Google Cloud instance. When a user request comes in, my code launches my 10 created but stopped Google Cloud GPU instances with
gcloud compute instances start (instance list)
In the rare case if the stopped instances don't exist (sometimes they get wiped) that's detected and they're recreated with
gcloud beta compute instances create (...)
This system all works fine. My only complaint is that even with created but stopped instances, the launch time before my GPU code finally starts to run is about 5 minutes. Most of this is just the time for the instance itself to launch its Ubuntu host and call my code.. the delay once Ubuntu is running to start the GPU is only about 10 seconds.
How can I reduce this 5 minute delay? I imagine most of it comes from Google having to copy over the 4GB of instance data to the target machine, but the startup time of (vanilla) Ubuntu adds probably 1 more minute. I'm not even sure if I could quantify these two numbers independently, I only can measure the combined 3-7 minutes delay from the launch until my code starts responding.
I don't think Ubuntu OS startup time is the major startup latency contributor since I timed an actual machine with the same Ubuntu and same GPU on my desk from poweron boot up and it began running my GPU code in 46 seconds.
My goal is to get results back to my users as soon as possible, and that 5 minute startup delay is a bottleneck.
Would making a smaller instance SIZE of say 2GB help? What else can I do to reduce the latency?
2GB is large. That's a heckuva big image. You should be able to cut that down to 100MB, perhaps using Alpine instead of Ubuntu.
Copying 4GB of data is also less than ideal. Given that, I suspect the solution will be more of an architecture change than a code change.
But if you want to take a whack at everything which is NOT about your 4GB of data, there is a capability to prepare a custom image for your VMs. If you can build a slim custom image that will help.
There's good resources for learning more, the two I would start with include:
- Improve GCE Boot Times with Custom Images
- Three steps to Compute Engine startup-time bliss: Google Cloud Performance Atlas

ColdFusion Production Server - Tuning for Performance

I'm trying to determine the optimal settings for my ColdFusion PRODUCTION server. The server has the following specs.
ColdFusion: Enterprise Version 10 O/S: Windows Server 2012R2
Standard Processor: Intel(R) Xeon(R) CPU E5-2660 v2 # 2.20GHz
Installed Memory (RAM): 20.0 GB System Type: 64-bit
Operating System, x64-based processor
My Java and JVM settings from the CFIDE are:
Minimum Heap Size (in MB): 2048 Maximum Heap Size (in MB): 4096
JVM Arguments
-server -XX:MaxPermSize=192m -XX:+UseParallelGC -Xbatch -Dcoldfusion.home={application.home} -Dcoldfusion.rootDir={application.home} -Dcoldfusion.libPath={application.home}/lib -Dorg.apache.coyote.USE_CUSTOM_STATUS_MSG_IN_HEADER=true -Dcoldfusion.jsafe.defaultalgo=FIPS186Random
I have multiple websites running on this production server, all of which use ColdFusion. The database server is completely separate, so all that this server is responsible for is the ColdFusion application and web server processes.
The websites are completely data-driven, all pulling from the database located on my production database server. Lately, I've been seeing the ColdFusion service locking up, as it is maxing out the CPU. The memory is stable, it's only the CPU that is maxing out.
Can anyone make suggestions as to how I can tune it to improve overall performance while reducing strain on the CPU?
Java Version
java version "1.8.0_73" Java(TM) SE Runtime Environment (build
1.8.0_73-b02) Java HotSpot(TM) Client VM (build 25.73-b02, mixed mode)
Thank you!
Are you running all of these sites on a single instance of ColdFusion? If so, I would recommend running multiple instances of CF. Each instance can run the same JVM settings given your total available memory.
Minimum Heap Size (in MB): 2048
Maximum Heap Size (in MB): 4096
So that's a max of about 16GB of memory allocated to a total of 4 instances of CF. Then you balance out the volume of sites run on each instance based on site usage. You might have one that needs its own instance and the rest can be spread across the other three.
It's also possible to run all of the sites on all of the instances using a load balancer to pass requests from one instance to another. Either approach should insure that one site doesn't cause the rest to run poorly or not at all.

Google Compute Engine - Low on Resource Utilisation

I use a VM Instance provided by Google Compute Engine.
Machine Type: n1-standard-8 (8 vCPUs, 30 GB memory).
When I check for the CPU Utilisation, it never uses more than 12%. I use my VM for running Jupyter Notebook. I have tried loading dataframes which costed 7.5 GiB (And it takes a long time to process the data for simple operations). But still the utilisation is same
How can I utilise the CPU power ~ 100%?
Or Does my program use only 1 out of the 8 CPU (1/8)*100 =12.5%?
You can run stress command to impose a configurable amount of CPU, memory, I/O, and disk stress on the system.
Example to stress 4 cores for 90 seconds:
stress --cpu 4 --timeout 90
In the meantime go to your Google Cloud Console on your browser to check your CPU usage on your VM or open new SSH connection to your VM and run TOP command to see your CPU status.
After running those mentioned commands, if your CPU can reach over 99%, your instance is working fine and you have to check your application resources to know why it is restricted and cannot use CPU more than 12%.

ColdFusion server crashing on hourly basis

I am facing serious ColdFusion Server crashing issue. I have many live sites on that server so that is serious and urgent.
Following are the system specs:
Windows Server 2003 R2, Enterprise X64 Edition, Service Pack 2
ColdFusion (8,0,1,195765) Enterprise Edition
Following are the hardware specs:
Intel(R) Xeon(R) CPU E7320 #2.13 GHZ, 2.13 GHZ
31.9 GB of RAM
It is crashing on the hourly bases. Can somebody help me to find out the exact issue? I tried to find it through ColdFusion log files but i do not find anything over there. Every times when it crashes, i have to reset the ColdFusion services to get it back.
Edit1
When i saw the runtime log files "ColdFusion-out165.log" so i found following errors
error ROOT CAUSE:
java.lang.OutOfMemoryError: Java heap space
javax.servlet.ServletException: ROOT CAUSE:
java.lang.OutOfMemoryError: Java heap space
04/18 16:19:44 error ROOT CAUSE:
java.lang.OutOfMemoryError: GC overhead limit exceeded
javax.servlet.ServletException: ROOT CAUSE:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Here are my current JVM settings:
As you can see my JVM setting are
Minimum JVM Heap Size (MB): 512
Maximum JVM Heap Size (MB): 1024
JVM Arguments
-server -Dsun.io.useCanonCaches=false -XX:MaxPermSize=512m -XX:+UseParallelGC -Dcoldfusion.rootDir={application.home}/../ -Dcoldfusion.libPath={application.home}/../lib
Note:- when i tried to increase Maximum JVM Heap size to 1536 and try to reset coldfusion services, it does not allow me to start them and give the following error.
"Windows could not start the ColdFusion MX Application Server on Local Computer. For more information, review the System Event Log. If this is a non-Microsoft service, contact the service vendor, and refer to service-specific error code 2."
Should i not able to set my maximum heap size to 1.8 GB, because i am using 64 bit operating system. Isn't it?
How much memory you can give to your JVM is predicated on the bitness off your JVM, not your OS. Are you running a 64-bit CF install? It was an uncommon thing to do back in the CF8 days, so worth asking.
Basically the error is stating you're using too much RAM for how much you have available (which you know). I'd be having a look at how much stuff you're putting into session and application scope, and culling back stuff that's not necessary.
Objects in session scope are particularly bad: they have a far bigger footprint than one might think, and cause more trouble than they're worth.
I'd also look at how many inactive but not timed-out sessions you have, with a view to being far more agressive with your session time-outs.
Have a look at your queries, and get rid of any SELECT * you have, and cut them back to just the columns you need. Push dataprocessing back into the DB rather than doing it in CF.
Farm scheduled tasks off onto a different CF instance.
Are you doing anything with large files? Either reading and processing them, or serving them via <cfcontent>? That can chew memory very quickly.
Are all your function-local variables in CFCs properly VARed? Especially ones in CFCs which end up in shared scopes.
Do you accidentally have debugging switched on?
Are you making heavy use of custom tags or files called in with <cfmodule>? I have heard apocryphyal stories of custom tags causing memory leaks.
Get hold of Mike Brunt or Charlie Arehart to have a look at your server config / app (they will obviously charge consultancy fees).
I will update this as I think of more things to look out for.
Turn on ColdFusion monitor in the administrator. Use it to observe behavior. Find long running processes and errors.
Also, make sure that memory monitoring is turned off in the ColdFusion Server Monitor. That will bring down a production server easily.
#Adil,
I have same kind of issue but it wasn't crashing it but CPU usage going high upto 100%, not sure it relevant to your issue but atleast worth to look.
See question at below URL:
Strange JRUN issue. JRUN eating up 50% of memory for every two hours
My blog entry for this
http://www.thecfguy.com/post.cfm/strange-coldfusion-issue-jrun-eating-up-to-50-of-cpu
For me it was high traffic site and storing client variables in registry which was making thing going wrong.
hope this help.

How to keep a VMWare VM's clock in sync?

I have noticed that our VMWare VMs often have the incorrect time on them. No matter how many times I reset the time they keep on desyncing.
Has anyone else noticed this? What do other people do to keep their VM time in sync?
Edit: These are CLI linux VMs btw..
If your host time is correct, you can set the following .vmx configuration file option to enable periodic synchronization:
tools.syncTime = true
By default, this synchronizes the time every minute. To change the periodic rate, set the following option to the desired synch time in seconds:
tools.syncTime.period = 60
For this to work you need to have VMWare tools installed in your guest OS.
See http://www.vmware.com/pdf/vmware_timekeeping.pdf for more information
according to VMware's knowledge base, the actual solution depends on the Linux distro and release, in RHEL 5.3 I usually edit /etc/grub.conf and append this parameters to the kernel entry: divider=10 clocksource=acpi_pm
Then enable NTP, disable VMware time synchronization from vmware-toolbox and finally reboot the VM
A complete table with guidelines for each Linux distro can be found here:
TIMEKEEPING BEST PRACTICES FOR LINUX GUESTS
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1006427
I'll answer for Windows guests. If you have VMware Tools installed, then the taskbar's notification area (near the clock) has an icon for VMware Tools. Double-click that and set your options.
If you don't have VMware Tools installed, you can still set the clock's option for internet time to sync with some NTP server. If your physical machine serves the NTP protocol to your guest machines then you can get that done with host-only networking. Otherwise you'll have to let your guests sync with a genuine NTP server out on the internet, for example time.windows.com.
Something to note here. We had the same issue with Windows VM's running on an ESXi host. The time sync was turned on in VMWare Tools on the guest, but the guest clocks were consistently off (by about 30 seconds) from the host clock. The ESXi host was configured to get time updates from an internal time server.
It turns out we had the Internet Time setting turned on in the Windows VM's (Control Panel > Date and Time > Internet Time tab) so the guest was getting time updates from two places and the internet time was winning. We turned that off and now the guest clocks are good, getting their time exclusively from the ESXi host.
In my case we are running VMWare Server 2.02 on Windows Server 2003 R2 Standard. The Host is also Windows Server 2003 R2 Standard. I had the VMware Tools installed and set to sync the time. I did everything imaginable that I found on various internet sites. We still had horrendous drift, although it had shrunk from 15 minutes or more down to the 3 or 4 minute range.
Finally in the vmware.log I found this entry (resides in the folder as the .vmx file):
"Your host system does not guarantee synchronized TSCs across different CPUs, so please set the /usepmtimer option in your Windows Boot.ini file to ensure that timekeeping is reliable. See Microsoft KB http://support.microsoft.com/kb... for details and Microsoft KB http://support.microsoft.com/kb... for additional information."
Cause: This problem occurs when the computer has the AMD Cool'n'Quiet technology (AMD dual cores) enabled in the BIOS or some Intel multi core processors. Multi core or multiprocessor systems may encounter Time Stamp Counter (TSC) drift when the time between different cores is not synchronized. The operating systems which use TSC as a timekeeping resource may experience the issue. Newer operating systems typically do not use the TSC by default if other timers are available in the system which can be used as a timekeeping source. Other available timers include the PM_Timer and the High Precision Event Timer (HPET).
Resolution: To resolve this problem check with the hardware vendor to see if a new driver/firmware update is available to fix the issue.
Note The driver installation may add the /usepmtimer switch in the Boot.ini file.
Once this (/usepmtimer switch) was done the clock was dead on time.
This documentation solved this problem for me.
The CPU speed varies due to power saving. I originally noticed this because VMware gave me a helpful tip on my laptop, but this page mentions the same thing:
Quote from : VMWare tips and tricks
Power saving (SpeedStep, C-states, P-States,...)
Your power saving settings may interfere significantly with vmware's performance. There are several levels of power saving.
CPU frequency
This should not lead to performance degradation, outside of having the obvious lower performance when running the CPU at a lower frequency (either manually of via governors like "ondemand" or "conservative"). The only problem with varying the CPU speed while vmware is running is that the Windows clock will gain of lose time. To prevent this, specify your full CPU speed in kHz in /etc/vmware/config
host.cpukHz = 2167000
VMware experiences a lot of clock drift. This Google search for 'vmware clock drift' links to several articles.
The first hit may be the most useful for you: http://www.fjc.net/linux/linux-and-vmware-related-issues/linux-2-6-kernels-and-vmware-clock-drift-issues
When installing VMware Tools on a Windows Guest, “Time Synchronisation” is not enabled by default.
However – “best practise” is to enable time synch on Windows Guests.
There a several ways to do this from outside the VM, but I wanted to find a way to enable time sync from within the guest itself either on or after tools install.
Surprisingly, this wasn’t quite as straightforward as I expected.
(I assumed it would be posible to set this as a parameter / config option during tools install)
After a bit of searching I found a way to do this in a VMware article called “Using the VMware Tools Command-Line Interface“.
So, if time sync is disabled, you can enable it by running the following command line in the guest:
VMwareService.exe –cmd “vmx.set_option synctime 0 1″
Additional Notes
For some (IMHO stupid) reason, this utility requires you to specify the current as well as the new value
0 = disabled
1 = enabled
So – if you run this command on a machine which has this already set, you will get an error saying – “Invalid old value“.
Obviously you can “ignore” this error when run (so not a huge deal) but the current design seems a bit dumb.
IMHO it would be much more sensible if you could simply specify the value you want to set and not require the current value to be specified.
i.e.
VMwareService.exe –cmd “vmx.set_option synctime <0|1>”
In Active Directory environment, it's important to know:
All member machines synchronizes with any domain controller.
In a domain, all domain controllers synchronize from the PDC Emulator (PDCe) of that domain.
The PDC Emulator of a domain should synchronize with local or NTP.
It's important to consider this when setting the time in vmware or configuring the time sync.
Extracted from: http://www.sysadmit.com/2016/12/vmware-esxi-configurar-hora.html
I added the following job to crontab. It is hacky but i think should work.
*/5 * * * * service ntpd stop && ntpdate pool.ntp.org && service ntpd start
It stops ntpd service updates from service and starts ntpd again