MemSQL server takes large disk space - diskspace

I have a MemSQL server which has approximately 1 MB data in all the tables in it. Which also uses very low acceptable memory usage. But the problem is the server has recently run out of disk space.
When look in more details I could find the following disk space usage.
52G /var/lib/memsql/master-3306/plancache/
34G /var/lib/memsql/leaf-3307/plancache/
which contains many files like below
Select_digin_componentheader_72f4f4454iusudf9sdf98tkjnm432436thss4dgfdhjfd45tr44.cc
Select_digin_componentheader_72f4f4454iusudf9sdf98tkjnm432436thss4dgfdhjfd45tr44.ok
Select_digin_componentheader_72f4f4454iusudf9sdf98tkjnm432436thss4dgfdhjfd45tr44.so
Where digin_componentheader is one of our table names and the files as above exists for other tables as well. We do many insertions but also we delete them after 5 mins thats the only operation we have been doing for 6 months.
We have not changed any configuration and it has all the default configurations in memsql server.
Can anybody explain why it needs 90GB to store this? OR is it any configuration issue?

MemSQL won't delete the files in the plancache on its own. They're required to run queries. Old\stale files can accumulate over time for queries you may no longer be running (or after ALTER table operations). You can delete the contents of the plancache directory when the server is stopped (don't do it when MemSQL is running). MemSQL will rebuild any files it needs the first time a query is run after you restart it. Be aware, this will make the first execution of your queries after restart slower (code-generation cost - this is much faster in MemSQL 5).

This is generated and compiled code that MemSQL creates for every query. Based on the content of the plancache dir you are running MemSQL 4. I recommend upgrading to MemSQL 5 which does not use g++ to compile queries.

Related

out of space SAS studio engine V9- MAC

I was trying to practice with a dataset having more than 100K rows and my SAS UE shows error as out of space while trying statistical analysis,after some google search I found some solutions like extending disk space in VM and cleaning work libraries(I did clean the work library using "proc datasets library=WORK kill; run; quit;" but the issue remains same) but I am not sure how to increase the disk space, or redirecting work library to local storage in my Mac. There are no proper guidelines I have seen/understood from google search. Please help.
You can modify the cores on the VM to be 2 and increase the RAM space in the Oracle VB settings. You cannot increase the size of the VM and 100K rows should not be problematic unless you're not cleaning up processes.
Yes, SAS UE does have a tendency to not clean up after crashes so eventually if you've crashed it multiple times you'll have to reinstall to clean up. You can get around this by reassigning the work library. A quick way to do this is in projects that will be affecting it set the USER library to your myfolders or another space on your computer.
libname user '/folders/myfolders/tempWSpace';
Make sure you first create the folder under myfolders. Then any single level data set (no libname) will automatically be stored in user library and you should be ok to run your code.

How should I parallelize a mix of cpu- and network-intensive tasks (in Celery)

I have a job that scans a network file system (can be remote), pulls many files, runs a computation on them and pushes the results (per file) into a DB. I am in process of moving this to Celery so that it can be scaled up. The number of files can get really huge (1M+).
I am not sure what design approach to take, specifically:
Uniform "end2end" tasks
A task gets a batch (list of N files), pulls them, computes and uploads results.
(Using batches rather than individual files is for optimizing the connection to the remote file system and the DB, although it is a pure heuristics at this point)
Clearly, a task would spend a large part of it waiting for I/O, so we'll need to play with number of worker processes (much more than # of CPUs), so that I have enough tasks running (computing) concurrently.
pro: simple design, easier coding and control.
con: probably will need to tune the process pool size individually per installation, as it would depend on environment (network, machines etc.)
Split into dedicated smaller tasks
download, compute, upload (again, batches).
This option is appealing intuitively, but I don't actually see the advantage.
I'd be glad to get some references to tutorials on concurrency design, as well as design suggestions.
How long does it take to scan the network file system compared to computation per file?
How does the hierarchy of the remote file system look like? Are the files evenly distributed? How can you use this in your advantage ?
I would follow a process like this:
1. In one process, list the first two levels of the root remote target folder.
2. For each of the discovered folders, spin up a separate celery process that further lists the content of those folders. You may also want to save the location of the discovered files just in case things go wrong.
3. After you have listed the content of the remote file system and all celery processes that list files terminate you can go in processing mode.
4. You may want to list files with 2 processes and use the rest of your cores to start doing per file work.
NB: Before doing everything in python I would also investigate how does bash tools like xargs and find work together in remote file discovery. Xargs allows you to spin up multiple C processes that do what you want. That might be the most efficient way to do the remote file discovery and then pipe everything to you python code.
Instead of celery, you can write a simple python script which runs on k*cpu_count threads just to connect to remote servers and fetch files without celery.
Personally, I found that k value in between 4 to 7 gives better results in terms of CPU utilization for IO bound tasks. Depending on the number of files produced or the rate at which you want to consume, you can use a suitable number of threads.
Alternatively, you can use celery + gevent or celery with threads if your tasks are IO bound.
For computation and updating DB you can use celery so that you can scale as per your requirements dynamically. If you are too many tasks at a time which need DB connection, you should use DB connection pooling for workers.

MySQL crash on DROP FUNCTION

I have created a UDF through the CREATE FUNCTION command, and now when I try to drop it the server crashes. According to the docs, this is a known issue:
To upgrade the shared library associated with a UDF, issue a DROP FUNCTION statement, upgrade the shared library, and then issue a CREATE FUNCTION statement. If you upgrade the shared library first and then use DROP FUNCTION, the server may crash.
It does, indeed, crash, and afterwards any attempt to remove the function crashes, even if I completely remove the DLL from the plugin directory. During development I'm continually replacing the library that defines the UDF functions. I've already re-installed MySQL from scratch once today and would rather not do it again. Aside from being more careful, is there anything I can do to e.g. clean up the mysql.* tables manually so as to remove the function?
Edit: after some tinkering, the database seems to have settled into a pattern of crashing until I have removed the offending DLL, and after that issuing Error Code: 1305: FUNCTION [schema].[functionName] does not exist. If I attempt to drop the function as root, I get the same message but without the schema prefix.
SELECT * from mysql.func shows the function. If I remove the record by hand, I get the same 1305 error.
Much of the data in the system tables in the mysql schema is cached in memory on first touch. After that, modifying the tables by hand may not have the expected effect unless the server is restarted.
For the grant tables, a mechanism for flushing any cached data is provided -- FLUSH PRIVILIGES -- but for other tables, like func and the time zone tables, the only certain way to ensure that manual changes to the tables are all taken into consideration is to restart the server process.

Why sqlite3 can't work with NFS?

I switch to using sqlite3 instead of MySQL because I had to run many jobs on a PBS system which doesn't not have mysql. Of course on my machine I do not have a NFS while there exists one on the PBS. After spending lots of time switching to sqlite3, I go to run many jobs and I corrupt my database.
Of course down in the sqlite3 FAQ it is mentioned about NFS, but I didn't even think about this when I started.
I can copy the database at the beginning of the job but it will turn into a merging nightmare!
I would never recommend sqlite to any of my colleagues for this simple reason: "sqlite doesn't work (on the machines that matter)"
I have read rants about NFS not being up to par and it being their fault.
I have tried a few workarounds, but as this post suggests, it is not possible.
Isn't there a workaround which sacrifices performance?
So what do I do? Try some other db software? Which one?
You are using the wrong tool. Saying "I would never recommend sqlite ..." based on this experience is a bit like saying "I would never recommend glass bottles" after they keep breaking when you use them to hammer in a nail.
You need to specify your problem more precisely. My attempt to read between the lines of your question gives me something like this:
You have many nodes that get work through some unspecified path, and produce output. The jobs do not interact because you say you can copy the database. The output from all the jobs can be merged after they are finished. How do you effectively produce the merged output?
Given that as the question, this is my advice:
Have each job produce its output in a structured file, unique to each job. After the jobs are finished, write a program to parse each file and insert it into an sqlite3 database. This uses NFS in a way it can handle (single process writing sequentially to a file) and uses sqlite3 in a way that is also sensible (single process writing to a database on a local filesystem). This avoid NFS locking issues while running the job, and should improve throughput because you don't have contention on the sqlite3 database.

How does rsync behave for concurrent file access?

I'm using rsync to run backups of my machine twice a day and the ten to fifteen minutes when it searches my files for modifications, slowing down everything considerably, start getting on my nerves.
Now I'd like to use the inotify interface of my kernel (I'm running Linux) to write a small background app that collects notifications about modified files and adds their pathnames to a list which is then processed regularly by a call to rsync.
Now, because this process by definition always works on files I've just been - and might still be - working on, I'm wondering whether I'll get loads of corrupted / partially updated files in my backup as rsync copies the files while I'm writing to them.
I couldn't find anyhing in the manpage and was yet unsuccessful in googling for the answer. I could go read the source, but that might take quite a while. Anybody know how concurrent file access is handled inside rsync?
It's not handled at all: rsync opens the file, reads as much as it can and copies that over.
So it depends how your applications handle this: Do they rewrite the file (not creating a new one) or do they create a temp file and rename that when all data has been written (as they should).
In the first case, there is little you can do: If two processes access the same data without any kind of synchronization, the result will be a mess. What you could do is defer the rsync for N minutes, assuming that the writing process will eventually finish before that. Reschedule the file if it is changes again within this time limit.
In the second case, you must tell rsync to ignore temp files (*.tmp, *~, etc).
It isn't handled in any way. If it is a problem, you can use e.g. LVM snapshots, and take the backup from the snapshot. That won't in itself guarantee that the files will be in a usable state, but it does guarantee that, as the name implies, it's a snapshot at a specific time.
Note that this doesn't have anything to do with whether you're letting rsync handle the change detection itself or if you use your own app. Your app, or rsync itself, just produces a list of files that have been changed, and then for each file, the rsync binary diff algorithm is run. The problem is if the file is changed while the rsync algorithm runs, not when producing the file list.