I have a program which writes data to mysql database and also huge amounts of logs to a file.. i have noticed that if i give huge amounts of data as input to the program, i.e data that creates logs as big as 70GB and mysql database table count(*), of the table that i use, to >1,000,000 entries, the whole programme slows down after some time..
But when initially the reports were collected at the rate of around 1000/min but the same becomes < 400/min wen the data is as i said before. Is this the database writes or the file writes that makes the program slower?
The logs are just cout from my program that are redirected to a file. No buffering is done there.
There's an easy way to test for this.
If you create a blackhole table, MySQL will pretend to do everything but never really write any data to file.
create table(s) just like your
normal table(s),
Make a copy of the logs.
Now write to the blackhole database just like you would in the real database.
If it's much faster it's MySQL giving you grief.
See: http://dev.mysql.com/doc/refman/5.5/en/blackhole-storage-engine.html
Related
I am looking for a way to save the result set of an ANALYZE Compression to a table / Joining it to another table in order to automate compression scripts.
Is it Possible and how?
You can always run analyze compression from an external program (bash script is my go to), read the results and store them back up to Redshift with inserts. This is usually the easiest and fastest way when I run into these type of "no route from leader to compute node" issues on Redshift. These are often one-off scripts that don't need automation or support.
If I need something programatic I'll usually write a Lambda function (or possibly a python program on an ec2). Fairly easy and execution speed is high but does require an external tool and some users are not happy running things outside of the database.
If it needs to be completely Redshift internal then I make a procedure that keeps the results of the leader only query in cursor and then loops on the cursor inserting the data into a table. Basically the same as reading it out and then inserting back in but the data never leaves Redshift. This isn't too difficult but is slow to execute. Looping on a cursor and inserting 1 row at a time is not efficient. Last one of these I did took 25 sec for 1000 rows. It was fast enough for the application but if you need to do this on 100,000 rows you will be waiting a while. I've never done this with analyze compression before so there could be some issue but definitely worth a shot if this needs to be SQL initiated.
I was recently asked a Question in an interview , if anyone can help me to figure out.
Suppose we have 100 files , and a process read a file , parse it , and write data into a database.
Now lets say process was at file number 60 and power got off , Now how will you design a system such that when power comes up , process should start write data into database , where it left before shut down.
This would be one way:
Loop over:
Pick up a file
Check it hasn't been processed with a query to the database.
Process the file
Update the database
Update the database with a log of the file processed
Commit
Move the file out of the non-processed queue
You can also log the file entry to some other persistent resource.
Q. What if there are many files. Doesn't writing to logs slow down the process?
A: Probably not much, it's just one entry into the database per file. It's the cost of resilience.
Q: What if the files are so small it's almost only updating one row per file?
A: Make your update query idempotent. Don't log, but ensure that files are removed from the queue once the transaction is complete.
Q: What if there are many lines in a file. Do you really want to restart with the first line of a file?
A: Depends on the cost/benefit. You could split the file into smaller ones prior to processing each sub-file. If the power out happens all the time, then that's a good compromise. If it happens very rarely, the extra work by the system may not be worth it.
A: What if there is a mix of small and large files?
Q: Put the files into separate queues that handle them accordingly.
The UPS idea by #TimBiegeleisen is very good, though:
Well actually it is about that, because unplugging a database in the middle of a lengthy transaction might result in corrupted data. – Tim Biegeleisen Feb 22 '20 at 10:24
I've experienced failure of one such, so you'll need two.
I think you must:
Store somewhere a reference to a file (ID, index of processed file - depend on the case really).
Your have to define the bounduaries of a single transaction - let it be full processing of one file so: read a file, parese it, store data to the database and update reference to the file you processed. If all of that succeeds you can commit the transaction to the database.
You main task which will process all the files should look into reference table and based on it's state featch next file.
In this case you create transaction around single file processing. If anything goes wrong there, you can always rerun the processing job and it will start where it left off.
Please be aware that this is very simple exaple in most scenarios you want to keep transactions as thin as possible.
I need to keep reading data from a very large file in a streaming style (like a database, we keep query data from that file). Is it a good practice to keep that file open, or open it when needed then close every time?
Assume the query is frequent.
If I keep it open, is there anything dangerous I need to pay attention to?
ADD:
to clarify my use case:
Only one process will keep query data from that file. The file is about 5GB, and frequency is about 1 time per second, and each query will fetch about 10MB chunk from that file.
The file won't change. We only do reading operation.
Is there a way to Import an entire CSV file into SQLite through the C Interface?
I'm aware of the commandline import that looks like this,
sqlite> .mode csv <table>
sqlite> .import <filename> <table>
but I need to be able to do this in my program.
I should also note that I have successfully created a CSV reader in C++ that reads in a CSV file and inserts its content to a table line by line.
This gets the job done but with a CSV containing 730k lines this method takes ~20 minutes to load which is WAY too long. (This is going to be around average size of the stuff being processed)
(Machine: Intel(R) Core(TM)2 Duo CPU E8500 # 3.16GHz 3.17GHz, 4.0 GB Ram, Windows 7 64 bit, Visual studios 2010)
This is unacceptable for my project so I need a faster way, something taking around 2-3 minutes.
Is there a way to reference the file's memory location so Import isn't necessary? If so is access of the information slow?
Can SQLite take the CSV file as binary data? Would this make importing the file any faster?
Ideas?
Note: I'm using the ":memory:" option with the C Interface to load the DB in memory to increase speed (I hope).
EDIT
After doing some more optimizing I found this. It explains how you can group insert statements into 1 transaction by writing.
BEGIN TRANSACTION;
INSERT into TABLE VALUES(...);
...Million more INSERT statements
INSERT into TABLE VALUES(...);
COMMIT;
This created a HUGE improvement in performance.
Useful Related Side Note
Also if you're looking to a create table from a query's results or Insert query results into a table try this for creating tables or this for inserting results into a table.
The insert link might not be obvious for inserting into a table. The query to do that looks like this.
INSERT INTO [TABLE] [QUERY]
where [TABLE] is the table you want the results of [QUERY] the query you're running to go into.
I have successfully created a CSV reader in C++ that reads in a CSV file and inserts its content to a table line by line... takes ~20 minutes to load
Put all your inserts into a single transaction - or at least batch up 100 or 1000 rows per transaction - and I would expect your program to run much faster.
I'm working on a Qt GUI for visualizing 'live' data which is received via a TCP/IP connection. The issue is that the data is arriving rather quickly (a few dozen MB per second) - it's coming in faster than I'm able to visualize it even though I don't do any fancy visualization - I just show the data in a QTableView object.
As if that's not enough, the GUI also allows pressing a 'Freeze' button which will suspend updating the GUI (but it will keep receiving data in the background). As soon as the Freeze option was disabled, the data which has been accumulated in the background should be visualized.
What I'm wondering is: since the data is coming in so quickly, I can't possibly hold all of it in the memory. The customer might even keep the GUI running over night, so gigabytes of data will accumulate. What's a good data storage system for writing this data to disk? It should have the following properties:
It shouldn't be too much work to use it on a desktop system
It should be fast at appending new data at the end. I never need to touch previously written data anymore, so writing into anywhere but the end is not needed.
It should be possible to randomly access records in the data. This is because scrolling around in my GUI will make it necessary to quickly display the N to N+20 (or whatever the height of my table is) entries in the data stream.
The data which is coming in can be separated into records, but unfortunately the records don't have a fixed size. I'd rather not impose a maximum size on them (at least not if it's possible to get good performance without doing so).
Maybe some SQL database, or something like CouchDB? It would be great if somebody could share his experience with such scenarios.
I think that sqlite might do the trick. It seems to be fast. Unfortunately, I have no data flow like yours, but it works well as a backend for a log recorder. I have a GUI where you can view the n, n+k logs.
You can also try SOCI as a C++ database access API, it seems to work fine with sqlite (I have not used it for now but plan to).
my2c
I would recommend a simple file based solution.
If you can use fixed size records: If the you get the data continuously with constant sample rate, random access to data is easy and very fast when you know the time stamp of first data point and the sample rate. If the sample rate varies, then write time stamp with each data point. Now random access requires binary search, but it is still fast enough.
If you have variable size records: Write the variable size data to one file and to other file write indexes (which are fixed size) to the data file. And if the sample rate varies, write time stamps too. Now you can do the random access fast using the index file.
If you are using Qt to implement this kind of solution, you need two sets of QFile and QDataStream instances, one for writing and one for reading.
And a note about performance: don't flush the file after every data point write. But remember to flush the file before doing any random access to it.