How does the PDI Kettle table input step load the data? - kettle

I am new to Kettle. I want to know if the table input step will load all the data from the table into memory at once. If yes, how do I avoid a memory overflow?

If it worked like that how would people use it in a big data scenario? :) Don't worry it will be fine. PDI doesn't the data into local memory if it can be avoided. That's the same for all input steps. It creates a cursor (like a stream) and just gets the data you work currently on and then releases it from memory when it reaches the end of the pipeline.

Related

Dynamic spool space allocation in teradata

I have huge data and am importing it from teradata to hdfs. While doing so, usually spool space is insufficient and the job thus fails. Is there any solution for this? Can spool space be allocated dynamically as per data size? Or can we load the data from sqoop import into a temporary buffer and then write it to hdfs?
If you're running out of SPOOL, it could be any of these scenarios:
Incorrectly written query (i.e. unintended CROSS JOIN)
Do an EXPLAIN on your query and check for a product join or anything that looks like it will take a long time
Inefficient query plan
Run an EXPLAIN and see if there are any long estimates. Also, you can try DIAGNOSTIC HELPSTATS ON FOR SESSION. When you enable this flag, any time you run an EXPLAIN, at the bottom you will get a bunch of recommended statistics to collect. Some of these suggestions may be useful
Tons of data
Not much you can do here. Maybe try to do the import in batches.
Also, you can check to see what the MaxSpool parameter is for the user running the query. You could try to increase the MaxSpool value to see if that helps. Keep in mind, the actual spool available will be capped by the amount of unallocated PERM space.

Why is loading dashDB analytics by trickle feed a bad idea?

I have a use case where I continuously need to trickle feed data into dashDB, however I have been informed that this is not optimal for dashDB.
Why is this not optimal? Is there a workaround?
Columnar warehouses are great for reads, but if you insert a single row into an N column table then the system has to cut the row into pieces and do N separate writes to disk. This makes small inserts relatively inefficient and things can slow down as a result.
You may want to do an initial batch load of data. Currently the compression dictionary is built only for bulk loads, so if you start with a new table and populate it only using inserts then the data doesn't get compressed at all.
Try to structure the loading into microbatches with a 2-5 minute load cycle.
What is the use case here? Check if dashDB Transactional can solve your need. DashDB transactional is tuned for OLTP and point of sale transactions which is what you are trying to feed.

Solution to work disk space not enough in sas

I have more than 50 tables running in work. Before, it worked well.
But recently, there are some errors like:
ERROR: An I/O error has occurred on file
WORK.'SASTMP-000000030'n.UTILITY. ERROR: File
WORK.'SASTMP-000000030'n.UTILITY is damaged. I/O processing did not
complete. NOTE: Error was encountered during utility-file processing.
You may be able to execute the SQL statement successfully if you
allocate more space to the WORK library. ERROR: There is not enough WORK disk space to store the results of an internal sorting
phase. ERROR: An error has occurred.
Does anyone know how to solve this error?
Your disk is full. If this is running on a server, ask your system administrator to investigate the problem.
If this is your desktop, find and delete un-needed files to free up space.
Clean out old SAS Work Folders
Often, old SAS Work folders do not get cleared when SAS closes. You can get back a lot of disk space by going to the path defined for SAS Work, and deleting all the old folders.
In SAS
%put %sysfunc(pathname(work));
will show you where the current WORK library is located. One level up is where all SAS Work folders are created.
On my system, that returns:
C:\Users\dpazzula\AppData\Local\Temp\SAS Temporary Files\_TD9512_GXM2L12-PAZZULA_
That means that I should look in "C:\Users\dpazzula\AppData\Local\Temp\SAS Temporary Files\" to find old folders to delete.
Your work space is full.
Your SAS server uses a dedicated directory where all SAS sessions store their temporary files: All files in the work libraries, as well as temp files as used while sorting, joining etc.
Solutions:
Have more space allocated.
Make certain only to put necessary files into work/ clean up/ close old sessions.
Run less processes.
Replace interim datasets with views instead, especially if you're using large source datasets :
data master /view=master ;
set lib.monthlydata20: ; /* all datasets since Jan 2000 */
run ;
proc sql ;
create table want as
select *
from master
where ID in(select ID from lookup) ;
quit ;
try to compress all datasets using this option
OPTIONS COMPRESS=YES REUSE=YES;
this should be in the very beginning of your code. it will compress all datasets by nearly 98%.It will also make your code run faster. It will consume more CPU but will decrease size.
In some cases, this might not help if the compressed data sets exceed the hard disk space.
Also, change your work directory to the biggest drive that has disk space.
Study your code.
Create a Data Flow Diagram to determine WHEN each file is created, where it is used downstream. Find out when a data set is no longer needed and DELETE it. If you have 50 data sets, chances are numerous data sets are 'value-added' by a subsequent step, and can go away freeing up your work space. A cute trick is to REUSE some of the data set names - to keep the number of unneeded data sets in check.
Rule of thumb: leave the environment the way you found it - if there were no files in WORK to start, manually clean up after yourself. Unless it is a Stored Process, which starts a completely new SAS job, and will clean up after itself upon completion of the job.

How to optimize writing this data to a postgres database

I'm parsing poker hand histories, and storing the data in a postgres database. Here's a quick view of that:
I'm getting a relatively bad performance, and parsing files will take several hours. I can see that the database part takes 97% of the total program time. So only a little optimization would make this a lot quicker.
The way I have it set-up now is as follows:
Read next file into a string.
Parse one game and store it into object GameData.
For every player, check if we have his name in the std::map. If so; store the playerids in an array and go to 5.
Insert the player, add it to the std::map, store the playerids in an array.
Using the playerids array, insert the moves for this betting round, store the moveids in an array.
Using the moveids array, insert a movesequence, store the movesequenceids in an array.
If this isn't the last round played, go to 5.
Using the movesequenceids array, insert a game.
If this was not the final game, go to 2.
If this was not the last file, go to 1.
Since I'm sending queries for every move, for every movesequence, for every game, I'm obviously doing too many queries. How should I bundle them for best performance? I don't mind rewriting a bit of code, so don't hold back. :)
Thanks in advance.
CX
It's very hard to answer this without any queries, schema, or a Pg version.
In general, though, the answer to these problems is to batch the work into bigger coarser batches to avoid repeating lots of work, and, most importantly, by doing it all in one transaction.
You haven't said anything about transactions, so I'm wondering if you're doing all this in autocommit mode. Bad plan. Try wrapping the whole process in a BEGIN and COMMIT. If it's a seriously long-running process the COMMIT every few minutes / tens of games / whatever, write a checkpoint file or DB entry your program can use to resume the import from that point, and open a new transaction to carry on.
It'll help to use multi-valued inserts where you're inserting multiple rows to the same table. Eg:
INSERT INTO some_table(col1, col2, col3) VALUES
('a','b','c'),
('1','2','3'),
('bork','spam','eggs');
You can improve commit rates with synchronous_commit=off and a commit_delay, but that's not very useful if you're batching work into bigger transactions.
One very good option will be to insert your new data into UNLOGGED tables (PostgreSQL 9.1 or newer) or TEMPORARY tables (all versions, but lost when session disconnects), then at the end of the process copy all the new rows into the main tables and drop the import tables with commands like:
INSERT INTO the_table
SELECT * FROM the_table_import;
When doing this, CREATE TABLE ... LIKE is useful.
Another option - really a more extreme version of the above - is to write your results to CSV flat files as you read and convert them, then COPY them into the database. Since you're working in C++ I'm assuming you're using libpq - in which case you're hopefully also using libpqtypes. libpq offers access to the COPY api for bulk-loading, so your app wouldn't need to call out to psql to load the CSV data once it'd produced it.

What portable data backends are there which have fast append and random access?

I'm working on a Qt GUI for visualizing 'live' data which is received via a TCP/IP connection. The issue is that the data is arriving rather quickly (a few dozen MB per second) - it's coming in faster than I'm able to visualize it even though I don't do any fancy visualization - I just show the data in a QTableView object.
As if that's not enough, the GUI also allows pressing a 'Freeze' button which will suspend updating the GUI (but it will keep receiving data in the background). As soon as the Freeze option was disabled, the data which has been accumulated in the background should be visualized.
What I'm wondering is: since the data is coming in so quickly, I can't possibly hold all of it in the memory. The customer might even keep the GUI running over night, so gigabytes of data will accumulate. What's a good data storage system for writing this data to disk? It should have the following properties:
It shouldn't be too much work to use it on a desktop system
It should be fast at appending new data at the end. I never need to touch previously written data anymore, so writing into anywhere but the end is not needed.
It should be possible to randomly access records in the data. This is because scrolling around in my GUI will make it necessary to quickly display the N to N+20 (or whatever the height of my table is) entries in the data stream.
The data which is coming in can be separated into records, but unfortunately the records don't have a fixed size. I'd rather not impose a maximum size on them (at least not if it's possible to get good performance without doing so).
Maybe some SQL database, or something like CouchDB? It would be great if somebody could share his experience with such scenarios.
I think that sqlite might do the trick. It seems to be fast. Unfortunately, I have no data flow like yours, but it works well as a backend for a log recorder. I have a GUI where you can view the n, n+k logs.
You can also try SOCI as a C++ database access API, it seems to work fine with sqlite (I have not used it for now but plan to).
my2c
I would recommend a simple file based solution.
If you can use fixed size records: If the you get the data continuously with constant sample rate, random access to data is easy and very fast when you know the time stamp of first data point and the sample rate. If the sample rate varies, then write time stamp with each data point. Now random access requires binary search, but it is still fast enough.
If you have variable size records: Write the variable size data to one file and to other file write indexes (which are fixed size) to the data file. And if the sample rate varies, write time stamps too. Now you can do the random access fast using the index file.
If you are using Qt to implement this kind of solution, you need two sets of QFile and QDataStream instances, one for writing and one for reading.
And a note about performance: don't flush the file after every data point write. But remember to flush the file before doing any random access to it.