Faster bulk inserts in sqlite3? - c++

I have a file of about 30000 lines of data that I want to load into a sqlite3 database. Is there a faster way than generating insert statements for each line of data?
The data is space-delimited and maps directly to an sqlite3 table. Is there any sort of bulk insert method for adding volume data to a database?
Has anyone devised some deviously wonderful way of doing this if it's not built in?
I should preface this by asking, is there a C++ way to do it from the API?

wrap all INSERTs in a transaction, even if there's a single user, it's far faster.
use prepared statements.

You want to use the .import command. For example:
$ cat demotab.txt
44 92
35 94
43 94
195 49
66 28
135 93
135 91
67 84
135 94
$ echo "create table mytable (col1 int, col2 int);" | sqlite3 foo.sqlite
$ echo ".import demotab.txt mytable" | sqlite3 foo.sqlite
$ sqlite3 foo.sqlite
-- Loading resources from /Users/ramanujan/.sqliterc
SQLite version 3.6.6.2
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> select * from mytable;
col1 col2
44 92
35 94
43 94
195 49
66 28
135 93
135 91
67 84
135 94
Note that this bulk loading command is not SQL but rather a custom feature of SQLite. As such it has a weird syntax because we're passing it via echo to the interactive command line interpreter, sqlite3.
In PostgreSQL the equivalent is COPY FROM:
http://www.postgresql.org/docs/8.1/static/sql-copy.html
In MySQL it is LOAD DATA LOCAL INFILE:
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
One last thing: remember to be careful with the value of .separator. That is a very common gotcha when doing bulk inserts.
sqlite> .show .separator
echo: off
explain: off
headers: on
mode: list
nullvalue: ""
output: stdout
separator: "\t"
width:
You should explicitly set the separator to be a space, tab, or comma before doing .import.

I've tested some pragmas proposed in the answers here:
synchronous = OFF
journal_mode = WAL
journal_mode = OFF
locking_mode = EXCLUSIVE
synchronous = OFF + locking_mode = EXCLUSIVE + journal_mode = OFF
Here's my numbers for different number of inserts in a transaction:
Increasing the batch size can give you a real performance boost, while turning off journal, synchronization, acquiring exclusive lock will give an insignificant gain. Points around ~110k show how random background load can affect your database performance.
Also, it worth to mention, that journal_mode=WAL is a good alternative to defaults. It gives some gain, but do not reduce reliability.
C# Code.

You can also try tweaking a few parameters to get extra speed out of it. Specifically you probably want PRAGMA synchronous = OFF;.

Increase PRAGMA cache_size
to a much larger number. This will
increase the number of pages cached
in memory. NOTE: cache_size is a per-connection setting.
Wrap all inserts into a single transaction rather than one transaction per row.
Use compiled SQL statements to do the inserts.
Finally, as already mentioned, if you are willing forgo full ACID compliance, set PRAGMA synchronous = OFF;.

RE: "Is there a faster way that generating insert statements for each line of data?"
First: Cut it down to 2 SQL statements by making use of Sqlite3's Virtual table API e.g.
create virtual table vtYourDataset using yourModule;
-- Bulk insert
insert into yourTargetTable (x, y, z)
select x, y, z from vtYourDataset;
The idea here is that you implement a C interface that reads your source data set and present it to SQlite as a virtual table and then you do a SQL copy from the source to the target table in one go. It sounds harder than it really is and I've measured huge speed improvements this way.
Second: Make use of the other advise provided here i.e. the pragma settings and making use of a transaction.
Third: Perhaps see if you can do away with some of the indexes on the target table. That way sqlite will have less indexes to update for each row inserted

There is no way to bulk insert, but
there is a way to write large chunks
to memory, then commit them to the
database. For the C/C++ API, just do:
sqlite3_exec(db, "BEGIN TRANSACTION",
NULL, NULL, NULL);
...(INSERT statements)
sqlite3_exec(db, "COMMIT TRANSACTION", NULL, NULL, NULL);
Assuming db is your database pointer.

A good compromise is to wrap your INSERTS between BEGIN; and END; keyword i.e:
BEGIN;
INSERT INTO table VALUES ();
INSERT INTO table VALUES ();
...
END;

Depending on the size of the data and the amount of RAM available, one of the best performance gains will occur by setting sqlite to use an all-in-memory database rather than writing to disk.
For in-memory databases, pass NULL as the filename argument to sqlite3_open and make sure that TEMP_STORE is defined appropriately
(All of the above text is excerpted from my own answer to a separate sqlite-related question)

I found this to be a good mix for an one shot long import.
.echo ON
.read create_table_without_pk.sql
PRAGMA cache_size = 400000; PRAGMA synchronous = OFF; PRAGMA journal_mode = OFF; PRAGMA locking_mode = EXCLUSIVE; PRAGMA count_changes = OFF; PRAGMA temp_store = MEMORY; PRAGMA auto_vacuum = NONE;
.separator "\t" .import a_tab_seprated_table.txt mytable
BEGIN; .read add_indexes.sql COMMIT;
.exit
source: http://erictheturtle.blogspot.be/2009/05/fastest-bulk-import-into-sqlite.html
some additional info: http://blog.quibb.org/2010/08/fast-bulk-inserts-into-sqlite/

If you are just inserting once, I may have a dirty trick for you.
The idea is simple, first inserting into a memory database, then backup and finally restore to your original database file.
I wrote the detailed steps at my blog. :)

I do a bulk insert with this method:
colnames = ['col1', 'col2', 'col3']
nrcols = len(colnames)
qmarks = ",".join(["?" for i in range(nrcols)])
stmt = "INSERT INTO tablename VALUES(" + qmarks + ")"
vals = [[val11, val12, val13], [val21, val22, val23], ..., [valn1, valn2, valn3]]
conn.executemany(stmt, vals)
colnames must be in the order of the column names in the table
vals is a list of db rows
each row must have the same length, and
contain the values in the correct order
Note that we use executemany, not execute

Related

Map values in a list to a new value with PySpark

I'm trying to recode a list of values using Pyspark to create a new column. I've set my mapping up with nested dictionaries, but can't get the mapping syntax figured out. The original data has several string values that need to get recoded to a new value, then I want to give the column a new name. The original column values will get grouped several different ways to create different new columns.
The df will have several thousand columns, so I need the code to be as efficient as possible.
I have a different scenario with a 1-1 mapping where I was able to create my expression with:
#expr = [ create_map([lit(x) for x in chain(*values.items())])[orig_df[key]].cast(IntegerType()).alias('new_name') for key, values in my_dict.items() if key in orig_df.columns]
I just can't figure out the syntax for mapping the many to one.
Here's what I've tried:
grouping_dict = {'orig_col_n':{'new_col_n_a': {'20':['011','012'.'013'],'30':['014','015','016']},
'new_col_n_b': {'25':['011','013','015'],'35':['012','014','016']}}}
expr = [ f.when(f.col(key) == f.lit(old_val),f.lit(new_value))
.cast(IntegerType())
.alias(new_var_name)
for key, new_var_names_dict in grouping_dict.items()
for new_var_name,mapping_dict in new_var_names_dict.items()
for new_value,old_value_list in mapping_dict.items()
for old_val in old_value_list
if key in original_df.columns]
new_df = original_df.select(*expr)
This expression isn't quite right, it creates multiple columns with the same name as it loops through the values that need to be mapped.
Any suggestions for restructuring my dictionary or how to fix my syntax would be greatly appreciated!
enter image description here
enter image description here
orig_col_n new_col_n_a new_col_n_b
011 20 25
012 20 35
013 20 25
014 30 35
015 30 25
016 30 35

pandas 'outer' merge of multiple csvs using too much memory

I am new to coding and have a lot of big data to deal with. Currently I am trying to merge 26 tsv files (each has two columns without a header, one is a contig _number the other is a count.
If a tsv did not have a count for a particular contig_number, it does not have that row - so I am attempting to use how = 'outer' and fill in the missing values with 0 afterwards.
I have been successful for the tsvs which I have subsetted to run the initial tests, but when I run the script on the actual data, which is large (~40,000 rows, two columns), more and more memory is used...
I got to 500Gb of RAM on the server and called it a day.
This is the code that is successful on the subsetted csvs:
files = glob.glob('*_count.tsv')
data_frames = []
logging.info("Reading in sample files and adding to list")
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)}).reset_index()
df = df.rename(columns = {0:"contig"})
# append the dataframes to a list
data_frames.append(df)
logging.info("Merging the tables on contig, and fill in samples with no counts for contigs")
# merge the tables on gene_id and select how = 'outer' which will include all rows but will leave empty space where there is no data
df=reduce(lambda left,right: pd.merge(left, right, how='outer', on="contig"), data_frames)
# this bit is important to fill missing data with a 0
df.fillna(0, inplace = True)
logging.info("Writing concatenated count table to file")
# write the dataframe to file
df.to_csv("combined_bamm_filter_count_file.tsv",
sep='\t', index=False, header=True)
I would appreciate any advice or suggestions! Maybe there is just too much to hold in memory, and I should be trying something else.
Thank you!
I usually do these types of operations with pd.concat. I don't know the exact details of why it's more efficient, but pandas has some optimizations for combining indices.
I would do
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)})
#just keep the contig as the index
data_frames.append(df)
df_full=pd.concat(data_frames,axis=1)
and then df_full=df_full.fillna(0) if you want to.
In fact since each of your files has only one column (+ an index) you may do better yet by treating them as Series instead of DataFrame.

How to load specific columns with varying location from a text file in python?

I'm trying to read the discharge data of 346 US rivers stored online in textfiles. The files are more or less in this format:
Measurement_number Date Gage_height Discharge_value
1 2017-01-01 10 1000
2 2017-01-20 15 2000
# etc.
I only want to read the gage height and discharge value columns.
The problem is that in most files additional columns with metadata are added in front of the 'Gage height' column, so i can not just simply read the 3rd and 4th column because their index varies.
I'm trying to find a way to say 'read the columns with the name 'Gage_height' and 'Discharge_value'', but I haven't succeeded yet.
I hope anyone can help. I'm currently trying to load the text files with numpy.genfromtxt so it would be great to find a solution with that package but other solutions are also more than welcome.
This is my code so far
data_url=urllib2.urlopen(#the url of this specific site)
data=np.genfromtxt(data_url,skip_header=1,comments='#',usecols=2,3])
You can use the names=True option to genfromtxt, and then use the column names to select which columns you want to read with usecols.
For example, to read 'Gage_height' and 'Discharge_value' from your data file:
data = np.genfromtxt(filename, names=True, usecols=['Gage_height', 'Discharge_value'])
Note that you don't need to set skip_header=1 if you use names=True.
You can then access the columns using their names:
gage_height = data['Gage_height'] # == array([ 10., 15.])
discharge_value = data['Discharge_value'] # == array([ 1000., 2000.])
See the docs here for more information.

MonetDB create 100.000 columns

I am trying to create a MonetDB database that shall hold 100k columns and approximately 2M rows of smallint type.
To generate 100k columns I am using a C code, i.e., a loop that performs the following sql request:
ALTER TABLE test ADD COLUMN s%d SMALLINT;
where %d is a number from 1 till 100000.
I observed that after 80000 sql requests each transaction takes about 15s, meaning that I need a lot of time to complete the table creation.
Could you tell me if there is a simple way of creating 100k columns?
Also, do you know what exactly what is going on with MonetDB?
You should use only one create table
in script shell (bash) :
#!/bin/bash
fic="/tmp/100k.sql"
col=1
echo "CREATE TABLE bigcol (" > $fic
while [[ $col -lt 100000 ]]
do
echo "field$col SMALLINT," >> $fic
col=$(($col + 1))
done
echo "field$col SMALLINT);" >> $fic
And in command line :
sh 100k.sh
mclient yourbdd < /tmp/100k.sql
wait about 2 minutes :D
mclient yourbdd
> \d bigcol
[ ... ... ...]
"field99997" SMALLINT,
"field99998" SMALLINT,
"field99999" SMALLINT,
"field100000" SMALLINT
);
DROP TABLE bigcol is against very very long. I do not know why.
I also think it is not a good idea, but it answer your question.
Pierre

RRD DB fake value generator

I want to generate fake values in RRD DB for a period of 1 month and with 5 seconds as a frequency for data collection. Is there any tool which would fill RRD DB with fake data for given time duration.
I Googled a lot but did not find any such tool.
Please help.
I would recommend the following one liner:
perl -e 'my $start = time - 30 * 24 * 3600; print join " ","update","my.rrd",(map { ($start+$_*5).":".rand} 0..(30*24*3600/5))' | rrdtool -
this assumes you have an rrd file called my.rrd and that is contains just one data source expecting GAUGE type data.