How to speed up the execution of inserting values in postgresql - python-2.7

data_input=open(ratingsfilepath,'r')
for row in data_input:
cur_load.execute("INSERT INTO "+ratingstablename+" VALUES (%s, %s, %s)", (row.split('::')[0],row.split('::')[1],row.split('::')[2]))
I have 10 million records in .dat file I am loading them into table using python script. But it takes nearly 1 hour to load them. Is there anything to reduce the time

Inserting 10 million records will anyway take a very long time, but you can still speed it up by using your python script to convert your data file into CSV format that corresponds to your table structure. Then you can use the COPY FROM sql command to load it into the table in one go.
Using copy is considerably faster than 10 million inserts.

Related

AWS Athena - how to process huge results file

Looking for a way to process ~ 4Gb file which is a result of Athena query and I am trying to know:
Is there some way to split Athena's query result file into small pieces? As I understand - it is not possible from Athena side. Also, looks like it is not possible to split it with Lambda - this file too large and looks like s3.open(input_file, 'r') does not work in Lambda :(
Is there some other AWS services that can solve this issue? I want to split this CSV file to small (about 3 - 4 Mb) to send them to external source (POST requests)
You can use the option to CTAS with Athena and use the built-in partition capabilities.
A common way to use Athena is to ETL raw data into a more optimized and enriched format. You can turn every SELECT query that you run into a CREATE TABLE ... AS SELECT (CTAS) statement that will transform the original data into a new set of files in S3 based on your desired transformation logic and output format.
It is usually advised to have the newly created table in a compressed format such as Parquet, however, you can also define it to be CSV ('TEXTFILE').
Lastly, it is advised to partition a large table into meaningful partitions to reduce the cost to query the data, especially in Athena that is charged by data scanned. The meaningful partitioning is based on your use case and the way that you want to split your data. The most common way is using time partitions, such as yearly, monthly, weekly, or daily. Use the logic that you would like to split your files as the partition key of the newly created table.
CREATE TABLE random_table_name
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
partitioned_by = ARRAY['year','month'])
AS SELECT ...
When you go to s3://bucket/folder/ you will have a long list of folders and files based on the selected partition.
Note that you might have different sizes of files based on the amount of data in each partition. If this is a problem or you don't have any meaningful partition logic, you can add a random column to the data and partition with it:
substr(to_base64(sha256(some_column_in_your_data)), 1, 1) as partition_char
Or you can use bucketing and provide how many buckets you want:
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
bucketed_by = ARRAY['column_with_high_cardinality'],
bucket_count = 100
)
You won't be able to do this with Lambda as your memory is maxed out around 3GB and your file system storage is maxed out at 512 MB.
Have you tried just running the split command on the filesystem (if you are using a Unix based OS)?
If this job is reoccurring and needs to be automated and you wanted to still be "serverless", you could create a Docker image that contains a script to perform this task and then run it via a Fargate task.
As for the specific of how to use split, this other stack overflow question may help:
How to split CSV files as per number of rows specified?
You can ask S3 for a range of the file with the Range option. This is a byte range (inclusive), for example bytes=0-1000 to get the first 1000 bytes.
If you want to process the whole file in the same Lambda invocation you can request a range that is about what you think you can fit in memory, process it, and then request the next. Request the next chunk when you see the last line break, and prepend the partial line to the next chunk. As long as you make sure that the previous chunk gets garbage collected and you don't aggregate a huge data structure you should be fine.
You can also run multiple invocations in parallel, each processing its own chunk. You could have one invocation check the file size and then invoke the processing function as many times as necessary to ensure each gets a chunk it can handle.
Just splitting the file into equal parts won't work, though, you have no way of knowing where lines end, so a chunk may split a line in half. If you know the maximum byte size of a line you can pad each chunk with that amount (both at the beginning and end). When you read a chunk you skip ahead until you see the last line break in the start padding, and you skip everything after the first line break inside the end padding – with special handling of the first and last chunk, obviously.

VoltDB is exhausting the RAM while loading the data

I am trying to load the database tables into VoltDB database using csvloader utility of VoltDB. When I am trying to load one table of size 5GB, Voltdb eats the RAM so fast that free RAM become 200 MB from 55 GB, then the VoltDB process gets killed by the system.
What can be the reason for this and what are the recommended setting for VoltDB to avoid this?
Is the table you are loading partitioned? That's the first thing to check, because if you have the default sitesperhost=8 on a single server, and the table is not partitioned, there will be a complete copy of the table in each of the 8 partitions. If the table is partitioned, the data is distributed among the partitions based on the hashing assignment of the values of the partitioning key column.
If it's partitioned and you still can't load all of the data, the next thing to look at would be the schema. There are formulas in the Planning Guide that describe the memory usage for given datatypes and for indexes. The VMC interface also has a sizing worksheet that gives you the mins and maxes based on the schema. You could also post the definition of the table you are trying to load, along with any indexes you have defined on it, and we can explain more about the bytes it would use per row.

Advises for handling a 200 GB CSV with geometry

I have a 200 GB CSV file that represent locations (points) around the globe. Each entry (row) has 64 columns and it has redundant information. I made some calculations and the size is approx. 800 million rows. My first approach was to push all the data into Postgres + Postgis. The data is not very clean and there are some rows that do not hold the datatype feature thus, I made an ORM implementation to first, validate and fix the datatype inconsistencies and handle exceptions.
The ORM I used was Django > 1.5 and it took approx. 3 hours to process less than 0.1% of the total dataset.
I tried also to partition the dataset in different files so that little by little I can process them pushing into the database. I used common unix commands like "Sed, Cat, AWK and Head" to do this but it takes so much time!
My questions are the following:
Using Django ORM sounds like a god approach?
What about SQLAlchemy could it help in making the insertions faster?
How can I split the dataset in shorter time?
I recently saw Pandas (Python library for data analysts) can it help with this task, maybe making the queries easier once the data is stored in the database.
Which other tools would you recommend to work with this massive amount of data?
Thank you for your help and reading the long post.

Hive -- split data across files

Is there a way to instruct Hive to split data into multiple output files? Or maybe cap the size of the output files.
I'm planning to use Redshift, which recommends splitting data into multiple files to allow parallel loading http://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html
We preprocess all out data in hive, and I'm wondering if there's a way to create, say 10 1GB files which might make copying to redshift faster.
I was looking at https://cwiki.apache.org/Hive/adminmanual-configuration.html and https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties but I can't find anything
There are a couple of ways you could go about splitting Hive output. The first and easiest way is to set the number of reducers. Since each reduces writes to its own output file, the number of reducers you specify will correspond to the number of output files written. Note that some Hive queries will not result in the number of reducers you specify (for example, SELECT COUNT(*) FROM some_table always results in one reducer). To specify the number of reducers run this before your query:
set mapred.reduce.tasks=10
Another way you could split into multiple output files would be to have Hive insert the results of your query into a partitioned table. This would result in at least one file per partition. For this to make sense you must have some reasonable column to partition on. For example, you wouldn't want to partition on a unique id column or you would have one file for each record. This approach will guarantee at least output file per partition, and at most numPartitions * numReducers. Here's an example (don't worry too much about hive.exec.dynamic.partition.mode, it needs to be set for this query to work).
hive.exec.dynamic.partition.mode=nonstrict
CREATE TABLE table_to_export_to_redshift (
id INT,
value INT
)
PARTITIONED BY (country STRING)
INSERT OVERWRITE TABLE table_to_export_to_redshift
PARTITION (country)
SELECT id, value, country
FROM some_table
To get more fine grained control, you can write your own reduce script to pass to hive and have that reduce script write to multiple files. Once you are writing your own reducer, you can do pretty much whatever you want.
Finally, you can forgo trying to maneuver Hive into outputting your desired number of files and just break them apart yourself once Hive is done. By default, Hive stores its tables uncompressed and in plain text in it's warehouse directory (ex, /apps/hive/warehouse/table_to_export_to_redshift). You can use Hadoop shell commands, a MapReduce job, Pig, or pull them into Linux and break them apart however you like.
I don't have any experience with Redshift, so some of my suggestions may not be appropriate for consumption by Redshift for whatever reason.
A couple of notes: Splitting files into more, smaller files is generally bad for Hadoop. You might get a speed increase for Redshift, but if the files are consumed by other parts of the Hadoop ecosystem (MapReduce, Hive, Pig, etc) you might see a performance loss if the files are too small (though 1GB would be fine). Also make sure that the extra processing/developer time is worth the time savings you get for paralleling your Redshift data load.

C++ SQLite importing entire CSV file in C Interface

Is there a way to Import an entire CSV file into SQLite through the C Interface?
I'm aware of the commandline import that looks like this,
sqlite> .mode csv <table>
sqlite> .import <filename> <table>
but I need to be able to do this in my program.
I should also note that I have successfully created a CSV reader in C++ that reads in a CSV file and inserts its content to a table line by line.
This gets the job done but with a CSV containing 730k lines this method takes ~20 minutes to load which is WAY too long. (This is going to be around average size of the stuff being processed)
(Machine: Intel(R) Core(TM)2 Duo CPU E8500 # 3.16GHz 3.17GHz, 4.0 GB Ram, Windows 7 64 bit, Visual studios 2010)
This is unacceptable for my project so I need a faster way, something taking around 2-3 minutes.
Is there a way to reference the file's memory location so Import isn't necessary? If so is access of the information slow?
Can SQLite take the CSV file as binary data? Would this make importing the file any faster?
Ideas?
Note: I'm using the ":memory:" option with the C Interface to load the DB in memory to increase speed (I hope).
EDIT
After doing some more optimizing I found this. It explains how you can group insert statements into 1 transaction by writing.
BEGIN TRANSACTION;
INSERT into TABLE VALUES(...);
...Million more INSERT statements
INSERT into TABLE VALUES(...);
COMMIT;
This created a HUGE improvement in performance.
Useful Related Side Note
Also if you're looking to a create table from a query's results or Insert query results into a table try this for creating tables or this for inserting results into a table.
The insert link might not be obvious for inserting into a table. The query to do that looks like this.
INSERT INTO [TABLE] [QUERY]
where [TABLE] is the table you want the results of [QUERY] the query you're running to go into.
I have successfully created a CSV reader in C++ that reads in a CSV file and inserts its content to a table line by line... takes ~20 minutes to load
Put all your inserts into a single transaction - or at least batch up 100 or 1000 rows per transaction - and I would expect your program to run much faster.