How to split gz csv file by it's data field efficiently

How to split gz csv file by it's data field efficiently - python-2.7

I have a very large gzipped csv file. I would like to split it into two gz files based on the string pattern in a particular column. I know it is possible to loop through the content and create two files, but is there a better way to do it in python in terms of efficiency?
In addition, the original file has one line header. I would like to either have the headers in each of the two result files, or remove headers altogether.

Related

import multiple csv files as list in R and change the name of one cell in each file

I am searching since days but I can't find an answer to my question.
I need to change a single cell currently named "29" into "Si29" in hundreds of csv files.
The position of the cell is the same in every file [3,7].
Then I need to save the files again (can be under the same name).
For one file I would do:
read_data[3,7]<-"Si29
However, I have no clue how I apply this to multiple files.
Cheers

How can I explicitly specify the size of the files to be split or the number of files?

Situation:
If only specify the partition clause, it will be divided into multiple files. The size of one file is less than 1MB (~ 40 files).
What I am thinking of:
I want to explicitly specify the size of the files to be split or the number of files when registering data with CTAS or INSERT INTO.
I have read this article: https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/
Problem:
Using bucketing method (like said in above article ) can help me specify the number of file or file size. However, it also said that "Note: The INSERT INTO statement isn't supported on bucketed tables". I would like to register data daily with Athena's INSERT INTO in the data mart.
what is the best way to build a partitioned data mart without compromising search efficiency? Is it best to register the data with Glue and save it as one file?

How can i manipulate csv's from within c++

I am trying to create a program that can read out to a csv (comma separated). Is there a way to manipulate say the column width or whether a cell is left or right justified internally from my code so that when i open up the file in excel it looks better than a bunch of strings cramped into tiny cells. My goal is for the user to do as little thinking as possible. If they open up the file and have to size everything right just to see it that seems a little crummy.

CSV is a plain text file format. It doesn't support any visual formatting. For that, you need to write the data to another file format such as .xlsx or .ods.

can i perform gzseek to update a file compressed using gzwrite (CPP)?

I have a file written using gzwrite. Now i want to edit this file and insert some data in the middle by seeking. Is this possible with gzseek/gzwrite in cpp?

No, it isn't possible. You have to create a new file by successively writing the pieces.
So it is not much different from inserting data in the middle of an uncompressed file, except for one thing: with the uncompressed file, you could leave a hole of the right size (a series of spaces, for example) and later on overwrite that with the data to be inserted, but of course that is not possible with the compressed file because you cannot predict its compressed length.

Generate dictionary file from Stata data

I know that I can create a dta file if I have dat file and dictionary dct file. However, I want to know whether the reverse is also possible. In particular, if I have a dta file, is it possible to generate dct file along with dat file (Stata has an export command that allows export as ASCII file but I haven't found a way to generate dct file). StatTransfer does generate dct and dat file, but I was wondering if it is possible without using StatTransfer.

Yes. outfile will create dictionaries as well as export data in ASCII (text) form.
If you want dictionaries and dictionaries alone, you would need to delete the data part.
If you really want two separate files, you would need to split each file produced by outfile.
Either is programmable in Stata, or you could just use your favourite text editor or scripting language.
Dictionaries are in some ways a very good idea, but they are not as important to Stata as they were in early versions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to split gz csv file by it's data field efficiently - python-2.7

Related

import multiple csv files as list in R and change the name of one cell in each file

How can I explicitly specify the size of the files to be split or the number of files?

How can i manipulate csv's from within c++

can i perform gzseek to update a file compressed using gzwrite (CPP)?

Generate dictionary file from Stata data

Categories

Resources