I want to create a CSV file which contains the results of query.
This CSV file will live in Google Cloud Storage. (This query is around 15GB) I need it to be a single file. Is it possible, if so how?
CREATE OR REPLACE TABLE `your-project.your-dataset.chicago_taxitrips_mod` AS (
WITH
taxitrips AS (
SELECT
trip_start_timestamp,
trip_end_timestamp,
trip_seconds,
trip_miles,
pickup_census_tract,
dropoff_census_tract,
pickup_community_area,
dropoff_community_area,
fare,
tolls,
extras,
trip_total,
payment_type,
company,
pickup_longitude,
pickup_latitude,
dropoff_longitude,
dropoff_latitude,
IF((tips/fare >= 0.2),
1,
0) AS tip_bin
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
trip_miles > 0
AND fare > 0)
SELECT
trip_start_timestamp,
trip_end_timestamp,
trip_seconds,
trip_miles,
pickup_census_tract,
dropoff_census_tract,
pickup_community_area,
dropoff_community_area,
fare,
tolls,
extras,
trip_total,
payment_type,
company,
tip_bin,
ST_AsText(ST_SnapToGrid(ST_GeogPoint(pickup_longitude,
pickup_latitude), 0.1)) AS pickup_grid,
ST_AsText(ST_SnapToGrid(ST_GeogPoint(dropoff_longitude,
dropoff_latitude), 0.1)) AS dropoff_grid,
ST_Distance(ST_GeogPoint(pickup_longitude,
pickup_latitude),
ST_GeogPoint(dropoff_longitude,
dropoff_latitude)) AS euclidean,
CONCAT(ST_AsText(ST_SnapToGrid(ST_GeogPoint(pickup_longitude,
pickup_latitude), 0.1)), ST_AsText(ST_SnapToGrid(ST_GeogPoint(dropoff_longitude,
dropoff_latitude), 0.1))) AS loc_cross
FROM
taxitrips
LIMIT
100000000
)
If BigQuery needs to output multiple files, you can then concatenate them into a single one with a gsutil operation for files in GCS:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose
Note that there is a limit (currently 32) to the number of components that can be composed in a single operation.
Exporting 15GB to a single CSV file is not possible (to multiple files is possible). I tried your same query (Bytes processed 15.66 GB) then tried to export it to a CSV file in GCS but failed with this error
Table gs://[my_bucket]/bq_export/test.csv too large to be exported to a single file. Specify a uri including a * to shard export. See 'Exporting data into one or more files' in https://cloud.google.com/bigquery/docs/exporting-data.
BQ Documentation only allows you to export up to 1 GB of table data to a single file. Since the table exceeds 1GB then you have to use a wildcard like:
gs://your-bucket-name/csvfilename*.csv
Not sure why would you like the export csv file to be in a single file but IMHO it's too large to be in a single file. writing it to multiple files will be a lot faster since BQ would use its parallelism to write the output using multiple threads.
Related
Situation:
If only specify the partition clause, it will be divided into multiple files. The size of one file is less than 1MB (~ 40 files).
What I am thinking of:
I want to explicitly specify the size of the files to be split or the number of files when registering data with CTAS or INSERT INTO.
I have read this article: https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/
Problem:
Using bucketing method (like said in above article ) can help me specify the number of file or file size. However, it also said that "Note: The INSERT INTO statement isn't supported on bucketed tables". I would like to register data daily with Athena's INSERT INTO in the data mart.
what is the best way to build a partitioned data mart without compromising search efficiency? Is it best to register the data with Glue and save it as one file?
Say I have a BigQuery table that contains 3M rows, and I want to export it to gcs.
What I do is standard bq extract <flags> ... <project_id>:<dataset_id>.<table_id> gs://<bucket>/file_name_*.<extension>
I am bound by a limit on the number of rows a file (part) can have. Is there a way to set a hard limit to the size of a file part?
For example, If I want each partition not to be above 10Mb for example, or even better, to set the maximum number of rows allowed to go in a file part? The documentation doesn't seem to mention any flags for this purpose.
You can't do it with BigQuery extract API.
But you can script it (perform an export of thousands of row in a loop) but you will have to pay for the processed data (the extract is free!). You can also set up a Dataflow job for this (but it's also not free!).
My actual data in csv extracts starts from line 10. How can I skip top few lines in snowflake load using copy or any other utility. Do we have anything similar to SKIP_HEADER ?
I have files on S3 and its my stage. I would be creating a snowpipe later on this datasource.
yes there is a skip_header option for CSV, allowing you to skip a specified number of rows, when defining a file format. Please have a look here:
https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.html#type-csv
So you create a file format associated with the csv files you have in mind and then use this when calling the copy commands.
Why Data Fusion, well coz I need to run several more steps ( run Data Proc clusters ) , insert to DBs and do it in a schedule. Also the data could explode ( 10s of TB ) or shrink ( 10s of GBs).
Stacking several TB files isn't a good idea. The Storage size limit per object is 5TB.
I don't know your need of stacking file.
Maybe Bigquery can be a solution for loading easily your CSV files and then to query subset of file for further processing. But querying 10s of TB is expensive! (5$ per TB)
For more help, add more detail on what you want to achieve.
I have multiple small parquet files generated as output of hive ql job, i would like to merge the output files to single parquet file?
what is the best way to do it using some hdfs or linux commands?
we used to merge the text files using cat command, but will this work for parquet as well?
Can we do it using HiveQL itself when writing output files like how we do it using repartition or coalesc method in spark?
According to this https://issues.apache.org/jira/browse/PARQUET-460
Now you can download the source code and compile parquet-tools which is built in merge command.
java -jar ./target/parquet-tools-1.8.2-SNAPSHOT.jar merge /input_directory/
/output_idr/file_name
Or using a tool like https://github.com/stripe/herringbone
You can also do it using HiveQL itself, if your execution engine is mapreduce.
You can set a flag for your query, which causes hive to merge small files at the end of your job:
SET hive.merge.mapredfiles=true;
or
SET hive.merge.mapfiles=true;
if your job is a map-only job.
This will cause the hive job to automatically merge many small parquet files into fewer big files. You can control the number of output files with by adjusting hive.merge.size.per.task setting. If you want to have just one file, make sure you set it to a value which is always larger than the size of your output. Also, make sure to adjust hive.merge.smallfiles.avgsize accordingly. Set it to a very low value if you want to make sure that hive always merges files. You can read more about this settings in hive documentation.
Using duckdb :
import duckdb
duckdb.execute("""
COPY (SELECT * FROM '*.parquet') TO 'merge.parquet' (FORMAT 'parquet');
""")