I have a file which contains holidays and it is required by UDF to use this file to run and calculate the given business days for two dates. Issue I have is when I add the file, it goes to a working directory but this directory differs every session.
Unlike in the example below from Hive Resources - This is not what is happening.
hive> add FILE /tmp/tt.py;
hive> list FILES;
/tmp/tt.py
hive> select from networks a
MAP a.networkid
USING 'python tt.py' as nn where a.ds = '2009-01-04' limit 10;
This is what I am getting and the alpha numeric keeps changing.
/mnt/tmp/a17b43d5-df53-4eea-8e2c-565471b49d25_resources/holiday2021.csv
I need to make this file located in a more permanent folder and this hive sql can be executed into any of the 18 nodes.
Is there any way to provide a suffix for paths when doing a partitioned unload to S3?
e.g. if I want to use the output of +several+ queries for batch jobs, where query outputs are partitioned by date.
Currently I have a structure in S3 like:
s3://bucket/path/queryA/key=1/ *.parquet
s3://bucket/path/queryA/key=2/ *.parquet
s3://bucket/path/queryB/key=1/ *.parquet
s3://bucket/path/queryB/key=2/ *.parquet
But ideally, I would like to have:
s3://bucket/path/key=1/queryA/ *.parquet
s3://bucket/path/key=2/queryA/ *.parquet
s3://bucket/path/key=1/queryB/ *.parquet
s3://bucket/path/key=2/queryB/ *.parquet
So that I can then use as input paths to batch processing jobs (e.g. on Sagemaker!):
s3://bucket/path/key=1/
s3://bucket/path/key=2/
Such that each batch job has the output of all queries for the particular day that the batch job is computing for.
Currently, I re-shape the data in S3 after unloading but it would be much faster and more convenient if I could specify a suffix for Redshift to append to S3 unload paths, +after+ the partition suffix.
From the UNLOAD docs I'm assuming that this isn't possible, and I'm unable to post on AWS forums.
But perhaps there's some other command or a connection variable that I can use, a hack involving something like a literal value for a second partition key, or a totally different strategy altogether?
You could add an artificial column q to mark the query, and then use it as a second partition - that would effectively add a q=queryA prefix to your path.
BUT, redshift does not allow to UNLOAD into a non-empty location, unless you provide an ALLOWOVERWRITE option.
Then, since you don't control the unloaded filenames (they'll depend on the slice count and max file size) allowing overwrite may cause your data to really be overwritten if you happen to have same partition keys.
To work around that, you could add one more artificial partitioning column which would add a unique component to your path (same value for each unload). I used RANDOM in my example for that - you could use something which is more clash-safe.
Below is an example query, which unloads data without overwriting results even if unloaded multiple times. I ran it for different part and q values.
unload ($$
WITH
rand(rand) as (select md5(random())),
input(val, part) as (
select 1, 'p1' union all
select 1, 'p2'
)
SELECT
val,
part,
'queryB' as q,
rand as r
FROM input, rand
$$)
TO 's3://XXX/partitioned_unload/'
IAM_ROLE 'XXX'
PARTITION by (part, q, r)
ALLOWOVERWRITE
These are the files produced by 3 runs:
aws s3 ls s3://XXX/partitioned_unload/ --recursive
2020-06-29 08:29:14 2 partitioned_unload/part=p1/q=queryA/r=b43e3ff9b6b271387e2ca5424c310bb5/0001_part_00
2020-06-29 08:28:58 2 partitioned_unload/part=p1/q=queryA/r=cfcd208495d565ef66e7dff9f98764da/0001_part_00
2020-06-29 08:29:54 2 partitioned_unload/part=p1/q=queryB/r=24a4976a535a584dabdf8861548772d4/0001_part_00
2020-06-29 08:29:54 2 partitioned_unload/part=p2/q=queryB/r=24a4976a535a584dabdf8861548772d4/0001_part_00
2020-06-29 08:29:14 2 partitioned_unload/part=p3/q=queryA/r=b43e3ff9b6b271387e2ca5424c310bb5/0002_part_00
2020-06-29 08:28:58 2 partitioned_unload/part=p3/q=queryA/r=cfcd208495d565ef66e7dff9f98764da/0001_part_00
My actual data in csv extracts starts from line 10. How can I skip top few lines in snowflake load using copy or any other utility. Do we have anything similar to SKIP_HEADER ?
I have files on S3 and its my stage. I would be creating a snowpipe later on this datasource.
yes there is a skip_header option for CSV, allowing you to skip a specified number of rows, when defining a file format. Please have a look here:
https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.html#type-csv
So you create a file format associated with the csv files you have in mind and then use this when calling the copy commands.
I want to create a CSV file which contains the results of query.
This CSV file will live in Google Cloud Storage. (This query is around 15GB) I need it to be a single file. Is it possible, if so how?
CREATE OR REPLACE TABLE `your-project.your-dataset.chicago_taxitrips_mod` AS (
WITH
taxitrips AS (
SELECT
trip_start_timestamp,
trip_end_timestamp,
trip_seconds,
trip_miles,
pickup_census_tract,
dropoff_census_tract,
pickup_community_area,
dropoff_community_area,
fare,
tolls,
extras,
trip_total,
payment_type,
company,
pickup_longitude,
pickup_latitude,
dropoff_longitude,
dropoff_latitude,
IF((tips/fare >= 0.2),
1,
0) AS tip_bin
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
trip_miles > 0
AND fare > 0)
SELECT
trip_start_timestamp,
trip_end_timestamp,
trip_seconds,
trip_miles,
pickup_census_tract,
dropoff_census_tract,
pickup_community_area,
dropoff_community_area,
fare,
tolls,
extras,
trip_total,
payment_type,
company,
tip_bin,
ST_AsText(ST_SnapToGrid(ST_GeogPoint(pickup_longitude,
pickup_latitude), 0.1)) AS pickup_grid,
ST_AsText(ST_SnapToGrid(ST_GeogPoint(dropoff_longitude,
dropoff_latitude), 0.1)) AS dropoff_grid,
ST_Distance(ST_GeogPoint(pickup_longitude,
pickup_latitude),
ST_GeogPoint(dropoff_longitude,
dropoff_latitude)) AS euclidean,
CONCAT(ST_AsText(ST_SnapToGrid(ST_GeogPoint(pickup_longitude,
pickup_latitude), 0.1)), ST_AsText(ST_SnapToGrid(ST_GeogPoint(dropoff_longitude,
dropoff_latitude), 0.1))) AS loc_cross
FROM
taxitrips
LIMIT
100000000
)
If BigQuery needs to output multiple files, you can then concatenate them into a single one with a gsutil operation for files in GCS:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose
Note that there is a limit (currently 32) to the number of components that can be composed in a single operation.
Exporting 15GB to a single CSV file is not possible (to multiple files is possible). I tried your same query (Bytes processed 15.66 GB) then tried to export it to a CSV file in GCS but failed with this error
Table gs://[my_bucket]/bq_export/test.csv too large to be exported to a single file. Specify a uri including a * to shard export. See 'Exporting data into one or more files' in https://cloud.google.com/bigquery/docs/exporting-data.
BQ Documentation only allows you to export up to 1 GB of table data to a single file. Since the table exceeds 1GB then you have to use a wildcard like:
gs://your-bucket-name/csvfilename*.csv
Not sure why would you like the export csv file to be in a single file but IMHO it's too large to be in a single file. writing it to multiple files will be a lot faster since BQ would use its parallelism to write the output using multiple threads.
Is there any way/option or workaround to skip the entire file which contains bad entries , while loading the data from S3 to Redshift.
Please note that I am not talking about skipping the entries that are invalid in the file, but the entire file which contains bad entry or record.
By default Redshift fails entire file if you don't supply Maxerror option in Copy command. Its default behavior.
copy catdemo from 's3://awssampledbuswest2/tickit/category_pipe.txt' iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' region 'us-west-2';
Above command will fail entire file and will not load any data from given file. Read the documentation here for more information.
If you specify, Maxerror option then only it ignores records upto that # from particular file.
copy catdemo from 's3://awssampledbuswest2/tickit/category_pipe.txt' iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' region 'us-west-2' MAXERROR 500;
In above example Redshift will tolerate up-to 500 bad records.
I hope this answers your question, but If it doesn't please update the question and I will refocus the answer.