Call the added file in Hive to use for UDF - amazon-web-services

I have a file which contains holidays and it is required by UDF to use this file to run and calculate the given business days for two dates. Issue I have is when I add the file, it goes to a working directory but this directory differs every session.
Unlike in the example below from Hive Resources - This is not what is happening.
hive> add FILE /tmp/tt.py;
hive> list FILES;
/tmp/tt.py
hive> select from networks a
MAP a.networkid
USING 'python tt.py' as nn where a.ds = '2009-01-04' limit 10;
This is what I am getting and the alpha numeric keeps changing.
/mnt/tmp/a17b43d5-df53-4eea-8e2c-565471b49d25_resources/holiday2021.csv
I need to make this file located in a more permanent folder and this hive sql can be executed into any of the 18 nodes.

Related

AWS S3: How to plug in a dynamic file name in the S3 directory in COPY command

I have job in Redshift that is responsible for pulling 6 files every month from S3. File names follow a standard naming convention as "file_label_MonthNameYYYY_Batch01.CSV". I'd like to modify the below COPY command to change the file naming in the S3 directory dynamically so I won't have to hard code the Month Name and YYYY and batch number. Batch number ranges 1-6.
Currently, here is what I have which is not efficient:
COPY tbl_name ( column_name1, column_name2, column_name3 )
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_Batch01.CSV'
CREDENTIALS 'aws_access_key_id = xxx;aws_secret_access_key = xxxxx'
removequotes
EMPTYASNULL
BLANKSASNULL
DATEFORMAT 'MM/DD/YYYY'
delimiter ','
IGNOREHEADER 1;
COPY tbl_name ( column_name1, column_name2, column_name3 )
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_Batch02.CSV'
CREDENTIALS 'aws_access_key_id = xxx;aws_secret_access_key = xxxxx'
removequotes
EMPTYASNULL
BLANKSASNULL
DATEFORMAT 'MM/DD/YYYY'
delimiter ','
IGNOREHEADER 1;
The dynamic file name shall change to August2021_Batch01 & August2021_Batch02 next month and so forth. Is there a way to do this? Thank you in advance.
There are lots of approaches to this. Which one is best for your case will depend on your circumstances. You need a layer in your process that controls configuring SQL for each month. Here are some ways to consider:
Use a manifest file - This file will have the S3 object names to
load. Your processing / file prep can update this file
Use a fixed load folder where the files are located for COPY, then
move these files to perm storage location after COPY.
Use variables in you bench to set the Month value and replace this
in when the SQL is issued to Redshift.
Write some code (Lambda?) to issue the SQL you are looking for
Last I checked you could leave the object name incomplete and all
matching objects would be loaded. Leave off the batch number and
suffix and load all the files with one text change.
It is desirable to load multiple files with a COPY command (uses more nodes in parallel) and options 1, 2, and 5 do this.
When specifying the FROM location of files to load, you can specify a partial filename.
Here is an example from COPY examples - Amazon Redshift:
The following example loads the SALES table with tab-delimited data from lzop-compressed files in an Amazon EMR cluster. COPY loads every file in the myoutput/ folder that begins with part-.
copy sales
from 'emr://j-SAMPLE2B500FC/myoutput/part-*'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;
Therefore, you could specify:
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_*'
You would just need to change the Month & Year identifier. All files with that prefix would be loaded in one batch.

Redshift: Possibility to specify suffix for paths when doing PARTITIONED UNLOAD to S3?

Is there any way to provide a suffix for paths when doing a partitioned unload to S3?
e.g. if I want to use the output of +several+ queries for batch jobs, where query outputs are partitioned by date.
Currently I have a structure in S3 like:
s3://bucket/path/queryA/key=1/ *.parquet
s3://bucket/path/queryA/key=2/ *.parquet
s3://bucket/path/queryB/key=1/ *.parquet
s3://bucket/path/queryB/key=2/ *.parquet
But ideally, I would like to have:
s3://bucket/path/key=1/queryA/ *.parquet
s3://bucket/path/key=2/queryA/ *.parquet
s3://bucket/path/key=1/queryB/ *.parquet
s3://bucket/path/key=2/queryB/ *.parquet
So that I can then use as input paths to batch processing jobs (e.g. on Sagemaker!):
s3://bucket/path/key=1/
s3://bucket/path/key=2/
Such that each batch job has the output of all queries for the particular day that the batch job is computing for.
Currently, I re-shape the data in S3 after unloading but it would be much faster and more convenient if I could specify a suffix for Redshift to append to S3 unload paths, +after+ the partition suffix.
From the UNLOAD docs I'm assuming that this isn't possible, and I'm unable to post on AWS forums.
But perhaps there's some other command or a connection variable that I can use, a hack involving something like a literal value for a second partition key, or a totally different strategy altogether?
You could add an artificial column q to mark the query, and then use it as a second partition - that would effectively add a q=queryA prefix to your path.
BUT, redshift does not allow to UNLOAD into a non-empty location, unless you provide an ALLOWOVERWRITE option.
Then, since you don't control the unloaded filenames (they'll depend on the slice count and max file size) allowing overwrite may cause your data to really be overwritten if you happen to have same partition keys.
To work around that, you could add one more artificial partitioning column which would add a unique component to your path (same value for each unload). I used RANDOM in my example for that - you could use something which is more clash-safe.
Below is an example query, which unloads data without overwriting results even if unloaded multiple times. I ran it for different part and q values.
unload ($$
WITH
rand(rand) as (select md5(random())),
input(val, part) as (
select 1, 'p1' union all
select 1, 'p2'
)
SELECT
val,
part,
'queryB' as q,
rand as r
FROM input, rand
$$)
TO 's3://XXX/partitioned_unload/'
IAM_ROLE 'XXX'
PARTITION by (part, q, r)
ALLOWOVERWRITE
These are the files produced by 3 runs:
aws s3 ls s3://XXX/partitioned_unload/ --recursive
2020-06-29 08:29:14 2 partitioned_unload/part=p1/q=queryA/r=b43e3ff9b6b271387e2ca5424c310bb5/0001_part_00
2020-06-29 08:28:58 2 partitioned_unload/part=p1/q=queryA/r=cfcd208495d565ef66e7dff9f98764da/0001_part_00
2020-06-29 08:29:54 2 partitioned_unload/part=p1/q=queryB/r=24a4976a535a584dabdf8861548772d4/0001_part_00
2020-06-29 08:29:54 2 partitioned_unload/part=p2/q=queryB/r=24a4976a535a584dabdf8861548772d4/0001_part_00
2020-06-29 08:29:14 2 partitioned_unload/part=p3/q=queryA/r=b43e3ff9b6b271387e2ca5424c310bb5/0002_part_00
2020-06-29 08:28:58 2 partitioned_unload/part=p3/q=queryA/r=cfcd208495d565ef66e7dff9f98764da/0001_part_00

BigQuery export to CSV file via SQL

I want to create a CSV file which contains the results of query.
This CSV file will live in Google Cloud Storage. (This query is around 15GB) I need it to be a single file. Is it possible, if so how?
CREATE OR REPLACE TABLE `your-project.your-dataset.chicago_taxitrips_mod` AS (
WITH
taxitrips AS (
SELECT
trip_start_timestamp,
trip_end_timestamp,
trip_seconds,
trip_miles,
pickup_census_tract,
dropoff_census_tract,
pickup_community_area,
dropoff_community_area,
fare,
tolls,
extras,
trip_total,
payment_type,
company,
pickup_longitude,
pickup_latitude,
dropoff_longitude,
dropoff_latitude,
IF((tips/fare >= 0.2),
1,
0) AS tip_bin
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
trip_miles > 0
AND fare > 0)
SELECT
trip_start_timestamp,
trip_end_timestamp,
trip_seconds,
trip_miles,
pickup_census_tract,
dropoff_census_tract,
pickup_community_area,
dropoff_community_area,
fare,
tolls,
extras,
trip_total,
payment_type,
company,
tip_bin,
ST_AsText(ST_SnapToGrid(ST_GeogPoint(pickup_longitude,
pickup_latitude), 0.1)) AS pickup_grid,
ST_AsText(ST_SnapToGrid(ST_GeogPoint(dropoff_longitude,
dropoff_latitude), 0.1)) AS dropoff_grid,
ST_Distance(ST_GeogPoint(pickup_longitude,
pickup_latitude),
ST_GeogPoint(dropoff_longitude,
dropoff_latitude)) AS euclidean,
CONCAT(ST_AsText(ST_SnapToGrid(ST_GeogPoint(pickup_longitude,
pickup_latitude), 0.1)), ST_AsText(ST_SnapToGrid(ST_GeogPoint(dropoff_longitude,
dropoff_latitude), 0.1))) AS loc_cross
FROM
taxitrips
LIMIT
100000000
)
If BigQuery needs to output multiple files, you can then concatenate them into a single one with a gsutil operation for files in GCS:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose
Note that there is a limit (currently 32) to the number of components that can be composed in a single operation.
Exporting 15GB to a single CSV file is not possible (to multiple files is possible). I tried your same query (Bytes processed 15.66 GB) then tried to export it to a CSV file in GCS but failed with this error
Table gs://[my_bucket]/bq_export/test.csv too large to be exported to a single file. Specify a uri including a * to shard export. See 'Exporting data into one or more files' in https://cloud.google.com/bigquery/docs/exporting-data.
BQ Documentation only allows you to export up to 1 GB of table data to a single file. Since the table exceeds 1GB then you have to use a wildcard like:
gs://your-bucket-name/csvfilename*.csv
Not sure why would you like the export csv file to be in a single file but IMHO it's too large to be in a single file. writing it to multiple files will be a lot faster since BQ would use its parallelism to write the output using multiple threads.

adjusting the project.xml file in a SAS Enterprise Guide project outside SAS EG

We are going to migrate our EG projects (over 1000 projects) to a new environment.
In the old environment we use "W-Latin" as encoding on the Teradata database.
In the new environment we will start using "UTF-8" as encoding on the Teradata database.
And a lot of other changes which I believe are not relevant for this question.
To prevent data issues we will have to replace functions like REVERSE, etc with KREVERSE, etc
We could do this by opening al projects and clicking through it to change the functions in the expression builder.
This would be really time consuming, considering that we have over 1000 .egp files
We already have a code scanner that unzips the .egp file and detects al the use of these functions in the project.xml file.
The next step could be that we find and replace the functions and put the project.xml file back in the .egp file.
Who can tell me how to put the project.xml file back in the .egp file without corrupting the .egp file
I was able to do this.
tl;dr -- Zip the files back up and change the extension to .egp.
Created a new EG project and added a code node to create sample data:
data test;
do cat = "A", "B", "C";
do i=1 to 10;
r = rannor(123);
output;
end;
end;
drop i;
run;
I then added a Query node to the output to do a "SUM" of the r column by cat.
Ran the flow and got expected output.
Saved the EG project.
Opened the EG Project in 7zip and extracted the archive to a location.
In project.xml, I found the section for the Query and changed the SUM to MEAN
<Expression>
<LHS_TYPE>LHS_FUNCTION</LHS_TYPE>
<LHS_DMCOLGROUP>Numeric</LHS_DMCOLGROUP>
<RHS_TYPE>RHS_COLUMN</RHS_TYPE>
<RHS_DMCOLGROUP>Numeric</RHS_DMCOLGROUP>
<InFormat />
<LHS_String>MEAN</LHS_String>
<LHS_Calc />
<OutputType>OPTYPE_NOTSET</OutputType>
<RHS_StringOne>r</RHS_StringOne>
<RHS_StringTwo />
</Expression>
Selected the files and added them to an achieve using 7zip. Selected "zip" compression and saved the file with ".egp" extension.
I opened the project in EG and ran the flow. The output was now the MEAN of R and not the SUM.

how to merge multiple parquet files to single parquet file using linux or hdfs command?

I have multiple small parquet files generated as output of hive ql job, i would like to merge the output files to single parquet file?
what is the best way to do it using some hdfs or linux commands?
we used to merge the text files using cat command, but will this work for parquet as well?
Can we do it using HiveQL itself when writing output files like how we do it using repartition or coalesc method in spark?
According to this https://issues.apache.org/jira/browse/PARQUET-460
Now you can download the source code and compile parquet-tools which is built in merge command.
java -jar ./target/parquet-tools-1.8.2-SNAPSHOT.jar merge /input_directory/
/output_idr/file_name
Or using a tool like https://github.com/stripe/herringbone
You can also do it using HiveQL itself, if your execution engine is mapreduce.
You can set a flag for your query, which causes hive to merge small files at the end of your job:
SET hive.merge.mapredfiles=true;
or
SET hive.merge.mapfiles=true;
if your job is a map-only job.
This will cause the hive job to automatically merge many small parquet files into fewer big files. You can control the number of output files with by adjusting hive.merge.size.per.task setting. If you want to have just one file, make sure you set it to a value which is always larger than the size of your output. Also, make sure to adjust hive.merge.smallfiles.avgsize accordingly. Set it to a very low value if you want to make sure that hive always merges files. You can read more about this settings in hive documentation.
Using duckdb :
import duckdb
duckdb.execute("""
COPY (SELECT * FROM '*.parquet') TO 'merge.parquet' (FORMAT 'parquet');
""")