Impala external table reading uncompressed files but with names (*.csv.gz) - compression

I have a data source in hdfs whose files are NOT compressed, even though their names end with (*.csv.gz), and Impala can't recognize that they are not compressed despite the name. Is there a way to read those files in the external table without the need to change all current files names? and if there is not, what is the best practice to change all current file names in hdfs?
Here is the current creation query for the table:
CREATE EXTERNAL TABLE db.table1(
col1 type,
col2 type
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001'
WITH SERDEPROPERTIES ('field.delim'='\u0001', 'serialization.format'=';')
STORED AS TEXTFILE
LOCATION 'hdfs://servicename/user/directory'
example of current file names (they are texfiles and not compressed on the content level):
-rw-rw-r--+ /final/file11_20210601_0000.csv.gz
-rw-rw-r--+ /final/file12_20210601_0015.csv.gz
-rw-rw-r--+ /final/file12_20210601_0045.csv.gz
-rw-rw-r--+ /final/file1_20210601_0015.csv.gz

So far, I have found no way regarding the external table properties to read ".gz" files in external table, but I could make a shell script to change all files and remove the ".gz" from their ends:
for f in $(hdfs dfs -ls -t -r /user/dir/ | awk '{print $8}');do
v=$(echo "$f" | cut -d'.' -f1,2)
hdfs dfs -mv "$f" "$v"
done
But I am still open to solutions to read the .gz files directly in the external table.

Related

Call the added file in Hive to use for UDF

I have a file which contains holidays and it is required by UDF to use this file to run and calculate the given business days for two dates. Issue I have is when I add the file, it goes to a working directory but this directory differs every session.
Unlike in the example below from Hive Resources - This is not what is happening.
hive> add FILE /tmp/tt.py;
hive> list FILES;
/tmp/tt.py
hive> select from networks a
MAP a.networkid
USING 'python tt.py' as nn where a.ds = '2009-01-04' limit 10;
This is what I am getting and the alpha numeric keeps changing.
/mnt/tmp/a17b43d5-df53-4eea-8e2c-565471b49d25_resources/holiday2021.csv
I need to make this file located in a more permanent folder and this hive sql can be executed into any of the 18 nodes.

AWS S3: How to plug in a dynamic file name in the S3 directory in COPY command

I have job in Redshift that is responsible for pulling 6 files every month from S3. File names follow a standard naming convention as "file_label_MonthNameYYYY_Batch01.CSV". I'd like to modify the below COPY command to change the file naming in the S3 directory dynamically so I won't have to hard code the Month Name and YYYY and batch number. Batch number ranges 1-6.
Currently, here is what I have which is not efficient:
COPY tbl_name ( column_name1, column_name2, column_name3 )
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_Batch01.CSV'
CREDENTIALS 'aws_access_key_id = xxx;aws_secret_access_key = xxxxx'
removequotes
EMPTYASNULL
BLANKSASNULL
DATEFORMAT 'MM/DD/YYYY'
delimiter ','
IGNOREHEADER 1;
COPY tbl_name ( column_name1, column_name2, column_name3 )
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_Batch02.CSV'
CREDENTIALS 'aws_access_key_id = xxx;aws_secret_access_key = xxxxx'
removequotes
EMPTYASNULL
BLANKSASNULL
DATEFORMAT 'MM/DD/YYYY'
delimiter ','
IGNOREHEADER 1;
The dynamic file name shall change to August2021_Batch01 & August2021_Batch02 next month and so forth. Is there a way to do this? Thank you in advance.
There are lots of approaches to this. Which one is best for your case will depend on your circumstances. You need a layer in your process that controls configuring SQL for each month. Here are some ways to consider:
Use a manifest file - This file will have the S3 object names to
load. Your processing / file prep can update this file
Use a fixed load folder where the files are located for COPY, then
move these files to perm storage location after COPY.
Use variables in you bench to set the Month value and replace this
in when the SQL is issued to Redshift.
Write some code (Lambda?) to issue the SQL you are looking for
Last I checked you could leave the object name incomplete and all
matching objects would be loaded. Leave off the batch number and
suffix and load all the files with one text change.
It is desirable to load multiple files with a COPY command (uses more nodes in parallel) and options 1, 2, and 5 do this.
When specifying the FROM location of files to load, you can specify a partial filename.
Here is an example from COPY examples - Amazon Redshift:
The following example loads the SALES table with tab-delimited data from lzop-compressed files in an Amazon EMR cluster. COPY loads every file in the myoutput/ folder that begins with part-.
copy sales
from 'emr://j-SAMPLE2B500FC/myoutput/part-*'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;
Therefore, you could specify:
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_*'
You would just need to change the Month & Year identifier. All files with that prefix would be loaded in one batch.

How to create dask dataframe from CSV file stored in HDFS(many part files)

I am trying to create dask dataframe from HDFS file(csv). The csv file stored in HDFS has many part files.
On read_csv api call:
dd.read_csv("hdfs:<some path>/data.csv")
Following error occurs:
OSError: Could not open file: <some path>/data.csv, mode: rb Path is not a file: <some path>/data.csv
In fact /data.csv is directory containing many part files. I'm not sure if there is some different API to read such hdfs csv.
Dask does not know which files you intend to read from when you pass only a directory name. You should pass a glob string uses to search for files or an explicit list of files, e.g.,
df = dd.read_csv("hdfs:///some/path/data.csv/*.csv")
Note the leading '/' after the colon: all hdfs paths begin this way.

Parquet partitioning and HDFS filesize

My data are in the form of relatively small Avro records, written in Parquet files (on average < 1mb).
Up to now I used my local filesystem to do some tests with Spark.
I partitioned the data using a hierarchy of directories.
I wonder if it would be better to "build" the partitioning onto the Avro record and accumulate bigger files... However I imagine that partitioned Parquet files would "map" onto HDFS partitioned files too.
What approach would be best?
Edit (clarifying based on comments):
"build the partitioning onto the Avro record": imagine that my directory structure is P1=/P2=/file.avro and that the Avro record contains fields F1 and F2. I could save all of that in a single Avro file containing the fields P1, P2, F1 and F2. Ie there is no need for a partitioning structure with directories as it is all present in the Avro records
about Parquet partitions and HDFS partitions: will HDFS split a big Parquet file on different machines, will that correspond to distinct Parquet partitions ? (I don't know if that is clarifying my question - if not that means I don't really understand)
the main reasoning behind using partitioning on folder level is that when Spark for instance reads the data and there is a filter on the partitioned column (extracted from the folder name as long as the format is path/partitionName=value) it will only read the needed folders (instead of reading everything and then applying filter). so if you want to use this mechanism use hierarchy in your folder structure (I use it often).
generally speaking I would recommend avoiding many folders with little data in them (not sure if is the case here)
about Spark input partitioning (same word different meaning), when reading from HDFS Spark will try to read files so that partitions will match files on HDFS (to prevent shuffling) so if data is partitioned by HDFS spark will match the same partitions. To my knowledge HDFS does not partition files rather it replicates them (to increase reliability) so I think a single large parquet file will translate to a single file on HDFS which will be read into a single partition unless you repartition it or define number of partition when reading (there are several ways to do it depending on Spark version. see this)

how to merge multiple parquet files to single parquet file using linux or hdfs command?

I have multiple small parquet files generated as output of hive ql job, i would like to merge the output files to single parquet file?
what is the best way to do it using some hdfs or linux commands?
we used to merge the text files using cat command, but will this work for parquet as well?
Can we do it using HiveQL itself when writing output files like how we do it using repartition or coalesc method in spark?
According to this https://issues.apache.org/jira/browse/PARQUET-460
Now you can download the source code and compile parquet-tools which is built in merge command.
java -jar ./target/parquet-tools-1.8.2-SNAPSHOT.jar merge /input_directory/
/output_idr/file_name
Or using a tool like https://github.com/stripe/herringbone
You can also do it using HiveQL itself, if your execution engine is mapreduce.
You can set a flag for your query, which causes hive to merge small files at the end of your job:
SET hive.merge.mapredfiles=true;
or
SET hive.merge.mapfiles=true;
if your job is a map-only job.
This will cause the hive job to automatically merge many small parquet files into fewer big files. You can control the number of output files with by adjusting hive.merge.size.per.task setting. If you want to have just one file, make sure you set it to a value which is always larger than the size of your output. Also, make sure to adjust hive.merge.smallfiles.avgsize accordingly. Set it to a very low value if you want to make sure that hive always merges files. You can read more about this settings in hive documentation.
Using duckdb :
import duckdb
duckdb.execute("""
COPY (SELECT * FROM '*.parquet') TO 'merge.parquet' (FORMAT 'parquet');
""")