Is it possible to validate the column order when uploading data from flat files using aws copy command - amazon-web-services

I'm uploading data from zipped flat files to redshift using copy command, I would like to understand if there is any way to validate that the column order of the files is correct? (for example, if fields are all varchar then the data could be uploaded to the wrong columns).
In the copy command documentation it shows that you can specify the column order, but not for flat files, but I was wondering if there are any other approaches that would allow me to check how the columns have been supplied (for example, uploading only the header row into a dummy table to check, but that doesn't seem a possibility).

You can't really do this inside Redshift. COPY doesn't provide any options to only load a specific number of rows or perform any validation.
Your best option would be to do this in the tool where you schedule the loads. You can get the first line from a compressed file easily enough (zcat < file.z|head -1) but for a file on S3 you may have to download the whole thing first.
FWIW, the process generating the load file should be fully automated in such a way that the column order can't change. If these files are being manually prepared you're asking for all sorts of trouble.

Related

Amazon Redshift: Check column names during COPY

Can I check columns names during copy from S3 to Redshift?
For example, I have "good" CSV:
name ,sur_name
BOB , FISCHER
And I have "wrong" CSV:
sur_name,name
FISCHER , BOB
Can I check names of columns during copy command?
I don't want to use AWS Glue or AWS Lambda for checks because I don't want to open/load/save the same file many times.
(The same problem for other files with columns names.)
This is very simple check so Redshift should allowed that but I can't find any information about that.
Or if this is not possible? Can you give me some idea how do it without reading all files?
(For example, a Lambda function that reads only headers without getting all file.)
From Column mapping options - Amazon Redshift:
You can specify a comma-separated list of column names to load source data fields into specific target columns. The columns can be in any order in the COPY statement, but when loading from flat files, such as in an Amazon S3 bucket, their order must match the order of the source data.
Therefore, the only way to read such files would be to specify the column names in their correct order. This requires you to look inside the file to determine the order of the columns.
When reading an object from Amazon S3, it is possible to specify the range of bytes to be read. So, instead of reading the entire file, it could read just the first 200 bytes (or whatever size would be sufficient to include the header row). An AWS Lambda function could read these bytes, extract the column names, then generate a COPY command that would import the columns in the correct order (without having to read the entire file first).

What is AWS S3 dataset?

Looking at documentation of awswrangler.s3.to_csv or awswrangler.s3.to_parquet, there is a dataset parameter.
From testing, it looks like setting dataset=True allows, among other things, to append new data to an already existing set. It also looks like when dataset=True, I can't specify the file name and AWS autogenerates the names for the files which are added to the specified path.
Apart from that, I can't find more information on what dataset means. Is it just referring to the general concept or is there a specific meaning within the context of AWS? What exactly is dataset and when should it be set to True?
The dataset=True option allows you to store the entire dataset, including all metadata, indexes, etc.
The dataset parameter documentation:
dataset (bool) – If True store as a dataset instead of ordinary file(s) If True, enable all follow arguments: partition_cols, mode, database, table, description, parameters, columns_comments, concurrent_partitioning, catalog_versioning, projection_enabled, projection_types, projection_ranges, projection_values, projection_intervals, projection_digits, catalog_id, schema_evolution.
Note all those extra things that get saved when you save a dataset. All that information, like columns_comments, concurrent_partitioning, projection_values, will be lost when you save to CSV or Parquet. But on the other hand, those values are probably only useful if you plan to do further manipulation of the data via awswrangler/pandas at some later date.
Also note that if you set dataset=True you have to give it a file name prefix instead of a single file name, because the output generated will be spread across multiple files.
If you want to use the data in any other tool besides Pandas, such as loading the CSV into Excel, then you most likely want to set dataset=False and output to a single file.

Why are tables segmented when exporting to parquet from AWS RDS

We use Python's boto3 library to execute start_export_task to trigger a RDS snapshot export (to S3). This successfully generates a directory in S3 that has a predicable, expected structure. Traversing down through that directory to any particular table directory (as in export_identifier/database_name/schema_name.table_name/) I see several .parquet files.
I download several of these files and convert them to pandas dataframes so I can look at them. They are all structured the same and seem to clearly be pieces of the same table. But they range in size from 100KB to 8MB in seemingly unpredictable size segments. Do these files/'pieces' of the table account for all its rows? Do they repeat/overlap at all? Why are they segmented so (seemingly) randomly? What parameters control this segmenting?
Ultimately I'm looking for documentation on this part of parquet folder/file structure. I've found plenty of information on how individual files are structured and partitioning. But I think this falls slightly outside of those topics.
You're not going to like this, but from AWS' perspective this is an implementation detail and according to the docs:
The file naming convention is subject to change. Therefore, when reading target tables we recommend that you read everything inside the base prefix for the table.
— docs
Most of the tools that work with Parquet don't really care about the number or file names of the parquet files. You just point something like Spark or Athena to the prefix of the table and it will read all the files and figure out how they fit together.
In the API there are also no parameters to influence this behavior. If you prefer a single file for aesthetic reasons or others, you could use something like a Glue Job to read the table prefixes, coalesce the data per table in a single file and write it to S3.

Big query EXPORT DATA statement creating mutiple files with no data and just header record

I have read similar issue here but not able to understand if this is fixed.
Google bigquery export table to multiple files in Google Cloud storage and sometimes one single file
I am using below big query EXPORT DATA OPTIONS to export the data from 2 tables in a file. I have written select query for the same.
EXPORT DATA OPTIONS(
uri='gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_'||CURRENT_DATE()||'*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter='|') AS
SELECT
I have only 2 rows returning from my select query and I assume that only one file should be getting created in google cloud storage. Multiple files are created only when data is more than 1 GB. thats what I understand.
However, 3 files got created in cloud storage where 2 files just had the header record and the third file has 3 records(one header and 2 actual data record)
radhika_sharma_ibm#cloudshell:~ (whr-asia-datalake-nonprod)$ gsutil ls gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000000.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000001.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000002.csv
Why empty files are getting created?
Can anyone please help? We don't want to create empty files. I believe only one file should be created when it is 1 GB. more than 1 GB, we should have multiple files but NOT empty.
You have to force all data to be loaded into one worker. In this way you will be exporting only one file (if <1Gb).
My workaround: add a select distinct * on top of the Select statement.
Under the hood, BigQuery utilizes multiple workers to read and process different sections of data and when we use wildcards, each worker would create a separate output file.
Currently BigQuery produces empty files even if no data is returned and thus we get multiple empty files. The Bigquery product team is aware of this issue and they are working to fix this, however there is no ETA which can be shared.
There is a public issue tracker that will be updated with periodic progress. You can STAR the issue to receive automatic updates and give it traction by referring to this link.
However for the time being I would like to provide a workaround as follows:
If you know that the output will be less than 1GB, you can specify a single URI to get a single output file. However, the EXPORT DATA statement doesn’t support Single URI.
You can use the bq extract command to export the BQ table.
bq --location=location extract \
--destination_format format \
--compression compression_type \
--field_delimiter delimiter \
--print_header=boolean \
project_id:dataset.table \
gs://bucket/filename.ext
In fact bq extract should not have the empty file issue like the EXPORT DATA statement even when you use Wildcard URI.
I faced the same empty files issue when using EXPORT DATA.
After doing a bit of R&D found the solution. Put LIMIT xxx in your SELECT SQL and it will do the trick.
You can find the count, and put that as LIMIT value.
SELECT ....
FROM ...
WHERE ...
LIMIT xxx
It turns out you need to enforce multiple files, wildcard syntax. Either a file for CSV or folder for other like AVRO.
The uri option must be a single-wildcard URI as described
https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements
Specifying a wildcard seems to start several workers to work on the extract, and as per the documentation, size of the exported files will vary.
Zero-length files is unusual but technically possible if the first worker is done before any other really get started. Hence why the wildcard is expected to be used only when you think your exported data will be larger than the 1 GB
I have just faced the same with Parquet but found out that bq CLI works, which should do for any format.
See (and star for traction) https://issuetracker.google.com/u/1/issues/181016197

Google Dataprep: Save GCS file name as one of the column

I have a Dataprep flow configured. The Dataset is a GCS folder (all files from it). Target is BigQuery table.
Since data is coming from multiple files, I want to have filename as of the columns in the resulting data.
Is that possible?
UPDATE: There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps. (If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface)
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
Original Answer
This is not currently possible out of the box. IF you're manually merging datasets with UNION, you could first process them to add a column with the source so that it's then present in the combined output.
If you're bulk-ingesting files, that doesn't help—but there is an open feature request open that you can comment on and/or follow for updates:
https://issuetracker.google.com/issues/74386476