Load parquet files to bq with no rows - google-cloud-platform

I am trying to load parquet files to big-query with bq command. These files has no rows but have columns. Columns are changing dynamically so can’t use provide schema. Is there any other way to do it?
I already tried bq command line with schema json file as parameter but columns are changing dynamically.
Secondly, Auto detect won’t work as it has no rows to detect the schema

Related

Loading data from Google Cloud Storage to Bigquery while adding additional columns

I want to load data from a parquet file in google cloud storage to Bigquery. While loading I also want to add couple of extra columns to the table which are not present in the source file like insert_time_stamp and source_file_name. After doing some research I found these options -
Create a temporary table linked to file in GCS and then load the data from the temporary table along with additional columns to the final Bigquery table.
Load the data from parquet file to pandas dataframe, add the extra two columns and then use pandas.DataFrame.to_gbq or client.load_table_from_dataframe options to load data to Bigquery table.
Load the data from parquet file to a staging table(by this I mean a normal table) in Bigquery and then use this table to create the final table as - "insert into final_table select *,current_timestamp as insert_time_stamp, <file_name> as source_file_name from staging_table". And then finally dropping the staging table.
If the number of rows from the source file are in millions, what would be the best approach to take?

Big query table update from reference CSV file

I have a BigQuery table and I want to update the content of few rows it from a reference CSV file. This CSV file is uploaded to Google cloud storage bucket.
When you use external table from storage, you can only read the CSV, not update them.
However, you can load you CSV into a BigQuery native table, perform the update with DML, and then export the table to CSV. That only works if you have only one CSV.
If you have several CSV files, you can, at least, print the pseudo column _FILE_NAME to identify the files where you need to perform the change. But the change will have to be performed manually or with the previous solution (native table)

AWS Glue crawler - Order of columns in input files

I have created two partitions in a s3 bucket and loading a csv file in each of the folder. Accordingly running the Glue crawler on top of these files, which are registered as a table in Glue catalog,which Im able to query via Athena.
Partition-1: Loading csv file in s3, csv file has 5 columns
Partition-2: Loading csv file in s3, csv file has same 5 columns as above, but in different order compared to (1)
When I run the crawler first time on (1), it creates the Glue table/schema. Later when I upload the same data in different order to a different partition as (2) and run the crawler,it just tries to map the second file to the schema already created as part of (1), which results in data issues.
Does order of columns in Glue important? Does the crawler not automatically identify the columns based on the name, instead of the expecting in the same order (2) as of (1).
Order is important in csv files. Any change makes it think that the schema is different. However if u use parquet files, then order can be played around with

Export athena table to S3 as one readable file

I am baffled: I cannot figure out how to export a sucessfully run CREATE TABLE statement to a single CSV.
The query "saves" the result of my Create Table command in an appropriately named S3 bucket, partitioned into 60 (!) files. Alas, these files are not readable text files
CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid AS
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa
How can I save this table to S3, as a single file, CSV format, without having to download and re-upload it?
If you want a result of CTAS query statement being written into a single file, then you would need to use bucketing by one of the columns you have in your resulting table. In order to get resulting files in csv format, you would need to specify tables' format and field delimiter properties.
CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://my_athena_results/ctas_query_result_bucketed/',
bucketed_by = ARRAY['__SOME_COLUMN__'],
bucket_count = 1)
AS (
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa
);
Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. Note, that even explicitly specifying a bucket size of one, might still get multiple files [1].
See Athena documentation for more information on its syntax and what can be specified within WITH directive. Also, don't forget about
considerations and limitations for CTAS Queries, e.g. the external_location for storing CTAS query results in Amazon S3 must be empty etc.
Update 2019-08-13
Apparently, the result of CTAS statements are compressed with GZIP algorithm by default. I couldn't find in documentation how to change this behavior. So, all you would need is to uncompress it after you had downloaded it locally. NOTE: uncompressed files won't have .csv file extension, but you still will be able to open them with text editors.
Update 2019-08-14
You wont' be able to preserve column names inside files if you save them in csv format. Instead, they would be specified in AWS Glue meta-data catalog, together with other information about a newly created table.
If you want to preserve column names in the output files after executing CTAS queries, then you should consider file formats which inherently do that, e.g. JSON, Parquet etc. You can do that by using format property within WITH clause. Choice of file format really depends on a use case and size of data. Go with JSON if your files are relatively small and you want to download and be able to read their content virtually from anywhere. If files are big and you are planning to keep them on S3 and query with Athena, then go with Parquet.
Athena stores query results in Amazon S3.
A results file stored automatically in a CSV format (*.csv) .So results can be exported into a csv file without CREATE TABLE statement (https://docs.aws.amazon.com/athena/latest/ug/querying.html)
Execute athena query using StartQueryExecution API and results .csv can be found at the output location specified in api call.
(https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html)

Find the source of athena query result

We have thousands of files stored in S3. These files are exposed to athena so that we can query on them. While doing debugging i found that athena shows multiple blank lines when queries on a specific id. Given that there are thousands of files, I am not sure where that data is coming from.
Is there a way that i can see the source file for respective rows in athena result?
There is a hidden column exposed by Presto Hive connector: "$path"
This column exposes the path of the file particular row has been read from.
Note: the column name is actually $path, but you need to "-quote it in SQL. This is because $ is otherwise illegal in an identifier.