Big query table update from reference CSV file - google-cloud-platform

I have a BigQuery table and I want to update the content of few rows it from a reference CSV file. This CSV file is uploaded to Google cloud storage bucket.

When you use external table from storage, you can only read the CSV, not update them.
However, you can load you CSV into a BigQuery native table, perform the update with DML, and then export the table to CSV. That only works if you have only one CSV.
If you have several CSV files, you can, at least, print the pseudo column _FILE_NAME to identify the files where you need to perform the change. But the change will have to be performed manually or with the previous solution (native table)

Related

Athena tables having history of records of every csv

I am uploading CSV files in the s3 bucket and creating tables through glue crawler and seeing the tables in Athena, making connection between Athena and Quicksight, and showing the result graphically there in quicksight.
But what I need to do now is keep the history of the files uploaded, instead of a new CSV file being uploaded and crawler updating the table, can I have crawler save each record separately? or is it even a reasonable thing to do? since I wonder it would then create so many tables and it'll be a mess?
I'm just trying to figure out a way to keep a history of previous records. how can i achieve this?
When you run an Amazon Athena query, Athena will look at the location parameter defined in the table's DDL. This specifies where the data is stored in an Amazon S3 bucket.
Athena will include all files in that location when it runs the query on that table. Thus, if you wish to add more data to the table, simply add another file in that S3 location. To replace data in that table, you can overwrite the file(s) in that location. To delete data, you can delete files from that location.
There is no need to run a crawler on a regular basis. The crawler can be used to create the table definition and it can be run again to update the table definition if anything has changed. But you typically only need to use the crawler once to create the table definition.
If you wish to preserve historical data in the table while adding more data to the table, simply upload the data to new files and keep the existing data files in place. That way, any queries will include both the historical data and the new data because Athena simply looks at all the files in that location.

Loading data from Google Cloud Storage to Bigquery while adding additional columns

I want to load data from a parquet file in google cloud storage to Bigquery. While loading I also want to add couple of extra columns to the table which are not present in the source file like insert_time_stamp and source_file_name. After doing some research I found these options -
Create a temporary table linked to file in GCS and then load the data from the temporary table along with additional columns to the final Bigquery table.
Load the data from parquet file to pandas dataframe, add the extra two columns and then use pandas.DataFrame.to_gbq or client.load_table_from_dataframe options to load data to Bigquery table.
Load the data from parquet file to a staging table(by this I mean a normal table) in Bigquery and then use this table to create the final table as - "insert into final_table select *,current_timestamp as insert_time_stamp, <file_name> as source_file_name from staging_table". And then finally dropping the staging table.
If the number of rows from the source file are in millions, what would be the best approach to take?

create table and load data in redshift from Parquet file in S3

I have my Parquet file in S3. I want to load this to the redshift table. I don't know the schema of the Parquet file.
Is there any command to create a table and then copy parquet data to it?
Also, I want to add the default time column date timestamp DEFAULT to_char(CURRDATE, 'YYYY-MM-DD').
You need first to create an external schema. Normal colunar schemas do not support parquet files.
Them you need to create a external table to support the columns on the s3 file. So look for a good relation between the file and datatypes. I usually avoid smalls ints for example. To floats create as REAL data type.
Then, do the copy command bellow:
COPY schema_name.table_name
FROM bucket_object iam_role user_arn
FORMAT PARQUET;

Export athena table to S3 as one readable file

I am baffled: I cannot figure out how to export a sucessfully run CREATE TABLE statement to a single CSV.
The query "saves" the result of my Create Table command in an appropriately named S3 bucket, partitioned into 60 (!) files. Alas, these files are not readable text files
CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid AS
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa
How can I save this table to S3, as a single file, CSV format, without having to download and re-upload it?
If you want a result of CTAS query statement being written into a single file, then you would need to use bucketing by one of the columns you have in your resulting table. In order to get resulting files in csv format, you would need to specify tables' format and field delimiter properties.
CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://my_athena_results/ctas_query_result_bucketed/',
bucketed_by = ARRAY['__SOME_COLUMN__'],
bucket_count = 1)
AS (
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa
);
Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. Note, that even explicitly specifying a bucket size of one, might still get multiple files [1].
See Athena documentation for more information on its syntax and what can be specified within WITH directive. Also, don't forget about
considerations and limitations for CTAS Queries, e.g. the external_location for storing CTAS query results in Amazon S3 must be empty etc.
Update 2019-08-13
Apparently, the result of CTAS statements are compressed with GZIP algorithm by default. I couldn't find in documentation how to change this behavior. So, all you would need is to uncompress it after you had downloaded it locally. NOTE: uncompressed files won't have .csv file extension, but you still will be able to open them with text editors.
Update 2019-08-14
You wont' be able to preserve column names inside files if you save them in csv format. Instead, they would be specified in AWS Glue meta-data catalog, together with other information about a newly created table.
If you want to preserve column names in the output files after executing CTAS queries, then you should consider file formats which inherently do that, e.g. JSON, Parquet etc. You can do that by using format property within WITH clause. Choice of file format really depends on a use case and size of data. Go with JSON if your files are relatively small and you want to download and be able to read their content virtually from anywhere. If files are big and you are planning to keep them on S3 and query with Athena, then go with Parquet.
Athena stores query results in Amazon S3.
A results file stored automatically in a CSV format (*.csv) .So results can be exported into a csv file without CREATE TABLE statement (https://docs.aws.amazon.com/athena/latest/ug/querying.html)
Execute athena query using StartQueryExecution API and results .csv can be found at the output location specified in api call.
(https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html)

Big Query can't query some csvs in Cloud Storage bucket

I created a permanent Big Query table that reads some csv files from a Cloud Storage Bucket sharing the same prefix name (filename*.csv) and the same schema.
There are some csvs anyway that make fail BigQuery queries with a message like the following one: "Error while reading table: xxxx.xxxx.xxx, error message: CSV table references column position 5, but line starting at position:10 contains only 2 columns.
Moving all the csvs one-by-one from the bucket I devised the one responsible for that.
This csv file doesn't have 10 lines...
I found this ticket BigQuery error when loading csv file from Google Cloud Storage, so I thought the issue was having an empty line at the end. But also others csvs in my bucket do, so this can't be the reason.
On the other hand this csv is the only one with content type text/csv; charset=utf-8, all the others being text/csv,application/vnd.ms-excel,application/octet-stream.
Furthermore downloading this csv to my local Windows machine and uploading it againt to Cloud Storage, content type is automatically converted to application/vnd.ms-excel.
Then even with the missing line Big Query can then query the permanent table based on filename*.csvs.
Is it possible that BigQuery had issues querying csvs with UTF-8 encoding, or is it just coincidence?
Use Google Cloud Dataprep to load your csv file. Once the file is loaded, analyze the data and clean it if requires.
Once all the rows are cleaned, you can then sink that data in BQ.
Dataprep is GUI based ETL tool and it runs a dataflow job internally.
Do let me know if any more clarification is required.
Just to remark the issue, the CSV file had gzip as encoding which was the reason that BigQuery doesn't interpret as a CSV file.
According to documentation BigQuery expects CSV data to be UTF-8 encoded:
"encoding": "UTF-8"
In addition, since this issue is relate to the metadata of the files in GCS you can edit the metadata directly from the Console.