Google BigQuery EXPORT DATA csv file on storage - issue special characters write badly

Google BigQuery EXPORT DATA csv file on storage - issue special characters write badly - google-cloud-platform

I need to export data of a BigQuery table into csv on Google Cloud Storage.
I used the following:
EXPORT DATA
OPTIONS(
uri=concat(path_file_output,'_*.csv'),
format='CSV',
overwrite=true,
header=true,
field_delimiter=';'
)
AS
SELECT * FROM my_bigquery_table
In my_bigquery_table there are string columns with the character '€' that are badly changed during the export
for example: a field with '1234.56 €' is changed with '1234.56 â'.
Exist a way to avoid this?
on the google documentation :https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements
there aren't any other options for the export

Microsoft will be always Microsoft... Be reading the comments, the problem comes from Excel, and the default encoding format.
So, let me explain. Your system doesn't use a UTF-8 encoding format. In France, my system uses ISO8859 encoding type, and when you open a file with Excel, it doesn't understand. Same thing if you have a coma separated value (the meaning of CSV) that you import in Excel, it doesn't work in France (we have the habit to use semi-colon separated value).
Anyway. There isn't straight forward solution to open the file with Excel. But you can do it.
Open Excel, and open a blank notebook
Go to Data, Get Data, from text
Select your file and click on "get data"
Then you can configure your import. Select UTF-8 as File Origin
And then continue with other parameters. You can see a sample of your file and the result that you will get.
Note: I have nothing against microsoft, but when it comes to development, Microsoft is a trap nest...

Related

What is AWS S3 dataset?

Looking at documentation of awswrangler.s3.to_csv or awswrangler.s3.to_parquet, there is a dataset parameter.
From testing, it looks like setting dataset=True allows, among other things, to append new data to an already existing set. It also looks like when dataset=True, I can't specify the file name and AWS autogenerates the names for the files which are added to the specified path.
Apart from that, I can't find more information on what dataset means. Is it just referring to the general concept or is there a specific meaning within the context of AWS? What exactly is dataset and when should it be set to True?

The dataset=True option allows you to store the entire dataset, including all metadata, indexes, etc.
The dataset parameter documentation:
dataset (bool) – If True store as a dataset instead of ordinary file(s) If True, enable all follow arguments: partition_cols, mode, database, table, description, parameters, columns_comments, concurrent_partitioning, catalog_versioning, projection_enabled, projection_types, projection_ranges, projection_values, projection_intervals, projection_digits, catalog_id, schema_evolution.
Note all those extra things that get saved when you save a dataset. All that information, like columns_comments, concurrent_partitioning, projection_values, will be lost when you save to CSV or Parquet. But on the other hand, those values are probably only useful if you plan to do further manipulation of the data via awswrangler/pandas at some later date.
Also note that if you set dataset=True you have to give it a file name prefix instead of a single file name, because the output generated will be spread across multiple files.
If you want to use the data in any other tool besides Pandas, such as loading the CSV into Excel, then you most likely want to set dataset=False and output to a single file.

Big query EXPORT DATA statement creating mutiple files with no data and just header record

I have read similar issue here but not able to understand if this is fixed.
Google bigquery export table to multiple files in Google Cloud storage and sometimes one single file
I am using below big query EXPORT DATA OPTIONS to export the data from 2 tables in a file. I have written select query for the same.
EXPORT DATA OPTIONS(
uri='gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_'||CURRENT_DATE()||'*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter='|') AS
SELECT
I have only 2 rows returning from my select query and I assume that only one file should be getting created in google cloud storage. Multiple files are created only when data is more than 1 GB. thats what I understand.
However, 3 files got created in cloud storage where 2 files just had the header record and the third file has 3 records(one header and 2 actual data record)
radhika_sharma_ibm#cloudshell:~ (whr-asia-datalake-nonprod)$ gsutil ls gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000000.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000001.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000002.csv
Why empty files are getting created?
Can anyone please help? We don't want to create empty files. I believe only one file should be created when it is 1 GB. more than 1 GB, we should have multiple files but NOT empty.

You have to force all data to be loaded into one worker. In this way you will be exporting only one file (if <1Gb).
My workaround: add a select distinct * on top of the Select statement.

Under the hood, BigQuery utilizes multiple workers to read and process different sections of data and when we use wildcards, each worker would create a separate output file.
Currently BigQuery produces empty files even if no data is returned and thus we get multiple empty files. The Bigquery product team is aware of this issue and they are working to fix this, however there is no ETA which can be shared.
There is a public issue tracker that will be updated with periodic progress. You can STAR the issue to receive automatic updates and give it traction by referring to this link.
However for the time being I would like to provide a workaround as follows:
If you know that the output will be less than 1GB, you can specify a single URI to get a single output file. However, the EXPORT DATA statement doesn’t support Single URI.
You can use the bq extract command to export the BQ table.
bq --location=location extract \
--destination_format format \
--compression compression_type \
--field_delimiter delimiter \
--print_header=boolean \
project_id:dataset.table \
gs://bucket/filename.ext
In fact bq extract should not have the empty file issue like the EXPORT DATA statement even when you use Wildcard URI.

I faced the same empty files issue when using EXPORT DATA.
After doing a bit of R&D found the solution. Put LIMIT xxx in your SELECT SQL and it will do the trick.
You can find the count, and put that as LIMIT value.
SELECT ....
FROM ...
WHERE ...
LIMIT xxx

It turns out you need to enforce multiple files, wildcard syntax. Either a file for CSV or folder for other like AVRO.
The uri option must be a single-wildcard URI as described
https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements

Specifying a wildcard seems to start several workers to work on the extract, and as per the documentation, size of the exported files will vary.
Zero-length files is unusual but technically possible if the first worker is done before any other really get started. Hence why the wildcard is expected to be used only when you think your exported data will be larger than the 1 GB
I have just faced the same with Parquet but found out that bq CLI works, which should do for any format.
See (and star for traction) https://issuetracker.google.com/u/1/issues/181016197

Upload and parse file in Oracle APEX

I'm trying to find the best way to upload, parse and work with text file in Oracle APEX (current version 20.1). Bussiness case: I must upload text file, first line will be saved to table A.
Rest lines contains some records (columns are pipe delimited) should be validated. After that correct recordes should be saved to table B or if there is some error it should be saved to table C (error log).
I tried to do something with the Data Loading wizard but it doesn't fit to my requirements.
Right now I added a "File browse..." item to page, and after page submit I can find this file in APEX_APPLICATION_TEMP_FILES in blob_content.
Is there any other option to work with that file than working with blob_content from APEX_APPLICATION_TEMP_FILES. I find it difficoult to work with type of data.
Text file look something like that:
2020-06-05 info: header line
2020-06-05|columnAValue|columnBValue|
2020-06-05|columnAValue||columnCValue
2020-06-05|columnAValue|columnBValue|columnCValue

have a look into the APEX_DATA_PARSER.PARSE table function. It parses the CSV file and returns the values as rows and columns. It's described in more detail within this blog posting:
https://blogs.oracle.com/apex/super-easy-csv-xlsx-json-or-xml-parsing-about-the-apex_data_parser-package

Simply pass "file.csv" (literally) as the p_file_name argument. APEX_DATA_PARSER does not care about the "real" file name....

The function uses the file extension only to differentiate between delimited, XLSX, XML or JSON files. So simply pass in a static file name like "file.csv". That should be enough.

Exporting from pgadmin reads line breaks in field cells and creates unreadable Excel

I'm new to this, so I am sure it is a silly question, but I have read through every question related on the site and can't find anything!
I am exporting from pgadmin. A few of the columns have line breaks within the cells, so the exported data is very choppy. Does anyone know how to fix this? Is there a way to make it so the line breaks within cells are not read?
I know I am doing the right settings for exporting, but basically what happens is that the header names are there, along with one row of content for each column and then Column A will have 20 more rows beneath it because of line breaks from the first cell in column E.
Any help would be much appreciated!

I assume that you're referring to the Query --> Execute to file command in the Query window. I don't think it's a bug that pgAdmin doesn't escape line breaks within strings in its csv output, but Excel can read it correctly anyway.
In the export options, please make sure that you use commas as column separators and double quotes as quote chars. Here are my settings:
Additionally, when you load your CSV into Excel, please don't use Data -> From Text. This one doesn't parse CSV with line breaks correctly. Just open the file directly in Excel (via Open within Excel, or by right clicking it in Windows Explorer and choosing Open With -> Microsoft Excel).

Facebook Ads Insights API reportstats endpoint

I'm using reportstats edge to download some reports in CSV format. (It probably applies to XLS as well)
What I've noticed:
headers have different descriptions than the data columns parameters - is there a resource describing the mapping? (eg. adgroup_id -> 'Ad ID', adgroup_name -> 'Ad Name', unique_impressions -> 'Reach'...
will the order of csv columns be as defined in data_columns param?
! some columns are not returned in csv format - two I've identified so far are inline_actions and unique_social_clicks - the column is skipped in csv format but available in json - is it a bug or there is a reason for that?
general question - does csv format require pagination or I will always get all of the data?
value mapping - the constant values in csv/xls format have different labels, eg. placement(desktop_feed -> 'News Feed on Desktop Computers'), Is there a resource describing all the possible values?
asynchronous report requests - it happens quite often that although I'm checking the report_run_id for async_percent_completion the data is still not available when it should . I'm getting a text response No data available.. I need to retry and then it's usually available. Is this expected?
Thanks!

different names in API and XLS are intentional; API developers prefer naming consistent with the rest of Ads API, but people using XLS exports are often not developers and prefer human-friendly naming
you can use export_columns to define the order
inline_actions/unique_social_clicks - not sure, maybe these might be deprecated
it will give you all of the data
I don't think there's public resource for mapping between placement values :-(
you need to check report_run_id for the job status (field "async_status"), that should work reliable; once it's "Job Completed" you should be able to get the data

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js