Why Transfer in GCP failed on csv file and where is the error log? - google-cloud-platform

I am testing out the transfer function in GCP:
This is the open data in csv, https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2018-financial-year-provisional/Download-data/annual-enterprise-survey-2018-financial-year-provisional-csv.csv
My configuration in GCP:
The transfer failed as below:
Question 1: why the transfer failed?
Question 2: where is the error log?
Thank you very much.
[UPDATE]:
I checked log history, nothing was captured:
[Update 2]:
Error details:
Details: First line in URL list must be TsvHttpData-1.0 but it is: Year,Industry_aggregation_NZSIOC,Industry_code_NZSIOC,Industry_name_NZSIOC,Units,Variable_code,Variable_name,Variable_category,Value,Industry_code_ANZSIC06
I noticed in the transfer service if you choose the third option for source: it reads URL of TSV file. Essentially TSV, PSV are just variants of CSV, and I have no problem retrieving the source csv file. The error details seem to implicating something not expected there.

The problem is that in your example, you are pointing to a data file as the source of the transfer. If we read the documentation on GCS transfer, we find that the we must specify a file which contains the identity of the target URL that we want to copy.
The format of this file is called a Tab-Separated-Values (TSV) and contains a number of parameters including:
The URL of the source of the file.
The size in bytes of the source file.
An MD5 hash of the content of the source file.
What you specified (just the URL of the source file) ... is not what is required.
One possible solution would be to use gsutil. It has an option of taking a stream as input and writing that stream to a given object. For example:
curl http://[URL]/[PATH] | gsutil cp - gs://[BUCKET]/[OBJECT]
References:
Creating a URL list
Can I upload files to google cloud storage from url?

Related

Test data to requests for Postman monitor

I run my collection using Test data from a csv file, However there is no option to upload the test data file when adding monitor for the collection. On searching through internet could see that the test data file have to be provided in URL (saved in cloud ..google drive,.). But i couldn't get source for how to provide this URL to the collection . Can anyone please help
https://www.postman.com/praveendvd-public/workspace/postman-tricks-and-tips/request/8296678-d06b3fc0-6b8b-4370-9847-aee0f526e7db
you cannot use csv file in monitor , but could store the content of csv as variable and use that to drive the monitor . An example can be seen in the above public repository

Google BigQuery cannot read some ORC data

I'm trying to load ORC data files stored in GCS into BigQuery via bq load/bq mk and facing an error below. The data files copied via hadoop discp command from on-prem cluster's Hive instance version 1.2. Most of the orc-files are loaded successfully, but few are not. There is no problem when I read this data from Hive.
Command I used:
$ bq load --source_format ORC hadoop_migration.pm hive/part-v006-o000-r-00000_a_17
Upload complete.
Waiting on bqjob_r7233761202886bd8_00000175f4b18a74_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '<project>-af9bd5f6:bqjob_r7233761202886bd8_00000175f4b18a74_1': Error while reading data, error message:
The Apache Orc library failed to parse metadata of stripes with error: failed to open /usr/share/zoneinfo/GMT-00:00 - No such file or directory
Indeed, there is no such file and I believe it shouldn't be.
Google doesn't know about this error message but I've found similar problem here: https://issues.apache.org/jira/browse/ARROW-4966. There is a workaround for on-prem servers of creating sym-link to /usr/share/zoneinfo/GMT-00:00. But I'm in a Cloud.
Additionally, I found that if I extract data from orc file via orc-tools into json format I'm able to load that json file into BigQuery. So I suspect that the problem not in the data itself.
Does anybody came across such problem?
Official Google support position below. In short BigQuery doesn't understand some timezone's description and we suggested to change it in the data. Our workaround for this was to convert ORC data to parquet and then load it into table.
Indeed this error can happen. Also when you try to execute a query from the BigQuery Cloud Console such as:
select timestamp('2020-01-01 00:00:00 GMT-00:00')
you’ll get the same error. It is not just related to the ORC import, it’s how BigQuery understands timestamps. BigQuery supports a wide range of representations as described in [1]. So:
“2020-01-01 00:00:00 GMT-00:00” -- incorrect timestamp string literal
“2020-01-01 00:00:00 abcdef” -- incorrect timestamp string literal
“2020-01-01 00:00:00-00:00” -- correct timestamp string literal
In your case the problem is with the representation of the time zone within the ORC file. I suppose it was generated that way by some external system. If you were able to get the “GMT-00:00” string with preceding space replaced with just “-00:00” that would be the correct name of the time zone. Can you change the configuration of the system which generated the file into having a proper time zone string?
Creating a symlink is only masking the problem and not solving it properly. In case of BigQuery it is not possible.
Best regards,
Google Cloud Support
[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#time_zones

Google Cloud - Download large file from web

I'm trying to download GhTorrent dump from http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2020-07-17.tar.gz which is about 127gb
I tried in the cloud but after 6gb it stops, I believe that there is a size limit for using curl
curl http://ghtorrent... | gsutil cp - gs://MY_BUCKET_NAME/mysql-2020-07-17.tar.gz
I cannot use Data Transfer as I need to specify the url, size in bytes (which I have) and hash MD5 which I don't have and I only can generate by having the file in my disk. I think(?)
Is there any other option to download and upload the file directly to the cloud?
My total disk size is 117gb sad beep
Worked for me with Storage Transfer Service: https://console.cloud.google.com/transfer/
Have a look on the pricing before moving TBs especially if your target is nearline/coldline: https://cloud.google.com/storage-transfer/pricing
Simple example that copies a file from a public url, to my bucket using a Transfer Job:
Create a file theTsv.tsv and specify the complete list of files that must be copied. This example contains just one file:
TsvHttpData-1.0
http://public-url-pointint-to-the-file
Upload the theTsv.tsv file to your bucket or any publicly accessible url. In this example I am storing my .tsv file on my bucket https://storage.googleapis.com/<my-bucket-name>/theTsv.tsv
Create a transfer job - List of object URLs
Add the url that points to the theTsv.tsv file in the URL of TSV file field;
Select the target bucket
Run immediately
My file, named MD5SUB was copied from the source url into my bucket, under an identical directory structure.

can not add file in aws s3 bucket using postman

I am trying to add a file in s3-bucket in my AWS account using postman. see below screenshot.
I pass Host in the header as a divyesh.vkinds.com.s3.amazonaws.com where divyesh.vkinds.com is my bucket name. and in Body I am giving file as index.html as file type like image below.
but it is giving me The provided 'x-amz-content-sha256' header does not match what was computed.
error. I searched for it but can't find anything.
Please check content-header. Add Content-Type as text/plain and date in this format XX-XX-XXXX
I have also faced the same problem. The issue was that, postman does not calculate the SHA. It defaults to a SHA of empty string e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
So in the postman headers, add an explicit key x-amz-content-sha256. Caluclate the value of SHA256 for your file using a sha command and provide as the value. Below command works on linux flavors
shasum -a 256 index.html
Couple of other observations in the question.
You can change the Body as binary and choose the file you want to upload.
Provide the complete path including the file name in the upload URL. E.g. if you provide the URL as <your bucket name>.s3.<region>.amazonaws.com/test/index.html, the file will be copied to test directory in the bucket with name as index.html
I encountered this situation recently, and the issue was that I was copying an active log file which changed between when my side calculated the hash and when the file was actually uploaded. My solution was to copy the file to a temporary location, then upload that stable file.

Downloading files from a remote server directory listing and import into HDFS

I have been given access to a server that provides a directory listing of files which I will download and import into HDFS. What I am currently doing is hitting the server with an HTTP GET and downloading the HTML directory listing and then I use jsoup and parse all the links to the files which I need to download. Once I have a complete list I download each files one by one and then import each into HDFS. I don't believe that flume is able to read & parse html to download files. Is there an easier cleaner way to do what I am describing?
With Flume I would do the following:
1) have a process grep you URLs and store the dumped HTML file to a directory
2) Configure a SpoolDir source pointing to that directory with a customer deserializer:
deserializer LINE Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement EventDeserializer.Builder.
that deserializer reads the HTML file and extracts the HTML file with JSoup. The extracted bits are then converted to multiple events in the desired format and sent to HDFSSink
That's basically it.