two steps create/upload a file to hdfs with webhdfs/httpfs - hdfs

I used apache-httpclient and webhdfs to create and upload a file(.jar) to HDFS. At last, it responded with a 201 and content-length=0.
But when I found the file on the HDFS, its length was 0. What's wrong? And what should I do to upload the jar file?
The REST API just like this.
enter image description here

Related

Why does my single part MP3 file fail to upload to AWS S3 due to an InvalidPart multipart file upload error?

I have a 121MB MP3 file I am trying to upload to my AWS S3 so I can process it via Amazon Transcribe.
The MP3 file comes from an MP4 file I stripped the audio from using FFmpeg.
When I try to upload the MP3, using the S3 object upload UI in the AWS console, I receive the below error:
InvalidPart
One or more of the specified parts could not be found. the part may not have been uploaded, or the specified entity tag may not match the part's entity tag
The error makes reference to the MP3 being a multipart file and how the "next" part is missing but it's not a multipart file.
I have re-run the MP4 file through FFmpeg 3 times in case the 1st file was corrupt, but that has not fixed anything.
I have searched a lot on Stackoverflow and have not found a similar case where anyone has uploaded a single 5MB+ file that has received the error I am.
I've also crossed out FFmpeg being the issue by saving the audio using VLC as an MP3 file but receive the exact same error.
What is the issue?
Here's the console in case it helps:
121MB is below the 160 GB S3 console single object upload limit, the 5GB single object upload limit using the REST API / AWS SDKs as well as the 5TB limit on multipart file upload so I really can't see the issue.
Considering the file exists & you have a stable internet-connected (no corrupted uploads), you may have incomplete multipart upload parts in your bucket somehow which may be conflicting with the upload for whatever reason so either follow this guide to remove them and try again or try creating a new folder/bucket and re-uploading again.
You may also have a browser caching issue/extension conflict so try incognito (with extensions disabled) or another browser if re-uploading to another bucket/folder doesn't work.
Alternatively, try the AWS CLI s3 cp command or a quick "S3 file upload" application in a supported SDK language to make sure that it's not a console UI issue.

Google Cloud - Download large file from web

I'm trying to download GhTorrent dump from http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2020-07-17.tar.gz which is about 127gb
I tried in the cloud but after 6gb it stops, I believe that there is a size limit for using curl
curl http://ghtorrent... | gsutil cp - gs://MY_BUCKET_NAME/mysql-2020-07-17.tar.gz
I cannot use Data Transfer as I need to specify the url, size in bytes (which I have) and hash MD5 which I don't have and I only can generate by having the file in my disk. I think(?)
Is there any other option to download and upload the file directly to the cloud?
My total disk size is 117gb sad beep
Worked for me with Storage Transfer Service: https://console.cloud.google.com/transfer/
Have a look on the pricing before moving TBs especially if your target is nearline/coldline: https://cloud.google.com/storage-transfer/pricing
Simple example that copies a file from a public url, to my bucket using a Transfer Job:
Create a file theTsv.tsv and specify the complete list of files that must be copied. This example contains just one file:
TsvHttpData-1.0
http://public-url-pointint-to-the-file
Upload the theTsv.tsv file to your bucket or any publicly accessible url. In this example I am storing my .tsv file on my bucket https://storage.googleapis.com/<my-bucket-name>/theTsv.tsv
Create a transfer job - List of object URLs
Add the url that points to the theTsv.tsv file in the URL of TSV file field;
Select the target bucket
Run immediately
My file, named MD5SUB was copied from the source url into my bucket, under an identical directory structure.

Why Transfer in GCP failed on csv file and where is the error log?

I am testing out the transfer function in GCP:
This is the open data in csv, https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2018-financial-year-provisional/Download-data/annual-enterprise-survey-2018-financial-year-provisional-csv.csv
My configuration in GCP:
The transfer failed as below:
Question 1: why the transfer failed?
Question 2: where is the error log?
Thank you very much.
[UPDATE]:
I checked log history, nothing was captured:
[Update 2]:
Error details:
Details: First line in URL list must be TsvHttpData-1.0 but it is: Year,Industry_aggregation_NZSIOC,Industry_code_NZSIOC,Industry_name_NZSIOC,Units,Variable_code,Variable_name,Variable_category,Value,Industry_code_ANZSIC06
I noticed in the transfer service if you choose the third option for source: it reads URL of TSV file. Essentially TSV, PSV are just variants of CSV, and I have no problem retrieving the source csv file. The error details seem to implicating something not expected there.
The problem is that in your example, you are pointing to a data file as the source of the transfer. If we read the documentation on GCS transfer, we find that the we must specify a file which contains the identity of the target URL that we want to copy.
The format of this file is called a Tab-Separated-Values (TSV) and contains a number of parameters including:
The URL of the source of the file.
The size in bytes of the source file.
An MD5 hash of the content of the source file.
What you specified (just the URL of the source file) ... is not what is required.
One possible solution would be to use gsutil. It has an option of taking a stream as input and writing that stream to a given object. For example:
curl http://[URL]/[PATH] | gsutil cp - gs://[BUCKET]/[OBJECT]
References:
Creating a URL list
Can I upload files to google cloud storage from url?

AWS Sagemaker unable to parse csv

I'm trying to run a training job on AWS Sagemaker, but it keeps failing giving the following error:
ClientError: Unable to parse csv: rows 1-5000, file /opt/ml/input/data/train/KMeans_data.csv
I've selected 'text/csv' as the content type and my CSV file contains 5 columns with numerical content and text headers.
Can anyone point out what could be going wrong here?
Thanks!
From https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html CSV must not have headers:
Amazon SageMaker requires that a CSV file doesn't have a header record ...
Try removing the header row.
Try to make sure that there are no other files other than the training file in the training folder in S3 bucket.

Zip file uploaded through Lambda+API gateway with POSTMAN is corrupted

I am doing following steps:
I have API gateway ( PUT Method) which integrated with AWS lambda.
It is direct mapping in multipart/form-data ( so big logic is happening here )
Now through POSTMAN, file is being uploaded.
File is getting uploaded.
When I download this ZIP file, It says "End-of-central-directory signature not found. Either this file is not a Zip file, or it constitutes one disk of a multi-part Zip file."
Then I opened the ZIP in notepad++ ( yes i did that ), I can see only few line with binary data, on other hand my original file has lot of it.
Please help, let me know if some more needed.
I had the same issue and it turned out I was sending the request without all needed headers. It worked fine with Postman. Please check if your request contains these postman-headers