Unzipping files with Google Dataflow Bulk Decompress template?

Unzipping files with Google Dataflow Bulk Decompress template? - google-cloud-platform

I am trying to unzip uploaded zip files to Cloud Storage which contains only image files without any other folders inside.
I was able to do that with cloud functions but seems like I get memory-related issues when files get bigger. I found Dataflow templates (Bulk Decompress Cloud Storage Files) for this specific case and tried to run some jobs with similar to below parameters.
{
"jobName": "unique_job_name",
"environment": {
"bypassTempDirValidation": false,
"numWorkers": 2,
"tempLocation": "gs://bucket_name/temp",
"ipConfiguration": "WORKER_IP_UNSPECIFIED",
"additionalExperiments": []
},
"parameters": {
"inputFilePattern": "gs://bucket_name/root_path/zip_to_extract.zip",
"outputDirectory": "gs://bucket_name/root_path/",
"outputFailureFile": "gs://bucket_name/root_path/failure.csv"
}
}
As an output, I only get 1 file with the same name of my zip file without a file extension and with the type of text/plain.
Is this an expected behaviour? If someone could help me to unzip the file with Dataflow, I would be glad.

Related

Pub/Sub to Text Files on Cloud Storage template at DataFlow FileNotFoundException error

I'm new to Dataflow. I am sending JSON data to pub/sub using a python script. I am using the "Pub/Sub to Text Files on Cloud Storage" template that I created in Dataflow.
When I want to write to Cloud Storage, it writes to the folder named .temp-beam instead of the bucket I specified as the output path. I know this is a fault-tolerant.
In the dataflow logs, I get the following error:
Error message from worker: java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.io.FileNotFoundException: gs://mybucket/data/.temp-beam/d96626fc549ca4e3-77c2
Example data in Pub/Sub:
{"productFullName": "watch", "productBrand": "Rolex", "productPrice": "1089.00", "productRating": "100", "productRatingCount": "15", "productDealer": "WatchCenter", "dealerRating": "100"}
I tried everything on the permission side.
I've verified from .temp-beam that my data is coming in correctly.
Tried txt and json as suffix.
Bucket path was made as desired. ( gs://bucket/data/ )
DataFlow SDK Version: Apache Beam SDK for Java 2.36.0

AWS quicksight can't ingest csv from s3 but the same data uploaded as file works

I am new to quicksight and was just test driving (on the quicksight web console. I'm not using the command line in this entire thing) with some data (can't share, confidential business info). I have a strange issue. when I create a dataset by uploading the file, which is only 50 mb, it works fine and I can see a preview of the table and I am able to proceed to the visualization. But when I upload the same file to the s3 and make a manifest and submit it using the 'use s3' option in the creat dataset window, I get the INCORRECT_FIELD_COUNT error.
here's the manifest file:
{
"fileLocations": [
{
"URIs": [
"s3://testbucket/analytics/mydata.csv"
]
},
{
"URIPrefixes": [
"s3://testbucket/analytics/"
]
}
],
"globalUploadSettings": {
"format": "CSV",
"delimiter": ",",
"containsHeader": "true"
}
}
I know the data is not fully structured with some rows where a few columns are missing but how is it possible for quicksight to automatically infer and put NULLs into shorter rows when uploaded from local machine but not as an s3 file with the manifest? are there some different setttings that i'm missing?

I'm getting the same thing - looks like this is fairly new code. It'd be useful to know what the expected field count is, especially as it doesn't say if it's too few or too many (both are wrong). One of those technologies that looks promising, but I'd say there's a little maturing required.

Postman / Newman and uploading files from Azure Blob instead of local files

I am trying to add my postman scripts to an azure pipeline.
To do this I am trying out newman.
I use the postman api to get the latest collection as well as the correct environment. Using the uid and an api key i have created. All good so far.
However my collection includes some calls that do file uploads.
In postman i tested those by simply selecting the body of the call, selecting form-data and choosing a sample file that is located in the default "postman files" folder.
When testing newman on my local machine, i need to copy all the sample files i want to use for uploads into the same folder that i run newman from.
This solution is not quite right for me though as i use the postman api to get the correct collections and the environments. I need to be able to get those files also from an alternative remote location (such as azure blob storage)
I have found some guides that describe how you can just edit the postman collection file to point the "src" to a remote file. However i cannot find any way to do this directly in postman, in such a way that when newman gets the collection file from the api the correct location is already in the correct place.
"request": {
"method": "POST",
"header": [],
"body": {
"mode": "formdata",
"formdata": [
{
"key": "files",
"type": "file",
"src": "sample.pdf"
}
]
},
Above is the extract from the collection file.
Is there a way i can make that change directly in postman?

Postman scripts have access to files in their working directory. We solved this by having a picture in a folder in our git repo, downloading the scripts to that folder, and referring to that file. This is the task we used in the build pipeline:
- task: CopyFiles#2
inputs:
SourceFolder:
Contents: |
**/PostmanTests/test-image.jpg
TargetFolder: '$(Build.ArtifactStagingDirectory)/postman'
OverWrite: true
flattenFolders: true
You can then use the postman API to download the files to the artifact folder created above.
One trick here is that we used a "container" file. We replaced the filename with {{file}} (instead of something like example.pdf) and passed the actual name in the environment file. See JSON below:
"body": {
"mode": "formdata",
"formdata": [
{
"key": "upload",
"type": "file",
"src": "{{file}}"
}
]
}
The environment file would then have the name of the file, in this case test-image.jpg.

Is there a way to zip a folder of files into one zip, gzip, bzip, etc file using Google Cloud?

My Goal: I have hundreds of Google Cloud Storage folders with hundreds of images in them. I need to be able to zip them up and email a user a link to a single zip file.
I made an attempt to zip these files on an external server using PHP's zip function, but that has proved to be fruitless given the ultimate size of the zip files I'm creating.
I have since found that Google Cloud offers a Bulk Compress Cloud Storage Files utility (docs are at https://cloud.google.com/dataflow/docs/guides/templates/provided-utilities#api). I was able to successfully call this utility, but for zips each file into it's own bzip or gzip file.
For instance, if I had the following files in the folder I'm attempt to zip:
apple.jpg
banana.jpg
carrot.jpg
The resulting outputDirectory would have:
apple.bzip2
banana.bzip2
carrot.bzip2
Ultimately, I'm hoping to create a single file named fruits.bzip2 that can be unzipped to reveal these three files.
Here's an example of the request parameters I'm making to https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Bulk_Compress_GCS_Files
{
"jobName": "ziptest15",
"environment": {
"zone": "us-central1-a"
},
"parameters": {
"inputFilePattern": "gs://PROJECT_ID.appspot.com/testing/samplefolder1a/*.jpg",
"outputDirectory": "gs://PROJECT_ID.appspot.com/testing/zippedfiles/",
"outputFailureFile": "gs://PROJECT_ID.appspot.com/testing/zippedfiles/failure.csv",
"compression": "BZIP2"
}
}

The best way to achieve that is to create an app that:
Download locally all the file of a GCS prefix (that you name "directory" but directory doesn't exist on GCS, only file with the same prefix)
Create an archive (can be a ZIP or a TAR. ZIP won't really compress the image. The image format is already a compressed format. You especially want only one 1 with all the image in it)
Upload the archive to GCS
Clean the files
Now you have to choose where to run this app.
On Cloud Run, you are limited by the space that you have in memory (for now, new feature are coming). For now you are limited to 8Gb of memory (and soon 16Gb), your app will be able to process total image size of 45% of the memory capacity (45% for the image size, 45% for the archive size, 10% for the app memory footprint.). Set the concurrency parameter to 1.
If you need more space, you can use Compute Engine.
Set up a startup script that run your script and stop automatically the VM at the end. The script read the parameter from the metadata server and run your app with the correct parameters
Before each run, update the Compute Engine metadata with the directory to process (and maybe other app parameter
-> The issue is that you can only run 1 process at a time. Or you need to create a VM for each job, and then delete the VM at the end of the startup script instead of stopping the VM
A side solution is to use Cloud Build. Run a Build with the parameters in the substitutions variables and perform the job in Cloud Build. You are limited to 10 builds in parallel. Use the diskSizeGb build option to set the correct disk size according to your file size requirements.
The dataflow template only zip each file unitary, and don't create an archive.

Problems uploading lambda function code on Amazon Web Services console

I'm trying to build an Alexa prototype for a client using this tutorial : https://developer.amazon.com/public/community/post/Tx3DVGG0K0TPUGQ/New-Alexa-Skills-Kit-Template:-Step-by-Step-Guide-to-Build-a-Fact-Skill
I am getting errors when I upload the zip file with the Alexskill.js and index.js files in it. I believe these are in the system itself and nothing to do with my code. Here is a screen grab of my browser console:
https://developer.amazon.com/public/community/post/Tx3DVGG0K0TPUGQ/New-Alexa-Skills-Kit-Template:-Step-by-Step-Guide-to-Build-a-Fact-Skill
There's no way to see if the zip file you upload has been successful (frustrating) - but this looks bad right?
Obviously, when I try and test the lambda function I get this error:
{
"errorMessage": "Cannot find module 'index'",
"errorType": "Error",
"stackTrace": [
"Function.Module._load (module.js:276:25)",
"Module.require (module.js:353:17)",
"require (internal/module.js:12:17)"
]
}
I desperately need to get this working. Has anyone got the code in one file that I can use to do this using the inline code editor? I am using the FactSkill demo which is very basic.

This is one of those 'I want to kick myself around the room' moments. In this article it tells you to download the ZIP archive from GIT and then upload it to the lambda control panel. When you do that on a mac it unzips it into a folder for you. I then zipped that folder back up and uploaded it. That was my problem ...
You need to zip the two files inside the folder and not the folder itself!
Then it can see the module from the archive.
DOH!!!
But, still ... Amazon, wtf is going on with all those errors?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unzipping files with Google Dataflow Bulk Decompress template? - google-cloud-platform

Related

Pub/Sub to Text Files on Cloud Storage template at DataFlow FileNotFoundException error

AWS quicksight can't ingest csv from s3 but the same data uploaded as file works

Postman / Newman and uploading files from Azure Blob instead of local files

Is there a way to zip a folder of files into one zip, gzip, bzip, etc file using Google Cloud?

Problems uploading lambda function code on Amazon Web Services console

Categories

Resources