How do I unzip a .zip file in google cloud storage? - google-cloud-platform

How do I unzip a .zip file in Goolge Cloud Storage Bucket? (If we have some other tool like 'CloudBerry Explorer' for AWS, that will be great.)

You can use Python, e.g. from a Cloud Function:
from google.cloud import storage
from zipfile import ZipFile
from zipfile import is_zipfile
import io
def zipextract(bucketname, zipfilename_with_path):
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucketname)
destination_blob_pathname = zipfilename_with_path
blob = bucket.blob(destination_blob_pathname)
zipbytes = io.BytesIO(blob.download_as_string())
if is_zipfile(zipbytes):
with ZipFile(zipbytes, 'r') as myzip:
for contentfilename in myzip.namelist():
contentfile = myzip.read(contentfilename)
blob = bucket.blob(zipfilename_with_path + "/" + contentfilename)
blob.upload_from_string(contentfile)
zipextract("mybucket", "path/file.zip") # if the file is gs://mybucket/path/file.zip

If you ended up having a zip file on your Google Cloud Storage bucket because you had to move large files from another server with the gsutil cp command, you could instead gzip when copying and it will be transferred in compressed format and unzippet when arriving to the bucket.
It is built in gsutil cp by using the -Z argument.
E.g.
gsutil cp -Z largefile.txt gs://bucket/largefile.txt

Here is some code I created to run as a Firebase Cloud Function. It is designed to listen to files loaded into a bucket with the content-type 'application/zip' and extract them in place.
const functions = require('firebase-functions');
const admin = require("firebase-admin");
const path = require('path');
const fs = require('fs');
const os = require('os');
const unzip = require('unzipper')
admin.initializeApp();
const storage = admin.storage();
const runtimeOpts = {
timeoutSeconds: 540,
memory: '2GB'
}
exports.unzip = functions.runWith(runtimeOpts).storage.object().onFinalize((object) => {
return new Promise((resolve, reject) => {
//console.log(object)
if (object.contentType !== 'application/zip') {
reject();
} else {
const bucket = firebase.storage.bucket(object.bucket)
const remoteFile = bucket.file(object.name)
const remoteDir = object.name.replace('.zip', '')
console.log(`Downloading ${remoteFile}`)
remoteFile.createReadStream()
.on('error', err => {
console.error(err)
reject(err);
})
.on('response', response => {
// Server connected and responded with the specified status and headers.
//console.log(response)
})
.on('end', () => {
// The file is fully downloaded.
console.log("Finished downloading.")
resolve();
})
.pipe(unzip.Parse())
.on('entry', entry => {
const file = bucket.file(`${remoteDir}/${entry.path}`)
entry.pipe(file.createWriteStream())
.on('error', err => {
console.log(err)
reject(err);
})
.on('finish', () => {
console.log(`Finsihed extracting ${remoteDir}/${entry.path}`)
});
entry.autodrain();
});
}
})
});

In shell, you can use the below command to unzip a compressed file
gsutil cat gs://bucket/obj.csv.gz | zcat | gsutil cp - gs://bucket/obj.csv

There is no mechanism in the GCS to unzip the files. A feature request regarding the same has already been forwarded to the Google development team.
As an alternative, you can upload the ZIP files to the GCS bucket and then download them to a persistent disk attached to a VM instance, unzip them there, and upload the unzipped files using the gsutil tool.

There are Data flow templates in google Cloud data flow which helps to Zip/unzip the files in cloud storage.Refer below screenshots.
This template stages a batch pipeline that decompresses files on Cloud Storage to a specified location. This functionality is useful when you want to use compressed data to minimize network bandwidth costs.
The pipeline automatically handles multiple compression modes during a single execution and determines the decompression mode to use based on the file extension (.bzip2, .deflate, .gz, .zip).
Pipeline requirements
The files to decompress must be in one of the following formats: Bzip2, Deflate, Gzip, Zip.
The output directory must exist prior to pipeline execution.

I'm afraid that by default in Google Cloud no program could do this..., but you can have this functionality, for example, using Python.
Universal method available on any machine where Python is installed (so also on Google Cloud):
You need to enter the following commands:
python
or if you need administrator rights:
sudo python
and then in the Python Interpreter:
>>> from zipfile import ZipFile
>>> zip_file = ZipFile('path_to_file/t.zip', 'r')
>>> zip_file.extractall('path_to_extract_folder')
and finally, press Ctrl+D to exit the Python Interpreter.
The unpacked files will be located in the location you specify (of course, if you had the appropriate permissions for these locations).
The above method works identically for Python 2 and Python 3.
Enjoy it to the fullest! :)

Enable Dataflow API in your gcloud console
Create a temp dir in your bucket (cant use root).
Replace YOUR_REGION (e.g. europe-west6) and YOUR_BUCKET in the below command and run it with gcloud cli (presumption is gz file is at root - change if not):
gcloud dataflow jobs run unzip \
--gcs-location gs://dataflow-templates-YOUR_REGION/latest/Bulk_Decompress_GCS_Files \
--region YOUR_REGION \
--num-workers 1 \
--staging-location gs://YOUR_BUCKET/temp \
--parameters inputFilePattern=gs://YOUR_BUCKET/*.gz,outputDirectory=gs://YOUR_BUCKET/,outputFailureFile=gs://YOUR_BUCKET/decomperror.txt

Another fast way to do it using Python in version 3.2 or higher:
import shutil
shutil.unpack_archive('filename')
The method also allows you to indicate the destination folder:
shutil.unpack_archive('filename', 'extract_dir')
The above method works not only for zip archives, but also for tar, gztar, bztar, or xztar archives.
If you need more options look into documentation of shutil module: shutil.unpack_archive

Related

How to upload a zip to S3 with CDK

I'm working on building a CDK library and am trying to upload a zip folder to S3 that I can then use for a Lambda deployment later. I've found a lot of direction online to use aws_s3_deployment.
The problem with that construct is that it loads the contents of a zip rather than a zip itself. I've tried to zip a zip inside a zip and that doesn't work. I've also tried to zip a folder and that doesn't work either. The behavior I see is that nothing shows up in S3 and there are no errors from CDK. Is there another way to load a zip to S3?
What you're looking for is the aws-s3-assets module. It allows you to define either directories (which will be zipped) or regular files as assets that the CDK will upload to S3 for you. Using the attributes you can refer to the assets.
The documentation has this example for it:
import { Asset } from 'aws-cdk-lib/aws-s3-assets';
// Archived and uploaded to Amazon S3 as a .zip file
const directoryAsset = new Asset(this, "SampleZippedDirAsset", {
path: path.join(__dirname, "sample-asset-directory")
});
// Uploaded to Amazon S3 as-is
const fileAsset = new Asset(this, 'SampleSingleFileAsset', {
path: path.join(__dirname, 'file-asset.txt')
});
In order to upload the zip file to a given bucket I ended up using BucketDeployment with a custom ILocalBundling. The custom bundler will compress the files and put them in an assets directory for CDK to upload. The important part is to set output_type=BundlingOutput.NOT_ARCHIVED, this way CDK will not try to unzip the file.
#implements(ILocalBundling)
class LocalBundling:
#member(jsii_name="tryBundle")
def try_bundle(self, output_dir: str, image: DockerImage,) -> bool:
cwd = pathlib.Path.cwd()
print(f"bundling to {output_dir}...")
build_dir = f"{cwd}/directory/to"
command = ["zip", "-r", f"{output_dir}/python.zip", f"zip"]
print(command)
output = subprocess.run(command, capture_output=True, check=True, cwd=build_dir)
# print(output.stdout.decode("utf-8"))
return True
local_bundling = LocalBundling()
s3_deployment.BucketDeployment(
self,
f"SomeIdForBucketDeployment",
sources=[
s3_deployment.Source.asset(
"directory/to/zip",
bundling=BundlingOptions(
command=['none'],
image=DockerImage.from_registry("lm"),
local=local_bundling,
output_type=BundlingOutput.NOT_ARCHIVED,
),
)
],
destination_bucket=some_bucket,
destination_key_prefix=some_key_prefix,
)

Trigger python codes from google cloud bucket to run against CSV file from another bucket

I have few python scripts which will process CSV files sent to a cloud bucket and uploads the output file into another bucket.
1. init.py (main file) 2. google_client.py (Reads input file and upload output file) 3. DP_Workflow.py (submit file to DP workflow to generate output file)
This works fine locally, But I am trying to find ways to get this uploaded into a bucket and run it against CSV file whenever gets uploaded in another bucket. Is there a way to trigger these files at once?
You should create a cloud function, this cloud function will be triggered whenever a csv file is uploaded to your bucket. To deploy a cloud function that responds to a new file in a bucket, you can use the following command:
gcloud functions deploy YOUR_FUNCTION \
--entrypoint=handler \
--runtime=python37 \
--trigger-resource=YOUR_TRIGGER_BUCKET_NAME \
--trigger-event=google.storage.object.finalize
Then rename your init.py to main.py (google logic) and put the following into that file as an entrypoint:
def handler(data, context):
bucket = data['bucket']
file = data['name']
....
# whatever processing you want here
Your directory layout:
main.py
requirements.txt
google_client.py
DP_Workflow.py

google cloud platform blob.download_to_file where the file is downloaded

gcloud beta functions deploy gcp-data-transfer --project probable-scout-216702 --runtime python37 --trigger-bucket connector-test-data-securonix --entry-point detect_file
Above is the google cloud function I am using. it is a trigger on my google cloud storage bucket. My gcloud function is running but I don't know where the files are supposed to be downloaded.I was able to store it in a /tmp/ directory but it's still not on my own system and I have no idea which /tmp it is downloading in. I am using following code :
def detect_file(file, context):
destination_file_name = "/Securonix/file1.txt"
f = open("/Securonix/file1.txt",'wb')
bucket = validate_message(file, 'bucket')
name = validate_message(file, 'name')
print("here")
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket)
blob = bucket.blob(name)
blob.download_to_file(f)
print('Blob {} downloaded to {}.'.format(
name,
destination_file_name))
def validate_message(message, param):
var = message.get(param)
if not var:
raise ValueError('{} is not provided. Make sure you have \
property {} in the request'.format(param, param))
return var
I am getting error:
FileNotFoundError: [Errno 2] No such file or directory: '/Securonix/file1.txt'
Use /tmp/ then your file name, this is because you can't create files on cloud function except /tmp/ folder.
Your string is like:
/tmp/Securonix/file1.txt
The Cloud Functions execution environment is totally read-only except for /tmp/, so you must download to that directory.
A second issue (and the reason why you saw FileNotFoundError instead of PermissionError) is that the /Securonix/ directory does not exist on the cloud function runtime. open() does not create new directories for you, you must create the directory before you can create a file in it.
Also: don't forget to delete the file when you are done as it's stored in memory.

Accessing data in Google Cloud bucket for a python Tensorflow learning program

I’m working through the Google quick start examples for Cloud Learning / Tensorflow as shown here: https://cloud.google.com/ml/docs/quickstarts/training
I want my python program to access data that I have stored in a Google Cloud bucket such as gs://mybucket. How do I do this inside of my python program instead of calling it from the command line?
Specifically, the quickstart example for cloud learning utilizes data they provided but what if I want to provide my own data that I have stored in a bucket such as gs://mybucket?
I noticed a similar post here: How can I get the Cloud ML service account programmatically in Python? ... but I can’t seem to install the googleapiclient module.
Some posts seem to mention Apache Beam though I can’t tell if that’s relevant to me, but besides I can’t figure out how to download or install that whatever it is.
If I understand your question correctly, you want to programmatically talk to GCS in Python.
The official docs are a good place to start.
First, grab the module using pip:
pip install --upgrade google-cloud-storage
Then:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
# Then do other things...
blob = bucket.get_blob('remote/path/to/file.txt')
print(blob.download_as_string())
blob.upload_from_string('New contents!')
blob2 = bucket.blob('remote/path/storage.txt')
blob2.upload_from_filename(filename='/local/path.txt')
Assuming you are using Ubuntu/Linux as an OS and already having data in GCS bucket
Execute following commands from a terminal or can be executed on Jupyter Notebook(just use ! before commands):
--------------------- Installation -----------------
1st install storage module:
on Terminal type:
pip install google-cloud-storage
2nd to verify storage installed or not type the command:
gsutil
(o/p will show available options)
---------------------- Copy data from GCS bucket --------
type this command: to check whether you are able to get information about bucket
gsutil acl get gs://BucketName
Now copy the file from GCS Bucket to your machine:
gsutil cp gs://BucketName/FileName /PathToDestinationDir/
In this way, you will be able to copy data from this bucket to your machine for further processing purpose.
NOTE: all the above commands can be run from a Jupyter Notebook just use ! before commands, it will run e.g.
!gsutil cp gs://BucketName/FileName /PathToDestinationDir/

Call aws-cli from AWS Lambda

is there ANY way to execute aws-cli inside AWS Lambda?
It doesn't seem to be pre-installed.
(I've checked with "which aws" via Node.js child-process, and it didn't exist.)
Now we can use Layers inside Lambda. Bash layer with aws-cli is available at https://github.com/gkrizek/bash-lambda-layer
handler () {
set -e
# Event Data is sent as the first parameter
EVENT_DATA=$1
# This is the Event Data
echo $EVENT_DATA
# Example of command usage
EVENT_JSON=$(echo $EVENT_DATA | jq .)
# Example of AWS command that's output will show up in CloudWatch Logs
aws s3 ls
# This is the return value because it's being sent to stderr (>&2)
echo "{\"success\": true}" >&2
}
Not unless you include it (and all of its dependencies) as part of your deployment package. Even then you would have to call it from within python since Lambda doesn't allow you to execute shell commands. Even if you get there, I would not recommend trying to do a sync in a Lambda function since you're limited to a maximum of 5 minutes of execution time. On top of that, the additional spin-up time just isn't worth it in many cases since you're paying for every 100ms chunk.
So you can, but you probably shouldn't.
EDIT: Lambda does allow you to execute shell commands
aws-cli is a python package. To make it available on a AWS Lambda function you need to pack it with your function zip file.
1) Start an EC2 instance with 64-bit Amazon Linux;
2) Create a python virtualenv:
mkdir ~/awscli_virtualenv
virtualenv ~/awscli_virtualenv
3) Activate virtualenv:
cd ~/awscli_virtualenv/bin
source activate
4) Install aws-cli and pyyaml:
pip install awscli
python -m easy_install pyyaml
5) Change the first line of the aws python script:
sed -i '1 s/^.*$/\#\!\/usr\/bin\/python/' aws
6) Deactivate virtualenv:
deactivate
7) Make a dir with all the files you need to run aws-cli on lambda:
cd ~
mkdir awscli_lambda
cd awscli_lambda
cp ~/awscli_virtualenv/bin/aws .
cp -r ~/awscli_virtualenv/lib/python2.7/dist-packages .
cp -r ~/awscli_virtualenv/lib64/python2.7/dist-packages .
8) Create a function (python or nodejs) that will call aws-cli:
For example (nodejs):
var Q = require('q');
var path = require('path');
var spawn = require('child-process-promise').spawn;
exports.handler = function(event, context) {
var folderpath = '/folder/to/sync';
var s3uel = 's3://name-of-your-bucket/path/to/folder';
var libpath = path.join(__dirname, 'lib');
var env = Object.create(process.env);
env.LD_LIBRARY_PATH = libpath;
var command = path.join(__dirname, 'aws');
var params = ['s3', 'sync', '.', s3url];
var options = { cwd: folderpath };
var spawnp = spawn(command, params, options);
spawnp.childProcess.stdout.on('data', function (data) {
console.log('[spawn] stdout: ', data.toString());
});
spawnp.childProcess.stderr.on('data', function (data) {
console.log('[spawn] stderr: ', data.toString());
});
return spawnp
.then(function(result) {
if (result['code'] != 0) throw new Error(["aws s3 sync exited with code", result['code']].join(''));
return result;
});
}
Create the index.js file (with the code above or your code) on ~/awscli_lambda/index.js
9) Zip everything (aws-cli files and dependencies and your function):
cd ~
zip -r awscli_lambda.zip awscli_lambda
Now you can simply run it as Docker container within lambda along with AWS CLI.
You can use the AWS node.js SDK which should be available in Lambda without installing it.
var AWS = require('aws-sdk');
var lambda = new AWS.Lambda();
lambda.invoke({
FunctionName: 'arn:aws:lambda:us-west-2:xxxx:function:FN_NAME',
Payload: {},
},
function(err, result) {
...
});
As far as I can tell you get most, if not all the cli functionality. See the full documentation here: http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Lambda.html
you can try this. I got this working for me.
1- add the AWS CLI layer
https://harishkm.in/2020/06/16/run-aws-cli-in-a-lambda-function/
2- add a lambda and run the following commands to run any AWS CLI command line.
https://harishkm.in/2020/06/16/run-bash-scripts-in-aws-lambda-functions/
function handler () {
EVENT_DATA=$1
DATA=`/opt/awscli/aws s3 ls `
RESPONSE="{\"statusCode\": 200, \"body\": \"$DATA\"}"
echo $RESPONSE
}
If you are provisioning your Lambda using code, then this is the most easiest way
lambda_function.add_layers(AwsCliLayer(scope, "AwsCliLayer"))
Ref: https://pypi.org/project/aws-cdk.lambda-layer-awscli/
I think that you should separate your trigger logic from the action.
Put a container with aws cli on another ec2 And use aws lambda to trigger that into an action.