Use pydub with AWS-stored files - django

I want to concatenate two files I'm storing in AWS, save them as .wav and pass them to IBM's Speech-to-Text API.
This is how a normal call to IBM looks like.
with open(join(dirname(__file__), './.', 'audio-file.wav'),
'rb') as audio_file:
recognition_job = speech_to_text.create_job(
audio_file,
content_type='audio/wav',
timestamps=True
).get_result()
Can pydub export directly to AWS, as online I cannot have it stored locally?
Thank you in advance!

When you say "export to AWS" I assume you mean to Amazon S3. From there you want to invoke IBM's speech-to-text API. To interact with Amazon S3 in python you should use the boto3 SDK.
You don't need to export your data to a temporary local file if you don't need it. You can keep the data in memory in python.
import os
import io
import boto3
from pydub import AudioSegment
from ibm_watson import SpeechToTextV1
speech_to_text = SpeechToTextV1()
s3r = boto3.resource("s3")
bucket = "randall-stackoverflow"
file1 = io.BytesIO()
s3r.Object(bucket, "file1.wav").download_fileobj(file1)
file2 = io.BytesIO()
s3r.Object(bucket, "file2.wav").download_fileobj(file2)
sound1 = AudioSegment.from_wav(file1)
sound2 = AudioSegment.from_wav(file2)
combined = sound1.append(sound2) # maybe add crossfade
recognition_job = speech_to_text.create_job(
combined.raw_data,
content_type='audio/wav',
timestamps=True
)
I'd be remiss if I didn't mention Amazon Transcribe which would let you do all of this within the AWS cloud.
transcribe = boto3.client("transcribe")
url = "{}/{}/{}".format(
s3r.meta.client.meta.endpoint_url,
bucket,
"file1.wav"
)
transcribe.start_transcription_job(
TranscriptionJobName="ExampleJob",
Media={"MediaFileUri": url},
LanguageCode="en-US",
MediaFormat="wav"
)

Related

Does google cloud vision api( source path- gcsSource) supports image detection (image contains text) in PDF file?

I am using OCR with TEXT_DETECTION and DOCUMENT_TEXT_DETECTION to process pdf file(InputConfig mimeType- "application/pdf"). Currently images are getting skipped while processing. Is there any possible way to process images(having text) in PDF file?
To answer your question, yes there is a way to process images with text in PDF files. According to Google official documentation, it is normally by using OCR DOCUMENT_TEXT_DETECTION [1].
The Vision API can detect and transcribe text from PDF and TIFF files stored in Cloud Storage. Document text detection from PDF and TIFF must be requested using the files:asyncBatchAnnotate function, which performs an offline (asynchronous) request and provides its status using the operations resources. The output from a PDF/TIFF request is written to a JSON file created in the specified Cloud Storage bucket.[2]
[1]https://cloud.google.com/vision/docs/ocr#optical_character_recognition_ocr
[2]https://cloud.google.com/vision/docs/pdf#vision_text_detection_pdf_gcs-gcloud
EDIT
I don't know what language you are using but I tried this python code and it processes a pdf with images without skipping them.
You need to install google-cloud-storage and google-cloud-vision.
On gcs_source_uri you have to specify your bucket name and your pdf file that you are using.
On gcs_destination_uri you only have to specify your bucket name let pdf_result as it is.
import os
import re
import json
from google.cloud import vision
from google.cloud import storage
"""
#pip install --upgrade google-cloud-storage
#pip install --upgrade google-cloud-vision
"""
credential_path = 'your_path'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path
client = vision.ImageAnnotatorClient()
batch_size = 2
mime_type = 'application/pdf'
feature = vision.Feature(
type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)
gcs_source_uri= 'gs://your_bucketname/your_pdf_File.pdf'
gcs_source = vision.GcsSource(uri=gcs_source_uri)
input_config = vision.InputConfig(gcs_source=gcs_source, mime_type=mime_type)
gcs_destination_uri = 'gs://your_bucketname/pdf_result'
gcs_destination = vision.GcsDestination(uri=gcs_destination_uri)
output_config = vision.OutputConfig(gcs_destination=gcs_destination, batch_size= batch_size)
async_request = vision.AsyncAnnotateFileRequest(
features=[feature], input_config=input_config, output_config=output_config
)
operation = client.async_batch_annotate_files(requests=[async_request])
operation.result(timeout=180)
storage_client = storage.Client()
match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
bucket_name = match.group(1)
prefix = match.group(2)
bucket = storage_client.get_bucket(bucket_name)
#List object with the given prefix
blob_list = list(bucket.list_blobs(prefix=prefix))
print('Output files: ')
for blob in blob_list:
print(blob.name)
output = blob_list[0]
json_string = output.download_as_string()
response = json.loads(json_string)
first_page_response = response['responses'][0]
annotation = first_page_response['fullTextAnnotation']
print('Full text:\n')
print(annotation['text'])

How can I get reports from Google Cloud Storage using the Google's API

I have to create a program that get informations on a daily basis about installations of a group of apps on the AppStore and the PlayStore.
For the PlayStore, using Google Cloud Storage I followed the instructions on this page using the client library and a Service Account method and the Python code example :
https://support.google.com/googleplay/android-developer/answer/6135870?hl=en&ref_topic=7071935
I slightly changed the given code to make it work since documentation looks not up-to-date. I made it possible to connect to the API and it seems to connect correctly.
My problem is that I don't understand what object I get and how to use it. It's not a report it just looks like files properties in a dict.
This is my code (private data "hidden") :
import json
from httplib2 import Http
from oauth2client.service_account import ServiceAccountCredentials
from googleapiclient.discovery import build
client_email = '************.iam.gserviceaccount.com'
json_file = 'PATH/TO/MY/JSON/FILE'
cloud_storage_bucket = 'pubsite_prod_rev_**********'
report_to_download = 'stats/installs/installs_****************_202005_app_version.csv'
private_key = json.loads(open(json_file).read())['private_key']
credentials = ServiceAccountCredentials.from_json_keyfile_name(json_file, scopes='https://www.googleapis.com/auth/devstorage.read_only')
storage = build('storage', 'v1', http=credentials.authorize(Http()))
supposed_to_be_report = storage.objects().get(bucket=cloud_storage_bucket, object=report_to_download).execute()
When I print the supposed_to_be_report - which is a dictionary- I only get what I understand as Metadata about he report like this:
{'kind': 'storage#object', 'id': 'pubsite_prod_rev_***********/stats/installs/installs_****************_202005_app_version.csv/1591077412052716',
'selfLink': 'https://www.googleapis.com/storage/v1/b/pubsite_prod_rev_***********/o/stats%2Finstalls%2Finstalls_*************_202005_app_version.csv',
'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/pubsite_prod_rev_***********/o/stats%2Finstalls%2Finstalls_****************_202005_app_version.csv?generation=1591077412052716&alt=media',
'name': 'stats/installs/installs_***********_202005_app_version.csv',
'bucket': 'pubsite_prod_rev_***********',
'generation': '1591077412052716',
'metageneration': '1',
'contentType': 'text/csv;
charset=utf-16le', 'storageClass': 'STANDARD', 'size': '378', 'md5Hash': '*****==', 'contentEncoding': 'gzip'......
I am not sure I'm using it correctly. Could you please explain me where am I wrong and/or how to get installs reports correctly ?
Thanks.
I can see that you are using googleapiclient.discovery client, this is not an issue, but the recommended way to access Google Cloud APIs programmatically is by using the client libraries.
Second, you are just retrieving the object's metadata. You can download the object to have access to the file contents, this is a sample using the client library.
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# bucket_name = "your-bucket-name"
# source_blob_name = "storage-object-name"
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print(
"Blob {} downloaded to {}.".format(
source_blob_name, destination_file_name
)
)
Sample taken from official docs.

How to Pass dynamically create Powerpoint file to Google Cloud Storage in python

I'm trying to build a power-point in my python 2.7 application and upload it on the fly to Google Cloud Storage.
I can create the ppt, store on my local hard drive as an intermediate step and then pick up from there to upload to the Google cloud storage. This works well. However, my production application will run on Google App Server so I want to be able to create the powerpoint and upload to Google Storage directly (without the intermediate step).
Any ideas of how to do this? The blob.upload_from_file() seems to only be able to pickup files that are physically stored somewhere but as my app is building these powerpoints I don't know what to pass to the blob.upload_from_file as an argument? I tried to use StringIO module but its generating the error message below.
from google.cloud import storage
from pptx import Presentation
from StringIO import StringIO
prs = Presentation()
title_slide_layout = prs.slide_layouts[0]
slide = prs.slides.add_slide(title_slide_layout)
title = slide.shapes.title
subtitle = slide.placeholders[1]
title.text = "Hello, World!"
subtitle.text = "python-pptx was here!"
out_file = StringIO()
prs.save(out_file)
client = storage.Client()
bucket = client.get_bucket([GCP_Bucket_Name])
blob = bucket.blob('test.pptx')
blob.upload_from_file(out_file)
print blob2.public_url
ValueError: Stream must be at beginning.
You are uploading the out_file object, which is just an empty StringIO object.
Instead you have to:
Save the Presentation object to the file system prs.save(savingPath)
Read the object and put it in a StringIO object.
Finally upload the the StringIO object.
Here is the code:
from google.cloud import storage
from pptx import Presentation
from StringIO import StringIO
#Creates presentation
prs = Presentation()
title_slide_layout = prs.slide_layouts[0]
slide = prs.slides.add_slide(title_slide_layout)
title = slide.shapes.title
subtitle = slide.placeholders[1]
title.text = "Hello, World!"
subtitle.text = "python-pptx was here!"
prs.save('test.pptx')
client = storage.Client()
bucket = client.get_bucket('yourBucket')
blob = bucket.blob('test.pptx')
with open('test.pptx','r') as ppt:
out_file = StringIO(ppt.read())
blob.upload_from_file(out_file)
print blob.public_url
Finally you mentioned the code will run on Google App Server, this product does not exist. If you meant Google App Engine keep in mind that not all the runtimes have access to the file system. So you might not be able to save the pptx file to the file system of the application.
I solved this problem by moving my GAE application from Standard to the Flexible environment in GAE. The flexible environment allows write to the preso file to the root directory and I pick it up from there, upload to Google storage and then delete the file in the root. This worked for me.

How to extract files in S3 on the fly with boto3?

I'm trying to find a way to extract .gz files in S3 on the fly, that is no need to download it to locally, extract and then push it back to S3.
With boto3 + lambda, how can i achieve my goal?
I didn't see any extract part in boto3 document.
You can use BytesIO to stream the file from S3, run it through gzip, then pipe it back up to S3 using upload_fileobj to write the BytesIO.
# python imports
import boto3
from io import BytesIO
import gzip
# setup constants
bucket = '<bucket_name>'
gzipped_key = '<key_name.gz>'
uncompressed_key = '<key_name>'
# initialize s3 client, this is dependent upon your aws config being done
s3 = boto3.client('s3', use_ssl=False) # optional
s3.upload_fileobj( # upload a new obj to s3
Fileobj=gzip.GzipFile( # read in the output of gzip -d
None, # just return output as BytesIO
'rb', # read binary
fileobj=BytesIO(s3.get_object(Bucket=bucket, Key=gzipped_key)['Body'].read())),
Bucket=bucket, # target bucket, writing to
Key=uncompressed_key) # target key, writing to
Ensure that your key is reading in correctly:
# read the body of the s3 key object into a string to ensure download
s = s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read()
print(len(s)) # check to ensure some data was returned
The above answers are for gzip files, for zip files, you may try
import boto3
import zipfile
from io import BytesIO
bucket = 'bucket1'
s3 = boto3.client('s3', use_ssl=False)
Key_unzip = 'result_files/'
prefix = "folder_name/"
zipped_keys = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter = "/")
file_list = []
for key in zipped_keys['Contents']:
file_list.append(key['Key'])
#This will give you list of files in the folder you mentioned as prefix
s3_resource = boto3.resource('s3')
#Now create zip object one by one, this below is for 1st file in file_list
zip_obj = s3_resource.Object(bucket_name=bucket, key=file_list[0])
print (zip_obj)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key='result_files/' + f'{filename}')
This will work for your zip file and your result unzipped data will be in result_files folder. Make sure to increase memory and time on AWS Lambda to maximum since some files are pretty large and needs time to write.
Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.
However, you could use an AWS Lambda function to retrieve an object from S3, decompress it, then upload content back up again. However, please note that there is default limit of 500MB in temporary disk space for Lambda, so avoid decompressing too much data at the same time.
You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:
Use boto3 to download the new file
Use the gzip Python library to extract files
Use boto3 to upload the resulting file(s)
Sample code:
import gzip
import io
import boto3
bucket = '<bucket_name>'
key = '<key_name>'
s3 = boto3.client('s3', use_ssl=False)
compressed_file = io.BytesIO(
s3.get_object(Bucket=bucket, Key=key)['Body'].read())
uncompressed_file = gzip.GzipFile(None, 'rb', fileobj=compressed_file)
s3.upload_fileobj(Fileobj=uncompressed_file, Bucket=bucket, Key=key[:-3])

django boto3: NoCredentialsError -- Unable to locate credentials

I am trying to use boto3 in my django project to upload files to Amazon S3. Credentials are defined in settings.py:
AWS_ACCESS_KEY = xxxxxxxx
AWS_SECRET_KEY = xxxxxxxx
S3_BUCKET = xxxxxxx
In views.py:
import boto3
s3 = boto3.client('s3')
path = os.path.dirname(os.path.realpath(__file__))
s3.upload_file(path+'/myphoto.png', S3_BUCKET, 'myphoto.png')
The system complains about Unable to locate credentials. I have two questions:
(a) It seems that I am supposed to create a credential file ~/.aws/credentials. But in a django project, where do I have to put it?
(b) The s3 method upload_file takes a file path/name as its first argument. Is it possible that I provide a file stream obtained by a form input element <input type="file" name="fileToUpload">?
This is what I use for a direct upload, i hope it provides some assistance.
import boto
from boto.exception import S3CreateError
from boto.s3.connection import S3Connection
conn = S3Connection(settings.AWS_ACCESS_KEY,
settings.AWS_SECRET_KEY,
is_secure=True)
try:
bucket = conn.create_bucket(settings.S3_BUCKET)
except S3CreateError as e:
bucket = conn.get_bucket(settings.S3_BUCKET)
k = boto.s3.key.Key(bucket)
k.key = filename
k.set_contents_from_filename(filepath)
Not sure about (a) but django is very flexible with file management.
Regarding (b) you can also sign the upload and do it directly from the client to reduce bandwidth usage, its quite sneaky and secure too. You need to use some JavaScript to manage the upload. If you want details I can include them here.