Custom vocabulary with AWS Transcribe- Japanese Language in AWS - amazon-web-services

While using aws transcriber, I want to create custom vocab but Not able to create custom vocabulary with Japanese words and nor able to find any sample of custom vocab phrases file.
Tried character code from the table and the direct japanese words array of strings. Neither worked.
Got the error "The vocabulary that you’re trying to create contains invalid characters or incorrectly formatted terms. See the developer guide for more information."
Here is my code
response = transcribe.create_vocabulary(
VocabularyName = 'vocab2',
LanguageCode = 'ja-JP',
Phrases = ["0x3005 0x3005"]
)
Any leads would be appreciated!

Upload to S3 first, forget the upload file button
AWS provides two ways to create custom vocabulary on the console, upload a file or fetch from s3. For the same file, I failed when uploading directly, but succeed when uploading to s3 first. I guess it's a bug in AWS, but we have to live with it.

Related

AWS S3 filename

I’m trying to build application with backend in java that allows users to create a text with images in it (something like a a personal blog). I’m planning to store these images to s3 bucket. When uploading image files to bucket i’m hashing the original name and store the hashed one in the bucket. Images are for display purpose only, no user will be able to download them. Frontend displays these images by getting a path to them from the server. So the question is, is there any need to store original name of the image file in the database? And what are the reasons, if any, of doing so?
I guess in general it is not needed because what is more important is how these resources are used or managed in the system.
Assuming your service is something like data access (similar to google drive), I don't think it's necessary to store it in DB, unless you want to make faster search queries.

How can I use GCS Delete in Data Fusion Studio?

Apologies if this is very simple but I am a complete beginner at GCP.
I've created a pipline that picks up multiple CSVs from a bucket, wrangles them then writes them into BigQuery. I want it to then delete the contents of the bucket folder the files came from. So let's say I pulled the CSVs using gs://bucket/Data/Country/*.CSV can I use GCS Delete to get rid of all the CSVs in there?
As a desperate attempt :D, in the Objects to delete, I specified gs://bucket/Data/Country/*.* but this didn't do a thing.
According to the Google Cloud Storage Delete plugin documentation its necessary to put each object separating it by comma.
There are feature request asking for the possibility to allow suffixes and prefixes when using this plugin, you can use the +1 button and provide your feedback about how this feature could be useful.
On the other hand, I thought in a workaround that could be work for you. Using the GCS documentation I have created an script to list all csv objects in a bucket, you only have to copy & paste the output in the Objects to Delete property of the plugin. Its important to mentioned that I used this workaround with 100 files more-less, I'm not sure if it's feasible to use with a larger amount of files.
from google.cloud import storage
bucket_name="MY_BUCKET"
file_format="csv"
def list_csv(bucket_name):
storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
if file_format in blob.name:
print("gs://"+ bucket_name + "/" + blob.name+",")
return None
list_csv(bucket_name)

What is the correct way to set up S3 for loading content in the browser?

I want to do the following: a user in a browser types some text and after he presses a 'Save' button, the text should be saved in a file (for example: content.txt) in a folder (for example: /username_text) on the root of an S3 bucket.
Also, I want the user to be able, when he visits the same page, load the content from S3 and continue working on the file. Then, if he/she is done, save the file to S3 again.
Probably important to mention, but I plan on using NodeJS for my back-end...
My question now is: What is the best way to set this storing-and-retrieving thing up? Do I create an API gateway + Lambda function to GET and POST files through that? Or do I for example use the aws-sdk in Node to directly push and pull files from S3? Or is there a better way to do this?
I looked at the following two guides:
Using AWS S3 Buckets in a NodeJS App – Codebase – Medium
Image Upload and Retrieval from S3 Using AWS API Gateway and Lambda
Welcome to StackOverflow!
I think you are worrying too much about the not-so-important stuff. S3 is nothing but a storage system. You could have decided to store the content of these files on DynamoDB, RDS, etc. What would you do if you stored its contents on these real databases? You'd fetch for data and display it to the user, wouldn't you?
This is what you need to do with S3! S3 is a smart choice on your scenario because your "file" can grow very big and S3 is a great place for storing files. However, apparently, you're not actually storing files (think of .pdf, .mp4, .mov, etc.), you're essentially only storing human-readable text.
So here's one approach on how to solve your problem:
FETCHING FILE CONTENT
User logs in
You fetch the user's personal information based on some token. You can store all the metadata in DynamoDB, where given a user_id, fetch all the "files" from this user. These "files" (metadata only) would be the bucket and key for the actual file on S3.
You use the getObject API from S3 to fetch the file based on your query and display the body of your file to your user in a RESTful way. Your response should look something like this:
{
"content": "some content"
}
SAVING FILE CONTENT
User logs in
The user writes anything in a form and submits it. In your Lambda function, you grab the content of this form and process it. This request should look something like this:
{
"file_id": "some-id",
"user_id": "some-id",
"content": "some-content"
}
If the file_id exists, update the content in S3. Otherwise, upload a new file in S3 and then create a new entry in DynamoDB. You'd then, of course, have to handle if the user submitting the changes actually owns the file, but if you're using UUIDs it shouldn't be too much of a problem, but still worth checking in case an ID is leaked somehow.
This way, you don't need to worry about uploading/downloading files as these are CPU intensive tasks, so you can keep your costs low as well as using very little RAM in your functions (128MB should be more than enough), after all, you're now only serving text. Not only this will simplify your way of designing it, but will also make things simpler both in API Gateway and in your code as you won't have to deal with binary types. The maximum you'll do is convert the buffer from S3 to a String when serving some content, but this should be completely fine.
EDIT
On your question regarding whether you should upload it from the browser or not, I suggest you take a look into this answer where I cover the pros/cons of doing it via API Gateway vs from the Browser.

extract all aws transcribe results using boto3

I have a couple hundred transcribed results in aws transcribe and I would like to get all the transcribed text and store it in one file.
Is there any way to do this without clicking on each transcribed result and copy and pasting the text?
You can do this via the AWS APIs.
For example, if you were using Python, you can use the Python boto3 SDK:
list_transcription_jobs() will return a list of Transcription Job Names
For each job, you could then call get_transcription_job(), which will provide the TranscriptFileUri that is the location where the transcription is stored.
You can then use get_object() to download the file from Amazon S3
Your program would then need to combine the content from each file into one file.
See how you go with that. If you run into any specific difficulties, post a new Question with the code and an explanation of the problem.
I put an example on GitHub that shows how to:
run an AWS Transcribe job,
use the Requests package to get the output,
write output to the console.
You ought to be able to refit if pretty easily for your purposes. Here's some of the code, but it'll make more sense if you check out the full example:
job_name_simple = f'Jabber-{time.time_ns()}'
print(f"Starting transcription job {job_name_simple}.")
start_job(
job_name_simple, f's3://{bucket_name}/{media_object_key}', 'mp3', 'en-US',
transcribe_client)
transcribe_waiter = TranscribeCompleteWaiter(transcribe_client)
transcribe_waiter.wait(job_name_simple)
job_simple = get_job(job_name_simple, transcribe_client)
transcript_simple = requests.get(
job_simple['Transcript']['TranscriptFileUri']).json()
print(f"Transcript for job {transcript_simple['jobName']}:")
print(transcript_simple['results']['transcripts'][0]['transcript'])

.csv upload not working in Amazon Web Services Machine Learning - AWS

I have uploaded a simple 10 row csv file (S3) into AWS ML website. It keeps giving me the error,
"We cannot find any valid records for this datasource."
There are records there and Y variable is continuous (not binary). I am pretty much stuck at this point because there is only 1 button to move forward to build Machine Learning. Does any one know what should I do to fix it? Thanks!
The only way I have been able to upload .csv files to S3 created on my own is by downloading an existing .csv file from my S3 server, modifying the data, uploading it then changing the name in the S3 console.
Could you post the first few lines of contents of the .csv file? I am able to upload my own .csv file along with a schema that I have created and it is working. However, I did have issues in that Amazon ML was unable to create the schema for me.
Also, did you try to save the data in something like Sublime, Notepad++, etc. in order to get a different format? On my mac with Microsoft Excel, the CSV did not work, but when I tried LibreOffice on my Windows, the same file worked perfectly.