AWS Comprehend output - amazon-web-services

I am new to using AWS Comprehend to search for PII. I get the job to run against an S3 bucket but can't read the output. The output is in another bucket that I specified. All of the output files have .out as the extension. I was expecting output in report form or at least the ability to open the output files and verify PII. One example of the output is a png file that has as extension .png.out
I do not want to redact the PII at this point. I just want to identify it. Any help would be appreciated.

Related

Get only file names from s3 bucket folder and print on another bucket in text file using glue

In glue, we are performing an ETL operation, after giving input files when glue job runs successfully files adding an output bucket so Here we need to write- extra code for copying the output file names and paste in another bucket in a text file
Eg: suppose today 5 files were added to the output bucket so immediately that files names must copy and place in another bucket text file\ with time
can anyone please tell me the solution to this

How should I handle public (TCGA) data on AWS?

I'm new to AWS development and studying how to use TCGA data (https://registry.opendata.aws/tcga/) on my EC2 instance.
$ aws s3 ls s3://tcga-2-open/ gives me millions of files. However, the XML file http://tcga-2-open.s3.amazonaws.com shows me only 1000 entries. Is there a full list of files, hopefully describing their hierarchy, so that I can lookup the coverage of this resource?
All TCGA files have prefix, and they seem to be GDC UUIDs (https://docs.gdc.cancer.gov/Encyclopedia/pages/UUID/). However, some files have UUID that I cannot find from original source. Downloading 'manifest' from https://portal.gdc.cancer.gov/repository gives me a list of UUIDs and their implications. However, some UUIDs from the manifest file and from the XML file are mutually exclusive. So how do I know what file, for example, http://tcga-2-open.s3.amazonaws.com/00002fe8-ec8e-4e0e-a174-35f039c15d06/6057825020_R01C01_Grn.idat, is, when I cannot find '00002fe8-ec8e-4e0e-a174-35f039c15d06' from the ground truth manifest file?
Is there any step-by-step tutorial to using Open Data on AWS?
Any help to get me through this would be really appreciated.

Specify output filename of Cloud Vision request

So I'm sitting with Google Cloud Vision (for Node.js) and I'm trying to dynamically upload a document to a Google Cloud Bucket, process it using Google Cloud Vision API, and then downloading the .json afterwards. However, when Cloud Vision processes my request and places it in my bucket for saved text extractions, it appends output-1-to-n.json at the end of the filename. So let's say I'm processing a file called foo.pdf that's 8 pages long, the output will not be foo.json (even though I specified that), but rather be foooutput1-to-8.json.
Of course, this could be remedied by checking the page count of the PDF before uploading it and appending that to the path I search for when downloading, but that seems like such an unneccesary hacky solution. I can't seem to find anything in the documentation about not appending output-1-to-n to outputs. Extremely happy for any pointers!
You can't specify a single output file for asyncBatchAnnotate because depending on your input, many files may get created. The output config is only a prefix and you have to do a wildcard search in gcs for your given prefix (so you should make sure your prefix is unique).
For more details see this answer.

Error Tracking in Amazon SageMaker

I am trying to create a custom Image Classifier in Amazon SageMaker. It is giving me the following error:
"ClientError: Data download failed:NoSuchKey (404): The specified key does not exist."
I'm assuming this means one of the pictures in my .lst file is missing from the directory. Is there some way to find out which .lst listing it is specifically having trouble with?
Upon further examination (of the log files), it appears the issue does not lie with the .lst file itself, but with the image files it was referencing (which now leaves me wondering why AWS doesn't just say that instead of saying the .lst file is corrupt). I'm going through the image files one-by-one to verify they are correct, hopefully that will solve the problem.

.csv upload not working in Amazon Web Services Machine Learning - AWS

I have uploaded a simple 10 row csv file (S3) into AWS ML website. It keeps giving me the error,
"We cannot find any valid records for this datasource."
There are records there and Y variable is continuous (not binary). I am pretty much stuck at this point because there is only 1 button to move forward to build Machine Learning. Does any one know what should I do to fix it? Thanks!
The only way I have been able to upload .csv files to S3 created on my own is by downloading an existing .csv file from my S3 server, modifying the data, uploading it then changing the name in the S3 console.
Could you post the first few lines of contents of the .csv file? I am able to upload my own .csv file along with a schema that I have created and it is working. However, I did have issues in that Amazon ML was unable to create the schema for me.
Also, did you try to save the data in something like Sublime, Notepad++, etc. in order to get a different format? On my mac with Microsoft Excel, the CSV did not work, but when I tried LibreOffice on my Windows, the same file worked perfectly.