How should I handle public (TCGA) data on AWS? - amazon-web-services

I'm new to AWS development and studying how to use TCGA data (https://registry.opendata.aws/tcga/) on my EC2 instance.
$ aws s3 ls s3://tcga-2-open/ gives me millions of files. However, the XML file http://tcga-2-open.s3.amazonaws.com shows me only 1000 entries. Is there a full list of files, hopefully describing their hierarchy, so that I can lookup the coverage of this resource?
All TCGA files have prefix, and they seem to be GDC UUIDs (https://docs.gdc.cancer.gov/Encyclopedia/pages/UUID/). However, some files have UUID that I cannot find from original source. Downloading 'manifest' from https://portal.gdc.cancer.gov/repository gives me a list of UUIDs and their implications. However, some UUIDs from the manifest file and from the XML file are mutually exclusive. So how do I know what file, for example, http://tcga-2-open.s3.amazonaws.com/00002fe8-ec8e-4e0e-a174-35f039c15d06/6057825020_R01C01_Grn.idat, is, when I cannot find '00002fe8-ec8e-4e0e-a174-35f039c15d06' from the ground truth manifest file?
Is there any step-by-step tutorial to using Open Data on AWS?
Any help to get me through this would be really appreciated.

Related

AWS Comprehend output

I am new to using AWS Comprehend to search for PII. I get the job to run against an S3 bucket but can't read the output. The output is in another bucket that I specified. All of the output files have .out as the extension. I was expecting output in report form or at least the ability to open the output files and verify PII. One example of the output is a png file that has as extension .png.out
I do not want to redact the PII at this point. I just want to identify it. Any help would be appreciated.

How to download file in Amazon S3 storage using only a unique id without subfolder name?

i want to download a file from Amazon S3 using only a certain unique id i can use from it's api, without using a folder or subfolder name. I created a folder/subfolder structure with hierarchy levels to organize the files.
The same of what I did in Google Drive API v3, regardless of which the folder or subfolder name or hierarchy level of folders the file was saved, i can download the file using only the fileid.
i haven't read yet about the file versioning docs since there are tons to read.
any help would greatly be appreciated. thank you.
You can't do this with S3. You need to know the bucket name (--bucket) and full key (--key) of the file you want to download. Since a given file can have multiple versions, you can also provide a version id (--version-id).

Is there a pseudocolumn in Hive/Presto to get the "last modified" timestamp of a given file?

I have an external table in Athena linked to a folder in S3. There are some pseudocolumns in Presto that allows me to get some metadata information about the the files sitting in that folder (for example, the $path pseudocolumn).
I wonder if there is a pseudocolumn where I can get the last modified timestamp of a file in S3 by using a query in AWS Athena.
This seems like a reasonable feature request. Please file an issue and include details about your use case (it's possible there is a better approach).

Specify output filename of Cloud Vision request

So I'm sitting with Google Cloud Vision (for Node.js) and I'm trying to dynamically upload a document to a Google Cloud Bucket, process it using Google Cloud Vision API, and then downloading the .json afterwards. However, when Cloud Vision processes my request and places it in my bucket for saved text extractions, it appends output-1-to-n.json at the end of the filename. So let's say I'm processing a file called foo.pdf that's 8 pages long, the output will not be foo.json (even though I specified that), but rather be foooutput1-to-8.json.
Of course, this could be remedied by checking the page count of the PDF before uploading it and appending that to the path I search for when downloading, but that seems like such an unneccesary hacky solution. I can't seem to find anything in the documentation about not appending output-1-to-n to outputs. Extremely happy for any pointers!
You can't specify a single output file for asyncBatchAnnotate because depending on your input, many files may get created. The output config is only a prefix and you have to do a wildcard search in gcs for your given prefix (so you should make sure your prefix is unique).
For more details see this answer.

.csv upload not working in Amazon Web Services Machine Learning - AWS

I have uploaded a simple 10 row csv file (S3) into AWS ML website. It keeps giving me the error,
"We cannot find any valid records for this datasource."
There are records there and Y variable is continuous (not binary). I am pretty much stuck at this point because there is only 1 button to move forward to build Machine Learning. Does any one know what should I do to fix it? Thanks!
The only way I have been able to upload .csv files to S3 created on my own is by downloading an existing .csv file from my S3 server, modifying the data, uploading it then changing the name in the S3 console.
Could you post the first few lines of contents of the .csv file? I am able to upload my own .csv file along with a schema that I have created and it is working. However, I did have issues in that Amazon ML was unable to create the schema for me.
Also, did you try to save the data in something like Sublime, Notepad++, etc. in order to get a different format? On my mac with Microsoft Excel, the CSV did not work, but when I tried LibreOffice on my Windows, the same file worked perfectly.