AWS Ground Truth text classification manifest using "source-ref" not displaying text - amazon-web-services

Background
I'm trying out SageMaker Ground Truth, an AWS service to help you label your data before using it in your ML algorithms.
The labeling job requires a manifest file which contains a JSON object per row that contains a source or a source-ref, see also the Input Data section of the documentation.
Setup
Source-ref is a reference to where the document is located in an S3 bucket like so
my-bucket/data/manifest.json
my-bucket/data/123.txt
my-bucket/data/124.txt
...
The manifest file looks like this (based on the blog example) :
{"source-ref": "s3://my-bucket/data/123.txt"}
{"source-ref": "s3://my-bucket/data/124.txt"}
...
The problem
When I create the job, all I get is the source-ref value: s3://my-bucket/data/123.txt as the text, the contents of the file are not displayed.
I have tried creating jobs using a manifest that does not contain the s3 protocol, but I get the same result.
Is this a bug on their end or I'm I missing something?
Observations
I have tried to make all files public, thinking there may maybe permissions issue? but no
I ensured that the content type of the file was text (s3 -> object -> properties -> metadata)
If I use "source" and inline the text, it works properly, but I should be able to use individual documents as there is a limit on the file size specially if I have to label many or large documents!

I am a member of AWS SageMaker Ground Truth team. Sorry to hear that you have difficulties in using certain features of our product.
From your post I presume you have multiple text files and each text files contains multiple lines. For text classification, in order to show preview in console, we currently support only the inline mode using "source" containing each line.
We understand it is not convenient to create such a manifest with embedded text as it is not trivial and time consuming. That is why we have provided a crawling feature in console (please see "create input manifest" link over the input manifest box) that takes an input s3Prefix and crawls all text files (with extensions .txt, .csv) in that prefix and read each line of each of the text files in the prefix, and creates a manifest with each line as {“source”:””}. Please let us know if you can crawl to create your manifest.
Please note that, currently crawler will only work if you have created s3://my-bucket/data/ folder from console and then uploaded all the text files in this folder (instead of using s3 cli sync tool to upload a local data/ directory).
Sorry if our documents are not clear and we are definitely taking your feedback to improve our product. For any question, please reach us here: https://aws.amazon.com/contact-us/

The problem is with your preprocessing lambda. The preprocessing lambda receives the objects from the manifest (in batches afaik), ie the s3 sources. The preprocessing lambda must read the files and return the actual content. It sounds like your preprocessing is passing the files location instead of content. Refer to the documentation. any example preprocessing lambda for text should be easily adjustable to your case

Related

How to write file-wide metadata into parquetfiles with apache parquet in C++

I use apache parquet to create Parquet tables with process information of a machine and I need to store file wide metadata (Machine ID and Machine Name).
It is stated that parquet files are capable of storing file wide metadata, however i couldn't find anything in the documentation about it.
There is another stackoverflow post that tells how it is done with pyarrow. As far as the post is telling, i need some kind of key value pair (maybe map<string, string>) and add it to the schema somehow.
I Found a class inside the parquet source code that is called parquet::FileMetaData that may be used for this purpose, however there is nothing in the docs about it.
Is it possible to store file-wide metadata with c++ ?
Currently i am using the stream_reader_writer example for writing parquet files
You can pass the file level metadata when calling parquet::ParquetFileWriter::Open, see the source code here

How can one extract text using PowerShell from zip files stored on dropbox?

I have been asked by a client to extract text from pdf files stored in zip archives on dropbox. I want to know how (and whether it is possible) to access these files using PowerShell. (I've read about APIs you can use to access things on dropbox, but have no idea how this could integrate in a PowerShell script) I'd ideally like to avoid downloading them, as there are around 7000 of them. What I want is a script to read the content of these files online, in dropbox, and then to process the relevant data (text) into a spreadsheet.
Just to reiterate - (i) Is it possible to access pdf files from Dropbox (and the text in them) which are stored in zip archives, and (ii) How can one go about this using PowerShell - what sort of script/instructions does one need to write
Note: I am still finding my way around PowerShell, so it is hard for me to elaborate - however, as and when I become more familiar, I will happily update this post.
The only officially supported programmatic interface for Dropbox is the Dropbox API:
https://www.dropbox.com/developers
It does let you access file contents, e.g., using /files (GET):
https://www.dropbox.com/developers/core/docs#files-GET
However, it doesn't offer any ability to interact with the contents of zip files remotely. (Dropbox just considers zip files as blobs of data like any other file.) That being the case, exactly what you want isn't possible, since you can't look inside the zip files without downloading them first. (Likewise, even if the PDF files weren't in zip files, the Dropbox API doesn't currently offer any ability to search the text in the PDF files remotely. You would still need to download them.)

OSX- Auto Delete file after x-time

Can we add metadata to unlink/remove a file after x-time automatically. That is system automatically removes that file, if it finds that particular metadata attached with that file
Note- file can be present at any location, and user may move that file anywhere on their system, but based on that metadata file should get deleted(i.e system should call unlink/remove) for that file.
Is there a cocoa/objective-c/c++ api to set such metadata/attributes of a file?
The main point is i am creating an application through which i am providing some trial files to the user, and those files are also usable by other application which recognises them. After trial expiry, i want to delete those files, but user can always move my files to a different location and use them forever, how to protect those files from permanent use?
No, there is no built-in mechanism to auto-delete a file based on some metadata.
You could add the feature yourself, with an accompanying agent that would trawl for files with the metadata and delete them when the time came.
If you are doing this for good housekeeping you can follow #Petesh answer.
If you are doing this because you really want those files gone then no. The user could move the file to a USB stick and remove it, or edit the metadata, etc.
Your earlier question "Completely restricting all types of access to a folder" seems to addressing the same issue and the suggestions are the same as given there - use encryption or implement your own file system.
E.g. have a special "trial file" format which is the same as the ordinary format - which is readable by other apps - but encrypted and includes an expiry date. Your app then decrypts the file, checks the date, and either does its thing or reports to the user the file is out of date.
The system isn't unbreakable, but its a reasonable barrier - easy for you to do, too hard for the average user to break.

SAS Folder mapping

I have created a SAS folder say "/Public Development/Area Name/Project Name" under "Folders" tab of SAS Management console.
In SAS EG this folder shows under "SAS Folder" option. I'm able to save EGP project and stored processes in this folder but not SAS code, log etc.
I believe its just a folder at meta data level and only items registered at meta data can be saved here.
So what approach should I take to organize my other project items like code, jobs, macros, Reports...?
The Enterprise Guide model includes storing your code as part of your EGP project. You put code modules in process flows, and log and output are stored alongside them (in a somewhat similar fashion to if you had run them in batch mode - log, output, and program are grouped as one entity effectively).
Your organization may have specific rules for how code/etc. is stored, such as storing it in a SVN repository or similar, so you should check with your manager or site SAS admin to get a more complete answer that is specific to your site.
I tend to keep metadata folders for storing metadata objects (stored processes, DI jobs, etc), and I use OS file system for storing code (.sas files), .log files, etc and .egp projects. Generally I don't store code as part of the EG project, instead the project just links to code that is sitting in the OS file system. So basically, I store my code, logs, macros, format catalogs, output reports, etc etc the same way as I did when I was using PC SAS.

batch export psd files to png

I have thousands of psd files to save as png. The psd files are not different, except for a small text in the center of a image. Is there a way to automate the job?
Yes. Open your actions window. Create new action. Record yourself opening, saving the file as png and closing the file.
Then under File -> Automate -> Batch. Point it to your psd folder and select your action. It should run through the files saving them as pngs.
A quick google search may help if you're new to actions.
edited per author input :}
XnView does the job pretty well. It can batch convert most files into most formats. It also has batch transformations and batch renaming among other things.
I use it regularly to convert PSDs to JPG/PNG/GIF.
I would use Irfanview 's powerful batch engine. Free and super-fast.
Go to the Folder in Irfanview Thumnails
Select all files
Rightclick and "Start batch dialog with selected files"
Select PNG as output format.
Yes you can make a Photoshop action to save the png and than run it via batch. This unfortunately gets tricky when you want to use this action option and specify the destination where the processed files are saved.
Enter Dr. (Russell) Brown’s Image Processor Pro, an extension for Photoshop that does exactly what most people need. It's dead simple and can even stack multiple processes and output formats/destinations to each file.
It's part of Dr. Brown’s Services
- http://russellbrown.com/scripts.html