How can one extract text using PowerShell from zip files stored on dropbox? - regex

I have been asked by a client to extract text from pdf files stored in zip archives on dropbox. I want to know how (and whether it is possible) to access these files using PowerShell. (I've read about APIs you can use to access things on dropbox, but have no idea how this could integrate in a PowerShell script) I'd ideally like to avoid downloading them, as there are around 7000 of them. What I want is a script to read the content of these files online, in dropbox, and then to process the relevant data (text) into a spreadsheet.
Just to reiterate - (i) Is it possible to access pdf files from Dropbox (and the text in them) which are stored in zip archives, and (ii) How can one go about this using PowerShell - what sort of script/instructions does one need to write
Note: I am still finding my way around PowerShell, so it is hard for me to elaborate - however, as and when I become more familiar, I will happily update this post.

The only officially supported programmatic interface for Dropbox is the Dropbox API:
https://www.dropbox.com/developers
It does let you access file contents, e.g., using /files (GET):
https://www.dropbox.com/developers/core/docs#files-GET
However, it doesn't offer any ability to interact with the contents of zip files remotely. (Dropbox just considers zip files as blobs of data like any other file.) That being the case, exactly what you want isn't possible, since you can't look inside the zip files without downloading them first. (Likewise, even if the PDF files weren't in zip files, the Dropbox API doesn't currently offer any ability to search the text in the PDF files remotely. You would still need to download them.)

Related

Concatenate 1000 CSV file directly in Google Cloud Storage? Without duplicated headers?

Is it possible to concatenate 1000 CSV file that have header into one file with no duplicated header directly in Google Cloud Storage? I could easily do this by downloading the file into my local hard drive but I would prefer to do it natively in Cloud Storage.
They all have same columns, and have header row.
I wrote an article to handle CSV files with BigQuery. To avoid several files, and if the volume is less than 1Gb, the recommended way is the following
Create a temporary table in BigQuery with all your CSV.
Use the Export API (not the export function)
Let me know if you need more guidance.
The problem with most solutions is that you still end up with a large number of split files where you have to then strip the headers and join them, etc...
Any method of avoiding multiple files tends to be also quite a lot of extra work.
It gets to be quite a hassle especially when big query spits out 3500 split gzipped csv files.
I needed a simple and batch file automatable method for achieving this.
Therefore wrote a CSV Merge (Sorry windows only though) to solve exactly this problem.
https://github.com/tcwicks/DataUtilities
Download latest release, unzip and use.
Also wrote an article on with scenario and usage examples:
https://medium.com/#TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826
Hope it is of use to someone.
p.s. Recommend tab delimited over CSV as it tends to have less data issues.

AWS Ground Truth text classification manifest using "source-ref" not displaying text

Background
I'm trying out SageMaker Ground Truth, an AWS service to help you label your data before using it in your ML algorithms.
The labeling job requires a manifest file which contains a JSON object per row that contains a source or a source-ref, see also the Input Data section of the documentation.
Setup
Source-ref is a reference to where the document is located in an S3 bucket like so
my-bucket/data/manifest.json
my-bucket/data/123.txt
my-bucket/data/124.txt
...
The manifest file looks like this (based on the blog example) :
{"source-ref": "s3://my-bucket/data/123.txt"}
{"source-ref": "s3://my-bucket/data/124.txt"}
...
The problem
When I create the job, all I get is the source-ref value: s3://my-bucket/data/123.txt as the text, the contents of the file are not displayed.
I have tried creating jobs using a manifest that does not contain the s3 protocol, but I get the same result.
Is this a bug on their end or I'm I missing something?
Observations
I have tried to make all files public, thinking there may maybe permissions issue? but no
I ensured that the content type of the file was text (s3 -> object -> properties -> metadata)
If I use "source" and inline the text, it works properly, but I should be able to use individual documents as there is a limit on the file size specially if I have to label many or large documents!
I am a member of AWS SageMaker Ground Truth team. Sorry to hear that you have difficulties in using certain features of our product.
From your post I presume you have multiple text files and each text files contains multiple lines. For text classification, in order to show preview in console, we currently support only the inline mode using "source" containing each line.
We understand it is not convenient to create such a manifest with embedded text as it is not trivial and time consuming. That is why we have provided a crawling feature in console (please see "create input manifest" link over the input manifest box) that takes an input s3Prefix and crawls all text files (with extensions .txt, .csv) in that prefix and read each line of each of the text files in the prefix, and creates a manifest with each line as {“source”:””}. Please let us know if you can crawl to create your manifest.
Please note that, currently crawler will only work if you have created s3://my-bucket/data/ folder from console and then uploaded all the text files in this folder (instead of using s3 cli sync tool to upload a local data/ directory).
Sorry if our documents are not clear and we are definitely taking your feedback to improve our product. For any question, please reach us here: https://aws.amazon.com/contact-us/
The problem is with your preprocessing lambda. The preprocessing lambda receives the objects from the manifest (in batches afaik), ie the s3 sources. The preprocessing lambda must read the files and return the actual content. It sounds like your preprocessing is passing the files location instead of content. Refer to the documentation. any example preprocessing lambda for text should be easily adjustable to your case

How do I zip a directory or multple files with zlib, using C/C++?

I did search for this topic, but I didn't find any relevant clue for this.
Can anyone give me some tips or demo code that can solve the problem?
Thanks in advance.
---FYI---
What I wanna do here is to zip files and upload to remote PC.
I think it'll take the following steps:
a) initialize a zipped file head and send to remote PC and save that zipped file head.
b) open file to read a portion of file data and zip the file data locally.
c) send zipped data through a pipe (tcp or udp for example) to remote PC.
d) save the data from pipe, which is zipped, on the remote PC.
e) if there are multiple files, come back to b)
e) when all files is zipped and transferred to remote PC, then close zipped file.
Two question here:
a) compress/decompress
b) File format
Thanks guys!
zlib zips a single stream. If you want to zip multiple files, you need to do one of two things:
Define a format (or use an existing format) that combines multiple files into one stream, then zip that; or
Zip each file individually, then use some format to combine those into one output file.
If you take the first option, using the existing tar format to combine the files, you will be producing a .tar.Z file which can be extracted with standard tools, so this is a good way to go. You can use libtar to generate a tar archive.
I have built a wrapper around minizip adding some features that I needed and making it nicer to use it. Is does use the latest c++11 and is developed using Visual Studio 2013 (should be portable, but i havent tested it on unix)
There's a full description here: https://github.com/sebastiandev/zipper
but is as simple as you can get:
Zipper zipper("ziptest.zip");
zipper.add("somefile.txt");
zipper.add("myFolder");
zipper.close();
you can zip entire folders, streams, vectors, etc. Also a nice feature is doing everything entirely in memory.

Django package multiple files in a single download

In Django I'm looking for a way to serve several different files at once. I can't use static archives (.zip, .tar, etc.) because I don't have enough storage to cache these files and it will take far too long to generate them on the fly (each could be in the 100s of megabytes).
Is there a way I can indicate to the browser that several files are coming its way? Perhaps there is a container format that I can indicate before streaming files to the user?
Edit: There could be hundreds of files in each package so asking the user to download each one is very time consuming.
Ah, the .tar file format can be streamed. I'll experiment with this for now.
http://docs.python.org/library/tarfile.html

Combining two PDF files in C++

In C++ I'm generating a PDF report with libHaru. I'm looking for someway to append two pages from an existing PDF file to the end of my report. Is there any free way to do that?
Thanks.
Try PoDoFo
http://podofo.sourceforge.net/
You should be able to open both of the PDFs as PDFMemDocuments using PDFMemDocument.Load( filename ).
Then, acquire references to the two pages you want to copy and add to the end of the document using InsertPages, or optionally, remove all but the last two pages of the source document, then call PDFDocument.
Append and pass the called document. Hard to say which would be faster or more stable.
Hope that helps,
Troy
You can use the Ghostscript utility pdf2ps to convert the PDF files to PostScript, append the PostScript files, and then convert them back to a PDF using ps2pdf.