How to load data into CrowdFlower's job by using GATE's crowdsourcing plugin? - gate

I am trying to create a job on CrowdFlower using
GATE crowdsourcing plugin. My problem is I cannot load the data to the
job at all. What I have done so far in creating the job is:
Create job builder in PR.
Right click on the job builder and choose create a new CrowdFlower
job. The job appeared in my job's list in CrowdFlower.
Populate corpus with some documents, pre-processing them with some
ANNIE's application, e.g. tokenizer and sentence splitter
Add the job builder to a corpus pipeline, edit some parameters so
they match with the initial annotations (tokens and sentences)
Run the pipeline. (Of course I make sure the Job ID match)
After I did all those, the job still has 0 row data. I am wondering if
I have done something wrong because I am sure that I follow all the instructions on this tutorial, specifically from page 28 to 35. Any advice on this?

I bet you have a typo in one of the job builder runtime parameters :)
Double-check the names of annotations and annotation sets, make sure all of them exist in your documents. If they exist and the builder found them, a cf_..._id feature should appear on each entity annotation.
If the job builder found any annotations it would call the crowdflower API and throw an exception if it fails to upload the data. It really sounds like it's not sending any requests and the only reason I see is it can't find annotations.

Related

How do I write a google cloud dataflow transform mapping?

I'm upgrading a google cloud dataflow job from dataflow java sdk 1.8 to version 2.4 and then trying to update its existing dataflow job on google cloud using the --update and --transformNameMapping arguments, but I can't figure out how to properly write the transformNameMappings such that the upgrade succeeds and passes the compatibility check.
My code fails at the compatibility check with the error:
Workflow failed. Causes: The new job is not compatible with 2018-04-06_13_48_04-12999941762965935736. The original job has not been aborted., The new job is missing steps BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey, PubsubIO.readStrings. If these steps have been renamed or deleted, please specify them with the update command.
The dataflow transform names for the existing, currently running job are:
PubsubIO.Read
ParDo(ExtractJsonPath) - A custom function we wrote
ParDo(AddMetadata) - Another custom function we wrote
BigQueryIO.Write
In my new code that uses the 2.4 sdk, I've changed the 1st and 4th transforms/functions because of some libraries being renamed and deprecation of some of the old sdk's functions in the new version.
You can see the specific transform code below:
The 1.8 SDK version:
PCollection<String> streamData =
pipeline
.apply(PubsubIO.Read
.timestampLabel(PUBSUB_TIMESTAMP_LABEL_KEY)
//.subscription(options.getPubsubSubscription())
.topic(options.getPubsubTopic()));
streamData
.apply(ParDo.of(new ExtractJsonPathFn(pathInfos)))
.apply(ParDo.of(new AddMetadataFn()))
.apply(BigQueryIO.Write
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.to(tableRef)
The 2.4 SDK version I rewrote:
PCollection<String> streamData =
pipeline
.apply("PubsubIO.readStrings", PubsubIO.readStrings()
.withTimestampAttribute(PUBSUB_TIMESTAMP_LABEL_KEY)
//.subscription(options.getPubsubSubscription())
.fromTopic(options.getPubsubTopic()));
streamData
.apply(ParDo.of(new ExtractJsonPathFn(pathInfos)))
.apply(ParDo.of(new AddMetadataFn()))
.apply("BigQueryIO.writeTableRows", BigQueryIO.writeTableRows()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.to(tableRef)
So it seems to me like PubsubIO.Read should map to PubsubIO.readStrings and BigQueryIO.Write should map to BigQueryIO.writeTableRows. But I could be misunderstanding how this works.
I've been trying a wide variety of things - I tried to give those two transforms that I'm failing to remap defined names as they formerly were not explicity named, so I updated my applys to .apply("PubsubIO.readStrings" and .apply("BigQueryIO.writeTableRows" and then set my transformNameMapping argument to:
--transformNameMapping={\"BigQueryIO.Write\":\"BigQueryIO.writeTableRows\",\"PubsubIO.Read\":\"PubsubIO.readStrings\"}
or
--transformNameMapping={\"BigQueryIO.Write/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey\":\"BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey\",\"PubsubIO.Read\":\"PubsubIO.readStrings\"}
or even trying to remap all the internal transforms inside the composite transform
--transformNameMapping={\"BigQueryIO.Write/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey\":\"BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey\",\"BigQueryIO.Write/BigQueryIO.StreamWithDeDup/Reshuffle\":\"BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle\",\"BigQueryIO.Write/BigQueryIO.StreamWithDeDup\":\"BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup\",\"BigQueryIO.Write\":\"BigQueryIO.writeTableRows\",\"PubsubIO.Read\":\"PubsubIO.readStrings\"}
but I seem to get the same exact error no matter what:
The new job is missing steps BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey, PubsubIO.readStrings.
Wondering if I'm doing something seriously wrong? Anybody whose written a transform mapping before who would be willing to share the format they used? I can't find any examples online at all besides the main google documentation on updating dataflow jobs which doesn't really cover anything but the most simple case --transformNameMapping={"oldTransform1":"newTransform1","oldTransform2":"newTransform2",...} and doesn't make the example very concrete.
It turns out there was additional information in the logs in the google cloud web console dataflow job details page that I was missing. I needed to adjust the log level from info to show any log level and then I found several step fusion messages like for example (although there were far more):
2018-04-16 (13:56:28) Mapping original step BigQueryIO.Write/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey to write/StreamingInserts/StreamingWriteTables/Reshuffle/GroupByKey in the new graph.
2018-04-16 (13:56:28) Mapping original step PubsubIO.Read to PubsubIO.Read/PubsubUnboundedSource in the new graph.
Instead of trying to map PubsubIO.Read to PubsubIO.readStrings I needed to map to the steps that I found mentioned in that additional logging. In this case I got past my errors by mapping PubsubIO.Read to PubsubIO.Read/PubsubUnboundedSource and BigQueryIO.Write/BigQueryIO.StreamWithDeDup to BigQueryIO.Write/StreamingInserts/StreamingWriteTables. So try mapping your old steps to those that are mentioned in the full logs before the job failure message in the logs.
Unfortunately I'm not working through a failure of the compatibility check due to a change in the coder used from the old code to the new code, but my missing step errors are solved.

Solr/Lucene "kit" to test searching?

Is there a "code free" way to get SOLR/LUCENE (or something similar) pointed at a set of word docs to make them quickly searchable by a user?
I am prototyping, seeing if there is value in, a system to search through some homegrown news articles. Before I stand up code to handle search string input and document indexing, I wanted to see if it was even worth it before I starting trying to figure it all out.
Thanks,
Judd
Using the bin/post tool of Solr and the Tika handler (named the ExtractingRequestHandler), you should be able to get something up and running for prototyping rather quickly.
See the introduction of Uploading Data with Solr Cell using Apache Tika. Tika is used to process a wide range of different document types.
You can give the Solr post tool a directory or a list of files to submit to the index.
Automatically detect content types in a folder, and recursively scan it for documents for indexing into gettingstarted.
bin/post -c gettingstarted afolder/

Export single column from DynamoDB to csv (or the like)

My DynamoDB table is quite large and I don't particularly want to dump the whole thing. There is one column that I want to test on, so I would like a dump of all of its values that I could have locally to code/test with. However I am not finding anything that lets me do this.
I found RazorSQL and it semi worked (in the sense that it let me pull down just one column of information from the table but it clearly didn't pull down all the data).
I also found a Data Pipeline Template on AWS but from what I can tell this will dump the entire table. I am relatively new to AWS so it's possible I'm not understanding something about pipelines properly.
I'm okay with writing to S3 because I can pull down all the data from there, but anything that gets to my local machine is fine by me
Thanks for the help!
UPDATE: This tutorial looks promising but I want to achieve this effect in a non-interactive method

AnthillPro - CCTray integration

Does anyone know if you can use CCTray (or an equivalent) with AnthillPro? I'm not finding a lot of documentation and am new to using AHP.
Thanks.
You should be able to use CCTray type tools with AnthillPro. You would need to create a custom report to generate the XML though.
Shoot me an email at eric#urbancode.com I may be able to write this later in the week.
Otherwise, you could experiment with report writing.
You can find the cc xml format here: http://confluence.public.thoughtworks.org/display/CI/Multiple+Project+Summary+Reporting+Standard
Example AP report code that iterates over each build workflow and spits out data about the latest build is here: https://bugs.urbancode.com/browse/AHPSCRIPTS-13
The "Recent Build Life Activity (RSS)" report that I think ships with the product would give you an XML example.

Django -- printing lots of documents?

I have a Django app that stores client data. Currently, there are just over 1,000 clients in the database. Twice a year, I need to print a semi-customized letter for each client. Ideally, I want to be able to click a button/link and the entire batch is sent to the printer; I don't want to have to click "print" for each letter since that would be absurdly time consuming.
I have thought of is using Celery to chug through the process of printing all the documents, but I don't know how that would be accomplished. I would have to 'build' the document and send it to the printer without the user seeing this happen.
The other idea I had was to create a "web page" that contains all the letters on one page. Then the user can hit "Print" and the pages would come out of the printer as a collection of letters. Although, this seems sloppy.
Any ideas?
Thanks
I would advise using wkhtmltopdf for this task. You can then create the required letters from one long html with pagebreaks or separately and print them as you regularly print PDF's.
http://code.google.com/p/wkhtmltopdf/
As wk stands for WebKit it will print exceptionally good quality PDF's. It's a commandline tool that you can just download and run. Small tutorial is here for you.
http://shivul.posterous.com/django-create-dynamic-pdfs-using-wkhtmltopdf
ReportLab is also a good option. But myself I don't want to create raw pdf syntax and Pisa the html library for ReportLab is not really that good. wkhtmltopdf is much better and easier to use.
I'd suggest using something like Reportlab to create the whole thing as a single PDF document that you can send to the printer in one go.
See the docs on generating PDFs from Django.