I am doing the processDocument process using the expense parser as in the example here. Since the billing costs too much, instead of sending the documents one by one, i combine 10 documents into one pdf and use processDocument again. However, DocumentAI sees 10 separate receipts that we have combined as a single receipt, and instead of returning 10 different total_amount entities for each receipt, 1 total_amount returns.I want to combine 10 documents into one pdf and send it for less billing cost. In addition, i am looking for a way to think of each document independently from each other and extract its entities separately. Will batch processing work for me? What can I do for it? Can you help me please?
Unfortunately there is no way to make the billing cheaper because the pricing of Document AI is calculated on a per page/document basis. See Document AI pricing.
With regards to your question:
I am looking for a way to think of each document independently from
each other and extract its entities separately. Will batch processing
work for me?
Yes batch processing will work for you, but pricing is just the same with processDocument. See the pricing info I have attached above.
The only difference between batch processing and processDocument is that instead of sending a single request for a single document, batch processing will send all your documents in a single request. The response will then be stored in a GCS bucket that you have defined on the batch process options. See batch process sample code.
Another thing to add is batch processing process the documents asynchronously. This means that when the request is sent, the processing is done on the backend and you can poll the status of your request to see if it is still processing or it is done.
Related
We are in Google Cloud Platform so technologies there would be a good win. We have a huge file that comes in and dataflow scales on the input to break up the file quite nicely. After that however, it streams through many system, microservice1 over to dataconnectors grabbing related data over to ML and finally over to a final microservice.
Since the final stage could be around 200-1000 servers depending on load, how can we take all the requests coming in (yes, we have a file id attached to every request including a customerRequestId in case a file is dropped multiple times). We only need to be writing every line with the same customerRequestId to the same file on output.
What is the best method to do this? The resulting file is almost always a csv file.
Any ideas or good options I can explore? I wonder if dataflow was good at ingestion and reading a massively large file in parallel, is it good at taking in various inputs on a cluster of nodes(not a single node which would bottleneck us).
EDIT: I seem to recall hdfs has files partitioned across nodes and I think can be written by many nodes at the same time somehow (a
node per partition). Does anyone know if google cloud storage files are this way as well? Is there a way to have 200 nodes writing to 200 partitions of the same file in google cloud storage in such a way that it is all 1 file?
EDIT 2:
I see that there is a streaming pub/sub to bigquery option that could be done as one stage in this list: https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming
HOWEVER in this list, there is not a batch bigquery to csv(what our customer wants). I do see a bigquery to parquet option though here: https://cloud.google.com/dataflow/docs/guides/templates/provided-batch
I would prefer to go directly to csv though. Is there a way?
thanks,
Dean
You case is complex and hard (and expensive) to reproduce. My first idea is to use BigQuery. Sink all the data in the same table with Dataflow.
Then, create a temporary table with only the data to export to CSV like that
CREATE TABLE `myproject.mydataset.mytemptable`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
) AS
SELECT ....
And then to export the temporary table to CSV. If the table is less than 1Gb, only one CSV will be generated.
If you need to orchestrate these steps, you can use Workflows
The plan was to get data from aws data exchange, move it to an s3 bucket then query it by aws athena for a data api. Everything works, just feels a bit slow.
No matter the dataset nor the query I can't get below 2 second in athena response time. Which is a lot for an API. I checked the best practices but seems that those are also above 2 sec.
So my question:
Is 2 sec the minimal response time for athena?
If so then I have to switch to postgres.
Athena is indeed not a low latency data store. You will very rarely see response times below one second, and often they will be considerably longer. In the general case Athena is not suitable as a backend for an API, but of course that depends on what kind of an API it is. If it's some kind of analytics service, perhaps users don't expect sub second response times? I have built APIs that use Athena that work really well, but those were services where response times in seconds were expected (and even considered fast), and I got help from the Athena team to tune our account to our workload.
To understand why Athena is "slow", we can dissect what happens when you submit a query to Athena:
Your code starts a query by using the StartQueryExecution API call
The Athena service receives the query, and puts it on a queue. If you're unlucky your query will sit in the queue for a while
When there is available capacity the Athena service takes your query from the queue and makes a query plan
The query plan requires loading table metadata from the Glue catalog, including the list of partitions, for all tables included in the query
Athena also lists all the locations on S3 it got from the tables and partitions to produce a full list of files that will be processed
The plan is then executed in parallel, and depending on its complexity, in multiple steps
The results of the parallel executions are combined and a result is serialized as CSV and written to S3
Meanwhile your code checks if the query has completed using the GetQueryExecution API call, until it gets a response that says that the execution has succeeded, failed, or been cancelled
If the execution succeeded your code uses the GetQueryResults API call to retrieve the first page of results
To respond to that API call, Athena reads the result CSV from S3, deserializes it, and serializes it as JSON for the API response
If there are more than 1000 rows the last steps will be repeated
A Presto expert could probably give more detail about steps 4-6, even though they are probably a bit modified in Athena's version of Presto. The details aren't very important for this discussion though.
If you run a query over a lot of data, tens of gigabytes or more, the total execution time will be dominated by step 6. If the result is also big, 7 will be a factor.
If your data set is small, and/or involves thousands of files on S3, then 4-5 will instead dominate.
Here are some reasons why Athena queries can never be fast, even if they wouldn't touch S3 (for example SELECT NOW()):
There will at least be three API calls before you get the response, a StartQueryExecution, a GetQueryExecution, and a GetQueryResults, just their round trip time (RTT) would add up to more than 100ms.
You will most likely have to call GetQueryExecution multiple times, and the delay between calls will puts a bound on how quickly you can discover that the query has succeeded, e.g. if you call it every 100ms you will on average add half of 100ms + RTT to the total time because on average you'll miss the actual completion time by this much.
Athena will writes the results to S3 before it marks the execution as succeeded, and since it produces a single CSV file this is not done in parallel. A big response takes time to write.
The GetQueryResults must read the CSV from S3, parse it and serialize it as JSON. Subsequent pages must skip ahead in the CSV, and may be even slower.
Athena is a multi tenant service, all customers are competing for resources, and your queries will get queued when there aren't enough resources available.
If you want to know what affects the performance of your queries you can use the ListQueryExecutions API call to list recent query execution IDs (I think you can go back 90 days at the most), and then use GetQueryExecution to get query statistics (see the documentation for QueryExecution.Statistics for what each property means). With this information you can figure out if your slow queries are because of queueing, execution, or the overhead of making the API calls (if it's not the first two, it's likely the last).
There are some things you can do to cut some of the delays, but these tips are unlikely to get you down to sub second latencies:
If you query a lot of data use file formats that are optimized for that kind of thing, Parquet is almost always the answer – and also make sure your file sizes are optimal, around 100 MB.
Avoid lots of files, and avoid deep hierarchies. Ideally have just one or a few files per partition, and don't organize files in "subdirectories" (S3 prefixes with slashes) except for those corresponding to partitions.
Avoid running queries at the top of the hour, this is when everyone else's scheduled jobs run, there's significant contention for resources the first minutes of every hour.
Skip GetQueryExecution, download the CSV from S3 directly. The GetQueryExecution call is convenient if you want to know the data types of the columns, but if you already know, or don't care, reading the data directly can save you some precious tens of milliseconds. If you need the column data types you can get the ….csv.metadata file that is written alongside the result CSV, it's undocumented Protobuf data, see here and here for more information.
Ask the Athena service team to tune your account. This might not be something you can get without higher tiers of support, I don't really know the politics of this and you need to start by talking to your account manager.
I have data coming from multiple machines, I would like to aggregate it by user. I'm thinking of producing batches of 1000 "rows", or 10 seconds of data (whichever comes first), by user.
I do have some experience with AWS kinesis and lambdas, but in my experience we don't have so much control on how the aggregation is done. All machines would send the data by kinesis, with the user id as the partition key. Then AWS will call our lambda with small batches. This has been great for some other use cases but here if I receive 100 records I don't know what to do (I would like to "wait" to receive more or wait that 10 seconds elapse since the date of the first record).
Also I'm not sure how the aggregation "by user id" would work. So far, on a lambda, I would have split the records in the batch by user id, but then if I get called with a batch of 100 records, even though there is a partition key on the user id, there is no guarantee that those 100 records would be for 1 user. Maybe I will get 100 records from 100 different users, and there is no "aggregation" help at all.
Any idea if kinesis + lambda is suited for this? I did look at the documentation of AWS but I don't see my scenario. It looks like they also have a tool "Data Streams" but it's hard for me to understand if this would work for my case.
Thanks!
Your understanding is correct. AWS Lambda + Kinesis alone will not be sufficient alone for aggregation. AWS Lambda programming model is stateless, so you can only aggregate based on the batch of records received in that particular invocatio(GetRecords API) call. Furthermore, the batch size provided in the function does not gurantee that you will get that number of records. This is merely the maximum number of records which you can get(MaxRecords) per invocation.
What you need is some kind of windowing mechanism, either row-based or time-based. Kinesis Analytics would be the easiest and fastest to get on-boarded with this. You can either use SQL or Flink with Kinesis analytics. You can even have your output to AWS Lambda for post processing.
Other ways would be use a Spark streaming job (you can use AWS EMR) and use windowing in your application.
Let's say I upload 10,000 documents to CloudSearch. CloudSearch would take some time to index them and I already have another 10,000 documents lined up to be uploaded. Now, the problem is that my ingestion flow would check if any of the documents in second batch already exists in my domain. If it does, then it would merge these documents and then upload them. However, if the indexing is in progress then my ingestion flow might miss some records and they will be overwritten by the second batch.
How can I solve this problem?
Can I know if the first batch has finished indexing before I start ingestion of the second batch?
In the Authorize.net API, when getSettledBatchList returns a settlementState of settlementError, is that the final state for the batch? What should I expect to happen to the batched transactions?
Is the same batch processed again the following day, using the same batch id, possibly resulting in a settlementState of settledSuccessfully? Or are the affected transactions automatically included in a new batch with a new batch id?
If the transactions are included in a new batch, would they then be included in multiple batches? If transactions are included in multiple batches, would getTransactionList for each of these batches return the exact same transactionStatus for transactions that were included in multiple batches, regardless of which batch id was used to make the getTransactionList request?
Question was originally asked at https://community.developer.authorize.net/t5/Integration-and-Testing/What-happens-to-a-batch-having-a-settlementState-of/td-p/58993. If the question is answered there, I'll also add the answer here.
Here's the answer posted in the Authorize.Net community for those who did not follow the link in the question:
Batch status of "settlement error" means that the batch failed. There are different reasons a batch could fail depending on the processor the merchant is using and different causes of failure. A failed batch needs to be reset and this means that the merchant will need to contact Authorize.Net to request for a batch reset. It is important to note that batches over 30 days old cannot be reset. When resetting a batch, merchant needs to confirm first with their MSP (Merchant Service Provider) that the batch was not funded, and the error that failed the batch has been fixed, before submitting a ticket for the batch to be reset.
Resetting a batch doesn't really modify the batch, what it does, is it takes the transactions from the batch and puts them back into unsetttled so they settle with the next batch. Those transactions that were in the failed batch will still have the original submit date.
Authorize.net just sends the batch to your msp, you'll have to contact your msp to have them three way call authorize.net to sort it out.