Amazon Textract: How to select 'Raw text' option - amazon-web-services

We are trying integrate amazon Textract api in our node.js application. we are facing some issue, FeatureType parameter while processing image. we need to achieve the below option via api:
We are not finding the option in the AWS JavaScript SDK.
export type FeatureType = "TABLES"|"FORMS"|string;
I'm trying this code:
const params = {
Document: {
/* required */
Bytes: Buffer.from(fileData)
},
FeatureTypes: [""] // here i'm facing issue, if i pass "TABLES"|"FORMS" it working
};
var textract = new AWS.Textract({
region: awsConfig.awsRegion,
accessKeyId: awsConfig.awsAccesskeyID,
secretAccessKey: awsConfig.awsSecretAccessKey
})
textract.analyzeDocument(params, (err, data) => {
console.log(err, data)
if (err) {
return resolve(err)
} else {
resolve(data)
}
})
Getting this error:
InvalidParameterType: Expected params.FeatureTypes[0] to be a string
If I pass "TABLES"|"FORMS" its working but I need Raw Text option.
Thanks in advance

You have been calling the analyzeDocument() function:
Analyzes an input document for relationships between detected items.
It returns various types of text:
'BlockType': 'KEY_VALUE_SET'|'PAGE'|'LINE'|'WORD'|'TABLE'|'CELL'|'SELECTION_ELEMENT',
The LINE and WORD blocks seem to match your requirements.
Alternatively, there is also a detectDocumentText() function:
Detects text in the input document. Amazon Textract can detect lines of text and the words that make up a line of text.

Related

Get generated API key from AWS AppSync API created with CDK

I'm trying to access data from my stack where I'm creating an AppSync API. I want to be able to use the generated Stacks' url and apiKey but I'm running into issues with them being encoded/tokenized.
In my stack I'm setting some fields to the outputs of the deployed stack:
this.ApiEndpoint = graphAPI.url;
this.Authorization = graphAPI.graphqlApi.apiKey;
When trying to access these properties I get something like ${Token[TOKEN.209]} and not the values.
If I'm trying to resolve the token like so: this.resolve(graphAPI.graphqlApi.apiKey) I instead get { 'Fn::GetAtt': [ 'AppSyncAPIApiDefaultApiKey537321373E', 'ApiKey' ] }.
But I would like to retrieve the key itself as a string, like da2-10lksdkxn4slcrahnf4ka5zpeemq5i.
How would I go about actually extracting the string values for these properties?
The actual values of such Tokens are available only at deploy-time. Before then you can safely pass these token properties between constructs in your CDK code, but they are opaque placeholders until deployed. Depending on your use case, one of these options can help retrieve the deploy-time values:
If you define CloudFormation Outputs for a variable, CDK will (apart from creating it in CloudFormation), will, after cdk deploy, print its value to the console and optionally write it to a json file you pass with the --outputs-file flag.
// AppsyncStack.ts
new cdk.CfnOutput(this, 'ApiKey', {
value: this.api.apiKey ?? 'UNDEFINED',
exportName: 'api-key',
});
// at deploy-time, if you use a flag: --outputs-file cdk.outputs.json
{
"AppsyncStack": {
"ApiKey": "da2-ou5z5di6kjcophixxxxxxxxxx",
"GraphQlUrl": "https://xxxxxxxxxxxxxxxxx.appsync-api.us-east-1.amazonaws.com/graphql"
}
}
Alternatively, you can write a script to fetch the data post-deploy using the listGraphqlApis and listApiKeys commands from the appsync JS SDK client. You can run the script locally or, for advanced use cases, wrap the script in a CDK Custom Resource construct for deploy-time integration.
Thanks to #fedonev I was able to extract the API key and url like so:
const client = new AppSyncClient({ region: "eu-north-1" });
const command = new ListGraphqlApisCommand({ maxResults: 1 });
const res = await client.send(command);
if (res.graphqlApis) {
const apiKeysCommand = new ListApiKeysCommand({
apiId: res.graphqlApis[0].apiId,
});
const apiKeyResponse = await client.send(apiKeysCommand);
const urls = flatMap(res.graphqlApis[0].uris);
if (apiKeyResponse.apiKeys && res.graphqlApis[0].uris) {
sendSlackMessage(urls[1], apiKeyResponse.apiKeys[0].id || "");
}
}

How to return an entire Datastore table by name using Node.js on a Google Cloud Function

I want to retrieve a table (with all rows) by name. I want to HTTP request using something like this on the body {"table": user}.
Tried this code without success:
'use strict';
const {Datastore} = require('#google-cloud/datastore');
// Instantiates a client
const datastore = new Datastore();
exports.getUsers = (req, res) => {
//Get List
const query = this.datastore.createQuery('users');
this.datastore.runQuery(query).then(results => {
const customers = results[0];
console.log('User:');
customers.forEach(customer => {
const cusKey = customer[this.datastore.KEY];
console.log(cusKey.id);
console.log(customer);
});
})
.catch(err => { console.error('ERROR:', err); });
}
Google Datastore is a NoSQL database that is working with entities and not tables. What you want is to load all the "records" which are "key identifiers" in Datastore and all their "properties", which is the "columns" that you see in the Console. But you want to load them based the "Kind" name which is the "table" that you are referring to.
Here is a solution on how to retrieve all the key identifiers and their properties from Datastore, using HTTP trigger Cloud Function running in Node.js 8 environment.
Create a Google Cloud Function and choose the trigger to HTTP.
Choose the runtime to be Node.js 8
In index.js replace all the code with this GitHub code.
In package.json add:
{
"name": "sample-http",
"version": "0.0.1",
"dependencies": {
"#google-cloud/datastore": "^3.1.2"
}
}
Under Function to execute add loadDataFromDatastore, since this is the name of the function that we want to execute.
NOTE: This will log all the loaded records into the Stackdriver logs
of the Cloud Function. The response for each record is a JSON,
therefore you will have to convert the response to a JSON object to
get the data you want. Get the idea and modify the code accordingly.

Google Cloud Datastore query values in array

I have entities that look like that:
{
name: "Max",
nicknames: [
"bestuser"
]
}
how can I query for nicknames to get the name?
I have created the following index,
indexes:
- kind: users
properties:
- name: name
- name: nicknames
I use the node.js client library to query the nickname,
db.createQuery('default','users').filter('nicknames', '=', 'bestuser')
the response is only an empty array.
Is there a way to do that?
You need to actually fetch the query from datastore, not just create the query. I'm not familiar with the nodejs library, but this is the code given on the Google Cloud website:
datastore.runQuery(query).then(results => {
// Task entities found.
const tasks = results[0];
console.log('Tasks:');
tasks.forEach(task => console.log(task));
});
where query would be
const query = db.createQuery('default','users').filter('nicknames', '=', 'bestuser')
Check the documentation at https://cloud.google.com/datastore/docs/concepts/queries#datastore-datastore-run-query-nodejs
The first point to notice is that you don't need to create an index to this kind of search. No inequalities, no orders and no projections, so it is unnecessary.
As Reuben mentioned, you've created the query but you didn't run it.
ds.runQuery(query, (err, entities, info) => {
if (err) {
reject(err);
} else {
response.resultStatus = info.moreResults;
response.cursor = info.moreResults == TNoMoreResults? null: info.endCursor;
resolve(entities);
};
});
In my case, the response structure was made to collect information on the cursor (if there is more data than I've queried because I've limited the query size using limit) but you don't need to anything more than the resolve(entities)
If you are using the default namespace you need to remove it from your query. Your query needs to be like this:
const query = db.createQuery('users').filter('nicknames', '=', 'bestuser')
I read the entire plob as a string to get the bytes of a binary file here. I imagine you simply parse the Json per your requirement

Amazon Rekognition for text detection

I have images of receipts and I want to store the text in the images separately. Is it possible to detect text from images using Amazon Rekognition?
Update from November 2017:
Amazon Rekognition announces real-time face recognition, Text in Image
recognition, and improved face detection
Read the announcement here: https://aws.amazon.com/about-aws/whats-new/2017/11/amazon-rekognition-announces-real-time-face-recognition-text-in-image-recognition-and-improved-face-detection/
Proof:
No, Amazon Rekognition not provide Optical Character Recognition (OCR).
At the time of writing (March 2017), it only provides:
Object and Scene Detection
Facial Analysis
Face Comparison
Facial Recognition
There is no AWS-provided service that offers OCR. You would need to use a 3rd-party product.
Amazon doesn't provide an OCR API. You can use Google Cloud Vision API for Document Text Recognition. It costs $3.5/1000 images though. To test Google's open this link and paste the code below in the the test request body on the right.
https://cloud.google.com/vision/docs/reference/rest/v1/images/annotate
{
"requests": [
{
"image": {
"source": {
"imageUri": "JPG_PNG_GIF_or_PDF_url"
}
},
"features": [
{
"type": "DOCUMENT_TEXT_DETECTION"
}
]
}
]
}
You may get better results with Amazon Textract although it's currently only available in limited preview.
It's possible to detect text in an image using the AWS JS SDK for Rekognition but your results may vary.
/* jshint esversion: 6, node:true, devel: true, undef: true, unused: true */
// Import libs.
const AWS = require('aws-sdk');
const axios = require('axios');
// Grab AWS access keys from Environmental Variables.
const { S3_ACCESS_KEY, S3_SECRET_ACCESS_KEY, S3_REGION } = process.env;
// Configure AWS with credentials.
AWS.config.update({
accessKeyId: S3_ACCESS_KEY,
secretAccessKey: S3_SECRET_ACCESS_KEY,
region: S3_REGION
});
const rekognition = new AWS.Rekognition({
apiVersion: '2016-06-27'
});
const TEXT_IMAGE_URL = 'https://loremflickr.com/g/320/240/text';
(async url => {
// Fetch the URL.
const textDetections = await axios
.get(url, {
responseType: 'arraybuffer'
})
// Convert to base64 Buffer.
.then(response => new Buffer(response.data, 'base64'))
// Pass bytes to SDK
.then(bytes =>
rekognition
.detectText({
Image: {
Bytes: bytes
}
})
.promise()
)
.catch(error => {
console.log('[ERROR]', error);
return false;
});
if (!textDetections) return console.log('Failed to find text.');
// Output the raw response.
console.log('\n', 'Text Detected:', '\n', textDetections);
// Output to Detected Text only.
console.log('\n', 'Found Text:', '\n', textDetections.TextDetections.map(t => t.DetectedText));
})(TEXT_IMAGE_URL);
See more examples of using Rekognition with NodeJS in this answer.
public async Task<List<string>> IdentifyText(string filename)
{
// Using USWest2, not the default region
AmazonRekognitionClient rekoClient = new AmazonRekognitionClient("Access Key ID", "Secret Access Key", RegionEndpoint.USEast1);
Amazon.Rekognition.Model.Image img = new Amazon.Rekognition.Model.Image();
byte[] data = null;
using (FileStream fs = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
data = new byte[fs.Length];
fs.Read(data, 0, (int)fs.Length);
}
img.Bytes = new MemoryStream(data);
DetectTextRequest dfr = new DetectTextRequest();
dfr.Image = img;
var outcome = rekoClient.DetectText(dfr);
return outcome.TextDetections.Select(x=>x.DetectedText).ToList();
}

Copying one table to another in DynamoDB

What's the best way to identically copy one table over to a new one in DynamoDB?
(I'm not worried about atomicity).
Create a backup(backups option) and restore the table with a new table name. That would get all the data into the new table.
Note: Takes considerable amount of time depending on the table size
I just used the python script, dynamodb-copy-table, making sure my credentials were in some environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY), and it worked flawlessly. It even created the destination table for me.
python dynamodb-copy-table.py src_table dst_table
The default region is us-west-2, change it with the AWS_DEFAULT_REGION env variable.
AWS Pipeline provides a template which can be used for this purpose: "CrossRegion DynamoDB Copy"
See: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-crossregion-ddb-create.html
The result is a simple pipeline that looks like:
Although it's called CrossRegion you can easily use it for the same region as long the destination table name is different (Remember that table names are unique per account and region)
You can use Scan to read the data and save it to the new table.
On the AWS forums a guy from the AWS team posted another approach using EMR: How Do I Duplicate a Table?
Here's one solution to copy all items from one table to another, just using shell scripting, the AWS CLI and jq. Will work OK for smallish tables.
# exit on error
set -eo pipefail
# tables
TABLE_FROM=<table>
TABLE_TO=<table>
# read
aws dynamodb scan \
--table-name "$TABLE_FROM" \
--output json \
| jq "{ \"$TABLE_TO\": [ .Items[] | { PutRequest: { Item: . } } ] }" \
> "$TABLE_TO-payload.json"
# write
aws dynamodb batch-write-item --request-items file://"$TABLE_TO-payload.json"
# clean up
rm "$TABLE_TO-payload.json"
If you both tables to be identical, you'd want to delete all items in TABLE_TO first.
DynamoDB now supports importing from S3.
https://aws.amazon.com/blogs/database/amazon-dynamodb-can-now-import-amazon-s3-data-into-a-new-table/
So, probably in almost all use cases, the easiest and cheapest way to replicate a table is
Use "Export to S3" feature to dump entire table into S3. Since this uses backup to generate the dump, table's throughput is not affected, and is very fast as well. You need to have backups (PITR) enabled. See https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/
Use "Import from S3" to import the dump created in step 1. This automatically requires you to create a new table.
Use this node js module : copy-dynamodb-table
This is a little script I made to copy the contents of one table to another.
It's based on the AWS-SDK v3. Not sure how well it would scale to big tables but as a quick and dirty solution it does the job.
It gets your AWS credentials from a profile in ~/.aws/credentials change default to the name of the profile you want to use.
Other than that it takes two args one for the source table and one for destination
const { fromIni } = require("#aws-sdk/credential-providers");
const { DynamoDBClient, ScanCommand, PutItemCommand } = require("#aws-sdk/client-dynamodb");
const ddbClient = new DynamoDBClient({
credentials: fromIni({profile: "default"}),
region: "eu-west-1",
});
const args = process.argv.slice(2);
console.log(args)
async function main() {
const { Items } = await ddbClient.send(
new ScanCommand({
TableName: args[0],
})
);
console.log("Successfully scanned table")
console.log("Copying", Items.length, "Items")
const putPromises = [];
Items.forEach((item) => {
putPromises.push(
ddbClient.send(
new PutItemCommand({
TableName: args[1],
Item: item,
})
)
);
});
await Promise.all(putPromises);
console.log("Successfully copied table")
}
main();
Usage
node copy-table.js <source_table_name> <destination_table_name>
Python + boto3 🚀
The script is idempotent as far as you maintain the same Keys.
import boto3
def migrate(source, target):
dynamo_client = boto3.client('dynamodb', region_name='us-east-1')
dynamo_target_client = boto3.client('dynamodb', region_name='us-west-2')
dynamo_paginator = dynamo_client.get_paginator('scan')
dynamo_response = dynamo_paginator.paginate(
TableName=source,
Select='ALL_ATTRIBUTES',
ReturnConsumedCapacity='NONE',
ConsistentRead=True
)
for page in dynamo_response:
for item in page['Items']:
dynamo_target_client.put_item(
TableName=target,
Item=item
)
if __name__ == '__main__':
migrate('awesome-v1', 'awesome-v2')
On November 29th, 2017 Global Tables was introduced. This may be useful depending on your use case, which may not be the same as the original question. Here are a few snippets from the blog post:
Global Tables – You can now create tables that are automatically replicated across two or more AWS Regions, with full support for multi-master writes, with a couple of clicks. This gives you the ability to build fast, massively scaled applications for a global user base without having to manage the replication process.
...
You do not need to make any changes to your existing code. You simply send write requests and eventually consistent read requests to a DynamoDB endpoint in any of the designated Regions (writes that are associated with strongly consistent reads should share a common endpoint). Behind the scenes, DynamoDB implements multi-master writes and ensures that the last write to a particular item prevails. When you use Global Tables, each item will include a timestamp attribute representing the time of the most recent write. Updates are propagated to other Regions asynchronously via DynamoDB Streams and are typically complete within one second (you can track this using the new ReplicationLatency and PendingReplicationCount metrics).
Another option is to download the table as a .csv file and upload it with the following snippet of code.
This also eliminates the need for providing your AWS credentials to a packages such as the one #ezzat suggests.
Create a new folder and add the following two files and your exported table
Edit uploadToDynamoDB.js and add the filename of the exported table and your table name
Run npm install in the folder
Run node uploadToDynamodb.js
File: Package.json
{
"name": "uploadtodynamodb",
"version": "1.0.0",
"description": "",
"main": "uploadToDynamoDB.js",
"author": "",
"license": "ISC",
"dependencies": {
"async": "^3.1.1",
"aws-sdk": "^2.624.0",
"csv-parse": "^4.8.5",
"fs": "0.0.1-security",
"lodash": "^4.17.15",
"uuid": "^3.4.0"
}
}
File: uploadToDynamoDB.js
var fs = require('fs');
var parse = require('csv-parse');
var async = require('async');
var _ = require('lodash')
var AWS = require('aws-sdk');
// If your table is in another region, make sure to update this
AWS.config.update({ region: "eu-central-1" });
var ddb = new AWS.DynamoDB({ apiVersion: '2012-08-10' });
var csv_filename = "./TABLE_CSV_EXPORT_FILENAME.csv";
var tableName = "TABLENAME"
function prepareData(data_chunk) {
const items = data_chunk.map(obj => {
const keys = Object.keys(obj)
let attr = Object.values(obj)
attr = attr.map(a => {
let newAttr;
// Can we make this an integer
if (isNaN(Number(a))) {
newAttr = { "S": a }
} else {
newAttr = { "N": a }
}
return newAttr
})
let item = _.zipObject(keys, attr)
return {
PutRequest: {
Item: item
}
}
})
var params = {
RequestItems: {
[tableName]: items
}
};
return params
}
rs = fs.createReadStream(csv_filename);
parser = parse({
columns : true,
delimiter : ','
}, function(err, data) {
var split_arrays = [], size = 25;
while (data.length > 0) {
split_arrays.push(data.splice(0, size));
}
data_imported = false;
chunk_no = 1;
async.each(split_arrays, function(item_data, callback) {
const params = prepareData(item_data)
ddb.batchWriteItem(
params,
function (err, data) {
if (err) {
console.log("Error", err);
} else {
console.log("Success", data);
}
});
}, function() {
// run after loops
console.log('all data imported....');
});
});
rs.pipe(parser);
It's been a very long time since the question was posted and AWS has been continuously improvising features. At the time of writing this answer, we have the option to export the Table to S3 bucket then use the import feature to import this data from S3 into a new table which automatically will re-create a new table with the data. Plese refer this blog for more idea on export & import
Best part is that you get to change the name, PK or SK.
Note: You have to enable PITR (might incur additional costs). Always best to refer documents.
Here is another simple python util script for this: ddb_table_copy.py. I use it often.
usage: ddb_table_copy.py [-h] [--dest-table DEST_TABLE] [--dest-file DEST_FILE] source_table
Copy all DynamoDB items from SOURCE_TABLE to either DEST_TABLE or DEST_FILE. Useful for migrating data during a stack teardown/re-creation.
positional arguments:
source_table Name of source table in DynamoDB.
optional arguments:
-h, --help show this help message and exit
--dest-table DEST_TABLE
Name of destination table in DynamoDB.
--dest-file DEST_FILE
2) a valid file path string to save the items to, e.g. 'items.json'.