Cloud Vision PDF unsupported input file format - google-cloud-platform

Im using cloud vision to detect text in a pdf file.Ive used the code provided in the documentation but it throws an error saying unsupported input file format.im using 100% sure the file is pdf and i even used the sample resource file https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/vision/cloud-client/detect/resources/kafka.pdf what should i do?????????
EDIT
This is the code taken staright from the documentation which i used.
const vision = require('#google-cloud/vision').v1;
const client = new vision.ImageAnnotatorClient();
const gcsSourceUri = `gs://${bucketName}/${fileName}`;
const gcsDestinationUri = `gs://${bucketName}/${outputPrefix}/`;
const inputConfig = {
// Supported mime_types are: 'application/pdf' and 'image/tiff'
mimeType: 'application/pdf',
gcsSource: {
uri: gcsSourceUri,
},
};
const outputConfig = {
gcsDestination: {
uri: gcsDestinationUri,
},
};
const features = [{type: 'DOCUMENT_TEXT_DETECTION'}];
const request = {
requests: [
{
inputConfig: inputConfig,
features: features,
outputConfig: outputConfig,
},
],
};
const [operation] = await client.asyncBatchAnnotateFiles(request);
const [filesResponse] = await operation.promise();
const destinationUri =
filesResponse.responses[0].outputConfig.gcsDestination.uri;
console.log('Json saved to: ' + destinationUri);

I tried moving that kafka.pdf to my gcs bucket and ran the python sample code, which worked as expected. Maybe something went wrong with the kafka.pdf file when you moved it into the gcs bucket.
Try using the sample file they provide to see if it works for you 'gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf'. The census file works for me as well.

I was getting the same response from the batch annotation service on otherwise valid PDF files. In my case it had to do with copy/pasting the example code from the node sample for file uploading to google cloud storage, and including the keys for gzip and cacheControl
It doesn't look like you've included those values, but after a lot of head scratching I ended up finding that if I uploaded my pdfs without those options then the annotation service tolerated them, not an exact reproduction but I hope it leads to progress for you :)

Related

uplode image to amazon s3 using #aws-sdk/client-s3 ang get its location

i am trying upload an in image file to s3 but get this error says :
ERROR: MethodNotAllowed: The specified method is not allowed against this resource.
my code using #aws-sdk/client-s3 package to upload wth this code :
const s3 = new S3({
region: 'us-east-1',
credentials: {
accessKeyId: config.accessKeyId,
secretAccessKey: config.secretAccessKey,
}
});
exports.uploadFile = async options => {
options.internalPath = options.internalPath || (`${config.s3.internalPath + options.moduleName}/`);
options.ACL = options.ACL || 'public-read';
logger.info(`Uploading [${options.path}]`);
const params = {
Bucket: config.s3.bucket,
Body: fs.createReadStream(options.path),
Key: options.internalPath + options.fileName,
ACL: options.ACL
};
try {
const s3Response = await s3.completeMultipartUpload(params);
if (s3Response) {
logger.info(`Done uploading, uploaded to: ${s3Response.Location}`);
return { url: s3Response.Location };
}
} catch (err) {
logger.error(err, 'unable to upload:');
throw err;
}
};
I am not sure what this error mean and once the file is uploaded I need to get his location in s3
thanks for any help
For uploading a single image file you need to be calling s3.upload() not s3.completeMultipartUpload().
If you had very large files and wanted to upload then in multiple parts, the workflow would look like:
s3.createMultipartUpload()
s3.uploadPart()
s3.uploadPart()
...
s3.completeMultipartUpload()
Looking at the official documentation, It looks like the new way to do a simple S3 upload in the JavaScript SDK is this:
s3.send(new PutObjectCommand(uploadParams));

AWS S3 files uploaded partially

I am using AWS JavaScript SDK 2 to upload files from my Webapplication. While uploading a large no files like 200 or more its showing success but files were not displayed in AWS consoles, many files were missing.
I am also making a head-object call to verify if file is uploaded successfully or not, which is giving success but still files are missing. Below is my code
// Upload file
const params = {
Bucket: bucket,
Key: directory + fileName,
Body: file,
};
await s3Client.upload(params).promise();
// Check if uploaded successfully
const headParams = {
Bucket: bucket,
Key: directory + fileName,
};
const fileDetails = await s3Client.headObject(headParams).promise();
if (fileSize === fileDetails.ContentLength) {
// Uploaded successfuly
}
is there anything I am missing?
Thanks!

s3 SignedUrl x-amz-security-token

const AWS = require('aws-sdk');
export function main (event, context, callback) {
const s3 = new AWS.S3();
const data = JSON.parse(event.body);`
const s3Params = {
Bucket: process.env.mediaFilesBucket,
Key: data.name,
ContentType: data.type,
ACL: 'public-read',
};
const uploadURL = s3.getSignedUrl('putObject', s3Params);
callback(null, {
statusCode: 200,
headers: {
'Access-Control-Allow-Origin': '*'
},
body: JSON.stringify({ uploadURL: uploadURL }),
})
}
When I test it locally it works fine, but after deployment it x-amz-security-token, and then I get access denied response. How can I get rid of this x-amz-security-token?
I was having the same issue. Everything was working flawlessly using serverless-offline but when I deployed to Lambda I started receiving AccessDenied issues on the URL. When comparing the URLs returned between the serverless-offline and AWS deployments I noticed the only difference was the inclusion of the X-Amz-Security-Token in the URL as a query string parameter. After some digging I discovered the token being assigned was based upon the assumed role the lambda function had. All I had to do was grant the appropriate S3 policies to the role and it worked.
I just solved a very similar, probably the same issue as you have. I say probably because you dont say what deployment entails for you. I am assuming you are deploying to Lambda but you may not be, this may or may not apply but if you are using temporary credentials this will apply.
I initially used the method you use above but then was using the npm module aws-signature-v4 to see if it was different and was getting the same error you are.
You will need the token, it is needed when you have signed a request with temporary credentials. In Lambda's case the credentials are in the runtime, including the session token, which you need to pass, the same is most likely true elsewhere as well but I'm not sure I haven't used ec2 in a few years.
Buried in the docs (and sorry I cannot find the place this is stated) it is pointed out that some services require that the session_token be processed with the other canonical query params. The module I'm using was tacking it on at the end, as the sig v4 instructions seem to imply, so I modified it so the token is canonical and it works.
We've updated the live version of the aws-signature-v4 module to reflect this change and now it works nicely for signing your s3 requests.
Signing is discussed here.
I would use the module I did as I have a feeling the sdk is doing the wrong thing for some reason.
usage example (this is wrapped in a multiPart upload thus the part number and upload Id):
function createBaseUrl( bucketName, uploadId, partNumber, objectKey ) {
let url = sig4.createPresignedS3URL( objectKey, {
method: "PUT",
bucket: bucketName,
expires: 21600,
query: `partNumber=${partNumber}&uploadId=${uploadId}`
});
return url;
}
I was facing the same issue, I'm creating a signed URL using library Boto3 in python3.7
All though this is not a recommended way to solve, it worked for me.
The request methods should be POST, content-type=['multipart/form-data']
Create a client in like this.
# Do not hard code credentials
client = boto3.client(
's3',
# Hard coded strings as credentials, not recommended.
aws_access_key_id='YOUR_ACCESS_KEY',
aws_secret_access_key='YOUR_SECRET_ACCESS_KEY'
)
Return response
bucket_name = BUCKET
acl = {'acl': 'public-read-write'}
file_path = str(file_name) //file you want to upload
response = s3_client.generate_presigned_post(bucket_name,
file_path,
Fields={"Content-Type": ""},
Conditions=[acl,
{"Content-Type": ""},
["starts-with", "$success_action_status", ""],
],
ExpiresIn=3600)

How can I download file inside a folder in Google Cloud Storage using Cloud Functions?

I am using an https triggered Google Cloud Function that is supposed to download a file from Google Cloud Storage (and then combine it with data from req.body). While it seems to work as long as the downloaded file is in the root directory I am having problems accessing the same file when placed inside a folder. The path to the file is documents/someTemplate.docx
'use strict';
const functions = require('firebase-functions');
const path = require('path');
const os = require("os");
const fs = require('fs');
const gcconfig = {
projectId: "MYPROJECTNAME",
keyFilename: "KEYNAME.json"
};
const Storage = require('#google-cloud/storage')(gcconfig)
const bucketPath = 'MYPROJECTNAME.appspot.com'
const bucket = Storage.bucket(bucketPath);
exports.getFileFromStorage = functions.https.onRequest((req, res) => {
let fileName = 'documents/someTemplate.docx'
let tempFilePath = path.join(os.tmpdir(), fileName);
return bucket.file(fileName)
.download({
destination: tempFilePath,
})
.then(() => {
console.log(fileName + ' downloaded locally to', tempFilePath);
let content = fs.readFileSync(tempFilePath, 'binary');
// do stuff with the file and data from req.body
return
})
.catch(err => {
res.status(500).json({
error: err
});
});
})
What I don't understand is that when I move the file to the root directory and use the file name someTemplate.docx instead then the code works.
Google's documentation states that
Objects added to a folder appear to reside within the folder in the GCP Console. In reality, all objects exist at the bucket level, and simply include the directory structure in their name. For example, if you create a folder named pets and add a file cat.jpeg to that folder, the GCP Console makes the file appear to exist in the folder. In reality, there is no separate folder entity: the file simply exists in the bucket and has the name pets/cat.jpeg.
This seems to be correct as in the metadata the file name is indeed documents/someTemplate.docx. Therefore I don't understand why the code above does not work.
Posting comment answer from #James Poag for visibility:
Also, perhaps the directory doesn't exist on the temp folder location? Maybe try let tempFilePath = path.join(os.tmpdir(), 'tempkjhgfhjnmbvgh.docx'); – James Poag Aug 21 at 17:10

"We can not access the URL currently."

I call google api when the return of "We can not access the URL currently." But the resources must exist and can be accessed.
https://vision.googleapis.com/v1/images:annotate
request content:
{
"requests": [
{
"image": {
"source": {
"imageUri": "http://yun.jybdfx.com/static/img/homebg.jpg"
}
},
"features": [
{
"type": "TEXT_DETECTION"
}
],
"imageContext": {
"languageHints": [
"zh"
]
}
}
]
}
response content:
{
"responses": [
{
"error": {
"code": 4,
"message": "We can not access the URL currently. Please download the content and pass it in."
}
}
]
}
As of August, 2017, this is a known issue with the Google Cloud Vision API (source). It appears to repro for some users but not deterministically, and I've run into it myself with many images.
Current workarounds include either uploading your content to Google Cloud Storage and passing its gs:// uri (note it does not have to be publicly readable on GCS) or downloading the image locally and passing it to the vision API in base64 format.
Here's an example in Node.js of the latter approach:
const request = require('request-promise-native').defaults({
encoding: 'base64'
})
const data = await request(image)
const response = await client.annotateImage({
image: {
content: data
},
features: [
{ type: vision.v1.types.Feature.Type.LABEL_DETECTION },
{ type: vision.v1.types.Feature.Type.CROP_HINTS }
]
})
I have faced the same issue when I was trying to call the api using the firebase storage download url (although it worked initially)
After looking around I found the below example in the api docs for NodeJs.
NodeJs example
// Imports the Google Cloud client libraries
const vision = require('#google-cloud/vision');
// Creates a client
const client = new vision.ImageAnnotatorClient();
/**
* TODO(developer): Uncomment the following lines before running the sample.
*/
// const bucketName = 'Bucket where the file resides, e.g. my-bucket';
// const fileName = 'Path to file within bucket, e.g. path/to/image.png';
// Performs text detection on the gcs file
const [result] = await client.textDetection(`gs://${bucketName}/${fileName}`);
const detections = result.textAnnotations;
console.log('Text:');
detections.forEach(text => console.log(text));
For me works only uploading image to google cloud platform and passing it to URI parameters.
In my case, I tried retrieving an image used by Cloudinary our main image hosting provider.
When I accessed the same image but hosted on our secondary Rackspace powered CDN, Google OCR was able to access the image.
Not sure why Cloudinary didn't work when I was able to access the image via my web browser, but just my little workaround situation.
I believe the error is caused by the Cloud Vision API refusing to download images on a domain whose robots.txt file blocks Googlebot or Googlebot-Image.
The workaround that others mentioned is in fact the proper solution: download the images yourself and either pass them in the image.content field or upload them to Google Cloud Storage and use the image.source.gcsImageUri field.
For me, I resolved this issue by requesting URI (e.g.: gs://bucketname/filename.jpg) instead of Public URL or Authenticated URL.
const vision = require('#google-cloud/vision');
function uploadToGoogleCloudlist (req, res, next) {
const originalfilename = req.file.originalname;
const bucketname = "yourbucketname";
const imageURI = "gs://"+bucketname+"/"+originalfilename;
const client = new vision.ImageAnnotatorClient(
{
projectId: 'yourprojectid',
keyFilename: './router/fb/yourprojectid-firebase.json'
}
);
var visionjson;
async function getimageannotation() {
const [result] = await client.imageProperties(imageURI);
visionjson = result;
console.log ("vision result: "+JSON.stringify(visionjson));
return visionjson;
}
getimageannotation().then( function (result){
var datatoup = {
url: imageURI || ' ',
filename: originalfilename || ' ',
available: true,
vision: result,
};
})
.catch(err => {
console.error('ERROR CODE:', err);
});
next();
}
I faced with the same issue several days ago.
In my case the problem happened due to using queues and call api requests in one time from the same ip. After changing the number of parallel processes from 8 to 1, the amount of such kind of errors was reduced from ~30% to less than 1%.
May be it will help somebody. I think there is some internal limits on google side for loading remote images (because as people reported, using google storage also solves the problem).
My hypothesis is that an overall (short) timeout exists on Google API side which limit the number of files that can actually be retrieved.
Sending 16 images for batch-labeling is possible but only 5 o 6 will labelled because the origin webserver hosting the images was unable to return all 16 files within <Google-Timeout> milliseconds.
In my case, the image uri that I was specifying in the request pointed at a large image ~ 4000px x 6000px. When I changed it to a smaller version of the image. The request succeeded
The very same request works for me. It is possible that the image host was temporarily down and/or had issues on their side. If you retry the request it will mostly work for you.