Is there a way to pass credentials programmatically for using google documentAI without reading from a disk? - google-cloud-platform

I am trying to run the demo code given in pdf parsing of GCP document AI. To run the code, exporting google credentials as a command line works fine. The problem comes when the code needs to run in memory and hence no credential files are allowed to be accessed from disk. Is there a way to pass the credentials in the document ai parsing function?
The sample code of google:
def main(project_id='YOUR_PROJECT_ID',
input_uri='gs://cloud-samples-data/documentai/invoice.pdf'):
"""Process a single document with the Document AI API, including
text extraction and entity extraction."""
client = documentai.DocumentUnderstandingServiceClient()
gcs_source = documentai.types.GcsSource(uri=input_uri)
# mime_type can be application/pdf, image/tiff,
# and image/gif, or application/json
input_config = documentai.types.InputConfig(
gcs_source=gcs_source, mime_type='application/pdf')
# Location can be 'us' or 'eu'
parent = 'projects/{}/locations/us'.format(project_id)
request = documentai.types.ProcessDocumentRequest(
parent=parent,
input_config=input_config)
document = client.process_document(request=request)
# All text extracted from the document
print('Document Text: {}'.format(document.text))
def _get_text(el):
"""Convert text offset indexes into text snippets.
"""
response = ''
# If a text segment spans several lines, it will
# be stored in different text segments.
for segment in el.text_anchor.text_segments:
start_index = segment.start_index
end_index = segment.end_index
response += document.text[start_index:end_index]
return response
for entity in document.entities:
print('Entity type: {}'.format(entity.type))
print('Text: {}'.format(_get_text(entity)))
print('Mention text: {}\n'.format(entity.mention_text))

When you run your workloads on GCP, you don't need to have a service account key file. You MUSTN'T!!
Why? 2 reasons:
It's useless because all GCP products have, at least, a default service account. And most of time, you can customize it. You can have a look on Cloud Function identity in your case.
Service account key file is a file. It means a lot: you can copy it, send it by email, commit it in Git repository... many persons can have access to it and you loose the management of this secret. And because it's a secret, you have to store it securely, you have to rotate it regularly (at least every 90 days, Google recommendation),... It's nighmare! When you can, don't use service account key file!
What the libraries are doing?
There are looking if GOOGLE_APPLICATION_CREDENTIALS env var exists.
There are looking into the "well know" location (when you perform a gcloud auth application-default login to allow the local application to use your credential to access to Google Resources, a file is created in a "standard location" on your computer)
If not, check if the metadata server exists (only on GCP). This server provides the authentication information to the libraries.
else raise an error.
So, simply use the correct service account in your function and provide it the correct role to achieve what you want to do.

Related

How to get google cloud project number programmaticaly?

I want to use Google Secret Manager in my project. To access a saved secret it is necessary to provide a secret name which contains Google project number. It will be convinient to get this number proramatically to form secret name and no to save it in the enviroment variable. I use node.js runtime for my project. I know there is a library google-auth-library which allow to get project id. Is it possible to get project number somehow?
You can access secrets by project_id or project_number. The following are both valid resource IDs that point to the same secret:
projects/my-project/secrets/my-secret
projects/1234567890/secrets/my-secret
You can get metadata, including project_id and project_number from the metadata service. There are many default values. The ones you're looking for are numeric-project-id and project-id.
Here is an example using curl to access the metadata service. You would run this inside your workload, typically during initial boot:
curl "https://metadata.google.internal/computeMetadata/v1/project/project-id" \
--header "Metadata-Flavor: Google"
Note: the Metadata-Flavor: Google header is required.
To access these values from Node, you can construct your own http client. Alternatively, you can use the googleapis/gcp-metadata package:
const gcpMetadata = require('gcp-metadata');
async function projectID() {
const id = await gcpMetadata.project('project-id');
return id
}
You can send a GET request to the Resource Manager API
https://cloudresourcemanager.googleapis.com/v1/projects/PROJECT_ID?alt=json
Not sure if the following method can be useful in your case, but I put it here, just in case:
gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)"
it should return the project number based on the project identifier (in the PROJECT_ID variable), under assumption, that a user (or a service account) who/which runs that command has relevant permissions.
If you're doing this from outside a Cloud VM, so that the metadata service is not available, you can use the Resource Manager API to convert the project name to project number:
const {ProjectsClient} = require('#google-cloud/resource-manager').v3;
const resourcemanagerClient = new ProjectsClient();
let projectId = 'your-project-id-123'; // TODO: replace with your project ID
const [response] = await resourcemanagerClient.getProject({name: projectId});
let projectNumber = response.name.split('/')[1];

PBI services Odbc.DataSource dynamic dsn name

In PBI desktop file no errors, erro appear only in PBI service on refreshing
ERROR:
Query contains unsupported function. Function name:Odbc.DataSource
Parameter1
Mydsn ' as parameter
used as text - not dynamic
= Odbc.DataSource("dsn=Mydsn", [HierarchicalNavigation=true]) ' no error
as used us text Parameter - not dynamic
= Odbc.DataSource("dsn=" & Parameter1, [HierarchicalNavigation=true]) ' no error,
Odbc_dsn ' Query
= settings[Column2]{0} ' from csv
from csv query
= Odbc.DataSource("dsn=" & Odbc_dsn, [HierarchicalNavigation=true]) ' Query contains unsupported function. Function name: Odbc.DataSource
directly from csv table
= Odbc.DataSource("dsn=" & settings[Column2]{0}, [HierarchicalNavigation=true]) ' Query contains unsupported function. Function name: Odbc.DataSource
No one of Privacy settings is not change anything, tryied all
available ways. (change to none, private, organizational, public,
disabling privacy settings and etc)
How to use odbs source DSN name from csv file?
(Answer to be expanded with additional info provided - see comments on original question)
While I have never imported a DSN name through a CSV, your saying that it works on your local machine makes me accept that this is at least possible so we'll instead focus on issues with the gateway.
My first impression here as to why this might not be working is simply permissions and visibility.
Having worked with a number of PowerBI Service setups, the issue with an unrecognized ODBC DSN usually falls into the following issues:
Is the DSN setup as a system DSN?
Is the gateway setup as a LocalService Account vs PowerBI Gateway Host Account?
Does which user the gateway is setup under actually have permissions to the directory that the data source (or custom connector) that the connection depends on?
So:
Fairly straight forward: all gateway accessible ODBC sources need to be setup on the gateway host as system DSNs, not user DSNs. See your ODBC Data Source Administrator here:
Confirm the On-Premise Gateway "Logon" User on the gateway's host machine? Generally I recommend going to Windows Services and making sure to use the "Local System account" (to inherit permissions) but just consider this during the next step of checking local permissions.
This applies to anything which is "self-hosted" on the local machine that is the gateway host: Whichever account is hosting the powerbi gateway service must also be given explicit permissions to the local resources needed. For example, if you add a custom connector to the documents directory on the gateway host under your user account - make sure the PowerBI default user has access to that directory and file. I.E. File properties -> Security -> User permission etc.
In my experience, 9/10 times one of these things isn't setup right.
Additional note - every time you upgrade or re-install a powerbi gateway host, you will have to change the service login account and double check all permissions. I don't know why but it overwrites that setting by default disabling all refresh until restored.
Edit:
After further thinking, I believe you will eventually run into the roadblock regardless - PowerBI Service's Gateway Data Source mappings are 1-1. After upload you will get this screen in the dataset settings:
Which requires that the data source has been defined in the PowerBI service's settings:
I don't believe that it is currently possible to make that definition a variably composed string per user's request.
Dsn name can be only static and only string

How to sign gcs blob from the dataflow worker

my beam dataflow job succeeds locally (with DirectRunner) and fails on the cloud (with DataflowRunner)
The issue localized in this code snippet:
class SomeDoFn(beam.DoFn):
...
def process(self, gcs_blob_path):
gcs_client = storage.Client()
bucket = gcs_client.get_bucket(BUCKET_NAME)
blob = Blob(gcs_blob_path, bucket)
# NEXT LINE IS CAUSING ISSUES! (when run remotely)
url = blob.generate_signed_url(datetime.timedelta(seconds=300), method='GET')
and dataflow points to the error: "AttributeError: you need a private key to sign credentials.the credentials you are currently using just contains a token."
My dataflow job uses the service account (and appropriate service_account_email is provided in the PipelineOptions), however I don't see how I could pass the .json credentials file of that service account to the dataflow job. I suspect that locally my job runs successfully because I set the environment variable GOOGLE_APPLICATION_CREDENTIALS=<path to local file with service account credentials>, but how do I set it similarly for remote dataflow workers? Or maybe there is another solution, if anyone could help
You can see an example here on how to add custom options to your Beam pipeline. With this we can create a --key_file argument that will point to the credentials stored in GCS:
parser.add_argument('--key_file',
dest='key_file',
required=True,
help='Path to service account credentials JSON.')
This will allow you to add the --key_file gs://PATH/TO/CREDENTIALS.json flag when running the job.
Then, you can read it from within the job and pass it as a side input to the DoFn that needs to sign the blob. Starting from the example here we create a credentials PCollection to hold the JSON file:
credentials = (p
| 'Read Credentials from GCS' >> ReadFromText(known_args.key_file))
and we broadcast it to all workers processing the SignFileFn function:
(p
| 'Read File from GCS' >> beam.Create([known_args.input]) \
| 'Sign File' >> beam.ParDo(SignFileFn(), pvalue.AsList(credentials)))
Inside the ParDo, we build the JSON object to initialize the client (using the approach here) and sign the file:
class SignFileFn(beam.DoFn):
"""Signs GCS file with GCS-stored credentials"""
def process(self, gcs_blob_path, creds):
from google.cloud import storage
from google.oauth2 import service_account
credentials_json=json.loads('\n'.join(creds))
credentials = service_account.Credentials.from_service_account_info(credentials_json)
gcs_client = storage.Client(credentials=credentials)
bucket = gcs_client.get_bucket(gcs_blob_path.split('/')[2])
blob = bucket.blob('/'.join(gcs_blob_path.split('/')[3:]))
url = blob.generate_signed_url(datetime.timedelta(seconds=300), method='GET')
logging.info(url)
yield url
See full code here
You will need to provide the service account JSON key similarly to what you are doing locally using the env variable GOOGLE_APPLICATION_CREDENTIALS.
To do so you can follow a few approaches mentioned in the answers to this question. Such as passing it using PipelineOptions
However, keep in mind that the safest way is to store the JSON key let's say in a GCP Bucket and get the file from there.
The easy but not safe workaround is getting the key, opening it, and in your code create a json object based on it to pass it later.

How can I grant individual permissions in Google Cloud Platform for BigQuery users using python

I need to set up very fine-grained access control for user accounts in GCP using a python script
I know that via UI/gcloud util I can give it role roles/big query. user, but it has a lot of other permissions I don't want this service account to have.
How can I grant individual permissions via python scripts?
Go to your BigQuery console, click into the arrow at the right of one dataset and then click into Share dataset
And then add the e-mail of the user here:
You can choose one of 3 roles available: Viewer/Owner/Editor.
Do this in every dataset to every user.
Update to do it via Python script
You can do it with a Python script following this small tutorial.
The code will be something like:
from google.cloud import bigquery
client = bigquery.Client()
dataset = client.get_dataset(client.dataset('dataset1'))
entry = bigquery.AccessEntry(
role='READER',
entity_type='userByEmail',
entity_id='user1#example.com')
assert entry not in dataset.access_entries
entries = list(dataset.access_entries)
entries.append(entry)
dataset.access_entries = entries
dataset = client.update_dataset(dataset, ['access_entries']) # API request
#assert entry in dataset.access_entries

Specify Maximum File Size while uploading a file in AWS S3

I am creating temporary credentials via AWS Security Token Service (AWS STS).
And Using these credentials to upload a file to S3 from S3 JAVA SDK.
I need some way to restrict the size of file upload.
I was trying to add policy(of s3:content-length-range) while creating a user, but that doesn't seem to work.
Is there any other way to specify the maximum file size which user can upload??
An alternative method would be to generate a pre-signed URL instead of temporary credentials. It will be good for one file with a name you specify. You can also force a content length range when you generate the URL. Your user will get URL and will have to use a specific method (POST/PUT/etc.) for the request. They set the content while you set everything else.
I'm not sure how to do that with Java (it doesn't seem to have support for conditions), but it's simple with Python and boto3:
import boto3
# Get the service client
s3 = boto3.client('s3')
# Make sure everything posted is publicly readable
fields = {"acl": "private"}
# Ensure that the ACL isn't changed and restrict the user to a length
# between 10 and 100.
conditions = [
{"acl": "private"},
["content-length-range", 10, 100]
]
# Generate the POST attributes
post = s3.generate_presigned_post(
Bucket='bucket-name',
Key='key-name',
Fields=fields,
Conditions=conditions
)
When testing this make sure every single header item matches or you'd get vague access denied errors. It can take a while to match it completely.
I believe there is no way to limit the object size before uploading, and reacting to that would be quite hard. A workaround would be to create an S3 event notification that triggers your code, through a Lambda funcation or SNS topic. That could validate or delete the object and notify the user for example.