I'm utilizing the NGDBC driver (SAP HANA JDBC driver) with an AWS Glue Notebook. I'm using the following line once I include the JAR file to access data from SAP HANA in our environment.
df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("dbtable", "KNA1").option("user", db_username).option("password", db_password).load()
In this example, it simply download the KNA1 table, but I have yet to see any documentation that tells me how to actually query the SAP HANA instance through these options. I attempted to use a "query" option, but that didn't seem like it was available via the JAR.
Am I to understand that I have to simply get entire tables, then query against the DataFrame? That seems expensive and not what I want to do. Maybe someone can provide some insight.
Try like this:
df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("dbtable", "(select name1 from kna1 where kunnr='1111') as name").option("user", db_username).option("password", db_password).load()
i.e. wrap the query into asterisks and provide an alias as help suggests.
Related
I am trying to figure out how to specify the dataset location in a BigQuery API query using v0.27 of the BigQuery API.
I have a dataset located in northamerica-northeast1 and the BigQuery API is returning 404 errors since this is not the default multi-regional location "US."
I am using the run_async_query method to execute my queries but based on documentation am unsure how to add a location to this field to make it location aware.
I have also tried to previously update my client instantiation like this:
def _get_client(self):
bigquery.Client.SCOPE = (
'https://www.googleapis.com/auth/bigquery',
'https://www.googleapis.com/auth/cloud-platform',
'https://www.googleapis.com/auth/drive')
client = bigquery.Client.from_service_account_json(_KEY_FILE)
if self._params['bq_data_location'].strip():
client.location = self._params['bq_data_location']
return client
However, it does not appear that this is the correct way to inform the BigQuery API of a dataset location.
For additional context, in my SQL that I am passing to the BigQuery API, I am already specifying the PROJECT_ID.DATASET_ID.TABLE_ID, however, this does not seem to be sufficient to find regional data.
Furthermore, I am making this request from Google App Engine using the CRMint open source data flow platform.
Can you please help me with an example of how location can be added to the BigQuery API for v0.27 so that the API does not return 404?
Thank you!
From the code sample it seems you're likely talking about google-cloud-bigquery 0.27, which was released in Aug 2017 and predates location support (as well as many other features).
Your best bet is to update that dependency to something more recent.
Since our scheme is constant we are using spark.read() which is way faster then creating dynamic frame from option when data is stored in s3
So now wanted to read data from glue catalog
using dynamic frame takes lot of time
So wanted to read using spark read api
Dataframe.read.format("").option("url","").option("dtable",schema.table name).load()
What to enter in format and url option and any other thing is required??
Short answer:
If you read/load the data directly using a SparkSession/SparkContext you'll get a
pure Spark DataFrame instead of a DynamicFrame.
Different options when reading from spark:
Format: is the source format you are reading from, so it can be parquet, csv, json,..
load: it is the path to the source file/files you are reading from: it can be a local path, s3 path, hadoop path,...
options: plenty of different options like inferSchema if you want spark to to the best for you and guess the schema based on a taken sample of data or header = true in csv files.
An example:
df = spark.read.format("csv").option("header", true) .option("inferSchema", true).load("s3://path")
No DynamicFrame has been created in the previous example, so df will be a DataFrame unless you convert it into a DynamicFrame using glue API.
Long answer:
Glue catalog is only a aws Hive implementation itself. You create a glue catalog defining a schema, a type of reader, and mappings if required, and then this becomes available for different aws services like glue, athena or redshift-spectrum. The only benefit I see from using glue-catalogs is actually the integration with the different aws-services.
I think you can get the most from data-catalogs using crawlers and the integrations with athena and redshift-specturm, as well as loading them into glue jobs using a unified API.
You can always read using from_options glue method directly from different sources and formats using glue and you won't lose some of the great tools glue has, and it will still read it as a DynamicFrame.
If you don't want to get that data from glue for any reason you just can specify a DataFrame Schema and read directly using a SparkSession but keep in mind that you won't have access to bookmarks, and other tools although you can transform that DataFrame into a DynamicFrame.
An example of reading from s3 using spark directly into a DataFrame (f.e in parquet, json or csv format), would be:
df = spark.read.parquet("s3://path/file.parquet")
df = spark.read.csv("s3a://path/*.csv")
df= spark.read.json("s3a://path/*.json")
That won't create any DynamicFrame unless you want to convert it to it, you'll get a pure Spark DataFrame.
Another way of doing it is using the format() method.
df = spark.read.format("csv").option("header", true) .option("inferSchema", true).load("s3://path")
Keep in mind that there are several options like "header" or "inferSchema" for a csv f.e. You'll need to know if you want to use them. It is best practice to define the schema in productions environments instead of using inferSchema but there are several use cases.
And furthermore you can always convert that pure DataFrame to a DynamicFrame if needed using:
DynamicFrame.fromDF(df, glue_context, ..)
Is there a way to bulk tag bigquery tables with python google.cloud.datacatalog?
If you want to take a look at sample code which uses the python google.cloud.datacatalog client library, I've put together a utilities open source script, that creates bulk Tags using a CSV as source. If you want to use a different source, you may use this script as reference, hope it helps.
create bulk tags from csv
For this purpose you may consider using DataCatalogClient() method which is included in google.cloud.datacatalog_v1 class as a part of PyPI Python google-cloud-datacatalog package leveraging Google Cloud Data Catalog API service.
By the first, you have to enable Data Catalog and BigQuery APIs
in your project;
Install Python Cloud Client Libraries for the Data Catalog API:
pip install --upgrade google-cloud-datacatalog
Set up authentication, exporting
GOOGLE_APPLICATION_CREDENTIALS environment variable holding JSON
file that contains your service account key:
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json"
Refer to this example from official documentation that
intelligibly reflects a way creating Data catalog tag template,
attaching appropriate tag fields to the target Bigquery table using
create_tag_template() function.
Having any doubts feel free to extend you initial question or add a comment below this answer, thus we can address particular use case according to your needs.
I want to import CSVs in my s3 bucket into my MySql rds instance using jdbc. It is a one time process and not an ongoing one. Interested in knowing the end to end process.
As you mentioned its one time activity, hence I would like you to suggest direct CSV import to MySQL rather then using JDBC unless there is specific reason you might have that you have not mentioned into the question.
Here is approach you could utilize,
for loop of your files in S3, the use following command to import data to MySQL RDS.
>mysqlimport [options] db_name textfile1 [textfile2 ...]
Please refer following for more details.
https://dev.mysql.com/doc/en/mysqlimport.html
I hope this provides you way to move forward. If I'm missing something, re-frame your question, I could reattempt the answer.
I am currently using protege 5.0 and have created a very simple ontology (the pizza example). I was wondering how I would export this ontology to dynamodb on AWS. I was hoping someone could post a link to a good tutorial on protege 5.0 or walk me through this. Thanks!
If you are using dynamodb just to store the content of a file and to be able to access the file at a specific URL, then the process required is just the same as for any other file type you would store on dynamodb. The default way for Protege and most other OWL related tools to access an ontology is a simple HTTP get from a provided IRI.