I have a BigQuery table filled with product data for a series of clients. The data has been flattened using a query. I want to export the data for each client to a Google Cloud Storage bucket in csv format - so each client has its own individual csv.
There are just over 100 clients, each with a client_id and the table itself is 1GB in size. I've looked into querying the table using a cloud function, but this would cost over 100,000 GB of data. I've also looked at importing the clients to individual tables directly from the source, but I would need to run the flattening query on each - again incurring a high data cost.
Is there a way of doing this that will limit data usage?
Have you thought about Dataproc?
You could write simple PySpark script where you load data from BigQuery and Write into Bucket splitting by client_id , something like this:
"""
File takes 3 arguments:
BIGQUERY-SOURCE-TABLE
desc: table being source of data in BiqQuery
format: project.dataset.table (str)
BUCKET-DEST-FOLDER
desc: path to bucket folder where CSV files will be stored
format: gs://bucket/folder/ (str)
SPLITER:
desc: name of column on which spit will be done during data saving
format: column-name (str)
"""
import sys
from pyspark.sql import SparkSession
if len(sys.argv) != 4:
raise Exception("""Usage:
filename.py BIGQUERY-SOURCE-TABLE BUCKET-DEST-FOLDER SPLITER"""
)
def main():
spark = SparkSession.builder.getOrCreate()
df = (
spark.read
.format("bigquery")
.load(sys.argv[1])
)
(
df
.write
.partitionBy(sys.argv[2])
.format("csv")
.option("header", True)
.mode("overwrite").
save(sys.argv[3])
)
if __name__ == "__main__":
main()
You will need to:
Save this script in Google Bucket,
Create Dataproc cluster for a while,
Run command written below,
Delete Dataproc cluster.
Let's say you have architecture as follows:
bigquery: myproject:mydataset.mytable
bucket: gs://mybucket/
dataproc cluster: my-cluster
So you will need to run following command:
gcloud dataproc jobs submit pyspark gs://mybucket/script-from-above.py \
--cluster my-cluster \
--region [region-of-cluster] \
--jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
-- \
myproject:mydataset.mytable gs://mybucket/destination/ client_id
This will save in gs://mybucket/destination/ data split by client_id and you will have files named:
client_id=1
client_id=2
...
client_id=n
As mentioned by #Mr.Batra, you can create partitions on your table based on client_id to regulate cost and amount of data queried.
Implementing Cloud Function and looping over each client id without partitions will cost more since with each
SELECT * FROM table WHERE client_id=xxx the query will scan the full table.
Related
I'm trying to load a table in na spark EMR cluster from glue catalog in apache iceberg format that is stored in S3. The table is correctly created because I can query it from AWS Athena. On the cluster creation I have set this configuration:
[{"classification":"iceberg-defaults","properties":{"iceberg.enabled":"true"}}]
IK have tried running sql queries from spark that are in other formats(csv) and it works, but when I try to read iceberg tables I get this error:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table table_name. StorageDescriptor#InputFormat cannot be null for table: table_name(Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
This is the code in the notebook:
%%configure -f
{
"conf":{
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.dev":"org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.dev.type":"hadoop",
"spark.sql.catalog.dev.warehouse":"s3://pyramid-streetfiles-sbx/iceberg_test/"
}
}
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
spark = SparkSession.builder.getOrCreate()
# This query works and shows the iveberg table i want to read
spark.sql("show tables from iceberg_test").show(truncate=False)
# Here shows the error
spark.sql("select * from iceberg_test.table_name limit 10").show(truncate=False)
How can I read apache iceberg tables in EMR cluster with Spark and glue catalog?
You need to pass the catalog name glue.
Example: glue_catalog.<your_database_name>.<your_table_name>
https://docs.aws.amazon.com/pt_br/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html
I have AWS Glue Crawler which runs twice a day and populates data in Athena.
Quicksight takes data from Athena and shows it in a dashboard.
I am implementing LastDataRefresh (Datetime) to show in a Quicksight dashboard. Is there a way I can get the last crawler run datetime so that I can store it in an Athena table and show in Quicksight ?
Any other suggestions are also welcome.
TL;DR Extract the crawler last run time from Glue's CloudWatch logs
Glue sends a series of events to CloudWatch during each crawler run. Extract and process the "finished running" logs from /aws-glue/crawlers log group to get the latest for each crawler.
Logs for a single crawler run:
2021-12-15T12:08:54.448+01:00 [7dd..] BENCHMARK : Running Start Crawl for Crawler lorawan_datasets_bucket_crawler
2021-12-15T12:09:12.559+01:00 [7dd..] BENCHMARK : Classification complete, writing results to database jokerman_events_database
2021-12-15T12:09:12.560+01:00 [7dd..] INFO : Crawler configured with SchemaChangePolicy {"UpdateBehavior":"UPDATE_IN_DATABASE","DeleteBehavior":"DEPRECATE_IN_DATABASE"}.
2021-12-15T12:09:27.064+01:00 [7dd..] BENCHMARK : Finished writing to Catalog
2021-12-15T12:12:13.768+01:00 [7dd..] BENCHMARK : Crawler has finished running and is in state READY
Extract and process the BENCHMARK : Crawler has finished running and is in state READY logs:
import boto3
from datetime import datetime, timedelta
def get_last_runs():
session = boto3.Session(profile_name='sandbox', region_name='us-east-1')
logs = session.client('logs')
startTime = startTime = datetime.now() - timedelta(days=14)
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.filter_log_events
filtered_events = logs.filter_log_events(
logGroupName="/aws-glue/crawlers",
# https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html#matching-terms-events
filterPattern="BENCHMARK state READY", # match "BENCHMARK : Crawler has finished running and is in state READY" messages
startTime=int(startTime.timestamp()*1000)
)
completed_runs = [
{"crawler": m.get("logStreamName"), "timestamp": datetime.fromtimestamp(m.get("timestamp")/1000).isoformat()}
for m in filtered_events["events"]
]
# rework the list to get a dictionary of the last runs by crawler
crawlers = set([r['crawler'] for r in completed_runs])
last_runs = dict()
for n in crawlers:
last_runs[n] = max([d["timestamp"] for d in completed_runs if d["crawler"] == n])
print(last_runs)
Output:
{
'lorawan_datasets_bucket_crawler': '2021-12-15T12:12:13.768000',
'jokerman_lorawan_events_table_crawler': '2021-12-15T12:12:12.007000'
}
I am new to **AWS Glue,** and my aim is to extract transform and load files uploaded in S3 bucket to RDS instance. Also I need to transfer the files into separate S3 buckets based on the Glue Job status (Success /Failure). There will be more than one file uploaded into the initial S3 bucket. How can I get the name of the files uploaded so that i can transfer those files to appropriate buckets.
Step 1: Upload files to S3 bucket1.
Step 2: Trigger lamda function to call Job1
Step 3: On success of job1 transfer file to S3 bucket2
Step 4: On failure transfer to another S3 bucket
Have a lambda event trigger listening to the folder you are uploading
the files to S3 In the lambda, use AWS Glue API to run the glue job
(essentially a python script in AWS Glue).
In Glue python script, use the appropriate library, such as pymysql, etc.
as an external library packaged with your python script.
Perform data load operations from S3 to your RDS tables. If you are
using Aurora Mysql, then AWS has provided a nice feature
"load from S3", so you can directly load the
file into the tables (you may need to do some configurations in the
PARAMETER GROUP / IAM Roles).
Lambda script to call glue job:
s3 = boto3.client('s3')
glue = boto3.client('glue')
def lambda_handler(event, context):
gluejobname="<YOUR GLUE JOB NAME>"
try:
runId = glue.start_job_run(JobName=gluejobname)
status = glue.get_job_run(JobName=gluejobname, RunId=runId['JobRunId'])
print("Job Status : ", status['JobRun']['JobRunState'])
except Exception as e:
print(e)
raise e
Glue Script:
import mysql.connector
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.context import DynamicFrame
from awsglue.transforms import *
from pyspark.sql.types import StringType
from pyspark.sql.types import DateType
from pyspark.sql.functions import unix_timestamp, from_unixtime
from pyspark.sql import SQLContext
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
url="<RDS URL>"
uname="<USER NAME>"
pwd="<PASSWORD>"
dbase="DBNAME"
def connect():
conn = mysql.connector.connect(host=url, user=uname, password=pwd, database=dbase)
cur = conn.cursor()
return cur, conn
def create_stg_table():
cur, conn = connect()
createStgTable1 = <CREATE STAGING TABLE IF REQUIRED>
loadQry = "LOAD DATA FROM S3 PREFIX 'S3://PATH FOR YOUR CSV' REPLACE INTO TABLE <DB.TABLENAME> FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' IGNORE 1 LINES (#var1, #var2, #var3, #var4, #var5, #var6, #var7, #var8) SET ......;"
cur.execute(createStgTable1)
cur.execute(loadQry)
conn.commit()
conn.close()
You can then create a cloudwatch alert wherein check for the glue job status, and depending upon the status, perform file copy operations between S3. We have similar setup in our production.
Regards
Yuva
I have about 200k CSVs(all with same schema). I wrote a Cloud Function for them to insert them to BigQuery such that as soon as I copy the CSV to a bucket, the function is executed and data is loaded to the BigQuery dataset
I basically used the same code as in the documentation.
dataset_id = 'my_dataset' # replace with your dataset ID
table_id = 'my_table' # replace with your table ID
table_ref = bigquery_client.dataset(dataset_id).table(table_id)
table = bigquery_client.get_table(table_ref) # API request
def bigquery_csv(data, context):
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = 'gs://{}/{}'.format(data['bucket'], data['name'])
errors = bigquery_client.load_table_from_uri(uri,
table_ref,
job_config=job_config) # API request
logging.info(errors)
#print('Starting job {}'.format(load_job.job_id))
# load_job.result() # Waits for table load to complete.
logging.info('Job finished.')
destination_table = bigquery_client.get_table(table_ref)
logging.info('Loaded {} rows.'.format(destination_table.num_rows))
However, when I copied all the CSVs to the bucket(which were about 43 TB), not all data was added to BigQuery and only about 500 GB was inserted.
I can't figure what's wrong. No insert jobs are being shown in Stackdriver Logging and no functions are running once the copy job is complete.
However, when I copied all the CSVs to the bucket(which were about 43 TB), not all data was added to BigQuery and only about 500 GB was inserted.
You are hitting BigQuery load limit as defined in this link
You should split your file into smaller file and the upload will work
I'm trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capability of connecting to JDBC:
https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases
The load command seems fairly straightforward (although I don't know how I would enter AWS credentials here, maybe in the options?).
df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")
And I'm not entirely sure how to deal with the SPARK_CLASSPATH variable. I'm running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?
Anyway, for now, when I try running these commands, I get a bunch of undecipherable errors, so I'm kind of stuck for now. Any help or pointers to detailed tutorials are appreciated.
Although this seems to be a very old post, anyone who is still looking for answer, below steps worked for me!
Start the shell including the jar.
bin/pyspark --driver-class-path /path_to_postgresql-42.1.4.jar --jars /path_to_postgresql-42.1.4.jar
Create a df by giving appropriate details:
myDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:redshift://host:port/db_name") \
.option("dbtable", "table_name") \
.option("user", "user_name") \
.option("password", "password") \
.load()
Spark Version: 2.2
It turns out you only need a username/pwd to access Redshift in Spark, and it is done as follows (using the Python API):
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load(source="jdbc",
url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret",
dbtable="schema.table"
)
Hope this helps someone!
If you're using Spark 1.4.0 or newer, check out spark-redshift, a library which supports loading data from Redshift into Spark SQL DataFrames and saving DataFrames back to Redshift. If you're querying large volumes of data, this approach should perform better than JDBC because it will be able to unload and query the data in parallel.
If you still want to use JDBC, check out the new built-in JDBC data source in Spark 1.4+.
Disclosure: I'm one of the authors of spark-redshift.
You first need to download Postgres JDBC driver. You can find it here: https://jdbc.postgresql.org/
You can either define your environment variable SPARK_CLASSPATH in .bashrc, conf/spark-env.sh or similar file or specify it in the script before you run your IPython notebook.
You can also define it in your conf/spark-defaults.conf in the following way:
spark.driver.extraClassPath /path/to/file/postgresql-9.4-1201.jdbc41.jar
Make sure it is reflected in the Environment tab of your Spark WebUI.
You will also need to set appropriate AWS credentials in the following way:
sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "***")
The simplest way to make a jdbc connection to Redshift using python is as follows:
# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession
jdbc_url = "jdbc:redshift://xxx.xxx.redshift.amazonaws.com:5439/xxx"
jdbc_user = "xxx"
jdbc_password = "xxx"
jdbc_driver = "com.databricks.spark.redshift"
spark = SparkSession.builder.master("yarn") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport().getOrCreate()
# Read data from a query
df = spark.read \
.format(jdbc_driver) \
.option("url", jdbc_url + "?user="+ jdbc_user +"&password="+ jdbc_password) \
.option("query", "your query") \
.load()
This worked for in Scala in AWS Glue with Spark 2.4:
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val sqlContext = new org.apache.spark.sql.SQLContext(spark)
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql://HOST:PORT/DBNAME?user=USERNAME&password=PASSWORD",
"dbtable" -> "(SELECT a.row_name FROM schema_name.table_name a) as from_redshift")).load()
// back to DynamicFrame
val datasource0 = DynamicFrame(jdbcDF, glueContext)
Works with any SQL query.