I am interfacing Elasticsearch with Spark, using the Elasticsearch-Hadoop plugin and I am having difficulty writing a dataframe with a timestamp type column to Elasticsearch.
The problem is when I try to write using dynamic/multi resource formatting to create a daily index.
From the relevant documentation I get the impression that this is possible, however, the python example below fails to run unless I change my dataframe type to date.
import pyspark
conf = pyspark.SparkConf()
conf.set('spark.jars', 'elasticsearch-spark-20_2.11-6.1.2.jar')
conf.set('es.nodes', '127.0.0.1:9200')
conf.set('es.read.metadata', 'true')
conf.set('es.nodes.wan.only', 'true')
from datetime import datetime, timedelta
now = datetime.now()
before = now - timedelta(days=1)
after = now + timedelta(days=1)
cols = ['idz', 'name', 'time']
vals = [(0,'maria', before), (1, 'lolis', after)]
time_df = spark.createDataFrame(vals, cols)
When I try to write, I use the following:
time_df.write.mode('append').format(
'org.elasticsearch.spark.sql'
).options(
**{'es.write.operation': 'index' }
).save('xxx-{time|yyyy.MM.dd}/1')
Unfortunatelly this renders an error:
.... Caused by: java.lang.IllegalArgumentException: Invalid format:
"2018-03-04 12:36:12.949897" is malformed at " 12:36:12.949897" at
org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:945)
On the other hand this works perfectly fine if I use dates when I create my dataframe:
cols = ['idz', 'name', 'time']
vals = [(0,'maria', before.date()), (1, 'lolis', after.date())]
time_df = spark.createDataFrame(vals, cols)
Is it possible to format a dataframe timestamp to be written to daily indexes with this method, without also keeping a date column around? How about monthly indexes?
Pyspark version:
spark version 2.2.1
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_151
ElasticSearch version
number "6.2.2" build_hash "10b1edd"
build_date "2018-02-16T19:01:30.685723Z" build_snapshot false
lucene_version "7.2.1" minimum_wire_compatibility_version "5.6.0"
minimum_index_compatibility_version "5.0.0"
Related
I wrote a query for one of my Big Query table called historical and I would like to copy the result of this query into a new Big Query table called historical_recent. I have difficulties to figure out how to do this operation with Python. Right now, I am able to execute my query and get the expected result:
SELECT * FROM gcp-sandbox.dailydev.historical WHERE (date BETWEEN '2015-11-05 00:00:00' AND '
2015-11-07 23:00:00')
I am also able to copy a my Big Query table without making any changes with this script:
from google.cloud import bigquery
client = bigquery.Client()
job = client.copy_table(
'gcp-sandbox.dailydev.historical',
'gcp-sandbox.dailydev.historical_copy')
How can I combine both using Python?
You can use INSERT statement as in below example
INSERT `gcp-sandbox.dailydev.historical_recent`
SELECT *
FROM `gcp-sandbox.dailydev.historical`
WHERE date BETWEEN '2015-11-05 00:00:00' AND '2015-11-07 23:00:00'
Using Python to save your query result.
from google.cloud import bigquery
client = bigquery.Client()
# Target table to save results
table_id = "gcp-sandbox.dailydev.historical_recent"
job_config = bigquery.QueryJobConfig(
allow_large_results=True,
destination=table_id,
use_legacy_sql=True
)
sql = """
SELECT * FROM gcp-sandbox.dailydev.historical
WHERE (date BETWEEN '2015-11-05 00:00:00' AND '2015-11-07 23:00:00')
"""
query = client.query(sql, job_config=job_config)
query.result()
print("Query results loaded to the table {}".format(table_id))
This example is based on the Google documentation.
I am running the following cloud function. It runs with success and indicates data was loaded to the table. But when I query the BigQuery no data has been added. I am getting no errors and no indication that it isn't working.
from google.cloud import bigquery
import pandas as pd
def download_data(event, context):
df = pd.read_csv('https://covid.ourworldindata.org/data/ecdc/full_data.csv')
# Create an empty list
Row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =[rows.date, rows.location, rows.new_cases, rows.new_deaths, rows.total_cases, rows.total_deaths]
#print(my_list)
# append the list to the final list
Row_list.append(my_list)
## Get Biq Query Set up
client = bigquery.Client()
table_id = "<project_name>.raw.daily_load"
table = client.get_table(table_id)
print(client)
print(table_id)
print(table)
errors = client.insert_rows(table, Row_list) # Make an API request.
if errors == []:
print("New rows have been added.")
Attempted so far;
Check data was being pulled -> PASSED, I printed out row_list and
data is there
Run locally from my machine -> PASSED, data appeared when I ran it from a python terminal
Print out the table details -> PASSED, see attached screenshot it all appears in the logs
Confirm it is able to find the table -> PASSED, I changed the name
of the table to one that didn't exist and it failed
Not sure what is next, any advice would be greatly appreciated
Maybe this post in Google Cloud documentation could help.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#loading_csv_data_into_a_table
You can directly stream the data from the website to BigQuery using Cloud Functions but the data should be clean and conform to BigQuery standards else the e insertion will fail. One more point to note is that the dataframe columns must match the table columns for the data to be successfully inserted. I tested this out and saw insertion errors returned by the client when the column names didn’t match.
Writing the function
I have created a simple Cloud Function using the documentation and pandas example. The dependencies that need to be included are google-cloud-bigquery and pandas.
main.py
from google.cloud import bigquery
import pandas as pd
def hello_gcs(event,context):
df = pd.read_csv('https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv')
df.set_axis(["Month", "Year_1", "Year_2", "Year_3"], axis=1, inplace=True) ## => Rename the columns if necessary
table_id = "project.dataset.airtravel"
## Get BiqQuery Set up
client = bigquery.Client()
table = client.get_table(table_id)
errors = client.insert_rows_from_dataframe(table, df) # Make an API request.
if errors == []:
print("Data Loaded")
return "Success"
else:
print(errors)
return "Failed"
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-bigquery
pandas
Now you can directly deploy the function.
Output
Output Table
Assuming that the App Engine default service account has the default Editor role assigned and that you have a very simple schema for the BigQuery table. For example:
Field name Type Mode Policy tags Description
date STRING NULLABLE
location STRING NULLABLE
new_cases INTEGER NULLABLE
new_deaths INTEGER NULLABLE
total_cases INTEGER NULLABLE
total_deaths INTEGER NULLABLE
The following modification of your code should work for an HTTP triggered function. Notice that you were not including the Row_list.append(my_list) in the for loop to populate your list with the elements and that according to the samples on the documentation you should be using a list of tuples:
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
table_id = "[PROJECT-ID].[DATASET].[TABLE]"
def download_data(request):
df = pd.read_csv('https://covid.ourworldindata.org/data/ecdc/full_data.csv')
# Create an empty list
Row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =(rows.date, rows.location, rows.new_cases, rows.new_deaths, rows.total_cases, rows.total_deaths)
# append the list to the final list
Row_list.append(my_list)
## Get Biq Query Set up
table = client.get_table(table_id)
errors = client.insert_rows(table, Row_list) # Make an API request.
if errors == []:
print("New rows have been added.")
With the very simple requirements.txt file:
# Function dependencies, for example:
# package>=version
pandas
google-cloud-bigquery
I have a Airflow DAG on GCP composer that runs every 5 minutes. I would like to create a BigQuery table that will have the time when DAG starts to run and a flag identifying whether it's a successful run or failed run. For example, if the DAG runs at 2020-03-23 02:30 and the run fails, the BigQuery table will have time column with 2020-03-23 02:30 and flag column with 1. If it's a successful run, then the table will have time column with 2020-03-23 02:30 and flag column with 0. The table will append new rows.
Thanks in advance
You can list_dag_runs CLI to list the DAG runs for a given dag_id. The information returned includes the state of each run.
Another option is retrieving the information via python code a few different ways. One such way that I've used in the past is the 'find' method in airflow.models.dagrun.DagRun.
dag_id = 'my_dag'
dag_runs = DagRun.find(dag_id=dag_id)
for dag_run in dag_runs:
print(dag_run.state)
Finally, use the BigQuery operator to write the DAg information into a BigQuery table. You can find an example of how to use the BigQueryOperator here.
Based on the solution by #Enrique, Here is my final solution.
def status_check(**kwargs):
dag_id = 'dag_id'
dag_runs = DagRun.find(dag_id=dag_id)
import pandas as pd
import pandas_gbq
from google.cloud import bigquery
arr = []
arr1 = []
for dag_run in dag_runs:
arr.append(dag_run.state)
arr1.append(dag_run.execution_date)
data1 = {'dag_status': arr, 'time': arr1}
df = pd.DataFrame(data1)
project_name = "project_name"
dataset = "Dataset"
outputBQtableName = '{}'.format(dataset)+'.dag_status_tab'
df.to_gbq(outputBQtableName, project_id=project_name, if_exists='replace', progress_bar=False, \
table_schema= \
[{'name': 'dag_status', 'type': 'STRING'}, \
{'name': 'time', 'type': 'TIMESTAMP'}])
return None
Dag_status = PythonOperator(
task_id='Dag_status',
python_callable=status_check,
)
I am using
Python 2.7
cx_Oracle 6.0.2
I am doing something like this in my code
import cx_Oracle
connection_string = "%s:%s/%s" % ("192.168.8.168", "1521", "xe")
connection = cx_Oracle.connect("system", "oracle", connection_string)
cur = connection.cursor()
print "Connection Version: {}".format(connection.version)
query = "select *from product_information"
cur.execute(query)
result = cur.fetchone()
print result
I got the output like this
Connection Version: 11.2.0.2.0
(1, u'????????????', 'test')
I am using following query to create table in oracle database
CREATE TABLE product_information
( product_id NUMBER(6)
, product_name NVARCHAR2(100)
, product_description VARCHAR2(1000));
I used the following query to insert data
insert into product_information values(2, 'दुःख', 'teting');
Edit 1
Query: SELECT * from NLS_DATABASE_PARAMETERS WHERE parameter IN ( 'NLS_LANGUAGE', 'NLS_TERRITORY', 'NLS_CHARACTERSET');
Result
NLS_LANGUAGE: AMERICAN, NLS_TERRITORY: AMERICA, NLS_CHARACTERSET:
AL32UTF8
I solved the problem.
First I added NLS_LANG=.AL32UTF8 as the environment variable in the system where Oracle is installed
Second I passed the encoding and nencoding parameter in connect function of cx_Oracle like below.
cx_Oracle.connect(username, password, connection_string,
encoding="UTF-8", nencoding="UTF-8")
This issue is also discussed here at https://github.com/oracle/python-cx_Oracle/issues/157
How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column. In this case:
Schema Source Table:
Col1, Col2
After Glue job.
Schema of Destination:
Col1, Col2, Update_Date(Current Timestamp)
We do the following and works great without converting toDF()
datasource0 = glueContext.create_dynamic_frame.from_catalog(...)
from datetime import datetime
def AddProcessedTime(r):
r["jobProcessedDateTime"] = datetime.today() #timestamp of when we ran this.
return r
mapped_dyF = Map.apply(frame = datasource0, f = AddProcessedTime)
I'm not sure if there's a glue native way to do this with the DynamicFrame, but you can easily convert to a Spark Dataframe and then use the withColumn method. You will need to use the lit function to put literal values into a new column, as below.
from datetime import datetime
from pyspark.sql.functions import lit
glue_df = glueContext.create_dynamic_frame.from_catalog(...)
spark_df = glue_df.toDF()
spark_df = spark_df.withColumn('some_date', lit(datetime.now()))
Some references:
Glue DynamicFrame toDF()
Spark Dataframe withColumn()
In my experience working with Glue the timezone where Glue runs is GMT. But my timezone is CDT. So, to get CDT timezone I need to convert the time within SparkContext. This specific case is to add last_load_date to the target/sink.
So I created a function.
def convert_timezone(sc):
sqlContext = SQLContext(sc)
local_time=dt.now().strftime('%Y-%m-%d %H:%M:%S')
local_time_df=sqlContext.createDataFrame([(local_time,)],['time'])
CDT_time_df = local_time_df.select(from_utc_timestamp(local_time_df['time'],'CST6CDT').alias('cdt_time'))
CDT_time=[i['cdt_time'].strftime('%Y-%m-%d %H:%M:%S') for i in CDT_time_df.collect()][0]
return CDT_time
And then call the function like ...
job_run_time = date_config.convert_timezone(sc)
datasourceDF0 = datasource0.toDF()
datasourceDF1 = datasourceDF0.withColumn('last_updated_date',lit(job_run_time))
As I have been seen there is not a properly answer to this issue I will try to explain my solution to this problem:
First thing is to clarify the withColumn function is a good way to do this but it is important to mention that this function is from the Dataframe from Spark itself and this function is not part of the glue DynamicFrame which is a own library from Glue AWS, so you need to covert the frames to do this....
First step is from the DynamicFrame get the Spark Dataframe, glue library does this with the function toDF() function, once with the Spark frame you can add the column and/or do whatever manipulation you require.
Then what we glue expect is his own frame so we need to transformed back from spark to glue proprietary frame, to do so you can use the apply function of the DynamicFrame, which requires to import the object:
import com.amazonaws.services.glue.DynamicFrame
and use the glueContext which you should already have it, like:
DynamicFrame(sparkDataFrame, glueContext)
In resume the code should looks like:
import org.apache.spark.sql.functions._
import com.amazonaws.services.glue.DynamicFrame
...
val sparkDataFrame = datasourceToModify.toDF().withColumn("created_date", current_date())
val finalDataFrameForGlue = DynamicFrame(sparkDataFrame, glueContext)
...
Note: the import org.apache.spark.sql.functions._ is to bring the current_date() function to add the column with the date.
Hope this helps....
Use Spark's current_timestamp() function:
import org.apache.spark.sql.functions._
...
val timestampedDf = source.toDF().withColumn("Update_Date", current_timestamp())
val timestamped = DynamicFrame(timestampedDf, glueContext)
You can do this supposedly with a built-in functionality now: see here...
Note to look for just the glueContext.add_ingestion_time_columns section