How to Compose Query in BigQuery with Destination Table? - python-2.7

I am trying to Query on a BQ table and load that queried data into destination table with use of legacy_sql
Code:
bigquery_client = bigquery.Client.from_service_account_json(config.ZF_FILE)
job_config = bigquery.QueryJobConfig()
job_config.use_legacy_sql = True
# Allow for query results larger than the maximum response size
job_config.allow_large_results = True
# When large results are allowed, a destination table must be set.
dest_dataset_ref = bigquery_client.dataset('datasetId')
dest_table_ref = dest_dataset_ref.table('datasetId:mydestTable')
job_config.destination = dest_table_ref
query =""" SELECT abc FROM [{0}] LIMIT 10 """.format(mySourcetable_name)
# run the Query here now
query_job = bigquery_client.query(query, job_config=job_config)
Error:
google.api_core.exceptions.BadRequest: 400 POST : Invalid dataset ID "datasetId:mydestTable". Dataset IDs must be alphanumeric (plus underscores, dashes, and colons) and must be at most 1024 characters long.
The job_config.destination gives :
print job_config.destination
TableReference(u'projectName', 'projectName:dataset', 'projectName:dataset.mydest_table')
The datasetId is correct from my side still the error?
May I know how to get the proper destination table ?

This may be helpfull to someone in future
It worked by just naming only the Names instead of full Id of dataset and table as below
dest_dataset_ref = bigquery_client.dataset('dataset_name')
dest_table_ref = dest_dataset_ref.table('mydestTable_name')

Related

Not able to get all the columns while using group by in Pandas df

controller.py
def consolidated_universities_data_by_country(countries,universities):
cursor = connection.cursor()
query = None
if countries == str(1):
query = f"""
#sql_query#
"""
result_data=cursor.execute(query)
result=dict_fetchall_rows(result_data)
consolidated_df_USA=pd.DataFrame(result).fillna('NULL').replace( {True : 1, False : 0}).groupby('CourseId')['ApplicationDeadline'].apply(', '.join).reset_index()
return consolidated_df_USA
With the mentioned code i am able to get desired output i.e., i wanted to merge n rows deadline in one row for given courseid, but i am not able to get rest of the columns.
consolidated_df_USA=pd.DataFrame(result).fillna('NULL').replace( {True : 1, False : 0}).groupby('CourseId')['ApplicationDeadline','CourseName'].agg(', '.join).reset_index()
return consolidated_df_USA
With this i am able to get some columns but some of the columns are getting depricated. Also getting below warning.
FutureWarning: Dropping invalid columns in SeriesGroupBy.agg is deprecated. In a future version, a TypeError will be raised. Before calling .agg, select only columns which should be valid for the aggregating function.
How to get all the columns which is given by sql query?

Snowflake table is not accepting null values in date field

I have one table in snowflake, I am performing bulk load using.
one of the columns in table is date, but in the source table which is on sql server is having null values in date column.
The flow of data is as :
sql_server-->S3 buckets -->snowflake_table
I am able to perform the sqoop job in EMR , but not able to load the data into snowflake table, as it is not accepting null values in the date column.
The error is :
Date '' is not recognized File 'schema_name/table_name/file1', line 2, character 18 Row 2,
column "table_name"["column_name":5] If you would like to continue loading when an error is
encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option.
can anyone help, where I am missing
Using below command you can able to see the values from stage file:
select t.$1, t.$2 from #mystage1 (file_format => myformat) t;
Based on the data you can change your copy command as below:
COPY INTO my_table(col1, col2, col3) from (select $1, $2, try_to_date($3) from #mystage1)
file_format=(type = csv FIELD_DELIMITER = '\u00EA' SKIP_HEADER = 1 NULL_IF = ('') ERROR_ON_COLUMN_COUNT_MISMATCH = false EMPTY_FIELD_AS_NULL = TRUE)
on_error='continue'
The error shows that the dates are not arriving as nulls. Rather, they're arriving as blank strings. You can address this a few different ways.
The cleanest way is to use the TRY_TO_DATE function on your COPY INTO statement for that column. This function will return database null when trying to convert a blank string into a date:
https://docs.snowflake.com/en/sql-reference/functions/try_to_date.html#try-to-date

Expression Error Key didn't Match Any Rows

I am trying to get today's current date and format it to yymmdd because my table name change daily. e.g. MICRINFO210616 and tomorrow it will be MICRINFO210617
When I run thecode below I get the following error:
Expression.Error: The key didn't match any rows in the table.
Key=
Schema=dbo
Item=MICRINFO210617
Table=[Table]
code:
let
Source = Sql.Database("TEST", "TEST"),
formattedDate = Date.ToText(DateTime.Date(DateTime.LocalNow()), "yyMMdd"),
combine = "MICRINFO" & formattedDate,
dbo_MICRINFO210616 = Source{[Schema="dbo", Item=combine]}[Data]
in
dbo_MICRINFO210616
Make sure the account you're using has at least read permissions (to the new table).
Check if the structure of both tables is the same (same number of columns, same datatype).

Find number of objects inside an Item of DynomoDB table using Lamda function (Python/Node)

I am new to the AWS world and I am in need to find the data count from a DynamoDB table.
My table structure is like this.
It has 2 items (Columns in MySQL) say A and B
A - stores the (primary partition key) user ids.
B - stores the user profiles, number of profiles associated with a UserID.
Suppose A contains a user ID 3435 and it has 3 profiles ({"21btet3","3sd4","adf11"})
My requirement is to get the count 3 to the output as a JSON in the format :
How to set the parameters for scanning this query?
Can anyone please help?
DynamoDb is NoSQL so there are some limitations in terms of querying
the data. In your case you have to scan the entire table like below
def ScanDynamoData(lastEvalutedKey):
table = boto3.resource("dynamodb", "eu-west-1").Table('TableName') #Add your region and table name
if lastEvalutedKey:
return table.scan(
ExclusiveStartKey=lastEvalutedKey
)
else:
return table.scan()
And call this method in a loop until lastEvalutedKey is null (To scan all the records) like
response = ScanDynamoData(None);
totalUserIds = response["Count"]
#In response you will get the json of entire table you can count userid and profiles here
while "LastEvaluatedKey" in response:
response = ScanDynamoData(response["LastEvaluatedKey"])
totalUserIds += response["Count"]
#Add counts here also
you should not do full table scan on a regular basis.
If you requirement is to get this count frequently, you should subscribe a lambda function to dynamodb streams and update the count as and when new records are inserted into dynamodb. This will make sure
you are paying less
you will not have to do table scan to calculate this number.

(AWS) Athena: Query Results seem too short

My Athena queries appear to be too short in their results. Trying to figure out Why?
Setup:
Glue Catalogs (118.6 Gig in size).
Data: Stored in S3 in both CSV and JSON format.
Athena Query: When I query data for a whole table, I only get 40K results per Query, there should be 121Million Records for that query on average for one month's data.
Does Athena Cap query result data? Is this a service limit (the documentation does not suggest this to be the case).
So, getting 1000 results at a time obviously doesn't scale. Thankfully, there's a simple workaround. (Or maybe this is how it was supposed to be done all along.)
When you run an Athena query, you should get a QueryExecutionId. This Id corresponds to the output file you'll find in S3.
Here's a snippet I wrote:
s3 = boto3.resource("s3")
athena = boto3.client("athena")
response: Dict = athena.start_query_execution(QueryString=query, WorkGroup="<your_work_group>")
execution_id: str = response["QueryExecutionId"]
print(execution_id)
# Wait until the query is finished
while True:
try:
athena.get_query_results(QueryExecutionId=execution_id)
break
except botocore.exceptions.ClientError as e:
time.sleep(5)
local_filename: str = "temp/athena_query_result_temp.csv"
s3.Bucket("athena-query-output").download_file(execution_id + ".csv", local_filename)
return pd.read_csv(local_filename)
Make sure the corresponding WorkGroup has "Query result location" set, e.g. "s3://athena-query-output/"
Also see this thread with similar answers: How to Create Dataframe from AWS Athena using Boto3 get_query_results method
It seems that there is a limit of 1000.
You should use NextToken to iterate over the results.
Quote of the GetQueryResults Documentation
MaxResults The maximum number of results (rows) to return in this
request.
Type: Integer
Valid Range: Minimum value of 0. Maximum value of 1000.
Required: No
Another option is Paginate and count approach :
Don't know whether better way to do it like select count(*) from table like...
Here is the complete example code ready to use. Used python boto3 athena api
I used paginator and converted result as list of dict and also returning count along with the result.
below are 2 methods
First one will paginate
second one will convert paginated result to list of dict and calculate count.
Note : converting in to list of dict is not necessary in this case. If you don't want that.. in the code you can modify to have only count
def get_athena_results_paginator(params, athena_client):
"""
:param params:
:param athena_client:
:return:
"""
query_id = athena_client.start_query_execution(
QueryString=params['query'],
QueryExecutionContext={
'Database': params['database']
}
# ,
# ResultConfiguration={
# 'OutputLocation': 's3://' + params['bucket'] + '/' + params['path']
# }
, WorkGroup=params['workgroup']
)['QueryExecutionId']
query_status = None
while query_status == 'QUEUED' or query_status == 'RUNNING' or query_status is None:
query_status = athena_client.get_query_execution(QueryExecutionId=query_id)['QueryExecution']['Status']['State']
if query_status == 'FAILED' or query_status == 'CANCELLED':
raise Exception('Athena query with the string "{}" failed or was cancelled'.format(params.get('query')))
time.sleep(10)
results_paginator = athena_client.get_paginator('get_query_results')
results_iter = results_paginator.paginate(
QueryExecutionId=query_id,
PaginationConfig={
'PageSize': 1000
}
)
count, results = result_to_list_of_dict(results_iter)
return results, count
def result_to_list_of_dict(results_iter):
"""
:param results_iter:
:return:
"""
results = []
column_names = None
count = 0
for results_page in results_iter:
print(len(list(results_iter)))
for row in results_page['ResultSet']['Rows']:
count = count + 1
column_values = [col.get('VarCharValue', None) for col in row['Data']]
if not column_names:
column_names = column_values
else:
results.append(dict(zip(column_names, column_values)))
return count, results