Django-restframework with pandas - django

I am using django-restframework,pandas,django-pandas to make an api and i am getting the following output
This is the data of a single user and each label represents a column name but I want the output in the following format
can anyone help to get the data in the desired format
My code is
views.py
#api_view(['GET'])
def my_view(request,id):
qs = Health.objects.filter(id = id)
df = read_frame(qs)
df['x-Mean'] = abs(df['Age'] - df['Age'].mean())
df['1.96*std'] = 1.96*df['Age'].std()
df['Outlier'] = abs(df['Age'] - df['Age'].mean()) > 1.96*df['Age'].std()
df['bmi'] = df['Weight']/(df['Height']/100)**2
a = df.fillna(0)
return Response(a)

This is happening because a is a pandas.DataFrame and it corresponds to table, so during serialization it tries to represent all data for each table column. DataFrame does not know that you have only one value for each column.
Values have to be extracted manually:
a = {column: values[0] for column, values in df.fillna(0).to_dict().items(orient='list')}
return Response(a)
For more details check http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html

Related

Not able to get all the columns while using group by in Pandas df

controller.py
def consolidated_universities_data_by_country(countries,universities):
cursor = connection.cursor()
query = None
if countries == str(1):
query = f"""
#sql_query#
"""
result_data=cursor.execute(query)
result=dict_fetchall_rows(result_data)
consolidated_df_USA=pd.DataFrame(result).fillna('NULL').replace( {True : 1, False : 0}).groupby('CourseId')['ApplicationDeadline'].apply(', '.join).reset_index()
return consolidated_df_USA
With the mentioned code i am able to get desired output i.e., i wanted to merge n rows deadline in one row for given courseid, but i am not able to get rest of the columns.
consolidated_df_USA=pd.DataFrame(result).fillna('NULL').replace( {True : 1, False : 0}).groupby('CourseId')['ApplicationDeadline','CourseName'].agg(', '.join).reset_index()
return consolidated_df_USA
With this i am able to get some columns but some of the columns are getting depricated. Also getting below warning.
FutureWarning: Dropping invalid columns in SeriesGroupBy.agg is deprecated. In a future version, a TypeError will be raised. Before calling .agg, select only columns which should be valid for the aggregating function.
How to get all the columns which is given by sql query?

return type of annotate

I want to get date when first order was placed for each item in database.
If I use this:
event_queryset3 = OrderItemTable.objects.filter(filter).annotate(f_date=Min(f_date_ord))
this will give dates only for items which has been ordered.
If I use this:
event_queryset3 = OrderItemTable.objects.annotate(f_date=Min(f_date_ord, filter=filter))
this gives desired date (first order date) for all items but in 'str' type which I can not use.
Can anyone help please. My variables rae:
filter = Q(prev_order_items__order_id__order_type='Normal') & Q(prev_order_items__order_id__final_approval='Yes')
f_date_ord = F('prev_order_items__order_id__created_at')

Parse schema of a dynamic dataframe in AWS Glue

I have a dynamic dataframe in AWS glue which I created using the below piece of code.
val rawDynamicDataFrame = glueContext.getCatalogSource(
database = rawDBName,
tableName = rawTableName,
redshiftTmpDir = "",
transformationContext = "rawDynamicDataFrame"
).getDynamicFrame()
In order to get the schema of the above dynamic frame, I used the below piece of code:
val x = rawDynamicDataFrame.schema
Now x is of type com.amazonaws.services.glue.schema.Schema. How can I parse the schema object?
To check if a field exist in schema use containsField(fieldPath):
if (rawDynamicDataFrame.schema.containsField("app_name")) {
// do something
}
Maybe you can use field_names = [field.name for field in self. rawDynamicDataFrame.schema().fields] to get a list of field names.

How to Compose Query in BigQuery with Destination Table?

I am trying to Query on a BQ table and load that queried data into destination table with use of legacy_sql
Code:
bigquery_client = bigquery.Client.from_service_account_json(config.ZF_FILE)
job_config = bigquery.QueryJobConfig()
job_config.use_legacy_sql = True
# Allow for query results larger than the maximum response size
job_config.allow_large_results = True
# When large results are allowed, a destination table must be set.
dest_dataset_ref = bigquery_client.dataset('datasetId')
dest_table_ref = dest_dataset_ref.table('datasetId:mydestTable')
job_config.destination = dest_table_ref
query =""" SELECT abc FROM [{0}] LIMIT 10 """.format(mySourcetable_name)
# run the Query here now
query_job = bigquery_client.query(query, job_config=job_config)
Error:
google.api_core.exceptions.BadRequest: 400 POST : Invalid dataset ID "datasetId:mydestTable". Dataset IDs must be alphanumeric (plus underscores, dashes, and colons) and must be at most 1024 characters long.
The job_config.destination gives :
print job_config.destination
TableReference(u'projectName', 'projectName:dataset', 'projectName:dataset.mydest_table')
The datasetId is correct from my side still the error?
May I know how to get the proper destination table ?
This may be helpfull to someone in future
It worked by just naming only the Names instead of full Id of dataset and table as below
dest_dataset_ref = bigquery_client.dataset('dataset_name')
dest_table_ref = dest_dataset_ref.table('mydestTable_name')

(AWS) Athena: Query Results seem too short

My Athena queries appear to be too short in their results. Trying to figure out Why?
Setup:
Glue Catalogs (118.6 Gig in size).
Data: Stored in S3 in both CSV and JSON format.
Athena Query: When I query data for a whole table, I only get 40K results per Query, there should be 121Million Records for that query on average for one month's data.
Does Athena Cap query result data? Is this a service limit (the documentation does not suggest this to be the case).
So, getting 1000 results at a time obviously doesn't scale. Thankfully, there's a simple workaround. (Or maybe this is how it was supposed to be done all along.)
When you run an Athena query, you should get a QueryExecutionId. This Id corresponds to the output file you'll find in S3.
Here's a snippet I wrote:
s3 = boto3.resource("s3")
athena = boto3.client("athena")
response: Dict = athena.start_query_execution(QueryString=query, WorkGroup="<your_work_group>")
execution_id: str = response["QueryExecutionId"]
print(execution_id)
# Wait until the query is finished
while True:
try:
athena.get_query_results(QueryExecutionId=execution_id)
break
except botocore.exceptions.ClientError as e:
time.sleep(5)
local_filename: str = "temp/athena_query_result_temp.csv"
s3.Bucket("athena-query-output").download_file(execution_id + ".csv", local_filename)
return pd.read_csv(local_filename)
Make sure the corresponding WorkGroup has "Query result location" set, e.g. "s3://athena-query-output/"
Also see this thread with similar answers: How to Create Dataframe from AWS Athena using Boto3 get_query_results method
It seems that there is a limit of 1000.
You should use NextToken to iterate over the results.
Quote of the GetQueryResults Documentation
MaxResults The maximum number of results (rows) to return in this
request.
Type: Integer
Valid Range: Minimum value of 0. Maximum value of 1000.
Required: No
Another option is Paginate and count approach :
Don't know whether better way to do it like select count(*) from table like...
Here is the complete example code ready to use. Used python boto3 athena api
I used paginator and converted result as list of dict and also returning count along with the result.
below are 2 methods
First one will paginate
second one will convert paginated result to list of dict and calculate count.
Note : converting in to list of dict is not necessary in this case. If you don't want that.. in the code you can modify to have only count
def get_athena_results_paginator(params, athena_client):
"""
:param params:
:param athena_client:
:return:
"""
query_id = athena_client.start_query_execution(
QueryString=params['query'],
QueryExecutionContext={
'Database': params['database']
}
# ,
# ResultConfiguration={
# 'OutputLocation': 's3://' + params['bucket'] + '/' + params['path']
# }
, WorkGroup=params['workgroup']
)['QueryExecutionId']
query_status = None
while query_status == 'QUEUED' or query_status == 'RUNNING' or query_status is None:
query_status = athena_client.get_query_execution(QueryExecutionId=query_id)['QueryExecution']['Status']['State']
if query_status == 'FAILED' or query_status == 'CANCELLED':
raise Exception('Athena query with the string "{}" failed or was cancelled'.format(params.get('query')))
time.sleep(10)
results_paginator = athena_client.get_paginator('get_query_results')
results_iter = results_paginator.paginate(
QueryExecutionId=query_id,
PaginationConfig={
'PageSize': 1000
}
)
count, results = result_to_list_of_dict(results_iter)
return results, count
def result_to_list_of_dict(results_iter):
"""
:param results_iter:
:return:
"""
results = []
column_names = None
count = 0
for results_page in results_iter:
print(len(list(results_iter)))
for row in results_page['ResultSet']['Rows']:
count = count + 1
column_values = [col.get('VarCharValue', None) for col in row['Data']]
if not column_names:
column_names = column_values
else:
results.append(dict(zip(column_names, column_values)))
return count, results