GBQ: SchemaField REPEATED mode writes empty array - google-cloud-platform

I am trying to write a simple dataframe to gbq. One of the field is array so I tried using REPEATED mode. But it seems nothing is written to the array
df = pd.DataFrame(
{
'my_string': ['a', 'b', 'c'],
'my_int64': [1, 2, 3],
'my_float64': [4.0, 5.0, 6.0],
'my_timestamp': [
pd.Timestamp("1998-09-04T16:03:14"),
pd.Timestamp("2010-09-13T12:03:45"),
pd.Timestamp("2015-10-02T16:00:00")
],
'my_array': [
[1,2,3],
[4,5,6],
[7,8,9]
]
}
)
# client = bigquery.Client()
table_id = 'junk.pshah2_new_table'
# Since string columns use the "object" dtype, pass in a (partial) schema
# to ensure the correct BigQuery data type.
job_config = bigquery.LoadJobConfig(schema=[
bigquery.SchemaField("my_string", "STRING"),
bigquery.SchemaField("my_array", "INTEGER","REPEATED"),
])
job = client.load_table_from_dataframe(
df, table_id, job_config=job_config
)
# Wait for the load job to complete.
job.result()
In GBQ I do not find my_array written at all
Am I doing something wrong here?

Related

Athena-express query returns nested array as a string

I have this json data in AWS S3, it's an array of objects.
[{"usefulOffer": "Nike shoe","webStyleId": "123","skus": [{"rmsSkuId": "456","eventIds": ["", "7", "8", "9"]},{"rmsSkuId": "777","eventIds": ["B", "Q", "W", "H"]}],"timeStamp": "4545"},
{"usefulOffer": "Adidas pants","webStyleId": "35","skus": [{"rmsSkuId": "16","eventIds": ["2", "4", "boo", "la"]}],"timeStamp": "999"},...]
This is a query how I created table/schema in Athena for data above
CREATE EXTERNAL TABLE IF NOT EXISTS table (
usefulOffer STRING,
webStyleId STRING,
skus array<struct<rmsSkuId: STRING, eventIds: array<STRING>>>,
`timeStamp` STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION 's3://...'
When I make a query to Athena using athena-express 'SELECT * FROM table' it returns the nice json format except the nested array it returns as a string
[
{
usefuloffer: 'Nike shoe',
webstyleid: '123',
skus: '[{rmsskuid=456, eventids=[, 7, 8, 9]}, {rmsskuid=777, eventids=[B, Q, W, H]}]',
timestamp: '4545'
},
{
usefuloffer: 'Adidas pants',
webstyleid: '35',
skus: '[{rmsskuid=16, eventids=[2, 4, boo, la]}]',
timestamp: '999'
},
I was trying create the table/schema without this option "WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')" but it returned me bad format at all.
How can I get the nested array as array but not as a string?
Thank you for help!

creating Pcollection from json string input

I am trying to create a unit test for my Dataflow code. However, I am getting error while creating pcollection using beam.create().
In the original function, I am using json.loads(element) which takes json string as input and gives the dictionary as o/p.
Test code:
def test(self):
input = '{"name": "xyz"}'
expected_output = {'name': 'xyz'}
input_string = p | beam.create(input) #pipeline object is already defined as p
output = input_string | beam.pardo(_splitdata) #calling the original function here
assert_that(output,equal_to(expected_output))
Errors :
beam.create() is not taking string as input. It just takes iterables and If I make input as suppose a list or any other iterable then json.loads() as part of _splitdata() expects the json object just as str.
How do I resolve this issue. Please help.
I got into similar scenario today, here is my solution:
import apache_beam as beam
data = [
{
'id': 1,
'name': 'abc',
},
{
'id': 2,
'name': 'xyz'
}
]
with beam.Pipeline() as pipeline:
plant_details = (
pipeline
| 'Read Input Data' >> beam.Create(data)
| beam.Map(print))
Output:
{'id': 1, 'name': 'abc'}
{'id': 2, 'name': 'xyz'}

how to remove extra squared brackets in a response on djago rest framework

I created a stored function in PostgreSQL returns a table, and called that function in a Django-rest-framework like this:
def getSomeFunc(self):
with connection.cursor() as cursor:
cursor.execute(f'select json_agg(myfunctionpsql) from myfunctionpsql')
table = cursor.fetchall()
return table
this function is called in a views file, like a code below:
class myview(views.APIView):
def get(self, request):
fund = getSomeFunc(self)
return Response({'data': fund}, status=status.HTTP_200_OK)
well the response is like this:
{
"data": [
[ // I want to delete this
[ // I want to delete this
{
"id": 21,
"somedata": "FIX A",
"somedata": "FIX A",
"sometag": 0.95,
"somdata": "005.119.745/0001-98",
"somedatayear": 1.57,
"somedata12": 4.11,
"somedata36": 19.58,
"somedata60": 51.9,
"datarisk": 0
}
]// I want delete this
]// I want to delete this
]
}
and I need response below:
{ "data": [{
"id": 21,
"somedata": "FIX A",
"somedata": "FIX A",
"sometag": 0.95,
"somdata": "005.119.745/0001-98",
"somedatayear": 1.57,
"somedata12": 4.11,
"somedata36": 19.58,
"somedata60": 51.9,
"datarisk": 0
}]
}
I try this:
class myview(views.APIView):
def get(self, request):
fund = getSomeFunc(self)
reps = str(fund)[1:-1]
return Response({'data': resp}, status=status.HTTP_200_OK)
but the response is all converted and return in a string, how to delete extra squared brackets, in a response.
regards.
If your variable fund is:
[
[
[
{
"id": 21,
"somedata": "FIX A",
"somedata": "FIX A",
"sometag": 0.95,
"somdata": "005.119.745/0001-98",
"somedatayear": 1.57,
"somedata12": 4.11,
"somedata36": 19.58,
"somedata60": 51.9,
"datarisk": 0
}
]
]
]
you can write a generator to flatten the items into a simple list, because I'm assuming that there could be multiple objects in your list and using indices to slice the list would be dangerous.
try this recursive generator:
def flatten_list(data):
if isinstance(data, list):
for item in data:
yield from flatten_list(item)
else:
yield data
and call it like this:
list(flatten_list(fund))
example:
data = [[[1, 4], 2], 3]
list(flatten_list(data))
# result: [1, 4, 2, 3]

Query Django JSONFields that are a list of dictionaries

Given a Django JSONField that is structured as a list of dictionaries:
# JSONField "materials" on MyModel:
[
{"some_id": 123, "someprop": "foo"},
{"some_id": 456, "someprop": "bar"},
{"some_id": 789, "someprop": "baz"},
]
and given a list of values to look for:
myids = [123, 789]
I want to query for all MyModel instances that have a matching some_id anywhere in those lists of dictionaries. I can do this to search in dictionaries one at a time:
# Search inside the third dictionary in each list:
MyModel.objects.filter(materials__2__some_id__in=myids)
But I can't seem to construct a query to search in all dictionaries at once. Is this possible?
Given the clue here from Davit Tovmasyan to do this by incrementing through the match_targets and building up a set of Q queries, I wrote this function that takes a field name to search, a property name to search against, and a list of target matches. It returns a new list containing the matching dictionaries and the source objects they come from.
from iris.apps.claims.models import Claim
from django.db.models import Q
def json_list_search(
json_field_name: str,
property_name: str,
match_targets: list
) -> list:
"""
Args:
json_field_name: Name of the JSONField to search in
property_name: Name of the dictionary key to search against
match_targets: List of possible values that should constitute a match
Returns:
List of dictionaries: [
{"claim_id": 123, "json_obj": {"foo": "y"},
{"claim_id": 456, "json_obj": {"foo": "z"}
]
Example:
results = json_list_search(
json_field_name="materials_data",
property_name="material_id",
match_targets=[1, 22]
)
# (results truncated):
[
{
"claim_id": 1,
"json_obj": {
"category": "category_kmimsg",
"material_id": 1,
},
},
{
"claim_id": 2,
"json_obj": {
"category": "category_kmimsg",
"material_id": 23,
}
},
]
"""
q_keys = Q()
for match_target in match_targets:
kwargs = {
f"{json_field_name}__contains": [{property_name: match_target}]
}
q_keys |= Q(**kwargs)
claims = Claim.objects.filter(q_keys)
# Now we know which ORM objects contain references to any of the match_targets
# in any of their dictionaries. Extract *relevant* objects and return them
# with references to the source claim.
results = []
for claim in claims:
data = getattr(claim, json_field_name)
for datum in data:
if datum.get(property_name) and datum.get(property_name) in match_targets:
results.append({"claim_id": claim.id, "json_obj": datum})
return results
contains might help you. Should be something like this:
q_keys = Q()
for _id in myids:
q_keys |= Q(materials__contains={'some_id': _id})
MyModel.objects.filter(q_keys)

Represent a hierarchical data.frame as a nested list

How to nicely convert a data.frame with hierarchical information to a JSON (or nested list)?
Let's say we have the following data.frame:
df <- data.frame(
id = c('1', '1.1', '1.1.1', '1.2'),
value = c(10, 5, 5, 5))
# id value
# 1 10
# 1.1 5
# 1.1.1 5
# 1.2 5
Then I would like to end up with the following JSON:
{
"id": "1",
"value": 10,
"children": [
{
"id": "1.1",
"value": 5,
"children": [
{
"id": "1.1.1",
"value": 5
}
]
},
{
"id": "1.2",
"value": 5
}
]
}
Where id defines the hierarchical structure, and . is a delimiter.
My intention is to easily be able to convert data from R to hierarchical D3 visualisations (e.g. Partition Layout or Zoomable Treemaps). It would also be nice if it is possible to add more "value"-columns; e.g value, size, weight, etc.
Thank you!
EDIT: I reverted to the original question, so it is easier to follow all the answers (sorry for all the editing).
I tend to have RJSONIO installed which does this:
R> df <- data.frame(id = c('1', '1.1', '1.1.1', '1.2'), value = c(10, 5, 5, 5))
R> RJSONIO::toJSON(df)
[1] "{\n \"id\": [ \"1\", \"1.1\", \"1.1.1\", \"1.2\" ],\n\"value\": [ 10, 5, 5, 5 ] \n}"
R> cat(RJSONIO::toJSON(df), "\n")
{
"id": [ "1", "1.1", "1.1.1", "1.2" ],
"value": [ 10, 5, 5, 5 ]
}
R>
That is not your desired output but the desired nesting / hierarchy was not present in the data.frame. I think if you nest a data.frame inside a list you will get there.
Edit: For your revised question, here is the R output of reading you spec'ed JSON back in:
R> RJSONIO::fromJSON("/tmp/foo.json")
$id
[1] "1"
$value
[1] 10
$children
$children[[1]]
$children[[1]]$id
[1] "1.1"
$children[[1]]$value
[1] 5
$children[[1]]$children
$children[[1]]$children[[1]]
$children[[1]]$children[[1]]$id
[1] "1.1.1"
$children[[1]]$children[[1]]$value
[1] 5
$children[[2]]
$children[[2]]$id
[1] "1.2"
$children[[2]]$value
[1] 5
R>
A possible solution.
First I define the following functions:
# Function to get the number hierarchical dimensions (occurences of "." + 1)
ch_dim <- function(x, delimiter = ".") {
x <- as.character(x)
chr.count <- function(x) length(which(unlist(strsplit(x, NULL)) == delimiter))
if (length(x) > 1) {
sapply(x, chr.count) + 1
} else {
chr.count(x) + 1
}
}
# Function to convert a hierarchical data.frame to a nested list
lst_fun <- function(ch, id_col = "id", num = min(d), stp = max(d)) {
# Convert data.frame to character
ch <- data.frame(lapply(ch, as.character), stringsAsFactors=FALSE)
# Get number of hierarchical dimensions
d <- ch_dim(ch[[id_col]])
# Convert to list
lapply(ch[d == num,][[id_col]], function(x) {
tt <- ch[grepl(sprintf("^%s.", x), ch[[id_col]]),]
current <- ch[ch[[id_col]] == x,]
if (stp != num && nrow(tt) > 0) {
c(current, list(children = lst_fun(tt, id_col, num + 1, stp)))
} else { current }
})
}
then convert the data.frame to a list:
lst <- lst_fun(df, "id")
and finally, the JSON:
s <- RJSONIO::toJSON(lst)