Is there a way to add a new column to existing table in DynamoDB in Amazon's AWS?
Google didn't help,
UpdateTable Query in http://docs.aws.amazon.com/cli/latest/reference/dynamodb/update-table.html?highlight=update%20table doesn't have any information related to adding a new column.
DynamoDB does not require schema definition, and so there is no such thing as a "column". You can just add a new item with a new attribute.
Well, let's not get dragged away in the semantic discussion about the difference between "fields" and "columns". The word "column" does remind us of relational databases, which dynamodb is not. In essence that means that dynamodb does not have foreign keys.
Dynamodb does have "primary partition keys" and "index partition keys" though, just as with relational databases. (Although there are other strategies as well)
You do need to respect those keys when you add data. But aside from those requirements, you don't have to predefine your fields (except for those partition keys mentioned earlier).
Assuming that you are new to this, some additional good practices:
Add a numeric field to each record, to store the time of creation in seconds. Dynamodb has optional cleaning features, which require this type of field in your data.
You cannot use dates in dynamodb, so you have to store those as numeric fields or as strings. Given the previously mentioned remark, you may prefer a numeric type for them.
Don't store big documents in it, because there is a maximum fetch size of 16MB, and a maximum record size of 400KB. Fortunately, AWS has S3 storage and other kind of databases (e.g. DocumentDB).
There are many strategies for table keys:
If you only declare the partition-key, then it acts like a primary key (e.g. partition-key=invoiceId). That's fine.
If your object has a parent reference. (e.g. invoices have a customer), then you probably want to add a sort-key. (e.g. partition-key=customerId;sort-key=invoiceId) Together they behave like a composed key. The advantage is that you can do a lookup using both keys, or just using the partition-key. (e.g. request a specific invoice for a specific customer, or all invoices for that customer)
I installed NoSQL Workbench then connected to existing DynamoDB Table and tried to update existing Item by adding a new attribute.
I figured out that we can only add a new attribute with one of these types - "SS", "NS", "BS" (String Set, Number Set, Binary Set").
In Workbench, we can generate code for the chosen operation.
I scanned my dynamodb Table and for each item added new attribute with type "SS" then I scanned again and updated recently added new attribute to type - "S" in order create a global secondary index (GSI) with a primary key - "pk2NewAttr".
NoSQL Workbench related video - https://www.youtube.com/watch?v=Xn12QSNa4RE&feature=youtu.be&t=3666
Example in Python "how to scan all dynamodb Table" - https://gist.github.com/pgolding
You can achieve the same by doing the following,
Open Dynamodb and click on the tables option in the left sidebar menu.
Search your table by name and click on your table
Now select the orange button named Explore table items
Scroll down and Click on Create item
Now you will see an editor with JSON Value, Click on Form button on the right side to add a new column and its type.
Note: this will insert 1 new record and you can see now the new column as well.
A way to add a new column to existing table in DynamoDB in Amazon's AWS:
We can store the values in DynamoDb in 2 ways,
(i) In an RDBMS Type of Structure for the DynamoDB, we can add a new Coulmn by executing the same command keeping the "new Column" entry within which the Records in the Existing Table has been created. we can use DynamoDb with the Records/ Rows having Values for certain Columns while other columns does not have Values.
(ii) In a NoSQL kind of Structure; where we store a Json String within a Column to keep all the attributes as per the Requirement. Here we are generating a json string and we have to add the new Attribute into the json String which can then be inserted into the same Column but with the new Attribute.
This script will either delete a record, or add a ttl field. You might want to tailor it to your column name and remove the delete stuff.
Usage:
usage: add_ttl.py [-h] --profile PROFILE [-d] [--force_delete_all] [-v] [-q]
Procedurally modify DynamoDB
optional arguments:
-h, --help show this help message and exit
--profile PROFILE AWS profile name
-d, --dryrun Dry run, take no action
--force_delete_all Delete all records, including valid, unexpired
-v, --verbose set loglevel to DEBUG
-q, --quiet set loglevel to ERROR
Script:
#!/usr/bin/env python3
# pylint:disable=duplicate-code
import argparse
import logging
import sys
from collections import Counter
from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from functools import cached_property
from typing import Dict, Optional
import boto3
from dateutil.parser import isoparse
from tqdm import tqdm
LOGGER = logging.getLogger(__name__)
DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
LOG_FORMAT = "[%(asctime)s] %(levelname)s:%(name)s:%(message)s"
def setup_logging(loglevel=None, date_format=None, log_format=None):
"""Setup basic logging.
Args:
loglevel (int): minimum loglevel for emitting messages
"""
logging.basicConfig(
level=loglevel or logging.INFO,
stream=sys.stdout,
format=log_format or LOG_FORMAT,
datefmt=date_format or DATE_FORMAT,
)
def parse_args():
"""
Extract the CLI arguments from argparse
"""
parser = argparse.ArgumentParser(description="Procedurally modify DynamoDB")
parser.add_argument(
"--profile",
help="AWS profile name",
required=True,
)
parser.add_argument(
"-d",
"--dryrun",
action="store_true",
default=False,
help="Dry run, take no action",
)
parser.add_argument(
"--force_delete_all",
action="store_true",
default=False,
help="Delete all records, including valid, unexpired",
)
parser.add_argument(
"-v",
"--verbose",
dest="loglevel",
help="set loglevel to DEBUG",
action="store_const",
const=logging.DEBUG,
)
parser.add_argument(
"-q",
"--quiet",
dest="loglevel",
help="set loglevel to ERROR",
action="store_const",
const=logging.ERROR,
)
return parser.parse_args()
def query_yes_no(question, default="yes"):
"""Ask a yes/no question via input() and return their answer.
"question" is a string that is presented to the user.
"default" is the presumed answer if the user just hits <Enter>.
It must be "yes" (the default), "no" or None (meaning
an answer is required of the user).
The "answer" return value is True for "yes" or False for "no".
"""
valid = {"yes": True, "y": True, "ye": True, "no": False, "n": False}
if default is None:
prompt = " [y/n] "
elif default == "yes":
prompt = " [Y/n] "
elif default == "no":
prompt = " [y/N] "
else:
raise ValueError("invalid default answer: '%s'" % default)
while True:
sys.stdout.write(question + prompt)
choice = input().lower()
if default is not None and choice == "":
return valid[default]
if choice in valid:
return valid[choice]
sys.stdout.write("Please respond with 'yes' or 'no' " "(or 'y' or 'n').\n")
#dataclass
class Table:
"""Class that wraps dynamodb and simplifies pagination as well as counting."""
region_name: str
table_name: str
_counter: Optional[Counter] = None
def __str__(self):
out = "\n" + ("=" * 80) + "\n"
for key, value in self.counter.items():
out += "{:<20} {:<2}\n".format(key, value)
return out
def str_table(self):
keys = list(self.counter.keys())
# Set the names of the columns.
fmt = "{:<20} " * len(keys)
return f"\n\n{fmt}\n".format(*keys) + f"{fmt}\n".format(
*list(self.counter.values())
)
#cached_property
def counter(self):
if not self._counter:
self._counter = Counter()
return self._counter
#cached_property
def client(self):
return boto3.client("dynamodb", region_name=self.region_name)
#cached_property
def table(self):
dynamodb = boto3.resource("dynamodb", region_name=self.region_name)
return dynamodb.Table(self.table_name)
#property
def items(self):
response = self.table.scan()
self.counter["Fetched Pages"] += 1
data = response["Items"]
with tqdm(desc="Fetching pages") as pbar:
while "LastEvaluatedKey" in response:
response = self.table.scan(
ExclusiveStartKey=response["LastEvaluatedKey"]
)
self.counter["Fetched Pages"] += 1
data.extend(response["Items"])
pbar.update(500)
self.counter["Fetched Items"] = len(data)
return data
#cached_property
def item_count(self):
response = self.client.describe_table(TableName=self.table_name)
breakpoint()
count = int(response["Table"]["ItemCount"])
self.counter["Total Rows"] = count
return count
def delete_item(table, item):
return table.table.delete_item(
Key={
"tim_id": item["tim_id"],
}
)
def update_item(table: Table, item: Dict, ttl: int):
return table.table.update_item(
Key={"tim_id": item["tim_id"]},
UpdateExpression="set #t=:t",
ExpressionAttributeNames={
"#t": "ttl",
},
ExpressionAttributeValues={
":t": ttl,
},
ReturnValues="UPDATED_NEW",
)
def main():
setup_logging()
args = parse_args()
if not query_yes_no(
f"Performing batch operations with {args.profile}; is this correct?"
):
sys.exit(1)
sys.stdout.write(f"Setting up connection with {args.profile}\n")
boto3.setup_default_session(profile_name=args.profile)
table = Table(region_name="us-west-2", table_name="TimManager")
now = datetime.utcnow().replace(microsecond=0).astimezone(timezone.utc)
buffer = timedelta(days=7)
# #TODO list comprehension
to_update = []
to_delete = []
for item in tqdm(table.items, desc="Inspecting items"):
ttl_dt = isoparse(item["delivery_stop_time"])
if ttl_dt > now - buffer and not args.force_delete_all:
to_update.append(item)
else:
to_delete.append(item)
table.counter["Identified for update"] = len(to_update)
table.counter["Identified for delete"] = len(to_delete)
table.counter["Performed Update"] = 0
table.counter["Performed Delete"] = 0
if to_update and query_yes_no(
f"Located {len(to_update)} records to update with {args.profile}"
):
for item in tqdm(to_update, desc="Updating items"):
if not args.dryrun:
ttl_dt = isoparse(item["delivery_stop_time"])
response = update_item(table, item, int((ttl_dt + buffer).timestamp()))
if response.get("ResponseMetadata", {}).get("HTTPStatusCode") == 200:
table.counter["Updated"] += 1
if to_delete and query_yes_no(
f"Located {len(to_delete)} records to delete with {args.profile}"
):
for item in tqdm(to_delete, desc="Deleting items"):
if not args.dryrun:
table.counter["Deleted"] += 1
response = delete_item(table, item)
if response.get("ResponseMetadata", {}).get("HTTPStatusCode") == 200:
table.counter["Deleted"] += 1
sys.stdout.write(str(table))
if __name__ == "__main__":
main()
Related
I'm using a Jython InvokeScriptedProcessor to struct data from json struct to sql struct. I'm having trouble with a specific function. json.loads. json.loads does not recognize special characters like ñ, é, á, í...
It writes it in an odd form. And I've not reached any form to have it.
e.g. (very simple)
{"id":"ÑUECO","value":3.141592,"datetime":"....","location":"ÑUECO"}
If we try to write it in sql like
INSERT INTO .... (id, value) VALUES ("...",3.141592);
It will fail. It fails me. I cannot return data with any return option, success or failure, it doesn't matter NiFi's version. Here is my code
def process(self, inputStream, outputStream):
# read input json data from flowfile content
text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
data = json.loads(text)
Neither
data = json.loads(text.encode("utf-8"))
works properly. text comes in unicode.
def __generate_sql_transaction(input_data):
""" Generate SQL statement """
sql = """
BEGIN;"""
_id = input_data.get("id")
_timestamp = input_data.get("timestamp")
_flowfile_metrics = input_data.get("metrics")
_flowfile_metadata = input_data.get("metadata")
self.valid = __validate_metrics_type(_flowfile_metrics)
if self.valid is True:
self.log.error("generate insert")
sql += """
INSERT INTO
{0}.{1} (id, timestamp, metrics""".format(schema, table)
if _flowfile_metadata:
sql += ", metadata"
sql += """)
VALUES
('{0}', '{1}', '{2}'""".format(_id.encode("utf-8"), _timestamp, json.dumps(_flowfile_metrics))
self.log.error("generate metadata")
if _flowfile_metadata:
sql += ", '{}'".format(json.dumps(_flowfile_metadata).encode("utf-8"))
sql += """)
ON CONFLICT ({})""".format(on_conflict)
if not bool(int(self.update)):
sql += """
DO NOTHING;"""
else:
sql += """
DO UPDATE
SET"""
if bool(int(self.preference)):
sql += """
metrics = '{2}' || {0}.{1}.metrics;""".format(schema, table, json.dumps(_flowfile_metrics))
else:
sql += """
metrics = {0}.{1}.metrics || '{2}';""".format(schema, table, json.dumps(_flowfile_metrics))
else:
return ""
sql += """
COMMIT;"""
return sql
I send the data to NiFi again with:
output = __generate_sql_transaction(data)
self.log.error("post generate_sql_transaction")
self.log.error(output.encode("utf-8"))
# If no sql_transaction is generated because requisites weren't met,
# set the processor output with the original flowfile input.
if output == "":
output = text
# write new content to flowfile
outputStream.write(
output.encode("utf-8")
)
That output seems like
INSERT INTO .... VALUES ("ÃUECO","2020-01-01T10:00:00",'{"value":3.1415}','{"location":"\u00d1UECO"}');
I have "Ñueco" also in metadata, and it doesn't works fine with id nor metadata
NOTE: It seems that InvokeScriptedProcessor works fine using Groove instead of Python. But my problem is I know nothing about Groovy...
Does anybody found a similar issue? How did you solve it?
Update:
Input Example:
{"id":"ÑUECO",
"metrics":{
"value":3.1415
},
"metadata":{
"location":"ÑUECO"
},
"timestamp":"2020-01-01 00:00:00+01:00"
}
Desired Output:
BEGIN;
INSERT INTO Table (id, timestamp, metrics, metadata)
VALUES ('ÑUECO',
'2020-01-01T00:00:00+01:00',
'{"value":3.1415}',
'{"location":"ÑUECO"}')
ON CONFLICT (id, timestamp)
DO UPDATE
SET
metrics='{"value":3.1415}' || Table.metrics;
COMMIT;
Real Output:
BEGIN;
INSERT INTO Table (id, timestamp, metrics, metadata)
VALUES ('ÃUECO',
'2020-01-01T00:00:00+01:00',
'{"value":3.1415}',
'{"location":"\u00d1UECO"}')
ON CONFLICT (id, timestamp)
DO UPDATE
SET
metrics='{"value":3.1415}' || Table.metrics;
COMMIT;
UPD
jython does not work correctly with byte-strings - so, don't use .encode('utf-8')
use java methods to write content back to flow file with specific encoding
below is a sample that reads and writes correctly non-ascii chars including Ñ
use ExecuteScript processor with jython and replace body of _transform(text) function:
import traceback
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
class FlowWriter(StreamCallback):
def _transform(self, text):
# transform incoming text here
return '####' + text + '****'
def process(self, inputStream, outputStream):
text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
new_text = self._transform(text)
IOUtils.write(new_text, outputStream, StandardCharsets.UTF_8)
flowFile = session.get()
if flowFile != None:
try:
flowFile = session.write(flowFile, FlowWriter())
flowFile = session.putAttribute(flowFile, "filename", 'headerfile.xml')
session.transfer(flowFile, REL_SUCCESS)
session.commit()
except Exception as e:
log.error("{}\n{}".format(e,traceback.format_exc()))
session.rollback(True) # put file back and penalize it
I've recently found this answer.
https://stackoverflow.com/a/35882335/7634711
It's not a problem with NiFi. It's a problem with Python2 and how it works with json library. And the problems will be also in Python3 if special characters come in dict keys.
What I'm trying to do
I'm trying to enter a list of tags in flask that should become passable as a list but I can't figure out how to do it in flask, nor can I find documentation to add lists (of strings) in flask_wtf. Has anyone have experience with this?
Ideally I would like the tags to be selectively delete-able, after you entered them. So that you could enter.
The problem
Thus far my form is static. You enter stuff, hit submit, it gets processed into a .json. The tags list is the last element I can't figure out. I don't even know if flask can do this.
A little demo of how I envisioned the entry process:
How I envisioned the entry process:
The current tags are displayed and an entry field to add new ones.
[Tag1](x) | [Tag2](x)
Enter new Tag: [______] (add)
Hit (add)
[Tag1](x) | [Tag2](x)
Enter new Tag: [Tag3__] (add)
New Tag is added
[Tag1](x) | [Tag2](x) | [Tag3](x)
Enter new Tag: [______]
How I envisioned the deletion process:
Hitting the (x) on the side of the tag should kill it.
[Tag1](x) | [Tag2](x) | [Tag3](x)
Hit (x) on Tag2. Result:
[Tag1](x) | [Tag3](x)
The deletion is kind of icing on the cake and could probably be done, once I have a list I can edit, but getting there seems quite hard.
I'm at a loss here.
I basically want to know if it's possible to enter lists in general, since there does not seem to be documentation on the topic.
Your description is not really clear (is Tag1 the key in the JSON or is it Tag the key, and 1 the index?)
But I had a similar issue recently, where I wanted to submit a basic list in JSON and let WTForms handle it properly.
For instance, this:
{
"name": "John",
"tags": ["code", "python", "flask", "wtforms"]
}
So, I had to rewrite the way FieldList works because WTForms, for some reason, wants a list as "tags-1=XXX,tags-2=xxx".
from wtforms import FieldList
class JSONFieldList(FieldList):
def process(self, formdata, data=None):
self.entries = []
if data is None or not data:
try:
data = self.default()
except TypeError:
data = self.default
self.object_data = data
if formdata:
for (index, obj_data) in enumerate(formdata.getlist(self.name)):
self._add_entry(formdata, obj_data, index=index)
else:
for obj_data in data:
self._add_entry(formdata, obj_data)
while len(self.entries) < self.min_entries:
self._add_entry(formdata)
def _add_entry(self, formdata=None, data=None, index=None):
assert not self.max_entries or len(self.entries) < self.max_entries, \
'You cannot have more than max_entries entries in this FieldList'
if index is None:
index = self.last_index + 1
self.last_index = index
name = '%s-%d' % (self.short_name, index)
id = '%s-%d' % (self.id, index)
field = self.unbound_field.bind(form=None, name=name, id=id, prefix=self._prefix, _meta=self.meta,
translations=self._translations)
field.process(formdata, data)
self.entries.append(field)
return field
On Flask's end to handle the form:
from flask import request
from werkzeug.datastructures import ImmutableMultiDict
#app.route('/add', methods=['POST'])
def add():
form = MyForm(ImmutableMultiDict(request.get_json())
# process the form, form.tags.data is a list
And the form (notice the use of JSONFieldList):
class MonitorForm(BaseForm):
name = StringField(validators=[validators.DataRequired(), validators.Length(min=3, max=5)], filters=[lambda x: x or None])
tags = JSONFieldList(StringField(validators=[validators.DataRequired(), validators.Length(min=1, max=250)], filters=[lambda x: x or None]), validators=[Optional()])
I found a viable solution in this 2015 book, where a tagging system is being build for flask as part of a blog building exercise.
It's based on Flask_SQLAlchemy.
Entering lists therefore is possible with WTForms / Flask by submitting the items to the database via, e.g. FieldList and in the usecase of a tagging system, reading them from the database back to render them in the UI.
If however you don't want to deal with O'Rielly's paywall (I'm sorry, I can't post copyrighted material here) and all you want is a solution to add tags, check out taggle.js by Sean Coker. It's not flask, but javascript, but it does the job.
My goal is to make an alert in ElastAlert for this scenario: no events has occured between midnight and 2 am. (for any date). The problem is how to make a query to Elasticsearch that matches any date but a specific time, because you cannot use regexp or wildcard on timestamp of type 'date'. Any suggestions?
This code returns "Parse failure":
"range": {
"timestamp": {
"gte": "20[0-9]{2}-[0-9]{2}-[0-9]{2}T00:00:00.000Z",
"lt": "20[0-9]{2}-[0-9]{2}-[0-9]{2}T02:00:00.000Z"
}
}
Handling it in a custom rule is ideal.
I wrote the following to do the same kind of filtering:
Note, the dependencies used (dateutil, elastalert.utils) are already bundled with the elastalert framework.
import dateutil.parser
from ruletypes import RuleType
# elastalert.util includes useful utility functions
# such as converting from timestamp to datetime obj
from util import ts_to_dt
# Modified version of http://elastalert.readthedocs.io/en/latest/recipes/adding_rules.html#tutorial
# to catch events happening outside a certain time range
class OutOfTimeRangeRule(RuleType):
""" Match if input time is outside the given range """
# Time range specified by including the following properties in the rule:
required_options = set(['time_start', 'time_end'])
# add_data will be called each time Elasticsearch is queried.
# data is a list of documents from Elasticsearch, sorted by timestamp,
# including all the fields that the config specifies with "include"
def add_data(self, data):
for document in data:
# Convert the timestamp to a time object
login_time = document['#timestamp'].time()
# Convert time_start and time_end to time objects
time_start = dateutil.parser.parse(self.rules['time_start']).time()
time_end = dateutil.parser.parse(self.rules['time_end']).time()
# If time is outside office hours
if login_time < time_start or login_time > time_end:
# To add a match, use self.add_match
self.add_match(document)
# The results of get_match_str will appear in the alert text
def get_match_str(self, match):
return "logged in outside %s and %s" % (self.rules['time_start'], self.rules['time_end'])
def garbage_collect(self, timestamp):
pass
I didn't have the right to write custom rules, so my solution was to make changes in logstash. Added the field hour_of_day, where the value is derived from the timestamp. Thus we are able to create a flatline rule with a filter like this:
filter:
- query:
query_string:
query: "hour_of_day: 0 OR hour_of_day: 1"
I've followed the guide in the queryset documentation as per (https://docs.djangoproject.com/en/1.10/ref/models/querysets/#update-or-create) but I think im getting something wrong:
my script checks against an inbox for maintenance emails from our ISP, and then sends us a calendar invite if you are subscribed and adds maintenance to the database.
Sometimes we get updates on already planned maintenance, of which i then need to update the database with the new date and time, so im trying to use "update or create" for the queryset, and need to use the ref no from the email to update or create the record
#Maintenance
if sender.lower() == 'maintenance#isp.com':
print 'Found maintenance in mail: {0}'.format(subject)
content = Message.getBody(mail)
postcodes = re.findall(r"[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2}", content)
if postcodes:
print 'Found Postcodes'
else:
error_body = """
Email titled: {0}
With content: {1}
Failed processing, could not find any postcodes in the email
""".format(subject,content)
SendMail(authentication,site_admins,'Unprocessed Email',error_body)
Message.markAsRead(mail)
continue
times = re.findall("\d{2}/\d{2}/\d{4} \d{2}:\d{2}", content)
if times:
print 'Found event Times'
e_start_time = datetime.strftime(datetime.strptime(times[0], "%d/%m/%Y %H:%M"),"%Y-%m-%dT%H:%M:%SZ")
e_end_time = datetime.strftime(datetime.strptime(times[1], "%d/%m/%Y %H:%M"),"%Y-%m-%dT%H:%M:%SZ")
subscribers = []
clauses = (Q(site_data__address__icontains=p) for p in postcodes)
query = reduce(operator.or_, clauses)
sites = Circuits.objects.filter(query).filter(circuit_type='MPLS', provider='KCOM')
subject_text = "Maintenance: "
m_ref = re.search('\[(.*?)\]',subject).group(1)
if not len(sites):
#try use first part of postcode
h_pcode = postcodes[0].split(' ')
sites = Circuits.objects.filter(site_data__postcode__startswith=h_pcode[0]).filter(circuit_type='MPLS', provider='KCOM')
if not len(sites):
#still cant find a site, send error
error_body = """
Email titled: {0}
With content: {1}
I have found a postcode, but could not find any matching sites to assign this maintenance too, therefore no meeting has been sent
""".format(subject,content)
SendMail(authentication,site_admins,'Unprocessed Email',error_body)
Message.markAsRead(mail)
continue
else:
#have site(s) send an invite and create record
for s in sites:
create record in circuit maintenance
maint = CircuitMaintenance(
circuit = s,
ref = m_ref,
start_time = e_start_time,
end_time = e_end_time,
notes = content
)
maint, CircuitMaintenance.objects.update_or_create(ref=m_ref)
#create subscribers for maintenance
m_ref, is the unique field that will match the update, but everytime I run this in tests I get
sites_circuitmaintenance.start_time may not be NULL
but I've set it?
If you want to update certain fields provided that a record with certain values exists, you need to explicitly provide the defaults as well as the field names.
Your code should look like this:
CircuitMaintenance.objects.update_or_create(default=
{'circuit' : s,'start_time' : e_start_time,'end_time' : e_end_time,'notes' : content}, ref=m_ref)
The particular error you are seeing is because update_or_create is creating an object because one with rer=m_ref does not exist. But you are not passing in values for all the not null fields. The above code will fi that.
I have a python app that runs a query on BigQuery and appends results to a file. I've run this on MAC workstation (Yosemite) and on GC instance (ubuntu 14.1) and the results for floating point differ. How can I make them the same? They python environments are the same on both.
run on google cloud instance
1120224,2015-04-06,23989,866,55159.71274162368,0.04923989554019882,0.021414467106578683,0.03609987911125933,63.69481840834143
54897577,2015-04-06,1188089,43462,2802473.708558333,0.051049132980100984,0.021641920553251377,0.03658143455582873,64.4810111950286
run on mac workstation
1120224,2015-04-06,23989,866,55159.712741623654,0.049239895540198794,0.021414467106578683,0.03609987911125933,63.694818408341405
54897577,2015-04-06,1188089,43462,2802473.708558335,0.05104913298010102,0.021641920553251377,0.03658143455582873,64.48101119502864
import sys
import pdb
import json
from collections import OrderedDict
from csv import DictWriter
from pprint import pprint
from apiclient import discovery
from oauth2client import tools
import functools
import argparse
import httplib2
import time
from subprocess import call
def authenticate_SERVICE_ACCOUNT(service_acct_email, private_key_path):
""" Generic authentication through a service accounts.
Args:
service_acct_email: The service account email associated
with the private key private_key_path: The path to the private key file
"""
from oauth2client.client import SignedJwtAssertionCredentials
with open(private_key_path, 'rb') as pk_file:
key = pk_file.read()
credentials = SignedJwtAssertionCredentials(
service_acct_email,
key,
scope='https://www.googleapis.com/auth/bigquery')
http = httplib2.Http()
auth_http = credentials.authorize(http)
return discovery.build('bigquery', 'v2', http=auth_http)
def create_query(number_of_days_ago):
""" Create a query
Args:
number_of_days_ago: Default value of 1 gets yesterday's data
"""
q = 'SELECT xxxxxxxxxx'
return q;
def translate_row(row, schema):
"""Apply the given schema to the given BigQuery data row.
Args:
row: A single BigQuery row to transform.
schema: The BigQuery table schema to apply to the row, specifically
the list of field dicts.
Returns:
Dict containing keys that match the schema and values that match
the row.
Adpated from bigquery client
https://github.com/tylertreat/BigQuery-Python/blob/master/bigquery/client.py
"""
log = {}
#pdb.set_trace()
# Match each schema column with its associated row value
for index, col_dict in enumerate(schema):
col_name = col_dict['name']
row_value = row['f'][index]['v']
if row_value is None:
log[col_name] = None
continue
# Cast the value for some types
if col_dict['type'] == 'INTEGER':
row_value = int(row_value)
elif col_dict['type'] == 'FLOAT':
row_value = float(row_value)
elif col_dict['type'] == 'BOOLEAN':
row_value = row_value in ('True', 'true', 'TRUE')
log[col_name] = row_value
return log
def extractResult(queryReply):
""" Extract a result from the query reply. Uses schema and rows to translate.
Args:
queryReply: the object returned by bigquery
"""
#pdb.set_trace()
result = []
schema = queryReply.get('schema', {'fields': None})['fields']
rows = queryReply.get('rows',[])
for row in rows:
result.append(translate_row(row, schema))
return result
def writeToCsv(results, filename, ordered_fieldnames, withHeader=True):
""" Create a csv file from a list of rows.
Args:
results: list of rows of data (first row is assumed to be a header)
order_fieldnames: a dict with names of fields in order desired - names must exist in results header
withHeader: a boolen to indicate whether to write out header -
Set to false if you are going to append data to existing csv
"""
try:
the_file = open(filename, "w")
writer = DictWriter(the_file, fieldnames=ordered_fieldnames)
if withHeader:
writer.writeheader()
writer.writerows(results)
the_file.close()
except:
print "Unexpected error:", sys.exc_info()[0]
raise
def runSyncQuery (client, projectId, query, timeout=0):
results = []
try:
print 'timeout:%d' % timeout
jobCollection = client.jobs()
queryData = {'query':query,
'timeoutMs':timeout}
queryReply = jobCollection.query(projectId=projectId,
body=queryData).execute()
jobReference=queryReply['jobReference']
# Timeout exceeded: keep polling until the job is complete.
while(not queryReply['jobComplete']):
print 'Job not yet complete...'
queryReply = jobCollection.getQueryResults(
projectId=jobReference['projectId'],
jobId=jobReference['jobId'],
timeoutMs=timeout).execute()
# If the result has rows, print the rows in the reply.
if('rows' in queryReply):
#print 'has a rows attribute'
#pdb.set_trace();
result = extractResult(queryReply)
results.extend(result)
currentPageRowCount = len(queryReply['rows'])
# Loop through each page of data
while('rows' in queryReply and currentPageRowCount < int(queryReply['totalRows'])):
queryReply = jobCollection.getQueryResults(
projectId=jobReference['projectId'],
jobId=jobReference['jobId'],
startIndex=currentRow).execute()
if('rows' in queryReply):
result = extractResult(queryReply)
results.extend(result)
currentRow += len(queryReply['rows'])
except AccessTokenRefreshError:
print ("The credentials have been revoked or expired, please re-run"
"the application to re-authorize")
except HttpError as err:
print 'Error in runSyncQuery:', pprint.pprint(err.content)
except Exception as err:
print 'Undefined error' % err
return results;
# Main
if __name__ == '__main__':
# Name of file
FILE_NAME = "results.csv"
# Default prior number of days to run query
NUMBER_OF_DAYS = "1"
# BigQuery project id as listed in the Google Developers Console.
PROJECT_ID = 'xxxxxx'
# Service account email address as listed in the Google Developers Console.
SERVICE_ACCOUNT = 'xxxxxx#developer.gserviceaccount.com'
KEY = "/usr/local/xxxxxxxx"
query = create_query(NUMBER_OF_DAYS)
# Authenticate
client = authenticate_SERVICE_ACCOUNT(SERVICE_ACCOUNT, KEY)
# Get query results
results = runSyncQuery (client, PROJECT_ID, query, timeout=0)
#pdb.set_trace();
# Write results to csv without header
ordered_fieldnames = OrderedDict([('f_split',None),('m_members',None),('f_day',None),('visitors',None),('purchasers',None),('demand',None), ('dmd_per_mem',None),('visitors_per_mem',None),('purchasers_per_visitor',None),('dmd_per_purchaser',None)])
writeToCsv(results, FILE_NAME, ordered_fieldnames, False)
# Backup current data
backupfilename = "data_bk-" + time.strftime("%y-%m-%d") + ".csv"
call(['cp','../data/data.csv',backupfilename])
# Concatenate new results to data
with open("../data/data.csv", "ab") as outfile:
with open("results.csv","rb") as infile:
line = infile.read()
outfile.write(line)
You mention that these come from aggregate sums of floating point data. As Felipe mentioned, floating point is awkward; it violates some of the mathematical identities that we tend to assume.
In this case, the associative property is the one that bites us. That is, usually (A+B)+C == A+(B+C). However, in floating point math, this isn't the case. Each operation is an approximation; you can see this better if you wrap with an 'approx' function: approx(approx(A+B) + C) is clearly different from approx(A + approx(B+C)).
If you think about how bigquery computes aggregates, it builds an execution tree, and computes the value to be aggregated at the leaves of the tree. As those answers are ready, they're passed back up to the higher levels of the tree and aggregated (let's say they're added). The "when they're ready" part makes it non-deterministic.
A node may get results back in the order A,B,C the first time and C,A, B the second time. This means that the order of distribution will change, since you'll get approx(approx(A + B) + C) the first time and approx(approx(C, A) + B) the second time. Note that since we're dealing with ordering, it may look like the commutative property is the problematic one, but it isn't; A+B in floating math is the same as B+A. The problem is really that you're adding partial results, which aren't associative.
Floating point math has all sorts of nasty properties and should usually be avoided if you rely on precision.
Assume floating point is non-deterministic:
https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/
“the IEEE standard does not guarantee that the same program will
deliver identical results on all conforming systems.”