Data modeling with document database? - document-database

I am new to working with Data.
So I have a lot of data based on time.
Data row for every 15 mins. Should I compute the data and store data for every 1 hour, 1 day, 1 month on the database?
if I do would this schema be good.
{
_id: "joe",
name: "Joe Bookreader",
time min: [
{
time: "1",
steps: "10"
},
{
time: "2",
steps: "4"
}
]
time day: [
{
time: "1",
steps: "30"
},
{
time: "2",
steps: "30"
}
]
}
If you have any advice on how I can improve my data modeling knowledge with document databases, I would be really grateful.

For a minute step away from programmatic approach to the problem and think about the task at hand.
How are you going to use that data after you stored it? When you use the data it is important for you to know exactly number of steps for a particular user or you want to see a big picture based on the time particular sample points in time.
If you care for per user perspective then your scheme above will work. On the other hand if you want to run global reports like how far along users were on average (or total) during certain time,then I would opt in for schema where your document is time (point in time or range in time), while user and steps are your properties.
Another important concept in database is not to statically store data that can be calculated on the fly. As with any rules there are some exceptions to this. Like Cached values that are short lived and will not have major effect on your application if they are incorrect. Another one is reports, you produced a report for the user based on current values and stored it. If user feels like getting fresh data, user will re-run the report. (I am sure there are few other)
But in most cases the risk that comes with serving stale/wrong data resulting in wrong decision based on that data will outweigh performance benefit of avoiding extra calculations.
The reason I am mentioning this, is because you are storing time min and time day. If time day can be calculated based on time min you should not store it in the database, but rather calculate it on the fly. You can write queries that will produce actual result of time day without using any extra computational power on your application node. All computations will be done on the data node, much more efficiently than a compute node and without network penalties.
I realize this post is a bit old, but I hope my answer will help someone.

Related

Optimizing / speeding up calculation time in Google Sheets

I have asked a few questions related to this personal project of mine already on this platform, and this should be the last one since I am so close to finishing. Below is the link to a mock example spreadsheet I've created, which mimics what my actual project does but it contains less sensitive information and is also smaller in size.
Mock Spreadsheet
Basic rundown of the spreadsheet:
Pulls data from a master schedule which is controlled/edited by another party into the Master Schedule tab.
In the columns adjacent to the imported data, an array formula expands the master schedule by classroom in case some of the time slots designate multiple rooms. Additional formulas adjust the date, start time, and end time to be capped within the current day's 24-hour period. The start time of each class is also made to be an hour earlier.
In the Room Schedule tab, an hourly calendar is created based on the room number in the first column, and only corresponds to the current day.
I have tested the spreadsheet extensively with multiple scenarios, and I'm happy with how everything works except for the calculation time. I figured the two volatile functions I use would take some processing time just by themselves, and I certainly didn't expect this to be lightning-fast especially without using a script, but the project that I am actually implementing this method for is much larger and takes a very long time to update. The purpose of this spreadsheet is to allow users to find an open room and "reserve" it by clicking the checkbox next to it (which will consequently color the entire row red) allowing everyone else to know that it is now taken.
I'd like to know if there is any way to optimize / speed up my spreadsheet, or to not update it every time a checkbox is clicked and instead update it "manually", similar to what OP is asking here. I am not familiar with Apps Script nor am I well-versed in writing code overall, but I am willing to learn - I just need a push in the right direction since I am going into this blind. I know the number of formulas in the Room Schedule tab is probably working against me yet I am so close to what I wanted the final product to be, so any help or insight is greatly appreciated!
Feel free to ask any questions if I didn't explain this well enough.
to speed up things you should avoid usage of the same formulae per each row and make use of arrayformulas. for example:
=IF(AND(TEXT(K3,"m/d")<>$A$1,(M3-L3)<0),K3+1,K3+0)
=ARRAYFORMULA(IF(K3:K<>"",
IF((TEXT(K3:K, "m/d")<>$A$1)*((M3:M-L3:L)<0), K3:K+1, K3:K+0), ))
=IF(AND(TEXT(K3,"m/d")=$A$1,(M3-L3)<0),TIMEVALUE("11:59:59 PM"),M3+0)
=ARRAYFORMULA(IF(K3:K<>"",
IF((TEXT(K3,"m/d")=$A$1)*((M3-L3)<0), TIMEVALUE("11:59:59 PM"), M3:M+0), ))

AWS Machine Learning Data

I'm using the AWS Machine Learning regression to predict the waiting time in a line of a restaurant, in a specific weekday/time.
Today I have around 800k data.
Example Data:
restaurantID (rowID)weekDay (categorical)time (categorical)tablePeople (numeric)waitingTime (numeric - target)1 sun 21:29 2 23
2 fri 20:13 4 43
...
I have two questions:
1)
Should I use time as Categorical or Numeric?
It's better to split into two fields: minutes and seconds?
2)
I would like in the same model to get the predictions for all my restaurants.
Example:
I expected to send the rowID identifier and it returns different predictions, based on each restaurant data (ignoring others data).
I tried, but it's returning the same prediction for any rowID. Why?
Should I have a model for each restaurant?
There are several problems with the way you set-up your model
1) Time in the form you have it should never be categorical. Your model treats times 12:29 and 12:30 as two completely independent attributes. So it will never use facts it learn about 12:29 to predict what's going to happen at 12:30. In your case you either should set time to be numeric. Not sure if amazon ML can convert it for you automatically. If not just multiply hour by 60 and add minutes to it. Another interesting thing to do is to bucketize your time, by selecting which half hour or wider interval. You do it by dividing (h*60+m) by some number depending how many buckets you want. So to try 120 to get 2 hr intervals. Generally the more data you have the smaller intervals you can have. The key is to have a lot of samples in each bucket.
2) You should really think about removing restaurantID from your input data. Having it there will cause the model to over-fit on it. So it will not be able to make predictions about restaurant with id:5 based on the facts it learn from restaurants with id:3 or id:9. Having restaurant id there might be okay if you have a lot of data about each restaurant and you don't care about extrapolating your predictions to the restaurants that are not in the training set.
3) You never send restaurantID to predict data about it. The way it usually works you need to pick what are you trying to predict. In your case probably 'waitingTime' is most useful attribute. So you need to send weekDay, time and number of people and the model will output waiting time.
You should think what is relevant for the prediction to be accurate, and you should use your domain expertise to define the features/attributes you need to have in your data.
For example, time of the day, is not just a number. From my limited understanding in restaurant, I would drop the minutes, and only focus on the hours.
I would certainly create a model for each restaurant, as the popularity of the restaurant or the type of food it is serving is having an impact on the wait time. With Amazon ML it is easy to create many models as you can build the model using the SDK, and even schedule retraining of the models using AWS Lambda (that mean automatically).
I'm not sure what the feature called tablePeople means, but a general recommendation is to have as many as possible relevant features, to get better prediction. For example, month or season is probably important as well.
In contrast with some answers to this post, I think resturantID helps and it actually gives valuable information. If you have a significant amount of data per each restaurant then you can train a model per each restaurant and get a good accuracy, but if you don't have enough data then resturantID is very informative.
1) Just imagine what if you had only two columns in your dataset: restaurantID and waitingTime. Then wouldn't you think the restaurantID from the testing data helps you to find a rough waiting time? In the simplest implementation, your waiting time per each restaurantID would be the average of waitingTime. So definitely restaurantID is a valuable information. Now that you have more features in your dataset, you need to check if restaurantID is as effective as the other features or not.
2) If you decide to keep restaurantID then you must use it as a categorical string. It should be a non-parametric feature in your dataset and maybe that's why you did not get a proper result.
On the issue with day and time I agree with other answers and considering that you are building your model for the restaurant, hourly time may give a more accurate result.

Amount of Test Data needed for load testing of a web service

I am currently working on a project that requires load testing of web services.
One of the services is being called 60,000 times in the production during Busy-Day/Busy-HR.
{PerfTest Env=PROD}
Input Account Number
Output AccountDetails
Do I really need 60,000 unique account numbers(TEST DATA) for this loadrunner script to simulate the production scenario?
If unique data is required, for endurance test I will have to prepare lot of test data for each web service.
If I don't get that much test data, what is the chance of Load Test being affected due to Application Server Cache mechanism??
Can somebody help me?
Thanks
Ram
Are you simulating a day or the highest volume hour in the last year? This can help you to shape the amount of data that you need. Rarely would you start with a 24 hour test. Instead you would be looking at your high water test of an hour with a ramp up and ramp down, so you would need approximately 1.333* your high water hour's worth of data.
So this can drop your 60K to (potentially) 20K(?) I am making an assumption that your worst hour over the last year is somewhere around 1/3 of your traditional day. I have observed this pattern over and over again in different environments over the past two decades. You will want to objectively verify this with log data or query data to support the number in your environment.
Next up, how many of these inquiries are actually unique? You are really going to need a log of the queries across a day (or your high water hour) to determine this. Log processing tools such as Microsoft Logparser or Splunk/Splunk Storm can help you to pull the observed distribution of unique account references within your data, including counts of those which are multiple. Once you know this you can simply use a data file with a fixed block size for each user for unique data and once the data is exhausted the user exits.

Incremental update of millions of records, indexed vs. join

I'm currently developing a strategy for an incremental update of our user data. We assume 100_000_000 records in our database of which approximately 1_000_000 records are updated per workflow.
The idea is to update records in a MapReduce job. Is it useful to use an indexed storage (eg. Cassandra) to be able to access current records randomly? Or is it preferable to retrieve data from HDFS and join new information to existing records.
The record size is O(200 Bytes). The user data has a fixed length but should be extendable. The log events have a similar but not equal structure. The number of user records is likely to grow. Near real-time updates are desirable, ie. a 3 hour time gap is not acceptable, few minutes is OK.
Have you made any experiences with either of these strategies and data of this size?
Is the pig JOIN fast enough? Is it a bottleneck always to read all records? Is Cassandra able to hold this amount of data efficiently? Which solution is scalable? What about the complexity of the system?
You need to define your requirements first. Your record volumes are not a problem, but you don't give a record length. Are they fixed length, fixed field number, likely to change format over time? Are we talking 100 byte records or 100,000 byte records? You need an index on a field/column if you wish to query by that field/column, unless you do all your work using map/reduce. Will the number of user records stay at 100mill (1 server will probably suffice) or will it grow 100% per year ( probably multiple servers adding new ones over time).
How you access records for updating depends on whether you need to update them in real-time or whether you can run a batch job. Will updates be every minute, or hour, or month?
I would strongly suggest you do some experimenting. Have you done any testing already? This will give you a context for your questions and this will lead to more objective questions and answers. It is unlikely that you can 'whiteboard' a solution based on your question.

Amazon SimpleDB Woes: Implementing counter attributes

Long story short, I'm rewriting a piece of a system and am looking for a way to store some hit counters in AWS SimpleDB.
For those of you not familiar with SimpleDB, the (main) problem with storing counters is that the cloud propagation delay is often over a second. Our application currently gets ~1,500 hits per second. Not all those hits will map to the same key, but a ballpark figure might be around 5-10 updates to a key every second. This means that if we were to use a traditional update mechanism (read, increment, store), we would end up inadvertently dropping a significant number of hits.
One potential solution is to keep the counters in memcache, and using a cron task to push the data. The big problem with this is that it isn't the "right" way to do it. Memcache shouldn't really be used for persistent storage... after all, it's a caching layer. In addition, then we'll end up with issues when we do the push, making sure we delete the correct elements, and hoping that there is no contention for them as we're deleting them (which is very likely).
Another potential solution is to keep a local SQL database and write the counters there, updating our SimpleDB out-of-band every so many requests or running a cron task to push the data. This solves the syncing problem, as we can include timestamps to easily set boundaries for the SimpleDB pushes. Of course, there are still other issues, and though this might work with a decent amount of hacking, it doesn't seem like the most elegant solution.
Has anyone encountered a similar issue in their experience, or have any novel approaches? Any advice or ideas would be appreciated, even if they're not completely flushed out. I've been thinking about this one for a while, and could use some new perspectives.
The existing SimpleDB API does not lend itself naturally to being a distributed counter. But it certainly can be done.
Working strictly within SimpleDB there are 2 ways to make it work. An easy method that requires something like a cron job to clean up. Or a much more complex technique that cleans as it goes.
The Easy Way
The easy way is to make a different item for each "hit". With a single attribute which is the key. Pump the domain(s) with counts quickly and easily. When you need to fetch the count (presumable much less often) you have to issue a query
SELECT count(*) FROM domain WHERE key='myKey'
Of course this will cause your domain(s) to grow unbounded and the queries will take longer and longer to execute over time. The solution is a summary record where you roll up all the counts collected so far for each key. It's just an item with attributes for the key {summary='myKey'} and a "Last-Updated" timestamp with granularity down to the millisecond. This also requires that you add the "timestamp" attribute to your "hit" items. The summary records don't need to be in the same domain. In fact, depending on your setup, they might best be kept in a separate domain. Either way you can use the key as the itemName and use GetAttributes instead of doing a SELECT.
Now getting the count is a two step process. You have to pull the summary record and also query for 'Timestamp' strictly greater than whatever the 'Last-Updated' time is in your summary record and add the two counts together.
SELECT count(*) FROM domain WHERE key='myKey' AND timestamp > '...'
You will also need a way to update your summary record periodically. You can do this on a schedule (every hour) or dynamically based on some other criteria (for example do it during regular processing whenever the query returns more than one page). Just make sure that when you update your summary record you base it on a time that is far enough in the past that you are past the eventual consistency window. 1 minute is more than safe.
This solution works in the face of concurrent updates because even if many summary records are written at the same time, they are all correct and whichever one wins will still be correct because the count and the 'Last-Updated' attribute will be consistent with each other.
This also works well across multiple domains even if you keep your summary records with the hit records, you can pull the summary records from all your domains simultaneously and then issue your queries to all domains in parallel. The reason to do this is if you need higher throughput for a key than what you can get from one domain.
This works well with caching. If your cache fails you have an authoritative backup.
The time will come where someone wants to go back and edit / remove / add a record that has an old 'Timestamp' value. You will have to update your summary record (for that domain) at that time or your counts will be off until you recompute that summary.
This will give you a count that is in sync with the data currently viewable within the consistency window. This won't give you a count that is accurate up to the millisecond.
The Hard Way
The other way way is to do the normal read - increment - store mechanism but also write a composite value that includes a version number along with your value. Where the version number you use is 1 greater than the version number of the value you are updating.
get(key) returns the attribute value="Ver015 Count089"
Here you retrieve a count of 89 that was stored as version 15. When you do an update you write a value like this:
put(key, value="Ver016 Count090")
The previous value is not removed and you end up with an audit trail of updates that are reminiscent of lamport clocks.
This requires you to do a few extra things.
the ability to identify and resolve conflicts whenever you do a GET
a simple version number isn't going to work you'll want to include a timestamp with resolution down to at least the millisecond and maybe a process ID as well.
in practice you'll want your value to include the current version number and the version number of the value your update is based on to more easily resolve conflicts.
you can't keep an infinite audit trail in one item so you'll need to issue delete's for older values as you go.
What you get with this technique is like a tree of divergent updates. you'll have one value and then all of a sudden multiple updates will occur and you will have a bunch of updates based off the same old value none of which know about each other.
When I say resolve conflicts at GET time I mean that if you read an item and the value looks like this:
11 --- 12
/
10 --- 11
\
11
You have to to be able to figure that the real value is 14. Which you can do if you include for each new value the version of the value(s) you are updating.
It shouldn't be rocket science
If all you want is a simple counter: this is way over-kill. It shouldn't be rocket science to make a simple counter. Which is why SimpleDB may not be the best choice for making simple counters.
That isn't the only way but most of those things will need to be done if you implement an SimpleDB solution in lieu of actually having a lock.
Don't get me wrong, I actually like this method precisely because there is no lock and the bound on the number of processes that can use this counter simultaneously is around 100. (because of the limit on the number of attributes in an item) And you can get beyond 100 with some changes.
Note
But if all these implementation details were hidden from you and you just had to call increment(key), it wouldn't be complex at all. With SimpleDB the client library is the key to making the complex things simple. But currently there are no publicly available libraries that implement this functionality (to my knowledge).
To anyone revisiting this issue, Amazon just added support for Conditional Puts, which makes implementing a counter much easier.
Now, to implement a counter - simply call GetAttributes, increment the count, and then call PutAttributes, with the Expected Value set correctly. If Amazon responds with an error ConditionalCheckFailed, then retry the whole operation.
Note that you can only have one expected value per PutAttributes call. So, if you want to have multiple counters in a single row, then use a version attribute.
pseudo-code:
begin
attributes = SimpleDB.GetAttributes
initial_version = attributes[:version]
attributes[:counter1] += 3
attributes[:counter2] += 7
attributes[:version] += 1
SimpleDB.PutAttributes(attributes, :expected => {:version => initial_version})
rescue ConditionalCheckFailed
retry
end
I see you've accepted an answer already, but this might count as a novel approach.
If you're building a web app then you can use Google's Analytics product to track page impressions (if the page to domain-item mapping fits) and then to use the Analytics API to periodically push that data up into the items themselves.
I haven't thought this through in detail so there may be holes. I'd actually be quite interested in your feedback on this approach given your experience in the area.
Thanks
Scott
For anyone interested in how I ended up dealing with this... (slightly Java-specific)
I ended up using an EhCache on each servlet instance. I used the UUID as a key, and a Java AtomicInteger as the value. Periodically a thread iterates through the cache and pushes rows to a simpledb temp stats domain, as well as writing a row with the key to an invalidation domain (which fails silently if the key already exists). The thread also decrements the counter with the previous value, ensuring that we don't miss any hits while it was updating. A separate thread pings the simpledb invalidation domain, and rolls up the stats in the temporary domains (there are multiple rows to each key, since we're using ec2 instances), pushing it to the actual stats domain.
I've done a little load testing, and it seems to scale well. Locally I was able to handle about 500 hits/second before the load tester broke (not the servlets - hah), so if anything I think running on ec2 should only improve performance.
Answer to feynmansbastard:
If you want to store huge amount of events i suggest you to use distributed commit log systems such as kafka or aws kinesis. They allow to consume stream of events cheap and simple (kinesis's pricing is 25$ per month for 1K events per seconds) – you just need to implement consumer (using any language), which bulk reads all events from previous checkpoint, aggregates counters in memory then flushes data into permanent storage (dynamodb or mysql) and commit checkpoint.
Events can be logged simply using nginx log and transfered to kafka/kinesis using fluentd. This is very cheap, performant and simple solution.
Also had similiar needs/challenges.
I looked at using google analytics and count.ly. the latter seemed too expensive to be worth it (plus they have a somewhat confusion definition of sessions). GA i would have loved to use, but I spent two days using their libraries and some 3rd party ones (gadotnet and one other from maybe codeproject). unfortunately I could only ever see counters post in GA realtime section, never in the normal dashboards even when the api reported success. we were probably doing something wrong but we exceeded our time budget for ga.
We already had an existing simpledb counter that updated using conditional updates as mentioned by previous commentor. This works well, but suffers when there is contention and conccurency where counts are missed (for example, our most updated counter lost several million counts over a period of 3 months, versus a backup system).
We implemented a newer solution which is somewhat similiar to the answer for this question, except much simpler.
We just sharded/partitioned the counters. When you create a counter you specify the # of shards which is a function of how many simulatenous updates you expect. this creates a number of sub counters, each which has the shard count started with it as an attribute :
COUNTER (w/5shards) creates :
shard0 { numshards = 5 } (informational only)
shard1 { count = 0, numshards = 5, timestamp = 0 }
shard2 { count = 0, numshards = 5, timestamp = 0 }
shard3 { count = 0, numshards = 5, timestamp = 0 }
shard4 { count = 0, numshards = 5, timestamp = 0 }
shard5 { count = 0, numshards = 5, timestamp = 0 }
Sharded Writes
Knowing the shard count, just randomly pick a shard and try to write to it conditionally. If it fails because of contention, choose another shard and retry.
If you don't know the shard count, get it from the root shard which is present regardless of how many shards exist. Because it supports multiple writes per counter, it lessens the contention issue to whatever your needs are.
Sharded Reads
if you know the shard count, read every shard and sum them.
If you don't know the shard count, get it from the root shard and then read all and sum.
Because of slow update propogation, you can still miss counts in reading but they should get picked up later. This is sufficient for our needs, although if you wanted more control over this you could ensure that- when reading- the last timestamp was as you expect and retry.