How to filter on a collection's field or property when quering on inventory table (Nexthink's nql) - nql

Suppose I am trying to filter out based on gpus's field/property. Which is a collection of device namespace.
I have a query, which gives results: (so i believe it's syntactically and semantically correct)
devices | where gpus != null | list gpus
and get results like:
gpus
Nvidia xyz
GeForce abc; ASUS GTx
But none of the query gives a result:, why?
devices | where gpus == "*Nvidia*" | list gpus or
devices | where gpus == "Nvidia xyz" | list gpus

Yes, all the three queries below are syntactically and semantically correct. These work fine for me and return accurate results.
devices | where gpus != null | list gpus
devices | where gpus == "*Nvidia*" | list gpus
devices | where gpus == "Nvidia xyz" | list gpus
I suggest checking out the documentation for comparison operators and verify if your gpu name has spaces or her special characters, hence not returning any results. Even in that case the second query should return results, so I am curious what the difference is. Keep us posted.

gpus is a collection of JSON objects stored as string in DB. So, while querying on the collection field either pass the entire json collection as string or can use regular expression to query like below:
devices | where gpus == "*Nvidia*" | list gpus

Related

How do I keep a running count in DynamoDB without a hot table row?

We have a completely server-less architecture and have been using DynamoDB almost since it was released, but I am struggling to see how to deal with tabulating global numbers on a large scale. Say we have users who choose to do either A or B. We want to keep track of how many users do each and they could happen at a high scale. According to DyanamoDB best practices, you are not supposed to write continually to a single row. What is the best way to handle this outside using another service like CouchDB or ElastiCache?
You could bucket your users by first letter of their usernames (or something similar) as the partition key, and either A or B as the sort key, with a regular attribute as the counts.
For example:
PARTITION KEY | SORT KEY | COUNT
--------------------------------
a | A | 5
a | B | 7
b | B | 15
c | A | 1
c | B | 3
The advantage is that you can reduce the risk of hot partitions by spreading your writes across multiple partitions.
Of course, you're trading hot writes for more expensive reads, since now you'll have to scan + filter(A) to get the total count that chose A, and another scan + filter(B) for the total count of B. But if you're writing a bunch and only reading on rare occasions, this may be ok.

DynamoDB Schema Design

I'm thinking of using Amazon AWS DynamoDB for a project that I'm working on. Here's the gist of the situation:
I'm going to be gathering a ton of energy usage data for hundreds of machines (energy readings are taken around every 5 minutes). Each machine is in a zone, and each zone is in a network.
I'm then going to roll up these individual readings by zone and network, by hour and day.
My thinking is that by doing this, I'll be able to perform one query against DynamoDB on the network_day table, and return the energy usage for any given day quickly.
Here's my schema at this point:
table_name | hash_key | range_key | attributes
______________________________________________________
machine_reading | machine.id | epoch | energy_use
machine_hour | machine.id | epoch_hour | energy_use
machine_day | machine.id | epoch_day | energy_use
zone_hour | machine.id | epoch_hour | energy_use
zone_day | machine.id | epoch_day | energy_use
network_hour | machine.id | epoch_hour | energy_use
network_day | machine.id | epoch_day | energy_use
I'm not immediately seeing that great of performance in tests when I run the rollup cronjob, so I'm just wondering if someone with more experience could comment on my key design? The only experience I have so far is with RDS, but I'm very much trying to learn about DynamoDB.
EDIT:
Basic structure for the cronjob that I'm using for rollups:
foreach network
foreach zone
foreach machine
add_unprocessed_readings_to_dynamo()
roll_up_fixture_hours_to_dynamo()
roll_up_fixture_days_to_dynamo()
end
roll_up_zone_hours_to_dynamo()
roll_up_zone_days_to_dynamo()
end
roll_up_network_hours_to_dynamo()
roll_up_network_days_to_dynamo()
end
I use the previous function's values in Dynamo for the next roll up, i.e.
I use zone hours to roll up zone days
I then use zone days to roll up
network days
This is what (I think) is causing a lot of unnecessary reads/writes. Right now I can manage with low throughputs because my sample size is only 100 readings. My concerns begin when this scales to what is expected to contain around 9,000,000 readings.
First things first, time series data in DynamoDB is hard to do right, but not impossible.
DynamoDB uses the hash key to shard the data so using the machine.id means that some of you are going to have hot keys. However, this is really a function of the amount of data and what you expect your IOPS to be. DynamoDB doesn't create a 2nd shard until you push past 1000 read or write IOPS. If you expect to be well below that level you may be fine, but if you expect to scale beyond that then you may want to redesign, specifically include a date component in your hash key to break things up.
Regarding performance, are you hitting your provisioned read or write throughput level? If so raise them to some crazy high level and re-run the test until the bottleneck becomes your code. This could be a simple as setting the throughput level appropriately.
However, regarding your actual code, without seeing the actual DynamoDB queries you are performing a possible issue would be reading too much data. Make sure you are not reading more data than you need from DynamoDB. Since your range key is a date field use a range conditional (not a filter) to reduce the number of records you need to read.
Make sure you code executes the rollup using multiple threads. If you are not able to saturate the DynamoDB provisioned capacity the issue may not be DynamoDB, it may be your code. By performing the rollups using multiple threads in parallel you should be able to see some performance gains.
What's the provisioned throughput on the tables you are using? how are you performing the rollup? Are you reading everything and filtering / filtering on range keys, etc?
Do you need to roll up/a cron job in this situation?
Why not use a table for the readings
machine_reading | machine.id | epoch_timestamp | energy_use
and a table for the aggregates
hash_key can be aggregate type and range key can be aggregate name
example:
zone, zone1
zone, zone3
day, 03/29/1940
when getting machine data, dump it in the first table and after that use atomic counters to increment entities in 2nd table:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters

Fast way to lookup entries with duplicates

I am looking for a way to help me quickly look up duplicate entries in some sort of table structure in a very efficient way. Here's my problem:
I have objects of type Cargo where each Cargo has a list of other associated Cargo. Each cargo also has a type associated with it.
so for example:
class Cargo {
int cargoType;
std::list<Cargo*> associated;
}
now for each cargo object and its associated cargo, there is a certain value assigned to it based on their types. This evaluation happens by classes that implement CargoEvaluator.
Now, I have a CargoEvaluatorManager which basically handles connecting everything together. CargoEvaluators are registered with CargoEvaluatorManager, then, to evaluate cargo, I call CargoEvaluatorManager.EvaluateCargo(Cargo* cargo).
Here's the current state
class CargoEvaluatorManager{
std::vector<std::vector<CargoEvaluator*>> evaluators;
double EvaluateCargo(Cargo* cargo)
{
double value = 0.0;
for(auto& associated : cargo->associated) {
auto evaluator = evaluators[cargo->cargoType][associated->cargoType];
if(evaluator != nullptr)
value += evaluator->Evaluate(cargo, associated);
}
return value;
}
}
So to recap and mention a few extra points:
CargoEvaluatorManager stores CargoEvaluators in a 2-D array using cargo types as indices. The entire 2d vector is initialized with nullptrs. When a CargoEvaluator is registered, resizing the array and the other bits and peieces I haven't shown here are handled appropriately.
I had tried using a map with std::pair as a key to look up different evaluators, but it is too slow.
This only allows one CargoEvaluator per combination of cargotype. I want to have multiple cargo evaluators potentially as well.
I am calling this EvaluateCargo tens and billions of times. I am aware my current implementation is not the most efficient and am looking for alternatives.
What I am looking for
As stated above, I want to do much of what I've outlined with the exception that I want to allow multiple Evaluators for each pair of Cargo types. As I envision it, naively, is a table like this :
--------------------------------------------
|int type 1 | int type 2 | CaroEvaluator* |
|------------------------------------------|
| 5 | 7 | Eval1* |
| 5 | 7 | Eval2* |
| 4 | 6 | Eval3* |
--------------------------------------------
The lookup should be symmetric in that the set (5,7) and (7,5) should resolve to the same entries. For speed I don't mind preloading duplicate entries.
There are maybe 3x or more more associated Caro in the list than there are Evaluators, if that factors into things.
Performance is crucial, as mentioned above!
For bonus points, each cargo evaluator may have an additional penalty value associated with it, that is not dependent on how many Associates a Cargo has. In other words: if a row in the table above is looked up, I want to call double Evaluator->AddPenality() once and only once each time EvaluateCargo is called. I cannot store any instance variables since it would cause some multithreading issues.
One added constraint is I need to be able to identify the CargoEvaluators associated with a particular cargotype, meaning that hashing the two cargotypes together is not a viable option.
If any further clarification is needed, I'll gladly try to help.
Thank you all in advance!

What does ECU units, CPU core and memory mean when I launch a instance

When I launch an instance on EC2, it gives me option for t1.micro, m1.small, m1.large etc. There is a comparision chart of vCPU, ECU, CPU cores, Memory, Instance store. Is this memory RAM of a system ?
I am not able to understand what all these terms refer to, can anyone give me a clear picture of what these terms mean ?
ECU = EC2 Compute Unit. More from here: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it
Amazon EC2 uses a variety of measures to provide each instance with a consistent and predictable amount of CPU capacity. In order to make it easy for developers to compare CPU capacity between different instance types, we have defined an Amazon EC2 Compute Unit. The amount of CPU that is allocated to a particular instance is expressed in terms of these EC2 Compute Units. We use several benchmarks and tests to manage the consistency and predictability of the performance from an EC2 Compute Unit. One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. This is also the equivalent to an early-2006 1.7 GHz Xeon processor referenced in our original documentation. Over time, we may add or substitute measures that go into the definition of an EC2 Compute Unit, if we find metrics that will give you a clearer picture of compute capacity.
For linuxes I've figured out that ECU could be measured by sysbench:
sysbench --num-threads=128 --test=cpu --cpu-max-prime=50000 --max-requests=50000 run
Total time (t) should be calculated by formula:
ECU=1925/t
And my example test results:
| instance type | time | ECU |
|-------------------|----------|---------|
| m1.small | 1735,62 | 1 |
| m3.xlarge | 147,62 | 13 |
| m3.2xlarge | 74,61 | 26 |
| r3.large | 295,84 | 7 |
| r3.xlarge | 148,18 | 13 |
| m4.xlarge | 146,71 | 13 |
| m4.2xlarge | 73,69 | 26 |
| c4.xlarge | 123,59 | 16 |
| c4.2xlarge | 61,91 | 31 |
| c4.4xlarge | 31,14 | 62 |
Responding to the Forum Thread for the sake of completeness. Amazon has stopped using the the ECU - Elastic Compute Units and moved on to a vCPU based measure. So ignoring the ECU you pretty much can start comparing the EC2 Instances' sizes as CPU (Clock Speed), number of CPUs, RAM, storage etc.
Every instance families' instance configurations are published as number of vCPU and what is the physical processor. Detailed info and screenshot obstained from here http://aws.amazon.com/ec2/instance-types/#instance-type-matrix
ECUs (EC2 Computer Units) are a rough measure of processor performance that was introduced by Amazon to let you compare their EC2 instances ("servers").
CPU performance is of course a multi-dimensional measure, so putting a single number on it (like "5 ECU") can only be a rough approximation. If you want to know more exactly how well a processor performs for a task you have in mind, you should choose a benchmark that is similar to your task.
In early 2014, there was a nice benchmarking site comparing cloud hosting offers by tens of different benchmarks, over at CloudHarmony benchmarks. However, this seems gone now (and archive.org can't help as it was a web application). Only an introductory blog post is still available.
Also useful: ec2instances.info, which at least aggregates the ECU information of different EC2 instances for comparison. (Add column "Compute Units (ECU)" to make it work.)

How can I code this problem? (C++)

I am writing a simple game which stores datasets in a 2D grid (like a chess board). Each cell in the grid may contain a single integer (0 means the cell is empty). If the cell contains a number > 0, it is said to be "filled". The set of "filled" cells on the grid is known as a "configuration".
My problems is being able to "recognize" a particular configuration, regardless of where the configuration of cells are in the MxN grid.
The problem (in my mind), breaks down into the following 2 sub problems:
Somehow "normalising" the position of a configuration (for e.g. "rebasing" its position to (0,0), such that the smallest rectangle containing all cells in the configuration has its left vertice at (0,0) in the MxN grid
Computing some kind of similarity metric (or maybe simply set difference?), to determine if the current "normalised" configuration is one of the known configurations (i.e. "recognized")
I suspect that I will need to use std::set (one of the few STL containers I haven't used so far, in all my years as a C++ coder!). I would appreciate any ideas/tips from anyone who has solved such a problem before. Any code snippets, pseudocode and/or links would be very useful indeed.
Similarity metrics are a massive area of academic research. You need to decide what level of sophistication is required for your task. It may be that you can simply drag a "template pattern" across your board, raster style, and for each location score +1 for a hit and -1 for a miss and sum the score. Then find the location where the template got the highest score. This score_max is then your similarity metric for that template. If this method is inadequate then you may need to go in to more detail about the precise nature of the game.
Maybe you could use some hash function to identify configurations. Since you need to recognize patterns even if they are not at the same position on the board, this hash should not depend on the position of the cells but on the way they are organized.
If you store your 2D grid in a 1D array, you would need to find the first filled cell and start calculating the hash from here, until the last filled cell.
Ex:
-----------------
| | | | |
-----------------
| | X | X | |
-----------------
| | | X | |
-----------------
| | | | |
-----------------
----------------+---------------+---------------+----------------
| | | | | | X | X | | | | X | | | | | |
----------------+---------------+---------------+----------------
|_______________________|
|
Compute hash based on this part of the array
However, there are cases where this won't work, e.g. when the pattern is shifted across lines:
-----------------
| | | | X |
-----------------
| X | | | |
----------------- Different configuration in 2D.
| X | | | |
-----------------
| | | | |
-----------------
----------------+---------------+---------------+----------------
| | | | X | X | | | | X | | | | | | | |
----------------+---------------+---------------+----------------
|_______________________|
Seems similar in 1D
So you'll need some way of dealing with these cases. I don't have any solution yet, but I'll try to find something if my schedule allows it!
After thinking a bit about it, maybe you could use two different representations for the grid: one where the lines are appended in a 1D array, and another one where the columns are appended in a 1D array. The hash would be calculated with these two representations, and that would (I think) resolve the problem evoked above.
This may be overkill for a small application, but OpenCV has some excellent image recognition and blob finding routines. If you treat the 2D board as an image, and the integer as brightness, it should be possible to use functions from that library.
And the link:
http://opencv.willowgarage.com/documentation/index.html
you can use a neural network for that job.
if you lookfor neural network shape recognition i think that you can find something usefull. you can find tons of library that may help you but if you have no experiece with NN this could be a little hard, but i think that is the easiest way
Sounds like you want to feed your chessboard to a neural network trained to recognize the configuration.
This is very similar to the classic examples of image classification, with the only complication being that you don't know exactly where your configuration thingy will appear in the grid, unless you're always considering the whole grid - in that case a classic 2 layers network should work.
HTM neural networks implementations solve the offset problem out-of-the-box. You can find plenty of ready-to-use implementations starting from here. Obviously you'll have to tweak the heck out of the implementations you'll find but you should be able to achieve exactly what you're asking for to my understanding.
If you want to further investigate this route the Numenta forum will be a good starting point.
This reminds me of HashLife which uses QuadTrees. Check out the wikipedia pages on HashLife and Quadtrees.
There's also a link at the bottom of the Wikipedia page to a DrDobbs article about actually implementing such an algorithm: http://www.ddj.com/hpc-high-performance-computing/184406478
I hope those links help. They're interesting reads if nothing else.
As to the first part of question,i.e debasing try this:
make a structure with 2 integers .Declare a pointer of that struct type. Input (or compute number of live cells) and assign that much storage (using routines like calloc) .Input the coordinates in the structure. Compute the minimum x coordinate and minimum y coordinate. In the universe assign [x][y] (user given values or current coordinate) to [x-minx][y-miny] Though expensive when reading from an already filled grid,but works and helps in the subsequent part of the question.