I am streaming AWS Cloudwatch logs (from a Node.js Lambda application) to an AWS Elasticsearch cluster, so that I can view metrics in Kibana.
Some of the data I was streaming was numeric, but was being logged as strings. I've updated the application code to log these as numeric values, however I can't use numeric visualizations in Kibana on those fields because the field type is now mixed -- i.e. in Kibana settings it says 13 fields are defined as several types (string, integer, etc) across the indices that match this pattern...
Is there a straightforward way to force ES / Kibana to treat that field as always numeric? Or convert all of the older logged data from string to number?
My searches have indicated I can do this with some kind of mutation using the ES API, but I can't track down what this API call would actually look like. Disclaimer: Elasticsearch noob.
There are two approaches here:
Convert all the data from strings to numeric values. Essentially, you'll have to reindex the whole data(we can't just change the field type with one click), making sure that the strings are converted / typecast to numeric values. The best way to reindex is to use Ingest Node Pipelines
Pros: Visualizations built on this data will be fast as the data is already in numeric format.
Cons: If the data set is huge this conversion can take long time.
Keep all data in string format as-it-is and use Scripted Fields in Kibana, to convert the data to numeric format at runtime e.g. whenever you visualize
Pros: No need to setup a whole new pipeline to convert the data
Cons: Visualizations on large timeframes might be too slow / heavy for your infrastructure.
Here is the scripted field I created, thanks to Abhishek's answer:
String key = 'myfield';
if (doc.containsKey(key + '.keyword')) {
key += '.keyword';
if (doc[key].size() != 0 && doc[key] != null) {
if (doc[key].value instanceof String) {
return Double.parseDouble(doc[key].value);
}
}
} else if (doc.containsKey(key) && doc[key].size() != 0 && doc[key] != null) {
return doc[key].value;
}
Related
I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?
I want to store arbitrary key value pairs. For example,
{:foo "bar" ; string
:n 12 ; long
:p 1.2 ; float
}
In datomic, I'd like to store it as something like:
[{:kv/key "foo"
:kv/value "bar"}
{:kv/key "n"
:kv/value 12}
{:kv/key "p"
:kv/value 1.2}]
The problem is :kv/value can only have one type in datomic. A solution is to to split :kv/value into :kv/value-string, :kv/value-long, :kv/value-float, etc. It comes with its own issues like making sure only one value attribute is used at a time. Suggestions?
If you could give more details on your specific use-case it might be easier to figure out the best answer. At this point it is a bit of a mystery why you may want to have an attribute that can sometimes be a string, sometimes an int, etc.
From what you've said so far, your only real answer it to have different attributes like value-string etc. This is like in a SQL DB you have only 1 type per table column and would need different columns to store a string, integer, etc.
As your problem shows, any tool (such as a DB) is designed with certain assumptions. In this case the DB assumes that each "column" (attribute in Datomic) is always of the same type. The DB also assumes that you will (usually) want to have data in all columns/attrs for each record/entity.
In your problem you are contradicting both of these assumptions. While you can still use the DB to store information, you will have to write custom functions to ensure only 1 attribute (value-string, value-int, etc) is in use at one time. You probably want custom insertion functions like "insert-str-val", "insert-int-val", etc, as well as custom read functions "read-str-val" etc al. It might be also a good idea to have a validation function that could accept any record/entity and verify that exactly one-and-only-one "type" was in use at any given time.
You can emulate a key-value store with heterogenous values by making :kv/key a :db.unique/identity attribute, and by making :kv/value either bytes-typed or string-typed and encoding the values in the format you like (e.g fressian / nippy for :db.types/bytes, edn / json for :db.types/string). I advise that you set :db/index to false for :kv/value in this case.
Notes:
you will have limited query power, as the values will not be indexed and will need to be de-serialized for each query.
If you want to run transaction functions which read or write the values (e.g for data migrations), you should make your encoding / decoding library available to the Transactor as well.
If the values are large (say, over 20kb), don't store them in Datomic; use a complementary storage service like AWS S3 and store a URL.
I am currently using the BETA 2 version of wso2 esb tooling. it is a great improvement from the previous version that I used which is the ALPHA version. so for my question is that is there a way to manipulate the attribute that was mapped from data mapper? so for example is that if have a response minute with a data type of integer and value of 150 and I want to concatenate a string on that integer so the result should be 150 min
For this, you can do a type conversion using 'ToString' operation and and then use 'Concat' operation to concatenate the two strings. You can find more details on the latest improvements of Data Mapper at : https://nuwanpallewela.wordpress.com/2016/07/16/understanding-wso2-data-mapper-5-0-0/
I am new to Flume (and to HDFS), so I hope my question is not stupid.
I have a multi-tenant application (about 100 different customers as for
now).
I have 16 different data types.
(In production, we have approx. 15 million messages/day through our
RabbitMQ)
I want to write to HDFS all my events, separated by tenant, data type,
and date, like this :
/data/{tenant}/{data_type}/2014/10/15/file-08.csv
Is it possible with one sink definition ? I don't want to duplicate
configuration, and new client arrive every week or so
In documentation, I see
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://server/events/%Y/%m/%d/%H/
Is this possible ?
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://server/events/%tenant/%type/%Y/%m/%d/%H/
I want to write to different folders according to my incoming data.
Yes this is indeed possible. You can either use the metadata or some field in the incoming data to redirect the output to.
For example, in my case I am getting different types of log data and I want to store it in respective folders accordingly. Also in my case the first word in my log lines is the file name. Here is the config snippet for the same.
Interceptor:
dataplatform.sources.source1.interceptors = i3
dataplatform.sources.source1.interceptors.i3.type = regex_extractor
dataplatform.sources.source1.interceptors.i3.regex = ^(\\w*)\t.*
dataplatform.sources.source1.interceptors.i3.serializers = s1
dataplatform.sources.source1.interceptors.i3.serializers.s1.name = filename
HDFS Sink
dataplatform.sinks.sink1.type = hdfs
dataplatform.sinks.sink1.hdfs.path = hdfs://server/events/provider=%{filename}/years=%Y/months=%Y%m/days=%Y%m%d/hours=%H
Hope this helps.
Possible solution may be to write an interceptor which passes the tenant value.
please refer to the link below
http://hadoopi.wordpress.com/2014/06/11/flume-getting-started-with-interceptors/
I have recently started looking into Google Charts API for possible use within the product I'm working on. When constructing the URL for a given chart, the data points can be specified in three different formats, unencoded, using simple encoding and using extended encoding (http://code.google.com/apis/chart/formats.html). However, there seems to be no way around the fact that the highest value possible to specify for a data point is using extended encoding and is in that case 4095 (endoded as "..").
Am I missing something here or is this limit for real?
When using the Google Chart API, you will usually need to scale your data yourself so that it fits within the 0-4095 range required by the API.
For example, if you have data values from 0 to 1,000,000 then you could divide all your data by 245 so that it fits within the available range (1000000 / 245 = 4081).
Per data scaling, this may also help you:
http://code.google.com/apis/chart/formats.html#data_scaling
Note the chds parameter option.
You may also wish to consider leveraging a wrapper API that abstracts away some of these ugly details. They are listed here:
http://groups.google.com/group/google-chart-api/web/useful-links-to-api-libraries
I wrote charts4j which has functionality to help you deal with data scaling.