I am trying to integrate influxdb with my application and process the output. I am importing InfluxDBClient package to connect to influx instance running on my local machine. Using query() that returns data in 'influxdb.resultset.ResultSet' format.
However, I want to be able to pick each element specifically from the Resultset for my computations. I was using different functions like keys(), items() and values() from the influxdb-python manual here but of no use:
http://influxdb-python.readthedocs.io/en/latest/api-documentation.html
This is the sample output of the query():
Result: ResultSet({'(u'cpu', None)': [{u'usage_guest_nice': 0, u'usage_user': 0.90783871790308868, u'usage_nice': 0, u'usage_steal': 0, u'usage_iowait': 0.056348610076366427, u'host': u'xxx.xxx.hostname.com', u'usage_guest': 0, u'usage_idle': 98.184322579062794, u'usage_softirq': 0.0062609566755314457, u'time': u'2016-06-26T16:25:00Z', u'usage_irq': 0, u'cpu': u'cpu-total', u'usage_system': 0.84522915123660536}]})
I am also finding it hard to get the data in JSON format using Raw mentioned in the above link. Would be great to have any pointers to process the above output.
items() returns a tuple in below format, ((u'cpu', None), ), where the generator can be used to loop and get the actual data in Dictionary format. Took some time for me to figure out but it was fun!!
According to the docs you could use the get_points() function to retrieve results from an InfluxDB resultset. The function allows you to filter by either measurement, tag, both measurement AND tag, or simply get all the results without any filtering.
Getting all points
Using rs.get_points() will return a generator for all the points in the ResultSet.
Filtering by measurement
Using rs.get_points('cpu') will return a generator for all the points that are in a serie with measurement name cpu, no matter the tags.
rs = cli.query("SELECT * from cpu")
cpu_points = list(rs.get_points(measurement='cpu'))
Filtering by tags
Using rs.get_points(tags={'host_name': 'influxdb.com'}) will return a generator for all the points that are tagged with the specified tags, no matter the measurement name.
rs = cli.query("SELECT * from cpu")
cpu_influxdb_com_points = list(rs.get_points(tags={"host_name": "influxdb.com"}))
Filtering by measurement and tags
Using measurement name and tags will return a generator for all the points that are in a serie with the specified measurement name AND whose tags match the given tags.
rs = cli.query("SELECT * from cpu")
points = list(rs.get_points(measurement='cpu', tags={'host_name': 'influxdb.com'}))
Related
I have a vector of nominal values and I need to know the probability of occurring each of the nominal values. Basically, I need those to obtain the min, max, mean, std of the probability of observing the nominal values and to get the Class Entropy value.
For example, lets assume there is a data-set in which the target is predicting 0, 1, or 2. In the training data-set. We can count the number of records which their target is 1, and call it n_1 and similarly we can define n_0 and n_2. Then, the probability of observing class 1 in the training data-set is simply p_1=n_1/(n_0 + n_2). Once p_0, p_1, and p_2 are obtained, one can get min, max, mean, and std of the these probabilitis.
It is easy to get that in python by pandas, but want to avoid reading the data-set twice. I was wondering if there is any CAS-action in SAS that can provide it to me. Note that I use the Python API of SAS through swat and I need to have the API in python.
I found the following solution and it works fine. It uses s.dataPreprocess.highcardinality to get the number of classes and then uses s.dataPreprocess.binning to obtain the number of observations within each class. Then, there is just some straightforward calculation.
import swat
# create a CAS server
s = swat.CAS(server, port)
# load the table
tbl_name = 'hmeq'
s.upload("./data/hmeq.csv", casout=dict(name=tbl_name, replace=True))
# call to get the number of classes
cardinality_result = s.dataPreprocess.highcardinality(table=tbl_name, vars=[target_var])
cardinality_result_df = pd.DataFrame(cardinality_result["HighCardinalityDetails"])
number_of_classes = int(cardinality_result_df["CardinalityEstimate"])
# call dataPreprocess.binning action to get the probability of each class
s.loadactionset(actionset="dataPreprocess")
result_binning = s.dataPreprocess.binning(table=tbl_name, vars=[target_var], nBinsArray=[number_of_classes])
result_binning_df = pd.DataFrame(result_binning["BinDetails"])
probs = result_binning_df["NInBin"]/result_binning_df["NInBin"].sum()
prob_min = probs.min()
prob_max = probs.max()
prob_mean = probs.mean()
prob_std = probs.std()
entropy = -sum(probs*np.log2(probs))
I need help to normalize the field "DSC_HASH" inside a single column delimeted by colon.
Input:
Outuput:
I achieved what I needed with java transformation:
1) In java transformation I created 4 output columns: COD1_out, COD2_out, COD3_out and DSC_HASH_out
2) Then I put the following code:
String [] column_split;
String column_delimiter = ";";
String [] column_data;
String data_delimiter = ":" ;
Column_split = DSC_HASH.split(column_delimiter);
COD1_out = COD1;
COD2_out = COD2;
COD3_out = COD3;
for (int I =0; i < column_split.length; i++){
column_data = column_split[i].split(data_delimiter);
DSC_HASH_out = column_data[0];
generateRow();
}
There are no generic parsers or loop construct in Informatica that can take one record and output an arbitrary number of records.
There are some ways you can bypass this limitation:
Using the Java Transformation, as you did, which is probably the easiest... if you know Java :) There may be limitations to performance or multi-threading.
Using a Router or a Normalizer with a fixed number of output records, high enough to cover all your cases, then filter out empty records. The expressions to extract fields are a bit complex to write (an maintain).
Using the XML Parser, but you have to convert your data to XML before, and design an XML schema. For example your first line would be changed in (on multiple lines for readability):
<e><n>2320</n><h>-1950312402</h></e>
<e><n>410</n><h>103682488</h></e>
<e><n>4301</n><h>933882987</h></e>
<e><n>110</n><h>-2069728628</h></e>
Using SQL Transformation or Stored Procedure Transformation to use database standard or custom functions, but that would result in an SQL query for each input row, which is bad performance-wise
Using a Custom Transformation. Does anyone want to write C++ for that ?
The Java Transformation is clearly a good solution for this situation.
I'm using python 2.7 with Elasticsearch-DSL package to query my elastic cluster.
Trying to add "from and limit" capabilities to the query in order to have pagination in my FE which presents the documents elastic returns but 'from' doesn't work right (i.e. I'm not using it correctly I spouse).
The relevant code is:
s = Search(using=elastic_conn, index='my_index'). \
filter("terms", organization_id=org_list)
hits = s[my_from:my_size].execute() # if from = 10, size = 10 then I get 0 documents, altought 100 documents match the filters.
My index contains 100 documents.
even when my filter match all results (i.e nothing is filtered out), if I use
my_from = 10 and my_size = 10, for instance, then I get nothing in hits (no matching documents)
Why is that? Am I misusing the from?
Documentation states:
from and size parameters. The from parameter defines the offset from the first result you want to fetch. The size parameter allows you to configure the maximum amount of hits to be returned.
So it seems really straightforward, what am I missing?
The answer to this question can be found in their documentation under the Pagination Section of the Search DSL:
Pagination
To specify the from/size parameters, use the Python slicing API:
s = s[10:20]
# {"from": 10, "size": 10}
The correct usage of these parameters with the Search DSL is just as you would with a Python list, slicing from the starting index to the end index. The size parameter would be implicitly defined as the end index minus the start index.
Hope this clears things up!
Try to pass from and size params as below:
search = Search(using=elastic_conn, index='my_index'). \
filter("terms", organization_id=org_list). \
extra(from_=10, size=20)
result = search.execute()
I am trying to read values from Google pubsub and Google storage and put those values into big query based on count conditions i.e., if the values exists, then it should not insert the value, else it can insert a value.
My code looks like this:
p_bq = beam.Pipeline(options=pipeline_options1)
logging.info('Start')
"""Pipeline starts. Create creates a PCollection from what we read from Cloud storage"""
test = p_bq | beam.Create(data)
"""The pipeline then reads from pub sub and then combines the pub sub with the cloud storage data"""
BQ_data1 = p_bq | 'readFromPubSub' >> beam.io.ReadFromPubSub(
'mytopic') | beam.Map(parse_pubsub, param=AsList(test))
where 'data' is the value from Google storage and reading from pubsub is the value from Google Analytics. Parse_pubsub returns 2 values: one is the dictionary and the other is count (which states the value is present or not in the table)
count=comparebigquery(insert_record)
return (insert_record,count)
How to provide condition to big query insertion since the value is in Pcollection
New edit:
class Process(beam.DoFn):
def process1(self, element, trans):
if element['id'] in trans:
# Emit this short word to the main output.
yield pvalue.TaggedOutput('present',element)
else:
# Emit this word's long length to the 'above_cutoff_lengths' output.
yield pvalue.TaggedOutput(
'absent', present)
test1 = p_bq | "TransProcess" >> beam.Create(trans)
where trans is the list
BQ_data2 = BQ_data1 | beam.ParDo(Process(),trans=AsList(test1)).with_outputs('present','absent')
present_value=BQ_data2.present
absent_value=BQ_data2.absent
Thank you in advance
You could use
beam.Filter(lambda_function)
after the beam.Map step to filter out elements that return False when passed to the lambda_function.
You can split the PCollection in a ParDo function using Additional-outputs based on a condition.
Don't forget to provide output tags to the ParDo function .with_outputs()
And when writing elements of a PCollection to a specific output, use .TaggedOutput()
Then you select the PCollection you need and write it to BigQuery.
I am dealing with high dimensional and large dataset, so i need to get just Top N outliers from output of ResultWriter.
There is some option in elki to get just the top N outliers from this output?
The ResultWriter is some of the oldest code in ELKI, and needs to be rewritten. It's rather generic - it tries to figure out how to best serialize output as text.
If you want some specific format, or a specific subset, the proper way is to write your own ResultHandler. There is a tutorial for writing a ResultHandler.
If you want to find the input coordinates in the result,
Database db = ResultUtil.findDatabase(baseResult);
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_VARIABLE_LENGTH);
will return the first relation containing numeric vectors.
To iterate over the objects sorted by their outlier score, use:
OrderingResult order = outlierResult.getOrdering();
DBIDs ids = order.order(order.getDBIDs());
for (DBIDIter it = ids.iter(); it.valid(); it.advance()) {
// Output as desired.
}