How to use ILP data types with QuestDB data types? - questdb

I use QuestDb with Influx Line Protocol. When I send new metric QuestDB automatically creates a table and for all my numbers it uses DOUBLE type. Also it can create SYMBOL for Tags and String for other string fields in the message and LONG for numbers ending with 'i'.
So I can send
sensors,location=ny temperature=22,flag=6i,name="out" 1465839830100400200
and it will create
location SYMBOL
temperature DOUBLE
flag LONG
name STRING
How can I control what type is created if I want to have FLOAT instead of DOUBLE or INT, SHORT or BYTE instead of LONG?

Influx Line Protocol does not have necessary granularity of types to cover all QuestDB types. The way to use of different data types is to pre-create the in QuestDB table before sending ILP metrics data. So if you create a table like this
create table abc (
location SYMBOL,
temperature FLOAT,
flag BYTE,
name STRING,
date DATE,
ts TIMESTAMPE
) timestamp(ts) partition by MONTH
you can then send this ILP data
sensors,location=ny temperature=22,flag=6i,name="out",date=1465839830100400 1465839830100400200000
Here date is epoch millisecond and after the space ts is in nano microseconds.
Last nano second timestamp can be omitted and then server will add timestamp on every message at the moment it appends it to the table.

Related

Storing unix timestamp as an IntegerField [duplicate]

Which one is best to use, DateTime or INT (Unix Timestamp) or anything else to store the time value?
I think INT will be better at performance and also more universal, since it can be easily converted to many timezones. (my web visitors from all around the world can see the time without confusion)
But, I'm still doubt about it.
Any suggestions?
I wouldn't use INT or TIMESTAMP to save your datetime values. There is the "Year-2038-Problem"! You can use DATETIME and save your datetimes for a long time.
With TIMESTAMP or numeric column types you can only store a range of years from 1970 to 2038. With the DATETIME type you can save dates with years from 1000 to 9999.
It is not recommended to use a numeric column type (INT) to store datetime information. MySQL (and other sytems too) provides many functions to handle datetime information. These functions are faster and more optimized than custom functions or calculations: https://dev.mysql.com/doc/refman/5.7/en/date-and-time-functions.html
To convert the timezone of your stored value to the client timezone you can use CONVERT_TZ. In this case you need to know the timezone of the server and the timezone of your client. To get the timezone of the server you can see some possibilites on this question.
Changing the client time zone The server interprets TIMESTAMP values
in the client’s current time zone, not its own. Clients in different
time zones should set their zone so that the server can properly
interpret TIMESTAMP values for them.
And if you want to get the time zone that a certain one you can do this:
CONVERT_TZ(#dt,'US/Central','Europe/Berlin') AS Berlin,
I wouldn't store it in int, you should check out MySQL Cookbook by Paul DuBois he covers lot's of things in it.Also there is a big portion about your quetion.

Compare batches of average values with each other in WSO2 Stream Processor

I've written some code in Siddhi that logs/prints the average of a batch of the last 100 events. So the average for event 0-100, 101-200, etc. I now want to compare these averages with each other to find some kind of trend. In first place I just want to see if there is some simple downward of upward trend for a certain amount of averages. For example I want to compare all average values with all upcoming 1-10 average values.
I've looked into Siddhi documentation but I did not find the answer that I wanted. I tried some solutions with partitioning, but this did not work. The below code is what I have right now.
define stream HBStream(ID int, DateTime String, Result double);
#info(name = 'Average100Query')
from HBStream#window.lengthBatch(100)
select ID, DateTime, Result, avg(Result)
insert into OutputStream;
Siddhi sequences can be used to match the averages and to identify a trend, https://siddhi.io/en/v5.1/docs/query-guide/#sequence
from every e1=HBStream, e2=HBStream[e2.avgResult > e1.avgResult], e3=HBStream[e3.avgResult > e2.avgResult]
select e1.ID, e3.avgResult - e1.avgResult as tempDiff
insert into TempDiffStream;
Please note you have to use partition to decide this patter per ID is you need averages to be calculated per Sensor. In your app, also use group by if you need average per sensor
#info(name = 'Average100Query')
from HBStream#window.lengthBatch(100)
select ID, DateTime, Result, avg(Result) as avgResult
group by ID
insert into OutputStream;

AWS Redshift: How to store text field with size greater than 100K

I have a text field in parquet file with max length 141598. I am loading the parquet file to redshift and got the error while loading as the max a varchar can store is 65535.
Is there any other datatype I can use or another alternative to follow?
Error while loading:
S3 Query Exception (Fetch). Task failed due to an internal error. The length of the data column friends is longer than the length defined in the table. Table: 65535, Data: 141598
No, the maximum length of a VARCHAR data type is 65535 bytes and that is the longest data type that Redshift is capable of storing. Note that length is in bytes, not characters, so the actual number of characters stored depends on their byte length.
If the data is already in parquet format then possibly you don't need to load this data into a Redshift table at all, instead you could create a Spectrum external table over it. The external table definition will only support a VARCHAR definition of 65535, the same as a normal table, and any query against the column will silently truncate additional characters beyond that length - however the original data will be preserved in the parquet file and potentially accessible by other means if needed.

what is the best practice for storing date and time in class/object?

recently i'm going to connect to PostgreSQL and I need to store date/time in my object to pass to the query for insert and update some table.
but there is not clear way in c++ to store and retrieve date/time.
any comment?
PostgreSQL TimeFormat 9+ version https://www.postgresql.org/docs/9.1/datatype-datetime.html
Which is a 8byte int(64 bit) time format in microsecond precision, UTC without timezone (from top of the table).
When you create a table you can either time-stamp the record by PostgreSQL current_timestamp , OR insert into table as integer 64bit microsecond format. since PostgreSQL has multiple time format you should decide time any the format you want from table
PostgreSQL approach CREATE,INSERT,RETRIEVE
"CREATE TABLE example_table(update_time_column TIMESTAMP DEFAULT CURRENT_TIMESTAMP)"
"INSERT INTO example_table(update_time_column) VALUES update_time_column=CURRENT_TIMESTAMP"
"SELECT (EXTRACT(epoch FROM update_time_column)*1000000) FROM example_table"
C++ approach
auto/int64_t cppTime = get64bitMicrosecondFormat from some library`
something similar to this answer: Getting an accurate execution time in C++ (micro seconds)
Then push your object / record to PostGRESQL, when retrieve in microseconds, adjust precision /1000 for milliseconds etc.
Just don't forget to synchronize PostgreSQL and C++ timestamp length (eg. 8byte - 8byte each side) otherwise, your time will be thresholded either side, and you will lose precision / get unexpected time.

Redshift performance for simple time series data

I am using an AWS Redshift table the holds information about invocations of functions.
Each row has a date (of timestamp type), a UID (varchar), and several fields such as duration, error code.
The size of the table is ~25 million rows of ~1000 different functions (each with a different UID).
My problem is that simple queries as a count of invocations of several functions in a time window take much time - usually 5-30 seconds.
I have tried different combinations of sort keys and dist key, but the performance seems to remain quite similar:
Setting the function UID as dist key
Setting a compound sort key of the date, the function UID and a combination of both in any order.
I have run VACUUM and ANALYZE on the table.
I also tried to add/remove columns compression.
I am using only a single dc2.large node.
EDIT:
The table DDL is:
create table public."invocations_metrics_$mig"(
"function_uid" varchar(256) NOT NULL encode RAW DISTKEY
,"date" timestamp encode zstd
,"duration" double precision encode zstd
,"used_memory" integer encode zstd
,"error" smallint encode zstd
,"has_early_exit" boolean encode zstd
,"request_id" varchar(256) encode zstd
)
SORTKEY(date,function_uid);
An example of a row:
"aca500c9-27cc-47f8-a98f-ef71cbc7c0ef","2018-08-15 13:43:28.718",0.17,27,0,false,"30ee84e1-a091-11e8-ba47-b110721c41bc"
The query:
SELECT
count(invocations_metrics_backup.function_uid) AS invocations,
max(invocations_metrics_backup.date) AS last_invocation,
invocations_metrics_backup.function_uid AS uid
FROM
invocations_metrics_backup
WHERE
function_uid IN (
<10 UIDs>
)
AND DATE >= '2018-08-20T10:55:20.222812'::TIMESTAMP
GROUP BY
function_uid
Total time is 5 seconds. The count in each query is ~5000.
For the same query with a ~1M count it takes 30 seconds.
First, you need to use at least 2 nodes. A single node has to do double duty as leader and compute. With 2 or more nodes you get a free leader node.
Then, change your DDL as follows, removing compression on the sort key:
CREATE TABLE public."invocations_metrics_$mig" (
"function_uid" varchar(256) NOT NULL ENCODE ZSTD,
"date" timestamp ENCODE RAW,
"duration" double precision ENCODE ZSTD,
"used_memory" integer ENCODE ZSTD,
"error" smallint ENCODE ZSTD,
"has_early_exit" boolean ENCODE ZSTD,
"request_id" varchar(256) ENCODE ZSTD
)
DISTSTYLE KEY
DISTKEY( function_uid )
SORTKEY ( date )
;
You may also improve performance by mapping unique UIDs to an integer ID value and using that in your queries. UID values are quite inefficient to work with. The values occur randomly and are relatively wide with very high entropy. They are expensive during sorts, hash aggregations, and hash joins.