I have a problem with ingestion time when inserting rows into QuestDB table.
Table definition:
create table trade1 (
id symbol,
buy_order_id string,
currency string,
price float,
quantity float,
instrument_id symbol,
sell_order_id string,
status string,
subtype symbol,
"type" string,
transact_time timestamp,
buy_trader_id string,
sell_trader_id string);
) timestamp(transact_time) PARTITION BY DAY;
I have an ETL process which extracts data from CSV files and inserts data using JDBC Postgrs driver.
When I insert data on empty table from the first file - it takes ~60s for ~300k rows.
However for the second file it takes significantly longer - 180s.
Forth file is over 10 minutes.
All files are similar in number of rows.
Also when I keep only one symbol column it seems to be faster but speed is decreasing as more rows are inserted:
create table trade1 (
id string,
buy_order_id string,
currency string,
price float,
quantity float,
instrument_id symbol,
sell_order_id string,
status string,
subtype string,
"type" string,
transact_time timestamp,
buy_trader_id string,
sell_trader_id string);
) timestamp(transact_time) PARTITION BY DAY;
Insert time: 15s, 19s, 29s, 37s, 35s, 59s, 62s, 74s so it's continously growing.
It seems that ingestion time grows together with number of rows inserted but how is that possible when there is not even index defined?
server.conf:
data:
server.conf: |
cairo.sql.append.page.size = 256
pg.worker.affinity = 1,2,3,4
pg.worker.count = 4
shared.worker.count = 2
QuestDB is deployed on Kubernetes using Helm chart.
Am I missing some core concept?
Does it still exhibit the slowness if you change instrument_id from SYMBOL to STRING?
If you can ingest your source CSV "as is", the REST import is ridiculously fast.
In my benchmarks, using the PostgreSQL wire protocol was the slowest (taking 2-3x longer) compared to REST /imp and InfluxDB Line Protocol.
Related
I used this article to read my vpc flow logs and everything worked correctly.
https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/
But my question is that when I refer to documentation and run the create table statement, it does not return any record.
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
`version` int,
`account_id` string,
`interface_id` string,
`srcaddr` string,
`dstaddr` string,
`srcport` int,
`dstport` int,
`protocol` bigint,
`packets` bigint,
`bytes` bigint,
`start` bigint,
`end` bigint,
`action` string,
`log_status` string,
`vpc_id` string,
`subnet_id` string,
`instance_id` string,
`tcp_flags` int,
`type` string,
`pkt_srcaddr` string,
`pkt_dstaddr` string,
`region` string,
`az_id` string,
`sublocation_type` string,
`sublocation_id` string,
`pkt_src_aws_service` string,
`pkt_dst_aws_service` string,
`flow_direction` string,
`traffic_path` int
)
PARTITIONED BY (
`aws-account-id` string,
`aws-service` string,
`aws-region` string,
`year` string,
`month` string,
`day` string,
`hour` string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://DOC-EXAMPLE-BUCKET/prefix/AWSLogs/aws-account-id={account_id}/aws-service=vpcflowlogs/aws-region={region_code}/'
TBLPROPERTIES (
'EXTERNAL'='true',
'skip.header.line.count'='1'
)
official doc:
https://docs.aws.amazon.com/athena/latest/ug/vpc-flow-logs.html
This create table statement should work after changing the variables like DOC-EXAMPLE-BUCKET/prefix, account_id and region_code. Why am I getting 0 rows returned for select * query?
You need to manually load the partitions first before you could use them.
From the docs:
After you create the table, you load the data in the partitions for querying. For Hive-compatible data, you run MSCK REPAIR TABLE. For non-Hive compatible data, you use ALTER TABLE ADD PARTITION to add the partitions manually.
So if your structure if hive compatible you can just run:
MSCK REPAIR TABLE `table name`;
And this will load all your new partitions.
Otherwise you'll have to manually load them using ADD PARTITION
ALTER TABLE test ADD PARTITION (aws-account-id='1', aws-acount-service='2' ...) location 's3://bucket/subfolder/data/accountid1/service2/'
Because manually adding partitions is so tedious if your data structure is not hive compatible I recommend you use partition projection for your table.
To avoid having to manage partitions, you can use partition projection. Partition projection is an option for highly partitioned tables whose structure is known in advance. In partition projection, partition values and locations are calculated from table properties that you configure rather than read from a metadata repository. Because the in-memory calculations are faster than remote look-up, the use of partition projection can significantly reduce query runtimes.
I have a big CSV text file uploaded weekly to an S3 path partitioned by upload date (maybe not important). The schema of these files are all the same, the formatting is all the same, the naming conventions are all the same. Each file contains ~100 columns and ~1M rows of mixed text/numeric types. The raw data looks like this:
id,date,string,int_values,double_values
"6F87U",2021-03-21,"Text",0,1.1483
"8DU87",2021-03-22,"More text, oh yes",1,2.525
"79LO2",2021-03-23,"Moar, give me moar, text",2,3.485489
When I run a Crawler with everything default, querying with Athena like so:
select * from tb_csv_data
...the results in Athena are thus:
id
date
string
int_values
double_values
"6F87U"
2021-03-21
"Text"
0
1.1483
"8DU87"
2021-03-22
"More text
oh yes"
1
"79LO2"
2021-03-23
"Moar
give me moar
text
The problem at this level seems to be with proper detection (read: ignoring) of commas as delimiters within quotation marks. So I have a CSV classifier with the following characteristics that I have attached to the Crawler, I run the Crawler again with the classifier attached, and the resulting table properties are thus:
Input format org.apache.hadoop.mapred.TextInputFormat
Output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Serde serialization lib org.apache.hadoop.hive.serde2.OpenCSVSerde
Serde parameters
quoteChar "
separatorChar ,
Table properties
sizeKey 4356512114
objectCount 3
UPDATED_BY_CRAWLER crawler-name
CrawlerSchemaSerializerVersion 1.0
recordCount 3145398
averageRecordSize 1384
CrawlerSchemaDeserializerVersion 1.0
compressionType none
columnsOrdered true
areColumnsQuoted true
delimiter ,
typeOfData file
The resulting table with the same simple Athena query as above seems to be correct:
id
date
string
int_values
double_values
6F87U
2021-03-21
Text, yes
0
1.1483
8DU87
2021-03-22
More text, oh yes
1
2.525
79LO2
2021-03-23
Moar, give me moar, text
2
3.485489
The expected automatic inference of data types is supposed to be this (let's simplify and presume the date is correct as a string):
Column name
Data type
id
string
date
string
string
string
int_values
bigint (or long)
double_values
double
...but instead they're all strings!
Column name
Data type
id
string
date
string
string
string
int_values
string
double_values
string
I need this data to be accurately queryable from Athena as it is, where it is, so what can I do without further processing of the raw data? I suppose I could manually adjust the table properties in the Console but is that really correct when I need the entire pipeline to be automated? I also want to avoid having to cast types in queries 80+ times for each field as most of these columns are numeric. What can I do?
Thank you!
The limitation arrives from the serde that you are using in your query. Refer to note section in this doc which has below explanation :
When you use Athena with OpenCSVSerDe, the SerDe converts all column types to STRING. Next, the parser in Athena parses the values from STRING into actual types based on what it finds. For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types when it can discern them. If the values are in TIMESTAMP in the UNIX format, Athena parses them as TIMESTAMP. If the values are in TIMESTAMP in Hive format, Athena parses them as INT. DATE type values are also parsed as INT.
For date type to be detected it has to be in UNIX numeric format, such as 1562112000 according to the doc.
Below is how the data looks
Flight Number: SSSVAD123X Date: 2/8/2020 1:04:40 PM Page[s] Printed: 1 Document Name: DownloadAttachment Print Driver: printermodel (printer driver)
I need help creating an Athena SQL create table with in below format
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
this is new to me, any direction towards solution will work
You may be able to use a regex serde to parse your files. It depends on the shape of your data. You only provide a single line so this assumes that every line in your data files look the same.
Here's the Athena documentation for the feature: https://docs.aws.amazon.com/athena/latest/ug/apache.html
You should be able to do something like the following:
CREATE EXTERNAL TABLE flights (
flight_number STRING,
`date` STRING,
pages_printed INT,
document_name STRING,
print_driver STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^Flight Number:\\s+(\\S+)\\s+Date:\\s+(\\S+)\\s+Page\\[s\\] Printed:\\s+(\\S+)\\s+Document Name:\\s+(\\S+)\\s+Print Driver:\\s+(\\S+)\\s+\\(printer driver\\)$"
) LOCATION 's3://example-bucket/some/prefix/'
Each capture group in the regex will map to a column, in order.
Since I don't have access to your data I can't test the regex, unfortunately, so there may be errors in it. Hopefully this example is enough to get you started.
First, make sure your data format uses tab spacing between columns because your sample doesn't seem to have a consistent separator.
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
As per AWS documentation, use the LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files if your data does not include values enclosed in quotes. You don't need to make it complicated using Regex.
Reference: https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html
As LazySimpleSerDe is the default used by AWS Athena, you don't even need to declare it, see the create table statement for your data sample:
CREATE EXTERNAL TABLE IF NOT EXISTS `mydb`.`mytable` (
`Flight Number` STRING,
`Date` STRING,
`Pages Printed` INT,
`Document Name` STRING,
`Print Driver` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION
's3://awsexamplebucket1-logs/AWSLogs/'
You can use an online generator to help you in the future: https://www.hivetablegenerator.com/
From the generator page: "Easily convert any JSON (even complex Nested ones), CSV, TSV, or Log sample file to an Apache HiveQL DDL create table statement."
I am working on stream processor 4.3.0. I have came across one scenario where I am putting some datafeeds into the rdbms table using siddhiapp. Using siddiapp, I am entering the data in RDBMS table as below
Now, I am using another SiddhiApp to retrieve the data, but I would want to try out to fetch the data in such way like below
As the common columns are shrinked to get into one row and the column which has counts are now summed to get the final Sum of all counts.
Can some one please guide me how to proceed here.
Thanks in advance
here is the app to get the total sum
#App:name("IncomingStream3")
#App:description("Description of the plan")
-- Please refer to https://docs.wso2.com/display/SP400/Quick+Start+Guide on getting started with SP editor.
--#store(type = 'rdbms', datasource = 'APIM_ANALYTICS_DB')
--#purge(enable='false', interval='60 min', #retentionPeriod(sec='1 day', min='72 hours', hours='90 days', days='1 year', months='2 years', years='3 years'))
define stream TempStatsStream (AGG_TIMESTAMP long, AGG_EVENT_TIMESTAMP long, apiName string, apiVersion string, apiResourcePath string,apiCreator string,username string, applicationConsumerKey string, AGG_LAST_EVENT_TIMESTAMP long, applicationName string, dateTime string, AGG_COUNT int);
define aggregation StatsToCal
from TempStatsStream
select apiName, apiVersion, apiResourcePath, apiCreator, username, applicationName,
applicationConsumerKey, SUM (AGG_COUNT) as totalRequestCount, dateTime
group by apiName, apiVersion, apiResourcePath, username, applicationConsumerKey
aggregate by dateTime every days;
Only change I have made here is instead of fetching the value from DB table, I am considering it as stream ( as the aggregation can be done only for Stream, I suppose).
Seems like you have to group by API, Name1, Name2 and ID? You can use group by similar to SQL group by
from TriggerStream join APITable
select APIName, Name1, Name2, ID, sum(Count) as totalCount
group by API, Name1, Name2, ID
insert into OutputStream;
Table: Customer
Hashkey: email
Other Attributes: name, address, purchasedamount, datecreated
Sample Data:
"xxx1.xxx.com", "XXXXX1", "no1.street",2500,"10-01-2017 01:02:03"
"xxx2.xxx.com", "XXXXX2", "no2.street",2000,"11-01-2017 04:05:06"
"xxx3.xxx.com", "XXXXX3", "no3.street",4050,"10-02-2017 07:08:09"
"xxx4.xxx.com", "XXXXX4", "no4.street",2800,"11-02-2017 10:11:12"
How to fetch customers, whose purchased date from "11-01-2017 00:00:00" to "10-02-2017 00:00:00".
Looking at your sample data, I don't see an easy way to do it unfortunately. I would say you need to do it in code (Scan all items and filter at the application level).
If changing the data model is an option:
Easiest and recommended approach with date / times in DynamoDB is to store is in ISO8601 format, using String data type.
ISO8601: Date and time values are ordered from the largest to smallest unit of time: year, month (or week), day, hour, minute, second, and fraction of second. The lexicographical order of the representation thus corresponds to chronological order, except for date representations involving negative years. This allows dates to be naturally sorted by, for example, DynamoDB.
If you use your Date attribute as a Sort Key / LSI, it enables you to ask DynamoDB to do the heavy lifting for querying between two dates (within a Partition Key), by using the BETWEEN comparison operator.