mapreduce data splitting based on time - mapreduce

Am parsing the mlabs data from google, parsed data will give text record for each packet of data. I want to split the data for half an hour so that what goes to reducer is an half an hour data. Is this the best way to get half an hour data or is there a better way. Can anyone suggest how can i do that.
Parsed data will be in the format:
src dest startTime endTime bytesTransferred
34.456.67.88 23.456.78.9 3453453454555 3453453994555 4564
Thanks

You can use the first second of the 30 minute (1800 seconds) epoch timsestamp as the key emitted by the Map and the value is the data record (or the parsed fields of same that you care about).
In that way on the Reducer will see (key, List[DataRecord]) like this:
(30-minute-interval-One-start-second) [(Data Record 1a, Data Record 1b, ... Data Record 1k)]
(30-minute-interval-Two-start-second) [(Data Record 2a, Data Record 2b, ... Data Record 2k)]
...

Related

Very slow ingestion to QuestDB when using Postgres wire protocol

I have a problem with ingestion time when inserting rows into QuestDB table.
Table definition:
create table trade1 (
id symbol,
buy_order_id string,
currency string,
price float,
quantity float,
instrument_id symbol,
sell_order_id string,
status string,
subtype symbol,
"type" string,
transact_time timestamp,
buy_trader_id string,
sell_trader_id string);
) timestamp(transact_time) PARTITION BY DAY;
I have an ETL process which extracts data from CSV files and inserts data using JDBC Postgrs driver.
When I insert data on empty table from the first file - it takes ~60s for ~300k rows.
However for the second file it takes significantly longer - 180s.
Forth file is over 10 minutes.
All files are similar in number of rows.
Also when I keep only one symbol column it seems to be faster but speed is decreasing as more rows are inserted:
create table trade1 (
id string,
buy_order_id string,
currency string,
price float,
quantity float,
instrument_id symbol,
sell_order_id string,
status string,
subtype string,
"type" string,
transact_time timestamp,
buy_trader_id string,
sell_trader_id string);
) timestamp(transact_time) PARTITION BY DAY;
Insert time: 15s, 19s, 29s, 37s, 35s, 59s, 62s, 74s so it's continously growing.
It seems that ingestion time grows together with number of rows inserted but how is that possible when there is not even index defined?
server.conf:
data:
server.conf: |
cairo.sql.append.page.size = 256
pg.worker.affinity = 1,2,3,4
pg.worker.count = 4
shared.worker.count = 2
QuestDB is deployed on Kubernetes using Helm chart.
Am I missing some core concept?
Does it still exhibit the slowness if you change instrument_id from SYMBOL to STRING?
If you can ingest your source CSV "as is", the REST import is ridiculously fast.
In my benchmarks, using the PostgreSQL wire protocol was the slowest (taking 2-3x longer) compared to REST /imp and InfluxDB Line Protocol.

Kinesis Analytics Session or Stagger Window Batching Without Aggregation

I'm looking to use Kinesis Data Analytics (or some other AWS managed service) to batch records based on a filter criteria. The idea would be that as records come in, we'd start a session window and batch any matching records for 15 min.
The stagger window is exactly what we'd like except we're not looking to aggregate the data, but rather just return the records all together.
Ideally...
100 records spread over 15 min. (20 matching criteria) with first one at 10:02
|
v
At 10:17, the 20 matching records would be sent to the destination
I've tried doing something like:
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
"device_id" INTEGER,
"child_id" INTEGER,
"domain" VARCHAR(32),
"category_id" INTEGER,
"posted_at" DOUBLE,
"block" TIMESTAMP
);
-- Create pump to insert into output
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
-- Select all columns from source stream
SELECT STREAM
"device_id",
"child_id",
"domain",
"category_id",
"posted_at",
FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE) as block
FROM "SOURCE_SQL_STREAM_001"
WHERE "category_id" = 888815186
WINDOWED BY STAGGER (
PARTITION BY "child_id", FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE)
RANGE INTERVAL '15' MINUTE);
I continue to get errors for all the columns not in the aggregation:
From line 6, column 5 to line 6, column 12: Expression 'domain' is not being used in PARTITION BY sub clause of WINDOWED BY clause
Kinesis Firehose was a suggested solution, but it's a blind window to all child_id, so it could possibly cut up a session in to multiple and that's what I'm trying to avoid.
Any suggestions? Feels like this might not be the right tool.
try LAST_VALUE("domain") as domain in the select clause.

Athena Schema creation when log format has missing fields

I have a custom log format where the log entries vary by the request type. So certain rows have more fields.
Can we specify certain fields as optional so that in rows that they are missing, the values will be set to certain default (null, 0)?
Here are some hypothetical log entries:
{"data":"[2017-09-10 10:44:54.448998 -0000] info ip=773.555.557.445 cluster=\"production\" query=uris type=TXT class=IN rcode=NXDOMAIN cnt=0 offset=74","header":{"recvtime":"2017-09-10 10:45:02","server":"m0107481","refid":"ABC-123"}}
{"data":"[2017-09-10 10:44:54.457718 -0000] info ip=991.509.704.832 cluster=\"inbound\" query=dnsbl type=A class=IN rcode=NOERROR cnt=1 offset=90 score=400","header":{"recvtime":"2017-09-10 10:45:02","server":"m010748","refid":"ABC-123"}}
{"data":"[2017-09-10 10:44:54.457718 -0000] info ip=971.509.704.832 cluster=\"inbound\" query=dnsbl type=A class=IN rcode=REFUSED cnt=1","header":{"recvtime":"2017-09-10 10:45:02","server":"m010574","refid":"ABC-123"}}
Note that each row of the log data is in json format, and the header part is fixed. If query in data is dnsbl, then sometimes the row has a score field, but other times it is missing. And I am planning to use Athena to parse this type of data from S3 and query for some stats in the line of: what % of data are dns queries and what % have score above 300.
It looks like your data is JSON with embedded structured logging in the data field. As long as the data is well formed JSON with one object per line you should be able to create a JSON table and then use functions to extract the other pieces out of the data field. You can create a view that does the extraction so that you don't have to do that in every query.
I'm thinking something like this:
CREATE EXTERNAL TABLE raw_log_entries (
data string,
header struct<recvtime: string, server: string, refid: string>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://some-bucket/and/path/';
CREATE VIEW log_entries AS
SELECT
header.recvtime,
header.server,
header.refid,
regexp_extract(data, 'query=(\S+)', 1) AS query,
regexp_extract(data, 'type=(\S+)', 1) AS type,
regexp_extract(data, 'score=(\S+)', 1) AS score,
-- and so on
FROM raw_log_entries
You'll have to experiment with the regexes, since I don't have your data I can't be sure if they will work for all cases, but I hope you get the idea.

DB2 The syntax of the string representation of a datetime value is incorrect

We have a staging table that's used to load raw data from our suppliers.
One column is used to capture a time-stamp but its data-type is varchar(265). Data's dirty: about 40% of the time, there is garbage data, otherwise time-stamp data like this
2011/11/15 20:58:48.041
I have to create a report that filters some dates/timestamps out that column but where I try to cast it, I get an error:
db2 => select cast(loadedon as timestamp) from automation
1
--------------------------
SQL0180N The syntax of the string representation of a datetime value is incorrect. SQLSTATE=22007
What do I need to do in order to parse/cast the timestamp string?
The string format for a DB2 timestamp is either:
'2002-10-20-12.00.00.000000'
or
'2002-10-20 12:00:00'
You have to get your date string in either of these formats.
Also DB2 runs on a 24 hour clock even though the output sometimes uses a 12 hour clock (AM / PM)
So '2002-10-20 14:49:50' For 2:49:50 PM
Or '2002-10-20 00:00:00' For midnight. Output would be 12:00:00 AM
It seems you have a lot of garbage data, so firt of all you should check if the data is a valid timestamp in the format you expect ('2011/11/15 20:58:48.041'). We could use a simple solution - just replace all digits with '0' and check the result format:
TRANSLATE(timestamp_column,'0','0123456789','0') = '0000/00/00 00:00:00.000'
If the format is the expected one, you should convert to DB2 timestamp. In DB2 for iSeries there is a build-in function since V6R1 TIMESTAMP_FORMAT. In your case it will look like that:
TIMESTAMP_FORMAT('2011/11/15 20:58:48.041','YYYY/MM/DD HH24:MI:SS.NNNNNN')
So the solution query combined should look something like that:
SELECT
CASE
WHEN TRANSLATE(timestamp_column,'0','0123456789','0') = '0000/00/00 00:00:00.000'
THEN TIMESTAMP_FORMAT(timestamp_column,'YYYY/MM/DD HH24:MI:SS.NNNNNN')
ELSE NULL
END
FROM
your_table_with_bad_data
EDIT
I just saw your comment that provider agreed to clean the data. You could use the solution provided to speed up the process and clean the data by yourself:
ALTER your_table_with_bad_data ADD COLUMN clean_timestamp TIMESTAMP DEFAULT NULL;
UPDATE your_table_with_bad_data
SET clean_timestamp =
CASE
WHEN TRANSLATE(timestamp_column,'0','0123456789','0') = '0000/00/00 00:00:00.000'
THEN TIMESTAMP_FORMAT(timestamp_column,'YYYY/MM/DD HH24:MI:SS.NNNNNN')
ELSE NULL
END;

How to update a stream with the response from another stream where the sink type is "http-response"

Am trying to enrich my input stream with an additional attribute which gets populated via "http-response" response sink.
I have tried using the "join" with window attribute and with "every" keyword to merge two streams and inserting the resulting merged stream into another stream to enrich it.
The window attributes (window.time(1 sec) or window.length(1)) and "every" keyword works well when the incoming events are coming at a regular interval of 1 sec or more.
When (say for example 10 or 100) events are sent at the same time(within a second). Then the result of the merge is not in expected terms.
The one with "window" attribute (join)
**
from EventInputStreamOne#window.time(1 sec) as i
join EventInputStreamTwo as s
on i.variable2 == s.variable2
select i.variable1 as variable1, i.variable2 as variable2, s.variable2 as variable2
insert into EventOutputStream;
**
The one with the "every" keyword
**
from every e1=EventInputStream,e2=EventResponseStream
select e1.variable1 as variable1, e1.variable2 as variable2, e2.variable3 as variable3
insert into EventOutputStream;
**
Is there any better way to merge the two streams in order to update a third stream?
To get the original request attributes, you can use custom mapping as follows,
#source(type='http-call-response', sink.id='source-1'
#map(type='json',#attributes(name='name', id='id', volume='trp:volume', price='trp:price')))
define stream responseStream(name String, id int, headers String, volume long, price float);
Here, the request attributes can be accessed with trp:attributeName, in this sample only name is from the response, price and volume is from the request.
The syntax in your 'every' keyword approach isn't quite right. Have you tried something like this:
from every (e1 = event1) -> e2=event2[e1.variable == e2.variable]
select e1.variable1, e2.variable1, e2.variable2
insert into outputEvent;
This document might help.