"reduce" a set of rows in Hive to another set of rows

"reduce" a set of rows in Hive to another set of rows - mapreduce

I'm using Hive for batch-processing of my spatial database. My trace table looks something like this:
object | lat | long | timestamp
1 | X11 | X12 | T11
1 | X21 | X22 | T12
2 | X11 | X12 | T21
1 | X31 | X22 | T13
2 | X21 | X22 | T22
I want to map each lat long of each object to a number (think about map-matching for example), but the algorithm needs to consider a number of adjacent data points to get the result. For example, I need all 3 data points of object 1 to map each of those 3 data points to a number. Can't process them one by one.
I'm thinking of using map-reduce with hive using transform, but I'm not sure how to this. Can someone please help me out?

You can use the custom map reduce functionality in Hive.
With the following:
add file /some/path/identity.pl;
add file /some/path/collect.pl;
from (
from trace_input
MAP id, lat, lon, ts
USING './identity.pl'
as id, lat, lon, ts
CLUSTER BY id) map_output
REDUCE id, lat, lon, ts
USING './collect.pl' as id, list
trace_input contains your trace data as described above:
create table trace_input(id string, lat string, lon string, ts string)
row format delimited
fields terminated by '\t'
stored as textfile ;
identity.pl is a simple script to dump out each line (could also be a script to select just the lat, long fields):
#!/usr/bin/perl
while (<STDIN>) {
print;
}
collect.pl (sample here) is simple script which collects consecutive lines with the same object id, saves the remainder of each line, and dumps out a line with id and comma separated list (tab separator).
The cluster by clause will assure the reducers get the correctly sorted input needed by the collect script.
The output of the user scripts are tab separated STRING columns.
Running the query, will result in the following output:
1 X11,X12,T11,X21,X22,T12,X31,X22,T13
2 X11,X12,T21,X21,X22,T22
You can modify the map script to limit the columns, and/or modify the reduce script to add results or separate the lat, lon from the ts, etc.
If this form is sufficient, you could insert directly into a result table by adding an insert before the reduce:
from (
from trace_input
MAP id, lat, lon, ts
USING './identity.pl'
as id, lat, lon, ts
CLUSTER BY id) map_output
INSERT overwrite table trace_res
REDUCE id, lat, lon, ts
USING './collect.pl';
The fields will be converted from string fields to match the schema of trace_res as necessary.
If you use collection types like I do, you can also do something like:
create table trace_res as
select sq.id, split(sq.list,",") from
(
from (
from trace_input
MAP id, lat, lon, ts
USING './identity.pl'
as id, lat, lon, ts
CLUSTER BY id) map_output
REDUCE id, lat, lon, ts
USING './collect.pl' as (id int, list string)
) sq;
This second field in the created table will be a list of all the lat, lon, ts; but probably will have a more complex table than that.

Related

Convert a number column into a time format in Power BI

I'm looking for a way to convert a decimal number into a valid HH:mm:ss format.
I'm importing data from an SQL database.
One of the columns in my database is labelled Actual Start Time.
The values in my database are stored in the following decimal format:
73758 // which translates to 07:27:58
114436 // which translates to 11:44:36
I cannot simply convert this Actual Start Time column into a Time format in my Power BI import as it returns errors for some values, saying it doesn't recognise 73758 as a valid 'time'. It needs to have a leading zero for cases such as 73758.
To combat this, I created a new Text column with the following code to append a leading zero:
Column = FORMAT([Actual Start Time], "000000")
This returns the following results:
073758
114436
-- which is perfect. Exactly what I needed.
I now want to convert these values into a Time.
Simply changing the data type field to Time doesn't do anything, returning:
Cannot convert value '073758' of type Text to type Date.
So I created another column with the following code:
Column 2 = FORMAT(TIME(LEFT([Column], 2), MID([Column], 3, 2), RIGHT([Column], 2)), "HH:mm:ss")
To pass the values 07, 37 and 58 into a TIME format.
This returns the following:
_______________________________________
| Actual Start Date | Column | Column 2 |
|_______________________________________|
| 73758 | 073758 | 07:37:58 |
| 114436 | 114436 | 11:44:36 |
Which is what I wanted but is there any other way of doing this? I want to ideally do it in one step without creating additional columns.

You could use a variable as suggested by Aldert or you can replace Column by the format function:
Time Format = FORMAT(
TIME(
LEFT(FORMAT([Actual Start Time],"000000"),2),
MID(FORMAT([Actual Start Time],"000000"),3,2),
RIGHT([Actual Start Time],2)),
"hh:mm:ss")
Edit:
If you want to do this in Power query, you can create a customer column with the following calculation:
Time.FromText(
if Text.Length([Actual Start Time])=5 then Text.PadStart( [Actual Start Time],6,"0")
else [Actual Start Time])
Once this column is created you can drop the old column, so that you only have one time column in the data. Hope this helps.

I, on purpose show you the concept of variables so you can use this in future with more complex queries.
TimeC =
var timeStr = FORMAT([Actual Start Time], "000000")
return FORMAT(TIME(LEFT([timeStr], 2), MID([timeStr], 3, 2), RIGHT([timeStr], 2)), "HH:mm:ss")

Splitting an array into columns in Athena/Presto

I feel this should be simple, but I've struggled to find the right terminology, please bear with me.
I have two columns, timestamp and voltages which is the array
If I do a simple
SELECT timestamp, voltages FROM table
Then I'd get a result of:
|timestamp | voltages |
|1544435470 |3.7352,3.749,3.7433,3.7533|
|1544435477 |3.7352,3.751,3.7452,3.7533|
|1544435484 |3.7371,3.749,3.7433,3.7533|
|1544435490 |3.7352,3.749,3.7452,3.7533|
|1544435497 |3.7352,3.751,3.7452,3.7533|
|1544435504 |3.7352,3.749,3.7452,3.7533|
But I want to split the voltages array so each element in its array is its own column.
|timestamp | v1 | v2 | v3 | v4 |
|1544435470 |3.7352 |3.749 |3.7433 |3.7533|
|1544435477 |3.7352 |3.751 |3.7452 |3.7533|
|1544435484 |3.7371 |3.749 |3.7433 |3.7533|
|1544435490 |3.7352 |3.749 |3.7452 |3.7533|
|1544435497 |3.7352 |3.751 |3.7452 |3.7533|
|1544435504 |3.7352 |3.749 |3.7452 |3.7533|
I know I can do this with:
SELECT timestamp, voltages[1] as v1, voltages[2] as v2 FROM table
But I'd need to be able to do this programmatically, as opposed to listing them out.
Am I missing something obvious?

This should serve your purpose if you have arrays of fixed length.
You need to first break down each array element into it's own row. You can do this using the UNNEST operator in the following way :
SELECT timestamp, volt
FROM table
CROSS JOIN UNNEST(voltages) AS t(volt)
Using the resultant table you can pivot (convert multiple rows with the same timestamp into multiple columns) by referring to Gordon Linoff's answer for "need to convert data in multiple rows with same ID into 1 row with multiple columns".

Match a keyword from a list data type in Cypher

I ran a cypher query to delete all duplicate relationship with same name from my graph. A relationship has properties(name, confidence, time). I kept the relationship with highest confidence value and collected all time values, using following query:
MATCH (e0:Entity)-[r:REL]-(e1:Entity)
WITH e0, r.name AS relation, COLLECT(r) AS rels, COLLECT(r.confidence)AS relConf, MAX(r.confidence) AS maxConfidence, COLLECT(r.time) AS relTime, e1 WHERE SIZE(rels) > 1
SET (rels[0]).confidence = maxConfidence, (rels[0]).time = relTime
FOREACH (rel in tail(rels) | DELETE rel)
RETURN rels, relation, relConf, maxConfidence, relTime
Old Data:
name,confidence,time
likes, 0.87, 20111201010900
likes, 0.97, 20111201010600
New data:
name,confidence,time
likes, 0.97, [20111201010900,20111201010600]
Could anyone please suggest a match query to find relationships containing year 2011 in new "time" property?? (I converted time using toInt while loading from a csv).

Your new data structure is definitely not easy to make such searches, but it is possible on medium graphs :
MATCH (n:Entity)-[r:REL]->(x)
WHERE ANY(
t IN extract(x IN r.time | toString(x))
WHERE t STARTS WITH "2011"
)
RETURN r

How to use regular expressions properly on a SQL files?

I have a lot of undocumented and uncommented SQL queries. I would like to extract some information within the SQL-statements. Particularly, I'm interested in DB-names, table names and if possible column names. The queries have usually the following syntax.
SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'
Usually, the statements involes several DBs and Tables. I would like only extract DBs and Tables with any other information. I thought if whether it is possible to extract first the information which begins after FROM & JOIN & LEFT JOIN. Here its usually db.table letters such as o t s correspond already to referenced tables. I suppose they are difficult to capture. What I tried without any success is to use something like:
gsub(".*FROM \\s*|WHERE|ORDER|GROUP.*", "", vec)
Assuming that each statement ends with WHERE/where or ORDER/order or GROUP... But that doesnt work out as expected.

You haven't indicated which database system you are using but virtually all such systems have introspection facilities that would allow you to get this information a lot more easily and reliably than attempting to parse SQL statements. The following code which supposes SQLite can likely be adapted to your situation by getting a list of your databases and then looping over the databases and using dbConnect to connect to each one in turn running code such as this:
library(gsubfn)
library(RSQLite)
con <- dbConnect(SQLite()) # use in memory database for testing
# create two tables for purposes of this test
dbWriteTable(con, "BOD", BOD, row.names = FALSE)
dbWriteTable(con, "iris", iris, row.names = FALSE)
# get all table names and columns
tabinfo <- Map(function(tab) names(fn$dbGetQuery(con, "select * from $tab limit 0")),
dbListTables(con))
dbDisconnect(con)
giving an R list whose names are the table names and whose entries are the column names:
> tabinfo
$BOD
[1] "Time" "demand"
$iris
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
or perhaps long form output is preferred:
setNames(stack(tabinfo), c("column", "table"))
giving:
column table
1 Time BOD
2 demand BOD
3 Sepal.Length iris
4 Sepal.Width iris
5 Petal.Length iris
6 Petal.Width iris
7 Species iris

You could use the stringi package for this.
library(stringi)
# Your string vector
myString <- "SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'"
# Three stringi functions used
# stringi_extract_all_regex will extract the strings which have FROM or JOIN followed by some text till the next space
# string_replace_all_regex will replace all the FROM or JOIN followed by space with null string
# stringi_unique will extract all unique strings
t <- stri_unique(stri_replace_all_regex(stri_extract_all_regex(myString, "((FROM|JOIN) [^\\s]+)", simplify = TRUE),
"(FROM|JOIN) ", ""))
> t
[1] "mydb.table1" "mydb.sometable" "otherdb.sometable"

Cached data structure design

I've got a C++ program that needs to access this wind data, refreshed every 6 hours. As clients of the server need the data, the server queries the database and provides the data to the client. The client will use lat, lon, and mb as keys to find the the 5 values.
+------------+-------+-----+-----+----------+----------+-------+------+------+
| id | lat | lon | mb | wind_dir | wind_spd | uv | vv | ts |
+------------+-------+-----+-----+----------+----------+-------+------+------+
| 1769584117 | -90.0 | 0.0 | 100 | 125 | 9 | -3.74 | 2.62 | 2112 |
| 1769584118 | -90.0 | 0.5 | 100 | 125 | 9 | -3.76 | 2.59 | 2112 |
| 1769584119 | -90.0 | 1.0 | 100 | 124 | 9 | -3.78 | 2.56 | 2112 |
Because the data changes so infrequently, I'd like the data to be cached by the server so if a client needs data previously queried, a second SQL query is not necessary.
I'm trying to determine the most efficient in-memory data structure, in terms of storage/speed, but more importantly, ease of access.
My initial thought was a map keyed by lat, containing a map keyed by lon, containing a map keyed by mb for which the value is a map containing the wind_dir, wind_speed, uv, vv and ts fields.
However, that gets complicated fast. Another thought of course is a 3-dimensional array (lat, lon, mb indices) containing a struct of the last 5 fields.
As I'm sitting here, I came up with the thought of combining lat, lon and mb into a string, which could be used as an index into a map, given that I'm 99% sure the combination of lat, lon and mb would always be unique.
What other ideas make sense?
Edit: More detail from comment below
In terms of data, there are 3,119,040 rows in the data set. That will be fairly constant, though it may slowly grow over the years as new reporting stations are added. There are generally between 700 and 1500 clients requesting the data. The clients are flight simulators. They'll be requesting the data every 5 minutes by default, though the maximum possible frequency would be every 30 seconds. There is not additional information - what you see above is the data desired to return.
One final note I forgot to mention: I'm quite rusty in my C++ and especially STL stuff, so the simpler, the better.

You can use std::map with a three part key and a suitable less than operator (this is what Crazy Eddie proposed, extended with some lines of code)
struct key
{
double mLat;
double mLon;
double mMb;
key(double lat, double lon, double mb) :
mLat(lat), mLon(lon), mMb(mb) {}
};
bool operator<(const key& a, const key& b)
{
return (a.lat < b.lat ||
a.lat == b.lat && a.lon < b.lon ||
a.lat == b.lat && a.lon == b.lon && a.mb < b.mb);
}
Defining and inserting into the map would look like:
std::map<key, your_wind_struct> values;
values[key(-90.0, 0.0, 100)] = your_wind_struct(1769584117, 125, ...);

A sorted vector also makes sense. You can feed it a less predicate that compares your three part key. You could do the same with a map or set. A hash... Depends on a lot of factors which container you chose.

Another option is the c++11 unordered_set, which uses a hash table instead of red black tree as the internal data structure, and gives (I believe) an amortized lookup time of O(1) vs O(logn) for red-black. Which data structure you use depends on the characteristics of the data in question - how many pieces of data, how often will a particular record likely be accessed, etc. I'm in agreement with several commentors, that using a structure as a key is the cleanest way to go. It also allows you to more simply alter what the unique key is, should that change in the future; you would just need to add a member to your key structure, and not create a whole new level of maps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

"reduce" a set of rows in Hive to another set of rows - mapreduce

Related

Convert a number column into a time format in Power BI

Splitting an array into columns in Athena/Presto

Match a keyword from a list data type in Cypher

How to use regular expressions properly on a SQL files?

Cached data structure design

Categories

Resources