Comparing tuples from two files in Pig

Comparing tuples from two files in Pig - tuples

I want to compare tuples from two different files using Pig. If tuples are mirror image of each other, I want that tuple into file-f3.
If f1 has the following tuples
(1 2)
(3 4)
and f2 has following tuples
(5 6)
(4 3)
Since (3 4) is a mirror image of (4 3), we need to store this value in f3. Thus, f3 would be
(3 4)

You can simply inner join the 2 data sets like below:
data1 = LOAD '$data1' USING AvroStorage();
data2 = LOAD '$data2' USING AvroStorage();
output = JOIN data1 BY ($0, $1), data2 BY ($1, $0);
output2 = FOREACH output GENERATE data1.$0, data1.$1;

Related

PostgreSQL empty list VALUES expression

I am trying to take a list of points, and query a geospatial database, to find all matching rows.
I have a computed SQL statement that looks like this:
cursor = connection.cursor()
cursor.execute(
'''
SELECT g.ident
FROM (VALUES %s) AS v (lon, lat)
LEFT JOIN customers g
ON (ST_Within(ST_SetSRID(ST_MakePoint(v.lon, v.lat), %s), g.poly_home));
''', [AsIs(formatted_points), SRID]
)
Here is an example of what the formatted_points variable looks like:
(-115.062,38.485), (-96.295,43.771)
So, when that is inserted into the SQL expression, then VALUES expression reads:
(VALUES (-115.062,38.485), (-96.295,43.771)) AS v (lon, lat)
So far so good. However, when the list of points is empty, the VALUES expression looks like this:
(VALUES ) AS v (lon, lat)
.. which causes me to get this error:
django.db.utils.ProgrammingError: syntax error at or near ")"
In other words, (VALUES ) is not legal SQL.
Here's the question: How do I represent an empty list using VALUES? I could special case this, and just return an empty list when this function is passed an empty list, but that doesn't seem very elegant.
I have looked at the PostgreSQL manual page for VALUES, but I don't understand how to construct an empty VALUES expression.

If you can put your lons and lats in separate arrays, you could use arrays with unnest:
select * from unnest(ARRAY[1, 2, 3]::int[], ARRAY[4, 5, 6]::int[]) as v(lon, lat);
lon | lat
-----+-----
1 | 4
2 | 5
3 | 6
(3 rows)
select * from unnest(ARRAY[]::int[], ARRAY[]::int[]) as v(lon, lat);
lon | lat
-----+-----
(0 rows)
You'll have to cast the arrays to the appropriate type (probably not int[]). Postgres will guess the type if the arrays aren't empty, but it will throw an error if they are empty and you don't cast them to a specific type.

Python: how to keep leading zeros with dataframe.to_csv [duplicate]

This question already has answers here:
Pandas read_csv dtype leading zeros
(5 answers)
Import pandas dataframe column as string not int
(3 answers)
Closed 5 years ago.
There is a dataframe (df1) like as following after I read the data from txt file:
name l1 l2
a 00000 00000
b 00010 00002
c 00000 01218
When I use the python code as following:
dataframe.to_csv('test.csv', index= False)
Then I use the following code to read:
df = pd.read_csv('test.csv')
I found the dataframe is being df2 as following
name l1 l2
a 0 0
b 10 2
c 0 1218
But I want to keep the leading zero in the dataframe like df1.
Thanks!

The leading zeros are being removed because Pandas is implicitly converting the values to integer types. You want to read the data as string types, which you can do by specifying dtype=str:
pd.read_csv('test.csv', dtype=str)
Update as it helps others:
To have most or selective columns as str, one can do this:
# lst of column names which needs to be string
lst_str_cols = ['prefix', 'serial']
# use dictionary comprehension to make dict of dtypes
dict_dtypes = {x : 'str' for x in lst_str_cols}
# use dict on dtypes
pd.read_csv('sample.csv', dtype=dict_dtypes)

Use lapply to plot data in a list and use names of list elements as plot titles [duplicate]

This question already has an answer here:
Adding lists names as plot titles in lapply call in R
(1 answer)
Closed 7 years ago.
If I have the following list:
comp.surv <- list(a = 1:4, b = c(1, 2, 4, 8), c = c(1, 3, 8, 27))
comp.surv
# $a
# [1] 1 2 3 4
#
# $b
# [1] 1 2 4 8
#
# $c
# [1] 1 3 8 27
I can use lapply to plot each list element:
lapply(comp.surv, function(x) plot(x))
However, I want to include the name of each list element as plot title (main). For my example data, the title of each graph would be a,b and c respectively. First thing, is that I have a gsub rule that given comp.surv$a, I return a :
gsub(comp.surv\\$([a-z]+), "\\1", deparse(sustitute((comp.surv$a)))
# "a"
Which is good. However I cannot embed this result into my lapply statement above. Any ideas?
In the mean time I have tried getting round this by creating a function this to include the main parameter:
splot <- function(x){
plot(x, main = gsub(comp.surv\\$([a-z]+), "\\1" deparse(sustitute((x))))
}
lapply(comp.surv, function(x) splot(x))
This will plot each sub-variable of comp.surv, but all the titles are blank.
Can anyone recommend if I am going down the right track?

One possibility would be to loop over the names of the list:
lapply(names(comp.surv), function(x) plot(comp.surv[[x]], main = x))
Or slightly more verbose, loop over the list indices:
lapply(seq_along(comp.surv), function(x) plot(comp.surv[[x]], main = names(comp.surv)[x]))

Is that what you want?
ns=names(comp.surv)
lapply(ns, function(x) plot(comp.surv[[x]], main=x,ylab="y"))

Multiple lists of the same length to csv

I have a couple List<string>s, with the format like this:
List 1 List 2 List 3
1 A One
2 B Two
3 C Three
4 D Four
5 E Five
So in code form, it's like:
List<string> list1 = {"1","2","3","4","5"};
List<string> list2 = {"A","B","C","D","E"};
List<string> list3 = {"One","Two","Three","Four","Five"};
My questions are:
How do I transfom those three lists to a CSV format?
list1,list2,list3
1,A,one
2,b,two
3,c,three
4,d,four
5,e,five
Should I append , to the end of each index or make the delimeter its own index within the multidimensional list?

If performance is your main concern, I would use an existing csv library for your language, as it's probably been pretty well optimized.
If that's too much overhead, and you just want a simple function, I use the same concept in some of my code. I use the join/implode function of a language to create a list of comma separated strings, then join that list with \n.
I'm used to doing this in a dynamic language, but you can see the concept in the following pseudocode example:
header = {"List1", "List2", "List3"}
list1 = {"1","2","3","4","5"};
list2 = {"A","B","C","D","E"};
list3 = {"One","Two","Three","Four","Five"};
values = {header, list1, list2, list3};
for index in values
values[index] = values[index].join(",");
values = values.join("\n");

How to transform a matrix with 2 columns into a multimap like structure?

I am wondering if there is a way to transform a matrix of 2 columns into a multimap or list of list.
The first column of the matrix is an id (with possibly duplicated entries) and the 2nd column is some value.
For example,
if I have to following matrix
m <- matrix(c(1,2,1,3,2,4), c(3,2))
I would like to transform it into the following list
[[1]]
3,4
[[2]]
2

With base functions, you can do something like this:
tapply(m[,2], m[,1], `[`) # outputs an array
by(m, m[,1], function(m) m[,2]) # outputs a by object, which is a list
You could use plyr:
dlply(m, 1, function(m) m[,2]) # outputs a list
dlply(m, 1, `[`, 2) # another way to do it...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Comparing tuples from two files in Pig - tuples

You can simply inner join the 2 data sets like below: data1 = LOAD '$data1' USING AvroStorage(); data2 = LOAD '$data2' USING AvroStorage(); output = JOIN data1 BY ($0, $1), data2 BY ($1, $0); output2 = FOREACH output GENERATE data1.$0, data1.$1;

Related

PostgreSQL empty list VALUES expression

Python: how to keep leading zeros with dataframe.to_csv [duplicate]

Use lapply to plot data in a list and use names of list elements as plot titles [duplicate]

Multiple lists of the same length to csv

How to transform a matrix with 2 columns into a multimap like structure?

Categories

Resources