For some RDF like this:
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:blah="http://www.something.org/stuff#">
<rdf:Description rdf:about="http://www.something.org/stuff/some_entity1">
<blah:stringid>string1</blah:stringid>
<blah:uid>1</blah:uid>
<blah:myitems rdf:parseType="Collection">
<blah:myitem>
<blah:myitemvalue1>7</blah:myitemvalue1>
<blah:myitemvalue2>8</blah:myitemvalue2>
</blah:myitem>
...
<blah:myitem>
<blah:myitemvalue1>7</blah:myitemvalue1>
<blah:myitemvalue2>8</blah:myitemvalue2>
</blah:myitem>
</blah:myitems>
</rdf:Description>
<rdf:Description rdf:about="http://www.something.org/stuff/some__other_entity2">
<blah:stringid>string2</blah:stringid>
<blah:uid>2</blah:uid>
<blah:myitems rdf:parseType="Collection">
<blah:myitem>
<blah:myitemvalue1>7</blah:myitemvalue1>
<blah:myitemvalue2>8</blah:myitemvalue2>
</blah:myitem>
....
<blah:myitem>
<blah:myitemvalue1>7</blah:myitemvalue1>
<blah:myitemvalue2>8</blah:myitemvalue2>
</blah:myitem>
</blah:myitems>
</rdf:Description>
</rdf:RDF>
I'm using Jena/SPARQL and I'd like to be able to use a SELECT query to retrieve the myitems node for an entity with a particular stringid and then extract it from the resultset and iterate through and get the values for each myitem nodes. Order isn't important.
So I have two questions:
Do I need to specify in my query that blah:myitems is a list?
How can I parse a list in a ResultSet?
Selecting Lists (and Elements) in SPARQL
Let's address the SPARQL issue first. I've modified your data just a little bit so that the elements have different values, so it will be easier to see them in the output. Here's the data in N3 format, which is a bit more concise, especially when representing lists:
#prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
#prefix blah: <http://www.something.org/stuff#> .
<http://www.something.org/stuff/some_entity1>
blah:myitems ([ a blah:myitem ;
blah:myitemvalue1 "1" ;
blah:myitemvalue2 "2"
] [ a blah:myitem ;
blah:myitemvalue1 "3" ;
blah:myitemvalue2 "4"
]) ;
blah:stringid "string1" ;
blah:uid "1" .
<http://www.something.org/stuff/some__other_entity2>
blah:myitems ([ a blah:myitem ;
blah:myitemvalue1 "5" ;
blah:myitemvalue2 "6"
] [ a blah:myitem ;
blah:myitemvalue1 "7" ;
blah:myitemvalue2 "8"
]) ;
blah:stringid "string2" ;
blah:uid "2" .
You mentioned in the question selecting the myitems node, but myitems is actually the property that relates the entity to the list. You can select properties in SPARQL, but I'm guessing that you actually want to select the head of the list, i.e., the value of the myitems property. That's straightforward. You don't need to specify that it's an rdf:List, but if the value of myitems could also be a non-list, then you should specify that you're only looking for rdf:Lists. (For developing the SPARQL queries, I'll just run them using Jena's ARQ command line tools, because we can move them to the Java code easily enough afterward.)
prefix blah: <http://www.something.org/stuff#>
select ?list where {
[] blah:myitems ?list .
}
$ arq --data data.n3 --query items.sparql
--------
| list |
========
| _:b0 |
| _:b1 |
--------
The heads of the lists are blank nodes, so this is the sort of result that we're expecting. From these results, you could get the resource from a result set and then start walking down the list, but since you don't care about the order of the nodes in the list, you might as well just select them in the SPARQL query, and then iterate through the result set, getting each item. It also seems likely that you might be interested in the entity whose items you're retrieving, so that's in this query too.
prefix blah: <http://www.something.org/stuff#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select ?entity ?list ?item ?value1 ?value2 where {
?entity blah:myitems ?list .
?list rdf:rest* [ rdf:first ?item ] .
?item a blah:myitem ;
blah:myitemvalue1 ?value1 ;
blah:myitemvalue2 ?value2 .
}
order by ?entity ?list
$ arq --data data.n3 --query items.sparql
----------------------------------------------------------------------------------------
| entity | list | item | value1 | value2 |
========================================================================================
| <http://www.something.org/stuff/some__other_entity2> | _:b0 | _:b1 | "7" | "8" |
| <http://www.something.org/stuff/some__other_entity2> | _:b0 | _:b2 | "5" | "6" |
| <http://www.something.org/stuff/some_entity1> | _:b3 | _:b4 | "3" | "4" |
| <http://www.something.org/stuff/some_entity1> | _:b3 | _:b5 | "1" | "2" |
----------------------------------------------------------------------------------------
By ordering the results by entity and by list (in case some entity has multiple values for the myitems property), you can iterate through the result set and be assured of getting, in order, all the elements in a list for an entity. Since your question was about lists in result sets, and not about how to work with result sets, I'll assume that iterating through the results isn't a problem.
Working with Lists in Jena
The following example shows how you can work with lists in Java. The first part of the code is just the boilerplate to load the model and run the SPARQL query. Once you're getting the results of the query back, you can either treat the resource as the head of a linked list and use the rdf:first and rdf:rest properties to iterate manually, or you can cast the resource to Jena's RDFList and get an iterator out of it.
import java.io.IOException;
import java.io.InputStream;
import com.hp.hpl.jena.query.QueryExecutionFactory;
import com.hp.hpl.jena.query.QuerySolution;
import com.hp.hpl.jena.query.ResultSet;
import com.hp.hpl.jena.rdf.model.Model;
import com.hp.hpl.jena.rdf.model.ModelFactory;
import com.hp.hpl.jena.rdf.model.Property;
import com.hp.hpl.jena.rdf.model.RDFList;
import com.hp.hpl.jena.rdf.model.RDFNode;
import com.hp.hpl.jena.rdf.model.Resource;
import com.hp.hpl.jena.util.iterator.ExtendedIterator;
import com.hp.hpl.jena.vocabulary.RDF;
public class SPARQLListExample {
public static void main(String[] args) throws IOException {
// Create a model and load the data
Model model = ModelFactory.createDefaultModel();
try ( InputStream in = SPARQLListExample.class.getClassLoader().getResourceAsStream( "SPARQLListExampleData.rdf" ) ) {
model.read( in, null );
}
String blah = "http://www.something.org/stuff#";
Property myitemvalue1 = model.createProperty( blah + "myitemvalue1" );
Property myitemvalue2 = model.createProperty( blah + "myitemvalue2" );
// Run the SPARQL query and get some results
String getItemsLists = "" +
"prefix blah: <http://www.something.org/stuff#>\n" +
"\n" +
"select ?list where {\n" +
" [] blah:myitems ?list .\n" +
"}";
ResultSet results = QueryExecutionFactory.create( getItemsLists, model ).execSelect();
// For each solution in the result set
while ( results.hasNext() ) {
QuerySolution qs = results.next();
Resource list = qs.getResource( "list" ).asResource();
// Once you've got the head of the list, you can either process it manually
// as a linked list, using RDF.first to get elements and RDF.rest to get
// the rest of the list...
for ( Resource curr = list;
!RDF.nil.equals( curr );
curr = curr.getRequiredProperty( RDF.rest ).getObject().asResource() ) {
Resource item = curr.getRequiredProperty( RDF.first ).getObject().asResource();
RDFNode value1 = item.getRequiredProperty( myitemvalue1 ).getObject();
RDFNode value2 = item.getRequiredProperty( myitemvalue2 ).getObject();
System.out.println( item+" has:\n\tvalue1: "+value1+"\n\tvalue2: "+value2 );
}
// ...or you can make it into a Jena RDFList that can give you an iterator
RDFList rdfList = list.as( RDFList.class );
ExtendedIterator<RDFNode> items = rdfList.iterator();
while ( items.hasNext() ) {
Resource item = items.next().asResource();
RDFNode value1 = item.getRequiredProperty( myitemvalue1 ).getObject();
RDFNode value2 = item.getRequiredProperty( myitemvalue2 ).getObject();
System.out.println( item+" has:\n\tvalue1: "+value1+"\n\tvalue2: "+value2 );
}
}
}
}
Related
I have a pyspark dataframe with two columns, ID and Elements. Column "Elements" has list element in it. It looks like this,
ID | Elements
_______________________________________
X |[Element5, Element1, Element5]
Y |[Element Unknown, Element Unknown, Element_Z]
I want to form a column with the most frequent element in the column 'Elements.' Output should look like,
ID | Elements | Output_column
__________________________________________________________________________
X |[Element5, Element1, Element5] | Element5
Y |[Element Unknown, Element Unknown, Element_Z] | Element Unknown
How can I do that using pyspark?
Thanks in advance.
We can use higher order functions here (available from spark 2.4+)
First use transform and aggregate to get counts for each distinct value in the array.
Then sort the array of structs in descending manner and then get the first element.
from pyspark.sql import functions as F
temp = (df.withColumn("Dist",F.array_distinct("Elements"))
.withColumn("Counts",F.expr("""transform(Dist,x->
aggregate(Elements,0,(acc,y)-> IF (y=x, acc+1,acc))
)"""))
.withColumn("Map",F.arrays_zip("Dist","Counts")
)).drop("Dist","Counts")
out = temp.withColumn("Output_column",
F.expr("""element_at(array_sort(Map,(first,second)->
CASE WHEN first['Counts']>second['Counts'] THEN -1 ELSE 1 END),1)['Dist']"""))
Output:
Note that I have added a blank array for ID z to test. Also you can drop the column Map by adding .drop("Map") to the output
out.show(truncate=False)
+---+---------------------------------------------+--------------------------------------+---------------+
|ID |Elements |Map |Output_column |
+---+---------------------------------------------+--------------------------------------+---------------+
|X |[Element5, Element1, Element5] |[{Element5, 2}, {Element1, 1}] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|[{Element Unknown, 2}, {Element_Z, 1}]|Element Unknown|
|Z |[] |[] |null |
+---+---------------------------------------------+--------------------------------------+---------------+
For lower versions, you can use a udf with statistics mode:
from pyspark.sql import functions as F,types as T
from statistics import mode
u = F.udf(lambda x: mode(x) if len(x)>0 else None,T.StringType())
df.withColumn("Output",u("Elements")).show(truncate=False)
+---+---------------------------------------------+---------------+
|ID |Elements |Output |
+---+---------------------------------------------+---------------+
|X |[Element5, Element1, Element5] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|Element Unknown|
|Z |[] |null |
+---+---------------------------------------------+---------------+
You can use pyspark sql functions to achieve that (spark 2.4+).
Here is a generic function that adds a new column containing the most common element in another array column. Here it is:
import pyspark.sql.functions as sf
def add_most_common_val_in_array(df, arraycol, drop=False):
"""Takes a spark df column of ArrayType() and returns the most common element
in the array in a new column of the df called f"MostCommon_{arraycol}"
Args:
df (spark.DataFrame): dataframe
arraycol (ArrayType()): array column in which you want to find the most common element
drop (bool, optional): Drop the arraycol after finding most common element. Defaults to False.
Returns:
spark.DataFrame: df with additional column containing most common element in arraycol
"""
dvals = f"distinct_{arraycol}"
dvalscount = f"distinct_{arraycol}_count"
startcols = df.columns
df = df.withColumn(dvals, sf.array_distinct(arraycol))
df = df.withColumn(
dvalscount,
sf.transform(
dvals,
lambda uval: sf.aggregate(
arraycol,
sf.lit(0),
lambda acc, entry: sf.when(entry == uval, acc + 1).otherwise(acc),
),
),
)
countercol = f"ReverseCounter{arraycol}"
df = df.withColumn(countercol, sf.map_from_arrays(dvalscount, dvals))
mccol = f"MostCommon_{arraycol}"
df = df.withColumn(mccol, sf.element_at(countercol, sf.array_max(dvalscount)))
df = df.select(*startcols, mccol)
if drop:
df = df.drop(arraycol)
return df
I have a list in which each item in the list is further split into 3 fields, separated by a '| '
Suppose my list is :
[‘North America | 23 | United States’, ’South America | 12 | Brazil’,
‘Europe | 51 | Greece’………] and so on
Using this list, I want to create a dictionary that would make the first field in each item the value, and the second field in each item the key.
How can I add these list items to a dictionary using a for loop?
My expected outcome would be
{’23’:’North America’, ’12’:’South America’, ’51’:’Europe’}
How about something like this:
var myList = new List<string>() { "North America | 23 | United States", "South America | 12 | Brazil", "Europe | 51 | Greece" };
var myDict = myList.Select(x => x.Split('|')).ToDictionary(a => a[1], a => a[0]);
Assuming you're in Python, if you know what the delimiter is, you can use string.split() to break the string up into a list then go from there.
my_dict = {}
for val in my_list:
words = val.split(" | ")
my_dict[words[1]] = words[0]
For other languages, you can take an index of the first "|", substring from the beginning to give you your value, then take the index of the second line to give you the key. In Java this would look like:
Map<String, String> myDict = new DopeDataStructure<String, String>();
for(String s : myArray){
int pos = s.indexOf("|");
String val = s.substring(0, pos - 1);
String rest = s.substring(pos + 2);
String key = rest.substring(0, rest.indexOf("|") - 1);
myDict.put(key, val);
}
EDIT: there very well could be more efficient ways of solving the problem in other languages, that's just the simplest method I know off the top of my head
Using Python dict comphrehension
data = ['North America | 23 | United States', 'South America | 12 | Brazil',]
# Spliting each string in list by "|" and setting its 1st index value to dict key and 0th index value to dict value
res = {i.split(" | ")[1]: i.split(" | ")[0] for i in data}
print (res)
I hope this helps and counts!
Let's say you have a Spark dataframe with multiple columns and you want to return the rows where the columns contains specific characters. Specifically you want to return the rows where at least one of the fields contains ( ) , [ ] % or +.
What is the proper syntax in case you want to use Spark SQL rlike function?
import spark.implicits._
val dummyDf = Seq(("John[", "Ha", "Smith?"),
("Julie", "Hu", "Burol"),
("Ka%rl", "G", "Hu!"),
("(Harold)", "Ju", "Di+")
).toDF("FirstName", "MiddleName", "LastName")
dummyDf.show()
+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
| John[| Ha| Smith?|
| Julie| Hu| Burol|
| Ka%rl| G| Hu!|
| (Harold)| Ju| Di+|
+---------+----------+--------+
Expected Output
+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
| John[| Ha| Smith?|
| Ka%rl| G| Hu!|
| (Harold)| Ju| Di+|
+---------+----------+--------+
My few attempts returns errors or not what expected even when I try to do it just for searching (.
I know that I could use the simple like construct multiple times, but I am trying to figure out to do it in a more concise way with regex and Spark SQL.
You can try this using rlike method:
dummyDf.show()
+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
| John[| Ha| Smith?|
| Julie| Hu| Burol|
| Ka%rl| G| Hu!|
| (Harold)| Ju| Di+|
| +Tim| Dgfg| Ergf+|
+---------+----------+--------+
val df = dummyDf.withColumn("hasSpecial",lit(false))
val result = df.dtypes
.collect{ case (dn, dt) => dn }
.foldLeft(df)((accDF, c) => accDF.withColumn("hasSpecial", col(c).rlike(".*[\\(\\)\\[\\]%+]+.*") || col("hasSpecial")))
result.filter(col("hasSpecial")).show(false)
Output:
+---------+----------+--------+----------+
|FirstName|MiddleName|LastName|hasSpecial|
+---------+----------+--------+----------+
|John[ |Ha |Smith? |true |
|Ka%rl |G |Hu! |true |
|(Harold) |Ju |Di+ |true |
|+Tim |Dgfg |Ergf+ |true |
+---------+----------+--------+----------+
You can also drop the hasSpecial column if you want.
Try this .*[()\[\]%\+,.]+.*
.* all character zero or more times
[()[]%+,.]+ all characters inside bracket 1 or more times
.* all character zero or more times
I'm looking for a way to add a new column in a Spark DataFrames from a list of strings, in just one simple line.
Given :
rdd = sc.parallelize([((u'2016-10-19', u'2016-293'), 40020),
((u'2016-10-19', u'2016-293'), 143938),
((u'2016-10-19', u'2016-293'), 135891225.0)
])
This is my code to structure my rdd and get a Dataframe :
def structurate_CohortPeriod_metrics_by_OrderPeriod(line):
((OrderDate, OrderPeriod), metrics) = line
metrics = str(metrics)
return OrderDate, OrderPeriod, metrics
(rdd
.map(structurate_CohortPeriod_metrics_by_OrderPeriod)
.toDF(['OrderDate', 'OrderPeriod', 'MetricValue'])
.show())
Result :
+----------+-----------+-----------+
| OrderDate|OrderPeriod|MetricValue|
+----------+-----------+-----------+
|2016-10-19| 2016-293| 40020|
|2016-10-19| 2016-293| 143938|
|2016-10-19| 2016-293|135891225.0|
+----------+-----------+-----------+
I want to add a new column to precise the metric's name. This is what I've done :
def structurate_CohortPeriod_metrics_by_OrderPeriod(line):
(((OrderDate, OrderPeriod), metrics), index) = line
metrics = str(metrics)
return OrderDate, OrderPeriod, metrics, index
df1 = (rdd
.zipWithIndex()
.map(structurate_CohortPeriod_metrics_by_OrderPeriod)
.toDF(['OrderDate', 'OrderPeriod', 'MetricValue', 'index']))
Then
from pyspark.sql.types import StructType, StructField, StringType
df2 = sqlContext.createDataFrame(sc.parallelize([('0', 'UsersNb'),
('1', 'VideosNb'),
('2', 'VideosDuration')]),
StructType([StructField('index', StringType()),
StructField('MetricName', StringType())]))
df2.show()
+-----+--------------+
|index| MetricName|
+-----+--------------+
| 0| UsersNb|
| 1| VideosNb|
| 2|VideosDuration|
+-----+--------------+
And finally :
(df1
.join(df2, df1.index == df2.index)
.drop(df2.index)
.select('index', 'OrderDate', 'OrderPeriod', 'MetricName', 'MetricValue')
.show())
+-----+----------+-----------+--------------+-----------+
|index| OrderDate|OrderPeriod| MetricName|MetricValue|
+-----+----------+-----------+--------------+-----------+
| 0|2016-10-19| 2016-293| VideosNb| 143938|
| 1|2016-10-19| 2016-293| UsersNb| 40020|
| 2|2016-10-19| 2016-293|VideosDuration|135891225.0|
+-----+----------+-----------+--------------+-----------+
This is my expected output, but this method takes considerably longer. I want to do this in just one or two lines.. Something like for exemple the lit method :
from pyspark.sql.functions import lit
df1.withColumn('MetricName', lit('my_string'))
But I need of course to put 3 different strings : 'VideosNb', 'UsersNb' and 'VideosDuration'.
Ideas ? Thank you very much !
I have a web application that uses RavenDB on the backend and allows the user to keep track of inventory. The three entities in my domain are:
public class Location
{
string Id
string Name
}
public class ItemType
{
string Id
string Name
}
public class Item
{
string Id
DenormalizedRef<Location> Location
DenormalizedRef<ItemType> ItemType
}
On my web app, there is a page for the user to see a summary breakdown of the inventory they have at the various locations. Specifically, it shows the location name, item type name, and then a count of items.
The first approach I took was a map/reduce index on InventoryItems:
this.Map = inventoryItems =>
from inventoryItem in inventoryItems
select new
{
LocationName = inventoryItem.Location.Name,
ItemTypeName = inventoryItem.ItemType.Name,
Count = 1
});
this.Reduce = indexEntries =>
from indexEntry in indexEntries
group indexEntry by new
{
indexEntry.LocationName,
indexEntry.ItemTypeName,
} into g
select new
{
g.Key.LocationName,
g.Key.ItemTypeName,
Count = g.Sum(entry => entry.Count),
};
That is working fine but it only displays rows for Location/ItemType pairs that have a non-zero count of items. I need to have it show all Locations and for each location, all item types even those that don't have any items associated with them.
I've tried a few different approaches but no success so far. My thought was to turn the above into a Multi-Map/Reduce index and just add another map that would give me the cartesian product of Locations and ItemTypes but with a Count of 0. Then I could feed that into the reduce and would always have a record for every location/itemtype pair.
this.AddMap<object>(docs =>
from itemType in docs.WhereEntityIs<ItemType>("ItemTypes")
from location in docs.WhereEntityIs<Location>("Locations")
select new
{
LocationName = location.Name,
ItemTypeName = itemType.Name,
Count = 0
});
This isn't working though so I'm thinking RavenDB doesn't like this kind of mapping. Is there a way to get a cross join / cartesian product from RavenDB? Alternatively, any other way to accomplish what I'm trying to do?
EDIT: To clarify, Locations, ItemTypes, and Items are documents in the system that the user of the app creates. Without any Items in the system, if the user enters three Locations "London", "Paris", and "Berlin" along with two ItemTypes "Desktop" and "Laptop", the expected result is that when they look at the inventory summary, they see a table like so:
| Location | Item Type | Count |
|----------|-----------|-------|
| London | Desktop | 0 |
| London | Laptop | 0 |
| Paris | Desktop | 0 |
| Paris | Laptop | 0 |
| Berlin | Desktop | 0 |
| Berlin | Laptop | 0 |
Here is how you can do this with all the empty locations as well:
AddMap<InventoryItem>(inventoryItems =>
from inventoryItem in inventoryItems
select new
{
LocationName = inventoryItem.Location.Name,
Items = new[]{
{
ItemTypeName = inventoryItem.ItemType.Name,
Count = 1}
}
});
)
this.AddMap<Location>(locations=>
from location in locations
select new
{
LocationName = location.Name,
Items = new object[0]
});
this.Reduce = results =>
from result in results
group result by result.LocationName into g
select new
{
LocationName = g.Key,
Items = from item in g.SelectMany(x=>x.Items)
group item by item.ItemTypeName into gi
select new
{
ItemTypeName = gi.Key,
Count = gi.Sum(x=>x.Count)
}
};