Creating Dictionary from nth list items - list

I have a list in which each item in the list is further split into 3 fields, separated by a '| '
Suppose my list is :
[‘North America | 23 | United States’, ’South America | 12 | Brazil’,
‘Europe | 51 | Greece’………] and so on
Using this list, I want to create a dictionary that would make the first field in each item the value, and the second field in each item the key.
How can I add these list items to a dictionary using a for loop?
My expected outcome would be
{’23’:’North America’, ’12’:’South America’, ’51’:’Europe’}

How about something like this:
var myList = new List<string>() { "North America | 23 | United States", "South America | 12 | Brazil", "Europe | 51 | Greece" };
var myDict = myList.Select(x => x.Split('|')).ToDictionary(a => a[1], a => a[0]);

Assuming you're in Python, if you know what the delimiter is, you can use string.split() to break the string up into a list then go from there.
my_dict = {}
for val in my_list:
words = val.split(" | ")
my_dict[words[1]] = words[0]
For other languages, you can take an index of the first "|", substring from the beginning to give you your value, then take the index of the second line to give you the key. In Java this would look like:
Map<String, String> myDict = new DopeDataStructure<String, String>();
for(String s : myArray){
int pos = s.indexOf("|");
String val = s.substring(0, pos - 1);
String rest = s.substring(pos + 2);
String key = rest.substring(0, rest.indexOf("|") - 1);
myDict.put(key, val);
}
EDIT: there very well could be more efficient ways of solving the problem in other languages, that's just the simplest method I know off the top of my head

Using Python dict comphrehension
data = ['North America | 23 | United States', 'South America | 12 | Brazil',]
# Spliting each string in list by "|" and setting its 1st index value to dict key and 0th index value to dict value
res = {i.split(" | ")[1]: i.split(" | ")[0] for i in data}
print (res)
I hope this helps and counts!

Related

How do I find the most frequent element in a list in pyspark?

I have a pyspark dataframe with two columns, ID and Elements. Column "Elements" has list element in it. It looks like this,
ID | Elements
_______________________________________
X |[Element5, Element1, Element5]
Y |[Element Unknown, Element Unknown, Element_Z]
I want to form a column with the most frequent element in the column 'Elements.' Output should look like,
ID | Elements | Output_column
__________________________________________________________________________
X |[Element5, Element1, Element5] | Element5
Y |[Element Unknown, Element Unknown, Element_Z] | Element Unknown
How can I do that using pyspark?
Thanks in advance.
We can use higher order functions here (available from spark 2.4+)
First use transform and aggregate to get counts for each distinct value in the array.
Then sort the array of structs in descending manner and then get the first element.
from pyspark.sql import functions as F
temp = (df.withColumn("Dist",F.array_distinct("Elements"))
.withColumn("Counts",F.expr("""transform(Dist,x->
aggregate(Elements,0,(acc,y)-> IF (y=x, acc+1,acc))
)"""))
.withColumn("Map",F.arrays_zip("Dist","Counts")
)).drop("Dist","Counts")
out = temp.withColumn("Output_column",
F.expr("""element_at(array_sort(Map,(first,second)->
CASE WHEN first['Counts']>second['Counts'] THEN -1 ELSE 1 END),1)['Dist']"""))
Output:
Note that I have added a blank array for ID z to test. Also you can drop the column Map by adding .drop("Map") to the output
out.show(truncate=False)
+---+---------------------------------------------+--------------------------------------+---------------+
|ID |Elements |Map |Output_column |
+---+---------------------------------------------+--------------------------------------+---------------+
|X |[Element5, Element1, Element5] |[{Element5, 2}, {Element1, 1}] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|[{Element Unknown, 2}, {Element_Z, 1}]|Element Unknown|
|Z |[] |[] |null |
+---+---------------------------------------------+--------------------------------------+---------------+
For lower versions, you can use a udf with statistics mode:
from pyspark.sql import functions as F,types as T
from statistics import mode
u = F.udf(lambda x: mode(x) if len(x)>0 else None,T.StringType())
df.withColumn("Output",u("Elements")).show(truncate=False)
+---+---------------------------------------------+---------------+
|ID |Elements |Output |
+---+---------------------------------------------+---------------+
|X |[Element5, Element1, Element5] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|Element Unknown|
|Z |[] |null |
+---+---------------------------------------------+---------------+
You can use pyspark sql functions to achieve that (spark 2.4+).
Here is a generic function that adds a new column containing the most common element in another array column. Here it is:
import pyspark.sql.functions as sf
def add_most_common_val_in_array(df, arraycol, drop=False):
"""Takes a spark df column of ArrayType() and returns the most common element
in the array in a new column of the df called f"MostCommon_{arraycol}"
Args:
df (spark.DataFrame): dataframe
arraycol (ArrayType()): array column in which you want to find the most common element
drop (bool, optional): Drop the arraycol after finding most common element. Defaults to False.
Returns:
spark.DataFrame: df with additional column containing most common element in arraycol
"""
dvals = f"distinct_{arraycol}"
dvalscount = f"distinct_{arraycol}_count"
startcols = df.columns
df = df.withColumn(dvals, sf.array_distinct(arraycol))
df = df.withColumn(
dvalscount,
sf.transform(
dvals,
lambda uval: sf.aggregate(
arraycol,
sf.lit(0),
lambda acc, entry: sf.when(entry == uval, acc + 1).otherwise(acc),
),
),
)
countercol = f"ReverseCounter{arraycol}"
df = df.withColumn(countercol, sf.map_from_arrays(dvalscount, dvals))
mccol = f"MostCommon_{arraycol}"
df = df.withColumn(mccol, sf.element_at(countercol, sf.array_max(dvalscount)))
df = df.select(*startcols, mccol)
if drop:
df = df.drop(arraycol)
return df

Converting a dataframe column with values to a list using spark and scala

+-----------------------------------------------------------------------------------------------------------------------------------------------+
|Texts |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|RT #xxxxxx: post aqwe qwqq ssdd qaAQ WQWQW CSDWDW!!!
must RT ! |
|RT #xxxxx: aaa in ssss ssss ss sqqq this qqq in "sss" should xxxx xx at xx xaaaa aqw |
|RT #xxxx: QWW sadad jkhj to hjyhy a eryr rrryryry? ersfsfdsgdgdgg t rtrt ytyyryr.
sadwf wwewe ewewe jyiopo;l dwewre etet of the ddgdg-we dfdfdf, #b… |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
I want to have these rows of values in Text column in a list using scala and spark.
1. val newList = myDataframe.select("Texts").rdd.map(_(0)).collect.toList
2. val newList = myDataframe.select("Texts").collect().map(_(0)).toList
newList .foreach(println)
both ways aren't giving any output and program doesn't terminate also. No errors are thrown.
Expected output
["RT #xxxxxx: post aqwe qwqq ssdd qaAQ WQWQW CSDWDW!!! must RT !", "RT #xxxxx: aaa in ssss ssss ss sqqq this qqq in "sss" should xxxx xx at xx xaaaa aqw", "RT #xxxx: QWW sadad jkhj to hjyhy a eryr rrryryry? ersfsfdsgdgdgg t rtrt ytyyryr.
sadwf wwewe ewewe jyiopo;l dwewre etet of the ddgdg-we dfdfdf, #b…"]
Please note that sentence in each row in dataframe may contain new line
eg I am going to the the shop.\n Its very expensive
this sentence will be displayed as
I am going to the shop
its very expensive
But both will belong to the same row.
Below methods are correct to convert a column of a dataframe into a list
1. val newList = myDataframe.select("Texts").rdd.map(_(0)).collect.toList
2. val newList = myDataframe.select("Texts").collect().map(_(0)).toList
But the Dataframe in the question says each row may contain new lines. therefore above mthods won't work directly. First new lines should be removed.
val singleLineDataframe = myDataframe.withColumn("Texts", regexp_replace(col("Texts"), "[\\r\\n\\n]", "."))
val sentenceList = singleLineDataframe.select("Texts").rdd.map(r => r(0)).collect.toList
for ( element <- sentenceList)
println(element)

spark regex while join data frame

I need to write some regex for condition check in spark while doing some join,
My regex should match below string
n3_testindia1 = test-india-1
n2_stagamerica2 = stag-america-2
n1_prodeurope2 = prod-europe-2
df1.select("location1").distinct.show()
+----------------+
| location1 |
+----------------+
|n3_testindia1 |
|n2_stagamerica2 |
|n1_prodeurope2 |
df2.select("loc1").distinct.show()
+--------------+
| loc1 |
+--------------+
|test-india-1 |
|stag-america-2|
|prod-europe-2 |
+--------------+
I want to join based on location columns like below
val joindf = df1.join(df2, df1("location1") == regex(df2("loc1")))
Based on the information above you can do that in Spark 2.4.0 using
val joindf = df1.join(df2,
regexp_extract(df1("location1"), """[^_]+_(.*)""", 1)
=== translate(df2("loc1"), "-", ""))
Or in prior versions something like
val joindf = df1.join(df2,
df1("location1").substr(lit(4), length(df1("location1")))
=== translate(df2("loc1"), "-", ""))
You can split by "_" in location1 and take the 2 element, then match with the entire string of "-" removed string in loc1. Check this out:
scala> val df1 = Seq(("n3_testindia1"),("n2_stagamerica2"),("n1_prodeurope2")).toDF("location1")
df1: org.apache.spark.sql.DataFrame = [location1: string]
scala> val df2 = Seq(("test-india-1"),("stag-america-2"),("prod-europe-2")).toDF("loc1")
df2: org.apache.spark.sql.DataFrame = [loc1: string]
scala> df1.join(df2,split('location1,"_")(1) === regexp_replace('loc1,"-",""),"inner").show
+---------------+--------------+
| location1| loc1|
+---------------+--------------+
| n3_testindia1| test-india-1|
|n2_stagamerica2|stag-america-2|
| n1_prodeurope2| prod-europe-2|
+---------------+--------------+
scala>

Unique list from dynamic range table with possible blanks

I have an Excel table in sheet1 in which column A:
Name of company Company 1 Company 2 Company
3 Company 1 Company 4 Company 1 Company
3
I want to extract a unique list of company names to sheet2 also in column A. I can only do this with help of a helper column if I dont have any blanks between company names but when I do have I get one more company which is a blank.
Also, I've researched but the example was for non-dynamic tables and so it doesn't work because I don't know the length of my column.
I want in Sheet2 Column A:
Name of company Company 1 Company 2 Company 3
Company 4
Looking for the solution that requires less computational power Excel or Excel-VBA. The final order which they appear in sheet2 don't really matter.
Using a slight modification to Recorder-generated code:
Sub Macro1()
Sheets("Sheet1").Range("A:A").Copy Sheets("Sheet2").Range("A1")
Sheets("Sheet2").Range("A:A").RemoveDuplicates Columns:=1, Header:=xlYes
With Sheets("Sheet2").Sort
.SortFields.Clear
.SortFields.Add Key:=Range("A2:A" & Rows.Count) _
, SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
.SetRange Range("A2:A" & Rows.Count)
.Header = xlGuess
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
End Sub
Sample Sheet1:
Sample Sheet2:
The sort removes the blanks.
EDIT#1:
If the original data in Sheet1 was derived from formulas, then using PasteSpecial will remove unwanted formula copying. There is also a final sweep for empty cells:
Sub Macro1_The_Sequel()
Dim rng As Range
Sheets("Sheet1").Range("A:A").Copy
Sheets("Sheet2").Range("A1").PasteSpecial Paste:=xlPasteValues
Sheets("Sheet2").Range("A:A").RemoveDuplicates Columns:=1, Header:=xlYes
Set rng = Sheets("Sheet2").Range("A2:A" & Rows.Count)
With Sheets("Sheet2").Sort
.SortFields.Clear
.SortFields.Add Key:=rng, SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
.SetRange rng
.Header = xlGuess
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
Call Kleanup
End Sub
Sub Kleanup()
Dim N As Long, i As Long
With Sheets("Sheet2")
N = .Cells(Rows.Count, "A").End(xlUp).Row
For i = N To 1 Step -1
If .Cells(i, "A").Value = "" Then
.Cells(i, "A").Delete shift:=xlUp
End If
Next i
End With
End Sub
All of these answers use VBA. The easiest way to do this is to use a pivot table.
First, select your data, including the header row, and go to Insert -> PivotTable:
Then you will get a dialog box. You don't need to select any of the options here, just click OK. This will create a new sheet with a blank pivot table. You then need to tell Excel what data you're looking for. In this case, you only want the Name of company in the Rows section. On the right-hand side of Excel you will see a new section named PivotTable Fields. In this section, simply click and drag the header to the Rows section:
This will give a result with just the unique names and an entry with (blank) at the bottom:
If you don't want to use the Pivot Table further, simply copy and paste the result rows you're interested in (in this case, the unique company names) into a new column or sheet to get just those without the pivot table attached. If you want to keep the pivot table, you can right click on Grand Total and remove that, as well as filter the list to remove the (blank) entry.
Either way, you now have your list of unique results without blanks and it didn't require any formulas or VBA, and it took relatively few resources to complete (far fewer than any VBA or formula solution).
Here's another method using Excel's built-in Remove Duplicates feature, and a programmed method to remove the blank lines:
EDIT
I have deleted the code using the above methodology as it takes too long to run. I have replaced it with a method that uses VBA's collection object to compile a unique list of companies.
The first method, on my machine, took about two seconds to run; the method below: about 0.02 seconds.
Sub RemoveDups()
Dim wsSrc As Worksheet, wsDest As Worksheet
Dim rRes As Range
Dim I As Long, S As String
Dim vSrc As Variant, vRes() As Variant, COL As Collection
Set wsSrc = Worksheets("sheet1")
Set wsDest = Worksheets("sheet2")
Set rRes = wsDest.Cells(1, 1)
'Get the source data
With wsSrc
vSrc = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With
'Collect unique list of companies
Set COL = New Collection
On Error Resume Next
For I = 2 To UBound(vSrc, 1) 'Assume Row 1 is the header
S = CStr(Trim(vSrc(I, 1)))
If Len(S) > 0 Then COL.Add S, S
Next I
On Error GoTo 0
'Populate results array
ReDim vRes(0 To COL.Count, 1 To 1)
'Header
vRes(0, 1) = vSrc(1, 1)
'Companies
For I = 1 To COL.Count
vRes(I, 1) = COL(I)
Next I
'set results range
Set rRes = rRes.Resize(UBound(vRes, 1) + 1)
'Write the results
With rRes
.EntireColumn.Clear
.Value = vRes
.EntireColumn.AutoFit
'Uncomment the below line if you want
'.Sort key1:=.Columns(1), order1:=xlAscending, MatchCase:=False, Header:=xlYes
End With
End Sub
NOTE: You wrote you didn't care about the order, but if you want to Sort the results, that added about 0.03 seconds to the routine.
With two sheets named 1 and 2
Inside sheet named: 1
+----+-----------------+
| | A |
+----+-----------------+
| 1 | Name of company |
| 2 | Company 1 |
| 3 | Company 2 |
| 4 | |
| 5 | Company 3 |
| 6 | Company 1 |
| 7 | |
| 8 | Company 4 |
| 9 | Company 1 |
| 10 | Company 3 |
+----+-----------------+
Result in sheet named: 2
+---+-----------------+
| | A |
+---+-----------------+
| 1 | Name of company |
| 2 | Company 1 |
| 3 | Company 2 |
| 4 | Company 3 |
| 5 | Company 4 |
+---+-----------------+
Use this code in a regular module:
Sub extractUni()
Dim objDic
Dim Cell
Dim Area As Range
Dim i
Dim Value
Set Area = Sheets("1").Range("A2:A10") 'this is where your data is located
Set objDic = CreateObject("Scripting.Dictionary") 'use a Dictonary!
For Each Cell In Area
If Not objDic.Exists(Cell.Value) Then
objDic.Add Cell.Value, Cell.Address
End If
Next
i = 2 '2 because the heading
For Each Value In objDic.Keys
If Not Value = Empty Then
Sheets("2").Cells(i, 1).Value = Value 'Store the data in column D below the heading
i = i + 1
End If
Next
End Sub
The code return the date unsorted, just the way data appears.
if you want a sorted list, just add this code before the las line:
Dim sht As Worksheet
Set sht = Sheets("2")
sht.Activate
With sht.Sort
.SetRange Range("A:A")
.Header = xlYes
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
This way the result will be always sorted.
(The subrutine would be like this)
Sub extractUni()
Dim objDic
Dim Cell
Dim Area As Range
Dim i
Dim Value
Set Area = Sheets("1").Range("A2:A10") 'this is where your data is located
Set objDic = CreateObject("Scripting.Dictionary") 'use a Dictonary!
For Each Cell In Area
If Not objDic.Exists(Cell.Value) Then
objDic.Add Cell.Value, Cell.Address
End If
Next
i = 2 '2 because the heading
For Each Value In objDic.Keys
If Not Value = Empty Then
Sheets("2").Cells(i, 1).Value = Value 'Store the data in column D below the heading
i = i + 1
End If
Next
Dim sht As Worksheet
Set sht = Sheets("2")
sht.Activate
With sht.Sort
.SetRange Range("A:A")
.Header = xlYes
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
End Sub
If you have any question about the code, I will glad to explain.

How to do a cross join / cartesian product in RavenDB?

I have a web application that uses RavenDB on the backend and allows the user to keep track of inventory. The three entities in my domain are:
public class Location
{
string Id
string Name
}
public class ItemType
{
string Id
string Name
}
public class Item
{
string Id
DenormalizedRef<Location> Location
DenormalizedRef<ItemType> ItemType
}
On my web app, there is a page for the user to see a summary breakdown of the inventory they have at the various locations. Specifically, it shows the location name, item type name, and then a count of items.
The first approach I took was a map/reduce index on InventoryItems:
this.Map = inventoryItems =>
from inventoryItem in inventoryItems
select new
{
LocationName = inventoryItem.Location.Name,
ItemTypeName = inventoryItem.ItemType.Name,
Count = 1
});
this.Reduce = indexEntries =>
from indexEntry in indexEntries
group indexEntry by new
{
indexEntry.LocationName,
indexEntry.ItemTypeName,
} into g
select new
{
g.Key.LocationName,
g.Key.ItemTypeName,
Count = g.Sum(entry => entry.Count),
};
That is working fine but it only displays rows for Location/ItemType pairs that have a non-zero count of items. I need to have it show all Locations and for each location, all item types even those that don't have any items associated with them.
I've tried a few different approaches but no success so far. My thought was to turn the above into a Multi-Map/Reduce index and just add another map that would give me the cartesian product of Locations and ItemTypes but with a Count of 0. Then I could feed that into the reduce and would always have a record for every location/itemtype pair.
this.AddMap<object>(docs =>
from itemType in docs.WhereEntityIs<ItemType>("ItemTypes")
from location in docs.WhereEntityIs<Location>("Locations")
select new
{
LocationName = location.Name,
ItemTypeName = itemType.Name,
Count = 0
});
This isn't working though so I'm thinking RavenDB doesn't like this kind of mapping. Is there a way to get a cross join / cartesian product from RavenDB? Alternatively, any other way to accomplish what I'm trying to do?
EDIT: To clarify, Locations, ItemTypes, and Items are documents in the system that the user of the app creates. Without any Items in the system, if the user enters three Locations "London", "Paris", and "Berlin" along with two ItemTypes "Desktop" and "Laptop", the expected result is that when they look at the inventory summary, they see a table like so:
| Location | Item Type | Count |
|----------|-----------|-------|
| London | Desktop | 0 |
| London | Laptop | 0 |
| Paris | Desktop | 0 |
| Paris | Laptop | 0 |
| Berlin | Desktop | 0 |
| Berlin | Laptop | 0 |
Here is how you can do this with all the empty locations as well:
AddMap<InventoryItem>(inventoryItems =>
from inventoryItem in inventoryItems
select new
{
LocationName = inventoryItem.Location.Name,
Items = new[]{
{
ItemTypeName = inventoryItem.ItemType.Name,
Count = 1}
}
});
)
this.AddMap<Location>(locations=>
from location in locations
select new
{
LocationName = location.Name,
Items = new object[0]
});
this.Reduce = results =>
from result in results
group result by result.LocationName into g
select new
{
LocationName = g.Key,
Items = from item in g.SelectMany(x=>x.Items)
group item by item.ItemTypeName into gi
select new
{
ItemTypeName = gi.Key,
Count = gi.Sum(x=>x.Count)
}
};