How to extract value after a particular string in scala (spark)? - regex

I hava a dataframe with Column :
df =
itemType count
it_shampoo 5
it_books 5
it_mm 5
{it_mm} 5
it_books it_books 5
{=it_books} it_books 5
I need to get :
itemType count
it_shampoo 5
it_books 5
it_mm 5
it_mm 5
it_books 5
it_books 5
How do I extract replaces the it_books it_books, {=it_books} it_books to it_books. Item Type will always follow it_

Try regex, ^.*?(it_[\w]+).*$ to itemType and replace with first captured group $1.
Regex

The below regex also works
scala> val df = Seq(("it_shampoo",5),
| ("it_books",5),
| ("it_mm",5),
| ("{it_mm}",5),
| ("it_books it_books",5),
| ("{=it_books} it_books",5)).toDF("itemType","count")
df: org.apache.spark.sql.DataFrame = [itemType: string, count: int]
scala> df.select( regexp_replace('itemtype,""".*\b(\S+)\b(.*)$""", "$1").as("replaced"),'count).show
+----------+-----+
| replaced|count|
+----------+-----+
|it_shampoo| 5|
| it_books| 5|
| it_mm| 5|
| it_mm| 5|
| it_books| 5|
| it_books| 5|
+----------+-----+
scala>

Related

Conditional count Measure

I have data looking like this:
| ID |OpID|
| -- | -- |
| 10 | 1 |
| 10 | 2 |
| 10 | 4 |
| 11 |null|
| 12 | 3 |
| 12 | 4 |
| 13 | 1 |
| 13 | 2 |
| 13 | 3 |
| 14 | 2 |
| 14 | 4 |
Here OpID 4 means 1 and 2.
I would like to count the different occurrences of 1, 2 and 3 in OpID of distinct ID.
If the counts of OpID having 1 would be 4, 2 would be 4, 3 would be 2.
If ID has OpID of 4 but already has data of 1, 2 it wouldn't be counted. But if 4 exists and only 1 (2) is there, count for 2 (1) would be incremented.
The expected output would be:
|OpID|Count|
| 1 | 4 |
| 2 | 4 |
| 3 | 2 |
(Going to be using the results in a column chart)
Hope this makes sense...
edit: there are other columns too and an ID and OpID can be duplicated hence need to do a groupby clause before.

Removal of special characters in a txt file using spark

I have a txt file which comprises of data like
3,3e,4,5
3,5s,4#,5
5,6,2,4
and so on
now what I have to do is to remove these characters and using spark and then add all the values into an aggregated sum.
How to get remove the special characters and sum all the values.
I have created a dataframe and used regexp_replace to remove the special characters.
But by using .withColumn clause I can only remove the special characters one by one and not as a whole which I believe is not optimized code.
Secondly, I have to add all the values into an aggregated sum. How to get the aggregate value.
If you have fixed number of column in input data then you can use below approach.
//Input Text file
scala> val rdd = sc.textFile("/spath/stack.txt")
scala> rdd.collect()
res108: Array[String] = Array("3,3e,4,5 ", "3,5s,4#,5 ", 5,6,2,4)
//remove special characters
scala> val rdd1 = rdd.map{x => x.replaceAll("[^,0-9]", "")}
scala> rdd1.collect
res109: Array[String] = Array(3,3,4,5, 3,5,4,5, 5,6,2,4)
//Conver RDD into DataFrame
scala> val df = rdd1.map(_.split(",")).map(x => (x(0).toInt,x(1).toInt,x(2).toInt,x(3).toInt)).toDF
scala> df.show(false)
+---+---+---+---+
|_1 |_2 |_3 |_4 |
+---+---+---+---+
|3 |3 |4 |5 |
|3 |5 |4 |5 |
|5 |6 |2 |4 |
+---+---+---+---+
//local UDF to sum up value
scala> val sumUDF = udf((r:Row) => {
| r.getAs("_1").toString.toInt + r.getAs("_2").toString.toInt + r.getAs("_3").toString.toInt + r.getAs("_4").toString.toInt
| })
//Expected DataFrame
scala> val finaldf = df.withColumn("sumcol", sumUDF(struct(rdd2.columns map col: _*)))
scala> finaldf.show(false)
+---+---+---+---+------+
|_1 |_2 |_3 |_4 |sumcol|
+---+---+---+---+------+
|3 |3 |4 |5 |15 |
|3 |5 |4 |5 |17 |
|5 |6 |2 |4 |17 |
+---+---+---+---+------+

Replace substring containing dollar sign ($) with other column value pyspark [duplicate]

This question already has answers here:
Spark column string replace when present in other column (row)
(2 answers)
Closed 3 years ago.
I am trying to replace the substring '$NUMBER' with the value in the column 'number' for each row.
I tried
from pyspark.sql.functions import udf
from pyspark.sql.Types import StringType
replace_udf = udf(
lambda long_text, number: long_text.replace("$NUMBER", number),
StringType()
)
df = df.withColumn('long_text',replace_udf(col('long_text'),col('number')))
and
from pyspark.sql.functions import expr
df = df.withColumn('long_text',expr("regexp_replace(long_text, '$NUMBER', number)"))
but nothing works. I can't figure out how another column can be the replacement for the substring.
SAMPLE:
df1 = spark.createDataFrame(
[
("hahaha the $NUMBER is good",3),
("i dont know about $NUMBER",2),
("what is $NUMBER doing?",5),\
("ajajaj $NUMBER",2),
("$NUMBER dwarfs",1)
],
["long_text","number"]
)
INPUT:
+---------------------------------+------+
| long_text . |number|
+---------------------------------+------+
|hahaha the $NUMBER is good | 3|
| what is $NUMBER doing? | 5|
| ajajaj $NUMBER | 2|
+---------------------------------+------+
EXPECTED OUTPUT:
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 123| 2|
+--------------------+------+
Similar question where the answeres didn't cover the column replacement:
Spark column string replace when present in other column (row)
The problem is that $ has a special meaning in regular expressions, which means match the end of the line. So your code:
regexp_replace(long_text, '$NUMBER', number)
Is trying to match the pattern: end of line followed by the literal string NUMBER (which can never match anything).
In order to match a $ (or any other regex special character), you have to escape it with a \.
from pyspark.sql.functions import expr
df = df.withColumn('long_text',expr("regexp_replace(long_text, '\$NUMBER', number)"))
df.show()
#+--------------------+------+
#| long_text|number|
#+--------------------+------+
#|hahaha the 3 is good| 3|
#| what is 5 doing?| 5|
#| ajajaj 2| 2|
#+--------------------+------+
You have to cast the number column to string with str() before you can use with replace in your lambda:
from pyspark.sql import types as T
from pyspark.sql import functions as F
l = [( 'hahaha the $NUMBER is good', 3)
,('what is $NUMBER doing?' , 5)
,('ajajaj $NUMBER ' , 2)]
df = spark.createDataFrame(l,['long_text','number'])
#Just added str() to your function
replace_udf = F.udf(lambda long_text, number: long_text.replace("$NUMBER", str(number)), T.StringType())
df.withColumn('long_text',replace_udf(F.col('long_text'),F.col('number'))).show()
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 2 | 2|
+--------------------+------+

How to create a table (or list) with the order codes of orders with both products

I have a Transactions table with the following structure:
ID | Product | OrderCode | Value
1 | 8 | ABC | 100
2 | 5 | ABC | 150
3 | 4 | ABC | 80
4 | 5 | XPT | 100
5 | 6 | XPT | 100
6 | 8 | XPT | 100
7 | 5 | XYZ | 100
8 | 8 | UYI | 90
How do I create a table (or list) with the order codes of orders with both products 5 and 8?
In the example above it should be the orders ABC and XPT.
There are probably many ways to do this, but here's a fairly general solution what I came up with:
FilteredList =
VAR ProductList = {5, 8}
VAR SummaryTable = SUMMARIZE(Transactions,
Transactions[OrderCode],
"Test",
COUNTROWS(INTERSECT(ProductList, VALUES(Transactions[Product])))
= COUNTROWS(ProductList))
RETURN SELECTCOLUMNS(FILTER(SummaryTable, [Test]), "OrderCode", Transactions[OrderCode])
The key here is if the set of products for a particular order code contains both 5 and 8, then the intersection of VALUES(Transations[Product]) with the set {5,8} is exactly that set and has a count of 2. If it doesn't have both, the count will be 1 or 0 and the test fails.
Please elaborate more on your question, From your above post I understood is you want to filter the list, For that, you can use below code
List<Transactions> listTransactions = listTransactions.FindAll(x=>x.Product == 5 || x.Product == 8)

Stata: Cumulative number of new observations

I would like to check if a value has appeared in some previous row of the same column.
At the end I would like to have a cumulative count of the number of distinct observations.
Is there any other solution than concenating all _n rows and using regular expressions? I'm getting there with concatenating the rows, but given the limit of 244 characters for string variables (in Stata <13), this is sometimes not applicable.
Here's what I'm doing right now:
gen tmp=x
replace tmp = tmp[_n-1]+ "," + tmp if _n > 1
gen cumu=0
replace cumu=1 if regexm(tmp[_n-1],x+"|"+x+",|"+","+x+",")==0
replace cumu= sum(cumu)
Example
+-----+
| x |
|-----|
1. | 12 |
2. | 32 |
3. | 12 |
4. | 43 |
5. | 43 |
6. | 3 |
7. | 4 |
8. | 3 |
9. | 3 |
10. | 3 |
+-----+
becomes
+-------------------------------+
| x | tmp |
|-----|--------------------------
1. | 12 | 12 |
2. | 32 | 12,32 |
3. | 12 | 12,32,12 |
4. | 43 | 3,32,12,43 |
5. | 43 | 3,32,12,43,43 |
6. | 3 | 3,32,12,43,43,3 |
7. | 4 | 3,32,12,43,43,3,4 |
8. | 3 | 3,32,12,43,43,3,4,3 |
9. | 3 | 3,32,12,43,43,3,4,3,3 |
10. | 3 | 3,32,12,43,43,3,4,3,3,3|
+--------------------------------+
and finally
+-----------+
| x | cumu|
|-----|------
1. | 12 | 1 |
2. | 32 | 2 |
3. | 12 | 2 |
4. | 43 | 3 |
5. | 43 | 3 |
6. | 3 | 4 |
7. | 4 | 5 |
8. | 3 | 5 |
9. | 3 | 5 |
10. | 3 | 5 |
+-----------+
Any ideas how to avoid the 'middle step' (for me that gets very important when having strings in x instead of numbers).
Thanks!
Regular expressions are great, but here as often elsewhere simple calculations suffice. With your sample data
. input x
x
1. 12
2. 32
3. 12
4. 43
5. 43
6. 3
7. 4
8. 3
9. 3
10. 3
11. end
end of do-file
you can identify first occurrences of each distinct value:
. gen long order = _n
. bysort x (order) : gen first = _n == 1
. sort order
. l
+--------------------+
| x order first |
|--------------------|
1. | 12 1 1 |
2. | 32 2 1 |
3. | 12 3 0 |
4. | 43 4 1 |
5. | 43 5 0 |
|--------------------|
6. | 3 6 1 |
7. | 4 7 1 |
8. | 3 8 0 |
9. | 3 9 0 |
10. | 3 10 0 |
+--------------------+
The number of distinct values seen so far is then just a cumulative sum of first using sum(). This works with string variables too. In fact this problem is one of several discussed within
http://www.stata-journal.com/sjpdf.html?articlenum=dm0042
which is accessible to all as a .pdf. search distinct would have pointed you to this article.
Becoming fluent with what you can do with by:, sort, _n and _N is an important skill in Stata. See also
http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
for another article accessible to all.