Pyspark: Create additional column based on Regex

Pyspark: Create additional column based on Regex - regex

I recently started Pyspark and I'm trying to figure out the regex matching.
For the regexes I've created a list and if one of these items in the list is found in the name column, the added column must be true. This Regex matching must not be case sensitive as seen in the example below.
I have a Table with the following format:
seqno
name
1
john jones
2
John Jones
3
John Stones
4
Mary Wild
5
William Wurt
6
steven wurt
I need to change the Table above to the format of the Table below. This is just a small part of the actual table so hard coding is not going to cut it unfortunately.
seqno
name
regex
1
john jones
True
2
John Jones
True
3
John Stones
True
4
Mary Wild
False
5
William Wurt
True
6
steven wurt
True
Here is the code to create part of the Table:
regex_list = [john, wurt]
columns = ['seqno', 'name']
data = [('1', 'john jones'),
('2', 'John Jones'),
('3', 'John Stones'),
('4', 'Mary Wild'),
('5', 'William Wurt'),
('6', 'steven wurt')]
df = spark.createDataFrame(data=data, schema=columns)
I've been trying numerous applications with .isin and .rlike but can't seem to make it work. Any help would be gladly appreciated.
Thanks in advance!

Use rlike to check if any of the listed regex are like names. can change case in both list and column while test happens Code beloow
df.withColumn('regex',upper(col('name')).rlike(('|').join([x.upper() for x in regex_list]))).show()
+-----+------------+-----+
|seqno| name|regex|
+-----+------------+-----+
| 1| john jones| true|
| 2| John Jones| true|
| 3| John Stones| true|
| 4| Mary Wild|false|
| 5|William Wurt| true|
| 6| steven wurt| true|
+-----+------------+-----+

Related

How to create a measure in power bi that counts yes and no values based on a country

I am new to power bi, i am trying to use "state cards" by okviz to create multiple dynamic cards.
i have an example of the following data structure: table 1
Country | Answer
Amercica | Yes
America | NO
America | YES
America | Yes
Brazil | NO
Brazil | NO
Brazil | NO
Brazil | NO
Brazil | yes
how do i create a measure in power bi that counts the yes and no columns per country and gives me the following output
Country | Answer |count
America |Yes |3
America |No |1
Brazil |Yes |1
Brazil |No |4

First standardize the Answer column. This is not necessary but I don't like all those different yes words.
You don't need a measure to achieve this. You can drag a column and chose the count option. But if you really need a measure use the following expression:
Answer Count = COUNT( 'Table'[Answer] )
Both ways you obtain the same result.

Contracting dataset to only hold unique values for a variable

Assuming I have the following dataset as a toy example:
clear
input str32 Country Population_1 Population_2
"United States of America" 3999 .
"United States of America" . 3447
"Afghanistan" 544 .
"Afghanistan" . 727
"Belgium" 7546 .
"Belgium" . 992
"China" 10000 .
"China" . 12000
end
I want to shrink the dataset so that there is just one unique value for country.
My final dataset should look as follows:
Country Population_1 Population_2
United States of America 3999 3447
Afghanistan 544 727
Belgium 7546 992
China 10000 12000
I tried to use the collapse command but did not get the expected outcome. The command duplicates drop does not work either, as I it does not obtain the observations from Population_2.

This works for me:
collapse Pop*, by(Country)
list, abbreviate(12)
+--------------------------------------------------------+
| Country Population_1 Population_2 |
|--------------------------------------------------------|
1. | Afghanistan 544 727 |
2. | Belgium 7546 992 |
3. | China 10000 12000 |
4. | United States of America 3999 3447 |
+--------------------------------------------------------+

The following works for me:
generate Population_ = .
by Country, sort: replace Population_ = Population_2 if Population_1 == .
by Country, sort: replace Population_ = Population_1 if Population_2 == .
by Country: generate time = _n
drop Population_1 Population_2
reshape wide Population_, i(Country) j(time)

The community-contributed command gcollapse can also preserve wanted variables in the dataset:
gcollapse (sum) Pop*, merge replace by(Country)
duplicates drop Country, force

Count the number of distinct strings and their occurrence in a variable

I have a variable called Category that specifies the category of observations. The problem is that some observation have multiple categories. For example:
id Category
1 Economics
2 Biology
3 Psychology; Economics
4 Economics; Psychology
There is no meaning in the order of categories. They are always separated by ";". There are 250 categories, so creating dummy variables might be tricky. I have the complete list of categories in a separate Excel file if this might help.
What I want is simply to summarize my dataset by unique categories such as Economics (3), Psychology (2), Biology (1) (so the sum of all can be superior to the number of observations).

tabsplit from the tab_chi package on SSC will do this for you.
clear
input id str42 Category
1 "Economics"
2 "Biology"
3 "Psychology; Economics"
4 "Economics; Psychology"
end
capture ssc install tab_chi
tabsplit Category, p(;)
Category | Freq. Percent Cum.
------------+-----------------------------------
Biology | 1 16.67 16.67
Economics | 3 50.00 66.67
Psychology | 2 33.33 100.00
------------+-----------------------------------
Total | 6 100.00
Note: You can count semi-colons and thus phrases like this.
gen count = 1 + length(category) - length(subinstr(category, ";", "", .))
The logic is that you measure the length of the string and its length should semi-colons ; be replaced by empty strings (namely, removed). The difference is the number of semi-colons, to which you add 1.
EDIT: How to get to a different data structure, starting with the data example above.
. split Category, p(;)
variables created as string:
Category1 Category2
. drop Category
. reshape long Category, i(id) j(mention)
(note: j = 1 2)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 4 -> 8
Number of variables 3 -> 3
j variable (2 values) -> mention
xij variables:
Category1 Category2 -> Category
-----------------------------------------------------------------------------
. drop if missing(Category)
(2 observations deleted)
. list, sepby(id)
+----------------------------+
| id mention Category |
|----------------------------|
1. | 1 1 Economics |
|----------------------------|
2. | 2 1 Biology |
|----------------------------|
3. | 3 1 Psychology |
4. | 3 2 Economics |
|----------------------------|
5. | 4 1 Economics |
6. | 4 2 Psychology |
+----------------------------+

Generate id by groups

I have an issue in Stata I can't solve. My data set looks like the first two columns of the following block, and I would like to add the third column, where newvar resets itself anytime id changes its value It is important for newvar to keep the order of the observations, so I cannot sort by group to generate it.
|id|group|newvar
|7 |10 |1
|7 |10 |1
|7 |10 |1
|7 |5 |2
|7 |5 |2
|7 |8 |3

I guess you don't mean what you say as your example shows the new variable changing even though id does not.
You can always ensure that the current order is taken literally by working with a variable that tracks observation order
gen long obs = _n
Then I guess what you want is
bysort id (order) : gen newvar = sum(group != group[_n-1])
This is rather a basic question considering the aim of this forum at professional and enthusiast programmers who are expected to have read documentation and show attempts at code. See e.g. https://stackoverflow.com/help/mcve for what defines a good question here.

Using count in Doctrine2

Count row by Doctrine
I have tables like this
id | name
---------
1 | john
2 | ken
3 | john
4 | ken
5 | ken
6 | haku
when I use this sentence
$em->createQuery("SELECT c.id FROM UserBundle:customer c group by c.name")->getResult()
I can get the pair of first id for each people.
1 | john
2 | ken
6 | haku
However ,I would like to get the count how many times each name appears, like below.
1 | john | 2
2 | ken | 3
6 | haku | 1
How can I make it?

For complicated queries, I'd probably use PDO directly, especially if you don't need to map to an entity.
Here is an example of how to use PDO directly:
$q = "your query here";
$stmt = $em->getConnection()->prepare($q);
$stmt->execute();
$results = $stmt->fetchAll(\PDO::FETCH_ASSOC);

Try this:
$em->createQuery("SELECT c.name, COUNT(c.id) as cnt FROM UserBundle:customer c group by c.name")->getResult();

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pyspark: Create additional column based on Regex - regex

Related

How to create a measure in power bi that counts yes and no values based on a country

Contracting dataset to only hold unique values for a variable

Count the number of distinct strings and their occurrence in a variable

Generate id by groups

Using count in Doctrine2

Categories

Resources