BeautifulSoup and List indexing - list

As I am still quite new to web scraping I am currently practicing some basics such as this one. I have scraped the categories from 'th' tag and the players from the 'tr' tag and appended it to a couple empty lists. The categories come out fine from get_text(), but when I try printing the players it has a number rank before the first letter of the name, and the player's team abbreviation letters after the last name.
3 things I am trying to do:
1)output only the first and last name of each player by doing some slicing from the list but I cannot figure out any easier way to do it. There is probably a quicker way inside the tags where I can call the class or using soup.findAll again in the html, or something else I am unware of, but I currently do not know how or what I am missing.
2)take the number ranks before the name and append it to an empty list.
3)take the 3 last abbreviated letters and append it to an empty list
Any suggestions would be much appreciated!
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
from time import sleep
players = []
categories = []
url ='https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'
source = requests.get(url)
soup = bs4(source.text, 'lxml')
for i in soup.findAll('th'):
c = i.get_text()
categories.append(c)
for i in soup.findAll('tr'):
player = i.get_text()
players.append(player)
players = players[1:51]
print(categories)
print(players)

Apis are always the best way to go in my opinion.
However, this can also be done with pandas .read_html() (it uses beautifulsoup under the hood to parse the table).
import pandas as pd
url = 'https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'
dfs = pd.read_html(url)
dfs[0][['Name','Team']] = dfs[0]['Name'].str.extract('^(.*?)([A-Z]+)$', expand=True)
df = dfs[0].join(dfs[1])
Output:
print (df[['RK','Name','Team','POS']])
RK Name Team POS
0 1 James Harden HOU SG
1 2 Stephen Curry GS PG
2 3 Bradley Beal WSH SG
3 4 Trae Young ATL PG
4 5 Kevin Durant BKN SF
5 6 CJ McCollum POR SG
6 7 Kyrie Irving BKN PG
7 8 Jaylen Brown BOS SG
8 9 Giannis Antetokounmpo MIL PF
9 10 Jayson Tatum BOS PF
10 11 Damian Lillard POR PG
11 12 Luka Doncic DAL PG
12 13 Collin Sexton CLE PG
13 14 Paul George LAC SG
14 15 Brandon Ingram NO SF
15 16 Nikola Jokic DEN C
16 17 LeBron James LAL SF
17 18 Zach LaVine CHI SG
18 19 Christian Wood HOU PF
19 20 Kawhi Leonard LAC SF
20 21 Joel Embiid PHI C
21 22 Jerami Grant DET PF
22 23 Anthony Davis LAL PF
23 24 Jamal Murray DEN PG
24 25 Julius Randle NY PF
25 26 Malcolm Brogdon IND PG
26 27 Fred VanVleet TOR SG
27 28 Nikola Vucevic ORL C
28 28 Donovan Mitchell UTAH SG
29 30 Terry Rozier CHA PG
30 31 Devin Booker PHX SG
31 32 Khris Middleton MIL SF
32 33 Terrence Ross ORL SG
33 33 Victor Oladipo IND SG
34 35 Russell Westbrook WSH PG
35 36 Domantas Sabonis IND PF
36 36 De'Aaron Fox SAC PG
37 38 Zion Williamson NO SF
38 39 Tobias Harris PHI SF
39 40 Bam Adebayo MIA C
40 41 DeMar DeRozan SA SG
41 41 D'Angelo Russell MIN SG
42 43 Gordon Hayward CHA SF
43 44 Kyle Lowry TOR PG
44 44 Shai Gilgeous-Alexander OKC SG
45 46 Mike Conley UTAH PG
46 47 Malik Beasley MIN SG
47 48 RJ Barrett NY SG
48 49 Thomas Bryant WSH C
49 50 Pascal Siakam TOR PF

Is this what you want?
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
url = "https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
table_data = [
[r // 2, i.find("a").getText(), i.find("span").getText()] for r, i in
enumerate(soup.find_all("td", class_="Table__TD"), start=1)
if i.find("a") and i.find("span")
]
print(tabulate(table_data, headers=["Rank", "Name", "Team"], tablefmt="pretty"))
Output:
| Rank | Name | Team |
+------+-------------------------+------+
| 1 | James Harden | HOU |
| 2 | Stephen Curry | GS |
| 3 | Bradley Beal | WSH |
| 4 | Trae Young | ATL |
| 5 | Kevin Durant | BKN |
| 6 | CJ McCollum | POR |
| 7 | Kyrie Irving | BKN |
| 8 | Jaylen Brown | BOS |
| 9 | Giannis Antetokounmpo | MIL |
| 10 | Jayson Tatum | BOS |
| 11 | Damian Lillard | POR |
| 12 | Luka Doncic | DAL |
| 13 | Collin Sexton | CLE |
| 14 | Paul George | LAC |
| 15 | Brandon Ingram | NO |
| 16 | Nikola Jokic | DEN |
| 17 | LeBron James | LAL |
| 18 | Zach LaVine | CHI |
| 19 | Christian Wood | HOU |
| 20 | Kawhi Leonard | LAC |
| 21 | Joel Embiid | PHI |
| 22 | Jerami Grant | DET |
| 23 | Anthony Davis | LAL |
| 24 | Jamal Murray | DEN |
| 25 | Julius Randle | NY |
| 26 | Malcolm Brogdon | IND |
| 27 | Fred VanVleet | TOR |
| 28 | Nikola Vucevic | ORL |
| 29 | Donovan Mitchell | UTAH |
| 30 | Terry Rozier | CHA |
| 31 | Devin Booker | PHX |
| 32 | Khris Middleton | MIL |
| 33 | Terrence Ross | ORL |
| 34 | Victor Oladipo | IND |
| 35 | Russell Westbrook | WSH |
| 36 | Domantas Sabonis | IND |
| 37 | De'Aaron Fox | SAC |
| 38 | Zion Williamson | NO |
| 39 | Tobias Harris | PHI |
| 40 | Bam Adebayo | MIA |
| 41 | DeMar DeRozan | SA |
| 42 | D'Angelo Russell | MIN |
| 43 | Gordon Hayward | CHA |
| 44 | Kyle Lowry | TOR |
| 45 | Shai Gilgeous-Alexander | OKC |
| 46 | Mike Conley | UTAH |
| 47 | Malik Beasley | MIN |
| 48 | RJ Barrett | NY |
| 49 | Thomas Bryant | WSH |
| 50 | Pascal Siakam | TOR |
+------+-------------------------+------+

Always ask you - Is there an easier way?
Yes it is and you should go it :)
If you wanna scrape, first take a look if you really have to scrape content from the website or if there is an api that provide the information well structured.
Example requesting api
import requests
import pandas as pd
url = "https://site.web.api.espn.com/apis/common/v3/sports/basketball/nba/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=true&page=1&limit=50&sort=offensive.avgPoints%3Adesc&season=2021&seasontype=2"
headers = {"user-agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
response.raise_for_status()
ranking=[]
for i,player in enumerate(response.json()['athletes'], start=1):
rank = i
name = player['athlete']['displayName']
team = player['athlete']['teamShortName']
category = player['athlete']['position']['abbreviation']
ranking.append({'rank':rank, 'name':name, 'team':team, 'category':category})
df = pd.DataFrame(ranking)
df
Output data frame
rank name team category
1 James Harden HOU SG
2 Stephen Curry GS PG
3 Bradley Beal WSH SG
4 Trae Young ATL PG
5 Kevin Durant BKN SF
6 CJ McCollum POR SG
7 Kyrie Irving BKN PG
8 Jaylen Brown BOS SG
9 Giannis Antetokounmpo MIL PF
10 Jayson Tatum BOS PF
But to answer your question
You can also do it with BeautifulSoup, but it is much more error-prone in my opinion:
from bs4 import BeautifulSoup
import requests
import pandas as pd
data = []
url ='https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'
source = requests.get(url)
soup = BeautifulSoup(source.text, 'lxml')
for i in soup.select('tr')[1:]:
if i.select_one('td'):
rank = i.select_one('td').get_text()
if i.select_one('div > a'):
player = i.select_one('div > a').get_text()
if i.select_one('div > span'):
team =i.select_one('div > span').get_text()
data.append({'rank':rank, 'player':player, 'team':team})
pd.DataFrame(data)
If you do not wanna use css selectors, you can also do
for i in soup.find_all('tr')[1:]:
if i.find('td'):
rank = i.find('td').get_text()
if i.find('a'):
player = i.find('a').get_text()
if i.find('span'):
team =i.find('span').get_text()

Related

How to extracting all values that contain part of particular number and then deleting them?

How do you extract all values containing part of a particular number and then delete them?
I have data where the ID contains different lengths and wants to extract all the IDs with a particular number. For example, if the ID contains either "-00" or "02" or "-01" at the end, pull to be able to see the hit rate that includes those—then delete them from the ID. Is there a more effecient way in creating this code?
I tried to use the substring function to slice it to get the result, but there is some other ID along with the specified position.
Code:
Proc sql;
Create table work.data1 AS
SELECT Product, Amount_sold, Price_per_unit,
CASE WHEN Product Contains "Pen" and Lenghth(ID) >= 9 Then ID = SUBSTR(ID,1,9)
WHEN Product Contains "Book" and Lenghth(ID) >= 11 Then ID = SUBSTR(ID,1,11)
WHEN Product Contains "Folder" and Lenghth(ID) >= 12 Then ID = SUBSTR(ID,1,12)
...
END AS ID
FROM A
Quit;
Have:
+------------------+-----------------+-------------+----------------+
| ID | Product | Amount_sold | Price_per_unit |
+------------------+-----------------+-------------+----------------+
| 123456789 | Pen | 30 | 2 |
| 63495837229-01 | Book | 20 | 5 |
| ABC134475472 02 | Folder | 29 | 7 |
| AB-1235674467-00 | Pencil | 26 | 1 |
| 69598346-02 | Correction pen | 15 | 1.50 |
| 6970457688 | Highlighter | 15 | 2 |
| 584028467 | Color pencil | 15 | 10 |
+------------------+-----------------+-------------+----------------+
Wanted the final result:
+------------------+-----------------+-------------+----------------+
| ID | Product | Amount_sold | Price_per_unit |
+------------------+-----------------+-------------+----------------+
| 123456789 | Pen | 30 | 2 |
| 63495837229 | Book | 20 | 5 |
| ABC134475472 | Folder | 29 | 7 |
| AB-1235674467 | Pencil | 26 | 1 |
| 69598346 | Correction pen | 15 | 1.50 |
| 6970457688 | Highlighter | 15 | 2 |
| 584028467 | Color pencil | 15 | 10 |
+------------------+-----------------+-------------+----------------+
Just test if the string has any embedded spaces or hyphens and also that the last word when delimited by space or hyphen is 00 or 01 or 02 then chop off the last three characters.
data have;
infile cards dsd dlm='|' truncover ;
input id :$20. product :$20. amount_sold price_per_unit;
cards;
123456789 | Pen | 30 | 2 |
63495837229-01 | Book | 20 | 5 |
ABC134475472 02 | Folder | 29 | 7 |
AB-1235674467-00 | Pencil | 26 | 1 |
69598346-02 | Correction pen | 15 | 1.50 |
6970457688 | Highlighter | 15 | 2 |
584028467 | Color pencil | 15 | 10 |
;
data want;
set have ;
if indexc(trim(id),'- ') and scan(id,-1,'- ') in ('00' '01' '02') then
id = substrn(id,1,length(id)-3)
;
run;
Result
amount_ price_
Obs id product sold per_unit
1 123456789 Pen 30 2.0
2 63495837229 Book 20 5.0
3 ABC134475472 Folder 29 7.0
4 AB-1235674467 Pencil 26 1.0
5 69598346 Correction pen 15 1.5
6 6970457688 Highlighter 15 2.0
7 584028467 Color pencil 15 10.0
There may be other solutions but you have to use some string functions. I used here the functions substr, reverse (reverting the string) and indexc (position of one of the characters in the string):
data have;
input text $20.;
datalines;
12345678
AB-142353 00
AU-234343-02
132453 02
221344-09
;
run;
data want (drop=reverted pos);
set have;
if countw(text) gt 1
then do;
reverted=strip(reverse(text));
pos=indexc(reverted,'- ')+1;
new=strip(reverse(substr(reverted,pos)));
end;
else new=text;
run;

Plot graph from within Mata

Consider the following toy matrix in mata:
mata: A
1 2
+-----------------+
1 | 6555 140 |
2 | 7205 135 |
3 | 6255 140 |
4 | 7272 138 |
5 | 10283 133 |
6 | 8244 136 |
7 | 6909 144 |
8 | 7645 138 |
9 | 12828 134 |
10 | 6538 137 |
+-----------------+
If I want to draw a scatter plot using this matrix, I first need to transfer it
to Stata and then also convert it to variables with the svmat command:
mata: st_matrix("A", A)
svmat A
list, separator(0)
+-------------+
| A1 A2 |
|-------------|
1. | 6555 140 |
2. | 7205 135 |
3. | 6255 140 |
4. | 7272 138 |
5. | 10283 133 |
6. | 8244 136 |
7. | 6909 144 |
8. | 7645 138 |
9. | 12828 134 |
10. | 6538 137 |
+-------------+
twoway scatter A1 A2
Is there a way to directly draw the graph without leaving mata?
One can plot a mata matrix without first converting it to Stata variables as follows:
twoway scatter matamatrix(A)
See help twoway_mata for more details.
Edit by #PearlySpencer:
This can be run directly from within mata using the stata() function:
mata: stata("twoway scatter matamatrix(A)")
An alternative approach is to use the community-contributed mata function mm_plot():
mata: mm_plot(A, "scatter")
This is part of the moremata collection of functions and must thus be downloaded first:
ssc install moremata

Change variable value based on what a string contains

Suppose you have several variables:
+--------------------------+------------+------------+-----------+-------+
| | Population | Median_Age | Sex_Ratio | GDP |
| Country | | | | |
+--------------------------+------------+------------+-----------+-------+
| United States of America | 3999 | | 1.01 | 16000 |
+--------------------------+------------+------------+-----------+-------+
| Afghanistan | 544 | 19 | 0.97 | 4456 |
+--------------------------+------------+------------+-----------+-------+
| China | 5000 | 26 | 0.96 | 10000 |
+--------------------------+------------+------------+-----------+-------+
Let us suppose that Median_Age under United States of America is empty.
How do I replace this missing value to 27 if Country contains United, or United States?
Here's a modified example that better illustrates the solution:
clear
input strL Country Population Median_Age Sex_Ratio GDP
"United States of America" 3999 . 1.01 5000
"Afghanistan" 544 19 0.97 457
"United Emirates" 7546 44 7.01 2000
"China" 10000 26 0.96 3400
"United Fictionary Nation" 6789 . 8.03 7689
end
list, abbreviate(10)
+-----------------------------------------------------------------------+
| Country Population Median_Age Sex_Ratio GDP |
|-----------------------------------------------------------------------|
1. | United States of America 3999 . 1.01 5000 |
2. | Afghanistan 544 19 .97 457 |
3. | United Emirates 7546 44 7.01 2000 |
4. | China 10000 26 .96 3400 |
5. | United Fictionary Nation 6789 . 8.03 7689 |
+-----------------------------------------------------------------------+
replace Median_Age = 27 if ( strmatch(Country, "*United States*") | ///
strmatch(Country, "*United*") ) & ///
missing(Median_Age)
list, abbreviate(10)
+-----------------------------------------------------------------------+
| Country Population Median_Age Sex_Ratio GDP |
|-----------------------------------------------------------------------|
1. | United States of America 3999 27 1.01 5000 |
2. | Afghanistan 544 19 .97 457 |
3. | United Emirates 7546 44 7.01 2000 |
4. | China 10000 26 .96 3400 |
5. | United Fictionary Nation 6789 27 8.03 7689 |
+-----------------------------------------------------------------------+

Creating a variable that increments by one if new value found in another

I have the following (sorted) variable:
35
35
37
37
37
40
I want to create a new variable which will increment by one when a new number comes up in the original variable.
For example:
35 1
35 1
37 2
37 2
37 2
40 3
I thought about using the by or bysort commands but none of them seems to solve the problem. This looks like something many people need, but I couldn't find an answer.
You are just counting how often a value differs from the previous value. This works also for observation 1 as any reference to a value for observation 0 is returned as missing, so in your example 35 is not equal to missing.
clear
input x
35
35
37
37
37
40
end
gen new = sum(x != x[_n-1])
list, sepby(new)
+----------+
| x new |
|----------|
1. | 35 1 |
2. | 35 1 |
|----------|
3. | 37 2 |
4. | 37 2 |
5. | 37 2 |
|----------|
6. | 40 3 |
+----------+
by would be pertinent if you had blocks of observations to be treated separately. One underlying principle here is that true or false comparisons (here, whether two values are unequal) are evaluated as 1 if true and 0 is false.
#Nick beat me to it by a couple of minutes but here's another -cleaner- way of doing this:
clear
input foo
35
35
37
37
37
40
end
egen counter = group(foo)
list
+---------------+
| foo counter |
|---------------|
1. | 35 1 |
2. | 35 1 |
3. | 37 2 |
4. | 37 2 |
5. | 37 2 |
|---------------|
6. | 40 3 |
+---------------+
This approach uses the egen command and its associated group() function.
There are also a couple of options for this function, with missing being perhaps the most useful.
From the command's help file:
"...missing indicates that missing values in varlist (either . or "") are to be treated like any other value when assigning groups, instead of as missing values being assigned to the group missing..."
clear
input foo
35
35
.
37
37
37
40
.
end
egen counter = group(foo), missing
sort foo
list
+---------------+
| foo counter |
|---------------|
1. | 35 1 |
2. | 35 1 |
3. | 37 2 |
4. | 37 2 |
5. | 37 2 |
|---------------|
6. | 40 3 |
7. | . 4 |
8. | . 4 |
+---------------+
Instead of:
drop counter
egen counter = group(foo)
sort foo
list
+---------------+
| foo counter |
|---------------|
1. | 35 1 |
2. | 35 1 |
3. | 37 2 |
4. | 37 2 |
5. | 37 2 |
|---------------|
6. | 40 3 |
7. | . . |
8. | . . |
+---------------+
Another option is label:
"... The label option returns integers from 1 up according to the distinct groups of varlist in sorted order. The integers are labeled with the values of varlist or the value labels, if they exist..."
Using the example without the missing values:
egen counter = group(foo), label
list
+---------------+
| foo counter |
|---------------|
1. | 35 35 |
2. | 35 35 |
3. | 37 37 |
4. | 37 37 |
5. | 37 37 |
|---------------|
6. | 40 40 |
+---------------+

How to subtract across columns

I want to subtract the values by apid in the table below:
-----------------------------------------------
| apid | AB | AS | BS | CS | DS | difference |
|-------|----|----|----|----|----|----------- |
| AP013 | 43 | 36 | | | | 7 |
-----------------------------------------------
For example, for "AP013", the difference is subtracting AS from AB (43 - 36 = 7).
The new value also needs to be saved in a new column called diff.
Can you please tell me how to do this in Stata?
You just generate a new variable diff:
clear
input str5 apid AB AS
"AP013" 43 36
end
generate diff = AB - AS
list
+------------------------+
| apid AB AS diff |
|------------------------|
1. | AP013 43 36 7 |
+------------------------+