as a project im coding to web scrape a site with statistics of certain monsters from a game, the problem is that when i append the data to a list it gets printed in the form of a very long single line.
I already tried .append(clean_data.getText().replace('\n', "\\n")).
Something to take into account is that if i don't use the .getText() I append a lot of [td] and [tr] tags into the list and it gets very messy.
I think the problem here is that the text im getting is being treated as plain text so when i replace \n with \\n it gets replaced directly as \\n like it doesnt recognize the \\n.
My code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import csv
url = 'https://guildstats.eu/monsters?world=Yonabra'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
monsters = ('adult goannas', 'young goannas', 'manticores', 'feral sphinxes', 'ogre ruffians', 'ogre rowdies', 'ogre sages', 'dogs')
finding_td = soup.find_all('td', string=monsters)
list_of_monsters = []
for looking_for_parent in finding_td:
parent_tr = looking_for_parent.find_parents('tr')
for clean_data in parent_tr:
list_of_monsters.append(clean_data.getText().replace('\n', " "))
print(list_of_monsters)
It gives the following output:
[' 7 adult goannas 2020-05-28 1519 0 736893 133 ', ' 222 dogs 2020-05-27 143 0 40043 0 ', ' 298 feral sphinxes 2020-05-28 1158 1 480598 152 ', ' 498 manticores 2020-05-28 961 1 299491 68 ', ' 581 ogre rowdies 2020-05-28 306 0 188324 13 ', ' 582 ogre ruffians 2020-05-29 217 0 121964 7 ', ' 583 ogre sages 2020-05-28 156 0 63489 8 ', ' 911 young goannas 2020-05-28 1880 0 972217 74 ']
i want it to be more like this:
[' 7 adult goannas 2020-05-28 1519 0 736893 133 '
' 222 dogs 2020-05-27 143 0 40043 0 '
' 298 feral sphinxes 2020-05-28 1158 1 480598 152 '
' 498 manticores 2020-05-28 961 1 299491 68 '
' 581 ogre rowdies 2020-05-28 306 0 188324 13 '
' 582 ogre ruffians 2020-05-29 217 0 121964 7 '
' 583 ogre sages 2020-05-28 156 0 63489 8 '
' 911 young goannas 2020-05-28 1880 0 972217 74 ']
What you want is to change the delimiter for the array - instead of ,, you want a new line. As #QHarr mentioned, you can use the python pprint to print the results in a better format.
Try:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import csv
from pprint import pprint
url = 'https://guildstats.eu/monsters?world=Yonabra'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
monsters = ('adult goannas', 'young goannas', 'manticores', 'feral sphinxes', 'ogre ruffians', 'ogre rowdies', 'ogre sages', 'dogs')
finding_td = soup.find_all('td', string=monsters)
list_of_monsters = []
for looking_for_parent in finding_td:
parent_tr = looking_for_parent.find_parents('tr')
for clean_data in parent_tr:
list_of_monsters.append(clean_data.getText().replace("\n", " "))
pprint(list_of_monsters)
This gives:
[' 7 adult goannas 2020-05-28 1519 0 736893 133 ',
' 222 dogs 2020-05-27 143 0 40043 0 ',
' 298 feral sphinxes 2020-05-28 1158 1 480598 152 ',
' 498 manticores 2020-05-28 961 1 299491 68 ',
' 581 ogre rowdies 2020-05-28 306 0 188324 13 ',
' 582 ogre ruffians 2020-05-29 217 0 121964 7 ',
' 583 ogre sages 2020-05-28 156 0 63489 8 ',
' 911 young goannas 2020-05-28 1880 0 972217 74 ']
The \n characters you obtained are already new line characters. There is no need to add the extra escape character in python. As you have tried, replace("\n", " ") already gives you the desired replace effect. Also, since you're printing an array, even though the element ends with a new line, it will still be printed as \n. pprint will not have any effect on the original array, only printing it in a better format.
Related
I've got a simple need.
Giving this input (string) : 10 20 30 40 65 45 44 67 100 200 65 40 66 88 65
I need to get all numbers between 65 and 66.
Problem is when we have multiple occurrence of each limit.
With a regex like : (65).+(66), I captured 65 45 44 67 100 200 65 40 66. But I would like to get only 40.
How could I achieve this ?
https://regex101.com/r/9HoKxr/1
Sounds like you want to exclude matching '65' inside the number of your pattern upto the 1st occurence of '66'? It's a bit verbose but what about:
\b65((?:\s(?:\d|[1-57-9]\d|6[0-47-9]|\d{3,}))+?)\s66\b
See an online demo
\b65\s - Start with '65' between a word-boundary and a whitespace char;
( - Open capture group;
(?:\s - Non-capture group with the constant of a whitespace char;
(?:\d|[1-57-9]\d|6[0-46-9]|\d{3,}) - Nested non-capture group to match any integer but '65' or '66';
)+?) - Close non-capture group and match it at least once but as few times as possible. Then close the capture group;
\s66\b - Match another space followed by '66' and word-boundary.
Note:
We will handle leading spaces with the Trim() function through the strings package;
That in my examples I have used '10 20 30 40 65 45 44 40 66 200 65 40 66 88 65' which should return multiple matches. In such case it's established OP is looking for the 'shortest' matching substring;
By 'shortest' it's meant that we are looking for the least amount of elements when the substring is split with spaces (using 'Fields' function from above mentione strings package). Therefor '123456' is prefered above '1 2 3' despite being the 'longer' substring in terms of characters;
Try:
package main
import (
"fmt"
"regexp"
"strings"
)
func main() {
s := `10 20 30 40 65 45 44 40 66 200 65 40 66 88 65`
re := regexp.MustCompile(`\b65((?:\s(?:\d|[1-57-9]\d|6[0-47-9]|\d{3,}))+?)\s66\b`)
matches := re.FindAllStringSubmatch(s, -1) // Retrieve all matches
shortest := ``
for i, _ := range matches { // Loop over array
if shortest == `` || len(strings.Fields(matches[i][1])) < len(strings.Fields(shortest)) {
shortest = strings.Trim(matches[i][1], ` `)
}
}
fmt.Println(shortest)
}
Try it for yourself here.
I have created a directory listing of my google drive in cloudfare index.
The file sorting logic is pretty weird for some reason.
It sorts file in sequence from 0 for every digit and if digit number matches , it checks for second digit from 0 and so on..
Currently the files sorted show up like this from top to bottom
1 , 10 , 100 ,101..109 , 11 , 110 ,111..119 ,12
There's a easy way to fix it but it needs to me to manually rename each file and add prefix 0 based on the no of digits of greatest number but there are hundreds/thousands of them.
I will be using javascript to rename all my files, it accepts inputting new names in following format , ( oA is the array where i input new names for each file ) .
I was wondering if any awk/perl/regex function can produce the expected output when executed on file.txt
Example 1
cat file.text
oA=['Lecture 7 - Topic.mp4','Lecture 56 - Topic.mp4','Lecture 3 - Topic.mp4','Lecture 4 - Topic.mp4']
Expected Output
oA=['Lecture 07 - Topic.mp4','Lecture 56 - Topic.mp4','Lecture 03 - Topic.mp4','Lecture 04 - Topic.mp4']
Example 2
cat file.txt
oA=['Lecture 3 - Topic.mp4','Lecture 116 - Topic.mp4','Lecture 46 - Topic.mp4','Lecture 112 - Topic.mp4']
Expected output
oA=['Lecture 003 - Topic.mp4','Lecture 116 - Topic.mp4','Lecture 046 - Topic.mp4','Lecture 112 - Topic.mp4']
Example 3
cat file.txt
oA=['Lecture 8 - Topic.mp4','Lecture 1165 - Topic.mp4','Lecture 667 - Topic.mp4','Lecture 12 - Topic.mp4']
Expected output
oA=['Lecture 0008 - Topic.mp4','Lecture 1165 - Topic.mp4','Lecture 0667 - Topic.mp4','Lecture 0012 - Topic.mp4']
As you might have noticed , only prefx 0 should be added to each number as required , the order of Lectures is still preseverd ( its important )
If i had to explain it like this...
1) Grab the greatest number after the word Lecture and check its no of digits.
2) Now all the numbers will have same no of digits as the greatest number ,add prefix zeros to each number as necessary
In Perl, the solution boils down to the repetition operator x. In the code below, the crucial line is
my $padding = "0" x ($maxlen-$thislen);
The Perl documentation on operators says this about x: "In scalar context or if the left operand is not enclosed in parentheses, it returns a string consisting of the left operand repeated the number of times specified by the right operand."
So it will repeat the digit 0 enough times to make a number of length $thislen into a number of length $maxlen.
The code gives the correct output for each of the examples.
$ cat file.text
oA=['Lecture 7 - Topic.mp4','Lecture 56 - Topic.mp4','Lecture 3 - Topic.mp4','Lecture 4 - Topic.mp4']
iA=['Lecture 3 - Topic.mp4','Lecture 116 - Topic.mp4','Lecture 46 - Topic.mp4','Lecture 112 - Topic.mp4']
anyname=['Lecture 8 - Topic.mp4','Lecture 1165 - Topic.mp4','Lecture 667 - Topic.mp4','Lecture 12 - Topic.mp4']
$ ./padding.pl file.text
oA=['Lecture 07 - Topic.mp4','Lecture 56 - Topic.mp4','Lecture 03 - Topic.mp4','Lecture 04 - Topic.mp4']
iA=['Lecture 003 - Topic.mp4','Lecture 116 - Topic.mp4','Lecture 046 - Topic.mp4','Lecture 112 - Topic.mp4']
anyname=['Lecture 0008 - Topic.mp4','Lecture 1165 - Topic.mp4','Lecture 0667 - Topic.mp4','Lecture 0012 - Topic.mp4']
Here is the full code that performs the requested task.
#!/usr/bin/perl
# Usage:
# padding.pl [file1.text [file2.text [...]]]
use List::Util qw(max);
use strict;
my $varname = "";
my #oA = ();
# loop over lines in input file(s)
while ($_ = <>) {
# Put data in #oA array.
# You'll need to decide what assumptions to make
# about your input data.
chomp;
($varname) = /^([^=]*)=/;
s/^$varname=//g;
if (/^\['.*'\]$/) {
s/^\['|'\]$//g;
#oA = split( /','/, $_ );
}
# extract the numbers, find the max
my #oA_nums = map { /Lecture (\d+)/; $1 } #oA;
my $maxlen = max map(length,#oA_nums); # pad all oA to this length
# replace the numbers with padded versions
foreach my $i (0 .. $#oA) { # loop from 0 to "num elements - 1"
my $thislen = length($oA_nums[$i]);
my $padding = "0" x ($maxlen-$thislen); # THIS IS IT!
my $padded_num = $padding . $oA_nums[$i];
$oA[$i] =~ s/Lecture \d+/Lecture $padded_num/;
}
print "$varname=['";
print join "','", #oA;
print "']\n";
}
Given script will be running in GoogleApp, see the following Javascript solution. It wil lcreate oA, from iA
find the longest sequence
loop over document, replace sequence with zero-padded sequence, put in oA
The console.log is for verification. Remove, and use the rename method you already have after testing.
iA=['Lecture 7 - Topic.mp4','Lecture 56 - Topic.mp4','Lecture 3 - Topic.mp4','Lecture 4 - Topic.mp4']
let seq_len=1
// Collect sequence, find largest
for (doc of iA) {
let seq = doc.match("\\d+")[0]
if ( seq.length > seq_len ) seq_len = seq.length
}
oA=[]
for (doc of iA) {
let old_seq = doc.match("\\d+")[0]
let new_seq = old_seq
while ( new_seq.length < seq_len ) new_seq = "0" + new_seq
oA.push( doc.replace(old_seq, new_seq))
}
console.log(seq_len)
console.log (oA)
Alternative solution - Perl.
#! /usr/bin/perl
use List::Util qw(max) ;
while ( <> ) {
if ( s/^iA=/oA=/ ) {
my $maxlen = max(map { length } /Lecture (\d+)/g) ;
s/(Lecture )(\d+)/sprintf("%s %0${maxlen}d", $1, $2)/eg ;
print ;
}
}
I am trying to replace whitespace characters with '\t' string. The text file looks like this:
255 255 255 white
0 0 0 black
47 79 79 dark slate gray
47 79 79 DarkSlateGray
47 79 79 DarkSlateGrey
105 105 105 dim gray
My code looks like:
import re
with open('rgb.txt', 'r') as f:
for line in f:
print(re.sub(r'\s+', r'\\t', line))
The above code gives:
255\t255\t255\twhite
\t0\t0\t0\tblack
\t47\t79\t79\tdark\tslate\tgray
\t47\t79\t79\tDarkSlateGray
\t47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim\tgray
However, I only want to replace the whitespaces which are after the first number until the color name. Also not in between the color. The output I want is:
255\t255\t255\twhite
0\t0\t0\tblack
47\t79\t79\tdarkslategray
47\t79\t79\tDarkSlateGray
47\t79\t79\tDarkSlateGrey
105\t105\t105\tdimgray
You can match whitespace immediately following a digit, which should solve the problem:
>>> txt = """255 255 255 white
... 0 0 0 black
... 47 79 79 dark slate gray
... 47 79 79 DarkSlateGray
... 47 79 79 DarkSlateGrey
... 105 105 105 dim gray"""
>>> for line in txt.split('\n'):
... print(re.sub(r'[0-9]\s+', lambda m:m.group(0)[0]+r'\t', line))
...
255\t255\t255\twhite
0\t0\t0\tblack
47\t79\t79\tdark slate gray
47\t79\t79\tDarkSlateGray
47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim gray
I couldn't find a quick way to just ignore the digit in the replacement, so I just made a lambda instead that takes the digit that was matched and appends a \t to it.
I suggest using nested re.subs:
re.sub(r'^[\d\s]+', lambda x: re.sub(r'\s+', '\t', x.group()), line)
To get rid of spaces at start use line.lstrip() before running the regex:
re.sub(r'^[\d\s]+', lambda x: re.sub(r'\s+', '\t', x.group()), line.lstrip())
The first ^[\d\s]+ matches all digits and spaces at the start of line and the second re.sub replaces whitespace strings with a single tab.
Output (for lines without .lstrip()):
255\t255\t255\twhite
\t0\t0\t0\tblack
\t47\t79\t79\tdark slate gray
\t47\t79\t79\tDarkSlateGray
\t47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim gray
Output (for lines with .lstrip()):
255\t255\t255\twhite
0\t0\t0\tblack
47\t79\t79\tdark slate gray
47\t79\t79\tDarkSlateGray
47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim gray
I'm not familiar with python to quickly answer accurately in python, but here's javascript showing the regex implementation. If the first three parameters will always be strings of digits, you can use handle it this way.
var input = `255 255 255 white
0 0 0 black
47 79 79 dark slate gray
47 79 79 DarkSlateGray
47 79 79 DarkSlateGrey
105 105 105 dim gray`
var output = input.replace(/(\d+)\s+/g, '$1\\t')
console.log(output)
You can do it in two passes:
import re
txt = """
255 255 255 white
0 0 0 black
47 79 79 dark slate gray
47 79 79 DarkSlateGray
47 79 79 DarkSlateGrey
105 105 105 dim gray
"""
for line in txt.split('\n'):
line = re.sub(r'^\s+', '', line) # remove leading spaces
print(regex.sub(r'(?<![a-zA-Z])(\s+)', r'\\t', line)) # change other spaces by \t when not preceded by a letter
Output:
255\t255\t255\twhite
0\t0\t0\tblack
47\t79\t79\tdark slate gray
47\t79\t79\tDarkSlateGray
47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim gray
I have some string like this
' 12 2 89 29 11 92 92 10'
(all the numbers are positive integers so no - and no .), and I want to extract all numbers from it, edit some of the numbers, and then put them all together with the same whitespaces. For example, if I change the number 11 to 22, I want the final string as
' 12 2 89 29 22 92 92 10'
I did some search and most questions disregard the whitespaces and only care about the numbers. I tried
match = re.match((\s*(\d+)){8}, str)
but match.group(0) gives me the whole string,, match.group(1) gives me the first match \ 12 (I added the \ otherwise the website won't show the leading whitespaces), and match.group(2) gives me 12. But it won't give me any numbers after that, any index higher than 2 gives me an error. I don't think my approach is the correct one, what is the right way to do this?
I just tried re.split('(\d+)', str) and that seems to be what I need.
I'd recommend using a regular expression with non-capturing groups, to get a list of 'space' parts and 'number' parts:
In [15]: text = ' 12 2 89 29 11 92 92 10'
In [16]: parts = re.findall('((?: +)|(?:[0-9]+))', text)
In [17]: parts
Out[17]: [' ', '12', ' ', '2', ' ', '89', ' ', '29', ' ',
'11', ' ', '92', ' ', '92', ' ', '10']
Then you can do:
for index, part in enumerate(parts):
if part == '11':
parts[index] = '22'
replaced = ''.join(parts)
(or whatever match and replacement you want to do).
Match all numbers with spaces, change desired number and join array.
import re
newNum = '125'
text = ' 12 2 89 29 11 92 92 10'
^^
marray = re.findall(r'\s+\d+', text)
marray[6] = re.sub(r'\d+', newNum, marray[6])
print(marray)
[' 12', ' 2', ' 89', ' 29', ' 11', ' 92', ' 125', ' 10']
I am currently trying to extract the following sentence:
This is a rectangle. Its height is 193, its width is 193 and the word number is 12.
from the following line:
ID: 1 x: 1232 y: 2208 w: 193 h: 390 wn: 12 ln: 13 c: This is a rectangle. Its height is 193, its width is 193 and the word number is 12 !
I have to do this using QRegularExpressions. Therefore, my code is as following:
regularExpression.setPattern("[c:](?:\\s*)$");
QRegularExpressionMatch match = regularExpression.match("ID: 2 x: 845 y: 1633 w: 422 h: 491 wn: 78 ln: 12 c: qsdfgh");
if (match.hasMatch()) {
QString id = match.captured(0);
qDebug()<<"The annotation is:"<<id;
return id;
}
return 0;
However, it does not work at all and I do not understand why (maybe my regular expression is not correct).I am stuck in this problem from several days now.
Could you help me please ?
Use following regex to parse everything after c: and to also remove possible white space from the beginning of the string:
regularExpression.setPattern("c:\s*(.*$)");