Whitespaces in scraping results (python)

Whitespaces in scraping results (python) - python-2.7

I'm trying to scrape a website with python2.7 and beautifulsoup4. The code I'm using works on one machine, on the other, I get the resulting 'soup' with three whitespaces added between the letters. I get something like the following (both in terminal as in eclipse/pydev. Any idea what's causing this?
i f ( w i n d o w . D o m L o a d e d )
{
D o m L o a d e d . l o a d ( f u n c t i o n ( ) { b a n n e r S y n c ( ' t b ' ) ; } ) ;
d o c u m e n t . w r i t e ( ' d i v i d = " d o m L o a d e d " s t y l e = " d i s p l a y : n o n e " > \ / d i v > ' ) ;
}
/ s c r i p t >
! - - S e r v e r : P h o b o s , S e r v e r t i m e : 0 , 0 9 2 7 s ( C : 0 , 0 5 2 0 ; Q : 7 ; 0 , 0 0 2 2 ; E : 5 2 ; 0 , 0 3 1 1 s , M : 3 ; 0 , 0 0 1 1 s , A : 0 ; 0 , 0 0 0 0 s ) , M e m : 1 2 3 0 1 K B , E n g i n e s : ( S ) p h o b o s ( 5 2 ) - - >
/ b o d y >
/ h t m l >

It's very possible that two machines have installed different HTML parser libraries, please check this link. As you know, different parsers may have different parse result, esp. for those ill-formed HTML.

Related

Demo response different from runtime response

I'm trying to use the Google Cloud Vision API to OCR this image:
I'm using the following code the make the request:
const resp = await fetch(
`https://vision.googleapis.com/v1/images:annotate?key=${KEY}`, {
method: 'POST',
body: JSON.stringify({
requests: [{
image: {content: encoded},
features: [{type: "TEXT_DETECTION"}],
}]
}),
});
This works but there is some information missing from the result. If we look at the text field:
Dog Search
D G O OD D ODG O O D D O
O D O O G G G D O D G OG G
OGD GOGD GO G GO G D
D D D G D DO DOO G D O O
O DGOGG D O O G G O O D
DOG
Here's that visualized:
There are boxes around the characters which were recognized. But, if we put this image into the gcv demo application, we get this instead:
And this is what text looks like:
Dog Search
D GOOD D 0 D GOOD DO
0 D 0 0 G G G DOD GO GG
o G O G D 0 0 D G 0 0 D D D
D G D o o o G G o o G D Go
0 G D G O G D G O G G O G D
D D D G D DO DO O G D 0 0
O D GO G G D 0 0 G G 0 0 D
DOG
Here's a gist with the requests + responses. I'm authenticating using a API token.
Why are the responses different? The requests are slightly different but not in a way which should affect the output. Right?

How to place the elements of a global list into a variable?

For instance, I know that I can use a global list as column names for a defined matrix, e.g.
global letter = "a b c d e f g h"
matrix colnames mymatrx = $letter
..However, I want to create a Stata variable that has the elements of my global macro within a variable, something like this:
gen myvar = $letter (Note: this doesn't work)

It is not clear exactly what you want, but all of these three interpretations are legal:
clear
set obs 8
global letter "a b c d e f g h"
gen letter1 = "$letter"
gen letter2 = "$letter" in 1
gen letter3 = word("$letter", _n)
list, sep(0)
+---------------------------------------------+
| letter1 letter2 letter3 |
|---------------------------------------------|
1. | a b c d e f g h a b c d e f g h a |
2. | a b c d e f g h b |
3. | a b c d e f g h c |
4. | a b c d e f g h d |
5. | a b c d e f g h e |
6. | a b c d e f g h f |
7. | a b c d e f g h g |
8. | a b c d e f g h h |
+---------------------------------------------+
Without the quotation marks, Stata will try to make sense of a as a variable or scalar name, and bail out if that does not work. Even if that works, it will not be able to make sense of how you want to combine it with b, and will bail out then.
In short, you usually need " " to deal with literal strings. The matrix *names commands are special, because their inputs are necessarily literal strings (even when they are numeric characters).

Python, reverse some lines down to up

I don't know how to do this:
a b c d e 1
a b c d e 2
a b c d e 3
...
a b c d e n
and need:
a b c d e n
....
a b c d e 3
a b c d e 2
a b c d e 1
I tried something like:
newcontent=codecs.encode(content,'utf_8','replace')
for i in reversed(newcontent):
newcontent.append(i)
print newcontent

How to delete lines starting with certain numbers in a file?

Simple question here but I'm kinda stuck.
Let's say I have a file with 20 lines and 4 columns. The first column is a number (1 to 20).
I have an other file with a few numbers in it like this
1
4
19
Now, how can I delete the line (in the first file) starting with the numbers in the second file. My main problem is that if I do a sed, the number 1 will get 10, 11, 12, and on. How can I do this the right way?
Thanks a lot!
EDIT: examples
file1
1 a a a
2 b b b
3 c c c
4 d d d
5 e e e
6 f f f
7 g g g
8 h h h
9 i i i
10 j j j
11 k k k
12 l l l
13 m m m
14 n n n
15 o o o
16 p p p
17 q q q
18 r r r
19 s s s
20 t t t
file2
1
4
19
the result I want:
2 b b b
3 c c c
5 e e e
6 f f f
7 g g g
8 h h h
9 i i i
10 j j j
11 k k k
12 l l l
13 m m m
14 n n n
15 o o o
16 p p p
17 q q q
18 r r r
20 t t t

You can use awk for this:
awk 'FNR==NR{a[$1]; next} !($1 in a)' file2 file1
2 b b b
3 c c c
5 e e e
6 f f f
7 g g g
8 h h h
9 i i i
10 j j j
11 k k k
12 l l l
13 m m m
14 n n n
15 o o o
16 p p p
17 q q q
18 r r r
20 t t t
Breakup of awk command:
FNR == NR { # While processing the file2
a[$1] # store the 1st field in an array
next # move to next record
}
# while processing the file1
!($1 in a) # print a row from file1 if 1st field is not in array 'a'

You can use sed to create a sed script that deletes the given lines:
sed 's=^=/^=;s=$=\\s/d=' numbers
It creates the following sed script:
/^1\s/d
/^4\s/d
/^19\s/d
I.e. delete the line if it starts with a 1, 4, or 19, followed by whitespace.
You can directly pipe it to sed to run it:
sed 's=^=/^=;s=$=\\s/d=' numbers | sed -f- input-file

Extracting words from a file in C

Could anyone help me to correct the following code. I
need to extract words (sequence of non white space characters up to a white space character or a new line character). Here the code prints each letter of extracted word 3 times.
#include<stdio.h>
#include<string.h>
main()
{
FILE *fp1,*fp2,*fp3;
char ch,str[10],lab[10],opc[10],opd[10];
int i;
fp1=fopen("ma.dat","r");
while((ch = fgetc(fp1)) != EOF)
{
i=0;
if(ch!=' ' || '\n' || -1)
{
lab[i++]=ch;
}
lab[i]='\0';
i=0;
if(ch!=' ' || '\n' || -1)
{
opc[i++]=ch;
}
opc[i]='\0';
i=0;
if(ch!=' ' || '\n' || -1)
{
opd[i++]=ch;
}
opd[i]='\0';
printf("%s %s %s ",lab,opc,opd);
}
fcloseall();
}
and here is my input :
copy start 1000
lda alpha
lda five
sta six
six word 4
alpha rword 5
five byte c'eof'
end
and the output is :
c c c o o o p p p y y y s s s t t t a a a r r r t t t 1 1 1 0 0 0 0 0 0 0 0 0
l l l d d d a a a a a a l l l p p p h h h a a a
l l l d d d a a a f f f i i i v v v e e e
s s s t t t a a a s s s i i i x x x
s s s i i i x x x w w w o o o r r r d d d 4 4 4
a a a l l l p p p h h h a a a r r r w w w o o o r r r d d d 5 5 5
f f f i i i v v v e e e b b b y y y t t t e e e c c c ' ' ' e e e o o o f f f ' ' '
e e e n n n d d d
Here I used the logic that scan until eof reached and (tried) to get get separate words until some space or newline is reached.

Do you have a debugger? Set a breakpoint and step through the program, line by line. You'll find that there is at least one statement in your loop that makes no sense. Hint: Why do you have the i=0 statement there inside the loop?
If this isn't just a typo after too many days of coding, you may want to read up on how
A while loop works. Especially which commands get repeated.
If works. Especially the difference between conditional statements like "if" and loop statements like "while".
PS - I'm obviously biased, but if you're looking for a good C tutorial, try my C tutorial http://masters-of-the-void.com - It's written for the Mac, but you already have your compiler up and running and you've compiled your own programs with it, so just doing the samples on Linux should be well within your skills.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Whitespaces in scraping results (python) - python-2.7

It's very possible that two machines have installed different HTML parser libraries, please check this link. As you know, different parsers may have different parse result, esp. for those ill-formed HTML.

Related

Demo response different from runtime response

How to place the elements of a global list into a variable?

Python, reverse some lines down to up

How to delete lines starting with certain numbers in a file?

Extracting words from a file in C

Categories

Resources