Issue with regular expressions while parsing source code [closed]

Issue with regular expressions while parsing source code [closed] - regex

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
Im trying to get some information from a page source code.
For example, lets take this amazon product.
https://www.amazon.com/gp/product/B07PWCJZJ6?pf_rd_p=2d1ab404-3b11-4c97-b3db-48081e145e35&pf_rd_r=0PF9KX04Y9GAPGCXBDAP
We can check the source code with
view-source:https://www.amazon.com/gp/product/B07PWCJZJ6?pf_rd_p=2d1ab404-3b11-4c97-b3db-48081e145e35&pf_rd_r=0PF9KX04Y9GAPGCXBDAP
My objective is to get some data like the product descriptions (1366x768 LED display for example)
Im basically taking the whole source code and then using regular expressions to get the data I need.
Im doing something like this:
import requests
source = requests.get(someUrl)
data = re.findall(r'<span class=\"a-list-item\">(.*?)<\/span><\/li>', source.content)
Which should give me every product description, but I keep getting the TypeError: cannot use a string pattern on a bytes-like object
I don't know if my regex is wrong or source.content is not giving me the source code

As the diagnostic explains, the regex library wants a string input, not bytes.
The requests documentation is pretty clear:
... access the response body as bytes, for non-text requests:
>>> r.content
Given that you retrieved some HTML text
you will want to decode it,
or let the library decode it for you:
>>> source.content.decode(source.encoding)
or
>>> source.text
Both expressions return a Unicode string,
which would be the perfect input for that regex.
Separate item: make Soup, not Regexes -- bs4 is the more appropriate tool, here.

Related

Statistica VB - Including an external macro [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm using Statistica 64 VB. I wrote a function "Public MyFunction()" in FileLibrary.svb (a collection of useful functions) that I want to be called by a function in FileDoStuff.svb (an analysis).
I tried to include FileLibrary.svb like this in FileDoStuff.svb:
'#Language "WWB-COM"
'#Uses "U:\TestSVB\FileLibrary.svb"
This is the result when I run Main() in FileDoStuff, and the result is the same even if I have FileLibrary open in the application.
"Script error in FileDoStuff.svb
Macro/module does not exist."
Statistica is on the E: drive. However, FileLibrary opens a spreadsheet on U: and has no problem with it. I am able to open FileLibrary from Statistica and test it.
Why would it work to open an external spreadsheet but not call an external macro? The FileLibrary is not saved within Statistica, but neither is the analysis in FileDoStuff. What am I doing wrong?
Also, what's the difference between an SVB and an SVX file?

You know what really helps, as I discovered after hours of trying everything?
Try spelling the entire path name and the entire file name correctly, including spaces, etc. And make sure the slashes go the right way, too. (In my real path/file there are spaces.)
As much as I'd like to delete this whole question, I'm leaving it here to remind us all that sometimes the answer is just that simple. Also, I want to draw more people out who are using Statistica VB because I know there will be more questions.

How to design syntax for complex command line options? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Suppose my command line utility can send messages with following structure
struct Message {
uint32_t ip;
string id;
}
User must be able to specify host(ipv4+port) and filters on ip and id per host
(allowed network addresses and ids to send). How can I design
clear syntax for such complex option?
The best what I invented is:
--msg-send="192.168.10.2:8080;ip_isin=10.0.0.0/8,172.16.0.1/16;id=one,two"
But something is wrong with it... for example sign = inside is
annoying... Does anybody knows "the silver bullet" for command line arguments with complex structure?
another variant is better:
--msg-send="192.168.10.2:8080{ip 10.0.0.0/8,172.168.0.0/16}{id one,two}"
UPD: msg-send is plural, user can set several hosts with different filters

There's no silver bullet, however, when interacting with humans you should try to follow human way of thinking. You're trying to make the user to compose a complex structure, somewhat resembling JSON format, by hand. Humans are bad at this. From your explanation I get that this structure has three components:
host+port
list of IPs (identified by subnets, CIDR notations,
rangers, whatever your program can handle)
list of string ids
Thus it might be logical to require the user to enter these parameters separately, for example
msg_util.exe --host 192.168.10.2:8080 --allowed_ips 10.0.0.0/8,172.16.0.1/16 --allowed_ids one,two
If you might have many hosts with corresponding allowed IPs and ids, then it'd be quite awkward to enter it from command line alone, and many network utilities (like dig) resort to consume input from files. For example you could have
msg_util.exe --file --host hosts_file --allowed_ips ips_file --allowed_ids ids_file
where each line of hosts_file has corrseponding options in ips_file and ids_file

Open up text file in Notepad after it was just made [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I have a little program that I made which allows me to take in some text and sort it and make it look the way I want it to. One of my functions to save the new text file allows the user to input the name of the file using C++.
What I want to do is at the end of the program, I want it to open notepad displaying that new text file. I know you use " system("notepad.exe (txt file)")." But I can't add a string variable in place of the txt file. It requires the name of the text file, but the file name could be anything depending on the user.
Any help or a link to where I can read about it would be great!
Thanks

Assemble the command in a std::string and then use its c_str function to pass to system.

Is there a C++ library to extract text from a PDF file like PDFBox for Java? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Last year, I made an application in Java using PDFBox to get the raw text in some PDF files and I need to port that application to C++ now.
I wanted to know what was the best C++ alternative to accomplish what I need.
I'll give an example in case it helps:
Most files will look like this: http://www.jumbala.net/backup/league.pdf
With PDFBox, using that file, each line read on page 2 and most of page 3 would output all the data of a line, separated by a space instead of keeping it in a grid like it is now.
So the first relevant line in page 2 would look like this:
FB 847 - Tremblay, Gérard 179,63 56 16167 90 268 s27 p3 669 s14 199 223 193 615
or something like that since there are minor changes in the order they appear, but I don't care about that as long as similar lines output the same since I just parse them and put the values I need in different variables.
So, knowing all of that, is there a library that I can use in a C++ program to get similar results?
Edit: After looking at sacredFaith's link at http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file and trying it, I'm getting a weird output like such for the example file I mentioned earlier:
http://www.jumbala.net/backup/league.pdf.txt
The parts I actually need are in the weird characters at the beginning. Using Adobe Acrobat Reader X and using Save As... Text (accessible), I get the following result:
http://www.jumbala.net/backup/league_good.pdf.txt
Which is approximately what I get in Java using PDFBox and what I want to get as output in C++.

Xpdf is a C++ application/library which includes tools to extract plain text from a PDF file.

Since that's what your looking for : PoDoFo is C++ library to parse/read/modify or create pdf files. The library is cross-platform.

I've never used the following, but after some Googling I found this:
http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file

Turn .txt file into .pdf file on the fly? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 months ago.
Improve this question
I'm we're trying to figure out if there would be a way to convert a .txt file to a .pdf file. Here's the catch. This needs to be done behind the scenes, and on the fly. Meaning, with a radio control selected, OnOK would create a .txt file. Behind the scenes, at run time, we would like for the .txt file to be converted to a .pdf file. Ideally we would like this to be done by running an executable in the background. The executable would take input "File.txt" and output "File.pdf". We're using C++ and Visual Studio 6.
Does anyone have any experience on this? Is this possible?

libHaru may do what you want. Demo.

This a2pdf tool will probably do the trick with minimal effort. Just be sure to turn off perl syntax highlighting.
http://perl.jonallen.info/projects/a2pdf

I recommend using this open source library.
Once you have the base for generating PDF documents programmatically, you would still need a method for converting the text to the PDF elements, while keeping the text flow and word wrapping. This article may help. Please pay attention to the DoText(StreamReader sr) function. It takes text and purge it into separate lines within the PDF document, keeping the rendered within the margins.

On of the simpler methods that has worked for 3 decades e.g. more than one quarter of a century is place a postscript header before the text then use ghostscript ps2pdf it is the same method as used by some commercial apps such as acrobat
at its most basic
Copy heading.ps file.txt printfile.ps
GS -sDEVICE=pdfwrite printfile.ps printfile.pdf
Master Example can be seen here
How to modify this plaintext-to-PDF-converting PostScript from 1992 to actually specify a page size?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js