Is there any way to extract URL from text in C++ [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
In PHP I can write regex to extract URL from the text.
Wanted to know any such class or method available in C++?
I am working with streaming data, which may contains URLs. I want to extract each URL from that with there count value.
I can use vector or other data structure for later processing but question is with title.

C++11 introduced <regex> as part of the standard library.
Let's take a look at how to use it.
First we need to import the header.
#include <regex>
Now let's declare our URL regex. For now we'll use something very simple. I'll leave it up to you to replace it with a more complete regex. Notice how we use \\ instead of just \ to escape things. \ itself is a special character in C++ so we need to escape it.
std::regex url(".*\\..*");
Let's create a string to test this against.
std::string url_test = "example.com";
Now let's check if url_test matches url and print out a message accordingly.
if(regex_match(url_test, url)) {
std::cout << "It's a url!" << std::endl;
} else {
std::cout << "Oh snap! It's not a url!" << std::endl;
}
Our complete program:
#include <iostream>
#include <regex>
#include <string>
int main()
{
std::regex url(".*\\..*");
std::string url_test = "example.com";
if(regex_match(url_test, url)) {
std::cout << "It's a url!" << std::endl;
} else {
std::cout << "Oh snap! It's not a url!" << std::endl;
}
}
Read more at http://www.cplusplus.com/reference/regex/

In regards to a regex, I use the following to match a multitude of links:
\b(?:(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%?=~_|$!:,.;]*[-A-Z0-9+&##/%=~_|$]
| ((?:mailto:)?[A-Z0-9._%+-]+#[A-Z0-9._%-]+\.[A-Z]{2,4})\b)
|"(?:(?:https?|ftp|file)://|www\.|ftp\.)[^"\r\n]+"?
|'(?:(?:https?|ftp|file)://|www\.|ftp\.)[^'\r\n]+'
This allows matching of web links with and without http/https at the beginning, email links with and without a mailto at the start, ftp links and file links, as well as links within single or double quotes.
I have not used the regex functionality of C++ (<regex> ) however shall have a look at it today and get back to you with some code exemplars.

Related

how to remove a substring of unknown length from a string - C++ [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a string (URL) which I know part of, but not all of it. Some of the string I want to remove, but the string can vary in length, so I can't use the erase() function from <string>.
An example URL I may work with:
https://github.com/username/repo-name
After parsing, I would like to be left with just repo-name
I know I can do this easily in shell with the sed command:
echo "https://github.com/username/repo-name" | sed 's/.*\///'
Ideally, I would like something I could turn into a function, so I could do something like:
string substring = sed("s/.*\///",var);
Please keep in mind that I do not know the full string to be removed. In the case of the URL I showed, I want everything before, and including the final '/', but, the username is subject to change. so from what I understand I can't use things such as erase() or rfind().
EDIT
All URLs I am parsing are GitHub URLs, so they will all be similar to this, another part of the program this will go in will ensure they all follow this same syntax.
Use rfind to get the location of the last slash then substr to return the right side of the string. Since substr's second argument defaults to "the end of the string" you only need to provide the beginning position:
#include <string>
std::string GetURLFilePath(const std::string& url) {
auto last_slash_location = url.rfind('/');
if(last_slash_location == std::string::npos || last_slash_location + 1 >= url.size()) {
return "";
}
return url.substr(last_slash_location + 1);
}
You didn't really say if you are expecting parameters (https://www.youtube.com/watch?v=YnWhqhNdYyk) so this will obviously not be exactly what you want if those exist in the string, but it'll get you to the right track.

Parsing /etc/passwd with regex_*, unstandard behavior C++ [duplicate]

This question already has answers here:
Is gcc 4.8 or earlier buggy about regular expressions?
(3 answers)
Closed 7 years ago.
Let's assume I have this line in my etc/passwd:
xuser01:*:111000:201:User Name, School Info, Year:/homes/pc/xu/xuser01:/bin/ksh
I browse the file by lines.
From parameters I get usernames/userids that tells me which lines I should store into variable.
Using both regex_match and regex_search I got no results, while when I was testing it on online regex testers, it work like hell. Any idea why this is not working?
regExpr = "^(xuser01|xuser02)+:((.*):?)+";
if(regex_search(line, regex(regExpr)))
{
cout << "Boom I got you!" << endl;
}
line contains line read at the moment, it loops through the whole file, and doesn't find the string. I used regex_match too, same results.
Different regular expressions I tried: (xuser01|xuser02)+ and similar, designed to be almost 100% sure match (but still what I need to match), neither of it works in my C++ program, on online regex testers it does.
Advices?
Thanks in advance!
It looks like the quantifier + is preventing C++ from getting your matches. I think it is redundant in your regex since you only have a unique number of "xuser"s in your string.
This code works alright, gets to the cout line:
string line( "xuser01:*:111000:201:User Name, School Info, Year:/homes/pc/xu/xuser01:/bin/ksh" );
regex regExpr("^(xuser01|xuser02):((.*):?)");
if(regex_search(line, regExpr))
{
cout << "Boom I got you!" << endl;
}
However, you did not indicate what you are looking for. Currently, it will only match 3 groups:
xuser01
*:111000:201:User Name, School Info, Year:/homes/pc/xu/xuser01:/bin/ksh
*:111000:201:User Name, School Info, Year:/homes/pc/xu/xuser01:/bin/ksh

C++ \n in a string (Not working) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 9 years ago.
Improve this question
I'm learning C++. Here's my problem.
http://prntscr.com/2m5flm
I created a function who can read Prop files like them (you can set the file beginning and ending tags with a function, searching a specified tag with a function who will return a string (containing the results).
Here's the main.cpp
#include <iostream>
#include "m_PFile_r.h"
using namespace std;
int main(int argc, const char * argv[])
{
m_PFile_r k;
k.open("prop.arg");
k.imNotaFag(true);
k.setOpenArg("$FILE_BEGIN$");
k.setCloseArg("$FILE_END$");
string lS;
lS=k.getArg("launchSentence");
cout << lS << endl;
string menu;
menu=k.getArg("progMenu");
cout << menu;
return 0;
}
MY QUESTION IS : Why doesn't it print the \n as a line return ?
Thanks :)
The file has new line characters in it, they are defining the end of the lines. When you enter the newline character in the file, it is being stored in that file not as a newline character, but the two individual characters "\" and "n". So when you then read in the file, those characters are read in just like the others.
You are over complicating this problem. If you would like to print out various phrases to the user, just include those phrases as string variables in your program.
String launchSentence = "This is the launch sentence.";
String progMenu = "Hit 1 For Add Hit 2 for Subtract";
These could then be printed with the normal COUT << progMenu method.
If your purpose with the text file is to keep all of the possible text strings isolated in one easy location, why not create a TextCommandPrompts.h, fill it with the String (character in C++) variables and include that in your main?
Edit - Because I can't comment yet and I want to respond to one - I thought that whatever text editor that was letting him write line by line would be messing this up. As in, its already doing the "\n" magic, and when he writes in the characters '\' and 'n' something mundane happened, and they stayed as regular characters.

replace string through regex using boost C++

I have string in which tags like this comes(there are multiple such tags)
|{{nts|-2605.2348}}
I want to use boost regex to remove |{{nts| and }} and replace whole string that i have typed above with
-2605.2348
in original string
To make it more clear:
Suppose string is:
number is |{{nts|-2605.2348}}
I want string as:
number is -2605.2348
I am quite new to boost regex and read many things online but not able to get answer to this any help would be appreciated
It really depends on how specific do you want to be. Do you want to always remove exactly |{{nts|, or do you want to remove pipe, followed by {{, followed by any number of letters, followed by pipe? Or do you want to remove everything that isn't whitespace between the last space and the first part of the number?
One of the many ways to do this would be something like:
#include <iostream>
#include <boost/regex.hpp>
int main()
{
std::string str = "number is |{{nts|-2605.2348}}";
boost::regex re("\\|[^-\\d.]*(-?[\\d.]*)\\}\\}");
std::cout << regex_replace(str, re, "$1") << '\n';
}
online demo: http://liveworkspace.org/code/2B290X
However, since you're using boost, consider the much simpler and faster parsers generated by boost.spirit.

C++ Screen Scraping from HTML

i'm trying to extract the data "Lady Gaga Fame Monster" from the html below using substr and find, but i wasn't able to retrieve the data.
<div class="album-name"><strong>Album</strong> > Lady Gaga Fame Monster</div>
I'm tried to extract the whole string first, but i can only extract till Album under the command cout << line_found , as there's spacing that prevents it from proceeding further.
I try cout << extract_line. I see no spaces in the extracted html code.
I tried the tutorial based from this http://www.cplusplus.com/reference/string/string/substr/, it works, even with spaces. I'm following closely but it stops extracting once it hit spaces. Pls help really appreciated. thanks. Figuring out 2 days without any solution.
here's the source code:
#include "parser.h"
#include <stdlib.h>
#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
using namespace std;
int main() {
string line_found, extract_line, result, finalResult="";
int firstPosition, secondPosition, input, location;
ifstream sourceFile ("cd1.htm"); // extracts from sourcefile
while(!sourceFile.eof())
{
sourceFile >> extract_line;
location = extract_line.find("album-name");
// cout << extract_line;
if (location >=0)
{
line_found = extract_line.substr(location);
cout << line_found << endl;
firstPosition= line_found.find_first_of(">");
result = line_found.substr(firstPosition);
}
}
return 0;
}
The >> operator doesn't fetch lines. It fetches whitespace-separated tokens. Use std::getline (see here) instead.
Better still, don't use string searching tools to parse HTML. It's a disaster waiting to happen. In fact, it's happening to you right now. Note that there is more than one instance of > in your line, so you will probably find the wrong one and get yourself in a complete muddle trying to skip all the ones that don't matter (you could try looking for " > ", but what if you encounter this: ...class="album-name" > <strong>..., which is perfectly valid HTML.
If the HTML is proper XHTML, use an XML parser instead. Expat, for instance, is small, fast and (relatively) simple to use. You can find a nice, easy intro here.
If the HTML is messy, you're going to struggle with C++. There's a related SO question here. Alternatively, use a language with a good HTML library such as Python (Beautiful Soup), which you can call from C++.
Another lightweight and simple option could be to use a regex. VS2010 and VS2008 (SP1 IIRC) come with the #include header that should allow much more control and flexibility than your approach.
It wouldn't be as robust as Marcelo's approach but would be quicker to get started with.