How can I extract html with fscanf

How can I extract html with fscanf - c++

I have a file, each line holds a .
<div style="random properties" id="keyword1:string id:int">text</div>
<div style="random properties" id="keyword1:string id:int">text</div>
<div style="random properties" id="keyword2:string id:int">text</div>
<div style="random properties" id="keyword2:string id:int">text</div>
Can i with fscanf return a list of text and id for a matching keyword1 and keyword2?

You can simply read it with regex:
std::string s;
std::regex r( "<div style=\"[^\"]*\" id=\".*(\\d+)\">((?:(?!</div>).)*)</div>" );
while( std::getline(in, s) ) {
std::smatch m;
if( std::regex_match(s, m, r) ) {
std::cout << "id = " << m.str(1) << ", text = " << m.str(2) << std::endl;
} else {
std::cout << "invalid pattern" << std::endl;
}
}
But if you want to read more about regex please go to http://en.cppreference.com/w/cpp/regex

Related

C++ How to format a text file into an html file

I am very new to C++. I have the building blocks for what I need my program to do, which is: Read a text file inputted by command line (ex. ./textToHtml.exe Alien.txt) and create an html file from the text file by implementing the correct HTML tags when necessary.
I have provided the code below, as well as the html file structure. This is for a project and I need to have the line breaks <br> after each paragraph and each line blank line. I have provided the last HTML structure as what I want it to look like.
Be advised, I am sure I have some unnecessary lines or redundant code.
#include <iostream>
#include <string>
#include <fstream>
using namespace std;
int main(int argc, char *argv[])
{
std::ifstream txtFile(argv[1]);
std::string fn = argv[1];
std::string fileName = fn.substr(0, fn.size() - 4);
if (txtFile)
{
std::ofstream html(fileName + "1.txt");
if (html)
{
html << "<HTML>\n"
<< "<head>\n"
<< "<title>";
std::string line{};
if (std::getline(txtFile, line))
{
html << line << "</title>" << '\n';
}
html << "</head>\n"
<< "<body>\n";
while (std::getline(txtFile, line))
{
html << line << "<br>" << '\n';
}
html << "</body>" << '\n'
<< "</html>" << '\n';
}
}
return 0;
}
The HTML file looks like this:
<HTML>
<head>
<title>Are These Aliens Martians?</title>
</head>
<body>
<br>
<br>
an adaptation<br>
<br>
The men from Earth stared at the aliens.<br>
The little green men had pointed heads and<br>
orange toes with one long curly hair on each toe.<br>
<br>
H. G. Wells' novel The War of the Worlds (1898)<br>
has had an extraordinary influence on science fiction. <br>
Wells' Martians are a technologically advanced species<br>
with an ancient civilization. They somewhat resemble<br>
cephalopods, with large, bulky brown bodies and<br>
sixteen snake-like tentacles, in two groups of eight,<br>
around a quivering V-shaped mouth; they move around in<br>
100 feet tall tripod fighting-machines they assemble<br>
upon landing, killing everything in their path.<br>
<br>
<br>
<br>
by your name<br>
<br>
</body>
</html>
What I need the HTML file to look like:
<HTML>
<head>
<title>Are These Aliens Martians?</title> //my output has <br> here
<br>
<br>
an adaptation<br>
<br>
The men from Earth stared at the aliens.
The little green men had pointed heads and
orange toes with one long curly hair on each toe.<br> //Just need a <br> at the end of each paragraph
<br>
H. G. Wells' novel The War of the Worlds (1898)
has had an extraordinary influence on science fiction.
Wells' Martians are a technologically advanced species
with an ancient civilization. They somewhat resemble
cephalopods, with large, bulky brown bodies and
sixteen snake-like tentacles, in two groups of eight,
around a quivering V-shaped mouth; they move around in
100 feet tall tripod fighting-machines they assemble
upon landing, killing everything in their path.<br> //Again just one <br> here not one each line
<br>
<br>
<br>
by your name<br>
<br>
</body>
</html>

Try something more like this:
#include <iostream>
#include <string>
#include <fstream>
void htmlencode(std::string &s)
{
std::string::size_type pos = 0;
while ((pos = s.find_first_of("<>&", pos)) != std::string::npos)
{
std::string replacement;
switch (s[pos])
{
case '<':
replacement = "<";
break;
case '>':
replacement = ">";
break;
case '&':
replacement = "&";
break;
}
s.replace(pos, 1, replacement);
pos += replacement.size();
}
}
int main(int argc, char* argv[])
{
std::string fn = argv[1];
std::ifstream txtFile(fn);
if (txtFile)
{
std::string fileName = fn.substr(0, fn.rfind('.'));
std::ofstream html(fileName + "1.html");
if (html)
{
html << "<HTML>\n"
<< "<head>\n"
<< "<title>";
std::string line;
if (std::getline(txtFile, line))
{
htmlencode(line);
html << line;
}
html << "</title>\n"
<< "</head>\n"
<< "<body>\n";
bool lastLineNotEmpty = false;
while (std::getline(txtFile, line))
{
if (line.empty())
{
if (lastLineNotEmpty)
html << "<br>\n";
html << "<br>\n";
lastLineNotEmpty = false;
}
else
{
if (lastLineNotEmpty)
html << '\n';
htmlencode(line);
html << line;
lastLineNotEmpty = true;
}
}
if (lastLineNotEmpty)
html << "<br>\n";
html << "</body>\n"
<< "</html>\n";
}
}
return 0;
}
Online Demo
However, HTML has <p></p> tags that are specifically designed for paragraphs, so you should consider using them instead of <br>, eg:
#include <iostream>
#include <string>
#include <fstream>
void htmlencode(std::string &s)
{
std::string::size_type pos = 0;
while ((pos = s.find_first_of("<>&", pos)) != std::string::npos)
{
std::string replacement;
switch (s[pos])
{
case '<':
replacement = "<";
break;
case '>':
replacement = ">";
break;
case '&':
replacement = "&";
break;
}
s.replace(pos, 1, replacement);
pos += replacement.size();
}
}
int main(int argc, char* argv[])
{
std::string fn = argv[1];
std::ifstream txtFile(fn);
if (txtFile)
{
std::string fileName = fn.substr(0, fn.rfind('.'));
std::ofstream html(fileName + "1.html");
if (html)
{
html << "<HTML>\n"
<< "<head>\n"
<< "<title>";
std::string line;
if (std::getline(txtFile, line))
{
htmlencode(line);
html << line;
}
html << "</title>\n"
<< "</head>\n"
<< "<body>\n";
bool inParagraph = false;
while (std::getline(txtFile, line))
{
if (line.empty())
{
if (inParagraph)
{
inParagraph = false;
html << "</p>\n";
}
}
else
{
if (!inParagraph)
{
inParagraph = true;
html << "<p>\n";
}
htmlencode(line);
html << line << '\n';
}
}
if (inParagraph)
html << "</p>\n";
html << "</body>\n"
<< "</html>\n";
}
}
return 0;
}
Online Demo

Replace a string to another string using C++

The problem is I don't know the length of the input string.
My function can only replace if the input string is "yyyy". I think of the solution is that first, we will try to convert the input string back to "yyyy" and using my function to complete the work.
Here's my function:
void findAndReplaceAll(std::string & data, std::string toSearch, std::string replaceStr)
{
// Get the first occurrence
size_t pos = data.find(toSearch);
// Repeat till end is reached
while( pos != std::string::npos)
{
// Replace this occurrence of Sub String
data.replace(pos, toSearch.size(), replaceStr);
// Get the next occurrence from the current position
pos = data.find(toSearch, pos + replaceStr.size());
}
}
My main function
std::string format = "yyyyyyyyyydddd";
findAndReplaceAll(format, "yyyy", "%Y");
findAndReplaceAll(format, "dd", "%d");
My expected output should be :
%Y%d

Use regular expressions.
Example:
#include <iostream>
#include <string>
#include <regex>
int main(){
std::string text = "yyyyyy";
std::string sentence = "This is a yyyyyyyyyyyy.";
std::cout << "Text: " << text << std::endl;
std::cout << "Sentence: " << sentence << std::endl;
// Regex
std::regex y_re("y+"); // this is the regex that matches y yyy or more yyyy
// replacing
std::string r1 = std::regex_replace(text, y_re, "%y"); // using lowercase
std::string r2 = std::regex_replace(sentence, y_re, "%Y"); // using upercase
// showing result
std::cout << "Text replace: " << r1 << std::endl;
std::cout << "Sentence replace: " << r2 << std::endl;
return 0;
}
Output:
Text: yyyyyy
Sentence: This is a yyyyyyyyyyyy.
Text replace: %y
Sentence replace: This is a %Y.
If you want to make it even better you can use:
// Regex
std::regex y_re("[yY]+");
That will match any mix of lowercase and upper case for any amount of 'Y's .
Example output with that Regex:
Sentence: This is a yYyyyYYYYyyy.
Sentence replace: This is a %Y.
This is just a simple example of what you can do with regex, I'd recommend to look at the topic on itself, there is plenty of info her in SO and other sites.
Extra:
If you want to match before replacing to alternate the replacing you can do something like:
// Regex
std::string text = "yyaaaa";
std::cout << "Text: " << text << std::endl;
std::regex y_re("y+"); // this is the regex that matches y yyy or more yyyy
std::string output = "";
std::smatch ymatches;
if (std::regex_search(text, ymatches, y_re)) {
if (ymatches[0].length() == 2 ) {
output = std::regex_replace(text, y_re, "%y");
} else {
output = std::regex_replace(text, y_re, "%Y");
}
}

parse inFile line by line while using regex to execute function

I sort of painted myself in a corner and need some guidance. I'm doing some parsing using regex while reading from an infstream. What I want to do is
while(getLine(inFile, str)) {
search(str) for regex match
if(match)
str = my_function(str)
outFile << str
else
outFile unchanged str
}
As it stands, I'm doing this:
while(getLine(inFile, str)) {
auto des = std::sregex_iterator(str.cbegin(), str.cend(), dest);
auto reg_it = std::sregex_iterator();
std::for_each(des , reg_it, [](std::smatch const& m){
str = my_function(str)
outFile << str
});
}
Unfortunately, this doesn't let me edit the file and write it back out in order, as the way I'm doing it only gives me access to the returned matches.
Any guidance would be greatly appreciated!

This works:
if (std::regex_match(str.cbegin(), str.cend(), matches , my_regex)) {
string baz;
baz = my_function(str); // we have a match
outFile baz;
}
else outFile << str << std::endl;

Would this work?
while(getLine(inFile, str)) {
auto des = std::sregex_iterator(str.cbegin(), str.cend(), dest);
auto reg_it = std::sregex_iterator();
int matches = 0;
std::for_each(des , reg_it, [](std::smatch const& m)){
str = my_function(str);
outFile << str;
matches++;
});
if (matches == 0) {
outFile << str;
}
}

How to match multiple results using std::regex

For example, If I have a string like "first second third forth" and I want to match every single word in one operation to output them one by one.
I just thought that "(\\b\\S*\\b){0,}" would work. But actually it did not.
What should I do?
Here's my code:
#include<iostream>
#include<string>
using namespace std;
int main()
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
regex_search(str, res, exp);
cout << res[0] <<" "<<res[1]<<" "<<res[2]<<" "<<res[3]<< endl;
}

Simply iterate over your string while regex_searching, like this:
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
string::const_iterator searchStart( str.cbegin() );
while ( regex_search( searchStart, str.cend(), res, exp ) )
{
cout << ( searchStart == str.cbegin() ? "" : " " ) << res[0];
searchStart = res.suffix().first;
}
cout << endl;
}

This can be done in regex of C++11.
Two methods:
You can use () in regex to define your captures(sub expressions).
Like this:
string var = "first second third forth";
const regex r("(.*) (.*) (.*) (.*)");
smatch sm;
if (regex_search(var, sm, r)) {
for (int i=1; i<sm.size(); i++) {
cout << sm[i] << endl;
}
}
See it live: http://coliru.stacked-crooked.com/a/e1447c4cff9ea3e7
You can use sregex_token_iterator():
string var = "first second third forth";
regex wsaq_re("\\s+");
copy( sregex_token_iterator(var.begin(), var.end(), wsaq_re, -1),
sregex_token_iterator(),
ostream_iterator<string>(cout, "\n"));
See it live: http://coliru.stacked-crooked.com/a/677aa6f0bb0612f0

sregex_token_iterator appears to be the ideal, efficient solution, but the example given in the selected answer leaves much to be desired. Instead, I found some great examples here:
http://www.cplusplus.com/reference/regex/regex_token_iterator/regex_token_iterator/
For your convenience, I've copy-pasted the sample code shown by that page. I claim no credit for the code.
// regex_token_iterator example
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this subject has a submarine as a subsequence");
std::regex e ("\\b(sub)([^ ]*)"); // matches words beginning by "sub"
// default constructor = end-of-sequence:
std::regex_token_iterator<std::string::iterator> rend;
std::cout << "entire matches:";
std::regex_token_iterator<std::string::iterator> a ( s.begin(), s.end(), e );
while (a!=rend) std::cout << " [" << *a++ << "]";
std::cout << std::endl;
std::cout << "2nd submatches:";
std::regex_token_iterator<std::string::iterator> b ( s.begin(), s.end(), e, 2 );
while (b!=rend) std::cout << " [" << *b++ << "]";
std::cout << std::endl;
std::cout << "1st and 2nd submatches:";
int submatches[] = { 1, 2 };
std::regex_token_iterator<std::string::iterator> c ( s.begin(), s.end(), e, submatches );
while (c!=rend) std::cout << " [" << *c++ << "]";
std::cout << std::endl;
std::cout << "matches as splitters:";
std::regex_token_iterator<std::string::iterator> d ( s.begin(), s.end(), e, -1 );
while (d!=rend) std::cout << " [" << *d++ << "]";
std::cout << std::endl;
return 0;
}
Output:
entire matches: [subject] [submarine] [subsequence]
2nd submatches: [ject] [marine] [sequence]
1st and 2nd submatches: [sub] [ject] [sub] [marine] [sub] [sequence]
matches as splitters: [this ] [ has a ] [ as a ]

You could use the suffix() function, and search again until you don't find a match:
int main()
{
regex exp("(\\b\\S*\\b)");
smatch res;
string str = "first second third forth";
while (regex_search(str, res, exp)) {
cout << res[0] << endl;
str = res.suffix();
}
}

My code will capture all groups in all matches:
vector<vector<string>> U::String::findEx(const string& s, const string& reg_ex, bool case_sensitive)
{
regex rx(reg_ex, case_sensitive ? regex_constants::icase : 0);
vector<vector<string>> captured_groups;
vector<string> captured_subgroups;
const std::sregex_token_iterator end_i;
for (std::sregex_token_iterator i(s.cbegin(), s.cend(), rx);
i != end_i;
++i)
{
captured_subgroups.clear();
string group = *i;
smatch res;
if(regex_search(group, res, rx))
{
for(unsigned i=0; i<res.size() ; i++)
captured_subgroups.push_back(res[i]);
if(captured_subgroups.size() > 0)
captured_groups.push_back(captured_subgroups);
}
}
captured_groups.push_back(captured_subgroups);
return captured_groups;
}

My reading of the documentation is that regex_search searches for the first match and that none of the functions in std::regex do a "scan" as you are looking for. However, the Boost library seems to be support this, as described in C++ tokenize a string using a regular expression

Simple parser and generator

I need to parse and generate some texts from and to c++ objects.
The syntax is:
command #param #param #param
There is set of commands some of them have no params etc.
Params are mainly numbers.
The question is: Should I use Boost Spirit for this task? Or just simply tokenize each line evaluate function to call from string compare with command, read additional parameters and create cpp object from it?
If you suggest using Spirit or any other solution it would be nice if you could provide some examples similiar to my problem. I've read and tried all examples from Boost Spirit doc.

I implemented more or less precisely this in a previous answer to the question " Using boost::bind with boost::function: retrieve binded variable type ".
The complete working sample program (which expects a very similar grammar) using Boost Spirit is here: https://gist.github.com/1314900. You'd just want to remove the /execute literals for your grammar, so edit Line 41 from
if (!phrase_parse(f,l, "/execute" > (
to
if (!phrase_parse(f,l, (
The example script
WriteLine "bogus"
Write "here comes the answer: "
Write 42
Write 31415e-4
Write "that is the inverse of" 24 "and answers nothing"
Shutdown "Bye" 9
Shutdown "Test default value for retval"
Now results in the following output after execution:
WriteLine('bogus');
Write(string: 'here comes the answer: ');
Write(double: 42);
Write(double: 3.1415);
Write(string: 'that is the inverse of');
Write(double: 24);
Write(string: 'and answers nothing');
Shutdown(reason: 'Bye', retval: 9)
Shutdown(reason: 'Test default value for retval', retval: 0)
Full Code
For archival purposes:
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <fstream>
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
///////////////////////////////////
// 'domain classes' (scriptables)
struct Echo
{
void WriteLine(const std::string& s) { std::cout << "WriteLine('" << s << "');" << std::endl; }
void WriteStr (const std::string& s) { std::cout << "Write(string: '" << s << "');" << std::endl; }
void WriteInt (int i) { std::cout << "Write(int: " << i << ");" << std::endl; }
void WriteDbl (double d) { std::cout << "Write(double: " << d << ");" << std::endl; }
void NewLine () { std::cout << "NewLine();" << std::endl; }
} echoService;
struct Admin
{
void Shutdown(const std::string& reason, int retval)
{
std::cout << "Shutdown(reason: '" << reason << "', retval: " << retval << ")" << std::endl;
// exit(retval);
}
} adminService;
void execute(const std::string& command)
{
typedef std::string::const_iterator It;
It f(command.begin()), l(command.end());
using namespace qi;
using phx::bind;
using phx::ref;
rule<It, std::string(), space_type> stringlit = lexeme[ '"' >> *~char_('"') >> '"' ];
try
{
if (!phrase_parse(f,l, /*"/execute" >*/ (
(lit("WriteLine")
> stringlit [ bind(&Echo::WriteLine, ref(echoService), _1) ])
| (lit("Write") >> +(
double_ [ bind(&Echo::WriteDbl, ref(echoService), _1) ] // the order matters
| int_ [ bind(&Echo::WriteInt, ref(echoService), _1) ]
| stringlit [ bind(&Echo::WriteStr, ref(echoService), _1) ]
))
| (lit("NewLine") [ bind(&Echo::NewLine, ref(echoService)) ])
| (lit("Shutdown") > (stringlit > (int_ | attr(0)))
[ bind(&Admin::Shutdown, ref(adminService), _1, _2) ])
), space))
{
if (f!=l) // allow whitespace only lines
std::cerr << "** (error interpreting command: " << command << ")" << std::endl;
}
}
catch (const expectation_failure<It>& e)
{
std::cerr << "** (unexpected input '" << std::string(e.first, std::min(e.first+10, e.last)) << "') " << std::endl;
}
if (f!=l)
std::cerr << "** (warning: skipping unhandled input '" << std::string(f,l) << "')" << std::endl;
}
int main()
{
std::ifstream ifs("input.txt");
std::string command;
while (std::getline(ifs/*std::cin*/, command))
execute(command);
}

For simple formatted, easily tested input, tokenizing should be enough.
When tokenizing, you can read a line from the input and put that in a stringstream (iss). From iss, you read the first word and pass that to a command factory which creates the right command for you. Then you can pass iss to the readInParameters function of the new command, so each command can parse it own parameters and check whether all parameters are valid.
Not tested code-sample:
std::string line;
std::getline(inputStream, line);
std::istringstream iss(line);
std::string strCmd;
iss >> strCmd;
try
{
std::unique_ptr<Cmd> newCmd = myCmdFactory(strCmd);
newCmd->readParameters(iss);
newCmd->execute();
//...
}
catch (std::exception& e)
{
std::cout << "Issue with received command: " << e.what() << "\n";
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How can I extract html with fscanf - c++

Related

C++ How to format a text file into an html file

Replace a string to another string using C++

parse inFile line by line while using regex to execute function

How to match multiple results using std::regex

Simple parser and generator

Categories

Resources