I have a text file with an index of students that looks something like this:
Anna Baker
Class 1B
Long description text about the student lorem ipsum dolor sit amet, consetetur sadipscing elitr.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr.
#####
Rick Bell
Class 2A
Long description text about the student lorem ipsum dolor sit amet.
At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor.
#####
etc.
I have a class Student and I need to extract the information from the text file and put it in student objects.
class Student{
private:
string name;
string class;
string description;
}
Name and class worked fine so far but I'm struggling with extracting the description text. "#####" serves as a delimiter. I use:
while (???){
getline(inFile, word3);
word3=word3.substr(0,word3.find(delimiter));
}
I need a while loop to read all the lines up to the delimiter and I can't find the right statement for it. Can you help me?
Quick look to cpp reference suggets that the return value of std::string::find is std::string::npos when not foud. So you can use something like
bool continue = true;
while (continue)
{
getline(inFile, word3);
continue = word3.find(delimiter)) == std::string::npos;
//do something else
}
You can use ifstream to read file, with
std::ifstream student_file("file.txt");
You can read file line by line using,
string info;
if(student_file.is_open()){
while( getline(student_file, info) ) {
if( info == "####")
continue;
else {
//this is student info
}
}
}
Related
I have a text file with a lot of content. I want to extract the following text fragments beginning with TXT_ and ending with a ).
e.g.
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. (TXT_I_WANT_TO_EXTRACT_THIS). At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. (TXT_AND_THIS) Lorem ipsum dolor sit amet.
Expected result:
TXT_I_WANT_TO_EXTRACT_THIS
TXT_AND_THIS
I just the need the regex for the result.
Thank you so much for your help.
Greetings
Not sure in which language you are working. In R, for example, you can use str_extract_all:
str_extract_all(txt, "TXT\\w+")
[[1]]
[1] "TXT_I_WANT_TO_EXTRACT_THIS" "TXT_AND_THIS"
Even if you don't work in R, the pattern used in the solution will not change greatly; it is in fact simple: supposing that all target strings start with the same literal pattern, say "TXT", and only contain alphabetic characters and the underscore, the parts after "TXT" can conveniently be matched by \\w+, a character class for alphanumeric characters and the underscore.
txt <- "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. (TXT_I_WANT_TO_EXTRACT_THIS). At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. (TXT_AND_THIS) Lorem ipsum dolor sit amet."
Extracting from text For example; the following sentence contains the initial capital letters. How can I separate them?
Text:
A. lorem ipsum dolor sit B . 41dipiscing elit sedC. lorem ipsum dolor
sit amet D. 35 Consectetur adipiscing E .Sed do eiusmod tempor
Goal:
A. lorem ipsum dolor sit
B . 41dipiscing elit sed
C. lorem ipsum dolor sit amet
D. 35 Consectetur adipiscing
E .Sed do eiusmod tempor
What have I done?
^(([a-zA-Z]{1}|[0-9]+)\s*[.,]{1})(.*)$
Result:
https://regex101.com/r/4HB0oD/1
But my Regex code doesn't detect it without first sentence. What is the reason of this?
Maybe,
(?=[A-Z]\s*\.)
might work OK.
RegEx Demo
Test
import re
string = '''
A. lorem ipsum dolor sit B . 41dipiscing elit sedC. lorem ipsum dolor sit amet D. 35 Consectetur adipiscing E .Sed do eiusmod tempor
'''
print(re.sub(r'(?=[A-Z]\s*\.)', '\n', string))
Output
A. lorem ipsum dolor sit
B . 41dipiscing elit sed
C. lorem ipsum dolor sit amet
D. 35 Consectetur adipiscing
E .Sed do eiusmod tempor
If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.
RegEx Circuit
jex.im visualizes regular expressions:
This pattern should do what you're looking for:
[A-Z\d] ?\..+?(?=$|[A-Z\d] ?\.)
https://regex101.com/r/i92QR1/1
Let's suppose we have a paragraph like this:
Lorem ipsum, sit amet consectetur adipiscing elit. Lorem - ipsum, sit
amet. Morbi a suscipit sem, quis finibus turpis. Lorem ipsum: sit
amet. Proin suscipit ac arcu pharetra tincidunt. Lorem ipsum. sit
amet. Pellentesque eu lacinia metus. sit amet: Lorem ipsum. Lorem
turpis ipsum, sit amet.
I need a regex pcre pattern case insensitive that only selects the words
1 lorem
2 ipsum
3 sit
4 amet
in that specific order ignoring punctutation and occurrences like
Sit amet lorem ipsum
Lorem turpis ipsum, sit amet
Simple straight forward with certain punctuation characters. You can append any punctuation character inside the []:
([Ll]orem)[\s,.!:\-()?]+(ipsum)[\s,.!:\-()?]+(sit)[\s,.!:\-()?]+(amet)
or everything that is a whitespace and not [A-Za-z0-9]
([Ll]orem)[\s\W]+(ipsum)[\s\W]+(sit)[\s\W]+(amet)
Case sensitivity can be an option to switch depending on the programming language. Or you have to manually add every relevant variation like ([L|l]orem)
Regex101 Example
Below is my test string:
Object: TLE-234DSDSDS324-234SDF324ER
Page location: SDEWRSD3242SD-234/324/234 (1)
org-chart Lorem ipsum dolor consectetur adipiscing # Colorado
234DSDSDS324-32-4/2/7-page2 (2) loc log Apr 18 21:42:49 2017 1
Page information: 3.32.232.212.23, Error: fatal, Technique: color
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Validation status: Lorem ipsums dolors sits amets, consectetur adipiscing elit
Positive control-export: Validated
Page location: SDEWRSD3242SD-SDF/234/324 (5)
org-chart Lorem ipsum dolor consectetur adipiscin # Arizona
234DSDSDS324-23-11/1/0-page1 (1) loc log Apr 18 21:42:49 2017 1
Page information: 3.32.232.212.23, Error: log, Technique: color
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Validation status: Lorem ipsums dolors sits amets, consectetur adipiscing elit
Positive control-export: Validated
I need to capture strings after the "Page location: ", "Object: " and "Comments: "
For example:
Object: TLE-234DSDSDS324-234SDF324ER - Group 1
Page location: SDEWRSD3242SD-234/324/234 (1) - Group 2
Page location: SDEWRSD3242SD-SDF/234/324 (5) - Group 3
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. - Group 4
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. - Group 5
Here is my regex URL.
I am able to capture the strings but the regex won't capture if any one of the string is repeated.
(See comments below the question for the problem description.)
The data is in a multi-line string, with multiple sections starting with Object:. Within each there are multiple lines starting with phrases Page location: and Comments:. The rest of the line for all these need be captured, and all organized by Objects.
Instead of attempting a tortured multi-line "single" regex, break the string into lines and process section by section. This way the problem becomes a very simple one.
The results are stored in an array of hashrefs; each has for keys the shown phrases. Since they can appear more than once per section their values are arrayrefs (with what follows them on the line).
use warnings;
use strict;
use feature 'say';
my $input_string = '...';
my #lines = split /\n/, $input_string;
my $patt = qr/Object|Page location|Comments/;
my #sections;
for (#lines)
{
next if not /^\s*($patt):\s*(.*)/;
push #sections, {} if $1 eq 'Object';
push #{ $sections[-1]->{$1} }, $2;
}
foreach my $sec (#sections) {
foreach my $key (sort keys %$sec) {
say "$key:";
say "\t$_" for #{$sec->{$key}};
}
}
With the input string copied (suppressed above for brevity), the output is
Comments:
Lorem ipsum dolor sit amet, [...]
Lorem ipsum dolor sit amet, [...]
Page location:
SDEWRSD3242SD-234/324/234 (1)
SDEWRSD3242SD-SDF/234/324 (5)
Object:
TLE-234DSDSDS324-234SDF324ER
A few comments.
Once the Object line is found we add a new hashref to #sections. Then the match for a pattern is set as a key and the rest of its line added to its arrayref value. This is done for the current (so last) element of #sections.
This adds an empty string if a pattern had nothing following. To disallow add next if not $2;
Note. An easy and common way to print complex data structures is via the core module Data::Dumper. But also see Data::Dump for a much more compact printout.
I have this
The body:
<body><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent leo leo, ultrices eu venenatis et, rutrum fringilla dolor.</p></body>
The code:
HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
Dictionary<HtmlNode, HtmlNode> toReplace = new Dictionary<HtmlNode, HtmlNode>();
// I do some logic here adding nodes to the toReplace dictionary.
foreach (HtmlNode replaceNode in toReplace.Keys)
{
replaceNode.ParentNod.ReplaceChild(toReplace[replaceNode], replaceNode);
}
After i do this, the InnerHtml of the body node remains the same as from beginning, although the OutterHtml or the InnerText are showing the good result. Is there something wrong with my code?
The result:
// body.InnerHtml
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent leo leo, ultrices eu venenatis et, rutrum fringilla dolor.</p>
// body.OutterHtml
<body><p>Lorem ipsum dolor sit amet...</p></body>
I think it may be something to do with the way you are adding nodes to replace old nodes. See if this solution works for you to truncate the text node. I did a quick test and all three gave me the same results.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
foreach (var paragraph in body.Descendants("p"))
{
paragraph.InnerHtml = paragraph.InnerHtml.Substring(0, 25) + "...";
}
Console.WriteLine(body.InnerHtml);
Console.WriteLine(body.InnerText);
Console.WriteLine(body.OuterHtml);