In the program I'm working on, I'm attempting to take in a txt file, then print the bits of txt contained in a pair of quotation marks.
Assuming I've taken in the txt file and put it into an array with each line as an array element this is what I was assuming would work, but alas no luck:
txt file contents:
Lorem ipsum dolor sit amet
consectetur "adipisicing elit"
sed "do" eiusmod tempor incididunt
ut "labore et dolore" magna aliqua
CODE:
foreach(#arr)
{
print $1 if /("*")/g;
}
Output:
""
...
foreach (#arr) {
print $1 for /(".*?")/g;
}
...
#!/usr/bin/perl
use strict;
use warnings;
foreach(<DATA>) {
print $1 if /(".*")/;
}
__DATA__
txt file contents:
Lorem ipsum dolor sit amet
consectetur "adipisicing elit"
sed "do" eiusmod tempor incididunt
ut "labore et dolore" magna aliqua
Related
I'm trying to extract some text from a column on a CSV file. Here is an example:
"Lorem ipsum dolor sit amet (2015), consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua (2000)."
I wanna get a new column with "amet (2015)" and "aliqua (2000)". This expression gives me the (2015) and (2000): value.find(/(.*?)/)
But how can I also get the word before the parentheses?
here is the regex your are looking for /\w* \([^\)]*\)/gm.
I have a String
Lorem ipsum dolor sit amet
*consectetur adipiscing elit
sed do eiusmod tempor incididunt*
ut labore et dolore magna aliqua
Ut enim ad minim veniam.
now I want to select the * content *
this [*](.*?)[*] is my current regex, but it's working with a single line
*consectetur adipiscing elit*
How do I make it multiline?
This REGEX worked for me [*]([\\s\\S]*?)[*]
Below is my test string:
Object: TLE-234DSDSDS324-234SDF324ER
Page location: SDEWRSD3242SD-234/324/234 (1)
org-chart Lorem ipsum dolor consectetur adipiscing # Colorado
234DSDSDS324-32-4/2/7-page2 (2) loc log Apr 18 21:42:49 2017 1
Page information: 3.32.232.212.23, Error: fatal, Technique: color
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Validation status: Lorem ipsums dolors sits amets, consectetur adipiscing elit
Positive control-export: Validated
Page location: SDEWRSD3242SD-SDF/234/324 (5)
org-chart Lorem ipsum dolor consectetur adipiscin # Arizona
234DSDSDS324-23-11/1/0-page1 (1) loc log Apr 18 21:42:49 2017 1
Page information: 3.32.232.212.23, Error: log, Technique: color
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Validation status: Lorem ipsums dolors sits amets, consectetur adipiscing elit
Positive control-export: Validated
I need to capture strings after the "Page location: ", "Object: " and "Comments: "
For example:
Object: TLE-234DSDSDS324-234SDF324ER - Group 1
Page location: SDEWRSD3242SD-234/324/234 (1) - Group 2
Page location: SDEWRSD3242SD-SDF/234/324 (5) - Group 3
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. - Group 4
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. - Group 5
Here is my regex URL.
I am able to capture the strings but the regex won't capture if any one of the string is repeated.
(See comments below the question for the problem description.)
The data is in a multi-line string, with multiple sections starting with Object:. Within each there are multiple lines starting with phrases Page location: and Comments:. The rest of the line for all these need be captured, and all organized by Objects.
Instead of attempting a tortured multi-line "single" regex, break the string into lines and process section by section. This way the problem becomes a very simple one.
The results are stored in an array of hashrefs; each has for keys the shown phrases. Since they can appear more than once per section their values are arrayrefs (with what follows them on the line).
use warnings;
use strict;
use feature 'say';
my $input_string = '...';
my #lines = split /\n/, $input_string;
my $patt = qr/Object|Page location|Comments/;
my #sections;
for (#lines)
{
next if not /^\s*($patt):\s*(.*)/;
push #sections, {} if $1 eq 'Object';
push #{ $sections[-1]->{$1} }, $2;
}
foreach my $sec (#sections) {
foreach my $key (sort keys %$sec) {
say "$key:";
say "\t$_" for #{$sec->{$key}};
}
}
With the input string copied (suppressed above for brevity), the output is
Comments:
Lorem ipsum dolor sit amet, [...]
Lorem ipsum dolor sit amet, [...]
Page location:
SDEWRSD3242SD-234/324/234 (1)
SDEWRSD3242SD-SDF/234/324 (5)
Object:
TLE-234DSDSDS324-234SDF324ER
A few comments.
Once the Object line is found we add a new hashref to #sections. Then the match for a pattern is set as a key and the rest of its line added to its arrayref value. This is done for the current (so last) element of #sections.
This adds an empty string if a pattern had nothing following. To disallow add next if not $2;
Note. An easy and common way to print complex data structures is via the core module Data::Dumper. But also see Data::Dump for a much more compact printout.
I have a column of titles in a table and would like to delete all words that are listed in a separate table/vector.
For example, table of titles:
"Lorem ipsum dolor"
"sit amet, consectetur adipiscing"
"elit, sed do eiusmod tempor"
"incididunt ut labore"
"et dolore magna aliqua."
To be deleted: c("Lorem", "dolore", "elit")
output:
"ipsum dolor"
"sit amet, consectetur adipiscing"
", sed do eiusmod tempor"
"incididunt ut labore"
"et magna aliqua."
The blacklisted words can occur multiple times.
The tm package has this functionality, but when applied to a wordcloud. What I would need is to leave the column intact rather than joining all the rows into one string of characters. Regex functions (gsub())don't seem to function when given a set of values as a pattern. An Oracle SQL solution would also be interesting.
lorem <- c("Lorem ipsum dolor",
"sit amet, consectetur adipiscing",
"elit, sed do eiusmod tempor",
"incididunt ut labore",
"et dolore magna aliqua.")
to.delete <- c("Lorem", "dolore", "elit")
output <- lorem
for (i in to.delete) {
output <- gsub(i, "", output)
}
This gives:
[1] " ipsum dolor" "sit amet, consectetur adipiscing"
[3] ", sed do eiusmod tempor" "incididunt ut labore"
[5] "et magna aliqua."
First read the data:
dat <- c("Lorem ipsum dolor",
"sit amet, consectetur adipiscing",
"elit, sed do eiusmod tempor",
"incididunt ut labore",
"et dolore magna aliqua.")
todelete <- c("Lorem", "dolore", "elit")
We can avoid loops with a little smart pasting. The | is an or so we can paste it in, allowing us to remove any loops:
gsub(paste0(todelete, collapse = "|"), "", dat)
You could also use stri_replace_all_fixed:
library(stringi)
lorem <- c("Lorem ipsum dolor",
"sit amet, consectetur adipiscing",
"elit, sed do eiusmod tempor",
"incididunt ut labore",
"et dolore magna aliqua.")
to.delete <- c("Lorem", "dolore", "elit")
#just a simple function call
library(stringi)
stri_replace_all_fixed(lorem, to.delete, '')
Output:
[1] " ipsum dolor" "sit amet, consectetur adipiscing" ", sed do eiusmod tempor"
[4] "incididunt ut labore" "et magna aliqua."
The tm-Package has a function implemented for that:
tm:::removeWords.character
It is implemented as follows:
foo <- function(x, words){
gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE),
collapse = "|")), "", x, perl = TRUE)
}
Which gives you
gsub("(*UCP)\\b(Lorem|elit|dolore)\\b","", x, perl = TRUE)
I'm trying to read file line by line to pull out all anchor tags in captured groups.
So far, I have:
regex="(<a href=\")([A-Za-z0-9:/._-]+)\".*(<\/a>)"
while read line; do
if [[ $line =~ $regex ]]; then
#echo ${BASH_REMATCH}
href=${BASH_REMATCH[2]}
echo $href
fi
done < file.txt
And while this almost works, as I am capturing the url as required, the problem I'm having is when a line contains two or more anchor <a> tags, at that point, my regex is ineffective as only the first anchor tag is captured.
So, unknown to me, there must be a way of capturing all repeated groups.
Example text would be:
This paragraph has only one anchor tag, google, lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Some paragraph with a lot of anchor tags, regular expression, lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Bash. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. asking, lorem ipsum dolor sit amet wikipedia
You will find that the results of running my bash script on the above text as file.txt is":
http://google.com
http://en.wikipedia.org/wiki/Regular_expression
...and if you uncomment the line #echo ${BASH_REMATCH}, you'll see the whole paragraph is matched, with only the first anchor captured.
How can I continue to capture all anchor patterns in the paragraph?
Thanks for your time!
You can use a while loop to capture all matches
regex="<a href=\"([A-Za-z0-9:/._-]+)\"[^<]*<\/a>(.*$)"
while read line; do
while [[ $line =~ $regex ]]; do
href=${BASH_REMATCH[1]}
line=${BASH_REMATCH[2]}
echo $href
done
done < file.txt
prints
http://google.com
http://en.wikipedia.org/wiki/Regular_expression
http://stackoverflow.com/questions/ask
http://en.wikipedia.org
Did you try grep -o? That would print the matches only.
grep -Po '(?<=<a href=\")([A-Za-z0-9:/._-]+)(?=\".*?<\/a>)' file.txt
-P turns on perl compatible regex
-o returns only the matched patterns not whole lines
(?<=...) positive look behind: matches a position that is preceded by this pattern
(?=...) positive lookahead: matches a position that is followed by this pattern
.*? non greedy matching: so you won't end up with a match from the first opening <a> tag to the last closing </a> tag
Using lookahead and look behind you do not match the surrounding pattern just require their presence. This makes grep -o output exactly what you need.
Just a note: this approach is very flaky, comments etc are not understood. If you need this tool for something important, use an xml/html parser instead