I'm not too familiar with regex but I know what I need to find-
I have a long list of data separated by newlines, and I need to delete all the lines of data that contain a string "(V)". The lines are of variable length, so I guess something to do with selecting everything between two newline characters if there's a (V) inside?
Try searching for this regular expression:
^.*\(V\).*$
Explanation:
^ start of line
.* any characters apart from new line
\( open parenthesis (escaped to avoid special behaviour)
V V
\) close parenthesis (escaped to avoid special behaviour)
.* any characters apart from new line
$ end of line (not strictly need here, included only for clarity)
Depending on your language you may need to add delimiters such as / and/or quotes " around the regular expression and you may need to enable multiline mode.
Here's an online example showing it working: Rubular
If the data is indeed rather large, then running a single regex against the whole string would be a bad idea. Instead, a simple solution like this Perl script could work for you:
open my $fh, '<', 'data.txt' or die $!;
while (my $line = <$fh>) {
if ($line =~ m/\(V\)/) {
next;
}
print $line;
}
close $fh;
This script reads the data file one line at a time and prints the lines that do not contain "(V)" to stdout. (You obviously could replace the "print" with a different data processing task)
Use the UNIX command grep, if you have access to such a system.
$ grep -v '(V)' data.txt
Grep matches all lines containing "(V)" in data.txt, and shows only the lines not matching (-v).
Related
I have to clean several csv files before i put them in a database, some of the files have a unexpected linebreak in the middle of the line, as the line should always end with a number i managed to fix the files with this one liner:
perl -pe 's/[^0-9]\r?\n//g'
while it did work it also replaces the last char before the line break
foob
ar
turns into
fooar
Is there any one liner perl that i can call that would follow the same rule without replacing the last char before the linebreak
A negative lookbehind which is an assertion and won't consume characters can also be used.
(?<!\d)\R
\d is a a short for digit
\R matches any linebreak sequence
See this demo at regex101
One way is to use \K lookbehind
perl -pe 's/[^0-9]\K\r?\n//g'
Now it drops all matches up to \K so only what follows it is subject to the replacement side.
However, I'd rather recommend to process your CSV with a library, even as it's a little more code. There's already been one problem, that linefeed inside a field, what else may be there? A good library can handle a variety of irregularities.
A simple example with Text::CSV
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = shift or die "Usage: $0 file.csv\n";
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\n+//g for #$row;
$csv->say(\*STDOUT, $row);
}
Consider other constructor options, also available via accessors, that are good for all kinds of unexpected problems. Like allow_whitespace for example.
This can be done as a command-line program ("one-liner") as well, if there is a reason for that. The library's functional interface via csv is then convenient
perl -MText::CSV=csv -we'
csv in => *ARGV, on_in => sub { s/\n+//g for #{$_[1]} }' filename
With *ARGV the input is taken either from a file named on command line or from STDIN.
Just capture the last char and put it back:
perl -pe 's/([^0-9])\r?\n/$1/g'
I need to match a sentences which contains both wild card character \ and . in same sentence.How to do it with Perl?
Say suppose my file has following sentences :
ttterfasghti.
ddseghies/affag
hhail/afgsh.
asfsdgagh/
adterhjc/sgsagh.
My expected output should be :
hhail/afgsh.
adterhjc/sgsagh.
Given a clarification from a comment
Any order but the matching line should contain both / and .
an easy way
perl -wne'print if m{/} and m{\.}' filename
This is inefficient in the sense that it starts the regex engine twice and scans each string twice. However, in most cases that is unnoticable while this code is much clearer than a single regex for the task.
I use {} delimiters so to not have to escape the /, in which case the m in front is compulsory. Then I use the same m{...} on the other pattern for consistency.
A most welcome inquiry comes that this be done in a script, not one-liner! Happy to oblige.
use warnings;
use strict;
my $file = shift || die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
print if m{/} and m{\.};
}
close $fh;
This feels like a duplicate, but I just can't find a good previous question for this.
For / there are two ways:
use m// operator with different separator characters, e.g. m,<regex with />,, m{<regex with />}, or
escape it, i.e. /\//
For . use escaping.
Note that inside a character class ([...]) many special characters no longer need escaping.
Hence we get:
$ perl <dummy.txt -ne 'print "$1: $_" if m,(\w+/\w*\.),'
hhail/afgsh.: hhail/afgsh.
adterhjc/sgsagh.: adterhjc/sgsagh.
i.e. the line is printed if it contains one-or-more word characters, followed by a /, zero-or-more word characters, ending with a ..
Recommended reading perlrequick, perlretut & perlre.
UPDATE after OP clarified the requirement in a comment:
$ perl <dummy.txt -ne 'print if m,/, && m{\.}'
hhail/afgsh.
adterhjc/sgsagh.
I am trying to replace quotes within a pipe delimited and quote encapsulated file without replacing the quotes providing the encapsulation.
I have tried using the below Perl line to replace the quotes with a back tick ` but am not sure how to replace only the quotes and not the entire group 1.
Sample data (test.txt):
"1"|"Text"|"a"\n
"2"|""Text in quotes""|"ab"\n
"3"|"Text "around" quotes"|"abc"\n
perl -pi.bak -e 's/(?<=\|")(.*)(?="\|)/\1`/' test.txt
Here is what is happening:
"1"|"`"|"a"\n
"2"|"`"|"ab"\n
"3"|"`"|"abc"\n
Here is what I am trying to achieve:
"1"|"Text"|"a"\n
"2"|"`Text in quotes`"|"ab"\n
"3"|"Text `around` quotes"|"abc"\n
With Perl 5.14 and newer, you may use
perl -pi.bak -e 's/(?:^|\|)(")?\K(.*?)(?=\1(?:$|\|))/$2=~s#"|(`)#`$1#gr/ge' test.txt
See the regex demo and an online demo.
The point here is that you match the fields with the first regex, and then you deal with double quotation marks and backticks using the second regex run on the match part.
Details
(?:^|\|) - matches the start of a string or |
(")? - an optional Group 1 matching a "
\K - match reset operator discarding all text in the current match buffer
(.*?) - Group 2: any 0+ chars other than line break chars
(?=\1(?:$|\|)) - a positive lookahead that makes sure there is the same value as in Group 1 and then the end of string or | immediately to the right of the current location.
So, Group 2 is the cell contents, with no enclosing double quotation marks. $2=~s#"|()#$1#gr replaces all " with ` and duplicates all found literal backticks in Group 2 value (see this regex demo). The "|(`) pattern matches a " or a backtick (capturing the latter into Group 1) and the `$1 replaces the match with a backtick and the contents of Group 1.
Updated for clarification that backticks that are already present should be doubled
One way is to split on | and strip the enclosing quotes to make the remaining regex simple, then assemble the string back. That may lose some efficiency in comparison with a single regex but is much simpler to maintain
perl -F"\|" -wlanE'
say join "\|",
map { s/^"|"$//g; s/`/``/g; s/"([^"]+)"/`$1`/g; qq("$_") } #F
' data.txt
The -a option makes it "autosplit" each line so in the program the line tokens are available in #F, and the -F specifies the pattern to split on (other than default). The -l handles newlines. See Command switches in perlrun.
In the map the enclosing "s are removed and any existing backticks doubled; then " around patterns are changed, globally. Then the quotes are put back and the returned list join-ed. The | in the join is escaped so to sneak it through the shell to the Perl program; if this goes into a script (instead of a one-liner), what I'd always recommend, change that \| to |.
I don't know the typical data and possible edge cases regarding quoting, but if there may be loose (single, unpaired) quotes the above will have problems and may produce wrong output, and quietly; just as any procedure that expects paired quotes would, without an extremely detailed analysis.
It may be overall safer to simply replace all "s (other than enclosing ones), with
map { s/^"|"$//g; s/`/``/g; s/"/`/g; qq("$_") }
(or with tr instead of regex s///g). That also adds some measure of efficiency.
Another way to get to the "meat" of the data is to use Text::CSV, which allows a delimiter other than (the default) comma and absorbs the enclosing quotes. Having quotes inside fields is considered bad CSV but the module can parse that just fine as well, with choices below.
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = shift || 'data.txt';
my $outfile = 'new_' . $file;
my $csv = Text::CSV->new( {
binary => 1, sep_char => '|',
allow_loose_quotes => 1, escape_char => '', # quotes inside fields
always_quote => 1 # output as desired
} ) or die "Can't do CSV: ", Text::CSV->error_diag;
open my $fh, '<', $file or die "Can't open $file: $!";
open my $out_fh, '>', $outfile or die "Can't open $outfile: $!";
while (my $row = $csv->getline($fh)) {
s/`/``/g for #$row;
tr/"/`/ for #$row;
$csv->say($out_fh, $row);
}
To work with quotes inside fields the escape_char needs to differ from quote_char; I've simply set it to '' here. The output is handled by the module as well, and always_quote attribute is for that (to quote all fields, needed or not). Please see documentation.
If the purpose of the question is precisely to clean up a file format where same quoting is used both for fields and inside the fields, I'd suggest to do it all with the module. This approach allows one to cleanly and consistently set up all kinds of options, both for input and output, and is maintainable.
A few questions
What kind of data is there and is it possible to have a stray quote? Then what? This can affect even the choice of the optimal approach as it may require a detailed analysis.
If the quest here is to straighten CSV-style data, then why not double the quotes inside fields, as common and proper in CSV, instead of replacing them (and potentially hurting their textual meaning)? See module's docs, for instance.
Perl uses $1 as the placeholder for the first capturing group in the replacement part of the regex instead of \1 (used in the matching part of the regex). Your regex wasn't matching the inner quotes and would fail to match the first or last field of your pipe delimited data. Your substitution also failed to include a quote character before the captured group.
Try:
perl -pi.bak -e 's/(?<=(?:^|\|)")"([^"]*)"(?="(?:$|\|))/`$1´/' test.txt
Another Perl. After splitting by array #F, check for " that is not at the beginning/end of the elements.
perl -F"\|" -lane ' for(#F) { s/(?<!^)"(?!$)/`/g }; print join("|",#F) '
with the given inputs
$ cat grasshopper.txt
"1"|"Text"|"a"
"2"|""Text in quotes""|"ab"
"3"|"Text "around" quotes"|"abc"
$ perl -F"\|" -lane ' for(#F) { s/(?<!^)"(?!$)/`/g }; print join("|",#F) ' grasshopper.txt
"1"|"Text"|"a"
"2"|"`Text in quotes`"|"ab"
"3"|"Text `around` quotes"|"abc"
$
i am an enthusiast of computers but never studied programming.
i am trying to learn Perl, because i found it interesting since i learned to use a little bit of regular expressions with Perl flavor, cause i needed to replace words in certain parts of the strings and that's how i found perl.
but i don't know anything about programming, i would like to know simple examples how to use regular expression from the shell (terminal) or basic scripts.
for example if i have in a folder a text document called : input.txt
how can i perform the following regex.
text to match :
text text text
text text text
what i want : change the second occurrence of the word text for the word: changed
(\A.*?tex.*?)text(.*?)$
replace for : \1changed\3
expected result:
text changed text
text changed text
using a text editor that would be using Multi-line and global modifiers.
now, how can i process this from the shell.
CD path and then what?
or a script? what should contain to make it workable.
please consider i don't know anything about Perl, but only about its regexp syntax
The regular expression part is easy.
s/\btext\b.*?\K\btext\b/changed/;
However, how to apply it if you're learning perl... that's the hard part. One could demonstrate a one liner, but that's not that helpful.
perl -i -pe 's/\btext\b.*?\K\btext\b/changed/;' file.txt
So instead, I'd recommend looking at perlfaq5 #How do I change, delete, or insert a line in a file, or append to the beginning of a file?. Ultimately what you need to learn is how to open a file for reading, and iterate over the lines. And alternatively, how to open a file for writing. With these two tools, you can do a lot.
use strict;
use warnings;
use autodie;
my $file = 'blah.txt';
my $newfile = 'new_blah.txt';
open my $infh, '<', $file;
open my $outfh, '>', $newfile;
while (my $line = <$infh>) {
# Any manipulation to $line here, such as that regex:
# $line =~ s/\btext\b.*?\K\btext\b/changed/;
print $outfh $line;
}
close $infh;
close $outfh;
Update to explain regex
s{
\btext\b # Find the first 'text' not embedded in another word
.*? # Non-greedily skip characters
\K # Keep the stuff left of the \K, don't include it in replacement
\btext\b # Match 2nd 'text' not embedded in another word
}{changed}x; # Replace with 'changed' /x modifier to allow whitespace in LHS.
I have a file of the following:
Question:What color is the sky?
Explanation:The sky reflects the ocean.
Question:Why did the chicken cross the road?
Explanation:He was hungry.
What I'm trying to obtain is a list of ("What color is the sky?", "Why did the chicken cross the road")
I'm trying to use perl regex to parse this file, but with no luck.
I have the entire contents of my file in a string called $file, and this is what I'm trying
my #questions = ($file =~ /Question:(.*)\n/g);
But this always just returns the entire $file string to me.
Your (.*) is greedily matching the whole line until it gets to the \n, which is probably a result of how you are getting the string.
You can add a ? to make the match not greedy.
So try
my #questions = ($file =~ /Question:(.*?\?)/g);
Notice I escaped \?, so the regex will match up to the questionmark
Put the whole file in a value will occupy too many memory if the is large, a better way is to process the file line by line.
For example you could do something like
my #questions;
while (<>) {
chomp;
if (m/Question:(.*)/) {
push #questions, $1;
}
}
Some explanations:
I/O Operators of perlop:
Input from <> comes either from standard input, or from each file listed on the command line.