I'm trying to reformat some data that I have that isn't playing well when I copy text from a pdf.
Cordless
9B12071R
CHARGER, 3.6V,LI-ION
Cordless
9B12073R
CHARGER,NI-CD,FRAMER
Framing / Sheathing tools
F28WW
WIRE COLLATED FRAMIN
Framing / Sheathing tools
N89C-1
COIL FRAMING NAILR
Framing / Sheathing tools
N80CB-HQ
I want to have it formatted like this:
Cordless 9B12071R CHARGER, 3.6V,LI-ION
Cordless 9B12073R CHARGER,NI-CD,FRAMER
....
What I'm trying to do is a find and replace that replaces the first two new lines "\n" with a tab "\t" and leaving the third "\n" in tact.
The first thing I do is replace all "\n" with "\t" which is easy. After that, I want to replace the third "\t" with "\n". How would I do that using regex?
For EditPadPro, paste this into the Search box
([A-Za-z /]+)
([A-Za-z0-9_-]+)
(.*)
Paste this into the Replace box
\1 \2 \3
And that should do it. Basically you can add carriage returns and tabs using Ctrl+Enter and Ctrl+Tab in EditPadPro.
I had to add a carriage return to your text in the question as it's missing the last line I think. All the others are in triples of data.
Alright here is the php code that does exactly as you want:
<?php
$s = "Cordless
9B12071R
CHARGER, 3.6V,LI-ION
Cordless
9B12073R
CHARGER,NI-CD,FRAMER";
$p = '/(Cordless.*?)\\n(.+?)\\n(CHARGER.+?)(\\n|$)/s';
$r = '\\1' . "\t" . '\\2' . "\t" . '\\3' . "\n";
echo preg_replace($p, $r, $s);
?>
OUTPUT:
>php -q regex.php
Cordless 9B12071R CHARGER, 3.6V,LI-ION
Cordless 9B12073R CHARGER,NI-CD,FRAMER
Is this a regex job or can you rely on the line number?
$ perl -nE 'chomp; print $_, $.%3? "\t": "\n"' file
EDIT (after comment)
If you have to do this in an editor, then this works in vim:
%s/\(.\+\)\n\(\C[A-Z0-9-]\+\)\n\(.\+\)/\1^I\2^I\3/
The important bit here is the assumption that a line that consists entirely of A-Z, 0-9 and - constitutes a part number. ^I is a tab, you type tab and vim prints ^I. (I hope your editor has this many steroids!)
Related
This question already has answers here:
How to cut html tag from very large multiline text file with content with use perl, sed or awk?
(4 answers)
Closed 7 years ago.
I trying to clear file from <math>.*?</math>. It is easy to do it in one line but how to do it with multiline? Where in one line can be more tags or less?
I prepare some test text for Wikipedia to show problem:
: <math>A =
\begin{bmatrix}
a_{1,1} & a_{1,2} & \dots \\
a_{2,1} & a_{2,2} & \dots \\
\vdots & \vdots & \ddots
\end{bmatrix}
</math> oraz <math>B =
\begin{bmatrix}
b_{1,1} & b_{1,2} & \dots \\
b_{2,1} & b_{2,2} & \dots \\
\vdots & \vdots & \ddots
\end{bmatrix}
=
\begin{bmatrix}
B_1 \\
B_2 \\
\vdots
\end{bmatrix}
</math>,
We discuss problem on Stackoverflow and receive such good solution but not working if line contains overlapping tags like </math> oraz <math> it is correct since we have pair but it not works.
I am not expert in awk, sed, perl - only know very well regex.
Perl suggestion (not working on this example):
cat dirt-math-2.txt | perl -wlne '
unless(((/.*<math>/../<\/math>/)||0) > 1){s/<math>//;print}
' | less
Awk suggestion (not working on this example):
cat dirt-math-2.txt | awk '
sub(/<math>.*/, "") {print; cut=1}
/<\/math>/ {cut=0; next}
!cut' | less
File to parse is whole Wikipedia in Polish language so it is need be parsed without loading 6Gb into memory. Thank you in advance for any suggestion. I asked some similar question before but it is not the same.
Here's a Perl solution. It works by accumulating lines from the file into a buffer $text and then removing all <math>...</math> pairs. If the resulting buffer has no opening <math> tag then it is printed and emptied. That way, text from the file will only be stored in memory until it has no unpaired <math> tags, and normally it will contain only a single line of input
The program expects the path to the input file as a parameter on the command line. It has been tested against your sample data in this and your previous questions, and works fine
use strict;
use warnings;
my $text;
while ( <> ) {
$text .= $_;
$text =~ s/<math>.*?<\/math>//sg;
if ( $text !~ /<math>/ ) {
print $text;
$text = '';
}
}
A way with sed:
sed -r ':a;/<math>/{:b;s!<math>([^<]|<[^/]|</[^m]|</m[^a]|</ma[^t]|</mat[^h]|</math[^>])*</math>!!g;ta;N;bb;}' file
details:
:a; # defines the label "a"
/<math>/ { # condition: if the pattern space contains "<math>"
:b; # defines the label "b"
# try to replace (the ugly alternation "emulate" a non greedy quantifier)
s!<math>([^<]|<[^/]|</[^m]|</m[^a]|</ma[^t]|</mat[^h]|</math[^>])*</math>!!g;
ta; # if something is replaced go to label "a"
N; # else append the next line to the pattern space
bb; # and go to label "b"
}
I'm trying to filter out all the words that contain any character other than a letter from a text file. I've looked around stackoverflow, and other websites, but all the answers I found were very specific to a different scenario and I wasn't able to replicate them for my purposes; I've only recently started learning about Unix tools.
Here's an example of what I want to do:
Input:
#derik I was there and it was awesome! !! http://url.picture.whatever #hash_tag
Output:
I was there and it was awesome!
So words with punctuation can stay in the file (in fact I need them to stay) but any substring with special characters (including those of punctuation) needs to be trimmed away. This can probably be done with sed, but I just can't figure out the regex. Help.
Thanks!
Here is how it could be done using Perl:
perl -ane 'for $f (#F) {print "$f " if $f =~ /^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$/} print "\n"' file
I am using this input text as my test case:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
#derik I was there; it was awesome! !! http://url.picture.whatever #hash_tag
output:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
I was there; it was awesome!
Command-line options:
-n loop around every line of the input file, do not automatically print it
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
The perl code splits each input line into the #F array, then loops over every field $f and decides whether or not to print it.
At the end of each line, print a newline character.
The regular expression ^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$ is used on each whitespace-delimited word
^ starts with
[a-zA-Z-\x27]+ one or more lowercase or capital letters or a dash or a single quote (\x27)
[?!;:,.]? zero or one of the following punctuation: ?!;:,.
(|) alternately match
[\d.]+ one or more numbers or .
$ end
Your requirements aren't clear at all but this MAY be what you want:
$ awk '{rec=sep=""; for (i=1;i<=NF;i++) if ($i~/^[[:alpha:]]+[[:punct:]]?$/) { rec = rec sep $i; sep=" "} print rec}' file
I was there and it was awesome!
sed -E 's/[[:space:]][^a-zA-Z0-9[:space:]][^[:space:]]*//g' will get rid of any words starting with punctuation. Which will get you half way there.
[[:space:]] is any whitespace character
[^a-zA-Z0-9[:space:]] is any special character
[^[:space:]]* is any number of non whitespace characters
Do it again without a ^ instead of the first [[:space:]] to get remove those same words at the start of the line.
I am investigating a regexp mystery. I am tired so I may be missing
something obvious - but I can't see any reason for this.
In the examples below, I use perl - but I first saw this in VIM,
so I am guessing it is something related to more than one regexp-engines.
Assume we have this file:
$ cat data
1 =2 3 =4
5 =6 7 =8
We can then delete the whitespace in front of the '=' with...
$ cat data | perl -ne 's,(.)\s+=(.),\1=\2,g; print;'
1=2 3=4
5=6 7=8
Notice that in every line, all instances of the match are replaced ;
we used the /g search modifier, which doesn't stop at the first replace,
and instead goes on replacing till the end of the line.
For example, both the space before the '=2' and the space before
the '=4' were removed ; in the same line.
Why not use simpler constructs like 's, =,=,g'? Well, we were
preparing for more difficult scenarios... where the right-hand side
of the assignments are quoted strings, and can be either
single or double-quoted:
$ cat data2
1 ="2" 3 ='4 ='
5 ='6' 7 ="8"
To do the same work (remove the whitespace before the equal sign),
we have to be careful, since the strings may contain the equal
sign - so we mark the first quote we see, and look for it
via back-references:
$ cat data2 | perl -ne 's,(.)\s+=(.)([^\2]*)\2,\1=\2\3\2,g; print;'
1="2" 3='4 ='
5='6' 7="8"
We used the back-reference \2 to search for anything that is not
the same quote as the one we first saw, any number of times ([^\2]*).
We then searched for the original quote itself (\2). If found,
we used back references to refer to the matched parts in the replace
target.
Now look at this:
$ cat data3
posAndWidth ="40:5 =" height ="1"
posAndWidth ="-1:8 ='" textAlignment ="Right"
What we want here, is to drop the last space character that exists
before all the instances of '=' in every line. Like before, we can't use
a simple 's, =",=",g', because the strings themselves may contain
the equal sign.
So we follow the same pattern as we did above, and use back-references:
$ cat data3 | perl -ne "s,(\w+)(\s*) =(['\"])([^\3]*)\3,\1\2=\3\4\3,g; print;"
posAndWidth="40:5 =" height ="1"
posAndWidth="-1:8 ='" textAlignment ="Right"
It works... but only on the first match of the line!
The space following 'textAlignment' was not removed, and neither was the one
on top of it (the 'height' one).
Basically, it seems that /g is not functional anymore: running the same
replace command without /g produces exactly the same output:
$ cat data3 | perl -ne "s,(\w+)(\s*) =(['\"])([^\3]*)\3,\1\2=\3\4\3,; print;"
posAndWidth="40:5 =" height ="1"
posAndWidth="-1:8 ='" textAlignment ="Right"
It appears that in this regexp, the /g is ignored.
Any ideas why?
Inserting some debug characters in your substitution sheds some light on the issue:
use strict;
use warnings;
while (<DATA>) {
s,(\w+)(\s*) =(['"])([^\3]*)\3,$1$2=$3<$4>$3,g;
print; # here -^ -^
}
__DATA__
posAndWidth ="40:5 =" height ="1"
posAndWidth ="-1:8 ='" textAlignment ="Right"
Output:
posAndWidth="<40:5 =" height ="1>"
posAndWidth="<-1:8 ='" textAlignment ="Right>"
# ^--------- match ---------------^
Note that the match goes through both quotes at once. It would seem that [^\3]* does not do what you think it does.
Regex is not the best tool here. Use a parser that can handle quoted strings, such as Text::ParseWords:
use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;
while (<DATA>) {
chomp;
my #a = quotewords('\s+', 1, $_);
print Dumper \#a;
print "#a\n";
}
__DATA__
posAndWidth ="40:5 =" height ="1"
posAndWidth ="-1:8 ='" textAlignment ="Right"
Output:
$VAR1 = [
'posAndWidth',
'="40:5 ="',
'height',
'="1"'
];
posAndWidth ="40:5 =" height ="1"
$VAR1 = [
'posAndWidth',
'="-1:8 =\'"',
'textAlignment',
'="Right"'
];
posAndWidth ="-1:8 ='" textAlignment ="Right"
I included the Dumper output so you can see how the strings are split.
I will elaborate on my comment to TLP's answer:
ttsiodras you are asking two questions:
1- why does your regex not produce the desired result? why does the g flag not work?
The answer is because your regular expression contains this part [^\3] which is not handled correctly: \3 is not recognised as a back reference. I looked for it but could not find a way to have a back reference in character class.
2- how do you remove the space preceding an equal sign and leave alone the part that comes after and is between quotes?
This would be a way to do it (see this reference):
$ cat data3 | perl -pe "s,(([\"']).*?\2)| (=),\1\3,g"
posAndWidth="40:5 =" height ="1"
posAndWidth="-1:8 ='" textAlignment="Right"
The 1st part of the regex catches whatever is between quotes (single or double) and is replaced by the match, the second part corresponds to the equal sign preceded by a space that you are looking for.
Please note that this solution is only a work around the "interesting" part about the complement character class operator with back reference [^\3] by using the non-greedy operator *?
Finally if you want to pursue on the negative lookahead solution:
$ cat data3 | perl -pe 's,(\w+)(\s*) =(["'"'"'])((?:(?!\3).)*)\3,\1\2=\3\4\3,g'
posAndWidth="40:5 =" height ="1"
posAndWidth="-1:8 ='" textAlignment="Right"
The part with the quotes between square brackets still means "[\"']" but I had to use single quotes around the whole perl command otherwise the negative lookahead (?!...) syntax returns an error in bash.
EDIT Corrected the regex with negative lookahead: notice the non-greedy operator *? again and the g flag.
EDIT Took ttsiodras's comment into account: removed the non-greedy operator.
EDIT Took TLP's comment into account
How do do you say the following in regex:
foreach line
look at the beginning of the string and convert every group of 3 spaces to a tab
Stop once a character other than a space is found
This is what i have so far:
/^ +/\t/g
However, this converts every space to 1 tab
Any help would be appreciated.
With Perl:
perl -pe '1 while s/\G {3}/\t/gc' input.txt >output.txt
For example, with the following input
nada
three spaces
four spaces
three in the middle
six space
the output (TABs replaced by \t) is
$ perl -pe '1 while s/\G {3}/\t/gc' input | perl -pe 's/\t/\\t/g'
nada
\tthree spaces
\t four spaces
\tthree in the middle
\t\tsix spaces
I know this is an old question but I thought I'd give a full regex answer that works (well it worked for me).
s/\t* {3}/\t/g
I usually use this to convert a whole document in vim do this in vim it looks like this:
:%s/\t* \{3\}/\t/g
Hope it still helps someone.
You probably want /^(?: {3})*/\t/g
edit: fixed
I have been trying to remove the text before and after a particular character in each line of a text. It would be very hard to do manually since it contain 5000 lines and I need to remove text before that keyword in each line. Any software that could do it, would be great or any Perl scripts that could run on Windows. I run Perl scripts in ActivePerl, so scripts that could do this and run on ActivePerl would be helpful.
Thanks
I'd use this:
$text =~ s/ .*? (keyword) .* /$1/gx;
You don't need software, you can make this part of your existing script. Multiline regex replace along the lines of /a(b)c/ then you can backref b in the replacer with $1. Without knowing more about the text you're working with it's hard to guess what the actual pattern would be.
Presuming that you have the following:
text1 text2 keyword text3 text4 text5 keyword text6 text7
and what you want is
s/.*?keyword(.*?)keyword.*/keyword$1keyword/;
otherwise you can just replace the whole line with keyword
An example of the data may help us be clearer
I'd say, that if $text contains your whole text, you can do :
$text =~ s/^.*(keyword1|keyword2).*$/$1/m;
The m modifier makes ^ and $ see a beginning and an ending of line, and not the beginning and ending of the string.
Assuming you want to remove all text to the left of keyword1 and all text to the right of keyword2:
while (<>) {
s/.*(keyword1)/$1/;
s/(keyword2).*/$1/;
print;
}
Put this into a perl script and run it like this:
fix.pl original.txt > new.txt
Or if you just want to do this inplace, perhaps on several files at once:
perl -i.bak -pe 's/.*(keyword1)/$1/; s/(keyword2).*/$1/;' original.txt original2.txt
This will do inplace editing, renaming the original to have a .bak extension, use an implicit while-loop with print and execute the search and replace pattern before each print.
To be safe, verify it without the -i option first, or at the very least on only one file...