Regular Expression over multiple lines - regex

I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.
Here is the problem:
I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:
This is a long abstract describing something. What follows is the tile for this sentence.",Title1
This is another sentence that is running on one line. On the next line you can find the title.,Title2
Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.
Here is what I came up with so far:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)
The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.
Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).

Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.
Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:
sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
Input
$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line
Output
$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line

Yours works with a couple of small changes:
sed -n '1h;1!H;${;g;s/\."\?\n,//g;p;}' inputfile
The ? needs to be escaped and . doesn't match newlines.
Here's another way to do it which doesn't require using the hold space:
sed -n '${p;q};N;/\n,/{s/"\?\n//p;b};P;D' inputfile
Here is a commented version:
sed -n '
$ # for the last input line
{
p; # print
q # and quit
};
N; # otherwise, append the next line
/\n,/ # if it starts with a comma
{
s/"\?\n//p; # delete an optional comma and the newline and print the result
b # branch to the end to read the next line
};
P; # it doesn't start with a comma so print it
D # delete the first line of the pair (it's just been printed) and loop to the top
' inputfile

Related

Add text to the end if not already added

I have the following lines:
source = "git::ssh://git#github.abc.com/test//bar"
source = "git::ssh://git#github.abc.com/test//foo?ref=tf12"
resource = "bar"
I want to update any lines that contain source and git words by adding ?ref:tf12 to the end of the line but inside ". If the line already contains ?ref=tf12, it should skip
source = "git::ssh://git#github.abc.com/test//bar?ref=tf12"
source = "git::ssh://git#github.abc.com/test//foo?ref=tf12"
resource = "bar"
I have the following expression using sed, but it outputs wrongly
sed 's#source.*git.*//.*#&?ref=tf12#' file.tf
source = "git::ssh://git#github.abc.com/test//bar"?ref=tf12
source = "git::ssh://git#github.abc.com/test//foo"?ref=tf12?ref=tf12
resource = "bar"
Using simple regular expressions for this is rather brittle; if at all possible, using a more robust configuration file parser would probably be a better idea. If that's not possible, you might want to tighten up the regular expressions to make sure you don't modify unrelated lines. But here is a really simple solution, at least as a starting point.
sed -e '/^ *source *= *"git/!b' -e '/?ref=tf12" *$/b' -e 's/" *$/?ref=tf12"/' file.tf
This consists of three commands. Remember that sed examines one line at a time.
/^ * source *= *"git/!b - if this line does not begin with source="git (with optional spaces between the tokens) leave it alone. (! means "does not match" and b means "branch (to the end of this script)" i.e. skip this line and fetch the next one.)
/?ref=tf12" *$/b similarly says to leave alone lines which match this regex. In other words, if the line ends with ?ref=tf12" (with optional spaces after) don't modify it.
s/"* $/?ref=tf12"/ says to change the last double quote to include ?ref=tf12 before it. This will only happen on lines which were not skipped by the two previous commands.
sed '/?ref=tf12"/!s#\(source.*git.*//.*\)"#\1?ref=tf12"#' file.tf
/?ref=tf12"/! Only run substitude command if this pattern (?ref=tf12") doesn't match
\(...\)", \1 Instead of appending to the entire line using &, only match the line until the last ". Use parentheses to match everything before that " into a group which I can then refer with \1 in the replacement. (Where we re-add the ", so that it doesn't get lost)

Remove newline after incorrect field splitting in csv file

I use linux and I'm trying to use sed for this. I download a CSV from an institutional site providing some data to be analyzed. There are several thousand lines per CSV, and many columns per row (I haven't counted them, but I think the number is useless). The fields are separated by semicolons and quoted, so the format per line is:
"Field 1";"Field 2";"Field 3"; .... ;"Field X";
Each correct line ends with semicolon and '\n'. The problem is that, from time to time, there's some field that incorrectly has a newline, and the solution is to delete the newline character, so the two lines go back to be together into only one. Example of an incorrect line:
"Field 1";"Field 2";"Fi
eld 3";"Field X";
I've found that there can be a \n right after the opening quote or somewhere in the between the quotes.
I've found a way to manage this last case, where the newline is right after the quote:
sed ':a;N;$!ba;s/";"\n/";"/g' file.csv
but not for "any number of alphabet characters after the quote not ending in semicolon". I have a pattern file (to be used with -f) with these lines:
:a;N;$!ba;s/";"\n/";"/g
:a;N;$!ba;s/\([A-z]\)\n/\1/g
:a;N;$!ba;s/\([:alpha:]\)\n/\1/g
The first line of the pattern file works, but I've tried combinations of the second and third and I always get an empty file.
If current line doesn't end with a semicolon, read and append next line to pattern space and remove line break.
sed '/[^;]$/{N;s/\n//}' file

How do i delete first 2 lines which match with a text given by me ( using sed )?

How do i delete first 2 lines which match with a text given by me ( using sed ! )
E.g :
#file.txt contains following lines :
abc
def
def
abc
abc
def
And i want to delete first 2 "abc"
Using "sed"
While #EdMorton has pointed out that sed is not the best tool for this job (if you wonder why exactly, see my answer below and compare it to the awk code), my research showed that the solution to the generalized problem
Delete occurences "N" through "M" of a line matching a given pattern using sed
indeed is a very tricky one in my opinion. There seem to be many suggestions for how to replace the "N"th occurence of a matching pattern with sed, but I found that deleting a specific matching line (or a range of lines) is a much more complex undertaking.
While the generalized problem with arbitrary values for N, M, and the pattern would probably be solved best by writing a "sed script generator" on the basis of a Finite State Machine, the solution to the special case asked by the OP is still simple enough to be coded by hand. I must admit that I wasn't very familiar with the obfuscated intricacies of the sed command syntax before, but I found this challenge to be quite useful for gaining more experience with non-trivial sed usage.
Anyway, here's my solution for deleting the first two occurences of a line containing "abc" in a file. If there's a simpler approach, I'm eager to learn about it, as this has taken me some time now.
A final caveat: this assumes GNU sed, as I was unable to find a solution with POSIX sed:
sed -n ':1;/abc/{n;b2;};p;$b4;n;b1;:2;/abc/{n;b3;};p;$b4;n;b2;:3;p;$b4;n;b3;:4;q' file
or, in more verbose syntax:
sed -n '
# BEGIN - look for first match
:first;
/abc/ {
# First match found. Skip line and jump to second section
n; bsecond;
};
# Line does not match. Print it and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bfirst;
# END - look for first match
# BEGIN - look for second match
:second;
/abc/ {
# Second match found. Skip line and jump to final section
n; bfinal;
}
# Line does not match. Print it and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bsecond;
# END - look for second match
# BEGIN - both matches found; print remaining lines
:final;
# Print line and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bfinal;
# END - print remaining lines
# QUIT
:end;
q;
' file
sed is for simple substitutions on individual lines, that is all. For anything else you should be using awk:
$ awk '!(/abc/ && ++c<3)' file
def
def
abc
def

egrep search: every letter appears maximum 2 times

Ok, I've got this issue where I have a list of md5's and a word next to it separated with a space, and I need to filter out some lines.
Example snippet:
...
F08A4C9263AD215D70B9C216F0B385CB wrapup
7B286E6F0615D64ACD4A7BC3578871DD wrath
8E35BA3D27A7730840EB1694386F69A0 wrathful
096762EA6790EDA22BF2369347FD53AC wreak
56AC6677205EB591A7173BADBB610F5C wreath
A85C0CB6C0367AF9D23442DF56EC9E1C wreathe
9E44AAE612306D44B91C4162DB5C26B7 wreck
6D9C795CBB3075DC1A482F6F78DC6D68 wreckage
BD907BC4DC65934D133BD5C472B78CC0 wrench
758C70E9B6F437D49D98D92E28E95939 wrest
81A4471F73DFDA0B534F58F4E1501FAB wrestle
94183CC7C7A66338DE89DB9C7460A8A2 wretch
AFEED5CE5BACCEC17EC54E68A97CCD0F wriggle
...
I need a regular expression for (e)grep that pulls out every line where every letter (so [A-F]) appears only 2 times maximum.
so an example for that would be:
4F2048B829C2834A23832F28928DE38E turtle
If anyone can help me with this i'd appreciate it very much!
You could use:
egrep -v "^\S*([A-F])\S*\1\S*\1" inputfile
That would list every line which does not include the letters A-F repeated three times in the same line.
EDIT: changed to avoid matching uppercase characters in the words...
you mentioned:
pulls out every line where every letter (so [A-F]) appears only 2
times maximum.
So my understanding is, the selected line should contain 0-2 [A-F]. Based on this, the following awk oneliner should do the job:
awk 'BEGIN{FS=""}{delete a;for(i=1;i<=NF;i++)if($i~/[A-F]/){a[$i]++;if(a[$i]>2)next}}1' file
Test
Note, the given input has NO line satisfies your requirement. So I added the 'turtle' line at the end:
kent$ echo "F08A4C9263AD215D70B9C216F0B385CB wrapup
7B286E6F0615D64ACD4A7BC3578871DD wrath
8E35BA3D27A7730840EB1694386F69A0 wrathful
096762EA6790EDA22BF2369347FD53AC wreak
56AC6677205EB591A7173BADBB610F5C wreath
A85C0CB6C0367AF9D23442DF56EC9E1C wreathe
9E44AAE612306D44B91C4162DB5C26B7 wreck
6D9C795CBB3075DC1A482F6F78DC6D68 wreckage
BD907BC4DC65934D133BD5C472B78CC0 wrench
758C70E9B6F437D49D98D92E28E95939 wrest
81A4471F73DFDA0B534F58F4E1501FAB wrestle
94183CC7C7A66338DE89DB9C7460A8A2 wretch
AFEED5CE5BACCEC17EC54E68A97CCD0F wriggle
4F2048B829C2834A23832F28928DE38E turtle"|awk 'BEGIN{FS=""}{delete a;for(i=1;i<=NF;i++)if($i~/[A-F]/){a[$i]++;if(a[$i]>2)next}}1'
4F2048B829C2834A23832F28928DE38E turtle

what does this sed commands does? please explain its bits and pieces

Please explain this sed command?
sed -n "s/[^>]*>/ /gp"
What is gp?
It looks for non-greater-than characters preceding a greater-than symbol, and changes all of them to a single space. Thus, it will convert this input (where I've used _ to indicate a space):
foo>_bar> b
x>>_a
to
___b
___a
As Mark notes, "g" means global, and "p" means "print the line".
g means global: i.e. replace all occurences, not just the first.
p means to print the modified line. Otherwise due to the -n switch it would not be printed.
The command finds all lines containing at least one > and prints some spaces followed by the text after the final >. The number of spaces printed is the number of > in the line.
For example if this line is in the input file:
123>456>789
Then this is printed:
789
I was typing up a long explanation, but Brian beat me to it. To clarify a tiny bit, the "p" prints the modified / matching line. The "-n" in your command line tells sed to "not print the file". Combined with the "p", it works kinda like grep, but within the scope of the script (ie, anything it changes/matches).