I have a csv file with semicolons separator and I need to remove all the line breaks after any character but ; and ".
I have succeeded in finding positions but removing line breaks doesn't seem to work.
What I have:
100138;"Some data";"AB";"My text goes here";
100139;"Some data 2";"CH";"My text goes here";
100140;"Some data 3";"CH";"My text goes here
And it has new line here
But it is still part of quoted data
and ends here";
100141;"Some data 4";"CH";"Another nice text without semicolon"enter
What I need:
100138;"Some data";"AB";"My text goes here";
100139;"Some data 2";"CH";"My text goes here";
100140;"Some data 3";"CH";"My text goes here And it has new line here But it is still part of quoted data and ends here";
100141;"Some data 4";"CH";"Another nice text without semicolon"enter
I used (?<=[^("|;)])$ to find it but \n doesn't seem to change anything.
I use notepad++ for that.
$(?<=[^;])(?<=[^"])\R
$ Find end of line
(?<=[^;]) Must not end with ;
(?<=[^"]) Must not end with "
\R Match linebreak character(s)
Have a try with:
Find what: (?<![;"])\R
Replace with: NOTHING
This will replace all linebreaks that aren't preceeded by ; nor ".
Related
I have large text files, in which sometimes long lines are broken into multiple lines by writing a = and then a newline character. (Enron email data from Kaggle). Since even words are broken this way and I want to do some machine learning with the data, I'd like to remove those breaks. As far as I can see the combination =\n is only used for these breaks, so if I remove those, I have the same information without the breaks and nothing gets lost.
I cannot use tr because it only replaces 1 character, but I have two characters to replace.
The sed command I am using so far to no avail is:
sed --in-place --quiet --regexp-extended 's/=\n//g' email_aa_edit
where email_aa_edit is a part of the enron mail data (used split to split it) and is my input file. However this only produces an empty file and I am not sure why. Afaik = is not a special character on itself and the newline should be \n.
What is the correct way of removing those =\n occurrences?
You can't remove newlines characters since sed works line by line, but it's possible if you append the next line to the pattern space:
sed ':a;/=$/{N;s/=\n//;ta}' file
details:
:a; # defines a label "a"
/=$/ { # if the line ends with =
N; # append the next line to the pattern space
s/=\n//; # replace the =\n
ta # jump to label "a" when something is replaced (that's always the case
# except if the last line ends with =)
}
Note: if your file uses the Windows newline sequence, change \n to \r\n.
Please see my textfile data below
roydwk27:teenaibuchytilibu5762sumonkhan:IJQRiq&76:8801627574057
deonnarsi15:latashajcclaypoolejcv5946sumonkhan:JKVWjv&20:8801627573929
ernaalo68:lindaohschletteoha1797sumonkhan:OPYZoy&84:8801628302709
dorathyshi56:fredrickaslperkinsonsle8932sumonkhan:STJKsj&30:8801621846709
londassg15:nataliaunmcredmondung5478sumonkhan:UVDEud&61:8801624792536
xiaoexu39:miriamfyboatwrightfyr3810sumonkhan:IJZAiz&47:8801626854856
I am want delete first word until :
like
roydwk27:
deonnarsi15:
ernaalo68:
dorathyshi56:
actually I am want if sumonkhan starting line then no problem but if sumonkhan line area 1st position available : with something then need remove this.
below actually data show in my .txt file
nataliaunmcredmondung5478sumonkhan:UVDEud&61:8801624792536
miriamfyboatwrightfyr3810sumonkhan:IJZAiz&47:8801626854856
all line available sumonkhan so if sumon khan starting position like this then good else delete this : full word not full line.
I hope this regex would help you. This regex deletes everything until first colon(:).
If you are reading a file then, read it line by line and run following regex on each line.
$str = 'roydwk27:teenaibuchytilibu5762sumonkhan:IJQRiq&76:8801627574057';
$str =~ s/^(?:.*?):(.*)/$1/g;
This code is in perl, you can re-write equivalent code in any other language.
See this demo at regex101.com.
^[\w\d]+:(.*)
^ // match the beginning of a line
[\w\d]+ // match any letter and any number
: // match ":" literally
( // start of the capturing group
.* // match any characters
) // end of capturing group
Now in all your matches in the first group you have the text you want matched. Note the g (global) and m (multiline) modifiers.
I have a huge file around half a gig and the file has records shown below:
44 ,1577,23GRE ,GREASE THE ENGINE
44 ,1577,23GRE ,GREASE THE ENGINE
44 ,1577,24GRE ,GREASE THE WHEELS
I want to remove white spaces between the the commas and whitespaces after content "GREASE THE ENGINE" and convert the file as shown below using vi:
"44","1577","23GRE","GREASE THE ENGINE"
"44","1577","23GRE","GREASE THE ENGINE"
"44","1577","24GRE","GREASE THE WHEELS"
I tried removing whitespaces by giving a command :1,$s/ //g This removes all the whitespace and renders the file as shown below which defeats the purpose. I want GREASE THE ENGINE with spaces.
44,1577,23GRE,GREASETHEENGINE
44,1577,23GRE,GREASETHEENGINE
44,1577,24GRE,GREASETHEWHEELS
Appreciate any or all help.
Thanks
You can use this substitute command in vi:
:1,$s/ *,/,/g
:1,$s/ *$//
:1,$s/,/","/g
First we replace trailing spaces then replace all spaces followed by , to a single ,. Finally we match each field that is not a comma and quote them.
[[:blank:]] will match a space or tab.
For your input it gives:
"44","1577","23GRE","GREASE THE ENGINE"
"44","1577","23GRE","GREASE THE ENGINE"
"44","1577","24GRE","GREASE THE WHEELS"
Another possibility:
%s/\s\+\([,$]\)/"\1"/g
%s/^/"/g
%s/$/"/g
Remove any sequence of white space characters (white space, tab, etc.) before a comma or the end of line;
Add " at the beginning of every line
Add " at the end of every line
If you had an empty line at the end of the file, you will end up with a line of "" which is easy to delete.
I know how to add something to the end of every line, but how to add text at the end of the lines containing specific words.
Some line of text here
Tomatoes Oranges
Mili Deci Centi
Some line of text there
Fire Flame
Dog Cat
Tall Small
Some line of text with more text
Mother farher
-------
I want to add characters at the end of the lines containing "Some line", something like this:
Some line of text here EXTRATEXT
Tomatoes Oranges
Mili Deci Centi
Some line of text there EXTRATEXT
Fire Flame
Dog Cat
Tall Small
Some line of text with more text EXTRATEXT
Mother farher
-------
The lines end in different characters, so I need to search for a pattern that is inside the line, and add text at the end of those line.
Replace the following pattern:
Some line.*
With:
$0 EXTRATEXT
This matches from Some line up to the end of the line (.*, as . matches any character but a newline).
You can then replace the whole match ($0) with itself followed by the extra text you want.
[a-zA-Z]+\n or \w+\n or mutliple \n+ at the end if you want to clean empty lines too. Finally if it's important that the word is capital on the firs letter: [A-Z][a-zA-Z]+\n
Why don't you try delimiting the regex pattern with a line-break, or a carriage return.
I think it might be achieved with \r\n at the end of the regex, on Notepad++.
I am reading a file and matching a regex for lines with a hex number at the start followed by few dot separated hex values followed by optional array name which may contain an option index. For eg:
010c10 00000000.00000000.0000a000.02300000 myFooArray[0]
while (my $rdLine = <RDHANDLE>) {
chomp $rdLine;
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+([0-9.a-z]+)[ \t]*([A-Za-z_0-9]*)\[*[0-9]*\]*$/) {
...
My source file containing these hex strings is also script generated. This match works fine for some files but other files produced thru the exact same script (ie no extra spaces, formats etc) do not match when the last $ is present on the match condition.
If I modify the condition to not have the end $, lines match as expected.
Another curious thing is for debugging this, I added a print statement like this:
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+/) {
print "Hey first part matched for $rdLine \n";
}
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+([0-9.a-z]+)/) {
print "Hey second part matched for $rdLine \n";
}
The output on the terminal for the following input eats the first character :
010000 00000000 foo
"ey first part matched for 010000 00000000 foo
ey second part matched for 010000 00000000 foo"
If I remove the chomp, it prints the Hey correctly instead of just ey.
Any clues appreciated!
"other files produced thru the exact same script (ie no extra spaces, formats etc) do not match when the last $ is present on the match condition"
Although you deny it, I am certain that your file contains a single space character directly before the end of the line. You should check by using Data::Dump to display the true contents of each file record. Like this
use Data::Dump;
dd \$read_line;
It is probably best to use
$read_line =~ s/\s+\z//;
in place of chomp. That will remove all spaces and tabs, as well as line endings like carriage-return and linefeed from the end of each line.
"If I remove the chomp, it prints the Hey correctly instead of just ey."
It looks like you are working on a Linux machine, processing a file that was generated on a Windows platform. Windows uses the two characters CR LF as a record separator, whereas Linux uses just LF, so a chomp removes just the trailing LF, leaving CR to cause the start of the string to be overwritten.
If it wasn't for your secondary problem of having trailing whitespace, tThe best solution here would be to replace chomp $read_line with $read_line =~ s/\R\z//. The \R character class matches the Unicode idea of a line break sequence, and was introduced in version 10 of Perl 5. However, the aforementioned s/\s+\z// will deal with your line endings as well, and should be all that you need.
Borodin is right, \r\n is the culprit.
I used a less elegant solution, but it works:
$rdLine =~ s/\r//g;
followed by:
chomp $rdLine;