How to split a string or file that may be delimited by a combination of comments and spaces, tabs, newlines, commas, or other characters - regex

If file: list.txt contains really ugly data like so:
aaaa
#bbbb
cccc, dddd; eeee
ffff;
#gggg hhhh
iiii
jjjj,kkkk ;llll;mmmm
nnnn
How do we parse/split that file, excluding the commented lines, delimiting it by all commas, semicolons, and all white-space (including tabs, spaces, and newline and carrage-return characters) with a bash script?

Using shell commands:
grep -v "^[ |\t]*#" file|tr ";," "\n"|awk '$1=$1'

It can be done with the following code:
#!/bin/bash
### read file:
file="list.txt"
IFSO=$IFS
IFS=$'\r\n'
while read line; do
### skip lines that begin with a "#" or "<whitespace>#"
match_pattern="^\s*#"
if [[ "$line" =~ $match_pattern ]];
then
continue
fi
### replace semicolons and commas with a space everywhere...
temp_line=(${line//[;|,]/ })
### splitting the line at whitespaces requires IFS to be set back to default
### and then back before we get to the next line.
IFS=$IFSO
split_line_arr=($temp_line)
IFS=$'\r\n'
### push each word in the split_line_arr onto the final array
for word in ${split_line_arr[*]}; do
array+=(${word})
done
done < $file
echo "Array items:"
for item in ${array[*]} ; do
printf " %s\n" $item
done
This was not posed as a question, but rather a better solution to what others have touched upon when answering other related questions. The bit that is unique here is that those other questions/solutions did not really address how to split a string when it is delimited with a combination of spaces and characters and comments; this is one solution that address all three simultaneously...
Related questions:
How to split one string into multiple strings separated by at least one space in bash shell?
How do I split a string on a delimiter in Bash?
Additional notes:
Why do this with bash when other scripting languages are better suited for splitting? A bash script is more likely to have all the libraries it needs when running from a basic upstart or cron (sh) shell, compared with a perl program for example. An argument list is often needed in these situations and we should expect the worst from people who maintain those lists...
Hopefully this post will save bash newbies a lot of time in the future (including me)... Good luck!

sed 's/[# \t,]/REPLACEMENT/g' input.txt
above command replaces comment characters ('#'), spaces (' '), tabs ('\t'), and commas (',') with an arbitrary string ('REPLACEMENT')
to replace newlines, you can try:
sed 's/[# \t,]/replacement/g' input.txt | tr '\n' 'REPLACEMENT'

if you have Ruby on your system
File.open("file").each_line do |line|
next if line[/^\s*#/]
puts line.split(/\s+|[;,]/).reject{|c|c.empty?}
end
output
# ruby test.rb
aaaa
cccc
dddd
eeee
ffff
iiii
jjjj
kkkk
llll
mmmm
nnnn

Related

Replace newline in quoted strings in huge files

I have a few huge files with values seperated by a pipe (|) sign.
The strings our quoted but sometimes there is a newline in between the quoted string.
I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.
I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.
I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.
The statement I tried but that isn't working:
perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt
Sample text:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....
Should become:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....
Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, [ map { tr/\n/ /r } #$row ])
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.
In your specific case, you can also consider a workaround using GNU sed or awk.
An awk command will look like
awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile
The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).
A GNU sed command will look like
sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file
This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.
With perl, if the file is huge but it can still fit into memory, you can use a short
perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' <<< "$s"
With -0777, you slurp the file and the \R++(?!\d{4,}\|) pattern matches any one or more line breaks (\R++) not followed with four or more digits followed with a | char. The ++ possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.
With your shown samples, this could be simply done in awk program. Written and tested in GNU awk, should work in any awk. This should work fast even on huge files(better than slurping whole file into memory, having mentioned that OP may use it on huge files).
awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
gsub(/"/,"&")%2!=0{ ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
if(val==""){ val=$0 } ##Checking condition if val is NULL then set val to current line.
else {print val $0;val=""} ##Else(if val NOT NULL) then print val current line and nullify val here.
next ##next will skip further statements from here.
}
1 ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file ##Mentioning Input_file name here.

BASH - Replacement of regex match within a file

Given the following files:
input_file:
My inputfile, contains multiple line
and also special characters {}[]ä/
template_file:
Contains multiple lines,
also special characters {}[]ä/
##regex_match## <= must be replaced by input_file
Content goes on
abc
output_file:
Contains multiple lines,
also special characters {}[]ä/
My inputfile, contains multiple line
and also special characters {}[]ä/
Content goes on
abc
I thought about sed but that would be very cumbersome because of escaping and newlines. Is there any other solution in BASH?
perl solution just for variety's sake.
perl -0777 -lpe'
BEGIN {
open $fh, "<", "input_file";
$input = $fh->getline
}
s/##regex_match##/$input/
' < template_file > output_file
sed -n -e '/##regex_match##/{r input_file' -e 'b' -e '}; p' template_file
If the regex is matched, read and output the input file and branch (end processing of the line and don't print it). Otherwise print the line.
The use of -e delimits parts of the sed commands so that the r command which reads the input file knows where the name of the file ends. Otherwise it would greedily consume the following sed commands as if they were part of the file name.
The curly braces delimit a block in the program that's like an if statement.
I tested this on MacOS, but it should be pretty similar for GNU. MacOS sed is much pickier about -e (among other differences which don't come into play here).
A very slight variation on the technique Dennis Williamson already posted, merely for discussion purposes -
sed '/##regex_match##/ {
r input_file
d
}' template_file
Contains multiple lines,
also special characters {}[]ä/
My inputfile, contains multiple line
and also special characters {}[]ä/
Content goes on
abc
c.f. the manual.
He used -e options to pass commands, where I separated them with newlines. Usually a semicolon is enough, but apparently r makes other commands on the same line get ignored.
The d prevents the tag pattern from being printed.
With any awk in any shell on every UNIX box and with any characters:
$ awk 'NR==FNR{rec=rec sep $0; sep=ORS; next} /##regex_match##/{$0=rec} 1' input_file template_file
Contains multiple lines,
also special characters {}[]ä/
My inputfile, contains multiple line
and also special characters {}[]ä/
Content goes on
abc

Getting rid of all words that contain a special character in a textfile

I'm trying to filter out all the words that contain any character other than a letter from a text file. I've looked around stackoverflow, and other websites, but all the answers I found were very specific to a different scenario and I wasn't able to replicate them for my purposes; I've only recently started learning about Unix tools.
Here's an example of what I want to do:
Input:
#derik I was there and it was awesome! !! http://url.picture.whatever #hash_tag
Output:
I was there and it was awesome!
So words with punctuation can stay in the file (in fact I need them to stay) but any substring with special characters (including those of punctuation) needs to be trimmed away. This can probably be done with sed, but I just can't figure out the regex. Help.
Thanks!
Here is how it could be done using Perl:
perl -ane 'for $f (#F) {print "$f " if $f =~ /^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$/} print "\n"' file
I am using this input text as my test case:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
#derik I was there; it was awesome! !! http://url.picture.whatever #hash_tag
output:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
I was there; it was awesome!
Command-line options:
-n loop around every line of the input file, do not automatically print it
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
The perl code splits each input line into the #F array, then loops over every field $f and decides whether or not to print it.
At the end of each line, print a newline character.
The regular expression ^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$ is used on each whitespace-delimited word
^ starts with
[a-zA-Z-\x27]+ one or more lowercase or capital letters or a dash or a single quote (\x27)
[?!;:,.]? zero or one of the following punctuation: ?!;:,.
(|) alternately match
[\d.]+ one or more numbers or .
$ end
Your requirements aren't clear at all but this MAY be what you want:
$ awk '{rec=sep=""; for (i=1;i<=NF;i++) if ($i~/^[[:alpha:]]+[[:punct:]]?$/) { rec = rec sep $i; sep=" "} print rec}' file
I was there and it was awesome!
sed -E 's/[[:space:]][^a-zA-Z0-9[:space:]][^[:space:]]*//g' will get rid of any words starting with punctuation. Which will get you half way there.
[[:space:]] is any whitespace character
[^a-zA-Z0-9[:space:]] is any special character
[^[:space:]]* is any number of non whitespace characters
Do it again without a ^ instead of the first [[:space:]] to get remove those same words at the start of the line.

tr '\n\t+' command not working in shell bash?

Text1 Text2
(3 tabs) text 3
(4 tabs) text 4
(2 tabs) text 5
Text2 Text7
(2 tabs) Text8
I have a text file in the above format. Basically what I want to do is that, I want to replace consecutive newline and tabs with a special char. I am using this command
tr '\n\t+' '#'
I am expecting this output
Text1 Text2#text 3#text 4#text 5<br/>
Text2 Text7#Text8
this regex is working fine with eclipse find and replace (also with editplus). However tr puts everything in one line.
Can anyone tell me what is problem with tr, with this regex? And, what is the resolution?
That is wrong use of tr command. It lets you translate one character (class) by another but you cannot use it for regex string replacements like this.
You can use gnu sed instead:
sed ':a;N;$!ba;s/\n\t\+/#/g;' file
Text1 Text2#text 3#text 4#text 5
Text2 Text7#text8
There are 2 parts of this sed command:
:a;N;$!ba;: Appends the current and next line to the pattern space via N command (is a loop that reads the entire input up front before then applying the string substitution)
s/\n\t\+/#/g; Replaces every newline followed by 1 or more tabs by #
EDIT: Here is a non-gnu sed version that worked on OSX also:
sed -e ':a' -e 'N' -e '$!ba' -e $'s/\\n\t\t*/#/g' file
#anubhava's helpful answer explains why tr doesn't work here, but the pure sed solution has a slight drawback (aside from being somewhat difficult to understand): it reads the entire input file into memory before performing the desired string substitution (which may be perfectly fine for smaller files).
IF you:
have GNU awk or mawk
and don't mind combining awk and sed
here's a solution that doesn't read the entire input all at once:
awk -v RS='\n\t+' -v ORS=# '1' file | sed '$d'
-v RS='\n\t+' assigns to RS, the [input] record separator, which breaks the input (potentially across lines) into records based on being separated a newline followed by at least 1 space. Note that it's the use of a regex as the record separator that is not POSIX-compliant and thus requires GNU awk or mawk.
-v ORS=# assigns # to variable ORS, the output record separator.
1 constitutes the entire awk program in this case: it is a common shortcut that is effectively the same as {print}, i.e., it simply outputs each input record, followed by ORS, the output record separator.
However, since every record, including the last one, is terminated with ORS, we end up with \n# at the end of the output, which is undesired.
sed '$d' simply deletes that last line from the output ($ matches the last line, and d deletes it).

how to rejoin words that are split accross lines with a hyphen in a text file

OCR texts often have words that flow from one line to another with a hyphen at the end of the first line. (ie: the word has '-\n' inserted in it).
I would like rejoin all such split words in a text file (in a linux environment).
I believe this should be possible with sed or awk, but the syntax for these is dark magic to me! I knew a text editor in windows that did regex search/replace with newlines in the search expression, but am unaware of such in linux.
Make sure to back up ocr_file before running as this command will modify the contents of ocr_file:
perl -i~ -e 'BEGIN{$/=undef} ($f=<>) =~ s#-\s*\n\s*(\S+)#$1\n#mg; print $f' ocr_file
This answer is relevant, because I want the words joined together... not just a removal of the dash character.
cat file| perl -CS -pe's/-\n//'|fmt -w52
is the short answer, but uses fmt to reform paragraphs after the paragraphs were mangled by perl.
without fmt, you can do
#!/usr/bin/perl
use open qw(:std :utf8);
undef $/; $_=<>;
s/-\n(\w+\W+)\s*/$1\n/sg;
print;
also, if you're doing OCR, you can use this perl one-liner to convert unicode utf-8 dashes to ascii dash characters. note the -CS option to tell perl about utf-8.
# 0x2009 - 0x2015 em-dashes to ascii dash
perl -CS -pe 'tr/\x{2009}\x{2010}\x{2011}\x{2012\x{2013}\x{2014}\x{2015}/-/'
cat file | perl -p -e 's/-\n//'
If the file has windows line endings, you'll need to catch the cr-lf with something like:
cat file | perl -p -e 's/-\s\n//'
Hey this is my first answer post, here goes:
'-\n' I suspect are the line-feed characters. You can use sed to remove these. You could try the following as a test:
1) create a test file:
echo "hello this is a test -\n" > testfile
2) check the file has the expected contents:
cat testfile
3) test the sed command, this sends the edited text stream to standard out (ie your active console window) without overwriting anything:
sed 's/-\\n//g' testfile
(you should just see 'hello this is a test file' printed to the console without the '-\n')
If I build up the command:
a) First off you have the sed command itself:
sed
b) Secondly the expression and sed specific controls need to be in quotations:
sed 'sedcontrols+regex' (the text in quotations isn't what you'll actually enter, we'll fill this in as we go along)
c) Specify the file you are reading from:
sed 'sedcontrols+regex' testfile
d) To delete the string in question, sed needs to be told to substitute the unwanted characters with nothing (null,zero), so you use 's' to substitute, forward-slash, then the unwanted string (more on that in a sec), then forward-slash again, then nothing (what it's being substituted with), then forward-slash, and then the scale (as in do you want to apply the edit to a single line or more). In this case I will select 'g' which represents global, as in the whole text file. So now we have:
sed 's/regex//g' testfile
e) We need to add in the unwanted string but it gets confusing because if there is a slash in your string, it needs to be escaped out using a back-slash. So, the unwanted string
-\n ends up looking like -\\n
We can output the edited text stream to stdout as follows:
sed 's/-\\n//g' testfile
To save the results without overwriting anything (assuming testfile2 doesn't exist) we can redirect the output to a file:
sed 's/-\\n//g' testfile >testfile2
sed -z 's/-\n//' file_with_hyphens