I have an xml format file for bulk loading in SQL Server 2008. I need to replace the instance of TERMINATOR="\r\n" with TERMINATOR="\n". I'm trying to use the command line utility FART.exe. No matter how I try, I cannot seem to get the utility to recognize the characters.
I have tried passing as combos of the find string and replace string:
\r\n \n
"\r\n" "\n"
""\r\n"" ""\n""
^\r^\n ^\n
TERMINATOR=""\r\n"" TERMINATOR=""\n""
TERMINATOR=""^\r^\n"" TERMINATOR=""^\n""
and many, many more combinations. Every article I look up about escaping DOS commands tells me something different and I've tried them all with no luck. The truly strange thing is that not only does using \r\n \n not do what I want it to do (replace the literals) but it does not replace the carriage return + line feed characters at the end of each line in the file.
The format file I need to edit. I believe the line I've underlined in red is causing a problem:
The command window output from calling Jrepl. It's obvious that the utility has a problem with the red-underlined line from the other screenshot:
To replace all \r\n literals with \n, simply use:
fart yourFile.xml \r\n \n
To only replace within larger string TERMINATOR="\r\n", then use:
fart -C yourfile.xml TERMINATOR=\x22\\r\\n\x22 TERMINATOR=\x22\\n\x22
I prefer to use JREPL.BAT - It is much more powerful than FART in that it can use regular expressions, plus it has a number of additional advanced features. The only limitation relative to FART is it can only process one file at a time.
To replace all \r\n literals with \n:
jrepl \r\n \n /l /f yourFile.xml /o -
To be more specific using a literal search string with quote escaped as \q and backslash as \\:
jrepl "TERMINATOR=\q\\r\\n\q" "TERMINATOR=\q\\n\q" /x /l /f yourFile.xml /o -
or using a regular expression:
jrepl "(TERMINATOR=\q)\\r(?=\\n\q)" "$1" /x /f yourFile.xml /o -
Simplest and quickest is fart binary method, for example:
fart.exe -C D:\sample.txt \x0d\x0a \x0a
Related
I have some 'fastq' format DNA sequence files (basically just text files) like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#
+
#
+
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
My ultimate goal is to turn these into 'fasta' format files, but to do that I need to get rid of the two empty sequences in the middle.
EDIT
The desired output would look like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
All of the dedicated software I tried (Biopython, stand alone programs, perl scripts posted by others) crash at the empty sequences. This is really just a problem of searching for the string #\n+ and replacing it with nothing. I googled this and read several posts and tried about a million options with sed and couldn't figure it out. Here are some things that didn't work:
sed s/'#'/,/'+'// test.fastq > test.fasta
sed s/'#,+'// test.fastq > test.fasta
Any insights would be greatly appreciated.
PS. I've got a Mac.
Try:
sed "/^[#+]*$/d" test.fastq > test.fasta
The /d option tells sed to "delete" the matching line (i.e. not print it).
^ and $ mean "start of string" and "end of string" respectively, i.e. the line must be an exact match.
So, the above command basically says:
Print all lines that do not only contain # or +, and write the result to test.fasta.
Edit: I misunderstood the question slightly, sorry. If you want to only remove pairs of consecutive lines like
#
+
then you need to perform a multi-line search and replace.
Although this can be done with sed, it's perhaps easier to use something like a perl script instead:
perl -0pe 's/^#\n\+\n//gm' test.fastq > test.fasta
The -0 option turns Perl into "file slurp" mode, where Perl reads the entire input file in one shot (instead of line by line). This enables multi-line search and replace.
The -pe option allows you to run Perl code (pattern matching and replacement in this case) and display output from the command line.
^#\n\+\n is the pattern to match, which we are replacing with nothing (i.e. deleting).
/gm makes the substitution multiline and global.
You could also instead pass -i as the first parameter to perl, to edit the file inline.
This may not be the most elegant solution in the world, but you can use tr to replace the \n with a null character and back.
cat test.fastq | tr '\n' '\0' | sed 's/#\x0+\x0//g' | tr '\0' '\n' > test.fasta
Try this:
sed '/^#$/{N;/\n+$/d}' file
When # is found, next line is appended to the pattern space with N.
If $ is found in next line, the d command deletes both lines.
I know this type of search has been address in a few other questions here, but for some reason I can not get it to work in my scenario.
I have a text file that contains something similar to the following patter:
some text here done
12345678_123456 226-
more text
some more text here done
12345678_234567 226-
I'm trying to find all cases where done is followed by 226- on the next line, with the 16 characters proceeding. I tried grep -Pzo and pcregrep -M but all return nothing.
I attempted multiple combinations of regex to take in account the 2 lines and the 16 chars in between. This is one of the examples I tried with grep:
grep -Pzo '(?s)done\n.\{16\}226-' filename
Related posts:
How to find patterns across multiple lines using grep?
Regex (grep) for multi-line search needed [duplicate]
How can I search for a multiline pattern in a file?
Generalize it to this (?m)done$\s+.*226-$
Because requiring a \n after 226- at end of string is a bad thing.
And not requiring a \n after 226- is also a bad thing.
Thus, the paradox is solved with (\n|$) but why the \n at all?
Both problems solved with multiline and $.
https://regex101.com/r/A33cj5/1
You must not escape { and } while using -P (PCRE) option in grep. That escaping is only for BRE.
You can use:
grep -ozP 'done\R.{16}226-\R' file
done
12345678_123456 226-
done
12345678_234567 226-
\R will match any unicode newline character. If you are only dealing with \n then you may just use:
grep -ozP 'done\n.{16}226-\n' file
I'm trying to write a bash function that would escape all double quotes within single quotes, eg:
'I need to escape "these" quotes with backslashes'
would become
'I need to escape \"these\" quotes with backslashes'
My take on it was:
Find pairs of single quotes in the input and extract them with grep
Pipe into sed, escape double quotes
Sed again the whole input and replace grep match with sedded match
I managed to get it working to the part of having correctly escaped quotes section, but replacing it in the whole input fails.
The script code copypaste:
# $1 - Full name, $2 - minified name
adjust_quotes ()
{
SINGLE_QUOTES=`grep -Eo "'.*'" $2`
ESCAPED_QUOTES=`echo $SINGLE_QUOTES | sed 's|"|\\\\"|g'`
sed -r "s|'.*'|$ESCAPED_QUOTES|g" "$2" > "$2.escaped"
mv "$2.escaped" $2
echo "Quotes escaped within single quotes on $2"
}
Random additional questions:
In the console, escaping the quote with only two backslashes works, but when code is put in the script - I need four. I'd love to know
Could I modify this code into a loop to escape all pairs of single quotes, one after another until EOF?
Thanks!
P.S. I know this would probably be easier to do in eg. python, but I really need to keep it in bash.
Using BASH string replacement:
s='I need to escape "these" quotes with backslashes'
r="${s//\"/\\\"}"
echo "$r"
I need to escape \"these\" quotes with backslashes
Here's a pure bash solution, which does the transformation on stdin, printing to stdout. It reads the entire input into memory, so it won't work with really enormous files.
escape_enclosed_quotes() (
IFS=\'
read -d '' -r -a fields
for ((i=1; i<${#fields[#]}; i+=2)); do
fields[i]=${fields[i]//\"/\\\"}
done
printf %s "${fields[*]}"
)
I deliberately enclosed the body of the function in parentheses rather than braces, in order to force the body to run in a subshell. That limits the modification of IFS to the body, as well as implicitly making the variables used local.
The function uses the read builtin to read the entire input (since the line delimiter is set to NUL with -d '') into an array (-a) using a single quote as the field separator (IFS=\'). The result is that the parts of the input surrounded with single quotes are in the odd positions of the array, so the function loops over the odd indices to do the substitution only for those fields. I use bash's find-and-replace syntax instead of deferring to an external utility like sed.
This being bash, there are a couple of gotchas:
If the file contains a NUL, the rest of the file will be ignored.
If the last line of the file does not end with a newline, and the last character of that line is a single quote, it will not be output.
Both of the above conditions are impossible in a portable text file, so it's probably OK. All the same, worth taking note.
The supplementary question: why are the extra backslashes needed in
ESCAPED_QUOTES=`echo $SINGLE_QUOTES | sed 's|"|\\\\"|g'`
Answer: It has nothing to do with that line being in a script. It has to do with your use of backticks (...) for command substitution, and the idiosyncratic and often unpredictable handling of backslashes inside backticks. This syntax is deprecated. Do not use it. (Not even if you see someone else using it in some random example on the internet.) If you had used the recommended $(...) syntax for command substitution, it would have worked as expected:
ESCAPED_QUOTES=$(echo $SINGLE_QUOTES | sed 's|"|\\"|g')
(More information is in the Bash FAQ linked above.)
I think my problem has something to do with escaping differences between using a regex within PHP versus using it at Bash commandline.
Here is my regex that is working in PHP:
$emailregex = '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$';
So I try giving the following at commandline and it doesn't seem to match anything.
(where emails.txt is a long plain text file with thousands of (possibly badly-formed) email addresses, one per line).
[root#host dir]# egrep '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$' emails.txt
I have tried surrounding the regex with double-quotemarks instead of single-quotemarks, but it made no difference.
Do I need to add some backslashes into the regex?
SOLVED! Thank you!
My file was created in Windows and extra CR in the END-OF-LINE markers did not agree with the dollar sign in the regex.
Single quotes should work with bash...
It works for me with this simple case:
echo test#test.com | egrep '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$'
In your text file, the line has to only contain the email address. Any additional spaces on the line will throw it off. For example this doesn't print anything:
echo " test#test.com" | egrep '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$'
Your problem might be that you have a dos formatted file. In that case the extra \r will make it so that the regex doesn't match since it will think there's an extra character at the end of the line. You can run dos2unix against it, or make your regex less restrictive by removing the beginning and end markers from your regex:
egrep '[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})'
WWorks for me:
JPP-MacBookPro-4:tmp jpp$ cat emails.txt
aa#bb.com
bb#cc.com
not an email
cc#dd.ee.ff
JPP-MacBookPro-4:tmp jpp$ egrep '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$' emails.txt
aa#bb.com
bb#cc.com
cc#dd.ee.ff
JPP-MacBookPro-4:tmp jpp$
Beware trailing whitespace/tabs/and returns - they have a way of biting regexs
There is a great ref on shell quoting here http://www.mpi-inf.mpg.de/~uwe/lehre/unixffb/quoting-guide.html
I am looking for a way to remove 'stray' carriage returns occurring at the beginning or end of a file. ie:
\r\n <-- remove this guy
some stuff to say \r\n
some more stuff to say \r\n
\r\n <-- remove this guy
How would you match \r\n followed by 'nothing' or preceded by 'nothing'?
Try this regular expression:
^(\r\n)+|\r\n(\r\n)+$
Depending on the language either the following regex in multiline mode:
^\r\n|\r\n$
Or this regex:
\A\r\n|\r\n\z
The first one works in e.g. perl (where ^ and $ match beginning/end of line in single-line mode and beginning/end of string in multiline mode). The latter works in e.g. ruby.
Here's a sed version that should print out the stripped file:
sed -i .bak -e '/./,$!d' -e :a -e '/^\n*$/{$d;N;ba' -e '}' foo.txt
The -i tells it to perform the edit in-place and the .bak tells it to back up the original with a .bak extension first. If memory is a concern, you can use '' instead of .bak and no backup will be made. I don't recommend unless absolutely necessary, though.
The first command ('/./,$!d' should get rid of all leading blank lines), and the rest is to handle all trailing blank lines.
See this list of handy sed 1-liners for other interesting things you can chain together.
^\s+|\s+$
\s is whitespace (space, \r, \n, tab)
+ is saying 1 or more
$ is saying at the end of the input
^ is saying at the start of the input