How to remove the milliseconds from timestamps with sed? - regex

My input file is as follows:
12/13/2011,07:14:13.724,12/13/2011 07:14:13.724,231.56.3.245,LasVegas,US
I wish to get the following:
12/13/2011,07:14:13,12/13/2011 07:14:13,231.56.3.245,LasVegas,US
I tried this, but with no success:
sed "s/[0-9]{2}\:[0-9]{2}\:[0-9]{2}\(\.[0-9]{1,3}\)/\1/g" input_file.csv > output.csv

sed 's/\(:[0-9][0-9]\)\.[0-9]\{3\}/\1/g' input_file.csv > output.csv
You were almost there. In classic sed, you have to use backslashes in front of parentheses and braces to make them into metacharacters. Some versions of sed may have a mechanism to invert operations, so that the braces and parentheses are metacharacters by default, but that's not reliable across platforms.
Also (strong recommendation): use single quotes around the sed command. Otherwise, the shell gets a crack at interpreting those backslashes (and any $ signs, etc) before sed sees it. Usually, this confuses the coder (and especially the maintaining coder). In fact, use single quotes around arguments to programs whenever you can. Don't get paranoid about it - if you need to interpolate a variable, do so. But single-quoting is generally easier to code, and ultimately easier to understand.
I chose to work on just one time unit; you were working on three. Ultimately, given systematically formed input data, there is no difference in the result - but there is a (small) difference in the readability of the script.

Try:
sed 's,\(:[0-9]\{2\}\).[0-9]\{3\},\1,g'
Also, try \d instead of [0-9], your version of sed may support that.

You were near but some characters are special in sed (in my version, at least): {, }, (, ), but not :. So you need to escape them with a back-slash.
And \1 takes expression between paretheses, it should be the first part until seconds, not the second one.
A modification of your version could be:
sed "s/\([0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}\)\.[0-9]\{1,3\}/\1/g" input_file.csv > output.csv

This might work for you:
sed 's/\....//;s/\....//' input_file.csv >output_file.csv

Since the sed solution has already been posted, here is an alternate awk solution:
[jaypal:~/Temp] cat inputfile
12/13/2011,07:14:13.724,12/13/2011 07:14:13.724,231.56.3.245,LasVegas,US
[jaypal:~/Temp] awk -F"," -v ORS="," '
{for(i=1;i<NF;i+=1)
if (i==2||i==3) {sub(/\..*/,"",$i);print $i}
else print $i;printf $NF"\n"}' inputfile
12/13/2011,07:14:13,12/13/2011 07:14:13,231.56.3.245,LasVegas,US
Explanation:
Set the Field Separator to , and Output Record Separator to ,.
Using a for loop we will loop over each fields.
Using an if loop we would do substitution to the fields when the for loop parses over second and third fields.
If the fields are not 2nd and 3rd then we just print out the fields.
Lastly since we have used the for loop for <NF we just print out $NF which is the last field. This won't cause a , to be printed after last field.

Related

BASH escaping double quotes within single quotes

I'm trying to write a bash function that would escape all double quotes within single quotes, eg:
'I need to escape "these" quotes with backslashes'
would become
'I need to escape \"these\" quotes with backslashes'
My take on it was:
Find pairs of single quotes in the input and extract them with grep
Pipe into sed, escape double quotes
Sed again the whole input and replace grep match with sedded match
I managed to get it working to the part of having correctly escaped quotes section, but replacing it in the whole input fails.
The script code copypaste:
# $1 - Full name, $2 - minified name
adjust_quotes ()
{
SINGLE_QUOTES=`grep -Eo "'.*'" $2`
ESCAPED_QUOTES=`echo $SINGLE_QUOTES | sed 's|"|\\\\"|g'`
sed -r "s|'.*'|$ESCAPED_QUOTES|g" "$2" > "$2.escaped"
mv "$2.escaped" $2
echo "Quotes escaped within single quotes on $2"
}
Random additional questions:
In the console, escaping the quote with only two backslashes works, but when code is put in the script - I need four. I'd love to know
Could I modify this code into a loop to escape all pairs of single quotes, one after another until EOF?
Thanks!
P.S. I know this would probably be easier to do in eg. python, but I really need to keep it in bash.
Using BASH string replacement:
s='I need to escape "these" quotes with backslashes'
r="${s//\"/\\\"}"
echo "$r"
I need to escape \"these\" quotes with backslashes
Here's a pure bash solution, which does the transformation on stdin, printing to stdout. It reads the entire input into memory, so it won't work with really enormous files.
escape_enclosed_quotes() (
IFS=\'
read -d '' -r -a fields
for ((i=1; i<${#fields[#]}; i+=2)); do
fields[i]=${fields[i]//\"/\\\"}
done
printf %s "${fields[*]}"
)
I deliberately enclosed the body of the function in parentheses rather than braces, in order to force the body to run in a subshell. That limits the modification of IFS to the body, as well as implicitly making the variables used local.
The function uses the read builtin to read the entire input (since the line delimiter is set to NUL with -d '') into an array (-a) using a single quote as the field separator (IFS=\'). The result is that the parts of the input surrounded with single quotes are in the odd positions of the array, so the function loops over the odd indices to do the substitution only for those fields. I use bash's find-and-replace syntax instead of deferring to an external utility like sed.
This being bash, there are a couple of gotchas:
If the file contains a NUL, the rest of the file will be ignored.
If the last line of the file does not end with a newline, and the last character of that line is a single quote, it will not be output.
Both of the above conditions are impossible in a portable text file, so it's probably OK. All the same, worth taking note.
The supplementary question: why are the extra backslashes needed in
ESCAPED_QUOTES=`echo $SINGLE_QUOTES | sed 's|"|\\\\"|g'`
Answer: It has nothing to do with that line being in a script. It has to do with your use of backticks (...) for command substitution, and the idiosyncratic and often unpredictable handling of backslashes inside backticks. This syntax is deprecated. Do not use it. (Not even if you see someone else using it in some random example on the internet.) If you had used the recommended $(...) syntax for command substitution, it would have worked as expected:
ESCAPED_QUOTES=$(echo $SINGLE_QUOTES | sed 's|"|\\"|g')
(More information is in the Bash FAQ linked above.)

sed: Replacing a double quote in a quoted field within a delmited record

Given an optionally quoted, pipe delimited file with the following records:
"foo"|"bar"|123|"9" Nails"|"2"
"blah"|"blah"|456|"Guns "N" Roses"|"7"
"brik"|"brak"|789|""BB" King"|"0"
"yin"|"yang"|789|"John "Cougar" Mellencamp"|"5"
I want to replace any double quotes not next to a delimiter.
I used the following and it almost works. With one exception.
sed "s/\([^|]\)\"\([^|]\)/\1'\2/g" a.txt
The output looks like this:
"foo"|"bar"|123|"9' Nails"|"2"
"blah"|"blah"|456|"Guns 'N" Roses"|"7"
"brik"|"brak"|789|"'BB' King"|"0"
"yin"|"yang"|789|"John 'Cougar' Mellencamp"|"5"
It doesn't catch the second set of quotes if they are separated by a single character as in Guns "N" Roses. Does anyone know why that is and how it can be fixed? In the mean time I'm just piping the output to a second regex to handle the special case. I'd prefer to do this in one pass since some of the files can be largish.
Thanks in advance.
You can use substitution twice in sed:
sed -r "s/([^|])\"([^|])/\1'\2/g; s/([^|])\"([^|])/\1'\2/g" file
"foo"|"bar"|123|"9' Nails"|"2"
"blah"|"blah"|456|"Guns 'N' Roses"|"7"
"brik"|"brak"|789|"'BB' King"|"0"
"yin"|"yang"|789|"John 'Cougar' Mellencamp"|"5"
sed kind of implements a "while" loop:
sed ':a; s/\([^|]\)"\([^|]\)/\1'\''\2/g; ta' file
The t command loops to the label a if the previous s/// command replaced something. So that will repeat the replacement until no other matches are found.
Also, perl handles your case without looping, thanks to zero-width look-ahead:
perl -pe 's/[^|]\K"(?!\||$)/'\''/g'
But it doesn't handle consecutive double quotes, so the loop:
perl -pe 's//'\''/g while /[^|]\K"(?!\||$)/' file
You may like to use \x27 instead of the awkward '\'' method to insert a single quote in a single quoted string. Works with perl and GNU sed.

Why those two sed commands get different result?

A csv file example.csv, it has
hello,world,wow
this,is,amazing
I want to get the first column elements, at the beginning I wrote a sed command like:
sed -n 's/\([^,]*\),*/\1/p' example.csv
output:
helloworld,now
thisis,amazing
Then I modified my command to the following and get what I want:
sed -n 's/\([^,]*\).*/\1/p' example.csv
output:
hello
this
command1 I used comma(,) and command2 I replaced comma with dot(.), and it works as expected, can anyone explain how sed really works to get the 1st output? What's the story behind? Is it because of the dot(.) or because of the substitution group & back-reference?
In both regexes, ([^,]*) will consume the same part of the string - all the symbols preceding the first encountered comma. Apparently the difference is how are the remaining parts of those regexes treated.
In the first one, it's ,* - zero or more comma symbols. Obviously all it might consume is
the comma itself - the rest of the line isn't covered by a pattern.
In the second one, it's .* - zero or more of any symbols. It's not a big surprise that'll cover the remaining string completely - as it has nothing to stop at; any is, well, any. )
In both cases the pattern-covered part of the string is replaced by the contents of the capturing group (and that's, as I said already, 'all the symbols before the first comma') - and what's covered by the remaining part of the regex is just removed. So in first case the very first comma is erased, in the second - the comma and the rest of the string.
The reason behind that is that the pattern matches only to the first part of the word, i.e. only the Hello, part is replaced. The part ,* takes arbitrary amount of commas, and then nothing is set to be next, i.e. nothing else matches the pattern. For example:
hello,,,,,,,,,,,,,,,,,,world
would be replaced to
helloworld
A good example would be
sed -n 's/\([^,]*\),*$/\1/p' example.csv
This will work if and only if all the commas are at the end of the line and will trim them, e.g.
hello,,,,,,
Hope this makes the problem a bit clearer.
On regex the . (dot) is a place holder for one, single character.
Can I suggest not using sed?
cut -d, -f1 example.csv
Personally, I'm a huge sed fan, but cut is much more appropriate in this instance.
If you like first word, why not use awk
awk -F, '{print $1}' file
hello
this
Using sed with back reference
sed -nr 's/([^,]*),.*/\1/p' file
hello
this
It seems that to make it work you need the .* so it get the whole line.
The r option make you not need to escape the parentheses \(

Sed wildcards- replace in the middle of certain characters

I have a line like this
"RU_28" "CDM_279" "CDI_45"
"RU_567" "CDM_528" "CDI_10000"
I want to obtain the result below
"RU_28" "CDM_Unusued" "CDI_45"
"RU_567" "CDM_Unusued" "CDI_10000"
Do this for all the lines in the file
I'm using this commands:
sed 's/\"CDM_\w*\"/\"Unusued\"/g' File1.txt > File2.txt
It doesn't seem to works.
Thanks in advance!!!
You can use:
sed -i.bak 's/"\(CDM_\)[^"]*"/"\1Unused"/' file1.txt
You're not actually leaving "CDM_" on the right side. Your substitution says "Replace CDM_ and any number of words with Unusued." The common way to do what you actually want to do is to use parentheses around the section on the left you want to keep, and then use backreferences to indicate where they go on the right. In this case, you just need a single backreference, indicated by \1:
sed 's/"\(CDM_\)\w*"/"\1Unusued"/g' File1.txt > File2.txt
Note the backslashes before the parentheses on the left side; these can be omitted if using sed with -r (I think), for extended regexps, but as-is, they're necessary so sed knows they're not literal.
Edit: I've updated the command in response to the accurate comment by Birei, noting that the extraneous escapes for the double-quotes. (Note that the ones for the parentheses are still necessary).

Awk/etc.: Extract Matches from File

I have an HTML file and would like to extract the text between <li> and </li> tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:
awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+) -- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk win this round ;)
Edit: The data is not well-structured, so using positional matches ($1, $2, etc.) is a no-go.
If you want to do this in the general case, where your list tags can contain any legal HTML markup, then awk is the wrong tool. The right tool for the job would be an HTML parser, which you can trust to get correct all of the little details of HTML parsing, including variants of HTML and malformed HTML.
If you are doing this for a special case, where you can control the HTML formatting, then you may be able to make awk work for you. For example, let's assume you can guarantee that each list element never occupies more than one line, is always terminated with </li> on the same line, never contains any markup (such as a list that contains a list), then you can use awk to do this, but you need to write a whole awk program that first finds lines that contain list elements, then uses other awk commands to find just the substring you are interested in.
But in general, awk is the wrong tool for this job.
gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file
Worked pretty well for me.
By your script, if you can get what you want (it means <li> and <a> tag is in one line.);
$ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'
or
$ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'
First one is for every awk, second one is for gnu awk.
There are several issues that I see:
The pattern has a trailing 'm' which is significant for multi-line matches in Perl, but Awk does not use Perl-compatible regular expressions. (At least, standard (non-GNU) awk does not.)
Ignoring that, the pattern seems to search for a 'start list item' followed by an anchor '<a>' to '</a>', not the end list item.
You search for anything that is not a '>' as the body of the anchor; that's not automatically wrong, but it might be more usual to search for anything that is not '<', or anything that is neither.
Awk does not do multi-line searches.
In Awk, '$1' denotes the first field, where the fields are separated by the field separator characters, which default to white space.
In classic nawk (as documented in the 'sed & awk' book vintage 1991) does not have a mechanism in place for pulling sub-fields out of matches, etc.
It is not clear that Awk is the right tool for this job. Indeed, it is not entirely clear that regular expressions are the right tool for this job.
Don't really know awk, how about Perl instead?
tr -d '\012' the.html | perl \
-e '$text = <>;' -e 'while ( length( $text) > 0)' \
-e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'
1) remove newlines from file, pipe through perl
2) initialize a variable with the complete text, start a loop until text is gone
3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass
Make sense? (warning, did not try this code myself, need to go home soon...)
P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.