I have a string like
July 20th 2017, 11:03:37.620 fc384c3d-9a75-459d-ba92-99069db0e7bf
I need to remove everything from the beginning of the line till the UUID substring (it's a tab, \t just before the UUID).
My regex looks like that:
^\s*July(.*)\t
When I test it in regex101 it all works beatufully: https://regex101.com/r/eZ1gT7/1077
However, when I plonk that into a sed command it doesn't do any substitution:
less pensionQuery.txt | sed -e 's/^\s*July(.*)\t//'
where pensionQuery.txt is a file full of the lines similar to the above. So the command above simply spits out unmodified file contnent.
Is my sed command wrong?
Any ideas?
The regex is right, you are not trying sed with --regexp-extended
'-E'
'--regexp-extended'
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that egrep accepts; they
can be clearer because they usually have fewer backslashes.
Historically this was a GNU extension, but the -E extension has
since been added to the POSIX standard
echo -e $'July 20th 2017, 11:03:37.620\tfc384c3d-9a75-459d-ba92-99069db0e7bf' |
sed -E 's/^\s*July(.*)\t//'
fc384c3d-9a75-459d-ba92-99069db0e7bf
Also a simple read-up on Basic (BRE) and extended (ERE) regular expression
Basic and extended regular expressions are two variations on the syntax of the specified pattern. Basic Regular Expression (BRE) is the default in sed (and similarly in grep). Extended Regular Expression syntax (ERE) is activated by using the -r or -E options (and similarly, grep -E).
I have a working (in macOS app Patterns) RegExp that reformats GeoJSON MultiPolygon coordinates, but don't know how to escape it for sed.
The file I'm working on is over 90 Mb in size, so bash terminal looks like the ideal place and sed the perfect tool for the job.
Search Text Example:
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
Desired outcome:
[[[37.9017735,69.400367955],[37.90098431,69.400425761],[37.90004869,69.400489545],[37.89915455,69.400578128],[37.89840665,69.400660744],[37.89747072,69.400762152],[37.89628639,69.400905283],[37.89545822,69.401014028],[37.89479369,69.401113128],[37.89414564,69.401195094],[37.89362565,69.401281229],[37.89276089,69.401414764],[37.89196611,69.401540312],[37.891721,69.401587053],[37.89137614,69.401634443],[37.89136515,69.401635893],[37.89114453,69.401663531],
My current RegExp:
((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))
and reformatting:
$1\.$2$4,$6.$7$9
The command should be something along these lines:
sed -i -e 's/ The RegExp escaped /$1\.$2$4,$6.$7$9/g' large_file.geojson
But what should be escaped in the RegExp to make it work?
My attempts always complain of being unbalanced.
I'm sorry if this has already been answered elsewhere, but I couldn't find even after extensive searching.
Edit: 2017-01-07: I didn't make it clear that the file contains properties other than just the GPS-points. One of the other example values picked from GeoJSON Feature properties is "35.642.1.001_001", which should be left unchanged. The braces check in my original regex is there for this reason.
That regex is not legal in sed; since it uses Perl syntax, my recommendation would be to use perl instead. The regular expression works exactly as-is, and even the command line is almost the same; you just need to add the -p option to get perl to operate in filter mode (which sed does by default). I would also recommend adding an argument suffix to the -i option (whether using sed or perl), so that you have a backup of the original file in case something goes horribly wrong. As for quoting, all you need to do is put the substitution command in single quotation marks:
perl -p -i.bak -e \
's/((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))/$1\.$2$4,$6.$7$9/g' \
large_file.geojson
If your data is just like you showed, you needn't worry about the brackets. You may use a POSIX ERE enabled with -E (or -r in some other distributions) like this:
sed -i -E 's/([0-9]{2})([0-9]*)\.([0-9]+)/\1.\2\3/g' large_file.geojson
Or a POSIX BRE:
sed -i 's/\([0-9]\{2\}\)\([0-9]*\)\.\([0-9]\+\)/\1.\2\3/g' large_file.geojson
See an online demo.
You may see how this regex works here (just a demo, not proof).
Note that in POSIX BRE you need to escape { and } in limiting / range quantifiers and ( and ) in grouping constructs, and the + quantifier, else they denote literal symbols. In POSIX ERE, you do not need to escape the special chars to make them special, this POSIX flavor is closer to the modern regexes.
Also, you need to use \n notation inside the replacement pattern, not $n.
A simple sed will do it:
$ echo "$var"
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
$ echo "$var" | sed 's/\([0-9]\{3\}\)\./.\1/g'
[[[379.017735,6940.0367955],[379.0098431,6940.0425761],[379.0004869,6940.0489545],[378.9915455,6940.0578128],[378.9840665,6940.0660744],[378.9747072,6940.0762152],[378.9628639,6940.0905283],[378.9545822,6940.1014028],[378.9479369,6940.1113128],[378.9414564,6940.1195094],[378.9362565,6940.1281229],[378.9276089,6940.1414764],[378.9196611,6940.1540312],[378.91721,6940.1587053],[378.9137614,6940.1634443],[378.9136515,6940.1635893],[378.9114453,6940.1663531],
I need a sed command that takes a string and removes all copies of the first character from the beginning (but not from the rest of the string).
For instance, AAABAC should produce BAC, because the first letter is A, so we remove the entire run of A's from the beginning.
My original thought was:
data=$(echo $data | sed 's/^.\+\(.*\)/\1/')
but this doesn't work (outputs empty string). If I replace the first . with a specific character, it will successfully work just for that character, but I can't get it to wildcard properly.
What I think is that the . matches the first character like I want, but then the + doesn't remember the letter I want and continues accepting every character until the end of the string, so that the parentheses contain nothing and so the whole string gets replaced with nothing. How can I initially accept any character, but then "lock in" that character for the +?
You can use:
$> s='AAABAC'
$> sed -E 's/^(.)\1*//' <<< "$s"
BAC
(.) will match the first character and captures it in group #1
\1* will match 0 or more instances of same character
Alternatively here is a pure BASH way of doing the same:
$> shopt -s extglob
$> echo "${s##+(${s:0:1})}"
BAC
${s:0:1} gives us the first character of $s and ##+(${s:0:1}) removes all the instances of first char from the start.
To provide a road map to the existing answers with respect to portability:
Note: It can be inferred from the syntax used in the question and from what answer was accepted that GNU sed is being used, but the question isn't tagged as such, and it may be of broader interest.
anubhava's helpful answer works with GNU sed, but not with (more) strictly POSIX-compliant sed implementations such as the one found on macOS.
Benjamin W.'s helpful answer works with GNU grep, due to requiring the -P option for PCRE support, which other grep implementations, such as the one found on macOS, do not support.
soronta's helpful answer works on platforms that use the GNU regular-expression libraries (most Linux distros), or, more generally, on platforms whose ERE (extended regular expression) syntax supports backreferences, as a nonstandard extension to the POSIX spec.
Note that =~, Bash's regex-matching operator, is one of the rare Bash features whose behavior is platform-dependent, due to using the respective platform's regex libraries.
Here's a POSIX-compliant solution that should work on all modern Unix-like platforms, because it uses BREs (basic regular expressions), for which POSIX does mandate backreference support:
$ echo 'AAABAC' | sed 's/^\(.\)\1*//'
BAC
You can do it with grep, if your grep understands Perl compatible regular expressions:
$ grep -Po '^(.)\1*\K.*' <<< 'AABAC'
BAC
or
$ grep -Po '^(.)\1*\K.*' <<< 'ABAC'
BAC
-o retains only the match, and \K is a variable-length look-behind, removing as many identical characters from the beginning of the string as possible.
Bash also supports regular expressions:
$ m='(.)(\1+)(.+)'; [[ AAAAABAC =~ $m ]]; printf '%s' "${BASH_REMATCH[3]}"
BAC
Valid for GNU ERE regex system library (varies with the system).
I want to do string replacement using regular expressions in sed. Now, I'm aware that the behavior of sed is funky on a Mac. I've often seen workarounds using egrep when I want to just examine a certain pattern in a line. But, in this case I want to do string replacement.
I want to replace cp an and cp <tab or newline> an with gggg. I tried the following, which would work under extended regular expressions:
sed -i'_backup' 's/cp\s+an/gggg/g'
But of course this does nothing. I tried egrepping, and of course it picks out the lines with cp <one or more space characters> an.
How do I get sed to do replacement using extended regular expressions? Or what is a better way to do replacement using regular expressions?
i'm on mac osx.
On OSX following command will work for extended regex support:
sed -i.backup -E 's/cp[[:blank:]]+an/gggg/g'
POSIX Character Class Reference
Since you mentioned you want <newline> to be handled, you'll need to coax sed a bit. Your exact requirements aren't too clear to me but the following example illustrates that sed can easily handle certain cases in which a newline is in the "target" regex:
$ echo $'cp\nancp an' | sed -E '/cp/{N; s/cp(\n|[[:blank:]])an/gggg/g;}'
gggggggg
(Note to non-Mac readers: If your grep does not support -E, try -r instead.)
I have a large file with many scattered file paths that look like
lolsed_bulsh.png
I want to prepend these file names with an extended path like:
/full/path/lolsed_bullsh.png
I'm having a hard time matching and capturing these. currently i'm trying variations of:
cat myfile.txt| sed s/\(.+\)\.png/\/full\/path\/\1/g | ack /full/path
I think sed has some regex or capture group behavior I'm not understanding
In your regex change + with *:
sed -E "s/(.*)\.png/\/full\/path\/\1/g" <<< "lolsed_bulsh.png"
It prints:
/full/path/lolsed_bulsh
NOTE: The non standard -E option is to avoid escaping ( and )
Save yourself some escaping by choosing a different separator (and -E option), for example:
cat myfile.txt | sed -E "s|(..*)\.png|/full/path/\1|g" | ack /full/path
Note that where supported, the -E option ensures ( and ) don't need escaping.
sed uses POSIX BRE, and BRE doesn't support one or more quantifier +. The quantifier + is only supported in POSIX ERE. However, POSIX sed uses BRE and has no option to switch to ERE.
Use ..* to simulate .+ if you want to maintain portability.
Or if you can assume that the code is always run on GNU sed, you can use GNU extension \+. Alternatively, you can also use the GNU extension -r flag to switch to POSIX ERE. The -E flag in higuaro's answer has been tagged for inclusion in POSIX.1 Issue 8, and exists in POSIX.1-202x Draft 1 (June 2020).