Convert spaces to tabs in RegEx - regex

How do do you say the following in regex:
foreach line
look at the beginning of the string and convert every group of 3 spaces to a tab
Stop once a character other than a space is found
This is what i have so far:
/^ +/\t/g
However, this converts every space to 1 tab
Any help would be appreciated.

With Perl:
perl -pe '1 while s/\G {3}/\t/gc' input.txt >output.txt
For example, with the following input
nada
three spaces
four spaces
three in the middle
six space
the output (TABs replaced by \t) is
$ perl -pe '1 while s/\G {3}/\t/gc' input | perl -pe 's/\t/\\t/g'
nada
\tthree spaces
\t four spaces
\tthree in the middle
\t\tsix spaces

I know this is an old question but I thought I'd give a full regex answer that works (well it worked for me).
s/\t* {3}/\t/g
I usually use this to convert a whole document in vim do this in vim it looks like this:
:%s/\t* \{3\}/\t/g
Hope it still helps someone.

You probably want /^(?: {3})*/\t/g
edit: fixed

Related

Replacing spaces with underscores within quotes

I need to replace within a large text file all occurrences such as 'yw234DV w-23-sDf wef23s-d-f' with the same strings but with underscores instead of spaces for all spaces within quotes, without replacing any spaces outside quotes with underscores.
I'm trying to find a solution for substitution within vim, but a sed solution would also be much appreciated. The number of tokens in each quote-delimited string may vary.
I've been playing with some regexes in vim, but they're pretty elementary and seem to be missing what I need.
My current attempt:
%s/'{[:alnum:] }*/'\0\_/g
And I'm experimenting with variations on that.
This is most similar to my question, though it is Java:
Replacing spaces within quotes
Sample Input:
'wiUEF7-gvouw ow wo24-RTeih we', 'yt23IT iug-76'
Sample Output:
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
You may try this with VIM, tried this on Macvim:
%s/\%('[^']*'\)*\('[^']*'\)/\=substitute(submatch(1), ' ', '_', 'g')/g
Much simpler solution , Thanks to #SergioAraujo:
#%s/\v%(('[^']*'))/\=substitute(submatch(1),' ', '_', 'g')/g
Not sure however, if below is the outcome you have expected
Output:
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
In perl:
perl -i -pe's{(\x27.*?\x27)}{ (my $subst = $1) =~ tr/ /_/ }ge' yourfile
or with perl5.14 or above:
perl -i -pe's{(\x27.*?\x27)}{ $1 =~ tr/ /_/r }ge'
With this the input file:
$ cat file
'wiUEF7-gvouw ow wo24-RTeih we', 'yt23IT iug-76'
We can convert all spaces inside of single-quotes into underscores with:
$ sed -E ":a; s/^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]/\1_/; ta" file
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
How it works
:a
This creates a label a.
s/^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]/\1_/
This inserts the underscores where we want them.
^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]
This looks for any odd number of single quotes followed by any number of non-quote characters followed by a space. Everything before that space is saved in group 1.
\1_
This replaces the matched text with group 1 followed by an underscore.
ta
If the previous command put any new underscores in the string, then jump back to label a and try again.
Using FPAT variable in gnu awk you can do this:
awk -v OFS=', ' -v FPAT="'[^']*'" '{for (h=1; h<=NF; h++)
{gsub(/[[:blank:]]/, "_", $h); printf "%s%s", $h, (h < NF ? OFS : ORS)}}' file
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'

Getting rid of all words that contain a special character in a textfile

I'm trying to filter out all the words that contain any character other than a letter from a text file. I've looked around stackoverflow, and other websites, but all the answers I found were very specific to a different scenario and I wasn't able to replicate them for my purposes; I've only recently started learning about Unix tools.
Here's an example of what I want to do:
Input:
#derik I was there and it was awesome! !! http://url.picture.whatever #hash_tag
Output:
I was there and it was awesome!
So words with punctuation can stay in the file (in fact I need them to stay) but any substring with special characters (including those of punctuation) needs to be trimmed away. This can probably be done with sed, but I just can't figure out the regex. Help.
Thanks!
Here is how it could be done using Perl:
perl -ane 'for $f (#F) {print "$f " if $f =~ /^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$/} print "\n"' file
I am using this input text as my test case:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
#derik I was there; it was awesome! !! http://url.picture.whatever #hash_tag
output:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
I was there; it was awesome!
Command-line options:
-n loop around every line of the input file, do not automatically print it
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
The perl code splits each input line into the #F array, then loops over every field $f and decides whether or not to print it.
At the end of each line, print a newline character.
The regular expression ^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$ is used on each whitespace-delimited word
^ starts with
[a-zA-Z-\x27]+ one or more lowercase or capital letters or a dash or a single quote (\x27)
[?!;:,.]? zero or one of the following punctuation: ?!;:,.
(|) alternately match
[\d.]+ one or more numbers or .
$ end
Your requirements aren't clear at all but this MAY be what you want:
$ awk '{rec=sep=""; for (i=1;i<=NF;i++) if ($i~/^[[:alpha:]]+[[:punct:]]?$/) { rec = rec sep $i; sep=" "} print rec}' file
I was there and it was awesome!
sed -E 's/[[:space:]][^a-zA-Z0-9[:space:]][^[:space:]]*//g' will get rid of any words starting with punctuation. Which will get you half way there.
[[:space:]] is any whitespace character
[^a-zA-Z0-9[:space:]] is any special character
[^[:space:]]* is any number of non whitespace characters
Do it again without a ^ instead of the first [[:space:]] to get remove those same words at the start of the line.

Simplify points in KML using regex

I am trying to cut down the file size of a kml file I have.
The coordinates for the polygons are this accurate:
-113.52106535153605,53.912817815321503,0.
I am not very good with regex, but I think it would be possible to write one that selects the eight characters before the commas. I'd run a search and replace so the result would be
-113.521065,53.9128178,0.
Any regex experts out there think this is possible?
Try this
\d{8}(?=,)
and replace with an empty string
See it here on Regexr
Here is something that might work. Replaces 8 chars and the coma with a coma: s/(.{8}),/,/g;
echo "-113.52106535153605,53.912817815321503,0." | sed 's/.\{8\},/,/'
So you can cat the file you have to a sed command like this:
cat file.kml | sed 's/.\{8\},/,/' > newfile.kml
I Just had to do the same thing. This is perl instead of sed, but it will look for a string of eight uninterrupted digits and then replace any number of uninterrupted digits after that with nothing. It worked great.
cat originalfile.kml | perl -pe 's/(?<=\d{8})\d*//g' > shortenedfile.kml

using sed to copy lines and delete characters from the duplicates

I have a file that looks like this:
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
I want it to look like this
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
I thought I could use sed to do this but I can't figure out how to store something in a buffer and then modify it.
Am I even using the right tool?
Thanks
You don't have to get tricky with regular expressions and replacement strings: use sed's p command to print the line intact, then modify the line and let it print implicitly
sed 'p; s/\.png//'
Glenn jackman's response is OK, but it also doubles the rows which do not match the expression.
This one, instead, doubles only the rows which matched the expression:
sed -n 'p; s/\.png//p'
Here, -n stands for "print nothing unless explicitely printed", and the p in s/\.png//p forces the print if substitution was done, but does not force it otherwise
That is pretty easy to do with sed and you not even need to use the hold space (the sed auxiliary buffer). Given the input file below:
$ cat input
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
you should use this command:
sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
The result:
$ sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
This commands is just a replacement command (s///). It matches anything starting with #" followed by non-period chars ([^.]*) and then by .png",. Also, it matches all non-period chars before .png", using the group brackets \( and \), so we can get what was matched by this group. So, this is the to-be-replaced regular expression:
#"\([^.]*\)\.png",
So follows the replacement part of the command. The & command just inserts everything that was matched by #"\([^.]*\)\.png", in the changed content. If it was the only element of the replacement part, nothing would be changed in the output. However, following the & there is a newline character - represented by the backslash \ followed by an actual newline - and in the new line we add the #" string followed by the content of the first group (\1) and then the string ",.
This is just a brief explanation of the command. Hope this helps. Also, note that you can use the \n string to represent newlines in some versions of sed (such as GNU sed). It would render a more concise and readable command:
sed 's/#"\([^.]*\)\.png",/&\n#"\1",/' input
I prefer this over Carles Sala and Glenn Jackman's:
sed '/.png/p;s/.png//'
Could just say it's personal preference.
or one can combine both versions and apply the duplication only on lines matching the required pattern
sed -e '/^#".*\.png",/{p;s/\.png//;}' input

How can I add characters at the beginning and end of every non-empty line in Perl?

I would like to use this:
perl -pi -e 's/^(.*)$/\"$1\",/g' /path/to/your/file
for adding " at beginning of line and ", at end of each line in text file. The problem is that some lines are just empty lines and I don't want these to be altered. Any ideas how to modify above code or maybe do it completely differently?
Others have already answered the regex syntax issue, let's look at that style.
s/^(.*)$/\"$1\",/g
This regex suffers from "leaning toothpick syndrome" where /// makes your brain bleed.
s{^ (.+) $}{ "$1", }x;
Use of balanced delimiters, the /x modifier to space things out and elimination of unnecessary backwhacks makes the regex far easier to read. Also the /g is unnecessary as this regex is only ever going to match once per line.
perl -pi -e 's/^(.+)$/\"$1\",/g' /your/file
.* matches 0 or more characters; .+ matches 1 or more.
You may also want to replace the .+ with .*\S.* to ensure that only lines containing a non-whitespace character are quoted.
change .* to .+
In other words lines must contain at 1 or more characters. .* represents zero or more characters.
You should be able to just replace the * (0 or more) with a + (1 or more), like so:
perl -pi -e 's/^(.+)$/\"$1\",/g' /path/to/your/file
all you are doing is adding something to the front and back of the line, so there is no need for regex. Just print them out. Regex for such a task is expensive if your file is big.
gawk
$ awk 'NF{print "\042" $0 "\042,"}' file
or Perl
$ perl -ne 'chomp;print "\042$_\042,\n" if ($_ ne "") ' file
sed -r 's/(.+)/"\1"/' /path/to/your/file