perl alternative for sed to split multiple

perl alternative for sed to split multiple | - regex

I was able to accomplish this in sed command, but could not get it working in perl. Would like to add spaces between pipe characters that are close together without any spaces or alphanumerics.
input ==> a|123|##||||
expected output ==> a|123|##| | | |
This sed command works fine:
echo "a|123|##||||" | sed 's/\([^[:blank:][:alnum:]]\)|/\1 | /g'
output for above command ==> a|123|## | | | |
In perl, I could not get it working
echo "a|123|##||||" | perl -pe 's/\([^[:blank:][:alnum:]]\)|/\1 | /g'
with output for above command
| a | | | 1 | 2 | 3 | | | # | # | | | | | | | | |

To add space only between those | that come next to each other
echo "a|123|##||||" | perl -pe's/\|(?=\|)/\| /g'
I use a lookahead in order to be able to detect consecutive (and overlapping!) pairs, with more than two | strung together: Only the first one in a match is consumed so the second one stays there for the next match, in case there is yet another after it (again asserted with the lookahead).

Another way using both lookahead and lookbehind.
$ echo "a|123|##||||" | perl -pe's/(?<=\|)(?=\|)/ /g '
a|123|##| | | |
$

Correct Perl syntax would be:
echo "a|123|##||||" | perl -pe 's/([^\s\w])\|/$1 | /g'
Pipe character must be escaped
$1 is used for 1st group match

Related

How to perform a sed transform within a matching part of a line

It's easy to do a sed transform within a line matching a certain pattern, but what if we only want to transform something in a certain part of the line?
Simple example
Suppose we want to make all characters uppercase in all lines starting with #. We could do that with a command of the following form.
sed '/^#/ y/abcdef/ABCDEF/'
Suppose we only want to turn the first word in these lines uppercase. How would we go about that using a sed translation?
More advanced application
I want to interchange slashes with backslashes in the graph part of the output of git --no-pager log --all --graph --decorate --oneline --color=always | tac.
Before
| * | | 279e9ad (tag: v0.0.4.334, origin/DR) asdfasdf
| | |/ /
| |/| / /
| | |/ / /
| | |\ \ \
| | * | | 1fc7ab7 (tag: v0.0.4.337) Merge branch 'DR' into NextMajor
| | | * | d24e21d (tag: v0.0.4.341, origin/DR-01) DR-010728 Updated unit tests
| | |\ \
| | * | 8c01099 (tag: v0.0.4.338, tag: 0.0.4_MILESTONE_RELEASE) Merge
After
| * | | 279e9ad (tag: v0.0.4.334, origin/DR) asdfasdf
| | |\ \
| |\| \ \
| | |\ \ \
| | |/ / /
| | * | | 1fc7ab7 (tag: v0.0.4.337) Merge branch 'DR' into NextMajor
| | | * | d24e21d (tag: v0.0.4.341, origin/DR-01) DR-010728 Updated unit tests
| | |/ /
| | * | 8c01099 (tag: v0.0.4.338, tag: 0.0.4_MILESTONE_RELEASE) Merge
Notice that any slashes in the commit messages are kept the same, but the slashes in the graphical part are transformed.

Keep it simple, just use awk. e.g. with GNU awk for the 3rd arg to match():
$ cat tst.awk
{
match($0,/([| *\/\\]+)(.*)/,a)
gsub(/\//,RS,a[1])
gsub(/\\/,"/",a[1])
gsub(RS,"\\",a[1])
print a[1] a[2]
}
$ awk -f tst.awk file
| * | | 279e9ad (tag: v0.0.4.334, origin/DR) asdfasdf
| | |\ \
| |\| \ \
| | |\ \ \
| | |/ / /
| | * | | 1fc7ab7 (tag: v0.0.4.337) Merge branch 'DR' into NextMajor
| | | * | d24e21d (tag: v0.0.4.341, origin/DR-01) DR-010728 Updated unit tests
| | |/ /
| | * | 8c01099 (tag: v0.0.4.338, tag: 0.0.4_MILESTONE_RELEASE) Merge
With any awk and comments added in case it's not obvious what the script does:
$ cat tst.awk
{
match($0,/[| *\/\\]+/) # find the segment of text you want
tgt = substr($0,RSTART,RLENGTH) # save that segment in a variable tgt
gsub(/\//,RS,tgt) # change all /s to newlines in tgt
gsub(/\\/,"/",tgt) # change all \s to /s in tgt
gsub(RS,"\\",tgt) # change all newlines to \s in tgt
print tgt substr($0,RSTART+RLENGTH) # print tgt plus rest of the line
}
We use newlines as the tmp value during the character swap since there's guaranteed to not already be a newline present in the line.
To turn the first word of each line that starts with # to uppercase, btw, might just be:
awk '/^#/{$1=toupper($1)}1' file
or:
awk '/^#/{$2=toupper($2)}1' file
depending on your input data, definition of a word, and white space requirements.
If the text you want to match can contain control characters, as it sounds like from your comments, then just allow that in the regexp, e.g.:
match($0,/([[:space:][:cntrl:]|*\/\\]+)(.*)/,a)

Here's a simple sed solution that should be portable (i.e. works in sed variants other than GNU). This swaps slashes that do not follow a letter (which works in your sample data at least).
sed -e 's:\([^a-z]\)/:\1\\:g;t' -e 's:\([^a-z]\)\\:\1/:g' file
The breakdown of this goes a little like this:
s:\([^a-z]\)/:\1\\:g - replace forward slashes with backslashes
t - If we just did a substitution, skip to the end (avoiding the next substitution)
s:\([^a-z]\)\\:\1/:g - replace backslashes with forward slashes.
The reason to split this into two -e expressions is that some variants of sed require the branch name to be at the end of a line in the script. The end of a -e expression is deemed equivalent to the the end of a line.

This might work for you (GNU sed):
sed '/^#/s/\w\+/\U&/' file
or:
sed '/^#/!b;s/\w\w*/&\n/;h;y/abcdef/ABCDEF/;G;s/\n.*\n//' file

If your version of sed supports it, you can use \U to transform text to uppercase:
sed -r 's/(^# *)([^ ]*)/\1\U\2/'
This captures the first part of any line starting with # (including optional spaces), then anything up to the next space character. The second capture group is transformed to uppercase.
If it doesn't support it, then you can always use perl:
perl -pe 's/(^#\s*)([\S]*)/$1\U$2/'
I've used \s and \S in this version, which are equivalent to [[:space:]] (space characters) and [^[:space:]] (non-space characters) respectively. You might want to use a slightly different pattern depending on the specifics of the files you're working with.

Perl regex nested grouping results

I have files like this:
mu (micro) | 10^(-6) | millionth
m (milli) | 0.001 | thousandth
k (kilo) | 10^3 | thousand
M (mega) | 10^6 | million
And I would like to to produce files like:
| $mu (micro)$ | $10^(-6)$ | $millionth$ |
| $m (milli)$ | $0.001$ | $thousandth$ |
| $k (kilo)$ | $10^3$ | $thousand$ |
| $M (mega)$ | $10^6$ | $million$ |
I'm trying to use the perl regex. And so far the best reexpression I could come up with is:
perl -lpe '(([[:alnum:][:punct:]\s]+)\s+|\|\s*([[:alnum:][:punct:]\s]+)\s*\||\s*([[:alnum:][:punct:]\s]+))'
I know it's got a few of redundant \s+, but I tried removing them the result was worse. Current it only separates it in two part:
mu (micro) | 10^(-6) |
millionth
So how can I improve upon this, to get the desired result? I know I can use s/foo/bar/g to replace it but I can't get the expression to separate properly. Also how will I access the nested groups?
Perhaps there is a better way to do this, I'm open to suggestions.

perl -lpe '$_ = "| " . join(" | ", map "\$$_\$", split / \| /) . " |"'
In words: Split each line into fields (on |), wrap each field in $...$, join the fields with |, and add a | at the beginning and end.

perl -pi -e 's/^(\S+ +\S+) +\| +(\S+) +\| +(\S+)$/| \$$1\$ | \$$2\$ | \$$3\$ |/g'

Get characters between two exact pipe | character in unix [duplicate]

This question already has answers here:
How to extract patterns form a text files in shell bash
(5 answers)
Split output of command by columns using Bash?
(10 answers)
Closed 8 years ago.
I need to grep all characters between second and third | (pipe) character from a file.
Let's say we have a file with string like below (two lines):
abc123 | def123 | ghi123 | jkl123 | mno123
abc123 | def123 | jkl123 | ghi123 | mno123
After I use grep/sed/awk command I should get like
ghi123
jkl123
I would appreciate any clue or help.

If you want always to get third element, you can try with:
echo "abc123 | def123 | ghi123 | jkl123 | mno123" | awk -F " | " '{print $5}'
Or:
echo "abc123 | def123 | ghi123 | jkl123 | mno123" | cut -d '|' -f 3 | tr -d ' '
Output:
ghi123
For a string with many words between | you can use:
echo "abc123 | def123 | foo bar | jkl123 buz | mno123" | cut -d '|' -f 3 | sed -e 's/^ //' | sed -e 's/ $//'
Output:
foo bar
Note that sed -e 's/^ //' | sed -e 's/ $//' is used for removeng first and last whitespace, because tr -d ' ' removes all whitespaces from the string.

How to remove words of a line upto specific character pattern...Regex

I want the words after "test" word from a line in a file. means in actuaaly, i dont want the words coming before "test" word.
thats the pattern...
e.g:
Input:
***This is a*** test page.
***My*** test work of test is complete.
Output:
test page.
work of test is complete.

Using sed:
sed -n 's/^.*test/test/p' input
If you want to print non-matching lines, untouched:
sed 's/^.*test/test/' input
The one above will remove (greedily) all text until the last test on a line. If you want to delete up to the first test use potong's suggestion:
sed -n 's/test/&\n/;s/.*\n//p' input

A pure bash one-liner:
while read x; do [[ $x =~ test.* ]] && echo ${BASH_REMATCH[0]}; done <infile
Input: infile
This is a test page.
My test work of test is complete.
Output:
test page.
test work of test is complete.
It reads all lines from file infile, checks if the line contains the string test and then prints the rest of the line (including test).
The same in sed:
sed 's/.(test.)/\1/' infile (Oops! This is wrong! .* is greedy, so it cuts too much from the 2nd example line). This works well:
sed -e 's/\(test.*\)/\x03&/' -e 's/.*\x03//' infile
I did some speed testing (for the original (wrong) sed version). The result is that for small files the bash solution performs better. For larger files sed is better. I also tried this awk version, which is even better for big files:
awk 'match($0,"test.*"){print substr($0,RSTART)}' infile
Similar in perl:
perl -ne 's/(.*?)(test.*)/$2/ and print' infile
I used the two lines example input file and I duplicated it every time. Every version run 1000 times. The result is:
Size | bash | sed | awk | perl
[B] | [sec] | [sec] | [sec] | [sec]
------------------------------------------
55 | 0.420 | 10.510 | 10.900 | 17.911
110 | 0.460 | 10.491 | 10.761 | 17.901
220 | 0.800 | 10.451 | 10.730 | 17.901
440 | 1.780 | 10.511 | 10.741 | 17.871
880 | 4.030 | 10.671 | 10.771 | 17.951
1760 | 8.600 | 10.901 | 10.840 | 18.011
3520 | 17.691 | 11.460 | 10.991 | 18.181
7040 | 36.042 | 12.401 | 11.300 | 18.491
14080 | 72.355 | 14.461 | 11.861 | 19.161
28160 |145.950 | 18.621 | 12.981 | 20.451
56320 | | | 15.132 | 23.022
112640 | | | 19.763 | 28.402
225280 | | | 29.113 | 39.203
450560 | | | 47.634 | 60.652
901120 | | | 85.047 |103.997

Regex replacement of a specific string using sed

I have been facing issues with using regex with sed.
I have a string like :
Call stack: [thread 0xac0aaa28]: | start | main main.m:37 | UIApplicationMain | GSEventRun | GSEventRunModal | CFRunLoopRunInMode | CFRunLoopRunSpecific | __CFRunLoopRun | __CFRunLoopDoSource1 | __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ | mshMIGPerform | _XCopyAttributeValue | _AXXMIGCopyAttributeValue | _copyAttributeValueCallback | -[NSObject(AXPrivCategory) accessibilityAttributeValue:] | -[UITableViewCellAccessibilityElement _accessibilityIsTableCell] | -[UITableViewCellAccessibilityElement tableViewCell] | -[UITableViewAccessibility(Accessibility) accessibilityCellForRowAtIndexPath:] | -[UITableView(UITableViewInternal) _createPreparedCellForRowAtIndexPath:] | -[UITableView(UITableViewInternal) _createPreparedCellForGlobalRow:withIndexPath:] | -[MailViewController tableView:cellForRowAtIndexPath:] | +[NICellFactory tableViewModel:cellForTableView:atIndexPath:withObject:] NICellFactory.m:89 | +[NICellFactory cellWithClass:tableView:object:] NICellFactory.m:67 | -[SwipableTableViewCell shouldUpdateCellWithObject:] | -[SwipableTableViewCell updateCellWithObject:] | -[ThreadCellFrontView updateCellWithObject:] | -[ThreadSummaryView updateWithNugget:] | -[JavaUtilLinkedList init] LinkedList.m:49 | -[JavaUtilLinkedList initJavaUtilLinkedList] LinkedList.m:40 | +[NSObject alloc] | +[NSObject allocWithZone:] | _objc_rootAllocWithZone | class_createInstance | calloc | malloc_zone_calloc
which has instances like main.m:37 |, LinkedList.m:95 |, NICellFactory.mm:89 | etc
i.e in text mate I can match these occurences with using the regex
[a-zA-z]+[.][m]+[:]+[0-9]+[ |]+
Now when I try to do the same thing in sed using
sed 's/\[a-zA-z]+[.][m]+[:]+[0-9][ |]+/ /g'
Sed does not seem to replace these instances.
I have tried using backlashes too
i.e
sed 's/\[a-zA-z\]+\[\.\]\[m\]+\[:\]+\[0-9\]+\[ |\]+/ /g'
Still sed does not replace such occurences.
Can someone help me understand what am I doing wrong?
Thanks

The backslashes you added for no good reason are the problem. Also your sed dialect may not support + repetition out of the box - try with * instead, or look for a -r or -E option in your sed manual page.

The following works for me:
sed -i.bck "s/[a-zA-Z][a-zA-Z]*\.mm*::*[0-9][0-9]*\s|/ /g" prova_sed.txt
It creates a backup file just in case.
May sed doesn't seem to support the +, \w and \d syntax so I've used [a-ZA-Z][a-zA-Z]* instead of [a-zA-Z]+, mm* instead of m+ and so on.
Also note that you don't need to put single characters inside brackets so [\.][m]+[:]+ can be replaced with \.mm*::*
If your sed version supports the -r option the whole thing could be simplified to
sed -i.bck "s/[a-zA-Z]+\.m+:+[0-9]+\s|/ /g" prova_sed.txt

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

perl alternative for sed to split multiple | - regex

Another way using both lookahead and lookbehind. $ echo "a|123|##||||" | perl -pe's/(?<=\|)(?=\|)/ /g ' a|123|##| | | | $

Correct Perl syntax would be: echo "a|123|##||||" | perl -pe 's/([^\s\w])\|/$1 | /g' Pipe character must be escaped $1 is used for 1st group match

Related

How to perform a sed transform within a matching part of a line

Perl regex nested grouping results

Get characters between two exact pipe | character in unix [duplicate]

How to remove words of a line upto specific character pattern...Regex

Regex replacement of a specific string using sed

Categories

Resources