After something I guess is pretty complex, and I am pretty bad with regex's so you guys might be able to help.
See this data source:
User ID:
a123456
a12345f
a1234e6
d123d56
b12c456
c1b3456
ba23456
Basically, what I want to do, is use a regex/sed to replace all occurances of letters into numbers EXCEPT the first letter. Letters will always match their alphabet position. e.g. a = 1, b = 2, c = 3 etc.
So the result set should look like this:
User ID:
a123456
a123456
a123456
d123456
b123456
c123456
b123456
There will also never be any letters other that a-j, and the string will always be 7 chars long.
Can anyone shed some light? Thanks! :)
Here's one way you could do it using standard tools cut, paste and tr:
$ paste -d'\0' <(cut -c1 file) <(cut -c2- file | tr 'abcdef' '123456')
a123456
a123456
a123456
d123456
b123456
c123456
b123456
This joins the first character of the line with the result of tr on the rest of the line, using the null string. tr replaces each element found in the first list with the corresponding element of the second list.
To replace a-j letters in a line by the corresponding digits except the first letter using perl:
$ perl -pe 'substr($_, 1) =~ tr/a-j/0-9/' input_file
a=0, not a=1 because j would be 10 (two digits) otherwise.
J = 0, and no, only numbers 0-9 are used, and letters simply replace their number counterpart, so there will never be a latter greater than j.
To make j=0 and a=1:
$ perl -pe 'substr($_, 1) =~ tr/ja-i/0-9/' input_file
sed '/[a-j][0-9a-j]\{6\}$/{h;y/abcdefghij/1234567890/;G;s/.\(.\{6\}\).\(.\).*/\2\1/;}' YourFile
filter on "number" only
remind line (for 1st letter)
change all letter to digit (including 1st)
add first form of number (as second line in buffer)
take 1st letter of second line and 6 last of 1st one, reorder and dont keep the other character
$ awk 'BEGIN{FS=OFS=""} NR>1{for (i=2;i<=NF;i++) if(p=index("jabcdefghi",$i)) $i=p-1} 1' file
User ID:
a123456
a123456
a123456
d123456
b123456
c123456
b123456
Note that the above reproduces the header line User ID: as-is. So far, best I can tell, all of the other posted solutions would change the header line to Us5r ID: since they would do the letter-to-number translation on it just like on all of the subsequent lines.
I don't see the complexity. Your samples look like you just want to replace six of seven characters with the numbers 1-6:
s/^\([a-j0-9]\)[a-j0-9]\{6\}/\1123456/
Since the numbers to put there are defined by position, we don't care what the letter was (or even if it was a letter). The downside here is that we don't preserve the numbers, but they never varied in your sample data.
If we want to replace only letters, the first method I can think of involves simply using multiple substitutions:
s/^\([a-j0-9]\{1\}\)[a-j]/\11/
s/^\([a-j0-9]\{2\}\)[a-j]/\12/
s/^\([a-j0-9]\{3\}\)[a-j]/\13/
s/^\([a-j0-9]\{4\}\)[a-j]/\14/
s/^\([a-j0-9]\{5\}\)[a-j]/\15/
s/^\([a-j0-9]\{6\}\)[a-j]/\16/
Replacing letters with specific digits, excluding the first letter:
s/\(.\)a/\11/g
This pattern will replace two character sequences, preserving the first, so would have to be run twice for each letter. Using hold space we could store the first character and use a simple transliteration. The tricky part is joining the two sections, whereupon sed injects an unwanted newline.
# Store in hold space
h
# Remove the first character
s/^.//
# Transliterate letters
y/jabcdefghi/0123456789/
# Exchange pattern and hold space
x
# Keep the first character
s/^\(.\).*$/\1/
# Print it
#P
# Join
G
# Remove the newline
s/^\(.\)./\1/
Still learning about sed's capabilities :)
Related
i want to grep all the texts in a file which contain symbols (non alpha numeric) and start with a number and which have spaces between them
grep -i "^[0-9]\|[^a-zA-Z0-9]\| "
I have written the following grep command which works perfectly , however i also wish to include those texts which are not in a particular limit say for example all those texts which are less than 3 and more than 15 should be greped
How can include that limit pattern as well in one command
I tried using
{3,15}
and all but could not get the desired output
sample input
aa
9dsa
abcd
abc#$
ab d
Sample output
aa //because length less than 3
ab d //because has space in between
9dsa // because starts with a number
abc#$ //because has special symbols in it
For clarity, simplicty, robustness, portability, etc. just use awk instead of grep to search for non-trivial conditions:
$ awk 'length()<3 || length()>15 || /[^[:alnum:]]/ || /[[:space:]]/ || /^[0-9]/' file
aa
9dsa
abc#$
ab d
I mean seriously, that couldn't get much clearer/simpler and it will work in any POSIX awk and it's trivial to change if/when your requirements change.
Below expression should help you find the required lines. I am assuming you will use grep -E so the alternation will work properly
^[[:digit:]]|[##$%^&*()]|^.{0,3}$|^.{15,}$
Below is the explanation for the regex
^[[:digit:]] - Match a line that starts with a number
[##$%^&*()] - Match a line containing the specified symbols.
Alternatively you can use [^[:alnum:]], if you want
the symbol to match any non alpha numeric character.
Beware that a space, underscore, tab, quote, etc are all
examples of non alpha numeric characters
^.{0,3}$ - Match a line containing less than 3 characters
^.{15,}$ - Match a line containing more than 15 characters
I want to find words in a document using only the letters in a given pattern, but those letters can appear at most once.
Suppose document.txt consists of "abcd abbcd"
What pattern (and what concepts are involved in writing such a pattern) will return "abcd" and not "abbcd"?
You could check if a character appears more than once and then negate the result (in your source code):
split your document into words
check each word with ([a-z])[a-z]*\1 (that matches abbcd, but not abcd)
negate the result
Explanation:
([a-z]) matches any single character
[a-z]* allows none or more characters after the one matched above
\1 is a back reference to the character found at ([a-z])
There were already some good ideas here, but I wanted to offer an example implementation in python. This isn't necessarily optimal, but it should work. Usage would be:
$ python find.py -p abcd < file.txt
And the implementation of find.py is:
import argparse
import sys
from itertools import cycle
parser = argparse.ArgumentParser()
parser.add_argument('-p', required=True)
args = parser.parse_args()
for line in sys.stdin:
for candidate in line.split():
present = dict(zip(args.p, cycle((0,)))) # initialize dict of letter:count
for ch in candidate:
if ch in present:
present[ch] += 1
if all(x <= 1 for x in present.values()):
print(candidate)
This handles your requirement of matching each character in the pattern at most once, i.e. it allows for zero matches. If you wanted to match each character exactly once, you'd change the second-to-last line to:
if all(x == 1 for x in present.values()):
Melpomene is right, regexps are not the best instrument to solve this task. Regexp is essentially a finite state machine. In your case current state can be defined as the combination of presence flags for each of the letters from your alphabet. Thus the total number of internal states in regex will be 2^N where N is the number of allowed letters.
The easiest way to define such regex will be list all possible permutations of available letters (and use ? to eliminate necessity to list shorter sequences). For three letters (a,b,c) regex looks like:
a?b?c?|a?c?b?|b?a?c?|b?c?a?|c?a?b?|c?b?a?
For the four letters (a,b,c,d) it becomes much longer:
a?b?c?d?|a?b?d?c?|a?c?b?d?|a?c?d?b?|a?d?b?c?|a?d?c?b?|b?a?c?d?|b?a?d?c?|b?c?a?d?|b?c?d?a?|b?d?a?c?|b?d?c?a?|c?a?b?d?|c?a?d?b?|c?b?a?d?|c?b?d?a?|c?d?a?b?|c?d?b?a?|d?a?b?c?|d?a?c?b?|d?b?a?c?|d?b?c?a?|d?c?a?b?|d?c?b?a?
As you can see, not that convenient.
The solution without regexps depends on your toolset. I would write a simple program that processes input text word by word. At the start of the word BitSet is created, where each bit represents the presence of the corresponding letter of the desired alphabet. While traversing the word if bit that corresponds to the current letter is zero it becomes one. If already marked bit occurs or letter is not in alphabet, word is skipped. If word is completely evaluated, then it's "valid".
grep -Pwo '[abc]+' | grep -Pv '([abc]).*\1' | awk 'length==3'
where:
first grep: a word composed by the pattern letters...
second grep: ... with no repeated letters ...
awk: ...whose length is the number of letters
I have text like this;
2500.00 $120.00 4500 12.00 $23.00 50.0989
Iv written a regex;
/(?!$)\d+\.\d{2}/g
I want it to only match 2500.00, 12.00 nothing else.
the requirement is that it needs to add the '$' sign onto numeric values that have exactly two digits after the decimal point. with the current regex it ads extra '$' to the ones that already have a '$' sign. its longer but im just saying it briefly. I know i can use regex to remove the '$' then use another regex to add '$' to all the desired numbers.
any help would be appreciated thanks!
To answer your question, you need to look before the pos where the first digit is.
(?<!\$)
But that's not going to work as it will match 23.45 of $123.45 to change it into $1$23.45, and it will match 123.45 of 123.456 to change it into $123.456. You want to make sure there's no digits before or after what you match.
s/(?<![\$\d])(\d+\.\d{2})(?!\d)/\$$1/g;
Or the quicker
s/(?<![\$\d])(?=\d+\.\d{2}(?!\d))/\$/g;
This is tricky only because you are trying to include too many functionalities in your single regex. If you manipulate the string first to isolate each number, this becomes trivial, as this one-liner demonstrates:
$ perl -F"(\s+)" -lane's/^(?=\d+\.\d{2}$)/\$/ for #F; print #F;'
2500.00 $120.00 4500 12.00 $23.00 50.0989
$2500.00 $120.00 4500 $12.00 $23.00 50.0989
The full code for this would be something like:
while (<>) { # or whatever file handle or input you read from
my #line = split /(\s+)/;
s/^(?=\d+\.\d{2}$)/\$/ for #line;
print #line; # or select your desired means of output
# my $out = join "", #line; # as string
}
Note that this split is non-destructive because we use parentheses to capture our delimiters. So for our sample input, the resulting list looks like this when printed with Data::Dumper:
$VAR1 = [
'2500.00',
' ',
'$120.00',
' ',
'4500',
' ',
'12.00',
' ',
'$23.00',
' ',
'50.0989'
];
Our regex here is simply anchored in both ends, and allowed to contain numbers, followed by a period . and two numbers, and nothing else. Because we use a look-ahead assertion, it will insert the dollar sign at the beginning, and keep everything else. Because of the strictness of our regex, we do not need to worry about checking for any other characters, and because we split on whitespace, we do not need to check for any such.
You can use this pattern:
s/(?<!\S)\d+\.\d{2}(?!\S)/\$${^MATCH}/gp
or
s/(?<!\S)(?=\d+\.\d{2}(?!\S))/\$/g
I think it is the shorter way.
(?<!\S) not preceded by a character that is not a white character
(?!\S) not followed by a character that is not a white character
The main interest of these double negations is that you include automaticaly the begining and the end of the string cases.
I have a poorly formatted csv file of Korean words with English definitions. I'd like to add a new line before each Korean word. For example:
# I'd like to change this
하다,to do,크기,size,대기,on hold,
# Into this
하다,to do,
크기,size,
대기,on hold,
Using the regex ([^\x00-\x7F]*) I was able to highlight all instances of Korean words but when I try to replace them with \n$1 it works only for the first word after my last cursor position and then inserts a newline after each character.
Use + instead of *, (the former means 1 or more, the latter means 0 or more). Otherwise you get zero-width matches at every position:
[^\x00-\x7F]+
I would like to match any character and any whitespace except comma with regex. Only matching any character except comma gives me:
[^,]*
but I also want to match any whitespace characters, tabs, space, newline, etc. anywhere in the string.
EDIT:
This is using sed in vim via :%s/foo/bar/gc.
I want to find starting from func up until the comma, in the following example:
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
I
To work with multiline in SED using RegEx, you should look at here.
EDIT:
In SED command, working with NewLine is a bit different. SED command support three patterns to manage multiline operations N, P and D. To see how it works see this(Working with Multiple Lines) explaination. Here these three operations discussed.
My guess is that N operator is the area of consideration that is missing from here. Addition of N operator will allows to sense \n in string.
An example from here:
Occasionally one wishes to use a new line character in a sed script.
Well, this has some subtle issues here. If one wants to search for a
new line, one has to use "\n." Here is an example where you search for
a phrase, and delete the new line character after that phrase -
joining two lines together.
(echo a;echo x;echo y) | sed '/x$/ { N s:x\n:x: }'
which generates
a xy
However, if you are inserting a new line, don't use "\n" - instead
insert a literal new line character:
(echo a;echo x;echo y) | sed 's:x:X\ :'
generates
a X
y
So basically you're trying to match a pattern over multiple lines.
Here's one way to do it in sed (pretty sure these are not useable within vim though, and I don't know how to replicate this within vim)
sed '
/func/{
:loop
/,/! {N; b loop}
s/[^,]*/func("ok"/
}
' inputfile
Let's say inputfile contains these lines
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
The output is
func("ok", "more strings")
Details:
If a line contains func, enter the braces.
:loop is a label named loop
If the line does not contain , (that's what /,/! means)
append the next line to pattern space (N)
branch to / go to loop label (b loop)
So it will keep on appending lines and looping until , is found, upon which the s command is run which matches all characters before the first comma against the (multi-line) pattern space, and performs a replacement.