Remove "." from digits - regex

I have a string in the following way =
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
I want to convert it into :
"lmn abc 40mg 350 mg over 12 days. Standing nebs."
that is I only convert a.b -> ab where a and b are integer
waiting for help

Assuming you are using Python. You can use captured groups in regex. Either numbered captured group or named captured group. Then use the groups in the replacement while leaving out the ..
import re
text = "lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
Numbered: You reference the pattern group (content in brackets) by their index.
text = re.sub("(\d+)\.(\d+)", "\\1\\2", text)
Named: You reference the pattern group by a name you specified.
text = re.sub("(?P<before>\d+)\.(?P<after>\d+)", "\g<before>\g<after>", text)
Which each returns:
print(text)
> lmn abc 40mg 350 mg over 12 days. Standing nebs.
However you should be aware that leaving out the . in decimal numbers will change their value. So you should be careful with whatever you are doing with these numbers afterwards.

Using any sed in any shell on every Unix box:
$ sed 's/\([0-9]\)\.\([0-9]\)/\1\2/g' file
"lmn abc 40mg 350 mg over 12 days. Standing nebs."

Using sed
$ cat input_file
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs. a.b.c."
$ sed 's/\([a-z0-9]*\)\.\([a-z0-9]\)/\1\2/g' input_file
"lmn abc 40mg 350 mg over 12 days. Standing nebs. abc."

echo '1.2 1.23 12.34 1. .2' |
ruby -p -e '$_.gsub!(/\d+\K\.(?=\d+)/, "")'
Output
12 123 1234 1. .2
If performance matters:
echo '1.2 1.23 12.34 1. .2' |
ruby -p -e 'BEGIN{$regex = /\d+\K\.(?=\d+)/; $empty_string = ""}; $_.gsub!($regex, $empty_string)'

Related

I want to find some string in front of another string pattern, how to do it?

I want to use bash shell to split string like:
Calcipotriol - Daivonex Cream 50mcg/1g 30 g [1]
Aspirin - DBL Aspirin 100mg [1] tablet
I want to get brand name "Davionex Cream" and "DBL Aspirin"
I want to get the name in front of parttern ***mg or ***mcg or ***g
how to do it?
If your sample input is representative, awk may offer the simplest solution:
awk -F'- | [0-9]+(mc?)?g' '{ print $2 }' <<'EOF'
Calcipotriol - Daivonex Cream 50mcg/1g 30 g [1]
Aspirin - DBL Aspirin 100mg [1] tablet
Foo - Foo Bar 22g [1] other
EOF
yields:
Daivonex Cream
DBL Aspirin
Foo Bar
In Bash you can do:
while IFS= read -r line || [[ -n "$line" ]]; do
if [[ "$line" =~ ^([[:alpha:]]+)[[:space:][:punct:]]+([[:alpha:][:space:]]+)[[:space:]](.*)$ ]]
then
printf "1:'%s' 2:'%s' 3:'%s'\n" "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]}" "${BASH_REMATCH[3]}"
fi
done <<<"Calcipotriol - Daivonex Cream 50mcg/1g 30 g [1]
Aspirin - DBL Aspirin 100mg [1] tablet"
Prints:
1:'Calcipotriol' 2:'Daivonex Cream' 3:'50mcg/1g 30 g [1]'
1:'Aspirin' 2:'DBL Aspirin' 3:'100mg [1] tablet'
You can use sed this way:
sed -E 's/^[[:alpha:]]+ - ([[:alpha:] ]+) [[:digit:]]+.*/\1/' <<< "Calcipotriol - Daivonex Cream 50mcg/1g 30 g [1]"
=> Daivonex Cream
^[[:alpha:]]+ - => matches all the characters until the pattern we need to extract
([[:alpha:] ]+) => this is the part we want to extract
[[:digit:]]+.* => this is everything that comes after; we assume this part starts with a space and one or more digits, followed by any number of characters
\1 => the part extracted by the (...) expression above;
we replace the entire string with the matched part
You can check out this site to learn more about regexes: http://regexr.com/

Remove duplicate words and just print lines in which this occurs

I have a challenge to look in a file if a sentence contains 2 identical consecutive words. If so, you print the word; otherwise, you don't print the sentence.
Example:
abc2 1 def2 3 abc2
F4
--------------
dea 123 123 zy45
12 12
abc cd abc cd
xyz%$#! xyz%$#! kk
xyzxyz
abc h h h h
After running the program the output will be:
dea 123 zy45
12
xyz%$#! kk
abc h h h
3
This is what I have so far:
sed '/\([^\([^ ]\+\)[ ]\+\1]\)/d' F4 >|tmp
I got this so far but this is only separating between the sentences that have the double word and sentences that don't.
Your sed expression was quite accurate. However, it needed some mangling to make it work:
$ sed -nr 's/\b(\S+)\s+\1(\s|$)/\1/p' file
dea 123 zy45
12
xyz%$#! kk
abc h h h
The idea is the one you already implemented: match a given word with [^ ] and see if you match it again with \1. What I added is all of this to be replaced with \1 so the repeated block disappears.
Instead of [^ ] it is also useful to use \S and instead of [ ], \s. Note also the usage of \b as a word boundary to prevent false positives like fedorqui qui and the usage of \1(\s|$) to prevent other false positives like hello helloa (thanks WalterA for the examples!). Note the usage of \s|$ to match either a space or the end of the line; \b matches any not-word character, which makes it not useful for the case with xyz%$#! kk.
To prevent all lines to be printed, we use sed -n. This way, we just print (with p) those that go through the regular expression that was defined.
Note the usage of -r to get rid of all those escaping to capture groups. Without it, the command would be:
sed -n 's/\b\([^ ]\+\)[ ]\+\1/\1/p' file
Let's test it with a more comprehensive input:
$ cat a
abc2 1 def2 3 abc2
F4
--------------
dea 123 123 zy45
12 12
abc cd abc cd
xyz%$#! xyz%$#! kk
xyzxyz
fedorqui qui
hello helloa
abc h h h h
$ sed -nr 's/\b(\S+)\s+\1(\s|$)/\1/p' a
dea 123zy45
12
xyz%$#!kk
abc hh h
I was looking for a sed solution that seemed to be easy. perhaps in this case awk is better (F4 is the inputfile):
awk '{
for (i=2; i<=NF; i++) {
if ($(i-1)==$i) {
$i="";
printf("%s\n", $0);
break;
}
}
}' F4
I am not complete happy with this solution, since it will leave a double FieldSep in $0 after deleting the doubled word, but literally the OP did not see that a space or tab should be deleted too.

Regex for soccer data

Why isn't my regex working? It just returns back the original file. My file looks like this (for a few hundred lines):
1 Germany 1765 0 Equal
2 Argentina 1631 0 Equal
3 Colombia 1488 1 Up
4 Netherlands 1456 -1 Down
5 Belgium 1444 0 Equal
6 Brazil 1291 1 Up
7 Uruguay 1243 -1 Down
8 Spain 1228 -1 Down
9 France 1202 1 Up
...
192 US Virgin Islands 28 -1 Down
And I want this:
Germany,1
Argentina,2
Colombia,3
...
US Virgin Islands,192
This is the regex I tried:
sed 's/\([0-9]*\)\t\([a-zA-Z]*\)/\2,\1/g' <fifa.csv >fifa.csv
But it just returns the original file.
EDIT:
Now I tried
sed 's/\([0-9]*\)\t\([a-zA-Z]*\)/\2,\1/g' <fifa.csv >fifa.csv
and got
,1 Germany,,1765Equal,0,
,2 Argentina,,1631Equal,0,
,3 Colombia,,1488Up,1,
,4 Netherlands,,1456-Down,1,
,5 Belgium,,1444Equal,0,
You could try the below sed command if the fields are tab-separated.
sed 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' file
Add the inline-edit option -i to save the changes made.
sed -i 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' file
^ means start of the line anchor. + would repeat the previous character one or more times. Basic sed uses BRE so you need to escape the + to do the functionality of repeating the previous character one or more times. [^\t]* matches any character but not of \t tab character zero or more times.
The following is what you are looking for. The -i option specifies that files are to be edited in-place.
sed -i 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' fifa.csv
awk '{print( $2 "," $1)}' YourFile
not a sed but easier to manage

RegEx to extract numeric value immediately before a search character /string found

what would be the best regEx to extract all the number (only numbers) before a search string ?
ABC Y C S 1 $ 46CC MAN 25/ 31
Need to extract 25 in this case, but its not fixed length ? Any help ?
'\d+(?=/)'
should work. see test with grep:
kent$ echo "ABC Y C S 1 $ 46CC MAN 25/ 31 "|grep -Po '\d+(?=/)'
25
Perl regex:
while ($subject =~ m!\d+(?=.*/)!g) {
# matched text = $&
}
Output:
1
46
25
So basically keep matching, as long as a / exist somewhere later.

Find and Replace using Notepad++

I'm trying to use Notepad++ to do some find and replace as I'm dealing with up to few thousand of lines of data.
The below is the example of the data structure that i am dealing with.
A = Can be any Aplabet
X = Can be any Number 0-9
RX = Number that I want to replace with another value.
AAAAA X.XXXXXX X.XXXXX X X X X X XX:XX:XX:XX.XXX XXX RXRXRXRXRXRX XXXXXX XXXXXX
Actual Example
werwer 2.178924 1.17892 1 1 1 1 1 12:14:44:59.123 123 0123123 123345 123123
gret 2.178975 1.15731 1 1 1 1 1 12:14:44:59.123 123 0123 123345 123123
sdfwe 2.123245 1.15171 1 1 1 1 1 12:14:44:59.123 123 0555312 123345 123123
Is there a shortcut I can use?
N++ is not the tool for the job as it has very limited regexp capabilities. In a decent editor, you could replace
((?:[a-zA-Z0-9:\.]+\s+){10})\d+(.*)
with
\1your_text\2
but notepad++ regex syntax supports neither (?:) nor {10}.
There are lots of regex tools out there, choose whichever.
P.S. I also tried repeating the first pattern ten times to emulate {10}, it still did not work strangely.
This looks like the sort of job that is perfect for awk.
awk '{print "$1 $2 $3 $4 $5 $6 $7 $8 $9 NUMBERS I'\''M CHANGING $11 $12"}' < file.txt > newfile.txt
You can also try vim's Block Highlighting and Insert/Change functionality. I don't know if notepad++ has anything similar to it.
press ctrl+F and then go to Replace tab.
refer to https://superuser.com/questions/214079/find-and-replace-using-notepad
whitequark solution