This is data present in my .txt file
+919000009998 SMS +919888888888
+919000009998 MMS +91988 88888 88
+919000009998 MMS abcd google
+919000009998 MMS amazon
I want to convert my .txt like this
919000009998 SMS 919888888888
919000009998 MMS 919888888888
919000009998 MMS abcd google
919000009998 MMS amazon
removing the + symbol, and also the spaces if present in third column only if it is a number, if it is string no operation to be performed
is there any regex to do this which can I write in search and replace in notepad++?
Ctrl+H
Find what: \+|(?<=\d)\h+(?=\d)
Replace with: LEAVE EMPTY
check Wrap around
check Regular expression
Replace all
Explanation:
\+ # + sign
| # OR
(?<=\d) # positive lookbehind, make sure we have a digit before
\h+ # 1 or more horizontal spaces
(?=\d) # positive lookahead, make sure we have a digit after
Screen capture:
All previous answer will perfectly work.
However, I'm just adding this just in case you need it:
If for some reason you had non-phone numbers on the third column separated by spaces (a street comes to mind for me +919000009998 MMS street foo nº 123 4º-B) you may use this regex instead (It will join number as long as the third column starts by +):
Search: ^[+](\S+\s+\S+\s++)(?:([^+][^\n]*)|[+])|\G\s*(\d+)
Replace by: \1\2\3
That will avoid joining the 3 and 4 on my previous example.
You have a demo here.
Related
Generally address comes with comma seperationa and can be splitted using simple regex. e.g
123 Main St, Los Angeles, CA, 90210
We can apply regex here and split using comma. But in my database addresses are stored without comma. e.g
A Better Property Management<br/> 6621 E PACIFIC COAST HWY<br/> STE 255<br/> LONG BEACH CA 90803-4241
And I want to put comma before the city. Something like this:
A Better Property Management<br/> 6621 E PACIFIC COAST HWY<br/> STE 255<br/> LONG BEACH ,CA 90803-4241
I was thing about finding the last two letter word from the end and put comma using regex . But I also need to account for the situations where we don't have complete address or missing city and pincodes. Is there a way this can be done. I only found solutions where we can split using comma but not the reverse.
I was thinking if we could select the last 2 words before numbers with something like [A-Za-z]{2} (don't know if this is correct). And at the same time if we can check to do this only if the string ends with numbers.
I tried
(\b(AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY|Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|New Hampshire|New Jersey|New Mexico|New York|North Carolina|North Dakota|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode Island|South Carolina|South Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West Virginia|Wisconsin|Wyoming)\b)
https://regex101.com/r/75fqO6/1
You can use
[a-zA-Z]+\s+\d(?:[\d-]*\d)?$
Replace with ,$0.
See the regex demo. Details:
[a-zA-Z]+ - one or more letters
\s+ - one or more whitespaces
\d - a digit
(?:[\d-]*\d)? - an optional substring of zero or more digits/hyphens and then a digit
$ - end of string.
The $0 in the replacement is a backreference to the whole match value, all text matched by the regex is put back where it was found with a prepended comma.
Hi guys im trying to get the the substring as well as the corresponding number from this string
text = "Milk for human consumption may be taken only from cattle from 80 hours after the last treatment."
I want to select the word milk and the corresponding number 80 from this sentence. This is part of a larger file and i want a generic solution to get the word milk in a line and then the first number that occurs after this word anywhere in that line.
(Milk+)\d
This is what i came up with thinking that i can make a group milk and then check for digits but im stumped how to start a search for numbers anywhere on line and not just immediately after the word milk. Also is there any way to make the search case insensitive?
Edit: im looking to get both the word and the number if possible eg: "milk" "80" and using python
/(?<!\p{L})([Mm]ilk)(?!p{L})\D*(\d+)/
This matches the following strings, with the match and the contents of the two capture groups noted.
"The Milk99" # "Milk99" 1:"Milk" 2:"99"
"The milk99 is white" # "milk99" 1:"milk" 2:"99"
"The 8 milk is 99" # "milk is 99" 1:"milk" 2:"99"
"The 8milk is 45 or 73" # "milk is 45" 1:"milk" 2:"45"
The following strings are not matched.
"The Milk is white"
"The OJ is 99"
"The milkman is 37"
"Buttermilk is 99"
"MILK is 99"
This regular expression could be made self-documenting by writing it in free-spacing mode:
/
(?<!\p{L}) # the following match is not preceded by a Unicode letter
([Mm]ilk) # match 'M' or 'm' followed by 'ilk' in capture group 2
(?!p{L}) # the preceding match is not followed by a Unicode letter
\D* # match zero or more characters other than digits
(\d+) # match one or more digits in capture group 2
/x # free-spacing regex definition mode
\D* could be replaced with .*?, ? making the match non-greedy. If the greedy variant were used (.*), the second capture group for "The 8milk is 45 or 73" would contain "3".
To match "MILK is 99", change ([Mm]ilk) to (?i)(milk).
This seems to work in java (I overlooked that the questioner wanted python or the question was later edited) like you want to:
String example =
"Test 40\n" +
"Test Test milk for human consumption may be taken only from cattle from hours after the last treatment." +
"\nTest Milk for human consumption may be taken only from cattle from 80 hours after the last treatment." +
"\nTest miLk for human consumption may be taken only from cattle from 80 hours after the last treatment.";
Matcher m = Pattern.compile("((?i)(milk).*?(\\d+).*\n?)+").matcher(example);
m.find();
System.out.print(m.group(2) + m.group(3));
Look at how it tests whether the word "milk" appears in a case insensitive manner anywhere before a number in the exact same line and only prints these both. It also prints only the first found occurence (making it find all occurencies is also possible pretty easily just by a little modifications of the given code).
I hope the way it extracts these both things from a matching pattern is in the sense of your task.
You should try this one
(Milk).*?(\d+)
Based on your language, you can also specify a case-insensitive search. Example in JS: /(Milk).*?(\d+)/i, the final i makes the search case insensitive.
Note the *?, the most important part ! This is a lazy iteration. In other words, it reads any char, but as soon as it can stop and process the next instruction successfully then it does. Here, as soon as you can read a digit, you read it. A simple * would have returned the last number from this line after Milk instead
I want to delete everything except IPs.
For example
1 138.68.161.60:1080 SOCKS5 HIA United States (New York NY) 138.68.161.60 (DigitalOcean, LLC) 0.143 75% (3) - 12-jan-2018 14:37 (10 minutes ago)
2 174.64.234.29:17501 SOCKS5 HIA United States wsip-174-64-234-29.sd.sd.cox.net (Cox Communications Inc.) 0.956
100% (5) - 12-jan-2018 14:36 (10 minutes ago)
3 45.79.219.154:63189 SOCKS5 HIA United States (Atlanta GA) li1318-154.members.linode.com (Linode, LLC) 6.973
90% (103) - 12-jan-2018 14:36 (11 minutes ago)
to
138.68.161.60:1080
174.64.234.29:17501
45.79.219.154:63189
I need a regex to this convert.
In Notepad++, it requires some finesse to delete text not containing matched strings, but you can choose Find, Mark, then check the Regular expression box and use the regex:
([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}+) and Mark Allto bookmark all rows containing IP adresses.
Then select Find, Replace, enter ^[0-9]\W in Find what:, and Replace All with nothing.
Then select Find, Replace, enter \w+S.+ in Find what:, and Replace All with nothing.
Then, go to Search, Bookmark, Remove Unmarked Lines.
Et Voilà!
You could use this regex in notepad++ and replace the captured values with group 1 \1
(?s)(\d \d+\.\d+\.\d+\.\d+:\d+).*?\(\d+ minutes ago\)
You select all the text for each of the 3 blocks from your example and use a capturing group for the text that you want to keep. Then in the replace you use only the captured group which holds your data.
Explanation
Inline modifier to make the dot match a line break (?s)
Group 1 with the pattern that you want to capture (\d \d+\.\d+\.\d+\.\d+:\d+)
Match any character zero or more times non greedy .*?
The pattern that is at the end of every part \(\d+ minutes ago\)
I need to parse the following expression:
Fertilizer abc 7-15-15 5KG BOX 250 KG
in 3 fields:
The product description: Fertilizer abc 7-15-15
Size: 250
Size unit: KG
Do not know how to proceed. Please, any help and explanation?
Try this in the alteryx REGEX Tool with Parse selected as the Method:
([A-z ]* [\d-]{6,8}) ([A-Z\d]{2,6}) (.{1,5}?) (\d*) ([A-Z]*)
You can test it at Regexpal to see the breakdown of each group but essentially the first set of brackets will get you your product description (text and spaces until 6-8 characters made up of digits and dashes), the 2nd & 3rd parts will deal with the erroneous info that you don't want, the 4th group will be just digits and the 5th group will be any text afterwards.
Note that this will change dramatically if your data has digits where there is characters currently etc.
You can always break it up into even smaller groups and then concatenate back together as well.
I found somewhat similar questions
R - Select string text between two values, regex for n characters or at least m characters,
but I'm still having trouble
say I have a string in r
testing_String <- "AK ADAK NAS PADK ADK 70454 51 53N 176 39W 4 X T 7"
And I need to be able to pull anything between the first element in the string that contains 2 characters (AK) and PADK,ADK. PADK and ADK will change in character but will always be 4 and 3 characters in length respectively.
So I would need to pull
ADAK NAS
I came up with this but its picking up everything from AK to ADK
^[A-Za-z0_9_]{2}(.*?) +[A-Za-z0_9_]{4}|[A-Za-z0_9_]{3,}
If I understood your question correctly, this should do the trick:
\b[A-Z]{2}\s+(.+?)\s+[A-Z]{4}\s+[A-Z]{3}\b
Demo
You'll have to switch the perl = TRUE option (to use a decent regex engine).
\b means word boundary. So this pattern looks for a match starting with a 2-letter word and ending with a 4 letter word followed by a 3 letter word. Your value will be in the first group.
Alternatively, you can write the following to avoid using the capturing group:
\b[A-Z]{2}\s+\K.+?(?=\s+[A-Z]{4}\s+[A-Z]{3}\b)
But I'd prefer the first method because it's easier to read.
Lookbehind is supported for perl=TRUE, so this regex will do what you want:
(?<=\w{2}\s).*?(?=\s+[^\s]{4}\s[^\s]{2})