I have part of the following text that I'm reading with C#
"I have to see your driver’s license and print you an ID tag before I can send you through," he said in a flat, automatic sort of way, staring at the horns with blank-eyed fascination.
I'm reading in some lines of this one book, and I'd like to create strings out of all the words, including those with apostrophes. I'd like to split the lines based on non word characters, but I want apostrophes to be included with the word characters, so I ultimately get a list of strings with just words, so that the word "driver's" is together.
I'm using sublime to test out the expressions, but when I do (\W+|\'), apostrophes are still captured. I don't want to split something like "you'd" into two string. \W+ is perfect, but I'd just like to include apostrophes. How could I do that?
If you're looking for a regex matching "between" the words:
[^\w']+
should do.
You can try String.Split: example follows
string _input ="I have to see your driver’s license and print you an ID tag before I can send you through";
string[] _words = _input.Split(' ');
In case you want to remove other characters, for example: single quote (apostrophe) "'" and comma "," and use Replace(), like:
_input = _input.Replace("'", String.Empty).Replace(",",String.Empty);
string[] _words = _input.Split(' ');
You can also use Regex, but its performance is worse than of these methods (if it does matter).
Also, you can try as an example my 'semantic analyzer' app at: http://webinfocentral.com/TECH/SemanticAnalyzer.aspx . It's doing all that stuff and much more (characters to exclude are listed at the left pane). Rgds,
Related
I have a database table that I have exported. I need to replace the image file name with a space and would like to use notepad++ and regex to do so. I have:
'data/green tea powder.jpg'
'data/prod_img/lumina herbal shampoo.JPG'
'data/ALL GREEN HERBS.jpeg'
'data/prod_img/PSORIASIS KIT (640x530) (2).jpg'
and need to make them look like this:
'data/green_tea_powder.jpg'
'data/prod_img/lumina_herbal_shampoo.JPG'
'data/ALL_GREEN_HERBS.jpeg'
'data/prod_img/PSORIASIS_KIT_(640x530)_(2).jpg'
I just want to change the spaces between the quotes (I don't want to change the capitalization). To be more specific I would like to replace any and all spaces between 'data/ and ' because there are other spaces between quotes in the DB, for example:
'data/ REPLACE ANY SPACE HERE '
I found this:
\s(?!(?:[^']*'[^']*')*[^']*$)
but there are other places where there are spaces between quotes so I'd like to search for data/ in the beging and not just a single quote but I can't figure out how. I tried \s(?!(?:[^'data\/]*'[^']*')*[^']*$) but it didn't work and I am not familiar enough with regex to make it do so.
An example of a full line from the database is:
(712, 'GRTE-P', '', 'data/green tea powder.jpg', '2014-03-12 22:52:03'),
I don't want to replace the spaces in the time and data stamp at the end of the line, just the image file names.
Thanks in advance for your help!
You have to use a \G based pattern to ensure that matches are contiguous.
search: (?:\G(?!^)|'data/)[^' ]*\K[ ]replace: _
The first match uses the second branch of the alternation, then the next matches are contiguous and use the first branch.
I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.
Given a text "article_utf8" i want to remove a list of words:
remove = "el|la|de|que|y|a|en|un|ser|se|no|haber|..."
regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
article_out = regex.sub("", article_utf8)
however this is incorrectly removing some words and parts of words for example:
1- aseguro becomes seguro
2- sería becomes í
3- coma becomes com
4- miercoles becomes 'ercoles'
Technically parts of a word can match a regexp. To solve this you would have to make sure that whatever sequence of letters your regexp matches is a single word and not part of it.
One way would be to make the regexp contain leading and trailing spaces, but words could also be separated with periods or commas so you would have to take those into account too if you want to catch all instances.
Alternatively, you can try splitting the list first into words using the built-in split method (https://docs.python.org/2/library/stdtypes.html#str.split). Then I would check each word in the resulting list, remove the ones I don't want and rejoin the strings. This method, however doesn't even need regexps so it's probably not what you intended despite being simple and practical.
After much testing, the following will remove the small words in a natural language string, without removing them from parts of other words:
regex = re.compile(r'[\s]?\b('+remove+')[\b\s\.\,]', flags=re.IGNORECASE)
I'm trying to use a regular expression to select all of each word except the first character, much as #mahdaeng wanted to do here. The solution offered to his question was to use \B[a-z]. This works fine, except when a word contains some form of punctuation, such as "Jack's" and "merry-go-round". Is there a way to select the entire word including any contained punctuation? (Not including outside punctuation such as "? , ." etc.)
If you can enumerate the acceptable in-word punctuation, you could just expand upon the answer you linked:
\B[a-zA-Z'-]+
A regex really isn't necessary here, since you can just split your word on spaces and deal with each word accordingly. Since you don't mention an underlying language, here's an implementation in Perl:
use strict;
use warnings;
$_="Jack's merry-go-round revolves way too fast!";
my #words=split /\s+/;
foreach my $word(#words)
{
my $stripped_word=substr($word,1);
$stripped_word=~s/[^a-z]$//i; #stripping out end punctuation
print "$stripped_word\n";
}
The output is:
ack's
erry-go-round
evolves
ay
oo
ast
\B[^\s]+
(where ^\s means "not whitespace") should get you what you want assuming the words are whitespace-delimited. If they're also punctuation-delimited, you might need to enumerate the punctuation:
\B[^\s,.?!]+
<![Apple]!>some garbage text may be here<![Banana]!>some garbage text may be here<![Orange]!><![Pear]!><![Pineapple]!>
In the above string, I would like to have a regex that matches all <![FruitName]!>, between these <![FruitName]!>, there may be some garbage text, my first attempt is like this:
<!\[[^\]!>]+\]!>
It works, but as you can see I've used this part:
[^\]!>]+
This kills some innocents. If the fruit name contains any one of these characters: ] ! > It'd be discarded and we love eating fruit so much that this should not happen.
How do we construct a regex that disallows exactly this string ]!> in the FruitName while all these can still be obtained?
The above example is just made up by me, I just want to know what the regex would look like if it has to be done in regex.
The simplest way would be <!\[.+?]!> - just don't care about what is matched between the two delimiters at all. Only make sure that it always matches the closing delimiter at the earliest possible opportunity - therefore the ? to make the quantifier lazy.
(Also, no need to escape the ])
About the specification that the sequence ]!> should be "disallowed" within the fruit name - well that's implicit since it is the closing delimiter.
To match a fruit name, you could use:
<!\[(.*?)]!>
After the opening <![, this matches the least amount of text that's followed by ]!>. By using .*? instead of .*, the least possible amount of text is matched.
Here's a full regex to match each fruit with the following text:
<!\[(.*?)]!>(.*?)(?=(<!\[)|$)
This uses positive lookahead (?=xxx) to match the beginning of the next tag or end-of-string. Positive lookahead matches but does not consume, so the next fruit can be matched by another application of the same regex.
depending on what language you are using, you can use the string methods your language provide by doing simple splitting (and simple regex that is more understandable). Split your string using "!>" as separator. Go through each field, check for <!. If found, replace all characters from front till <!. This will give you all the fruits. I use gawk to demonstrate, but the algorithm can be implemented in your language
eg gawk
# set field separator as !>
awk -F'!>' '
{
# for each field
for(i=1;i<=NF;i++){
# check if there is <!
if($i ~ /<!/){
# if <! is found, substitute from front till <!
gsub(/.*<!/,"",$i)
}
# print result
print $i
}
}
' file
output
# ./run.sh
[Apple]
[Banana]
[Orange]
[Pear]
[Pineapple]
No complicated regex needed.