So basically I have a CSV like:
121\sdf\ 34 4333DSssD,23233,TECH,32, ...
that first string is the ID but its supposed to have + not spaces. They got trimmed out, so now on each line until the first comma I need to replace any spaces with +.
Was thinking of using regex for this and re.sub (processing using python) but am having trouble only getting the spaces.
Was hoping StackOverflow could help :D
This can be done without a regex; just partition on the comma and manipulate the left partition
with open('path/to/input') as infile:
for line in infile:
left, comma, right = line.partition(',')
print "%s%s%s" %(left.replace(' ', "+"), comma, right)
Here is one solution without regular expressions (assuming you have a string with a single line called line, this would probably be inside of a for loop that is iterating over the file object):
pieces = line.split(',', 1)
pieces[0] = pieces[0].replace(' ', '+')
line = ','.join(pieces)
Or with regular expressions:
import re
line = re.sub(r'^[^,]*', lambda m: m.group(0).replace(' ', '+'), line)
Related
I need to split the CSV file at commas, but the problem is that file can contain commas inside fields. So for an example:
one,two,tree,"four,five","six,seven".
It uses double quotes to escape, but I could not solve it.
I tried to use something like this with this regex, but I got an error: REGEX_TOO_COMPLEX.
data: lv_sep type string,
lv_rep_pat type string.
data(lv_row) = iv_row.
"Define a separator to replace commas in double quotes
lv_sep = cl_abap_conv_in_ce=>uccpi( uccp = 10 ).
concatenate '$1$2' lv_sep into lv_rep_pat.
"replace all commas that are separator with the new separator
replace all occurrences of regex '(?:"((?:""|[^"]+)+)"|([^,]*))(?:,|$)' in lv_row with lv_rep_pat.
split lv_row at lv_sep into table rt_cells.
You must use this Regex => ,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
DATA: lv_sep TYPE string,
lv_rep_pat TYPE string.
DATA(lv_row) = 'one,two,tree,"four,five","six,seven"'.
"Define a separator to replace commas in double quotes
lv_sep = cl_abap_conv_in_ce=>uccpi( uccp = 10 ).
CONCATENATE '$1$2' lv_sep INTO lv_rep_pat.
"replace all commas that are separator with the new separator
REPLACE ALL OCCURRENCES OF REGEX ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)' IN lv_row WITH lv_rep_pat.
SPLIT lv_row AT lv_sep INTO TABLE data(rt_cells).
LOOP AT rt_cells into data(cells).
WRITE cells.
SKIP.
ENDLOOP.
Testing output
I never ever touched ABAP, so please see this as pseudo code
I'd recommend using a non-regex solution here:
data: checkedOffsetComma type i,
checkedOffsetQuotes type i,
baseOffset type i,
testString type string value 'val1, "val2, val21", val3'.
LOOP AT SomeFancyConditionYouDefine.
checkedOffsetComma = baseOffset.
checkedOffsetQuotes = baseOffset.
find FIRST OCCURRENCE OF ','(or end of line here) in testString match OFFSET checkedOffsetComma.
write checkedOffsetComma.
find FIRST OCCURRENCE OF '"' in testString match OFFSET checkedOffsetQuotes.
write checkedOffsetQuotes.
*if the next comma is closer than the next quotes
IF checkedOffsetComma < checkedOffsetQuotes.
REPLACE SECTION checkedOffsetComma 1 OF ',' WITH lv_rep_pat.
baseOffset = checkedOffsetComma.
ELSE.
*if we found quotes, we go to the next quotes afterwards and then continue as before after that position
find FIRST OCCURRENCE OF '"' in testString match OFFSET checkedOffsetQuotes.
write baseOffset.
ENDIF.
ENDLOOP.
This assumes that there are no quotes in quotes thingies. Didn't test, didn't validate in any way. I'd be happy if this at least partly compiles :)
I have a pattern such as word-\nword, i.e. words are hyphenated and separated by new line character.
I would like the output as word-word. I get word-\nword with the below code.
text_string = "word-\nword"
result=re.findall("[A-Za-z]+-\n[A-Za-z]+", text_string)
print(result)
I tried this, but did not work, I get no result.
text_string = "word-\nword"
result=re.findall("[A-Za-z]+-(?=\n)[A-Za-z]+", text_string)
print(result)
How can I achieve this.
Thank You !
Edit:
Would it be efficient to do a replace and run a simple regex
text_string = "aaa bbb ccc-\nddd eee fff"
replaced_text = text_string.replace('-\n', '-')
result = re.findall("\w+-\w+",replaced_text)
print(result)
or use the method suggested by CertainPerformance
text_string = "word-\nword"
result=re.sub("(?i)(\w+)-\n(\w+)", r'\1-\2', text_string)
print(result)
You should use re.sub instead of re.findall:
result = re.sub(r"(?<=-)\n+", "", test_str)
This matches any new lines after a - and replaces it with empty string.
Demo
You can alternatively use
(?<=-)\n(?=\w)
which matches new lines only if there is a - before it and it is followed by word characters.
If the string is composed of just that, then a pure regex solution is to use re.sub, capture the first word and the second word in a group, then echo those two groups back (without the dash and newline):
result=re.sub("(?i)([a-z]+)-\n([a-z]+)", r'\1\2', text_string)
Otherwise, if there is other stuff in the string, iterate over each match and join the groups:
text_string = "wordone-\nwordtwo wordthree-\nwordfour"
result=re.findall("(?i)([a-z]+)-\n([a-z]+)", text_string)
for match in result:
print(''.join(match))
You can simply replace any occurrences of '-\n' with '-' instead:
result = text_string.replace('-\n', '-')
I have the following code that can return a line from text where a certain word exists
with open('/Users/Statistical_NLP/Project/text.txt') as f:
haystack = f.read()
with open('/Users/Statistical_NLP/Project/test.txt') as f:
for line in f:
needle = line.strip()
pattern = '^.*{}.*$'.format(re.escape(needle))
for match in re.finditer(pattern, haystack, re.MULTILINE):
print match.group(0)
How can I search for a word and return not the whole line, just the 3 words after and the three words before this certain word.
Something has to be changed in this line in my code:
pattern = '^.*{}.*$'.format(re.escape(needle))
Thanks a lot
The following regex will help you achieve what you want.
((?:\w+\s+){3}YOUR_WORD_HERE(?:\s+\w+){3})
For a better understanding of the regex, I suggest you go to the following page and experiment with it.
https://regex101.com/r/eS8zW5/3
This will match the three words before, the matched word and three words after.
The following will match 3 words before and after if they exist
((?:\w+\s+){0,3}YOUR_WORD_HERE(?:\s+\w+){0,3})
I am searching a text file consisting of single words on each line for the following:
Lines that have two consecutive a’s in them but which don’t start with an a
import re
import sys
pattern = '^[^Aa][A-Za-z]*[Aa]{2}'
regexp = re.compile(pattern)
inFile = open('words.txt', 'r')
outFile = open('exercise04.log', 'w')
for line in inFile:
match = regexp.search(line)
if match:
outFile.write(line)
inFile.close()
outFile.close()
My main concern is my regex search pattern rather than the python itself. I understand the ^[^Aa] at the start stops the first character from being 'A' or 'a', but is there a better way of breaking out of this statement to check for two consecutive 'a's in each word than I have used?
Your pattern looks fine.
If you want to make sure the first character is a letter, use
pattern = '^[B-Zb-z][A-Za-z]*[Aa]{2}'
I have multiple lines in some text files such as
.model sdata1 s tstonefile='../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p' passive=2
I want to extract the text between the single quotes in MATLAB.
Much help would be appreciated.
To get all of the text inside multiple '' blocks, regexp can be used as follows:
regexp(txt,'''(.[^'']*)''','tokens')
This says to get text surrounded by ' characters, which does not include a ' in the captured text. For example, consider this file with two lines (I made up different file name),
txt = ['.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'' passive=2 ', char(10), ...
'.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_3port_via_minstub.s00p'' passive=2']
>> stringCell = regexp(txt,'''(.[^'']*)''','tokens');
>> stringCell{:}
ans =
'../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'
ans =
'../data/s_element/isdimm_rcv_via_3port_via_minstub.s00p'
>>
Trivia:
char(10) gives a newline character because 10 is the ASCII code for newline.
The . character in regexp (regex in the rest of the coding word) pattern usually does not match a newline, which would make this a safer pattern. In MATLAB, a dot in regexp does match a newline, so to disable this, we could add 'dotexceptnewline' as the last input argument to `regexp``. This is convenient to ensure we don't get the text outside of the quotes instead, but not needed since the first match sets precedent.
Instead of excluding a ' from the match with [^''], the match can be made non-greedy with ? as follows, regexp(txt,'''(.*?)''','tokens').
If you plan to use textscan:
fid = fopen('data.txt','r');
rawdata = textscan(fid,'%s','delimiter','''');
fclose(fid);
output = rawdata{:}(2)
As also used in other answers the single apostrophe 'is represented by a double one: '', e.g. for delimiters.
considering the comment:
fid = fopen('data.txt','r');
rawdata = textscan(fid,'%s','delimiter','\n');
fclose(fid);
lines = rawdata{1,1};
L = size(lines,1);
output = cell(L,1);
for ii=1:L
temp = textscan(lines{ii},'%s','delimiter','''');
output{ii,1} = temp{:}(2);
end
One easy way is to split the string with single quote delimiter and take the even-numbered strings in the output:
str = fileread('test.txt');
out = regexp(str, '''', 'split');
out = out(2:2:end);
You can do this using regular expressions. Assuming that there is only one occurrence of text between quotation marks:
% select all chars between single quotation marks.
out = regexp(inputString,'''(.*)''','tokens','once');
After identifing which lines you want to extract info from, you could tokenize it or do something like this if they all have the same form:
test='.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'' passive=2';
a=strfind(test,'''')
test=test(a(1):a(2))