R - How to extract file from directory based on user input - regex

I am relatively new to R and struggling with file extraction.
I have a list of CSV files (i.e. 001.csv, 002.csv, ....) in my directory xyz and need to extract a specific file based on the input given by user.
User input is in the form of 1, 2 ... (stored in y) which I tried converting by leading 0's.
When I run the code
filename = as.character(formatC(y, width=3, flag=0))
list.files(directory,pattern = "^",filename,"\\.csv$")
I get the result
character[0]
which implies my pattern code is incorrect, I want the file for eg: 001.csv to be extracted
Can anybody help me out?

It seems you miss the pattern that will match any file that starts with the filename then can match any 0+ characters and ends with .csv.
To build it, use paste0:
files <- list.files(directory, pattern = paste0("^", filename, ".*\\.csv$"))
Where:
"^" - start of the file name string
filename - the filename you pass
".*\\.csv$" - any 0+ characters (.*) followed by .csv (\\.csv) at the end of the string ($).

filename = as.character(formatC(y, width=3, flag=0))
The formatC flag 0 seems to work only for numerical objects; if you read the user input y with e. g. y = readline(), y is of type "character". You get the desired formatting with
filename = formatC(as.integer(y), width=3, flag=0)
(as.character() isn't needed because the formatC() value already has that type).
list.files(directory,pattern = "^",filename,"\\.csv$")
This isn't a correct usage of
list.files(path = ".", pattern = NULL, all.files = FALSE,
full.names = FALSE, recursive = FALSE,
ignore.case = FALSE)
- surely you meant to concatenate "^", filename and "\\.csv$".
All told I suggest to build the whole filename pattern with sprintf(), i. e.:
filename = sprintf("%03d\\.csv", as.integer(y))
list.files(directory, filename)

Related

Conditionally extracting the beginning of a regex pattern

I have a list of strings containing the names of actors in a movie that I want to extract. In some cases, the actor's character name is also included which must be ignored.
Here are a couple of examples:
# example 1
input = 'Levan Gelbakhiani as Merab\nAna Javakishvili as Mary\nAnano Makharadze'
expected_output = ['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']
# example 2
input = 'Yoosuf Shafeeu\nAhmed Saeed\nMohamed Manik'
expected_output = ['Yoosuf Shafeeu', 'Ahmed Saeed', 'Mohamed Manik']
Here is what I've tried to no avail:
import re
output = re.findall(r'(?:\\n)?([\w ]+)(?= as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?: as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?:(?= as )|(?! as ))', input)
The \n in the input string are new line characters. We can make use of this fact in our regex.
Essentially, each line always begins with the actor's name. After the the actor's name, there could be either the word as, or the end of the line.
Using this info, we can write the regex like this:
^(?:[\w ]+?)(?:(?= as )|$)
First, we assert that we must be at the start of the line ^. Then we match some word characters and spaces lazily [\w ]+?, until we see (?:(?= as )|$), either as or the end of the line.
In code,
output = re.findall(r'^(?:[\w ]+?)(?:(?= as )|$)', input, re.MULTILINE)
Remember to use the multiline option. That is what makes ^ and $ mean "start/end of line".
You can do this without using regular expression as well.
Here is the code:
output = [x.split(' as')[0] for x in input.split('\n')]
I guess you can combine the values obtained from two regex matches :
re.findall('(?:\\n)?(.+)(?:\W[a][s].*?)|(?:\\n)?(.+)$', input)
gives
[('Levan Gelbakhiani', ''), ('Ana Javakishvili', ''), ('', 'Anano Makharadze')]
from which you filter the empty strings out
output = list(map(lambda x : list(filter(len, x))[0], output))
gives
['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']

Changing a pipe delimited file to comma delimited in VB.net

So I have a set of pipe delimited inputs which are something like this:
"787291 | 3224325523" | 37826427 | 2482472 | "46284729|46246" | 24682
| 82524 | 6846419 | 68247
and I am converting them to comma delimited using the code given below:
Dim line As String
Dim fields As String()
Using sw As New StreamWriter("c:\test\output.txt")
Using tfp As New FileIO.TextFieldParser("c:\test\test.txt")
tfp.TextFieldType = FileIO.FieldType.Delimited
tfp.Delimiters = New String() {"|"}
tfp.HasFieldsEnclosedInQuotes = True
While Not tfp.EndOfData
fields = tfp.ReadFields
line = String.Join(",", fields)
sw.WriteLine(line)
End While
End Using
End Using
So far so good. It only considers the delimiters that are present outside the quotes and changes them to the comma delimiter. But trouble starts when I have input with a stray quotation like below:
"787291 | 3224325523" | 37826427 | 2482472 | "46284729|46246" | 24682
| "82524 | 6846419 | 68247
Here the code gives
MalformeLineExcpetion
Which I realize is due to the stray quotation in my input and since i am like a total noob in RegEx so i am not able to use it here(or I am incapable of). If anyone has any idea, it would be much appreciated.
Here is the coded procedure described in the comments:
Read all the lines of the original input file,
fix the faulty lines (with Regex or anything else that fits),
use TextFieldParser to perform the parsing of the correct input
Join() the input parts created by TextFieldParser using , as separator
save the fixed, reconstructed input lines to the final output file
I'm using Wiktor Stribiżew Regex pattern: it looks like it should work given the description of the problem.
Note:
Of course I don't know whether a specific Encoding should be used.
Here, the Encoding is the default UTF-8 no-BOM, in and out.
"FaultyInput.txt" is the corrupted source file.
"FixedInput.txt" is the file containing the input lines fixed (hopefully) by the Regex. You could also use a MemoryStream.
"FixedOutput.txt" is the final CSV file, containing comma separated fields and the correct values.
These files are all read/written in the executable startup path.
Dim input As List(Of String) = File.ReadAllLines("FaultyInput.txt").ToList()
For line As Integer = 0 To input.Count - 1
input(line) = Regex.Replace(input(line), "(""\b.*?\b"")|""", "$1")
Next
File.WriteAllLines("FixedInput.txt", input)
Dim output As List(Of String) = New List(Of String)
Using tfp As New FileIO.TextFieldParser("FixedInput.txt")
tfp.TextFieldType = FileIO.FieldType.Delimited
tfp.Delimiters = New String() {"|"}
tfp.HasFieldsEnclosedInQuotes = True
While Not tfp.EndOfData
Dim fields As String() = tfp.ReadFields
output.Add(String.Join(",", fields))
End While
End Using
File.WriteAllLines("FixedOutput.txt", output)
'Eventually...
'File.Delete("FixedInput.txt")
Sub ReadMalformedCSV()
Dim s$
Dim pattern$ = "(?x)" + vbCrLf +
"\b #word boundary" + vbCrLf +
"(?'num'\d+) #any number of digits" + vbCrLf +
"\b #word boundary"
'// Use "ReadLines" as it will lazily read one line at time
For Each line In File.ReadLines("c:\test\output.txt")
s = String.Join(",", Regex.Matches(line, pattern).
Select(Function(e) e.Groups("num").Value))
WriteLine(s)
Next
End Sub

.txt filename iteration vb.net

I've got a little problem with regards to iterating the filename of the txt files. I've got a filename format that goes like this: <date>-<year>_filename-<number>.txt. The problem is that when <number> reaches 9, the filename stops iterating.
The filenames goes like this:
31-2014_filename-1
31-2014_filename-2
31-2014_filename-3
31-2014_filename-4
31-2014_filename-5
31-2014_filename-6
31-2014_filename-7
31-2014_filename-8
31-2014_filename-9
31-2014_filename-10
The function only detects up to 9. Anything beyond that number is ignored.
Below is the code
Dim lastreport As Integer = 1
Public Sub GetLastNo(ByVal filePath As String)
Dim lastFile As String = 1
Dim files() As String = Directory.GetFiles(filePath, "*.txt")
For Each File As String In files
File = Path.GetFileNameWithoutExtension(File)
Dim numbers As MatchCollection = Regex.Matches(File, "(?<num>[\d]+)")
For Each number In numbers
number = CInt(number.ToString())
If number > 0 And number < 1000 And number > lastFile Then
lastFile = number
End If
lastreport = number
Next
Next
End Sub
Here it is:
(?<num>\d+(?=$))
This would make sure that the digits are followed by a > and $(End of line). This would make sure that it is the last set of digits.
It would really help to see some real filenames, including some that fail to match (your description is not completely unambiguous: for example what is <date> if it does not include the year?).
But assuming files like:
30May-2014_Stuff-1.txt
30May-2014_Stuff-3.txt
30May-2014_Stuff-5.txt
30May-2014_Stuff-7.txt
30May-2014_Stuff-9.txt
30May-2014_Stuff-11.txt
then using the .NET regex engine (from PowerShell (PSH) here as quicker to test with):
(?<num>\d+)$
should match the final digits ($ matches the end of the string) of the filename without extension: BaseName in PSH):
dir | foreach { if ($_.BaseName -match '(?<num>\d+)$') { $matches['num'] } }
gives:
1
11
3
5
7
9
So all filenames are matched, and the final number of their basenames is matched by group "num" of the regex.
I think there is something else going on in your approach: I would suggest changing to only get a single match per filename (and use Regex.Match rather than Matches to be consistent).

Matlab regexp; I would like to catch words between specific words

I would like to catch words between specific words in Matlab regular expression.
For example, If line = 'aaaa\bbbbb\ccccc....\wwwww.xyz' is given,
I would like to catch only wwwww.xyz.
aaaa ~ wwwww.xyz does not represent specific words and number of character- it means they can be any character excluding backslash and number of character can be more than 1. wwwww.xyz is always after last backslash. My problem is regexp(line,'\\.+\.xyz','match') does not always work since wwwww sometimes contain special character such as '-'.
Any suggestion is appreciated.
If you Must use regex for this, this regex should work:
[\\]?(?!.+\\)([^.]+\.[a-z]{3})
Working regex example:
http://regex101.com/r/fL5oS5
Example data:
aaaa\bbbbb\ccccc\ww%20-www.xyz
www-654_33.xyz
Matches:
1. ww%20-www.xyz
2. www-654_33.xyz
No solution provided here is likely to be 100% reliable unless you know that your data is carefully formatted (has the path string been escaped?). The question boils down to finding a word that is a valid path in line of text. It not so easy. We'll assume that all files have file extensions (this is not necessarily true in the context of paths). An arbitrary path can then might look like any of the following:
'wwwww.x'
'wwwww.xyz'
'\wwwww.xyz'
'ccccc\wwwww.xyz'
'\ccccc\wwwww.xyz'
...
str = 'The quick brown fox aaaa\bbbbb\ccccc\wwwww.xyz jumped over the lazy dog.';
matches = regexp(str,'\s\\?([^.\s\\]+\\)*([^.\s]+\.\w+)\s','tokens');
file_name = matches{1}(2)
which returns (for all of the cases above the extension is slightly different for the first case though)
file_name =
'wwwww.xyz'
If you know the filename extension is '.xyz', then you can use this instead:
matches = regexp(str,'\s\\?([^.\s\\]+\\)*([^.\s]+\.xyz)\s','tokens');
By the way, for a path, the fileparts function can be used:
str = 'aaaa\bbbbb\ccccc\wwwww.xyz'; % A Windows-only path
% str = 'aaaa/bbbbb/ccccc/wwwww.xyz'; % A UNiX or OS X path (works on Windows too)
[path_str,file_name,file_ext] = fileparts(str)
which returns
path_str =
aaaa\bbbbb\ccccc
file_name =
wwwww
file_ext =
.xyz
You can then get the filename with extension via
file_name_ext = [file_name file_ext];
Note also that that path_str omits the trailing file separator.
Assuming that the only thing that your strings have in common is that there is a file path separator, and you are interested in everything "from the last file path separator to the first whitespace", then you could try
['[\' filesep ']([^\' filesep ']+?)(?:\s|$)']
which on Windows platform would reduce to
\\([^\\]+?)(?:\s|$)
Demo:
http://regex101.com/r/jW5tT1
If you want to match the extension literally (.xyz in your example), change it to
\\([^\\]+?\.xyz)(?:\s|$)
"Find a backslash followed by the fewest (+?) number of "not backslash" until literal .xyz followed by a white space or end of string"

Changing substring in a String

I've got a variable "Variable" in VBScript that will receive different values, based on names that come from xml files i don't trust. I can't let "Variable" have forbidden caracters on it (<, >, :, ", /, \, |, ?, * ) or characters with accents (I think they are called accent in english) like (Á, á, É, é, Â, â, Ê, ê, ñ, ã).
So, my question is: How can I create a script that studies and replace these possible multiple possible characters in the variable I have? I'm using a Replace function found in MSDN Library, but it won't let me alter many characters in the way I'm using it.
Example:
(Assuming a Node.Text value of "Example A/S")
For Each Node In xmlDoc.SelectNodes("//NameUsedToRenameFile")
Variable = Node.Text
Next
Result = Replace(Variable, "<", "-")
Result = Replace(Variable, "/", "-")
WScript.Echo Result
This Echo above returns me "Example A-S", but if I change my Replaces order, like:
Result = Replace(Variable, "/", "-")
Result = Replace(Variable, "<", "-")
I get a "Example A/S". How should I program it to be prepared to any possible characters? Thanks!
As discussed, it might be easier to do things the other way around; create a list of allowed characrters as VBScript is not so good at handling unicode like characters; whilst the characters you have listed may be fine, you may run into issues with certain character sets. here's an example routine that could help your cause:
Consider this command:
wscript.echo ValidateStr("This393~~_+'852Is0909A========Test|!:~#$%####")
Using the sample routine below, it should produce the following results:
This393852Is0909ATest
The sample routine:
Function ValidateStr (vsVar)
Dim vsAllowed, vscan, vsaScan, vsaCount
vsAllowed = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"
ValidateStr = ""
If vartype(vsvar) = vbString then
If len(vsvar) > 0 then
For vscan = 1 To Len(vsvar)
vsValid = False
vsaCount = 1
Do While vsaValid = false and vsaCount <= len(vsAllowed)
If UCase(Mid(vsVar, vscan, 1)) = Mid(vsAllowed, vsaCount, 1) Then vsValid = True
vsaCount = vsaCount + 1
Loop
If vsValid Then ValidateStr = ValidateStr & Mid(vsVar, vscan,1)
Next
End If
End If
End Function
I hope this helps you with your quest. Enjoy!
EDIT: If you wish to continue with your original path, you will need to fix your replace command - it is not working because you are resetting it after each line. You'll need to pump in variable the first time, then use result every subsequent time..
You had:
Result = Replace(Variable, "/", "-")
Result = Replace(Variable, "<", "-")
You need to change this to:
Result = Replace(Variable, "/", "-")
Result = Replace(Result, "<", "-")
Result = Replace(Result, ...etc..)
Result = Replace(Result, ...etc..)
Edit: You could try Ansgar's Regex, as the code is by far more simple, but I am not sure it will work if as an example you had simplified Chinese characters in your string.
I agree with Damien that replacing everything but known-good characters is the better approach. I would, however, use a regular expression for this, because it greatly simplifies the code. I would also recommend to not remove "bad" characters, but to replace them with a known-good placeholder (an underscore for instance), because removing characters might yield undesired results.
Function SanitizeString(str)
Set re = New RegExp
re.Pattern = "[^a-zA-Z0-9]"
re.Global = True
SanitizeString = re.Replace(str, "_")
End Function