Getting Specific columns from csv which contains empty cells in between - mapreduce

I am trying to parse specific columns from a csv using map reduce.I have some empty cells in my csv. When I am trying to run it shows some exceptions.But When i run a csv which has no empty cells I get my output.
sss xx asdas sddf
saq asdsds sdsds
ewe zz asdsd
aaa qq adsd ssdds
my csv looks like that and my mapper output must be like this
sss xx sddf 1
saq sdsds 1
ewe zz 1
aaa qq ssdds 1
My Mapper Code is
public class DataMap extends MapReduceBase implements Mapper<
LongWritable,Text,Text,IntWritable> {
public final static IntWritable one = new IntWritable(1);
#Override
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output, Reporter r)
throws IOException {
String a = value.toString();
String[] temp = a.split("\t");
output.collect(new Text(temp[0] + "\t" +temp[1] + "\t" +temp[3]), new IntWritable(a));
}
}
How can i over come this.

Related

Open a recordset in VS2017 like I could in VB6 for MS Access table

I am a long time VB6 guy, and feel squeezed into VS2017
I need your help with a VS2017 equivlent of
dim db as database
dim rs as recordset
db=opendatabase("path to .MDB")
rs = db.openrecordset("select * where myfield ="mine", order by fieldage") 'This SQL bit is easy.
'Then rs.movenext or rs.moveprevious etc etc
'Also it would help if I could say Textbox1.text = rs(3)
This is to be rolled out to more than one PC, so having to set a specific data connection in the PC config is not practical.
Thanks for reading this far.
I don't know if this would tick off all your needs, but it's longer than a comment, so I'm surfacing it as a possible solution.
It looks like you're doing VB.net? I encountered a similar challenge when I converted directly from VB6 -> C# using a tool I wrote and then made public.
https://github.com/bhoogter/VB6TocSharp (Yes, I wrote this. Yes, it's free).
To get around it, we used the standard System.Data and System.Data.OleDb packages. These did not have the convenience methods provided like .MoveNext() or .MovePrevious(), and also, as you pointed out, could not be referenced by rs(3), let alone some of the easy positioning and filtering we were converting from.
That said, we chose to wrap the objects returned from the database calls with a Recordset class, and use those provide the interface we wanted.
Of course, the original was written in C#, available here:
https://github.com/bhoogter/VB6TocSharp/blob/master/extras/Recordset.cs
But, I put together a VB.NET port of the same, if intersted, linked, and also included here.
https://github.com/bhoogter/VB6TocSharp/blob/master/extras/vb.net/recordset.vb
Of note is the use of a few sub-classes contained within the Recordset class that can make life easier. But, what is available right away is the methods you mentioned (MoveNext, MovePrevious, RS('fieldname'))... Note that in VB6, RS(f) would return a Field object, and the default property of that field object was .Value. This tries to maintain back-wards compatibility by bypassing the Field object and just returning .Fields[i].Value, so you don't have to change your existing code. This saves time and effort in conversion. And, if there is some interface you're missing, you control this layer so you can add and/or modify as suits your conversion needs.
The VB.net ported version is as follows... It is a conversion of the C# one, so I can't guarantee it's 100% perfect, but it should be enough to demonstrate the point. YMMV.
Imports System.Data
Imports System.Data.OleDb
' Recordset object. Used to wrap other data objects to 'simulate' VB6.
Public Class Recordset
Public Source As String = ""
Public Parameters As Dictionary(Of Object, Object)
Public Database As String = ""
Public QuietErrors As Boolean = False
Private mAddingRow As Boolean = False
Public ReadOnly Property AddingRow As Boolean
Get
Return mAddingRow
End Get
End Property
Private connection As OleDbConnection
Private adapter As OleDbDataAdapter
Private table As DataTable
Private filteredTable As DataTable
Private mFilter As String
Public Sub New()
End Sub
Public Sub New(table As DataTable, adapter As OleDbDataAdapter, connection As OleDbConnection)
Me.connection = connection
Me.adapter = adapter
Me.table = table
End Sub
Public Sub New(SQL As String, File As String, Optional QuietErrors As Boolean = False, Optional Parameters As Dictionary(Of Object, Object) = Nothing)
Me.Source = SQL
Me.Parameters = Parameters
Me.Database = File
Me.QuietErrors = QuietErrors
Open()
End Sub
Public Sub Close()
Try
connection?.Close()
Catch
' just suppress
End Try
connection = Nothing
adapter = Nothing
table = Nothing
filteredTable = Nothing
End Sub
Public Shared Sub sqlExecutionError(mSQL As String, e As Exception)
Dim T As String = ""
T &= "getRecordSet Failed: " & e.Message & vbCrLf
T &= vbCrLf
T &= mSQL & vbCrLf
T &= vbCrLf
T &= "ERROR:" & e.Message
T = T.Replace("$EDESC", e.Message)
'ErrMsg = Replace(ErrMsg, "$ENO", Err().Number)
T = T.Replace("$ESRC", e.Source)
MsgBox("Database Error: " + T, 0, "Error")
'CheckStandardErrors() ' Bookmark/updateable query
End Sub
Private Function ConnectionString(file As String) As String
Return "PROVIDER=Microsoft.Jet.OLEDB.4.0Data Source=" + file
End Function
Public Property AbsolutePosition As Integer = -1
Public Property Position As Integer
Get
Return AbsolutePosition
End Get
Set(value As Integer)
AbsolutePosition = value
End Set
End Property
Public ReadOnly Property RecordCount As Integer
Get
If Not table Is Nothing Then
If Not table.Rows Is Nothing Then
Return table.Rows.Count
End If
End If
Return 0
End Get
End Property
Public ReadOnly Property EOF As Boolean
Get
Return AbsolutePosition >= RecordCount
End Get
End Property
Public ReadOnly Property BOF As Boolean
Get
Return AbsolutePosition = 0
End Get
End Property
Public Function FieldExists(F As String) As Boolean
If Not table Is Nothing Then
If Not table.Columns Is Nothing Then
Return table.Columns.Contains(F)
End If
End If
Return False
End Function
Public Function MoveFirst() As Integer
AbsolutePosition = 0
Return 0
End Function
Public Function MoveNext() As Integer
Return If(++AbsolutePosition < RecordCount, AbsolutePosition, AbsolutePosition = RecordCount)
End Function
Public Function MovePrevious() As Integer
Return If(--AbsolutePosition >= 0, AbsolutePosition, AbsolutePosition = 0)
End Function
Public Function MoveLast() As Integer
AbsolutePosition = RecordCount - 1
Return AbsolutePosition
End Function
Public ReadOnly Property Fields As RecordsetFields
Get
If AbsolutePosition >= 0 And AbsolutePosition < RecordCount Then Return New RecordsetFields(table.Rows(AbsolutePosition))
Throw New ArgumentOutOfRangeException("Either EOF or BOF is true.")
End Get
End Property
Public ReadOnly Property FieldNames As List(Of String)
Get
If IsNothing(table) Then Return Nothing
Dim result As List(Of String) = New List(Of String)
For Each item As DataColumn In table.Columns
result.Add(item.ColumnName)
Next
Return result
End Get
End Property
Public ReadOnly Property Field As PropIndexer(Of Object, Object)
Get
Return New PropIndexer(Of Object, Object)(
Function(k As Object)
Return Fields(k).Value
End Function,
Function(k As Object, v As Object)
Fields(k).Value = v
End Function
)
End Get
End Property
Default Property Item(field As Object) As Object
Get
Return GetField(field)
End Get
Set
SetField(field, Value)
End Set
End Property
Public Function GetField(key As Object) As Object
Return Fields(key).Value
End Function
Public Sub SetField(key As Object, value As Object)
Fields(key).Value = value
End Sub
Public Function GetRows() As List(Of List(Of Object))
Dim tableEnumerable As Object = table.AsEnumerable()
Dim tableList As Object = tableEnumerable.ToArray().ToList()
Return tableList.ToList() _
.Select(Function(r As Object)
Return r.ItemArray.ToList()
End Function) _
.ToList()
End Function
Public Property Filter As String
Get
Return mFilter
End Get
Set(value As String)
mFilter = value
If String.IsNullOrEmpty(value) Then
filteredTable = Nothing
Return
End If
filteredTable = table.Select(mFilter).CopyToDataTable()
End Set
End Property
Protected Function Find(v As String) As Boolean
Dim temp As DataTable = table.Select(mFilter).CopyToDataTable()
If temp.Rows.Count = 0 Then Return False
Dim X As Integer = table.Rows.IndexOf(temp.Rows(0))
AbsolutePosition = X
Return True
End Function
Private Sub Open()
Const maxTries = 5
If Dir(Database) = "" Then
MsgBox("Database Not Found: " + Database)
Return
End If
Dim result As DataSet = New DataSet()
connection = New OleDbConnection(ConnectionString(Database))
Dim Command As OleDbCommand = New OleDbCommand(Source, connection)
For Each Key In Parameters.Keys
Dim param As OleDbParameter = Command.CreateParameter()
param.ParameterName = Key
param.Value = Parameters(Key)
Next
adapter = New OleDbDataAdapter(Command)
Try
connection.Open()
adapter.FillSchema(result, SchemaType.Source)
adapter.Fill(result, "Default")
Catch e As Exception
If Not QuietErrors Then sqlExecutionError(Source, e)
Finally
connection.Close()
End Try
table = result.Tables("Default")
End Sub
Public Sub Update()
Dim cb As OleDbCommandBuilder = New OleDbCommandBuilder(adapter)
cb.QuotePrefix = "["
cb.QuoteSuffix = "]"
Try
connection.Open()
adapter.UpdateCommand = cb.GetUpdateCommand()
adapter.Update(table)
Catch e As Exception
If Not QuietErrors Then sqlExecutionError(adapter.DeleteCommand.ToString(), e)
Finally
connection.Close()
End Try
mAddingRow = False
End Sub
Public Sub AddNew()
Dim newRow As DataRow = table.NewRow()
table.Rows.InsertAt(newRow, table.Rows.Count)
AbsolutePosition = table.Rows.Count - 1
mAddingRow = True
End Sub
Public Sub Delete()
Dim cb As OleDbCommandBuilder = New OleDbCommandBuilder(adapter)
Try
connection.Open()
adapter.DeleteCommand = cb.GetDeleteCommand()
adapter.Update(table)
Catch e As Exception
If Not QuietErrors Then sqlExecutionError(adapter.UpdateCommand.ToString(), e)
Finally
connection.Close()
End Try
End Sub
Public Class RecordsetFields
Implements ICollection
Private row As DataRow = Nothing
Public Sub New(row As DataRow)
Me.row = row
End Sub
Public ReadOnly Property Count As Integer
Get
Return row.Table.Columns.Count
End Get
End Property
Public SyncRoot As Object = Nothing
Public IsSynchronized As Boolean = False
Private Sub ICollection_CopyTo(array As Array, index As Integer) Implements ICollection.CopyTo
Throw New InvalidOperationException("Not valid on object")
End Sub
Private Function IEnumerable_GetEnumerator() As IEnumerator Implements IEnumerable.GetEnumerator
Return row.Table.Columns.GetEnumerator()
End Function
Default Public ReadOnly Property Item(x As Object) As RecordsetField
Get
Dim C As DataColumn = row.Table.Columns(x)
Return New RecordsetField(row, x)
End Get
End Property
Private ReadOnly Property ICollection_Count As Integer Implements ICollection.Count
Get
Throw New NotImplementedException()
End Get
End Property
Private ReadOnly Property ICollection_IsSynchronized As Boolean Implements ICollection.IsSynchronized
Get
Throw New NotImplementedException()
End Get
End Property
Private ReadOnly Property ICollection_SyncRoot As Object Implements ICollection.SyncRoot
Get
Throw New NotImplementedException()
End Get
End Property
End Class
Public Class RecordsetField
Public Const adSmallInt As Integer = 2 ' Integer SmallInt
Public Const adInteger As Integer = 3 ' AutoNumber
Public Const adSingle As Integer = 4 ' Single Real
Public Const adDouble As Integer = 5 ' Double Float Float
Public Const adCurrency As Integer = 6 ' Currency Money
Public Const adDate As Integer = 7 ' Date DateTime
Public Const adIDispatch As Integer = 9 '
Public Const adBoolean As Integer = 11 ' YesNo Bit
Public Const adVariant As Integer = 12 ' Sql_Variant(SQL Server 2000 +) VarChar2
Public Const adDecimal As Integer = 14 ' Decimal *
Public Const adUnsignedTinyInt As Integer = 17 ' Byte TinyInt
Public Const adBigInt As Integer = 20 ' BigInt(SQL Server 2000 +)
Public Const adGUID As Integer = 72 ' ReplicationID(Access 97 (OLEDB)), (Access 2000 (OLEDB)) UniqueIdentifier (SQL Server 7.0 +)
Public Const adWChar As Integer = 130 ' NChar(SQL Server 7.0 +)
Public Const adChar As Integer = 129 ' Char Char
Public Const adNumeric As Integer = 131 ' Decimal(Access 2000 (OLEDB)) Decimal
Public Const adBinary As Integer = 128 ' Binary
Public Const adDBTimeStamp As Integer = 135 ' DateTime(Access 97 (ODBC)) DateTime
Public Const adVarChar As Integer = 200 ' Text(Access 97) VarChar VarChar
Public Const adLongVarChar As Integer = 201 ' Memo(Access 97)
Public Const adVarWChar As Integer = 202 ' Text(Access 2000 (OLEDB)) NVarChar (SQL Server 7.0 +) NVarChar2
Public Const adLongVarWChar As Integer = 203 ' Memo(Access 2000 (OLEDB))
Public Const adVarBinary As Integer = 204 ' ReplicationID(Access 97) VarBinary
Public Const adLongVarBinary As Integer = 205 ' OLEObject Image Long Raw *
Private Row As DataRow = Nothing
Public Name As Object = ""
Public Size As Integer = 0
Public Sub New(Row As DataRow, Name As Object)
Me.Row = Row
Me.Name = Name
End Sub
Public Property Value As Object
Get
Return Row(Name)
End Get
Set(value As Object)
Row(Name) = value
End Set
End Property
Public ReadOnly Property Type As Object
Get
Return Row.Table.Columns(Name).DataType
End Get
End Property
End Class
Public Class PropIndexer(Of I, V)
Public Delegate Sub setProperty(idx As I, value As V)
Public Delegate Function getProperty(idx As I)
Public getter As getProperty
Public setter As setProperty
Public Sub New(g As getProperty, s As setProperty)
getter = g
setter = s
End Sub
Public Sub New(g As getProperty)
getter = g
setter = AddressOf setPropertyNoop
End Sub
Public Sub New()
getter = AddressOf getPropertyNoop
setter = AddressOf setPropertyNoop
End Sub
Private Sub setPropertyNoop(idx As I, value As V)
' NOOP. Intentionally left blank.
End Sub
Private Function getPropertyNoop(idx As I) As V
Return CType(Nothing, V)
End Function
Default Public Property Item(ByVal nIndex As I) As V
Get
Return getter.Invoke(nIndex)
End Get
Set
setter.Invoke(nIndex, Value)
End Set
End Property
End Class
End Class

Pig - Remove line feed, return and tab

I'm trying to remove the characters: \n, \t and \r from a column in Pig but I'm getting the wrong output.
Here is what I'm doing:
qr_1 = LOAD 'hdfs://localhost:9000/sample.csv' USING PigStorage(',') as (Id:int,PostTypeId:int,AcceptedAnswerId:int,ParentId:int,CreationDate:chararray,DeletionDate:chararray,Score:int,ViewCount:int,Body:chararray,OwnerUserId:int,OwnerDisplayName:chararray,LastEditorUserId:int,LastEditorDisplayName:chararray,LastEditDate:chararray,LastActivityDate:chararray,Title:chararray,Tags:chararray,AnswerCount:int,CommentCount:int,FavoriteCount:int,ClosedDate:chararray,CommunityOwnedDate:chararray);
qr_1 = FOREACH qr_1 GENERATE Id .. ViewCount, REPLACE(Body,'\n','') as Body, OwnerUserId .. ;
qr_1 = FOREACH qr_1 GENERATE Id .. ViewCount, REPLACE(Body,'\r','') as Body, OwnerUserId .. ;
qr_1 = FOREACH qr_1 GENERATE Id .. ViewCount, REPLACE(Body,'\t','') as Body, OwnerUserId .. ;
Input:
5585779,1,5585800,,2011-04-07 18:27:54,,1432,3090250,"<p>How can I convert a <code>String</code> to an <code>int</code> in Java?</p>
<p>My String contains only numbers and I want to return the number it represents.</p>
<p>For example, given the string <code>""""1234""""</code> the result should be the number <code>1234</code>.</p>",537967,,2756409,user166390,2015-09-10 21:30:42,2016-03-07 00:42:49,Converting String to Int in Java?,<java><string><type-conversion>,12,0,239
Output:
(5585779,1,5585800,,2011-04-07 18:27:54,,1432,3090250,"<p>How can I convert a <code>String</code> to an <code>int</code> in Java?</p>,,,,,,,,,,,,,)
(,,,,,,,,,,,,,,,,,,,,,)
(,,,,,,,,,,,,,,,,,,,,)
(,,,,,,,,,,,,,,,,,,,,,)
(,,537967,,2756409,user166390,,,Converting String to Int in Java?,,12,0,239,,,,,,,,,)
What am I doing?
Thanks.
Also "\\n" doesn't make a difference.
There is comma in your data and that's why the fields and the schema are not matching.Use CSVLoader and then use the REPLACE command to replace '\\t','\\n','\\r'
<p>For example, given the string

Remove text between two tags

I'm trying to remove some text between two tags [ & ]
[13:00:00]
I want to remove 13:00:00 from [] tags.
This number is not the same any time.
Its always a time of the day so, only Integer and : symbols.
Someone can help me?
UPDATE:
I forgot to say something. The time (13:00:00) was picked from a log file. Looks like that:
[10:56:49] [Client thread/ERROR]: Item entity 26367127 has no item?!
[10:57:25] [Dbutant] misterflo13 : ils coute chere les enchent aura de feu et T2 du spawn??*
[10:57:35] [Amateur] firebow ?.SkyLegend.? : ouai 0
[10:57:38] [Novice] iPasteque : ils sont gratuit me
[10:57:41] [Novice] iPasteque : ils sont gratuit mec *
[10:57:46] [Dbutant] misterflo13 : on ma dit k'ils etait payent :o
[10:57:57] [Novice] iPasteque : on t'a mytho alors
Ignore the other text I juste want to remove the time between [ & ] (need to looks like []. The time between [ & ] is updated every second.
It looks like your log has specific format. And you seem want to get rid of the time and keep all other information. Ok - read in comments
I didn't test it but it should work
' Read log
Dim logLines() As String = File.ReadAllLines("File_path")
If logLines.Length = 0 Then Return
' prepare array to fill sliced data
Dim lines(logLines.Length - 1) As String
For i As Integer = 0 To logLines.Count - 1
' just cut off time part and add empty brackets for each line
lines(i) = "[]" & logLines(i).Substring(10)
Next
What you see above - if you know that your file comes in certain format, just use position in the string where to cut it off.
Note: Code above can be done in 1 line using LINQ
If you want to actually get the data out of it, use IndexOf. Since you looking for first occurrence of "[" or "]", just use start index "0"
' get position of open bracket in string
Dim openBracketPos As Integer = myString.IndexOf("[", 0, StringComparison.OrdinalIgnoreCase)
' get position of close bracket in string
Dim closeBracketPos As Integer = myString.IndexOf("]", 0, StringComparison.OrdinalIgnoreCase)
' get string between open and close bracket
Dim data As String = myString.Substring(openBracketPos + 1, closeBracketPos - 1)
This is another possibility using Regex:
Public Function ReplaceTime(ByVal Input As String) As String
Dim m As Match = Regex.Match(Input, "(\[)(\d{1,2}\:\d{1,2}(\:\d{1,2})?)(\])(.+)")
Return m.Groups(1).Value & m.Groups(4).Value & m.Groups(5).Value
End Function
It's more of a readability nightmare but it's efficient and it takes only the brackets containing a time value.
I also took the liberty of making it match for example 13:47 as well as 13:47:12.
Test: http://ideone.com/yogWfD
(EDIT) Multiline example:
You can combine this with File.ReadAllLines() (if that's what you prefer) and a For loop to get the replacement done.
Public Function ReplaceTimeMultiline(ByVal TextLines() As String) As String
For x = 0 To TextLines.Length - 1
TextLines(x) = ReplaceTime(TextLines(x))
Next
Return String.Join(Environment.NewLine, TextLines)
End Function
Above code usage:
Dim FinalT As String = ReplaceTimeMultiline(File.ReadAllLines(<file path here>))
Another multiline example:
Public Function ReplaceTimeMultiline(ByVal Input As String) As String
Dim ReturnString As String = ""
Dim Parts() As String = Input.Split(Environment.NewLine)
For x = 0 To Parts.Length - 1
ReturnString &= ReplaceTime(Parts(x)) & If(x < (Parts.Length - 1), Environment.NewLine, "")
Next
Return ReturnString
End Function
Multiline test: http://ideone.com/nKZQHm
If your problem is to remove numeric strings in the format of 99:99:99 that appear inside [], I would do:
//assuming you want to replace the [......] numeric string with an empty []. Should you want to completely remove the tag, just replace with string.Empty
Here's a demo (in C#, not VB, but you get the point (you need the regex, not the syntax anyway)
List<string> list = new List<string>
{
"[13:00:00]",
"[4:5:0]",
"[5d2hu2d]",
"[1:1:1000]",
"[1:00:00]",
"[512341]"
};
string s = string.Join("\n", list);
Console.WriteLine("Original input string:");
Console.WriteLine(s);
Regex r = new Regex(#"\[\d{1,2}?:\d{1,2}?:\d{1,2}?\]");
foreach (Match m in r.Matches(s))
{
Console.WriteLine("{0} is a match.", m.Value);
}
Console.WriteLine();
Console.WriteLine("String with occurrences replaced with an empty string:");
Console.WriteLine(r.Replace(s, string.Empty).Trim());

Can we extract dyanmic data from string using regex?

I want to validate and get the data for following tags(9F03,9F02,9C ) using regex:
9F02060000000060009F03070000000010009C0101
Above string is in Tag - length - value format.
Where 9F02,9F03,9C are tags and have fixed length but their position and value in string can vary.
Just after the tag there is the length of the value in bytes that tag can store.
for example:
9F02=tag
06=Length in bytes
000000006000= value
Thanks,
Ashutosh
Standard regex doesn't know how to count very well, it behaves like a state machine in that way.
What you can do though if the number of possibilities is small is represent each possibility in a state in regex, and use multiple regex queries for each tag ...
/9F02(01..|02....|03......)/
/9C(01..|02....)/
... And so on.
Example here.
http://rubular.com/r/euHRxeTLqH
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegEx {
public static void main(String[] args) {
String s = "9F02060000000060009F03070000000010009C0101";
String regEx = "(9F02|9F03|9C)";
Pattern p = Pattern.compile(regEx);
Matcher m = p.matcher(s);
while(m.find()){
System.out.println("Tag : "+ m.group());
String length = s.substring(m.end(), m.end()+2);
System.out.println("Length : " + length);
int valueEndIndex = new Integer(m.end()) + 3 + new Integer(length);
String value = s.substring(m.end()+3,valueEndIndex);
System.out.println("Value : "+ value);
}
}
}
This code will give you following output :
Tag : 9F02
Length : 06
value : 000000
Tag : 9F03
Length : 07
value : 0000000
Tag : 9C
Length : 01
value : 1
I am not sure about byte length you are mentioning here, but I guess this code shall help you kick start!

Use case for String split

I want code snippet for splitting the below string:
Input : select * from table where a=? and b=? and c=?
Output:
Str1: select * from table where a=?
Str2: and b=?
Str3: and c=?
I do not want to use indexof as of now, whether StringUtils or regex can help me here? I was looking for StringUtilus but I did not get anything in it. Your input is appreciated.
This should suffice:
inputStr.split("(?=\band\b)")
If you are trying it in php then try ,
$myvar = explode('and','select * from table where a=? and b=? and c=? ');
$str1 = $myvar[0];
$str2 = $myvar[1];
$str3 = $myvar[2];
I got, we can use
String str = "select * from table where a=? and b=? anc c=?";
String[] tokens = str.split("\\?");
for (String string : tokens) {
System.out.print("tokens: "+string);
}