Hi,
I'm writing a program that will firstly remove all stop words ("a" and "the"), then all tabs ('\t') from a text file. I seem to have trouble when attempting to perform one action after the other. When I run the methods with the following code...
import java.util.*;
import java.io.* ;
public class RemoveAndUntabify
{
public static void main(
String args[] ) throws IOException
{
if ( args.length != 0 && args.length == 3)
{
String input = args[0];
String output = args[1];
String wordFile = args[2];
String[] tabs = {input, output};
Untabify.main(tabs);
String[] words = {input, output, wordFile};
RemoveStopWords.main(words);
}
else
{
System.out.println("You must supply ''inFile, outFile'' ''stopWordFile''");
}
}
}
... It appears as if ONLY the "RemoveStopWords" class has been run (in that the untabify method appears in the output file as though it has not happened - only the "a" and "the"'s are gone, but after checking I believe both are actually being run - however I think someone is wrong with the classes it's calling and how they're accessing/closing the files...
removeStopWords...
import java.util.*;
import java.io.* ;
public class RemoveStopWords
{
public static void main(String args[] ) throws IOException
{
if ( args.length != 0 && args.length == 3)
{
FileReader reader = new FileReader(args[0]);
// wrap reader into a BufferedReader so we can use readLine()
BufferedReader in = new BufferedReader(reader);
FileWriter writer = new FileWriter (args[1]);
// wrap writer into a PrintWriter so we can use println()
PrintWriter out = new PrintWriter(writer);
FileReader stopWordsReader = new FileReader(args[2]);
// wrap reader into a BufferedReader so we can use readLine()
BufferedReader stopWords = new BufferedReader(stopWordsReader);
Collection col = new ArrayList();
String input;
while ((input = stopWords.readLine()) != null)
col.add(input);
TextProcessing.removeStopWords(out, in, col);
in.close();
out.close();
stopWords.close();
//Test
writer.close();
reader.close();
stopWordsReader.close();
}
else
{
System.out.println("You must supply ''inFile, outFile'' ''stopWordFile''");
}
}
}
Untabify...
import java.util.*;
import java.io.* ;
public class Untabify
{
public static void main(String args[] ) throws IOException
{
if ( args.length != 0 && args.length == 2)
{
FileReader reader = new FileReader(args[0]);
// wrap reader into a BufferedReader so we can use readLine()
BufferedReader in = new BufferedReader(reader);
FileWriter writer = new FileWriter (args[1]);
// wrap writer into a PrintWriter so we can use println()
PrintWriter out = new PrintWriter(writer);
TextProcessing.untabify(out, in);
in.close();
out.close();
//Test
writer.close();
reader.close();
}
else
{
System.out.println("You must supply ''inFile, outFile''");
}
}
}
and finally...
import java.util.*;
import java.io.* ;
public class TextProcessing {
static final String delims = " .,;:!?-`'\"()";
static boolean isDelim (String str) {return (delims.indexOf(str) > 0);}
public static String removeStopWords(String str, Collection stopWords)
{
StringTokenizer tokenizer = new StringTokenizer(str,delims, true);
StringBuffer sb = new StringBuffer();
while (tokenizer.hasMoreTokens())
{
String token = tokenizer.nextToken();
if (isDelim(token) || (!stopWords.contains(token)))
sb.append(token);
};
return (sb.toString());
}
// Task 1 - 2
public static void removeStopWords(PrintWriter out, BufferedReader in, Collection stopWords) throws IOException
{
String input;
while ((input = in.readLine()) != null)
{
StringTokenizer tokenizer = new StringTokenizer(input,delims, true);
StringBuffer sb = new StringBuffer();
while (tokenizer.hasMoreTokens())
{
String token = tokenizer.nextToken();
if (isDelim(token) || (!stopWords.contains(token)))
sb.append(token);
};
out.println(sb.toString());
}
}
// Task 1 - 3
//Doesn't work quite right - it swaps 'cloud' for eight blank spaces instead of \t char for now
public static void untabify(PrintWriter out, BufferedReader in) throws IOException
{
String input;
Collection stopWords = new ArrayList();
stopWords.add("cloud");
while ((input = in.readLine()) != null)
{
StringTokenizer tokenizer = new StringTokenizer(input,delims, true);
StringBuffer sb = new StringBuffer();
while (tokenizer.hasMoreTokens())
{
String token = tokenizer.nextToken();
//if (isDelim(token) || (!stopWords.contains(token)))
if (isDelim(token) || (!stopWords.contains(token)))
sb.append(token);
else
sb.append(" ");
};
out.println(sb.toString());
}
}
}
Sorry it's a lot of code there guys, but I think it's really close to working right; it would really make my day if anyone's got any ideas on how to fix it.
Cheers,
Peter