• Post Reply Bookmark Topic Watch Topic
  • New Topic

Help. Recognising white space in a string  RSS feed

 
Frank Mills
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a program that takes a string of words from a file, and basically copies it into another.
However I want to break the string up into each individual words and output the string into database.txt with each word on a new line with the name of the file it came from. i.e.
If I have a txt file called webpage.txt and it has the text in it as follows:
Frank likes music
--------------------
The program should produce a file called master.txt (or if possible, add to the master.txt file)
Frank webpage.txt
likes webpage.txt
music webpage.txt
and if this process was repeated with another source file, the words would be added to the master.txt file
I have the code so far as:
import java.io.*;
public class Program_one
{
public static void main (String[] args) throws IOException
{
final int MAX = 10;
int value;
String Rfile = "webpage.txt"; // name of input file

String Ofile = "database.txt"; //name of output file

String inputLine; // String line
FileReader fr = new FileReader (Rfile);
BufferedReader inFile = new BufferedReader (fr);

inputLine = inFile.readLine(); //puts input files into one string
FileWriter fw= new FileWriter (Ofile); //creates new output file
BufferedWriter bw = new BufferedWriter (fw);
PrintWriter outFile = new PrintWriter(bw);

outFile.print (inputLine + " "+ Rfile); //writes string to database.txt file

outFile.println (); //new line

outFile.close();

System.out.println ("Output file has been created: " + Ofile);
}
}
any help or advice would be much appreicated.
Many Thanks
Frank
 
Michael Morris
Ranch Hand
Posts: 3451
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Frank,
Welcome to JavaRanch. Sounds like a job for a StringTokenizer to me. I took the liberty of rearranging your code and add a StringTokenizer to break it down:

See if that does what you want.
 
Frank Mills
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thats fantastic, thanks a million, with this string tokenizer, is it possible to use some sort of wild card, so if I wanted to remove html tags from the code, i could use some sort of "if" statement i.e.
if (word starts with "<" and ends with ">")
{
ignore;
}
else
println (word + file name);
 
Layne Lund
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, that doesn't have much to do with StringTokenizer. You can simply use the appropriate methods of the String class to check the first and last char of a "word." Of course, if you are using white space to separate words, there are many cases where the ">" won't appear until several words after the "<" because of other parameters within the HTML tag.
If you look at the API docs, you can specify any delimiters you wish for StringTokenizer. You can use this feature to tokenize a String with "<" and ">" as delimiters. Then the tokens that are outside of these brackets can then be tokenized by another StringTokenizer delimited by white space.
This is only one of several different approaches which I can think of. Again, you should check out the API docs for more information about how to use StringTokenizer.
 
Frank Mills
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm pretty much a novice at java, could you show me a link to the api document (never seen one before) or if you could suggest perhaps a more easier way of dealing with the problem, I would greatly appreciate it.
As you've pointed out, I dont think I really have a problem with simple tags such as < h1 > or any of the standard tags, but with other tags with white space, or unpredefined tags, where the closing > maybe be half way down the page, this is my biggest problem, and I need to find a way of using some kind of wildcard, if that makes much sense.
 
Michael Morris
Ranch Hand
Posts: 3451
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Probably a better approach for removing HTML tags would be with Regular expressions, that is provided you are using the 1.4 SDK. Take a look at this:

One problem with this is that it doesn't span multiple lines, so you will have to make adjustments for that.
 
Layne Lund
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I thought of regular expressions, but I didn't suggest because I thought they might be too complex for Frank. I'm definitely no novice and I have to think really hard to understand them very well. Of course, maybe that's just me.
As for the API docs, you can find them here for Java 2 SDK 1.4.1. If you are using an older version, go to the Official Java Website to find the appropriate documentation.
HTH
Layne
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!