• Post Reply Bookmark Topic Watch Topic
  • New Topic

Java reading input from a file and splitting it into words.  RSS feed

 
Saad Mushtaq
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So i am doing this past sample final exam where the question asks to read input from a file and then process them into words. The end of a sentence is marked by any word that ends with one of the three characters . ? !
I was able to write a code for this however i can only split them into sentences using scanner class and using use.Delimiter. Any help would be appreciated as i am learning this on my own and this is what i came up with. My code is here.


What i am doing in this code is that i am splitting the input into sentences using the Delimiter method and then counting the words, letters of the entire file.
If i want to split this into words, how can i do that without using the scanner class.

Some of the input from the file that i have to process is here:
Text that follows is based on the
Wikipedia page on cryptography!
Cryptography is the practice and study of hiding information. In modern times,
cryptography is considered to be a branch of both mathematics and computer
science, and is affiliated closely with information theory, computer security, and
engineering. Cryptography is used in applications present in technologically
advanced societies; examples include the security of ATM cards, computer
passwords, and electronic commerce, which all depend on cryptography.....
 
Campbell Ritchie
Marshal
Posts: 56570
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think you are confusing yourself by calling what you are reading from the file with the Scanner line. You are not reading a line at all, but a token. If you change the delimiter to [.?!] you will not get correct splitting because at least two of those characters are meta‑characters and that regular expression is incorrect. Read more about regular expressions in the Java™ Tutorials.
Don't use string tokeniser. It is legacy code which ought no longer to be used in new code. Use String#split instead.
What does that mean about isDigit? That looks like something separate which ought to be in a separate method. There are also Character class method which will identify letters, so you will not count commas as letters.

You appear to be doing four things:-
  • Reading from the file.
  • Dividing the text into sentences.
  • Dividing it into words.
  • Counting letters.
  • Those four things shou‍ld be done separately.
     
    Campbell Ritchie
    Marshal
    Posts: 56570
    172
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    I tried your exercise using a Scanner object to read the text and appending it to a StringBuilder object with single spaces " " between successive tokens. Your regex seemed to work, so I must have been mistaken about meta‑characters inside []. I tried String#split to divide the text into sentences and a loop with a method of the Character class to count letters. Copying the whole of your first post into a text file gave a letter count of 1500. I forgot about counting words, so I shall try that later.

    Beware of the ellipsis
    ...
    That divides the text into three zero‑length Strings as if you had
    .""."".""
     
    Campbell Ritchie
    Marshal
    Posts: 56570
    172
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Norm Radder wrote:Also posted at: . . .
    Well, he got a better answer here

    Remember to tell people on both websites about the crossposting.
     
    Saad Mushtaq
    Greenhorn
    Posts: 21
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Campbell Ritchie wrote:I tried your exercise using a Scanner object to read the text and appending it to a StringBuilder object with single spaces " " between successive tokens. Your regex seemed to work, so I must have been mistaken about meta‑characters inside []. I tried String#split to divide the text into sentences and a loop with a method of the Character class to count letters. Copying the whole of your first post into a text file gave a letter count of 1500. I forgot about counting words, so I shall try that later.

    Beware of the ellipsis
    ...
    That divides the text into three zero‑length Strings as if you had
    .""."".""

    The ellipsis means that there is some more text. There are no ellipsis in the main file. Sorry about that. My program runs fine but it is just that i want to split them into words now. And i just now that i should split them using string.split("") and then go through each word and then character to see if the last character is ?.!
    If it does then we will stop adding those words into the sentence class. You said you were able to split my code into words, how did you do it? Can you please share it with me?
     
    Saad Mushtaq
    Greenhorn
    Posts: 21
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    Sorry about that.
    Guys i have also posted this on the other site. The link is above.
     
    Knute Snortum
    Sheriff
    Posts: 4281
    127
    Chrome Eclipse IDE Java Postgres Database VI Editor
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    You said you were able to split my code into words, how did you do it? Can you please share it with me?

    Well, what character or characters are between the words?
     
    Saad Mushtaq
    Greenhorn
    Posts: 21
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Well, what character or characters are between the words?

    White space? If it is then how did you do that? At least give me the pseudocode. I have been trying with whitespace for a long time now.
     
    Campbell Ritchie
    Marshal
    Posts: 56570
    172
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Saad Mushtaq wrote:. . . i should split them using string.split("") . . .
    No, you must not use text.split("") because that will split on the empty String and the following sentence will come out as in the quote:-
    This is a sentence.
    T
    h
    i
    s

    a

    s
    e
    n
    t
    e
    n
    c
    e
    .
    You must split the String with a non‑empty String, a regular expression. But I think you are doing things in the wrong order. I think you shou‍ld start by reading the file. Have a readFile method which creates a String representing the entire contents of the file and assigns that to a field. Make sure the keyword static appears nowhere other than in public static void main(String[] args).
    I suggest you use the hasNextLine and nextLine methods of Scanner for reading, and use the append and toString methods of StringBuilder to create the String. As an alternative to StringBulder, you might try a StringJoiner (Java8 only).
    There are some things I have missed out, but I think it will not take you at all long to implement that. Once we have that lot working well, we can consider how to get the chars out of the String and count them, how to count words, and how to split it into sentences. None of those things is difficult.
    You shou‍ld not have any problems about memory, not unless the file contains many millions of letters.
     
    Knute Snortum
    Sheriff
    Posts: 4281
    127
    Chrome Eclipse IDE Java Postgres Database VI Editor
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Saad Mushtaq wrote:
    Well, what character or characters are between the words?

    White space? If it is then how did you do that? At least give me the pseudocode. I have been trying with whitespace for a long time now.

    You are splitting the stream into sentences already, right?  Then it's just as easy as:
     
    Consider Paul's rocket mass heater.
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!