• Post Reply Bookmark Topic Watch Topic
  • New Topic

Text file contains elements that return null  RSS feed

 
Nico Fish
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator



My text file contains elements that are returning null so it breaks out of the do while loop adds one to i and omits and entire portion of the text. How can I get past this?

Thank you so much. You have all been very helpful. I have been reading and coding for hours a day to get past my deficiencies and finish this project.


I can not attach the files for some reason they are .txt and a FASTA file

here is a link to the file ( I annotated the file to show where the breaks and omissions were occuring)
https://docs.google.com/document/d/1Bg5R0Vb4smFi1fVOA5EvnsPODJU8sRjYnKo-E1BxFus/edit?usp=sharing

original file here https://docs.google.com/document/d/10_rgWtgiZ-HoFI79BpvSkYHDzOM_JFS1bgaDeb3MW4o/edit?usp=sharing


 
Carey Brown
Saloon Keeper
Posts: 3312
46
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Try


fileStringArr should be fileStringList
 
Nico Fish
Greenhorn
Posts: 20
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you for the reply. I had tried this before and basically it just returns the opposite of the prior configuration (with the do while)


it will just print the nucleotide sequences and omit the other information.

I think this is a characteristic of the file type

Description line[edit]
The description line (defline) or header line, which begins with '>', gives a name and/or a unique identifier for the sequence, and may also contain additional information. In a deprecated practice, the header line sometimes contained more than one header, separated by a ^A (Control-A) character.
In the original Pearson FASTA format, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur after the header. Some databases and bioinformatics applications do not recognize these comments and follow the NCBI FASTA specification. An example of a multiple sequence FASTA file follows:
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH



http://en.wikipedia.org/wiki/FASTA_format


this would be great if I had species per file but it seems who ever put together the database garbled together a bunch of things into a file haphazardly.


hmmm I think when I get this figured out I am going to have to store it as a list<Hashmap<>> with the list index corresponding to the file path, the key being the species name, and the mapped value as the sequence. O lordey lord! I am going crazy (on the up-hand I learned a lot over these past few days)


actually the extra string is unecessary I can just do


and get the same result

the problem is the ">" character in the fasta files is still throwing null which is causing my iterator to rise
 
Carey Brown
Saloon Keeper
Posts: 3312
46
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
readLine() only works for text t, if your file includes any non-text information you'll have to use read().
 
Campbell Ritchie
Marshal
Posts: 56536
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would suggest that a Scanner is easier to use for a text file. And use try‑with‑resources to close the Scanner. I don't like read() which only reads one character at a time as an int.
I do not know why you are getting lines which are null, but you can escape that with a Scanner. There is an example for a text file in the Scanner documentation.
If you look in the FASTA link you provided, you see that the actual sequences (those appear to be single‑letter codes for amino acids) start with a letter and the header lines with the description start with > or ;.

I would suggest that you should be creating a ProteinSequence class (and a NucleicAcidSequence class too; I think I can see an inheritance tree here ‍) rather than simply putting all your Strings into a List. I would suggest your Sequence classes have Nucleotide[] or AminoAcid[] (or List) fields rather than storing all the info as Strings.The 1000 in the StringBuilder constructor will give you slightly faster execution if your proteins contain up to 1000 ααs, but that bit is not important and it will run without that enhancement.

There are various other ways to iterate the file. What you are doing is reading the next line, which is done with the = operator in code line No 10, and testing whether it starts with > or ;. The = is surrounded by round brackets () so as to increase its precedence, so the reading from the file is done before there is any attempt to test for the first > or ;
This is one of the few places where it is respectable to have = in the middle of an expression.

This is the sort of thing you wrote and it ain't going to work.You have two readLine() calls, so you are only using alternate lines. You will use the even‑numbered lines (assuming 1=first line), and if your file has an odd number of lines, you will end up with line in that loop being null. You cannot read a line from a file and get null. The only way you can get null is by going beyond the last line of the file.
 
Campbell Ritchie
Marshal
Posts: 56536
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You may need to add a != null test to my line 10. Otherwise you may suffer a null exception.

Or try this
 
Campbell Ritchie
Marshal
Posts: 56536
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You will have to change that for cases where there are multiple lines starting ; or >
 
Liutauras Vilda
Sheriff
Posts: 4917
334
BSD
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Which character encoding your file is using?
Issue could be because of this. Because you reading file and using different character encoding.

If it is a case, might FileInputStream would be better to use, as you could specify character encoding.
 
Campbell Ritchie
Marshal
Posts: 56536
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Cannot see how encoding problems could cause nulls to be returned. If you look up the FASTA files they appear to contain ASCII characters only, so encoding problems are unlikely.
Since file input streams read bytes they are not suitable for reading text files. Java® Tutorials link. OP says clearly it is a text file he is reading.
 
Liutauras Vilda
Sheriff
Posts: 4917
334
BSD
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:Cannot see how encoding problems could cause nulls to be returned. If you look up the FASTA files they appear to contain ASCII characters only, so encoding problems are unlikely.
Since file input streams read bytes they are not suitable for reading text files. Java® Tutorials link. OP says clearly it is a text file he is reading.


Maybe the nulls is being returned because those omitted parts are written only in 5 lines in a raw file, and length is ~ 3000 symbols per line (might this is a case)

And 2 empty lines at the end of file with "carriage return followed by a line feed" tags (it explains why at the end he gets null received)
 
Paul Clapham
Sheriff
Posts: 22827
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Nobody yet has pointed out that this code:



adds only the odd-numbered lines from the input to your fileStringArr thing. This therefore works differently for input with an even number of lines versus input with an odd number of lines.

However I see that several people have posted code which correctly reads all lines, so perhaps this is now an obsolete observation.
 
Campbell Ritchie
Marshal
Posts: 56536
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Liutauras Vilda wrote: . . .
Maybe the nulls is being returned because those omitted parts are written only in 5 lines in a raw file, and length is ~ 3000 symbols per line (might this is a case)
Sounds unlikely to me. It is because he is only reading every other line


And 2 empty lines at the end of file with "carriage return followed by a line feed" tags (it explains why at the end he gets null received)
That will not give a null line but a, empty String like "".
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!