See the Java Tutorial for how to obtain content from URL's. To parse content, you'll have to deal with each type differently. The Java API includes an HTML Parser, but for other types you'll need third party libraries.
Before downloading it? I don't know of any API to do that. HTTP has the "Content-Type" header (see the HTTP specificiation, section 14.17 which allows a server to tell a client the MIME type of a URL. That's probably your best bet.
What is the error with Nutch? It may be something that changing crawlers won't fix (i.e. files protected by authentication, connectivity problems). I've had good results using the aforementioned HTML parser to locate hyperlinks and feed them back into a queue for visiting with java.net.URL as a homegrown web crawler.