File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes I/O and Streams and the fly likes scraping XML Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "scraping XML" Watch "scraping XML" New topic

scraping XML

Ciri Bhoy

Joined: Oct 20, 2011
Posts: 16
Hi all,

I'm writing a small app that reads in an XML file from a website as an inputstream, but I want to parse this inputstream in order to display only certain results contained as follows:.

<td class="first">

<img id="ctl00_Content_ctl00_rptInfo_ctl16_Image2" alt="Inactive" src="../../images/t2.jpg" style="border-width:0px;" />
<td >
Aer Lingus
12 Mar 21:50
<td class="last">
Arrived 21:39

<td class="first">
<img id="ctl00_Content_ctl00_rptInfo_ctl17_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />

<td >

I'm currently doing this by reading in each line and pulling out the relevant data using readLine() and it's working fine.....problem is, this seems far too easy. It's only a small project so performance isn't really an issue, I'm again just looking for the 'right' way of doing it....or a few 'right' ways. I hope I'm making myself clear enough, I'm afraid I'm not too well up on the jargon yet.

Any advice is very welcome and appreciated.
Ulf Dittmer

Joined: Mar 22, 2005
Posts: 41134
That doesn't look like XML; it looks like HTML. My first weapon of choice would be a library that can handle HTML like HtmlUnit, which also handles the downloading of the page.

Ping & DNS - my free Android networking tools app
It is sorta covered in the JavaRanch Style Guide.
subject: scraping XML
Similar Threads
How to hide a cell w/ background image until a function call
location.href not working on FireFox 3.5
JSP page not getting displayed
Page displaying in IE6 and older versions but not in Higher versions, chrome and firefox also.
Getting search results on same search page