• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

String Compare with a bit of a twist

 
Ranch Hand
Posts: 30
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
My title probably does not accurately reflect my problem, so hopefully I can explain it without causing anybody's eyes to glaze over, or drool to drip from the corner of their mouths.

I am comparing two text files, which seems simple enough, but after messing with it for a couple of weeks, I am finding it harder to accomplish that I thought at first.

Essentially all I am trying to accomplish... both text files are loaded into ArrayLists (after being trimmed and split). The first file is the "good" file, the second file is the "changed" file.

I load the first String, and then loop string by string through the second file until a match is found. If a match is found, I want to remove the line from both lists (to avoid the overhead of having to 'search' it again)

If I get all of the way through the second file without a match, then a line is missing, and it's corresponding match needs to be added to the ArrayList labeled "missing".

Whatever is left over in the second file when all is said and done would be "added" lines of text.

In a perfect world, my code would work....



I am guessing that one of the problems that i am having is that by removing elements from 'clean', it is throwing off my count?? At any rate, if both files match, then I should have nothing left over at the end. For the purposes in which I intend to use this, I know for sure that there will be elements in at least two of the arrays...

And while I am thinking about it, it would probably be better to start the comparisons from the end of the ArrayLists to avoid the overhead when elements are removed.

I'm pretty sure I am missing something simple, but I can no longer see the trees for the forest...

EDIT: Or maybe I should be trying to do this in two passes??
[ January 04, 2005: Message edited by: C. Alan ]
 
Sheriff
Posts: 11343
Mac Safari Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you don't need random access and you're going to be removing elements from the middle, then you might consider LinkedList instead of ArrayList. Then consider some built-in Collection functionality:
  • Since you're just iterating through the "good" list, why not use an Iterator?
  • Rather than iterating through the "infected" list looking for a match, why not just use indexOf(Object)?
  • When you do find a match, remove it from the "good" list using Iterator.remove(), and from the "infected" list using LinkedList.remove(int index).



  • [ January 05, 2005: Message edited by: marc weber ]
     
    C. Alan
    Ranch Hand
    Posts: 30
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Thank you. I've never used Iterators before, and only dabbled with linked lists. I'll see how that works out.

    EDIT: Well, I have to admit that I am a bit humbled...indeed the solution was quite simple, and IMHO, elegant.

    Thanks again Marc. Happy New Year!
    [ January 05, 2005: Message edited by: C. Alan ]
     
    marc weber
    Sheriff
    Posts: 11343
    Mac Safari Java
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Originally posted by C. Alan:
    ...the solution was quite simple, and IMHO, elegant...


    Thank you! Sometimes it just takes a fresh perspective -- as you mentioned above, with the forest and the trees.

    Actually, I see now that the try/catch block in my code is not required (since the exceptions thrown by Iterator methods are unchecked), so it would be more elegant without that.
     
    (instanceof Sidekick)
    Posts: 8791
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Are all the lines unique? If not you may have to remove all occurrences of a line from both sides instead of just the one.

    Here's a compare algorithm for text files like Java source. I extracted the algorithm from some really smelly code I found on the net and did a new implementation on my Wiki.

    ---
    For each each unique line of text create a symbol. The symbol state is: OldOnly, NewOnly, UniqueMatch (both files exactly once), or Other.

    For each line, create a LineInfo object. Set state = symbol state and establish bidirectional links between UniqueMatch lines in the two files.

    For each UniqueMatch in old create a "match block". Stretch match blocks forward and backward to include matching lines with any state, including other match blocks.

    Build a Report of edit commands that can be used to tranform Old into New.
    Matching blocks generate match or move commands. Non-matching blocks generate insert, append, delete or change commands.

    Iterate the commands to generate a report.
    ---
    And here's an old favorite if you work with sorted lists. It's really the "master-update" pattern from the days of processing cards and tapes.

    [ January 05, 2005: Message edited by: Stan James ]
     
    C. Alan
    Ranch Hand
    Posts: 30
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Thanks Stan. That's sort of what I was trying to do originally. For this app, all the lines will be unique. The text I am comparing consists of simply file names, lengths, and MD5 hashes. I guess a loose comparison would be similar to what TripWire does, merely checking to see what files have been added and changed after a system has intentionally been infected with adware, spyware, etc.. With the code Marc provided, that works flawlessly, and works much faster than how I was trying to do it before. I realize I am reinventing the wheel, so to speak, but it is as much for me to learn, as it is my reluctance to pay for any application that doesn't do precisely what I want it to do.

    Now that I have an understanding of how to do it, I can try and do what I was initially trying to do in the first place, and that is to be able to compare registry files from before infection, and after infection. The goal is to see what has been changed, and create a .reg file our infected users can run to repair their registry files.

    All keys will be unique, so I think that the smae method will work as far as comparisons go, but I am definatley going to be running into memory management issues that I will need to figure out how to address. My current methods of reading each text file into a Linked List will not work, as there will be upwards of 75,000 lines of text in each list. I suspect I will have to read one line at a time, and do it that way, unless anybody has any other clever solutions.

    The actual hives themselves, in Hexadecimal form, are quite short, but for the moment, using them in that form is a little beyond my abilities. Given enough time, and the continuing kind help that I have received here so far, I'll figure it our eventually.
     
    Ranch Hand
    Posts: 1646
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I'm not all that familiar with how the registry is implemented.* Is it ordered in any way, for example as it is viewed in the tree editor? If so, then you could implement the second method Stan mentioned (the card stack). Basically you scan down each file, advancing either one pointer or the other or both based on equality or ordering.

    If you care to, can you describe in more detail the format of the files? Things like "it's in depth-first tree order" or "the full noode path is included on each line" would give you some great opportunities to optimize the task using a different algorithm.

    * I just realized that you're probably just exporting the entire registry to a file rather than accessing it directly. If so, then I'd bet there is a well-specified ordering that you could capitalize on.
     
    C. Alan
    Ranch Hand
    Posts: 30
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Yes David, that is all I am doing (exporting the registry as a text file)... that's what makes it huge. I am curious as to what your recommendations would be. Mine always seem to be use the hardest, most complicated means possible.

    My next project after this one (lol..there is always one more little project) is to learn how to access the registry directly.
     
    Stan James
    (instanceof Sidekick)
    Posts: 8791
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I'd probably try to convert the tree into fully qualified names so I could sort them and use that master-update algorithm, but I'm just old fashioned.
    reply
      Bookmark Topic Watch Topic
    • New Topic