This week's book giveaways are in the Jython/Python and Object-Oriented programming forums. We're giving away four copies each of Machine Learning for Business: Using Amazon SageMaker and Jupyter and Object Design Style Guide and have the authors on-line! See this thread and this one for details.
Not sure of the best location for this question but since it is likely an intermediary type of inquiry, I�ll post it here.
I need to build a track changes system for HTML documents. The documents are loaded into our web-system. They are then routed for review and editing. At the review location, the user looks at the document and then uses the FCK editor to make any deletions or insertions that they desire.
I have already built the parsing system for the HTML documents with the �<div>� or �<p>� tags using the org.htmlparser.* package (that is an outstanding HTML parser BTW and very easy to use). I then build an XML document of the original text and other details. Where I am stuck in design and API functionality is on the actual tracking of changes in the document.
Here is what I think I should do but I am hoping that some of you may have suggestions:
1.Count the paragraphs of the original document and put that count in the XML. No problem here� very simple to do. 2.Count the paragraphs of the new document and put that count in the XML. If numbers are the same, new paragraphs were not inserted nor were original paragraphs deleted. 3.Give an algorithmic number to the values contained in the original paragraphs. Number would be based (here is where I need your thoughts) on the text value (converted to float or int I suppose) in the paragraph in question as well as the position of that text. Do you think that I could simply create the method that will look at the text in position one of paragraph one and if it is an �a� then give it a value of 1 (for a) and 1 for the first position making my value in that spot 2? I doubt that can this be relied upon to be accurate down the paragraph and based on the rest of the design (to follow)� my problem is that there are lots of ways to sum a number to get the same values. So, I need some way to reliably tell what the numeric value of the paragraph is and to make sure that value is unique to any paragraph with a specific series of characters. Basically, I need a unique numeric ID for any paragraph. 4.Put the total value of all numbers (chars) in the paragraph in the XML document (if the algorithm is not too complicated it would seem to me that I could get false positives during compare). I need a guaranteed and unique way to identify the characters in each location of the paragraph without making unnecessary strings and having to compare paragraphs that don�t need to be compared (the original compared to the new). 5.When an edit of the document occurs, parse it and get the paragraph values in the edited document (also have to count the number of paragraphs and compare that value to the XML document�). 6.If the paragraph identity (the paragraph number of the document i.e. p1) of the original document and the new document have the same algorithmic value based on the XML report of the original parse and the value of the characters in the new document, then ignore the paragraph and iterate to the next comparison. 7.If the paragraph position of the new and original document are the same and the algorithmic values of the original and new paragraphs are different, then compare the new and old paragraphs starting at the first position and looking for where the text changes. 8.In the case of insertion, as soon as the text changes, look through the rest of the original paragraph and compare each value to the new paragraph subtracting from the position of the text to where it might originally have been. Once the values match again (have to test past the first match and make sure that a character did not match while the word is not the same), everything from the starting position to the ending position of the new doc where the values were != is an insertion. Append the insertion with <font color =�red�>new text</font>. 9.In the case of a deletion, subtract from the original paragraph position by position and compare each value to the new paragraph until they match. Again, have to test past the first match to assure there is no false positive. Once the deleted text is uncovered, append the HTML with <strike> the deleted text</strike> and return that new document for review. 10.Put all changes in the XML document and ask the user to accept or decline the changes� Simple logic here so this part won�t be challenging.
Incidentally, the reason that I am attempting to build some kind of algorithm rather than comparing characters of the old paragraph to the new paragraph is that there is no reason to open up the old text and load it to memory if the values are the same. A 60 page document could be a big deal to try to read through and give quick responses when now it is really a 120 page document of text (without a mathematical operation, I would have to load both documents into memory and compare them char by char� I want to just compare the text where the value changed�).
I am now searching through scores of diff programs I found in my search based on your suggestion.
If text is the only (or best) way to do this, then I will just build a StringBuilder object of both paragraphs, pass those in and see what the diff program (that I am sure I will find) produces. The problem is not just that I only have the one instance of these documents. I don't know how many I will have at any given time. And, they are contracts (which are even more verbose than I am).
Thanks so much for the suggestion. I am sure that I can solve this problem much quicker and simpler now.