Win a copy of Kotlin in Action this week in the Kotlin forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

maintaing string width  RSS feed

 
peter m hayward
Ranch Hand
Posts: 94
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am trying to solve a string width issue were the object is to have a string that is no more than 80 characters in length and must include basic information
Firstly here is a typical string
Tess Gerritsen The Apprentice The second book in the Jane Rizzoli and Maura Isles series 9780553814323
Basic information required
Authors name the basic title and the isbn so in this case
Tess Gerritsen The Apprentice 9780553814323 which is 43 characters in length leaving 37 to fit the remaining info = The second book in the Jane Rizzoli and Maura Isles series which is 59 characters in length
So i have successively removed superfluous words as follows, when this process completes if the string is still too long i then replace given words with similar shorter one until the target width is achieved or declared not possible
Here is the action as it happens
The second book in the Jane Rizzoli and Maura Isles series too many chars to fit 58
second book in the Jane Rizzoli and Maura Isles series removed “The” chars remaining  =54
second book in Jane Rizzoli and Maura Isles series removed “the” chars remaining  = 50
second book Jane Rizzoli and Maura Isles series removed “in” chars remaining  = 47
second book Jane Rizzoli and Maura Isles removed “series” chars remaining  = 41
once i reach this stage there are no more superfluous words to remove, now i replace words with similar ones but with less characters, so second book becomes book 2
book 2 Jane Rizzoli and Maura Isles  chars remaining  = 36 which now fits
Should it have not fitted it would have been changed to 2nd Jane Rizzoli and Maura Isles 
I am doing this using a two dimension array as follow
replacements[1][0] = "second book";
replacements[1][1] = "book 2";
replacements[1][2] = "second";
replacements[1][3] = "2nd";
replacements[1][4] = "2";
I am aware that the array data needs work as some items are not shorter E.G  second  and book 1 are both 6 characters in length, but keeping the array the same rows so it can be scanned easily in addition to this the possibilities that the starting point for this process may vary   E. G. The second book could be “A book in” which of course means that the number of items in the array now increase each time a different format is encountered and need to be auctioned
Being a person of the old school i am using things from my days of “C” first put the string into an array then working on it and replacing it etc  and i am looking for help in using JAVA and its newer methods to achieve this here is some of the code i am still working on the substitute method using the two dimension array so i do not have at this point a working version of that part

 
Carey Brown
Bartender
Posts: 2993
46
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Two things, without getting into the details:
  • Variable and method names should always begin with a lower case letter: BasicTitle should have been basicTitle.
  • This kind of parsing is very complex. I would suggest setting up a testing framework right away. Otherwise you are going to think it works when it doesn't.

  • A testing framework would be something like:
    This is a very terse example but I hope you get the idea. Also a similar list and test should be put together for failureCases.
     
    Carey Brown
    Bartender
    Posts: 2993
    46
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator

    This is too verbose and could have been written as:
    Or you could have written it using Java's variable arguments, which would have made it useful in more cases.
    In which case you could now do something like:
     
    peter m hayward
    Ranch Hand
    Posts: 94
    2
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    thank something to think about
     
    Campbell Ritchie
    Marshal
    Posts: 55698
    163
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Is this a task capable of fitting into a context‑free grammar, or anything like that? Book titles are written in natural languages and may use irregular grammars (=free grammars). I suspect you may actually have an impossible task. What are you going to do with Ernest Hemingway? Book whose title only contains short words.
     
    Dave Tolls
    Rancher
    Posts: 2914
    36
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    It's not the title, it's the description that's being abbreviated.
     
    Campbell Ritchie
    Marshal
    Posts: 55698
    163
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    But how, without knowing a free grammar, are you going to tell where title ends and the description begins?
     
    Dave Tolls
    Rancher
    Posts: 2914
    36
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Ah, well...magic?

    For some reason I thought there might tabs in the original line.  I see there's nothing actually stated to that effect.
     
    Campbell Ritchie
    Marshal
    Posts: 55698
    163
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Dave Tolls wrote:. . . tabs in the original line. . . .
    Full stops, commas, tabs, double spaces, or anything like that, can be used to change the text into a regular grammar, which can easily be parsed with regexes.

    Otherwise, it is really easy to do by hand, because we are used to free grammars; we speak in them all the time.
     
    Liutauras Vilda
    Marshal
    Posts: 4638
    316
    BSD
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    1. Consider converting embeddedAuthor to upper cases (or lower case) and only then look for "BY", so you wouldn't need to check "By", "by" and "bY".

    2. I'd follow right away Carey's advice to write tests first, otherwise you won't notice how you'll break something while fixing something else.

    3. Array probably isn't the right data structure for that task. Consider using Map (look for HashMap implementation). Might think of a structure to achieve mapping as Map<String, List<String>>. Might not, need to think carefully.

    4. Since you are going to use a lot System.out.print... create simple method with a short name for debug purpose so you'd less clutter your code, as:
    So you could write:
    debug("Author", i);
    debug("Book title", i);
    ...


    Will join discussion again most likely later..
     
    peter m hayward
    Ranch Hand
    Posts: 94
    2
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Dave Tolls wrote:It's not the title, it's the description that's being abbreviated.


    very true my error its the extra data pertaining to the book
     
    Campbell Ritchie
    Marshal
    Posts: 55698
    163
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    I can't see the wor‍d “by” in the example in the first post.
     
    peter m hayward
    Ranch Hand
    Posts: 94
    2
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Liutauras Vilda wrote:1. Consider converting embeddedAuthor to upper cases (or lower case) and only then look for "BY", so you wouldn't need to check "By", "by" and "bY".

    2. I'd follow right away Carey's advice to write tests first, otherwise you won't notice how you'll break something while fixing something else.

    3. Array probably isn't the right data structure for that task. Consider using Map (look for HashMap implementation). Might think of a structure to achieve mapping as Map<String, List<String>>. Might not, need to think carefully.

    4. Since you are going to use a lot System.out.print... create simple method with a short name for debug purpose so you'd less clutter your code, as:
    So you could write:
    debug("Author", i);
    debug("Book title", i);
    ...


    Will join discussion again most likely later..

    thanks there are some thing to read up on here so google her i come
     
    Carey Brown
    Bartender
    Posts: 2993
    46
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Campbell Ritchie wrote:I can't see the wor‍d “by” in the example in the first post.

    How would you differentiate "by" with "bystander"?
     
    peter m hayward
    Ranch Hand
    Posts: 94
    2
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Campbell Ritchie wrote:I can't see the wor‍d “by” in the example in the first post.


    that is in the incoming data but i take your point converting it to all one case make sense so i shall do so not sure which need to think about it
     
    Liutauras Vilda
    Marshal
    Posts: 4638
    316
    BSD
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Carey Brown wrote:
    Campbell Ritchie wrote:I can't see the wor‍d “by” in the example in the first post.

    How would you differentiate "by" with "bystander"?

    " by ", wouldn't work?
     
    Carey Brown
    Bartender
    Posts: 2993
    46
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Liutauras Vilda wrote:
    Carey Brown wrote:
    Campbell Ritchie wrote:I can't see the wor‍d “by” in the example in the first post.

    How would you differentiate "by" with "bystander"?

    " by ", wouldn't work?

    Yes, that should work, but that's exactly my point. The devil's in the details.
     
    peter m hayward
    Ranch Hand
    Posts: 94
    2
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Carey Brown wrote:
    Campbell Ritchie wrote:I can't see the wor‍d “by” in the example in the first post.

    How would you differentiate "by" with "bystander"?

    i find the  Index = embeddedAuthor.indexOf("by"); in the string hence no need to worry about words that contain by such as baby is it only looks for by but as it has been pointed out converting to a given case removes the need to check for typos  e.g. By or bY or BY by will be cover so iwill be converting all to one case
     
    Carey Brown
    Bartender
    Posts: 2993
    46
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    peter m hayward wrote:
    Carey Brown wrote:
    Campbell Ritchie wrote:I can't see the wor‍d “by” in the example in the first post.

    How would you differentiate "by" with "bystander"?

    i find the  Index = embeddedAuthor.indexOf("by"); in the string hence no need to worry about words that contain by such as baby is it only looks for by but as it has been pointed out converting to a given case removes the need to check for typos  e.g. By or bY or BY by will be cover so iwill be converting all to one case

    In your code you have
    Index = embeddedAuthor.indexOf("by");
    this will find "by" embedded in "byabc", "abcby", and "abcbyxyz". indexOf() doesn't look for whole words.
     
    peter m hayward
    Ranch Hand
    Posts: 94
    2
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Carey Brown wrote:
    Liutauras Vilda wrote:
    Carey Brown wrote:
    Campbell Ritchie wrote:I can't see the wor‍d “by” in the example in the first post.

    How would you differentiate "by" with "bystander"?

    " by ", wouldn't work?

    Yes, that should work, but that's exactly my point. The devil's in the details.

    here is the code

    I tested this by placing 1000 different strings containing the author into and array the iterated through it each to call the method above and had the  System.out.println
    then copied the result back into the next column in excel where i already had the author in a column then used the excel function exact to check both columns so i conclude that it does work not quite sure why you think it would not ? maybe i am missing something important ? as i am new to java
     
    peter m hayward
    Ranch Hand
    Posts: 94
    2
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Campbell Ritchie wrote:
    Dave Tolls wrote:. . . tabs in the original line. . . .
    Full stops, commas, tabs, double spaces, or anything like that, can be used to change the text into a regular grammar, which can easily be parsed with regexes.

    Otherwise, it is really easy to do by hand, because we are used to free grammars; we speak in them all the time.


    i have avoid regexes but maybe it's time to dive in, as for the original data it is on individual lines so yes i can be sure which line have the other which the title and finally the additional data
    e.g
    Mary Mary (2005)
    (Book 11 in the Alex Cross series)
    A novel by James Patterson

    thanks
     
    peter m hayward
    Ranch Hand
    Posts: 94
    2
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Carey Brown wrote:
    peter m hayward wrote:
    Carey Brown wrote:
    Campbell Ritchie wrote:I can't see the wor‍d “by” in the example in the first post.

    How would you differentiate "by" with "bystander"?

    i find the  Index = embeddedAuthor.indexOf("by"); in the string hence no need to worry about words that contain by such as baby is it only looks for by but as it has been pointed out converting to a given case removes the need to check for typos  e.g. By or bY or BY by will be cover so iwill be converting all to one case

    In your code you have
    Index = embeddedAuthor.indexOf("by");
    this will find "by" embedded in "byabc", "abcby", and "abcbyxyz". indexOf() doesn't look for whole words.

    Ha! just released i posted an early version of the code and your are 100 % correct which is why i changed it to " by " in the later version as i noticed it found by in baby
     
    Carey Brown
    Bartender
    Posts: 2993
    46
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    gives this output
     
    Carey Brown
    Bartender
    Posts: 2993
    46
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    peter m hayward wrote:
    Ha! just released i posted an early version of the code and your are 100 % correct which is why i changed it to " by " in the later version as i noticed it found by in baby

    So what if you have "by peter m hayward" ?
     
    peter m hayward
    Ranch Hand
    Posts: 94
    2
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Dave Tolls wrote:Ah, well...magic?

    For some reason I thought there might tabs in the original line.  I see there's nothing actually stated to that effect.

    sorry i should have pointed out that the data is on separate lines

    Mary Mary (2005)
    (Book 11 in the Alex Cross series)
    A novel by James Patterson


     
    Carey Brown
    Bartender
    Posts: 2993
    46
    Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Carey Brown wrote:
    peter m hayward wrote:
    Ha! just released i posted an early version of the code and your are 100 % correct which is why i changed it to " by " in the later version as i noticed it found by in baby

    So what if you have "by peter m hayward" ?

    You are starting to get to a place where regular expressions would be useful.
    Here's an example of what it would take to find the word 'by'
    Here's the output

    Note that regular expressions can get complicated. This example has a somewhat difficult regular expression but it is what is needed to match all the variations of the use of "by". This is also an example of why a test framework is critical.
     
    Liutauras Vilda
    Marshal
    Posts: 4638
    316
    BSD
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    @OP

    I'm curious, if not a secret, is it some kind of industrial application or something else?

    So, what is your next plan?
     
    peter m hayward
    Ranch Hand
    Posts: 94
    2
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Liutauras Vilda wrote:@OP

    I'm curious, if not a secret, is it some kind of industrial application or something else?

    So, what is your next plan?


    this is an application i am building for my own use, me a few family member are running an online book store and the data entry is the killer so i analyse all my action and attempt to get the data entry system to mimic it hence the complication

    it goes far beyond what i have posted there are mysql stuff and jpeg loaders postage evaluation all are in the application this latest idea is to put the data that we have found generates sales in the listing often people are not aware that i given book is part of a set so when they see book n of the series it encourages them to purchase the ones they do not have

    i will take a good look at the idea you have shown me thanks
     
    Campbell Ritchie
    Marshal
    Posts: 55698
    163
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    peter m hayward wrote:. . . mysql stuff . . .
    In which case, don't you have the description author's name, book title and everything else in the database? So who needs to dissect such Strings? You can get those details from the database and work out their total length, and then you have the description by itself to shorten. Who needs to look for by?
     
    peter m hayward
    Ranch Hand
    Posts: 94
    2
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Campbell Ritchie wrote:
    peter m hayward wrote:. . . mysql stuff . . .
    In which case, don't you have the description author's name, book title and everything else in the database? So who needs to dissect such Strings? You can get those details from the database and work out their total length, and then you have the description by itself to shorten. Who needs to look for by?

    unfortunately the database does not contain every book published and more are available each day sometimes a quick google gets what we need but the format is in human reading form not machine thus the conversion also as the data is entered by tens of thousands its not consistent  as you probably suspect so extracting what we need is the job at hand and the more than can be automated the better
     
    Campbell Ritchie
    Marshal
    Posts: 55698
    163
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Yesterday, I wrote:. . . I suspect you may actually have an impossible task. . . .
    Maybe the publisher's websites will have details in a form you can scrape; otherwise I still think this might be a task impossible to automate. Sorry.
     
    Liutauras Vilda
    Marshal
    Posts: 4638
    316
    BSD
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    I guess I'm still missing an understanding about the root problem as there are quite a few unknowns to me.

    peter m hayward wrote:that is in the incoming data

    Where the data coming from?

    peter m hayward wrote:and the isbn

    Do you have ISBN's always? In many your provided examples I don't see them, or I can't read your very first post as it is slightly unclear.
    There is ISBN's database where all info could be found about the book in a nice format. They have an API to pull it.

    peter m hayward wrote:me a few family member are running an online book store

    peter m hayward wrote:I am trying to solve a string width issue were the object is to have a string that is no more than 80 characters in length and must include basic information

    If that is something you are running from scratch, why you have such limitations in a first place about the string lengths and about "must"?

    And the problem you are trying to solve as of now, is to disassemble messy string and put its pieces into the right table's fields in database? Or you have just a 1 column in a table which has a limit of 80 characters?
     
    Dave Tolls
    Rancher
    Posts: 2914
    36
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Do you have a serious data size limitation for your database?
    If not then I'm not sure why you are worrying about abbreviating a description.
    Just use a VARCHAR of suitable length (say a couple thousand characters or even more, you have 65k to spare in a row).
     
    Dave Tolls
    Rancher
    Posts: 2914
    36
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Liutauras Vilda wrote:
    There is ISBN's database where all info could be found about the book in a nice format. They have an API to pull it.


    And now I can see an app that pulls from the local database and, if not found, does a search on that one.
     
    Konstantinos Fotiadis
    Greenhorn
    Posts: 1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    http://isbndb.com/search/all?query=Mary+Mary+%282005%29++%28Book+11+in+the+Alex+Cross+series%29++A+novel+by+James+Patterson+

    Use the first result...

    questions...

    1) Why the limitation on the length of the data?
    2) How many books in this system?

     
    Campbell Ritchie
    Marshal
    Posts: 55698
    163
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Welcome to the Ranch

    I am not sure whether peter m hayward still needs the information however.
     
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!