• Post Reply Bookmark Topic Watch Topic
  • New Topic

Random URL generator -> Critique appreciated  RSS feed

 
Peter Hoppe
Greenhorn
Posts: 4
Notepad Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am working on a web project and needed some test URLs to test a URL input validator for a web form. Sadly, I found no test data set of real world URLs. So I wrote a random URL generator which creates HTTP(S) links. I'd like the generator to create links which are valid so I can run my tests with those randomly generated URLs. For reference I delved into RFC1738 (ouch, 20 years old), even though I deviated slightly (e.g. I'm creating https urls which isn't mentioned in RFC1738). See below for the generator's source code. Could you kindly comment on the code? It's a bit of a quick job, I know there are imperfections (e.g. not many comments, methods start with capital letter, some unused constants...), but I'd be interested in constructive criticism regards fundamental mistakes I made -

1. Can you spot any mistakes I did?
2. How I can fix them?


Thank you so much for your consideration!

P

 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Peter Hoppe wrote:I am working on a web project and needed some test URLs to test a URL input validator for a web form. Sadly, I found no test data set of real world URLs. So I wrote a random URL generator which creates HTTP(S) links. I'd like the generator to create links which are valid so I can run my tests with those randomly generated URLs. For reference I delved into RFC1738 (ouch, 20 years old), even though I deviated slightly (e.g. I'm creating https urls which isn't mentioned in RFC1738). See below for the generator's source code. Could you kindly comment on the code?

Well, my first one is that it's rather long.

We're all volunteers here, and asking people to plough through over 400 lines of code, however nicely documented and formatted (and it certainly appears to be that), isn't likely to produce a lot of responses. Is there any way you could shorten it? - eg, perhaps only include the parts you think might be causing problems.

Second: I'm not quite sure what a "random" URL generator will give you - even if it can spew out valid URLs for you - because unless the URL actually exists, you won't be able to connect to it. Perhaps you could explain how you intend to use it.

About the only thing I could imagine it might be useful for is checking whether a URL "validator" actually works; and it would seem to me that you then have a bit of a "chicken and egg" situation:
  • You can't validate a URL without knowing what a "valid" one looks like.
  • You can't write a program to generate valid URLs without knowing what one looks like.
  • and, unless you're very careful, you could easily end up creating a validator that simply reverse-engineers your generator - including any mistakes it makes.

    Third: Even assuming that there are uses for such an animal, a truly random generator is only likely to be good for "smoke tests", and may never (or only very rarely) produce "corner cases" - ie, URLs that are particularly long, problematic, or obscure.

    Fourth: Have you tried looking for an existing library to do this? My Google for "valid URL generator" produced a slew of results; although I have to admit that none leap out at me as a "solution" for what you appear to want. I do know that there are any number of solutions (most involving regular expressions) for validating a URL though.

    HIH

    Winston

    PS: Welcome to JavaRanch, Peter!
     
    Peter Hoppe
    Greenhorn
    Posts: 4
    Notepad Tomcat Server
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hi Winston!

    Well, my first one is that it's rather long.

    We're all volunteers here, and asking people to plough through over 400 lines of code, however nicely documented and formatted (and it certainly appears to be that), isn't likely to produce a lot of responses. Is there any way you could shorten it? - eg, perhaps only include the parts you think might be causing problems.


    Oops - good point - sorry! I appreciate the helpfulness - that's why I really appreciate your answer!

    however nicely documented and formatted (and it certainly appears to be that)


    I try to write well formatted and documented code and use the Taligent naming conventions. I have always found them very useful. Thank you for the encouragement!

    Second: I'm not quite sure what a "random" URL generator will give you - even if it can spew out valid URLs for you - because unless the URL actually exists, you won't be able to connect to it. Perhaps you could explain how you intend to use it.


    I'm working on a website which runs inside a small intranet (only accessible from inside). The website has got an "external links" section; links can be edited by someone with admin rights. I'd like to provide basic validation when a link is submitted; firstly by the resp. web page before submission (for visual feedback if wrong) and then by the server. I need to test the validator, and the URL generator is part of the test framework. During a test the generator creates many URLs which should all be syntactically valid, and if the validator fails any URL it fails the test. The validator uses regular expression(s), so I am basically testing those regular expressions. I'll probably have to create syntactically invalid URLs as well and let them loose on my validator; if any of those passes then the validator fails the test as well. So far the URL generator generates syntactically valid URLs only. I don't think the validator could be tested exhaustively, but at least I have more confidence that it's able to validate URLs for correct syntax.

    because unless the URL actually exists, you won't be able to connect to it.


    Excellent point! My definition of "valid" was "syntactically correct". But the definition scales upwards - I could add more validation tests, e.g. whether a URL actually connects to an existing resource and whether a URL is malicious. The connection is easy to test, but whether it's malicious - oh dear. I don't think I could design that myself; maybe there would be online services that allow for automated queries on URLs and return a result whether that URL is malicious. Thank you for sparking some ideas!

    About the only thing I could imagine it might be useful for is checking whether a URL "validator" actually works; and it would seem to me that you then have a bit of a "chicken and egg" situation:

    You can't validate a URL without knowing what a "valid" one looks like.
    You can't write a program to generate valid URLs without knowing what one looks like.

    and, unless you're very careful, you could easily end up creating a validator that simply reverse-engineers your generator - including any mistakes it makes.


    Another excellent point! For the 'knowing' part I went to the RFC1738 and practically reverse engineered the BNF grammar in section 5. I wouldn't use the test specimen (i.e. reverse engineer my validator) to create the URL generator; that would defeat the tests. Thinking about it, I might as well have transformed the grammar to something readable by the Dada engine and use that engine to create a huge test data corpus. For now, the URL generator creates links such as



    They are completely non sensical, but should be syntactically correct (need to cross check against the RFC's BNF grammar).

    Third: Even assuming that there are uses for such an animal, a truly random generator is only likely to be good for "smoke tests", and may never (or only very rarely) produce "corner cases" - ie, URLs that are particularly long, problematic, or obscure.


    Good observation, thank you! I think a simple smoke test is already quite effective, given the use case. If the validator unrighteously complains about corner cases then I'd ask the admins to shout at me. Come to think of it, I could set up a report feature "you dipstick, this editor fails my perfectly good URL" and have the server write any reported URL entry into a log file. Then the admins would shout at me by means of the log file, and I could fix the validator!

    Fourth: Have you tried looking for an existing library to do this? My Google for "valid URL generator" produced a slew of results; although I have to admit that none leap out at me as a "solution" for what you appear to want. I do know that there are any number of solutions (most involving regular expressions) for validating a URL though.


    I did have an extensive look for existing test data - all I found was some 3 GB monster of web server logs (http://www.archive.org/download/httparchive_downloads_Nov_15_2014/httparchive_Nov_15_2014_requests.csv.gz). At the time I thought that would be overkill. I also found a few other data sources, but they were all from particular web servers. I need some data corpus which gives me a random selection of URLs, but I found it hard to locate something (I gave up after about 2 hrs looking). Maybe I will download those 3 GB after all and clean them up so I just have a list of URLs. Another idea I had was to use wget in recursive spider mode to do a google search for the word "the", and then have wget write all visited URLs into a text file (I'd have done a CTRL-C after a few hours). I tried several search engines that way, but in the end I didn't feel good about hammering them with thousands of web requests, so I CTRL-C'd quite quickly on each one of them and threw that approach away.

    I did try your web search, but it's not come up with something that creates a random selection of URLs. I did look up random URL generators, but most of them were Javascript snippets which simply return a random URL from a fixed set of URLs which isn't very helpful either in my case. But thank you so much for taking the trouble and looking up on Google!

    In summa - you have given me some very valuable ideas how to take the validator further. I also value the cautions you made with respect not reverse engineering the validator and poor coverage of corner cases! I know, my post was insanely long and could have been a bit better prepared... I did it under a bit of time pressure. My excuse is: I sometimes found that asking stupid questions yields intelligent answers!

    Your answer was certainly very intelligent! Thank you for your thoughts and taking the time to write!

    P

    P.S. - Thank you for your welcome!
     
    Winston Gutkowski
    Bartender
    Posts: 10575
    66
    Eclipse IDE Hibernate Ubuntu
    • Likes 1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Peter Hoppe wrote:Thank you for your thoughts and taking the time to write!

    You're most welcome.

    One final point: that "chicken and egg" thing could be a real bugger. If you're writing a generator to create test data for someone else's "validator", I suspect you'll be on reasonable ground; but if it's to test your own, then you should tread very carefully.

    If your generator only produces valid URLs, you could be lulled into a sense of false security if the data always returns 'valid'. If I was writing something like this, I think I'd be trying to write one that also produces deliberate "invalid" ones from time to time - especially ones that look like they "ought" to be right.

    That takes quite a lot of thought (and ingenuity) though. And it most certainly won't be "random".

    HIH

    Winston
     
    Junilu Lacar
    Sheriff
    Posts: 11477
    180
    Android Debian Eclipse IDE IntelliJ IDE Java Linux Mac Spring Ubuntu
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Peter Hoppe wrote:
    I try to write well formatted and documented code and use the Taligent naming conventions. I have always found them very useful.

    These apply to C++ programs and most Java coding conventions that I know discourage a lot of the things listed. The book "Clean Code" by Robert Martin has a whole chapter about choosing good names. You might want to look at that. Also check out the Google Java Style Guide for more common Java naming conventions.
     
    Junilu Lacar
    Sheriff
    Posts: 11477
    180
    Android Debian Eclipse IDE IntelliJ IDE Java Linux Mac Spring Ubuntu
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Another approach to consider: give the person entering URLs a way to test the URL before he adds it. The test will try to connect to the given URL and if a positive response is received, you can tell the user the link looks good. Otherwise, you tell the user that the link might be broken or has a typo. You can still add checks for malicious or suspicious URLs with this method. This is like how my database utility allows me to test a DB connection and lets me know that the DB connection string I gave is good. You can kind of do the same thing with the "Preview" feature in these forums. I use it all the time to make sure any links I give are good before I submit a reply for posting.
     
    Junilu Lacar
    Sheriff
    Posts: 11477
    180
    Android Debian Eclipse IDE IntelliJ IDE Java Linux Mac Spring Ubuntu
    • Likes 1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Peter Hoppe wrote:The validator uses regular expression(s), so I am basically testing those regular expressions.

    I have a quote saved just for occasions like this:
    Jamie Zawinski, in a Tue 12 Aug 1997 Usenet post wrote: Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
     
    Peter Hoppe
    Greenhorn
    Posts: 4
    Notepad Tomcat Server
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Wow thanks everyone! It's dawning on me that for validating URLs the Regex isn't the best solution (As Junilu points out, and Winston hints at). Thanks for the quote, that makes me chuckle! Only my poor Regular expression engine wasn't so amused... she is now very depressed, feeling rejected, abandoned, sliced'n diced... Now I have a very depressed Regular expression engine sitting in the corner crying.... awwwww

    With the URL test - I'll implement the main test on the server - for each submitted link, the server will

  • test whether the link starts with 'http://' or 'https://' as I don't want to accept other schemes e.g. ftp links.
  • submit the link to a malicious link tester and await the results. For example, Virustotal do a public API for that, but sending links to them is highly asynchronous, i.e. answer will come after some time and has to be requested with a new request since they test each URL with multiple malware checkers
  • for each good link, the server will do a short connection test.


  • If a link fails the server's tests it will be marked 'disabled' on the backend database, so it can be corrected by the admin. I'll need some notification mechanism for those cases (send a mail?)
    Any good link will make it to the External links section.

    I think it's important to test each link for badness before doing a connection test - I wouldn't want to connect to a link that's been marked malicious.

    There will still be some very basic testing on the GUI side (web page) by checking whether the URL field starts with https:// or https:// and flagging links that fail this very basic test (visual feedback). This should alert the editor in case (s)he tries to submit an empty URL field or an ftp link etc. Client side checking is just for convenience really, the data has to be properly checked on the server. Lo and behold, work for my poor Regex engine! Comeone, Regex, cheer up

    With respect to programming style - oh dear, now I stepped into a hornet's nest ... I looked at the Google guidelines. Thank you for pointing them out to me. I think, keeping to agreed conventions is important when collaborating with people, and I'd be the last one to contest established rules when collaborating with others. Thankfully, I have good formatting facilites on my IDE (Eclipse). I don't want to religiously defend the use of Taligent's guidelines, they just worked very well for my personal projects. But no problem, I can always make my constants uppercase...

    And now, just for some nefarious amusement: I got curious and thought again "What the heck" and recoded the URL generator as a grammar script for the Dada engine (Yes, I am dingdong). I used the grammar in rfc1738 as template. The engine produces beautifully nonsensical URLS:



    These don't even make sense to me, for example the query string in the last URL smells very strange! But, doesn't matter, it was just an exercise out of interest, no more. Here's the Dada engine script:



    Interestingly, the Dada script looks very similar to the BNF grammar in the rfc1738, section 5. It's interesting to note that the rules are actually quite loose! For example, according to this grammar, the search rule (line 55) doesn't have the key/value constraint we normally find in practice (?key1=value1&key2=value2...), but allows for a random jumble of characters from a fixed set! This explains the strange search fragment in the above URL list! With the IP address and port the rules are similarly loose, but I just couldn't take it and made those a bit more restrictive. Please note - should you find yourself actually trying this out: The Dada engine itself seems to reseed it's random generator each second, so it produces unique URLs once per second. Just a small curiosity with that engine.

    Once again, thanks very much for all comments and effort to help!

    Peter
     
    • Post Reply Bookmark Topic Watch Topic
    • New Topic
    Boost this thread!