Hi Winston!
Well, my first one is that it's rather long.
We're all volunteers here, and asking people to plough through over 400 lines of code, however nicely documented and formatted (and it certainly appears to be that), isn't likely to produce a lot of responses. Is there any way you could shorten it? - eg, perhaps only include the parts you think might be causing problems.
Oops - good point - sorry! I appreciate the helpfulness - that's why I really appreciate your answer!
however nicely documented and formatted (and it certainly appears to be that)
I try to write well formatted and documented code and use the
Taligent naming conventions. I have always found them very useful. Thank you for the encouragement!
Second: I'm not quite sure what a "random" URL generator will give you - even if it can spew out valid URLs for you - because unless the URL actually exists, you won't be able to connect to it. Perhaps you could explain how you intend to use it.
I'm working on a website which runs inside a small intranet (only accessible from inside). The website has got an "external links" section; links can be edited by someone with admin rights. I'd like to provide basic validation when a link is submitted; firstly by the resp. web page before submission (for visual feedback if wrong) and then by the server. I need to test the validator, and the URL generator is part of the test framework. During a test the generator creates many URLs which should all be syntactically valid, and if the validator fails any URL it fails the test. The validator uses regular expression(s), so I am basically testing those regular expressions. I'll probably have to create syntactically invalid URLs as well and let them loose on my validator; if any of those passes then the validator fails the test as well. So far the URL generator generates syntactically
valid URLs only. I don't think the validator could be tested exhaustively, but at least I have more confidence that it's able to validate URLs for correct syntax.
because unless the URL actually exists, you won't be able to connect to it.
Excellent point! My definition of "valid" was "syntactically correct". But the definition scales upwards - I could add more validation tests, e.g. whether a URL actually connects to an existing resource and whether a URL is malicious. The connection is easy to test, but whether it's malicious - oh dear. I don't think I could design that myself; maybe there would be online services that allow for automated queries on URLs and return a result whether that URL is malicious. Thank you for sparking some ideas!
About the only thing I could imagine it might be useful for is checking whether a URL "validator" actually works; and it would seem to me that you then have a bit of a "chicken and egg" situation:
You can't validate a URL without knowing what a "valid" one looks like.
You can't write a program to generate valid URLs without knowing what one looks like.
and, unless you're very careful, you could easily end up creating a validator that simply reverse-engineers your generator - including any mistakes it makes.
Another excellent point! For the 'knowing' part I went to the
RFC1738 and practically reverse engineered the BNF grammar in section 5. I wouldn't use the test specimen (i.e. reverse engineer my validator) to create the URL generator; that would defeat the tests. Thinking about it, I might as well have transformed the grammar to something readable by the
Dada engine and use that engine to create a huge test data corpus. For now, the URL generator creates links such as
https://211.34.83.162
http://130.17.174.40/(/%C2%13K/%82xy%6DT#%06%0D%E7%ACu%91gO%FD
https://31.102.186.43:11645
http://5OW.us:12813/D%E5/1%5C%3D%32%EE%3CqVw/%B8%D0*w%F7H#2%80n%CE%4Aw
http://LF6-h.info/T?%3C%6BE%8Fxw%7E%BC#%40k
They are completely non sensical, but should be syntactically correct (need to cross check against the RFC's BNF grammar).
Third: Even assuming that there are uses for such an animal, a truly random generator is only likely to be good for "smoke tests", and may never (or only very rarely) produce "corner cases" - ie, URLs that are particularly long, problematic, or obscure.
Good observation, thank you! I think a simple smoke test is already quite effective, given the use case. If the validator unrighteously complains about corner cases then I'd ask the admins to shout at me. Come to think of it, I could set up a report feature "you dipstick, this editor fails my perfectly good URL" and have the server write any reported URL entry into a log file. Then the admins would shout at me by means of the log file, and I could fix the validator!
Fourth: Have you tried looking for an existing library to do this? My Google for "valid URL generator" produced a slew of results; although I have to admit that none leap out at me as a "solution" for what you appear to want. I do know that there are any number of solutions (most involving regular expressions) for validating a URL though.
I did have an extensive look for existing test data - all I found was some 3 GB monster of web server logs (
http://www.archive.org/download/httparchive_downloads_Nov_15_2014/httparchive_Nov_15_2014_requests.csv.gz). At the time I thought that would be overkill. I also found a few other data sources, but they were all from particular web servers. I need some data corpus which gives me a random selection of URLs, but I found it hard to locate something (I gave up after about 2 hrs looking). Maybe I will download those 3 GB after all and clean them up so I just have a list of URLs. Another idea I had was to use wget in recursive spider mode to do a google search for the word "the", and then have wget write all visited URLs into a text file (I'd have done a CTRL-C after a few hours). I tried several search engines that way, but in the end I didn't feel good about hammering them with thousands of web requests, so I CTRL-C'd quite quickly on each one of them and threw that approach away.
I did try your web search, but it's not come up with something that creates a random selection of URLs. I did look up random URL generators, but most of them were Javascript snippets which simply return a random URL from a fixed set of URLs which isn't very helpful either in my case. But thank you so much for taking the trouble and looking up on Google!
In summa - you have given me some very valuable ideas how to take the validator further. I also value the cautions you made with respect not reverse engineering the validator and poor coverage of corner cases! I know, my post was insanely long and could have been a bit better prepared... I did it under a bit of time pressure. My excuse is: I sometimes found that asking stupid questions yields intelligent answers!
Your answer was certainly very intelligent! Thank you for your thoughts and taking the time to write!
P
P.S. - Thank you for your welcome!