• Post Reply Bookmark Topic Watch Topic
  • New Topic

Regular expression to find comments  RSS feed

 
Ruslan Salimovich
Greenhorn
Posts: 25
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Could you please say how to create a regex for the following task:

"Write a program that reads a Java source-code file (you provide the file name on the command line) and displays all the comments."

For now I have created this one: regex = "^[^\"]*(//.*)";

But:
1) it does not find comments like these /*....*/
2) it does not find comments that are located after " signs, like this: String s = "this is me"; //Comment
 
fred rosenberger
lowercase baba
Bartender
Posts: 12565
49
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
regular expressions are not easy things to begin with. They are even more difficult when you are trying to search for something that exists across multiple lines.

Have you considered that they may NOT be the way to solve this particular problem?
 
Junilu Lacar
Sheriff
Posts: 11494
180
Android Debian Eclipse IDE IntelliJ IDE Java Linux Mac Spring Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ruslan Salimych wrote:Could you please say how to create a regex for the following task:

Well, that would amount to us doing your homework for you, which we don't do around here.

Here's something to consider though:

You probably want to use more than one regex because there are quite a few ways comments can be written:


Of course, you may want to start with just a few simple cases first and then work your way up to handling the more complicated ones later.

EDIT: And then there's the excellent point that Fred makes: Do you really have to / should you use regex for this?
 
Knute Snortum
Sheriff
Posts: 4288
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
it does not find comments like these /*....*/

I believe this is one of the situations where a regex cannot be found to do the job.
it does not find comments that are located after " signs, like this: String s = "this is me"; //Comment

I would go at it like this: find a regex that matches the comment at the beginning of the line (you've done this), find a regex that matches at the end of a line, then combine them with an alternation character (|).
 
Stephan van Hulst
Saloon Keeper
Posts: 7993
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
For single line comments, you need to match "//" and everything that follows it, until the next end-of-line.

For multi-line comments, you need to match "/*" and everything that follows it, until the next "*/".

You can combine the two patterns into a single find operation.

Despite the general overuse of regular expressions, I think this is a prime example of a task for which regular expressions are perfect.
 
Knute Snortum
Sheriff
Posts: 4288
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
For single line comments, you need to match "//" and everything that follows it, until the next end-of-line.

Hmm, what about this line?
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ruslan Salimych wrote:Could you please say how to create a regex for the following task:
"Write a program that reads a Java source-code file (you provide the file name on the command line) and displays all the comments."

Well, the very first thing I'd do - before I write my first line of code or start worrying about regular expressions - is find out exactly what a "comment" in a Java source file is.

it does not find comments like these /*....*/

And this is just the type of thing I'm talking about. Can you describe the "rules" of a /*....*/ comment?
One that is plainly causing you a problem is that they are generally used for "multi-line" comments, but what about:
  /*.... /* .... */ .... */
or:
  /*.... /* .... */
or:
  // .... /* ....
  */
Are they valid "comments"?

Programming is NOT about coding; it is about thinking...and it often involves research too.
I don't know the answers to my question, but you can be darn sure I'd find out if I had to write this.

It's probably also worth mentioning that regexes are not good for some types of pattern-matching, and if '/*....*/' comments can be "nested", you'll have just run into one of them.

HIH

Winston
 
Stephan van Hulst
Saloon Keeper
Posts: 7993
143
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Knute Snortum wrote:Hmm, what about this line?

What about it? I can easily write a regex that will find the comment within that line.
 
Knute Snortum
Sheriff
Posts: 4288
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, my point is that there is no comment in that line of code. The quote escapes the //. Or am I misunderstanding you?
 
Ruslan Salimovich
Greenhorn
Posts: 25
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:
Knute Snortum wrote:Hmm, what about this line?

What about it? I can easily write a regex that will find the comment within that line.


The thing is here is no comment, this is just a single string. But my aforementioned regex works fine with this situation
 
Stephan van Hulst
Saloon Keeper
Posts: 7993
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ahh I'm sorry, I thought you were just talking about the String literal, I didn't even consider the entire statement XD

Well, the same is true for comments within comments. You can solve this problem by including an alternative in your regex that will match on character and string literals, and then the find algorithm will just match on whatever it finds first. All you have to do then is discard matches that start with a quote or a double quote.
 
Paul Clapham
Sheriff
Posts: 22841
43
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator


Don't forget about the possibility of Unicode escapes in the source code.
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham wrote:Don't forget about the possibility of Unicode escapes in the source code.

Good point mah man. Have a cow.

@Ruslan: See? More "rules" you need to know about if you want an industrial-strength solution.

Winston
 
Ruslan Salimovich
Greenhorn
Posts: 25
Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham wrote:

Don't forget about the possibility of Unicode escapes in the source code.


Thank you!
So I see that first I have to define what is comment, and only then start solving a problem, since I see there are a lot of variants of comments.
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:What about it? I can easily write a regex that will find the comment within that line.

Hmmm, really? And what about
  System.out.println("Use \"//\" to comment your code");
or indeed
  System.out.println("Use ""//"" to comment your code");
? (I forget if Java allows the last one, but several languages do)

The fact is that these are semantic rules, and regexes are NOT (generally) very good for dealing with them.

Winston
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ruslan Salimych wrote:So I see that first I have to define what is comment, and only then start solving a problem, since I see there are a lot of variants of comments.

Not just variants. Knute's post shows that there is also "context" to this problem - and that's where regexes run into problems.

1. A "start of comment" is ONLY the start of a comment if it is not in quotes.
2. Conversely, "quotes" (for the purposes of this problem) only exist outside a comment.

And let me remind you of the requirements:
"Write a program that reads a Java source-code file (you provide the file name on the command line) and displays all the comments."

No mention of regexes. You have assumed that they are the road to Nirvana - and I suspect you're wrong.

Winston
 
Stephan van Hulst
Saloon Keeper
Posts: 7993
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:Hmmm, really? And what about [...]

Okay, I take back "easy". However, I would still use regular expressions in parts of my solution.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!