• Post Reply Bookmark Topic Watch Topic
  • New Topic

Regular Expressions  RSS feed

 
colin shuker
Ranch Hand
Posts: 750
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi, I'm not sure which section this comes under, so I've put it here.

Basically I want to read pgn files(chess files), a pgn file looks something like this...



Firstly you have tags [...] then the list of moves, then the result.
This is repeated for the next game, and the next game.

Well I am able to read this data into one big String.
I need to be able to split this String into individual games..
eg...

and

I had a Regular Expression that looked for the correct match, see below:

This works by locating a [(opening bracket of first tag), then any characters, then a ](closing bracket of last tag), then any bunch of non-angled brackets(the moves & result).

since *? is not greedy, it won't collect everything between first [ and last ], and since + is greedy, it will grab the moves & result.

HOWEVER... I came across a pgn file with nested [] tags causing a problem.
I was thinking, would it give better performance to scan through the string to collect the tags, moves , and result, or to use a regular expression to do this.

I think generally a regular expression is wise, but the text files I'm using may be very large (1000 games or more), and there is a possibility of using nested [], meaning I have to use a complex regular expression to check for this.

Would it be faster to just scan(loop through the string) to get the required data?

Thanks for any advice.
[ August 06, 2007: Message edited by: colin shuker ]
 
Anand Hariharan
Rancher
Posts: 272
C++ Debian VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How many levels of nesting can you have? If this number is arbitrary, then you have to resort to something that is grammar based (I'd love to be proved wrong, of course).

Is there anything else you can take advantage of? E.g., your samples seem to suggest that newlines separate the moves from the headers of the next game.

Here is an UNTESTED hack that takes one (1) level of nesting into account.



- Anand

PS: Sorry if you are going through any of these emotions:

 
colin shuker
Ranch Hand
Posts: 750
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, usually there are no nested [], but once I saw...
[something [ something ] something [ something ] ]

So, I'm sure 1 level of nesting (as above) will be enough.
But my question is, would it be better to write my own code to loop through the string, checking the characters (notably '[',']' and not ('[',']')) OR to use a regular expression,
so that I can split the String into its individual games.

I'm guessing regular expressions do this anyway, but the complications of their job are concealed from the user, meaning they can take up some processing power for long strings.

So perhaps if I write a specific piece of code to do a specific job, it will have better performance.

What do you think?
 
Henry Wong
author
Sheriff
Posts: 23283
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I agree... since you are not going to extract the individual tags, much less the nested tags, why do you need to match the tags themselves? Why can't you find a pattern that fits the description for a game, which is what you do want to extract?

To do that, you can use the carriage return as part of your pattern. Try...



Note: There is probably an easier pattern. I tend to over use look-aheads.

Henry
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!