• Post Reply Bookmark Topic Watch Topic
  • New Topic

Regex - Delimiter question  RSS feed

 
Ranch Hand
Posts: 189
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Good day,

I need to have a delimiter to parse this from the file, anyone can guide me? Thanks in advance!

the whole file contain format as below:

Format of file contain many records like below:


Result should be :

Trying below, but i'm not sure how to skip the tab split if got "()" in between
 
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Nakataa Kokuyo wrote:Trying below, but i'm not sure how to skip the tab split if got "()" in between

I don't understand what that last sentence means. Could you provide a precise example of each situation?

Thanks.

Winston

BTW: I'm pretty sure you don't have to escape '|' when it's inside square brackets; but you do need to escape the TAB, so
Pattern.compile("[|\\t]")
would be what you want for "'|' or TAB".
 
Nakataa Kokuyo
Ranch Hand
Posts: 189
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey Winston,

May be it is easier to explain with the result, please read the first post with update expecting result.

Sorry for poor explanation
 
Nakataa Kokuyo
Ranch Hand
Posts: 189
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There should not split with tab when () surrounding to words
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Nakataa Kokuyo wrote:May be it is easier to explain with the result, please read the first post with update expecting result.

Better, but I'm still a bit mystified.

In both cases, is the TAB immediately before the number? Or are you saying that if there are brackets, the delimiter will be a TAB rather than a space?

Winston
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Nakataa Kokuyo wrote:There should not split with tab when () surrounding to words

It seems an odd way to do things. Why not just always use a TAB? That way, you don't have to worry whether the brackets are there or not. It's the standard method used for many Unix delimited files.

Winston
 
Nakataa Kokuyo
Ranch Hand
Posts: 189
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
from the input


if i usd just TAB, my value will be




What i need is

 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Nakataa Kokuyo wrote:from the input if i usd just TAB, my value will be

Not if you do it properly. In such a situation, I would make the inputwhere '{TAB}' denotes the '\t' character.

And THEN, your split regex will do exactly what you want.

Winston
 
Saloon Keeper
Posts: 7994
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You shouldn't be using delimiters here.

Make a regex that describes the entire record, and use capturing groups to get individual parts of the record. Then use the findWithinHorizon() method to read all the records.
I have not compiled this, so it might be completely off. The point is that it describes the records, and then finds those records within the file, regardless of delimiters. The pattern consists of three capturing groups: the text before the pipe, the text after the pipe and the final number. We see that text before and after the pipe may consist of any number of x characters, where x is any character except for whitespace or pipes, but including tabs. Each of the three groups may be separated by any number of whitespace.
 
Nakataa Kokuyo
Ranch Hand
Posts: 189
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Stephan,

I'm confuse on regex that you use on below sample, any chance to explain what is trying to achieve from below code


 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Nakataa Kokuyo wrote:I'm confuse on regex that you use on below sample, any chance to explain what is trying to achieve from below code

Basically, it's a regex that contains a duplicate substring, which he defined separately. I suggest you look at the String.format() method documentation for more details.

I guess my question is: why is your input so confusing? Why not just TAB-delimit (or pipe-delimit) the whole darn thing?
It looks suspiciously to me like there might be "layers" to this input, in which case regexes may not be the best solution anyway; but if not, and it's just columnar data, pick ONE delimiter and just use it everywhere.

Winston
 
Nakataa Kokuyo
Ranch Hand
Posts: 189
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Winston for the reply!

The input file was generated from Hadoop application with format that i mentioned and I was think if there is a chance for me to used delimited to handle all given format ...

Help me to understand better, are you suggesting using a tab as delimited and then i spilt the remaining part ? it will look probably with following steps :-

My input :


1. By delimited with tab, and result


2. I need to spilt again with delimited "|", and result


3. Then merge and get the result

 
Nakataa Kokuyo
Ranch Hand
Posts: 189
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think above should work but i afraid there is a paid for performance as there are many records(200k) from the given input textfile.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!