Stephan van Hulst wrote:
Writing a library that performs this task automatically is practically impossible if you don't already have quite some knowledge in the area of artificial intelligence and signal processing. And if you did, you probably wouldn't have to ask this question here.
Also a LOT of work. Most people attempting such a task would look towards existing signal-processing libraries first, not try to whip up something from scratch. And at that, the libraries in question would likely enlist specialized hardware such as a good GPU to do the work.
You could make a crude attempt by doing frequency analysis, since male and femable voices tend to occupy different parts of the spectrum. Better, if the voices were on separate audio (stereo) tracks, the job would be simplified. Finally. if you simply wanted to break apart a set of distinct phrases and translations such as a language-learning recording, it should be easy to simply split at the silent points between them.
But those last two aren't really what I'd do in Java. There's a program called audacity that's much better suited to that sort of stuff. And probably - with suitable plugins - the first option as well.