Matt Wright wrote:I am implementing the hash join algorithm for a project with a hard coded hash function.
Not quite sure what you mean here. Do you have a hashCode() method?
Also: You talk about "tuples", which is a database concept. In Java, you're likely to be dealing with collections of some kind.
What you're attempting is something that is usually done with Sets (java.util.Set) in Java (indeed, the method retainAll() specifically mimics it), so if I was trying to do this I would probably create a Relation class that encapsulates the "where" clause of a SELECT statement.
Once you have that, and can calculate its hash, populate two HashSet<Relation>s with the relations for each of your datasets (don't worry about duplicates for now) and run retailAll(). That will give you the Relations common to both, and then with another pass through your datasets you can add something that equates to row ids for them (eg, List indexes). It's possible you could do this during your initial build, but the two-pass approach is likely to be a lot simpler to code and doesn't change the O(n) characteristic of the operation.
It seems to me that you may be concentrating too much on the mechanics of the operation rather than looking at the design.
Winston Gutkowski wrote:It's possible you could do this during your initial build...
Actually, thinking about it, you could probably do that quite easily if you initially load into HashMap<Relation, List<Integer>>s (where the Integer is the index of your "tuple"); and you could then do the retainAll() operation on their keySet()s.
I would expect this to have similar speed to a HashSet (but I don't know), and it would save you having to scan one of the datasets a second time (in fact you could just scan one of the Maps).
In order to combine them though, you'll need some way of distinguishing indexes of "table" A from those of table B, and one easy method would be to make the latter negative (perhaps something like: negativeIndex = -(originalIndex + 1), as Arrays.binarySearch() does for its "insertion" indexes). Obviously, this won't work for any joins involving more than two tables though (if you need it).
Winston Gutkowski wrote:if I was trying to do this I would probably create a Relation class that encapsulates the "where" clause of a SELECT statement.
Actually, since this is for joins, it would only have to include those elements that are actually part of the join. What you do about any other "filtering" logic would be up to you. One simple way to combine hashes for multiple join fields would be to XOR them.