I agree with DW - if
u_sip_bist.u_sip_core.dec_1.s_llp1.sip_bdp_llp1.i_21978.i_0
is one symbol, it may be easier to consider it instead as a series of symbols:
u_sip_bist
u_sip_core
dec_1
s_llp1
sip_bdp_llp1
i_21978
i_0
Hard to say though; I have limited understanding of what all you're doing, and whether this approach would be worthwhile.
Hashmaps are useless for holding a hundred million plus keys. I found that TreeMaps and Hashtables had the same run time speed. That's a little surprising to me - I'd have thought that the HashMap would be better as N increases. HashMap lookup should be O(1) while for TreeMap it's O(log N). Not a huge difference necessarily, but for the amount of data you're talking about I'd have expected different. This could indicate that the hashCode() is not distributing things as evenly as we'd like.
Say, you're not using an ancient JDK, are you? Prior to JDK 1.2, String's hashCode() only looked at the first 16 chars. Something like that would lead to really crappy performance in your case, as you've probably got a lot of strings with the same 16 chars initially.
One concern I have is that even if I store keys in a special symbol table or dictionary efficiently I might lose time because every lookup will be starting from a String which must first be converted to the internal byte array form for comparison. My gut feeling is this isn't a big deal, but I may be wrong. One simple test would be to just read in the whole file and write it to another file using a different encoding, and see how long that takes. If it
is an issue, one option is to never bother converting the symbol to a String in the first place. The symbol in the file was encoded in bytes in the first place - it you just keep treating everything as bytes rather than chars, you may be able to save on performance here. But I suspect it won't make much difference either way, and chars are usually easier to understand.
The keys are strictly 8 bit ASCII per the Verilog IEEE spec 1394-1995. I think there's some ambiguity in the term "8 bit ASCII" as not everyone agrees on how to interpret values 128-255. I don't have the Verilog spec (is it online somewhere?). You will probably want to find a more exact name for the encoding. For example,
here is a discussion of the differences between ISO-8859-1 and common variants such as Microsoft's Cp-1252.