Java Strings use too much memory since they store 16 bit chars not bytes. For a Treemap or any table with tens of millions of String keys it is possible to run out of RAM. Does anyone know of a memory efficient String or Key implementation? See http://www.javaworld.com/javatips/jw-javatip130_p.html for a good discussion of the String size problem. The particular use I have is for symbol tables in a Verilog language compiler. Hundreds of millions of symbols are common. I have searched the web and all my books but have found nothing much on this topic.
The most efficient way to store 128 bit characters is of course in a 128 bit data structure. A raw byte array with two characters packed per byte is as close as we can get. But before we start getting two specific though, just a couple of random questions: * Are there a lot of duplicate String appearing as values in the table? (i.e. will String.intern() save enough memory to be sufficient) * How big, on average, is each individual string? * How is the table constructed and accessed? (i.e. everything is added up front, then everything is accessed as necessary, no deletions or modification necessary). How frequently is each operation called, relatively speaking?
The most efficient way to store 128 bit characters is of course in a 128 bit data structure. True if you need to use 128 bit characters. But Java normally just uses 16 , and most of us only need 7 or 8 most of the time. I assume you meant 16? Since Charlie mentions ASCII in the title I'm guessing that Verilog symbols use only 7 or 8 bits in their charset. Charlie, be very sure this is actually the case before proceeding. Is there definitive documentation somewhere for what's allowed? If indeed the charset is limited to what can fit in 7 or 8 bits, then probably the simplest thing is to convert each string to an array of bytes using whatever encoding is most appropriate to Verilog - I'll guess ISO-8859-1 as most likely, but who knows? UTF-8 may be good if you occasionally get "exotic" chars (outside the latin-1 range). Anyway, it's pretty easy to cut memory usage in half (approximately) doing something like this. You'd also need to create a custom wrapper class which provides equals() and hashCode() methods for a sequence of bytes - the default methods of an array won't suffice; they're just inherited from Object. And as DW notes, if there's going to be significant duplication of values in memory, it may be worthwhile to do something similar to String's intern pool. This is fairly easy to do with a HashMap, but of course that has additional overhead of its own. So this is only worthwhile if there's enough duplication in data to begin with. To be fair, all DW's previous questions are still improtant to address - I'm just anticipating what I think are likely answers. [ July 07, 2003: Message edited by: Jim Yingst ]
Well, no wonder that number sounded familiar. Guess we're in agreement then. Here are a couple classes that might be useful. I quickly adapted them from something else and haven't tested really, so beware.
And if there's a lot of duplication, use this instead:
You still need to choose the most appropriate way to encode Strings into byte, based on the type of data you're dealing with. It may even be worthwhile to use more complex data compression schemes to encode strings, especially if they are long. [ July 07, 2003: Message edited by: Jim Yingst ]
Thanks to all for replies so far. You have confirmed my thoughts. There seems to be no published solution to this problem. ( not in the Java Collections book, not in the commercial JDSL from Brown University, not on the web, etc.) I have been sketching out some ideas along the lines of the code snippet posted by Jim Yingst. One concern I have is that even if I store keys in a special symbol table or dictionary efficiently I might lose time because every lookup will be starting from a String which must first be converted to the internal byte array form for comparison. I guess only an experiment will resolve this. The keys are strictly 8 bit ASCII per the Verilog IEEE spec 1394-1995. The keys are all unique in the table, they represent instances of a logic cell. E.g. u_sip_bist.u_sip_core.dec_1.s_llp1.sip_bdp_llp1.pm_pipe_reg_reg_6_0.Q sip_bdp_2 D0 on u_sip_bist.u_sip_core.dec_1.s_llp1.sip_bdp_llp1.ll_node4_pipe2_reg_6 A on u_sip_bist.u_sip_core.dec_1.s_llp1.sip_bdp_llp1.i_21978.i_0 A on u_sip_bist.u_sip_core.dec_1.s_llp1.sip_bdp_llp1.i_21980.i_44 D0 on u_sip_bist.u_sip_core.dec_1.s_llp1.sip_bdp_llp1.i_30333719 The key length is about 60 characters long for a small test circuit with a 300,000 line long netlist. The above says pin Q, an output pin of the first instance is connected to inpout pins A and D0 of the other named cell instances. Crude timing of the run time for the compiler shows that most of the time is spent looking up symbols in the table. The table I use is a TreeMap from the Java Collections package. Hashmaps are useless for holding a hundred million plus keys. I found that TreeMaps and Hashtables had the same run time speed. The compiler needs about 400Meg of RAM for the small test circuit. It uses about 3Gig of RAM on a Sun Sparc for a more realistic circuit. The gating items for the compiler seem to be the inefficient key storage using standard Strings, and the lookup time for symbols. I am thinking it may be necessary to break up the one massive symbol table into hierarchical tables since the circuits use modules and within a module all connections in the netlist have a common prefix. However, the low hanging fruit may be to just use a byte array storeage class for the key. I'll be giving this a try and will post back the results later.
I agree with DW - if u_sip_bist.u_sip_core.dec_1.s_llp1.sip_bdp_llp1.i_21978.i_0 is one symbol, it may be easier to consider it instead as a series of symbols: u_sip_bist u_sip_core dec_1 s_llp1 sip_bdp_llp1 i_21978 i_0 Hard to say though; I have limited understanding of what all you're doing, and whether this approach would be worthwhile. Hashmaps are useless for holding a hundred million plus keys. I found that TreeMaps and Hashtables had the same run time speed. That's a little surprising to me - I'd have thought that the HashMap would be better as N increases. HashMap lookup should be O(1) while for TreeMap it's O(log N). Not a huge difference necessarily, but for the amount of data you're talking about I'd have expected different. This could indicate that the hashCode() is not distributing things as evenly as we'd like. Say, you're not using an ancient JDK, are you? Prior to JDK 1.2, String's hashCode() only looked at the first 16 chars. Something like that would lead to really crappy performance in your case, as you've probably got a lot of strings with the same 16 chars initially. One concern I have is that even if I store keys in a special symbol table or dictionary efficiently I might lose time because every lookup will be starting from a String which must first be converted to the internal byte array form for comparison. My gut feeling is this isn't a big deal, but I may be wrong. One simple test would be to just read in the whole file and write it to another file using a different encoding, and see how long that takes. If it is an issue, one option is to never bother converting the symbol to a String in the first place. The symbol in the file was encoded in bytes in the first place - it you just keep treating everything as bytes rather than chars, you may be able to save on performance here. But I suspect it won't make much difference either way, and chars are usually easier to understand. The keys are strictly 8 bit ASCII per the Verilog IEEE spec 1394-1995. I think there's some ambiguity in the term "8 bit ASCII" as not everyone agrees on how to interpret values 128-255. I don't have the Verilog spec (is it online somewhere?). You will probably want to find a more exact name for the encoding. For example, here is a discussion of the differences between ISO-8859-1 and common variants such as Microsoft's Cp-1252.