posted 10 years ago
Thanks Dieter,
I'll have to dig into JPA at this point I barely understand the acronym.
I'll toy with possibly null named parameters but the number of combinations is going to be large.
Maybe I am just going about this all wrong. So allow me to describe the problem I'm trying to solve instead of the programming corner I've painted myself into.
The rows in this table represent a data stream stored on disk or tape in one or more computing clusters. We call the data stream a Channel which may have been acquired by a real time system or calculated from one or more acquired channels.
A channel has a name auto generated from a model that has several fields for example "L1:IMC-X_MC3M3_OUT_DQ.mean". It also has a sampling frequency that runs from about 1/60-65536Hz. These are the fields that are giving me headaches. There are others but we do exact matches on them.
For the name field the user specifies a set of mandatory substrings it must contain or an set of alternative substrings. In the above example IMC and MC3 might be required and and one of X, or Y or Z. To complicate matters the order these substrings occur is not always easy to infer and the delimiters (:, -, _, and .) do not always break the name into fields that have to match exactly.
The match string a user might enter could look like "L1 mc3 imc x | L1 mc3 imc z" and that would get translated into a where clause like:
For the sample frequency they may select a value and compare operator ( <=, <, =, >, >=) which would then make the where clause look like:
The web app that uses this as Step 1 is meant to find and retrieve the data for a specific time interval and plot or transform and plot it. It turns out that many of our users just use it to do this search to figure out which channels to use in their off-line analysis. I seem to be the only one who cares how $#@%! UGLY and inefficient the DB queries are.
Some numbers:
If we ignore how many calculated channels are derived from an acquired channel, and how many different data formats and sample rates were used over the years there are about 1 million channels.
Add those variants and it's about 10 million
Add in exact duplicates available from different clusters and it's about 25 million.
Just for the shock and awe factor: the data kept on line at main cluster is a bit over 2 PB (2,000 TB) and when the instruments comes on line after a major upgrade at the end of this year or beginning of 2015 new data will be acquired at a rate of about 1 PB/year. At that point hopefully the size of the channel list will stop growing or at least slow to a crawl.
Right now we are searching the 25 million row table but soon I'll have the search working with the 1 million row table. There are multiple indexes and alternate name matching strategies (like glob and regex) that work much more efficiently. However they require a priori knowledge of the order of the substrings which is not always available.
I've thought about using the full text search feature, which is a bit weird on char(64) fields but the big problem is the strings can't be reliably broken into "words" to search on.
Perhaps there is a way to write a custom string comparison function that I don't know about. Something like String.containsAllSubstrings(String name, String[] substrings) would deal with the ugly syntax and increase efficiency a bit.
I do appreciate the Ranch as the best place to describe these kinds of problems. I always seem to get a good discussion of alternatives.
Best,
Joe
It's not what your program can do, it's what your users do with the program.