We have a web services platform where we deploy Java based web services inside docker containers. By default, up until now we were making a Max Metaspace allocation of 256MB. However, we found that for most of the 95% of the services the maximum used metaspace never crosses 150MB. To add to that we have over 5000 instance of services running which is only increasing as more and more teams move from legacy system to this.
The way the platform works is like this: We have a customized runtime which runs on top of Dropwizard. We take the WAR, read its contents and pass onto the underlying Jetty web server which is embedded in Dropwizard. This allows us to maintain and improve the runtime independent of the WAR which is developed by the consumer teams. We also have a monitoring system in place which monitor all the JVM metrics and reports it.
Optimization 1st Attempt:
We used the monitoring system to identify the max used metaspace for each and every service. On top of that we added some buffer and set that as the max metaspace for the services in our internal environments. However, we quickly realized that wasn't working partially when we started receiving alerts for some of the services (~5-6%). This where we started investigating further. We found that for some of the service, the WAR version didn't change, but there was a change in the runtime version, which may have contributed to increase in the required metaspace for those services and this is where we need help.
We have two completely different components which are developed independently but one relies on other for it to run. So, how do we even estimate the required metaspace upfront so that we can dynamically set the required metaspace during deployment for our services.