[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
| Re: [jakartaee-tck-dev] [glassfish-dev] Tracking usage data for EE4J working group CI cloud systems | 
I wasn't intending to point any fingers at the stability observations 
you made -- only to observe that we want to focus on improving the 
reliability so that we don't need to rely on re-runs, or waits, or other 
symptomatic only type fixes. One day, I'd like to see that we can 
initiate several test runs simultaneously -- and they all reliably 
complete -- perhaps taking more clock-time, but reliably and 
consistently repeating.
Ideally, we could fill this compute pipeline with running and waiting 
tasks and be confident that the system will reliably produce, consistent 
results. It sounds like, for some unknown cause, we aren't there yet. 
Getting to the root cause of this would be my priority (and we don't 
know if that's a GlassFish issue, an infrastructure issue or even 
something else). But, I'm not actually doing the work so, it's just my 
opinion.
-- Ed
On 9/30/2020 7:54 PM, Scott Marlow wrote:
On 9/30/20 8:04 PM, Ed Bratt wrote:
The stability issues mentioned below -- my paraphrase: GlassFish 
won't start properly if multiple tests are being run in parallel -- 
are concerning. We really need to get to the bottom of this. The 
whole point of moving these into K8s containers was to provide 
isolation and stability.
We are finally seeing usage data which is a great step forward in 
understanding!  :-)
We cannot really point fingers at a cause yet, we have some suspects 
in mind that are within our control.  We haven't seen any 
container/system level failures that I recall.  The closest to a 
system level failure was when we were seeing `git` failures due to our 
JNLP memory size being too small (increasing the JNLP memory size 
solved the `git` failures).  I think it was around April 2020 when we 
fixed the `git` oom failures.
One stability change that we can make that is within our control, is 
to set a trap handler or `try {} finally { cleanup() }` handler to 
ensure that all started test processes are terminated during each test 
run.
We might also consider if we need to wait for time waited listening 
ports to actually be closed during cleanup as well (simply don't wait 
if there are none).
One symptom that we have seen in our CI environment when running two 
concurrent Platform TCK test runs is reported on 
https://urldefense.com/v3/__https://github.com/eclipse-ee4j/glassfish/issues/23191__;!!GqivPVa7Brio!KjhJBU5roEMY5jAK9upXVTyx1HsH6txHjfoEgKd3ugrTMLAhfg5pPdcquwCjzDI$ 
.  We do need to solve issues/23191 at some point.  I can recreate a 
related failure locally that could be the same issue (hard to know for 
sure).  We mostly avoid this by either rerunning the TCK tests if we 
see it and also avoid starting multiple concurrent Platform TCK test 
runs.
Another approach to handling glassfish/issues/23191 in CI, is to 
assume that problems like that can happen, so we could handle them 
with a loop that retries after terminating GlassFish, sleeping for the 
right amount of time and trying again a few times.
We will need to consider possible expansion of resources as well -- 
so independence and reliability need to be addressed. As we move 
forward -- it seems entirely plausible we might be doing work in 
multiple feature branches -- e.g. maybe Jakarta EE 9.1 w/JDK11+ 
support only and Jakarta EE 10 with 9.1 changes AND new features for 
Jakarta EE 10. We may be expanding the test matrix requirements as 
well -- I do not know but we should try to consider this as we start 
investigating optimizations and/or resource allocation changes.
Based on your numbers, is it possible that the upper limit is on 
Memory, not CPU? (I make this comment in relation to your observation 
that Jakarta EE TCK should have up to 100 vCPUs but never gets more 
than 76 (though I also don't get the fractional Max value but, that's 
for a later date.) )
Yes, I agree and made the same conclusion that we first hit an upper 
limit on memory, not CPUs.  IMO, we should be able to reduce the 
memory used per container/VM in which case it may be useful to use one 
CPU per container/VM to get more (testing) bang for our bucks out of 
the system.
One comment I saw today mentioned the a high max jvm memory setting is 
to work around memory issues with the EJB tests.  IMO, we should 
create more separate test groups for the EJB tests to see if that 
helps reduce the need for 10gb per test container/vm.
Please be careful that we don't get too distracted about trying to 
optimize any of this right now. We can spend some time with that once 
the TCKs are all finalized -- put another way, if you had spare 
cycles, I'd rather you put time into moving the Jakarta EE TCKs to 
their final status, before running experiments to find out how these 
change Resource Pack usage rates. I'll also note that weekends are a 
frequent time for our committer members who are contributing as a 
side-light to their regular job so, we should be a bit careful with 
that, as well.
I agree, I did get excited and wanted to push a little more on coming 
up with a way to help us get further answers about our ci environment 
+ testing some time this year, if that helps us make a decision for 
next year.  So, I think we have a path that we could follow forward, 
time permitting.
I am totally not surprised that this is a rather burst-like usage 
pattern. 
The data usage numbers are not that useful in that they don't show how 
the resources are used exactly (building Platform TCK, running 
Platform TCK, building Standalone TCKs, running Standalone TCKs). We 
know it takes longer to run Platform TCKs.
Perhaps some day we will be able to correlate each running TCK job 
with the usage report to allow more detailed usage reporting per type 
of TCK job.
Of course, that makes planning more tricky because you'd ideally like 
the utilization to be consistent and robust. 
Agreed.
The parallel operation model for the TCKs -- both running multiple 
stand-alone TCKs and running the Jakarta EE Platform TCK are designed 
to run this way -- they are optimized to get completed test results 
as quickly as possible -- not to sequence the tests, one after 
another. Having delivered this for a long time, I can definitely say, 
I prefer the results sooner, rather than waiting days and days.
Agreed, we just need to run more correctly/defensively and I think we 
will get there.
Scott
-- Ed
On 9/30/2020 11:11 AM, Scott Marlow wrote:
Here are the average + max Memory/#CpuCores:
avg memory.limit    Max Memory        average cpu limits Max CPU
=====                   ===== ====== ========
61.58 Gi                   378.00 Gi             12.1 
vCPU                    74.7 vCPU
There are some cpu/memory limits in Jenkinsfile 
(https://urldefense.com/v3/__https://github.com/eclipse-ee4j/jakartaee-tck/blob/master/Jenkinsfile*L147__;Iw!!GqivPVa7Brio!K6SiwGSX9lBaEKBbtvCH6386RJfFh1TVdZrGAH_A4H2aAbNuuSrBjJubOh13CnE$ 
), each memory limit is specifying the container/VM memory size 
(since we didn't specify the initial memory request setting), so 
the calculation is something like:
memory usage = 10Gi per VM * number of test groups
CPU core = 2 * number of test groups
The data-capture does give us a high level view of what the 
container level memory/CPU core usage has been. Quoting from a 
previous TCK ml conversation (from David Blevins with subject: 
"Resource Pack Allocations & Maximizing Use"):
"
Over all of EE4J we have 105 resource packs paid for that give us a 
total of 210 cpu cores and 840 GB RAM.  These resource packs are 
dedicated, not elastic.  The actual allocation of 105 resource 
packs is by project.  The biggest allocation is 50 resource packs 
to ee4j.jakartaee-tck (this project), the second biggest is 15 
resource packs to ee4j.glassfish.
The most critical takeaway from the above is we have 50 resource 
packs dedicated to this project giving us a total of 100 cores and 
400GB ram at our disposal 24x7.  These 50 are bought and paid for 
-- we do not save money if we don't use them.
"
So, the Platform TCK is budgeted to use 100 cores and 400GB ram, 
however, we haven't used more than 75 CPU cores and 378gb of memory 
(as per numbers max memory/cpu numbers pasted above).
I think the fundamental question is: can we manage this resource, 
hence the cost, based on these data?
Imo, I think there is memory/cpu tuning that we could do if there 
is time to experiment before answers are needed regarding current 
usage versus what usage could be.
Alwin helped me to create a Platform TCK runner job that can run 
against my github repository.  Thanks Alwin!
I created 
https://urldefense.com/v3/__https://github.com/scottmarlow/jakartaee-tck/tree/tuning__;!!GqivPVa7Brio!K6SiwGSX9lBaEKBbtvCH6386RJfFh1TVdZrGAH_A4H2aAbNuuSrBjJubC77yR40$ 
to represent changes to improve our memory/cpu tuning.
When we have time to try memory/cpu tuning improvements, we can run 
tests with 
https://urldefense.com/v3/__https://ci.eclipse.org/jakartaee-tck/job/jakartaee-tck-scottmarlow__;!!GqivPVa7Brio!K6SiwGSX9lBaEKBbtvCH6386RJfFh1TVdZrGAH_A4H2aAbNuuSrBjJubdTey-6o$ 
against the `tuning` branch.  Pull requests are welcome! :-)
So, I think this identifies the `how we can try making improvements 
to our usage`.  I'm also hoping that reducing our memory/cpu usage 
can translate into being able to run more concurrent tests at the 
same time.
Currently, we also have to avoid starting multiple Platform TCK test 
runs at the same time or we hit test stability problems (GlassFish 
won't start correctly for some tests).
You are also welcome to review any of the commentary and ask 
questions directly via the issue.
I asked on 
https://urldefense.com/v3/__https://bugs.eclipse.org/bugs/show_bug.cgi?id=565098__;!!GqivPVa7Brio!K6SiwGSX9lBaEKBbtvCH6386RJfFh1TVdZrGAH_A4H2aAbNuuSrBjJubSQ_qLy8$ 
about measuring usage for a weekend or over a few days.
The answer is that the measuring is always on and can be observed as 
per links mentioned in the bugzilla issue.  This will require some 
dancing as we need to ensure that no other tests are run the same 
day (until after we have noted the usage for the `tuning` test 
run).  This is important so that we have a way to compare use of 
different settings.
I'm not sure of when we will have time to do this testing yet but 
would be nice to fit it in.
Scott