Reorganizing docs as recommended in:
https://www.divio.com/blog/documentation/
This is simply a reorganization of the existing documents and changes
no content EXCEPT to correct the location of sphinx doc references.
Expect followup changes to change document names (to reflect the new
structure) and to move content from existing guides (e.g., to move the
pipeline/project/job structure definitions out of the "Project Configuration"
reference guide into their own reference documents for easier locatability).
All documents are now located in either the "overview", "tutorials",
"discussions", or "references" subdirectories to reflect the new structure
presented to the user. Code examples and images are moved to "examples" and
"images" root-level directories.
Developer specific documents are located in the "references/developer"
directory.
Change-Id: I538ffd7409941c53bf42fe64b7acbc146023c1e3
We currently lack means to support resource accounting of tenants or
projects. Together with an addition to nodepool that adds resource
metadata to nodes we can emit statsd statistics per tenant and per
project.
The following statistics are emitted:
* zuul.nodepool.resources.tenant.{tenant}.{resource}.current
Gauge with the currently used resources by tenant
* zuul.nodepool.resources.project.{project}.{resource}.current
Gauge with the currently used resources by project
* zuul.nodepool.resources.tenant.{tenant}.{resource}.counter
Counter with the summed usage by tenant. e.g. cpu seconds
* zuul.nodepool.resources.project.{project}.{resource}.counter
Counter with the summed usage by project. e.g. cpu seconds
Depends-On: https://review.openstack.org/616262
Change-Id: I68ea68128287bf52d107959e1c343dfce98f1fc8
The stats zuul.executor.<name>.pause and
zuul.executor.<name>.paused_builds are undocumented. While at it fix
the indentation of this section.
Change-Id: I5d5bdc1fe748ec2c545c8b7e8ec2674d50208f9f
We currently have a gauge for starting_builds but actually have no
knowledge about how long jobs are in the starting state. This adds a
metric for this so we can see changes in the job startup time after
changes in the system.
Change-Id: I261f8bdc8de336967b9c8ecd6eafc68f0bfe6b78
When running within k8s the system memory statistics are useless as
soon there are configured limits (which is strongly advised). In this
case we additionally need to check the cgroups.
Change-Id: Idebe5d7e60dc862e89d012594ab362a19f18708d
The current stats set a counter zuul.nodepool.<status> but then tries
to set more counters like zuul.nodepool.<status>.label.
This doesn't work because zuul.nodepool.<status> is already a counter
value; it can't also be an intermediate key. Note this *does* work
with the timer values, but that's because statsd is turning the timer
into individual values
(e.g. zuul.nodepool.<status>.<mean|count|std...>) as it flushes each
interval.
Thus we need to rethink these stats. This puts them under a new
intermeidate key "requests" and adds a "total" count; thus
zuul.nodepool.<status> == zuul.nodepool.requests.<status>.total
The other stats, showing requests by-label and by-size will now live
under the zuul.nodepool.requests parent.
While we're here, use a statsd pipeline to send the status update as
it works better when sending lots of stats quickly over UDP. This
isn't handled by the current debug log below; move this into the
test-case framework.
The documentation has been clarified to match the code.
Change-Id: I127e8b6d08ab86e0f24018fd4b33c626682c76c7
The stats emitted under zuul.geard are currently undocumented. Add
them to the monitoring guide and add some more details to the geard
toubleshooting guide for what to do if the stats look wrong.
Change-Id: I831def2f7c22d8ffff62569cc7d657033a85ed19
The load_average and pct_used_ram metrics are indented incorrectly
which placed them under zuul.executor.<executor>.phase.* in the docs.
Change-Id: Id613ce57a679d1ab4bf9f71bf4d5a6bde72e2d50
Folks tend to misread this metric as used ram, rather than available,
since that's how memory is typically graphed, so go ahead and invert
it. Admins will need to mentally invert it again to determine whether
the executor is approached the available ram threshold.
Change-Id: I60cde8bf2fd04926cd2ac1bb733bf9c72fda8daf
This adds the following counters:
- zuul.executor.*.phase.setup.<result> (setup task)
- zuul.executor.*.phase.reset.<result> (reset connection task)
- zuul.executor.*.phase.<phase>.<result> (pre/run/post playbooks)
- zuul.executor.*.merger.['SUCCESS','FAILURE'] (merger status)
The data provided by these counters are not very reliable in the sense
that the failures may not be related to the executor itself and is
instead a legitimate issue with the patch or the job it is running.
However, when averaged out, these counters should help us identify if
a particular executor is exhibiting irregular behavior when compared
to regular patterns or other executors.
Change-Id: Ie430f9935dce94f4b90cffee33695e1eb4d1ca7d
The metrics that we use to govern load on the executors are all
trailing indicators. The executors are capable of accepting a
large number of jobs in a batch and then, only after they begin
to run, will the load indicators increase. To avoid the thundering
herd problem, reduce the rate at which we accept jobs past a certain
point.
That point is twice the number of jobs as the target load average.
In practice that seems to be a fairly conservative but reasonable
number of jobs for the executor to run, so, to facilitate a quick
start, allow the executor to start up to that number all at once.
Once the number of jobs running is beyond that number, subsequent
jobs will only be accepted one at a time, after each one completes
its startup phase (cloning repos, establishing ansible connections),
which is to say, at the point where the job begins running its first
pre-playbook.
We will also wait until the next regular interval of the governor
to accept the next job. That's currently 30 seconds, but to make
the system a bit more responsive, it's lowered to 10 seconds in this
change.
To summarize: after a bunch[1] of jobs are running, after each new
job, we wait until that job has started running playbooks, plus up
to an additional 10 seconds, before accepting a new job.
This is implemented by adding a 'starting jobs' metric to the governor
so that we register or de-register the execute function based on
whether too many jobs are in the startup phase. We add a forced
call to the governor routine after each job starts so that we can
unregister if necessary before picking up the next job, and wrap that
routine in a lock since it is now called from multiple threads and
its logic may not be entirely thread-safe.
Also, add tests for all three inputs to manageLoad.
[1] 2*target load average
Change-Id: I066bc539e70eb475ca2b871fb90644264d8d5bf4
The events are landing in statsd as zuul.event.<driver>.<type> and not
as zuul.event.<driver>.event.<type>
Change-Id: I9c4901a9c02d4d833fdc3e1b7617a4bbba15c94d
Also, change the interval to 30s rather than 10s. There is some
cost to the gear server to calculate the status report, especially
if the queue is long.
Change-Id: Icfe4c6496e45847cdf884f23a06d7186aafdf8e2
The automatic statsd configuration based on env variables has
proven cumbersome and counter-intuitive. Move its configuration
into zuul.conf in preparation for other components emitting stats.
Change-Id: I3f6b5010d31c05e295f3d70925cac8460d334283
Refresh the user and admin guide for v3 changes, and reorganize into
a narrative structure which makes more sense for v3.
Change-Id: I4ac3b18d5ed33b0fea4e2ef0318b19bfc3447ccc