Commit Graph

27 Commits

Author SHA1 Message Date
David Shrewsbury e6d8b210cc Documentation reorg
Reorganizing docs as recommended in:

https://www.divio.com/blog/documentation/

This is simply a reorganization of the existing documents and changes
no content EXCEPT to correct the location of sphinx doc references.
Expect followup changes to change document names (to reflect the new
structure) and to move content from existing guides (e.g., to move the
pipeline/project/job structure definitions out of the "Project Configuration"
reference guide into their own reference documents for easier locatability).

All documents are now located in either the "overview", "tutorials",
"discussions", or "references" subdirectories to reflect the new structure
presented to the user. Code examples and images are moved to "examples" and
"images" root-level directories.

Developer specific documents are located in the "references/developer"
directory.

Change-Id: I538ffd7409941c53bf42fe64b7acbc146023c1e3
2020-01-14 12:47:23 -05:00
Andy Ladjadj addf6ccf37 [doc][monitoring] Fix the wait_time parent attribute
- the documentation defer from code source

Change-Id: I54ebfdd4dd04684651226656dd8175cd00b735b3
2019-06-25 14:04:24 +02:00
Tobias Henkel a455e0bff8
Fix typo in docs
It's project, not tenant there.

Change-Id: I148e2b8615e85ce726b592f4025f2ade7fdf3463
2019-05-29 06:13:22 +02:00
Tobias Henkel e90fe41bfe Report tenant and project specific resource usage stats
We currently lack means to support resource accounting of tenants or
projects. Together with an addition to nodepool that adds resource
metadata to nodes we can emit statsd statistics per tenant and per
project.

The following statistics are emitted:
* zuul.nodepool.resources.tenant.{tenant}.{resource}.current
  Gauge with the currently used resources by tenant

* zuul.nodepool.resources.project.{project}.{resource}.current
  Gauge with the currently used resources by project

* zuul.nodepool.resources.tenant.{tenant}.{resource}.counter
  Counter with the summed usage by tenant. e.g. cpu seconds

* zuul.nodepool.resources.project.{project}.{resource}.counter
  Counter with the summed usage by project. e.g. cpu seconds

Depends-On: https://review.openstack.org/616262
Change-Id: I68ea68128287bf52d107959e1c343dfce98f1fc8
2019-05-29 04:10:08 +00:00
Zuul 97da909bd8 Merge "Add cgroup support to ram sensor" 2019-01-09 19:32:24 +00:00
Zuul 863705c334 Merge "Document missing executor stats" 2019-01-09 15:41:05 +00:00
Tobias Henkel 1f6e001c06
Document missing executor stats
The stats zuul.executor.<name>.pause and
zuul.executor.<name>.paused_builds are undocumented. While at it fix
the indentation of this section.

Change-Id: I5d5bdc1fe748ec2c545c8b7e8ec2674d50208f9f
2018-12-20 22:13:06 +01:00
Tobias Henkel d4f75ffac8
Add timer for starting_builds
We currently have a gauge for starting_builds but actually have no
knowledge about how long jobs are in the starting state. This adds a
metric for this so we can see changes in the job startup time after
changes in the system.

Change-Id: I261f8bdc8de336967b9c8ecd6eafc68f0bfe6b78
2018-12-20 07:58:40 +01:00
Tobias Henkel 145e62b568
Add cgroup support to ram sensor
When running within k8s the system memory statistics are useless as
soon there are configured limits (which is strongly advised). In this
case we additionally need to check the cgroups.

Change-Id: Idebe5d7e60dc862e89d012594ab362a19f18708d
2018-12-18 22:25:27 +01:00
gaobin 5b3ca17c05 Modify some file content errors
The following error 
exectuor to executor
formated to formatted
overidden to overridden

Change-Id: Ie80e1632624c65adaf6aad86a2c7aae93da688ff
2018-12-11 06:11:07 +00:00
Ian Wienand c6fe6459f2 Rework zuul nodepool stats reporting
The current stats set a counter zuul.nodepool.<status> but then tries
to set more counters like zuul.nodepool.<status>.label.

This doesn't work because zuul.nodepool.<status> is already a counter
value; it can't also be an intermediate key.  Note this *does* work
with the timer values, but that's because statsd is turning the timer
into individual values
(e.g. zuul.nodepool.<status>.<mean|count|std...>) as it flushes each
interval.

Thus we need to rethink these stats.  This puts them under a new
intermeidate key "requests" and adds a "total" count; thus
zuul.nodepool.<status> == zuul.nodepool.requests.<status>.total

The other stats, showing requests by-label and by-size will now live
under the zuul.nodepool.requests parent.

While we're here, use a statsd pipeline to send the status update as
it works better when sending lots of stats quickly over UDP.  This
isn't handled by the current debug log below; move this into the
test-case framework.

The documentation has been clarified to match the code.

Change-Id: I127e8b6d08ab86e0f24018fd4b33c626682c76c7
2018-12-10 14:56:36 +11:00
Ian Wienand 18fb9ec37e Add gearman stats reference
The stats emitted under zuul.geard are currently undocumented.  Add
them to the monitoring guide and add some more details to the geard
toubleshooting guide for what to do if the stats look wrong.

Change-Id: I831def2f7c22d8ffff62569cc7d657033a85ed19
2018-11-27 20:25:04 +11:00
Tobias Henkel 40a895b03c
Fix indentation of executor monitoring docs
The load_average and pct_used_ram metrics are indented incorrectly
which placed them under zuul.executor.<executor>.phase.* in the docs.

Change-Id: Id613ce57a679d1ab4bf9f71bf4d5a6bde72e2d50
2018-07-20 19:13:23 +02:00
James E. Blair a4f94a14d7 Invert executor ram statsd metric
Folks tend to misread this metric as used ram, rather than available,
since that's how memory is typically graphed, so go ahead and invert
it.  Admins will need to mentally invert it again to determine whether
the executor is approached the available ram threshold.

Change-Id: I60cde8bf2fd04926cd2ac1bb733bf9c72fda8daf
2018-02-14 15:39:57 -08:00
David Moreau Simard 1267144b19
Add Executor Merger and Ansible execution statsd counters
This adds the following counters:
- zuul.executor.*.phase.setup.<result> (setup task)
- zuul.executor.*.phase.reset.<result> (reset connection task)
- zuul.executor.*.phase.<phase>.<result> (pre/run/post playbooks)
- zuul.executor.*.merger.['SUCCESS','FAILURE'] (merger status)

The data provided by these counters are not very reliable in the sense
that the failures may not be related to the executor itself and is
instead a legitimate issue with the patch or the job it is running.

However, when averaged out, these counters should help us identify if
a particular executor is exhibiting irregular behavior when compared
to regular patterns or other executors.

Change-Id: Ie430f9935dce94f4b90cffee33695e1eb4d1ca7d
2018-02-07 13:54:03 -05:00
James E. Blair df37ad2ce7 Executor: Don't start too many jobs at once
The metrics that we use to govern load on the executors are all
trailing indicators.  The executors are capable of accepting a
large number of jobs in a batch and then, only after they begin
to run, will the load indicators increase.  To avoid the thundering
herd problem, reduce the rate at which we accept jobs past a certain
point.

That point is twice the number of jobs as the target load average.
In practice that seems to be a fairly conservative but reasonable
number of jobs for the executor to run, so, to facilitate a quick
start, allow the executor to start up to that number all at once.

Once the number of jobs running is beyond that number, subsequent
jobs will only be accepted one at a time, after each one completes
its startup phase (cloning repos, establishing ansible connections),
which is to say, at the point where the job begins running its first
pre-playbook.

We will also wait until the next regular interval of the governor
to accept the next job.  That's currently 30 seconds, but to make
the system a bit more responsive, it's lowered to 10 seconds in this
change.

To summarize: after a bunch[1] of jobs are running, after each new
job, we wait until that job has started running playbooks, plus up
to an additional 10 seconds, before accepting a new job.

This is implemented by adding a 'starting jobs' metric to the governor
so that we register or de-register the execute function based on
whether too many jobs are in the startup phase.  We add a forced
call to the governor routine after each job starts so that we can
unregister if necessary before picking up the next job, and wrap that
routine in a lock since it is now called from multiple threads and
its logic may not be entirely thread-safe.

Also, add tests for all three inputs to manageLoad.

[1] 2*target load average

Change-Id: I066bc539e70eb475ca2b871fb90644264d8d5bf4
2018-02-02 11:36:49 -08:00
Zuul 59092227e3 Merge "Add available RAM to statsd" 2018-02-01 15:54:50 +00:00
James E. Blair 40ca3791fb Add available RAM to statsd
If the executor is using it to decide whether to accept jobs, we
should graph it.

Change-Id: If34e81f953df4ed0a2c2c287e7d00d4977267fef
2018-01-31 14:22:07 -08:00
Tobias Henkel 60a8547ffb
Fix statsd documentation about events
The events are landing in statsd as zuul.event.<driver>.<type> and not
as zuul.event.<driver>.event.<type>

Change-Id: I9c4901a9c02d4d833fdc3e1b7617a4bbba15c94d
2018-01-31 11:16:15 +01:00
James E. Blair 4dd5f4b6cb Document executor/merger stats
Also, change the interval to 30s rather than 10s.  There is some
cost to the gear server to calculate the status report, especially
if the queue is long.

Change-Id: Icfe4c6496e45847cdf884f23a06d7186aafdf8e2
2017-10-23 13:08:06 -07:00
James E. Blair 4f1731ba86 Emit some nodepool stats
Change-Id: I7bc3914e8b8d64afee061c002dcc9cca5dd1ef4d
2017-10-13 15:56:59 -07:00
James E. Blair faf8198f2a Emit some stats from executor
Emit the load average, a counter for builds, and a guage for
running builds.

Change-Id: I8541724f1322b8257b623b3b2cfd8f3e6b95574d
2017-10-13 15:56:25 -07:00
James E. Blair ded241e598 Switch statsd config to zuul.conf
The automatic statsd configuration based on env variables has
proven cumbersome and counter-intuitive.  Move its configuration
into zuul.conf in preparation for other components emitting stats.

Change-Id: I3f6b5010d31c05e295f3d70925cac8460d334283
2017-10-13 14:04:42 -07:00
James E. Blair 80ac158acd Update statsd output for tenants
Update the statsd output to account for tenants and other v3 changes.

Change-Id: I984e1930ab63d9a551cf33be922bac447ad0df9d
2017-10-09 07:02:40 -07:00
David Shrewsbury 1c61c71c9c Fix documentation nits
Just minor spelling and grammar fixes.

Change-Id: I2dc98e4b68ac2df35fe1647cd4af3402cd55d77d
2017-08-16 16:04:54 -04:00
James E. Blair 91c9dde0cb Docs: reformat metrics docs
Adds a new directive/role for stats (zuul:stat).

Change-Id: If292c393811eaffd955c98589088adf4881a61e3
2017-08-04 11:10:24 -07:00
James E. Blair eff5a9d8d7 Reorganize docs into user/admin guide
Refresh the user and admin guide for v3 changes, and reorganize into
a narrative structure which makes more sense for v3.

Change-Id: I4ac3b18d5ed33b0fea4e2ef0318b19bfc3447ccc
2017-07-05 14:35:22 -07:00