zuul/zuul - zuul - OpenDev: Free Software Needs Free Tools

Commit Graph

Author	SHA1	Message	Date
David Shrewsbury	e6d8b210cc	Documentation reorg Reorganizing docs as recommended in: https://www.divio.com/blog/documentation/ This is simply a reorganization of the existing documents and changes no content EXCEPT to correct the location of sphinx doc references. Expect followup changes to change document names (to reflect the new structure) and to move content from existing guides (e.g., to move the pipeline/project/job structure definitions out of the "Project Configuration" reference guide into their own reference documents for easier locatability). All documents are now located in either the "overview", "tutorials", "discussions", or "references" subdirectories to reflect the new structure presented to the user. Code examples and images are moved to "examples" and "images" root-level directories. Developer specific documents are located in the "references/developer" directory. Change-Id: I538ffd7409941c53bf42fe64b7acbc146023c1e3	2020-01-14 12:47:23 -05:00
Andy Ladjadj	addf6ccf37	[doc][monitoring] Fix the wait_time parent attribute - the documentation defer from code source Change-Id: I54ebfdd4dd04684651226656dd8175cd00b735b3	2019-06-25 14:04:24 +02:00
Tobias Henkel	a455e0bff8	Fix typo in docs It's project, not tenant there. Change-Id: I148e2b8615e85ce726b592f4025f2ade7fdf3463	2019-05-29 06:13:22 +02:00
Tobias Henkel	e90fe41bfe	Report tenant and project specific resource usage stats We currently lack means to support resource accounting of tenants or projects. Together with an addition to nodepool that adds resource metadata to nodes we can emit statsd statistics per tenant and per project. The following statistics are emitted: * zuul.nodepool.resources.tenant.{tenant}.{resource}.current Gauge with the currently used resources by tenant * zuul.nodepool.resources.project.{project}.{resource}.current Gauge with the currently used resources by project * zuul.nodepool.resources.tenant.{tenant}.{resource}.counter Counter with the summed usage by tenant. e.g. cpu seconds * zuul.nodepool.resources.project.{project}.{resource}.counter Counter with the summed usage by project. e.g. cpu seconds Depends-On: https://review.openstack.org/616262 Change-Id: I68ea68128287bf52d107959e1c343dfce98f1fc8	2019-05-29 04:10:08 +00:00
Zuul	97da909bd8	Merge "Add cgroup support to ram sensor"	2019-01-09 19:32:24 +00:00
Zuul	863705c334	Merge "Document missing executor stats"	2019-01-09 15:41:05 +00:00
Tobias Henkel	1f6e001c06	Document missing executor stats The stats zuul.executor.<name>.pause and zuul.executor.<name>.paused_builds are undocumented. While at it fix the indentation of this section. Change-Id: I5d5bdc1fe748ec2c545c8b7e8ec2674d50208f9f	2018-12-20 22:13:06 +01:00
Tobias Henkel	d4f75ffac8	Add timer for starting_builds We currently have a gauge for starting_builds but actually have no knowledge about how long jobs are in the starting state. This adds a metric for this so we can see changes in the job startup time after changes in the system. Change-Id: I261f8bdc8de336967b9c8ecd6eafc68f0bfe6b78	2018-12-20 07:58:40 +01:00
Tobias Henkel	145e62b568	Add cgroup support to ram sensor When running within k8s the system memory statistics are useless as soon there are configured limits (which is strongly advised). In this case we additionally need to check the cgroups. Change-Id: Idebe5d7e60dc862e89d012594ab362a19f18708d	2018-12-18 22:25:27 +01:00
gaobin	5b3ca17c05	Modify some file content errors The following error exectuor to executor formated to formatted overidden to overridden Change-Id: Ie80e1632624c65adaf6aad86a2c7aae93da688ff	2018-12-11 06:11:07 +00:00
Ian Wienand	c6fe6459f2	Rework zuul nodepool stats reporting The current stats set a counter zuul.nodepool.<status> but then tries to set more counters like zuul.nodepool.<status>.label. This doesn't work because zuul.nodepool.<status> is already a counter value; it can't also be an intermediate key. Note this does work with the timer values, but that's because statsd is turning the timer into individual values (e.g. zuul.nodepool.<status>.<mean\|count\|std...>) as it flushes each interval. Thus we need to rethink these stats. This puts them under a new intermeidate key "requests" and adds a "total" count; thus zuul.nodepool.<status> == zuul.nodepool.requests.<status>.total The other stats, showing requests by-label and by-size will now live under the zuul.nodepool.requests parent. While we're here, use a statsd pipeline to send the status update as it works better when sending lots of stats quickly over UDP. This isn't handled by the current debug log below; move this into the test-case framework. The documentation has been clarified to match the code. Change-Id: I127e8b6d08ab86e0f24018fd4b33c626682c76c7	2018-12-10 14:56:36 +11:00
Ian Wienand	18fb9ec37e	Add gearman stats reference The stats emitted under zuul.geard are currently undocumented. Add them to the monitoring guide and add some more details to the geard toubleshooting guide for what to do if the stats look wrong. Change-Id: I831def2f7c22d8ffff62569cc7d657033a85ed19	2018-11-27 20:25:04 +11:00
Tobias Henkel	40a895b03c	Fix indentation of executor monitoring docs The load_average and pct_used_ram metrics are indented incorrectly which placed them under zuul.executor.<executor>.phase.* in the docs. Change-Id: Id613ce57a679d1ab4bf9f71bf4d5a6bde72e2d50	2018-07-20 19:13:23 +02:00
James E. Blair	a4f94a14d7	Invert executor ram statsd metric Folks tend to misread this metric as used ram, rather than available, since that's how memory is typically graphed, so go ahead and invert it. Admins will need to mentally invert it again to determine whether the executor is approached the available ram threshold. Change-Id: I60cde8bf2fd04926cd2ac1bb733bf9c72fda8daf	2018-02-14 15:39:57 -08:00
David Moreau Simard	1267144b19	Add Executor Merger and Ansible execution statsd counters This adds the following counters: - zuul.executor..phase.setup.<result> (setup task) - zuul.executor..phase.reset.<result> (reset connection task) - zuul.executor..phase.<phase>.<result> (pre/run/post playbooks) - zuul.executor..merger.['SUCCESS','FAILURE'] (merger status) The data provided by these counters are not very reliable in the sense that the failures may not be related to the executor itself and is instead a legitimate issue with the patch or the job it is running. However, when averaged out, these counters should help us identify if a particular executor is exhibiting irregular behavior when compared to regular patterns or other executors. Change-Id: Ie430f9935dce94f4b90cffee33695e1eb4d1ca7d	2018-02-07 13:54:03 -05:00
James E. Blair	df37ad2ce7	Executor: Don't start too many jobs at once The metrics that we use to govern load on the executors are all trailing indicators. The executors are capable of accepting a large number of jobs in a batch and then, only after they begin to run, will the load indicators increase. To avoid the thundering herd problem, reduce the rate at which we accept jobs past a certain point. That point is twice the number of jobs as the target load average. In practice that seems to be a fairly conservative but reasonable number of jobs for the executor to run, so, to facilitate a quick start, allow the executor to start up to that number all at once. Once the number of jobs running is beyond that number, subsequent jobs will only be accepted one at a time, after each one completes its startup phase (cloning repos, establishing ansible connections), which is to say, at the point where the job begins running its first pre-playbook. We will also wait until the next regular interval of the governor to accept the next job. That's currently 30 seconds, but to make the system a bit more responsive, it's lowered to 10 seconds in this change. To summarize: after a bunch[1] of jobs are running, after each new job, we wait until that job has started running playbooks, plus up to an additional 10 seconds, before accepting a new job. This is implemented by adding a 'starting jobs' metric to the governor so that we register or de-register the execute function based on whether too many jobs are in the startup phase. We add a forced call to the governor routine after each job starts so that we can unregister if necessary before picking up the next job, and wrap that routine in a lock since it is now called from multiple threads and its logic may not be entirely thread-safe. Also, add tests for all three inputs to manageLoad. [1] 2*target load average Change-Id: I066bc539e70eb475ca2b871fb90644264d8d5bf4	2018-02-02 11:36:49 -08:00
Zuul	59092227e3	Merge "Add available RAM to statsd"	2018-02-01 15:54:50 +00:00
James E. Blair	40ca3791fb	Add available RAM to statsd If the executor is using it to decide whether to accept jobs, we should graph it. Change-Id: If34e81f953df4ed0a2c2c287e7d00d4977267fef	2018-01-31 14:22:07 -08:00
Tobias Henkel	60a8547ffb	Fix statsd documentation about events The events are landing in statsd as zuul.event.<driver>.<type> and not as zuul.event.<driver>.event.<type> Change-Id: I9c4901a9c02d4d833fdc3e1b7617a4bbba15c94d	2018-01-31 11:16:15 +01:00
James E. Blair	4dd5f4b6cb	Document executor/merger stats Also, change the interval to 30s rather than 10s. There is some cost to the gear server to calculate the status report, especially if the queue is long. Change-Id: Icfe4c6496e45847cdf884f23a06d7186aafdf8e2	2017-10-23 13:08:06 -07:00
James E. Blair	4f1731ba86	Emit some nodepool stats Change-Id: I7bc3914e8b8d64afee061c002dcc9cca5dd1ef4d	2017-10-13 15:56:59 -07:00
James E. Blair	faf8198f2a	Emit some stats from executor Emit the load average, a counter for builds, and a guage for running builds. Change-Id: I8541724f1322b8257b623b3b2cfd8f3e6b95574d	2017-10-13 15:56:25 -07:00
James E. Blair	ded241e598	Switch statsd config to zuul.conf The automatic statsd configuration based on env variables has proven cumbersome and counter-intuitive. Move its configuration into zuul.conf in preparation for other components emitting stats. Change-Id: I3f6b5010d31c05e295f3d70925cac8460d334283	2017-10-13 14:04:42 -07:00
James E. Blair	80ac158acd	Update statsd output for tenants Update the statsd output to account for tenants and other v3 changes. Change-Id: I984e1930ab63d9a551cf33be922bac447ad0df9d	2017-10-09 07:02:40 -07:00
David Shrewsbury	1c61c71c9c	Fix documentation nits Just minor spelling and grammar fixes. Change-Id: I2dc98e4b68ac2df35fe1647cd4af3402cd55d77d	2017-08-16 16:04:54 -04:00
James E. Blair	91c9dde0cb	Docs: reformat metrics docs Adds a new directive/role for stats (zuul:stat). Change-Id: If292c393811eaffd955c98589088adf4881a61e3	2017-08-04 11:10:24 -07:00
James E. Blair	eff5a9d8d7	Reorganize docs into user/admin guide Refresh the user and admin guide for v3 changes, and reorganize into a narrative structure which makes more sense for v3. Change-Id: I4ac3b18d5ed33b0fea4e2ef0318b19bfc3447ccc	2017-07-05 14:35:22 -07:00

27 Commits