Qron User Manual

This web page is the user manual of qron free scheduler.

It can be read both in qron web user interface (embeded in qron daemon) and on the Internet here: http://qron.eu/doc/master/user-manual.html.

You are currently reading this version of the documentation: v1.15.5

Concepts

Qron is a task scheduler, that is, a software controlling execution of non-interactive application components such as large batches planned once a day, automatic processing of incoming messages every minute, or any processing triggered by an event not directly issued by a human being.

This implies defining tasks, the way they are triggered, the properties of the infrastructure on which they run, alongside with centralized logs and alert means to notify about anything wrong or suspect.

Often, scheduling tasks will also need a powerful parameters system with hierarchical inheritance and evaluation language, handling scheduling constraints such as resources and calendars, or a customizable event-driven model.

Monitoring and operating the scheduler can be done through the responsive-design web user interface and (coming soon) desktop and phone/tablet apps.

Integrating with other IT tools and custom extensions is possible through HTTP API and standard operating system integration.

Tasks Hierarchy

Tasks are grouped within taskgroups in a tree hierarchy.

Sample trigger diagram with 5 tasks within 2 task groups:

Task Groups

Tasks are grouped within taskgroups which make the configuration more clear in a documentation point of view (when used to group tasks e.g. by application or module) and/or by factorizing some task properties (such as paremeters and events).

A taskgroup consist of its id (which is its only mandatory field), plus optional fields: a human readable label and task properties that are inherited by any task belonging to the group. The task properties that can be defined at the taskgroup level are params, vars and events subscriptions.

There is only one level of taskgroups (a taskgroup cannot belong to another taskgroup) however dots in taskgroups id are processed as a hierarchical separator by qron user interfaces to display a multi-level tree.

Sample configuration file fragments:

(taskgroup app1.biz.batch
  (label business batches for application app1)
)

(taskgroup app2.tech
  (label technical tasks for application app2)
  (param db_password mysecret)
  (var ORACLE_SID ORCL)
)

(taskgroup minimalgroup)

Tasks

Task is qron's most important configuration element. It carries almost every information needed to schedule and execute a task and often some of the event and alerting configuration is done task by task.

Tasks Description

A task is described by its id and its parent taskgroup. It should also have a human readable label and some additional documentation info.

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (label "build PDF reports about customers")
  (info "see http://intranet/app1/ReportingBatches")
)

(task oracle-statistics
  (taskgroup app2.tech)
  (label recompute statistics for db ORCL)
)

(task minimaltask(taskgroup minimalgroup))

Tasks Execution Properties

Task execution is defined by 3 properties:

mean: describes the mean used to execute the task, among the following:
- local: spawn a process local to the qron daemon, on the same host
- ssh: establish an SSH connection to the target host to execute the task
- docker: run a docker container to execute the task, on the same host
- http: perform an HTTP request
- donothing: do not execute anything, but still trigger events, pretending the execution to be successful (i.e. onsuccess actions will be triggered, not onfailure ones)
target: defines which host or cluster will be choosen to execute the task, defaults to localhost, see infrastructure section for more details
command: interpreted depending on the execution mean:
- local, ssh and docker means: command line, spaces being interpreted as command arguments separators but if they are protected by backslashes
- http mean: path and query string
- donothing mean: not applicable, should be omitted

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (mean ssh)
  (target server1)
  (command /opt/app1/bin/build_reports.sh --customers)
)

(task process-last-orders
  (taskgroup app1.biz.queuing)
  (mean http)
  (target server4)
  (command /pollers/last-orders/?from=-5minutes)
  (param method POST) # defaults to GET
  (param user scheduler) # set up a "Authorization: Basic" header
  (param password mysecret)
)

(task remove-old-files
  (taskgroup app1.tech)
  (mean ssh)
  (target server1)
  (command bash -c '
# shell script right in qron config file
set -e
for DIR in %directories; do
  find $DIR -mtime +60 -delete
done
')
  (param directories /opt/app1/files /opt/app1/backup /tmp/app1)
)

Tasks Triggers and Constraints

Tasks are queued for execution when triggered, and one can defined one or several triggers among:

cron: time trigger in the form of Unix' cron pattern with precision down to second, e.g.
- * * * * * *: every seconds
- 13 6 7 * * 1: every monday (1) at 07:06:13 in the morning
- /15 * 8-20 * * *: every 15 seconds from 8 a.m. to 8 p.m.
notice: event trigger that happen when a given notice event occurs
filechange: not yet implented, but will be able to trigger task execution on local or remote file change or creation
HTTP API: a task can be triggered by HTTP API requesttask call
requesttask event: a task can be triggered by occurrence of a requesttask event

TODO: calendars

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (trigger(cron 0 0 23 * * 1-6)) # 23:00 mon-sat
)

(task process-last-orders
  (taskgroup app1.biz.queuing)
  (trigger
    (cron /15 * 8-20 * * *) # every 15 seconds daily
    (cron 0 0 22 * * * # once in the evening
      (param mode nightly) # overriding parameter 'mode'
    )
    (notice process-last-orders-now) # on event
  )
)

When a task is queued for execution, it will be executed only when no constraint disable the task from running. Contraints are of several types:

maxinstances: a given task is only allowed to be run maxinstances times at a time, which defaults to 1
resources: if a given task is declared to need some resources it can only run on its target host if it has enough resources available, resources can be seen as semaphores, and can be used for mutual exclusion across tasks, see resources section below.

Sample configuration file fragments:

(task process-customer-request
  (taskgroup app1.biz.queuing)
  (maxinstances 16) # code is multiexec-safe and is faster when parallelized
)

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (target report-server)
  (resource memory 2048) # avoid too many memory consummers on the server
  (resource reports-semaphore 1) # avoid two report batches running at a time
)
(host report-server
  (hostname server4.acme.com)
  (resource memory 8192)
  (resource reports-semaphore 1)
)

In addition, tasks can be rejected when submitted for queueing, and queued tasks can even be automaticaly canceled, according to enqueuepolicy, which is defined task by task and can have the following values/behaviors:

enqueueanddiscardqueued: when a task is queued, every other queued task with the same id is canceled, avoiding stacking tasks in big numbers when they are slower than expected, this is the default
enqueueall: allow boundless stacking of this task in the queue, provided the queue limit is not reached
warning: when the queue limit is reached, other tasks may be rejected, so this setting is dangerous for the whole tasks schedule
enqueueuntilmaxinstances: reject task requests when task maxinstances is already reached by running instances + queued ones, this can be convenient for user-requested tasks since it's no longer possible to request execution of a task that is already running and is set to maxinstances = 1 (which is the default)

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (enqueuepolicy enqueueuntilmaxinstances)
)

Tasks Parameters

Tasks have free text parameters defined by param configuration element. They can be used for any custom parameters and are evaluated with % evaluation character in many other configuration items.

The whole system for parameters evaluation and inheritance is discussed in deep in parameters section below alonside with special parameters name that have an effect qron behavior (parameters such as ssh.options when using ssh execution mean).

Parameters can be overriden at task request, being it in trigger definition, in HTTP API or in web user interface using request forms.

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (mean ssh)
  (target server1)
  (command /opt/app1/bin/build_reports.sh --customers --from %from)
  (param from -2days) # regular param value
  (trigger
    (cron 0 0 23 * * 1-6)
    (cron 0 0 23 * * 0 (param from -8days)) # different param on sunday
  )
  (requestform
    (field from # can be overriden when the task is started from the web ui
      (label Data depth) # ui label
      (placeholder -2days) # ui placeholder/hint
      (format "-[0-9]+days") # server side validation regexp
    )
  )
)

Tasks Vars

Depending on the execution mean, there can be out of band parameters, transmitted by another way than the command itself. For local and ssh means this is possible through environment variables (hence the name). For http mean this is possible through custom HTTP headers.

These out of band parameters are defined using var configuration element.

TODO: explain how default envionment is set, according to execution mean.

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (mean ssh)
  (target server1)
  (command /opt/app1/bin/build_reports.sh --customers)
  (var ORACLE_SID ORCL)
)

(task process-last-orders
  (taskgroup app1.biz.queuing)
  (mean http)
  (target server4)
  (command /pollers/last-orders/?from=-5minutes)
  (var X-TaskInstanceId %!taskinstanceid)
  (var Authorization Basic AhjsqduYez=)
  #in fact it's often easier to set Authorization header this way:
  #(param user scheduler) # set up a "Authorization: Basic" header
  #(param password mysecret)
)

Tasks Events Subscriptions

Several task-level events can be used to define actions at task (or taskgroup) level. See Task-Level Events section below for more details.

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (onsuccess
    # sending UDP packets to statsd server, see https://github.com/etsy/statsd
    (requesturl "task.%!taskid.ok:1|c" (address udp://127.0.0.1:8125))
    (requesturl (address udp://127.0.0.1:8125) "task.%!taskid.time:%!durationms|ms")
    # add a custom debug log for task success
    (log(severity debug) task success! *%!tasklocalid*)
  )
  (onfailure
    # sending UDP packets to statsd
    (requesturl(address udp://127.0.0.1:8125) "task.%!taskid.ko:1|c")
  )
)

Tasks Monitoring

Whenever a task has a suspect behavior, the scheduler emit an automatic alert. For instance when a task finishes with a failure status.

It's often a good idea to send by e-mail (or any other mean) the automatic alerts for all or most of the tasks in order to be notified of an issue. See alerts subscriptions section below.

More automatic alerts are available when setting the following task parameters:

minexpectedduration: minimum expected task running duration in seconds, an alert will be raised otherwise, with the pattern task.tooshort.%!taskid, the alert is raised when the task finishes.
maxexpectedduration: maximum expected task total duration in seconds, an alert will be raised otherwise, with the pattern task.toolong.%!taskid, the alert is raised as soon as the task is running longer than maxexpectedduration (although the detection may take up to 1 minute it won't wait for the task finishing).
maxdurationbeforeabort: total duration above which the scheduler will abort a running task, this is useful for certain kind of tasks that should better be killed that take too much time (a nightly purge that can disrupt normal daily processing, a periodic task running every minutes that should rather be killed when frozen after 3 minutes rather than keep disable next executions forever, etc.) but is dangerous in many other cases since aborting is quite hard (kill for local and ssh execution means, close for http execution mean).

Task duration is the time elapsed between the queuing of a task request and the finishing of its execution.

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (minexpectedduration 10) # below 10" there is a issue with data selected
  (maxexpectedduration 3600) # 1 hour
)

(task process-last-orders
  (taskgroup app1.biz.queuing)
  (trigger(cron /30 * 8-20 * * *)) # twice a minute
  (maxexpectedduration 20)
  (maxdurationbeforeabort 300) # 5 minutes
)

Tasks Request Forms

When a user task execution request is submited interactively through the web user interface (or, when it will be available, the desktop/phone user interface) it is possible to display a custom form to ask the user which parameters he would like to override.

A request form is described in the configuration file using a requestform element within a task declaration. It contains one field elements per form field, which itself can contain the following optional elements:

label: user interface label (field title)
placeholder: user interface hint (placeholder attribute of input HTML element)
suggestion: default value
format: validation regular expression (for the web user interface, the validation is enforced server-side)

Sample configuration file fragment:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (command "/opt/app1/bin/build_reports.sh --customers --from %from
     %{=switch:%dryrun:true:--dryrun:}")
  (param from -2days) # regular param value
  (requestform
    (field from # can be overriden when the task is started from the web ui
      (label Data depth) # ui label
      (placeholder -2days) # ui placeholder/hint
      (format "-[0-9]+days") # server side validation regexp
    )
    (field dryrun
      (label Dry run)
      (suggestion false) # default value in ui form
      (format "true|false")
    )
  )
)

Infrastructure

Tasks are executed on hosts, optionaly grouped by clusters.

Sample deployment diagram with 5 tasks deployed on 1 cluster and 3 hosts:

Hosts

A host is an execution target for tasks. It must have an id and, for most execution means, a hostname (which default value is the id). A host can also carry an optional human readable label and resources that are consummed by tasks that need them, see resources section below

The same host can be used by several execution means, for instance one can defines http and ssh tasks with the same host as target provided both means can use the same hostname to reach the host.

There must always be a host named localhost. If it's not declared in configuratio file, qron will pretend it is declared with no other property than its id.

Sample configuration file fragments:

(host server1
  (hostname server1.mycompany.com)
  (resource memory 8192)
)

(host localhost)

Clusters

A cluster defines an execution target consisting of a list of hosts and a balancing method to choose the actual execution host among them. It must have an id, which cannot be identical to one of the hosts id, and can have a human readable label.

Following balancing methods are supported:

first: choose first host (in configuration order) with enough resources
random: choose a random host among those with enough resources
roundrobin: choose clusters hosts one after another, skipping hosts without enough resources, this is the default balancing method
each: when requesting the task once, it is actually executed on every host within the cluster

Sample configuration file fragments:

(cluster app-servers
  (hosts server1 server2 server3)
  (balancing roundrobin) # this can be omitted since this is the default
)
(task process-last-orders
  (taskgroup app1.biz.queuing)
  (mean http)
  (target app-servers)
  (command /pollers/last-orders/?from=-5minutes)
)

(cluster unix-servers
  (hosts server4 server5 server6)
  (balancing each)
)
(task remove-old-files-on-every-unix-server
  (taskgroup app1.tech)
  (mean ssh)
  (target unix-servers)
  (command find /opt/app1/files -mtime +60 -delete)
  (maxinstances 3) # allow parallel execution on 3 hosts
)

Logs

Qron logs information related to tasks scheduling and exectution, configuration changes, web ui user interaction, etc. In addition it also logs error messages from tasks themselves when the execution mean make it possible, which is a simple way to centralize tasks logs.

Logs Configuration

Several log files can be defined in the configuration file with log element, which must have following properties:

file: path to the log file, which is %-evaluated and thus can vary depending on the date
level: minimal severity level to log, among debug, info, warning, error and fatal

The log element can also have a unbuffered flag to disable write buffers.

Sample configuration file fragments:

(log(level debug)(file "/var/log/qron/debug-%{=date:yyyyMMdd}.log"))
(log(level info)(file "/var/log/qron/info-%{=date:yyyyMMdd}.log"))

(log(level debug)(file "/tmp/qron-debug.log")(unbuffered))

Centralizing Logging of Tasks Stderr

Qron automatically records in its logs the standard error stream of tasks executed using local mean, and (by default) both the standard error and standard output streams of tasks executed using ssh mean.

This feature enable centralizing tasks error streams in qron log. However not all task logs should be recorded that way. For instance if a 4 hours-long huge Java batch logging millions of debug lines send them to its stderr, qron logs will be no longer readable since they will be filled with crap. One should take care of that when designing batches, or their wrapper shell script. Or logging can be disabled for a given task or taskgroup using parmeters described below.

Loging of tasks output is modified by several special task parameters:

stderrfilter: regular expression to filter out task output, for instance:
- .* will drop any data
- ^Connection to [^ ]* closed.$ will drop an ennoying ssh log line
ssh.disablepty: when disabling pty allocation of a task executed through ssh mean, standard output will no longer be merged with standard error and only the later will be recorded in qron log (this is not the only effect of disabling pty allocation, see special parameters section for more information
ssh.options: this more general parameter can also, among other things, disable pty allocation

Logs Format

Qron log files are formatted with lines ended with the newline character (0x0a), each line complying to either of these two formats:

First record lines immediatly start with a timestamp (in other words, their first character is a digit, likely to be a 2):
```
timestamp taskid/taskinstanceid sourcefile:sourceline severity message
```
Continuation lines start with two spaces (0x20):
```
  message_continuation
```

Log records fields are defined this way:

timestamp: ISO 8601 timestamp, currently from year to milliseconds, with or without timezone (currently the timestamp is recorded using local timezone but without explicit timezone specification, this may change in the future)
taskid: either the taskid if the record is related to a specific task, or the name of the logging thread (e.g. SchedulerThread or AlerterThread)
taskinstanceid: either the taskinstanceid if the record is related to a specific task instance execution, or 0
sourcefile: source code file name if known/relevant, otherwise an empty string
sourceline: source code line number or label if known/relevant, otherwise an empty string
severity: record severity, currently among: DEBUG INFO WARNING ERROR FATAL (this may change in the future but will stay an alphanumeric identifier without spaces)
message: any text, encoded as UTF-8, if newlines are contained in the message, it is continued on the next lines, with two extra space characters (0x20) at the begining of the continuation lines

For instance:

2016-04-05T13:57:42,003 gridboard.failing/201604051357420004 : INFO starting task 'gridboard.failing' through mean 'local' after 1 ms in queue
2016-04-05T13:57:40,421 SchedulerThread/0 : INFO starting scheduler

Logs Rotation

Qron do not rotate, compress or purge its logfiles, however it reopens them periodicaly (every minute) which make several log rotation patterns easy to implement. One of the following should be choosen, or even a mix of them.

Timestamped Logs File Names

Defining log files with a timestamp in the name ensure that qron will stop writing in a log 1 minute after the timestamp changes. Therefore old log files can be compressed or archived by any external mean (e.g. logrotate daemon or a task scheduled by qron itself) without need of signaling qron to reopen its log files.

Sample configuration file fragment:

(log(level debug)(file "/var/log/qron/debug-%{=date:yyyyMMdd}.log"))
(log(level info)(file "/var/log/qron/info-%{=date:yyyyMMdd}.log"))
(taskgroup qron.tech)
(task logs-compress
  (taskgroup qron.tech)
  (mean local)
  (command 'find /var/log/qron -name "*.log" -mtime +1 -exec bzip2 {} \\;')
  (trigger(cron 0 2 0 * * *))
)
(task logs-purge
  (taskgroup qron.tech)
  (mean local)
  (command 'find /var/log/qron -mtime +365 -delete')
  (trigger(cron 0 2 0 * * *))
)

Numbered Log Files Names

When scheduling a small amount of tasks, on can prefer to have a numbered log files scheme. The example below show it with a weekly rotation and compression processed by logrotate.

Sample configuration file fragment:

(log(level info)(file /var/log/qron.log))

Along with a logrotate configuration file like this one:

/var/log/qron.log {
    weekly
    rotate 52
    missingok
    notifempty
    compress
    delaycompress
}

Alerts

There are two kinds of alerts:

stateful alerts, which are emitted when raised and canceled but also periodically reminded if still raised
one-shot alerts which are emitted immediatly each time they occur: they are de-duplicated to avoid spam but are not subject to raise/cancel/remind mechanisms.

Whenever possible, stateful alerts should be prefered, however there are cases where one wants to be alerted of the occurrence of a abnormal or rare event rather than of an abnormal condition. In these cases one-shot alerts are well suited.

Both alerts kind are UTF-8 characters strings, there are no technical constraints on their format, however the subscriptions mechanisms will be easier to use if the alert strings follow a hierarchical dot naming convention, e.g. task.failure.$TASKGROUP.$TASK or myapp.mycustomalert.

Alerts are notified through alert channels and can be displayed on web ui, especially using gridboards.

Stateful Alerts

At first glance, stateful alerts have 3 statuses: nonexistent, raised and canceled. However, if an alert is constantly raised and canceled, without protection it would be a huge source of spam/noise and there would be no benefit to handle alerts as stateful. Therefore the alert engine implements grace delays before actually raising or canceling an alert, and there are a few more statuses and transitions than the most intuitive ones. They are described below.

Alerts State Diagram:

Alerts statuses:

nonexistent: the alert string is not known to the alert engine
rising: the alert has been asked for rising since less than the rise delay
mayrise: the alert has been asked for cancellation while it was rising, and mayrise delay has not yet been reached
raised: the alert was rising and rising delay has been reached, or the alert has been asked for immediate rising
dropping: the alert has been asked for cancellation while it was raised, and drop delay has not yet been reached
canceled: the alert was dropping and drop delay has been reached, or the alert has been asked for immediate cancellation

Alerts status transitions:

raiseAlert(): turn a dropping alert into raised and any other alert into rising
cancelAlert(): turn a rising alert into mayrise and a raised alert into dropping
raiseImmediatlyAlert(): turn any alert into raised
cancelImmediatlyAlert(): turn rising and mayrise alerts into nonexistent, and turn raised and dropping alerts into canceled
rise delay timeout: turn rising and mayrise alerts into raised
mayrise delay timeout: turn mayrise alerts into nonexistent
drop delay timeout: turn dropping alerts into canceled
canceled alerts will immediatly become nonexistent again, as soon as cancel notifications are emitted

Induced effects:

When an alert becomes raised and was neither raised nor dropping before, an alert is emitted for all matching subscriptions.
When an alert becomes canceled, an alert cancellation is emitted for all matching subscriptions.
When an alert stays raised or dropping for a long time, some alert channels (most notabily the mail channel) will notify subscribers with alert reminders.

See alerts fine tuning section below for delays default values and advices on their tuning when needed.

One-Shot Alerts

One-shot alerts have no state. They are (almost) immediately emitted each time they are required to.

To avoid noise/spam, only the first emission request during duplicate emit delay interval triggers an immediate alert emission, the following ones are counted and only one other alert will be actually emitted after the delay expires.

The emission counter is displayed by alert channels in a appropriate way, by default they will suffix the alert string with an x and the counter value, e.g.: scheduler.config.load x 3.

See alerts fine tuning section below for delays default values and advices on their tuning when needed.

Alert Channels

Subscriptions

Automatic Alerts

Custom Alerts

TODO: custom alerts through events and actions, but also from outside qron using the HTTP API

Gridboards

Provided they follow a name convention that can be expressed using regular expression capture groups, some alerts can be arranged as tables of statuses named gridboards.

A gridboard is described by its id and several properties among the following:

pattern: alert pattern to be processed by this gridboard, must be a regular expression with named capture groups, e.g. ^task\.(?<status>[^\.]+)\.(?<taskid>.+)$
dimension: declare a dimension, first one being displayed as rows, second one as columns, using one of the following syntaxes:
- just the capture group name declared in the pattern, e.g. (dimension taskid)
- a dimension name and a %-expression, e.g. (dimension TaskId %taskid) (note that the regular expression capture groups are visible in the %-evaluation context which make it possible to use a capture group name or number as a % key)
- the second dimension is not mandatory, if omitted, the gridboard will be built as if there was a second dimension declared with (dimension status status), i.e. only one column, named "status".
label
info
warningdelay default 0
param gridboard parameters, used in %-evaluation performed within the context of this gridboard; in addition some parameters can be used to customize HTML rendering of the gridboard:
- gridboard.tableclass: HTML table class, defaults to table table-condensed table-hover
- gridboard.divclass: HTML div class for whole gridbard, defaults to row gridboard-status
- gridboard.componentclass: HTML div class for each gridboard component, defaults to gridboard-component
- gridboard.tdclass.ok: HTML table cell class for canceled alerts, defaults to an empty string
- gridboard.tdclass.warning: HTML table cell class for canceled alerts not canceled for longer than the warningdelay, defaults to warning
- gridboard.tdclass.error: HTML table cell class for raised alerts, defaults to danger
- gridboard.tdclass.unknown: HTML table cell class in unknown status (most likely never raised nor canceled), defaults to an empty string
- gridboard.rowformat: custom format for row headers (i.e. first dimension values), this is a %-expression where %1 is replaced with the actual dimension value, defaults to %1, example below
- gridboard.columnformat: same as gridboard.rowformat for the second dimension

For instance it is possible to set up a gridboard relying on automatic alerts to display one task per row and several columns and, for each task, if it has failed recently, if it has been too long or too short recently, etc. using this configuration:

(gridboard tasks
  (label Tasks Alerts)
  # match qron's automatic alerts about tasks: failure, toolong, tooshort...
  (pattern "^task\.(?<status>[^\.]+)\.(?<taskid>.+)$")
  (dimension taskid)
  (dimension status)
  # add an HTML link to the task page to the taskid in row headers
  (param gridboard.rowformat '<a href="../tasks/%1">%1</a>')
)

An example of the HTML rendering of this gridboard would be: Rendering sample of 'Tasks Alerts' gridboard

Apart from setting up a gridboard to display default automatic alerts raised by qron, it is also possible to use this feature with some custom alerts, being them raised using the alert-related actions or from any external source using the HTTP API.

Here are some examples of dashboards associated to host and application servers supervision probes:

(gridboard ping
  (label Hosts Ping Statuses)
  # match alerts of the form host.down.ping.$HOST
  (pattern "^host\.down\.ping\.(?[^\.]+)")
  # first and only dimension is the host
  (dimension host)
  # implicitely there is a second dimension, always equal to "status"
  (warningdelay 120) # shown as warning after 2 minutes w/o ping status
)

An example of the HTML rendering of this gridboard would be: Rendering sample of 'Hosts Ping Statuses' gridboard

(gridboard service_x_instance
  (label Application Services x Deployed Instances)
  # match alerts of the form host.down.http.$HOST.$PORT.$PATH
  (pattern "^host\.down\.http\.(?[^\.]+)\.(?[0-9]+)\.(?.+)$")
  # first dimension is the service, identified by its http path
  (dimension service %path))
  # second dimension is the instance, identified by %host:%port, with special
  # processing on %host to remove everything after first dash
  (dimension instance %{=sub:%host:/-.*//}:%port)
  (warningdelay 300) # shown as warning after 2 minutes w/o http status
)

An example of the HTML rendering of this gridboard would be: Rendering sample of 'Application Services x Deployed Instances' gridboard

Fine Tuning

TODO

Events and Actions

Some events can be subscribed at task level or scheduler level to make the scheduler perform one or several actions. Events and actions are described below.

Task-Level Events

At task (and taskgroup) level, the following events can be subscribed:

onstart: occurs just before a task begin to run
onsuccess: occurs just after a task finished with a successful status
onfailure: occurs just after a task finished with a failure status
onfinish: short for onsuccess and onfailure

Sample configuration file fragment:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (onsuccess
    # sending UDP packets to statsd server, see https://github.com/etsy/statsd
    (requesturl "task.%!taskid.ok:1|c" (address udp://127.0.0.1:8125))
    (requesturl (address udp://127.0.0.1:8125) "task.%!taskid.time:%!durationms|ms")
    # add a custom debug log for task success
    (log(severity debug) task success! *%!tasklocalid*)
  )
  (onfailure
    # sending UDP packets to statsd
    (requesturl(address udp://127.0.0.1:8125) "task.%!taskid.ko:1|c")
  )
)

Scheduler-Level Events

At scheduler (global) level, the following events can be subscribed:

onschedulerstart: occurs when qron daemon starts
onconfigload: occurs whenever qron daemon activate a new configuration, including at startup
onnotice: occurs whenever a notice is posted
onlog: occurs whenever a log entry is recorded, this event is obviously only available for debuging purposes, using it on real-world live systems is not advisable
every task-level events can also be subscribed at scheduler level, they will be applicable to every task

Sample configuration file fragment:

(config
  # emit an alert to warn of a configuration (re)load
  (onconfigload (emitalert config.reload))
  # send UDP packets to statsd server on every task success or failure,
  # see https://github.com/etsy/statsd
  (onsuccess
    (requesturl "task.%!taskid.ok:1|c" (address udp://127.0.0.1:8125))
    (requesturl (address udp://127.0.0.1:8125) "task.%!taskid.time:%!durationms|ms")
  )
  (onfailure
    (requesturl(address udp://127.0.0.1:8125) "task.%!taskid.ko:1|c")
  )
)

Actions

Here are the actions that can be performed when an event occurs:

postnotice: post a notice
- takes notice name as its main parameter
- allow param to set notice parameters
- e.g. (postnotice file_has_arrived (param path /tmp/foobar) (param urgent_file true))
raisealert: raise a stateful alert
- takes alert name as its main parameter
- e.g. (raisealert app.alert.too_many_waiting_customers)
cancelalert: cancel a raisable alert
- takes alert name as its main parameter
- e.g. (cancelalert app.alert.too_many_waiting_customers)
emitalert: emit an one shot alert
- takes alert name as its main parameter
- e.g. (emitalert app.alert.a_box_has_been_broken)
requesttask: request task execution
- takes task id as its main parameter, if such task is not found, the id is prefixed with current context's taskgroup, which makes shorter ids when requesting a task from another one within the same group
- allow param to override task parameters
- e.g. (requesttask import_customers_file (param path /tmp/foobar))
requesturl: request a network action by url, currently only UDP and HTTP are supported
- takes payload as its main parameter
- needs an address param containing an url
- http supports following optionnal params: method (default: GET), user and password, port (default: 80), content-type, follow-redirect (default: false), redirect-max.
- udp supports following optionnal params: connecttimeout (in seconds, default: 2), disconnecttimeout (default: 0.2)
- e.g. (requesturl(address udp://127.0.0.1:8125) "task.%!taskid.ok:1|c")
- e.g. (requesturl(address http://monitoring-server/alert) (method POST) "%!taskid failed")
log: add an entry to centralized log
- takes log message as its main parameter
- supports severity optionnal param, defaults to info
- e.g. (log this is a debug message(severity debug))
writefile: write data to a custom file
- takes data to write as its main parameter
- needs a path param containing the file path
- supports following optionnal params: append (default: true), truncate (default: false), unique (to create a unique file name using a suffix or replacing XXXXXX pattern, default: false)
- e.g. (writefile(path /var/log/custom.log) "my message\n")
- e.g. (writefile(path /opt/file_transfer/in/file_XXXXXX.data) (unique true) "my message\n")

Sample configuration file fragments:

(config
  # emit an alert to warn of a configuration (re)load
  (onconfigload (emitalert config.reload))
  # send UDP packets to statsd server on every task success or failure,
  # see https://github.com/etsy/statsd
  (onsuccess
    (requesturl "task.%!taskid.ok:1|c" (address udp://127.0.0.1:8125))
    (requesturl (address udp://127.0.0.1:8125) "task.%!taskid.time:%!durationms|ms")
  )
  (onfailure
    (requesturl(address udp://127.0.0.1:8125) "task.%!taskid.ko:1|c")
  )
  # record misc execution data in /var/log/qrond_tasks_statistics_yyyyMMdd.log
  (onfinish
    (writefile "%=date task.%!taskid "%!status %!returncode %!queueds %!runnings %!durations\n
       (path "/var/log/qrond_tasks_statistics_%{=date!yyyyMMdd}.log"))
  )
)

(config
  (task process1 (taskgroup group1)
    # raise and cancel an alert shared with both process1 and process2
    (onfailure (raisealert app1.process_failed))
    (onsucess (cancelalert app1.process_failed))
    (command /bin/process1)(mean local)
  )
  (task process2 (taskgroup group1)
    (onfailure (raisealert app1.process_failed))
    (onsucess (cancelalert app1.process_failed))
    # write "success" or "failure" in /opt/flags/process2_status
    (onfinish (writefile "%!status\n" (path /opt/flags/process2_status)
                 (truncate true)(append false)))
    (command /bin/process2)(mean local)
  )
)

# this example is a kind of small workflow where 'afterreload' is triggered
# by config reload (which could trigger one 'afterreload' per application if
# needed) and then 'afterreload_failed' is triggered only if 'afterreload'
# failed (hence its name)
(config
  # posting a notice make it possible to bind it to several task triggers
  # without specifying the links explicitely
  (onconfigload (postnotice config_load))
  (task afterreload (taskgroup group1)
     (trigger (notice config_load))
     (command /bin/whatever_script)(mean local)
     (onfailure (requesttask afterreload_failed))
  )
  (task afterreload_failed (taskgroup group1)
     # there is no need for a trigger, since (requesttask) will request task
     # execution instead
     (command /bin/another_script)(mean local)
  )
)

Notices

A notice is a named event that occur when posted by a postnotice action or a /notices/post API call. Notice triggers subscribe to them to execute tasks. The only link between the post event and the task trigger is the notice name.

This mechanism provides loose coupling between a detected event and the tasks it triggers, the event or external process which posts the notice knowing nothing about the tasks triggered. For instance an external file transfer software can notify the scheduler of a file being ready with a logical name (the notice name) which will be linked to a given task in the scheduler configuration, not elsewhere.

A notice can carry parameters, which are used as evaluation context when evaluating notice trigger overriding params. Using them the post events can provide some contextual information to the triggered tasks, such as a path to a file, a customer orders batch id, etc.

Resources

Parameters

Hierarchical Inheritance of Parameters Sets

TODO

Parameters Sets Hierarchy:

Caption: Parameters Sets Hierarchy Caption

%-Evaluation

Special Parameters

Configuration Management

TODO

Operations

Task Instances Lifecycle

Task Instances Statuses

Task Instances State Diagram:

Task instances statuses:

queued: task execution has been requested and queued but is not yet starting, likely because some constraints still disable the task to run (not enough resources available, maxinstances already reached by currently running instances, etc.)
running: task instance has been started and is not yet ended
success: task instance has been ended and success conditions was met
failure: task instance has been ended and success conditions was not met
canceled: task request was canceled before started

Task instances transitions:

requestTask(): explicit task request from API or UI
request by configuration (trigger, action, etc.): implicit task request deduced from configuration
start: when constraints are met, a queued task is started and reaches running status
end: when the underlying process finishes, the task instance is set to either success or failure status depending on success conditions
cancelTask(): a queued task instance can be canceled from API or UI, and then reaches canceled status, a running task can no longer be canceled
implicit cancellation: task instances can be implicitely canceled when several requests are enqueued for the same task, depending on enqueuepolicy
abortTask(): a running task instance can be aborted from API or UI, triggering the end of the underlying process, depending on execution mean (sending kill signal for local execution mean, closing socket for http execution mean, etc.), leading to a fast end of the task in conditions that are interpreted as a failure by default;
not all task instances can be aborted (for instance a ssh task cannot be aborted if pty allocation is explicitly disabled), but most of them can;
it is possible to define custom success conditions in a way that make an aborted task be interpreted as a success, but default success conditions will always interpret an abort as a failure
failure on start with some configuration errors: when configuration errors are detected in the start process and before the actual task is started, a task instance status can be set from queued to failure without even reaching running, for instance if the target is not set or is invalid and the execution mean requires a target

Task Instances Timestamps and Durations

A task instance bears several timestamps, set when it reaches different steps of its lifecycle. Any of them can be null since the corresponding step may not have been reached.

requestdate: set when the task is created (and immediatly reaches queued status)
startdate: set when the task reaches running status, which may never happen if the task never starts (if canceled or if an error occurs during start process)
enddate: set when the task reaches success, failure or canceled status; it should always happen at some time but can be long, and may even never happen in some abnormal cases (server crash, etc.)

Those timestamps can be read from API, UI and used in configuration through %-expressions, e.g.:

  %!startdate # this task instance start date
  %{!enddate:yyyy-MM-dd} # end date with format specification
  %!requestdate # task instance request date
  %{!requestdate:yyyyMMdd:-1days} # eve of task request date
  %{!statdate:ms1970} # this task start date in milliseconds since 1970

For instance this can be used to set a task param or var according to the request or start timestamp (but obviously not the end timestamp):

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (trigger(cron 0 0 23 * * 1-6))
  (param filename /tmp/%{!startdate:yyyyMMdd}.out)
  (mean local)
  (command /opt/myapp/bin/build-reports-customers.sh -o %filename)
)
(task build-reports-customers-alternative
  (taskgroup app1.biz.batch)
  (trigger(cron 0 0 23 * * 1-6))
  (var FILENAME /tmp/%{!startdate:yyyyMMdd}.out)
  (mean local)
  (command /opt/myapp/bin/build-reports-customers.sh) # the script uses $FILENAME
)

Several durations are computed from these timestamps. Any of them is null if its start or end timestamp is null.

queued: startdate - requestdate
running: finishdate - startdate
duration: finishdate - requestdate

Task monitoring features, always use the worst case duration to trigger alerts, e.g. total duration for maxexpectedduration and running duration for minexpectedduration. Like timestamps, durations can also be read from API, UI and used in configuration through %-expressions, e.g.:

  %!durationms # total duration, in milliseconds
  %!queueds # time spent in queue, in seconds

For instance this can be used to feed reporting or monitoring systems with metrics about tasks durations:

(onsuccess
  # sending UDP packets to statsd server, see https://github.com/etsy/statsd
  (requesturl (address udp://127.0.0.1:8125) "task.%!taskid.time:%!durationms|ms")
  (requesturl (address udp://127.0.0.1:8125) "task.%!taskid.delay:%!queuedms|ms")
)

Task Instances Success Conditions

Actions Performed by an Operator

Reloading Configuration

Requesting a Task Execution

Canceling a Task Request

Aborting a Task Execution

Disabling a Task

Web User Interface

Description

TODO incl. responsive-design and bootstrap-based

Views

TODO concept and description of every view

Customization

TODO through parameters

HTTP API (REST API)

Qron's HTTP API is splitted into a REST API (paths begining with /rest/) and a RPC API (paths begining with /do/) which is as simple as a REST API but action-oriented rather than data-oriented and does not rely on HTTP methods to determine the action as the REST API does (therefore they can be called by not-method-aware HTTP clients such as an HTML link or form).

Whatever HTTP URI whose path is not begining with /do/ or /rest/ is not part of the HTTP API, it's only UI. Backward compatibility of UI URIs can be broken between Qron's versions without notice (they should only be linked from within Qron web UI itself, not called or referenced by any third party tool).

REST API

REST calls are described in the table below. They complies with the following rules:

HTTP method widespread REST semantics: GET is read-only and has no side effect, POST means create, PUT means update
response content-type is defined in the request path (e.g. ending with .html to ask for HTML) not in content-type HTTP headers, to have a more natural explicit choice and avoid any kind of out-of-band parameters
paths always start with /rest/%version/%collection_name/, most of them are followed by a collection-wide view (e.g. list.csv) or the object id within the collection (e.g. 2783794.html).
reply HTTP statuses can be trusted with their standard meaning, at least for the first digit (2xx: success, 4xx: input error, 5xx: server-side error, 401 and 403 used for authentication)

The following table describes REST calls:

Call	Description
`GET /rest/v1/taskgroups/list.csv` `GET /rest/v1/taskgroups/list.html`	list of task groups
`GET /rest/v1/tasks/%taskid/config.pf`	task %taskid's configuration, in config file format
`GET /rest/v1/tasks/%taskid/list.csv`	task %taskid's configuration, in /rest/v1/tasks/list.csv format
`GET /rest/v1/tasks/list.csv GET /rest/v1/tasks/list.html`	list of tasks
`GET /rest/v1/tasks/deployment_diagram.svg GET /rest/v1/tasks/deployment_diagram.dot`	deployment diagram (taskgroup to tasks to either cluster or host graph), in SVG or Graphviz dot format
`GET /rest/v1/tasks/trigger_diagram.svg GET /rest/v1/tasks/trigger_diagram.dot`	trigger diagram (taskgroup to tasks to trigger graph), in SVG or Graphviz dot format
`GET /rest/v1/hosts/list.csv` `GET /rest/v1/hosts/list.html`	list of hosts
`GET /rest/v1/clusters/list.csv` `GET /rest/v1/clusters/list.html`	list of clusters
`GET /rest/v1/global_params/list.csv` `GET /rest/v1/global_params/list.html`	list of global params
`GET /rest/v1/global_vars/list.csv` `GET /rest/v1/global_vars/list.html`	list of global vars
`GET /rest/v1/calendars/list.csv` `GET /rest/v1/calendars/list.html`	list of named calendars
`GET /rest/v1/taskinstances/list.csv GET /rest/v1/taskinstances/list.html GET /rest/v1/taskinstances/current/list.csv` `GET /rest/v1/taskinstances/current/list.html`	list of task instances, "current" paths give the subset of unfinished or very soon finished task instances
`GET /rest/v1/scheduler_events/list.csv` `GET /rest/v1/scheduler_events/list.html`	list of scheduler events subscriptions
`GET /rest/v1/notices/lastposted.csv` `GET /rest/v1/notices/lastposted.html`	list of last posted notices, in reverse chronological order
`GET /rest/v1/resources/free_resources_by_host.csv` `GET /rest/v1/resources/free_resources_by_host.html`	free resources x host matrix
`GET /rest/v1/resources/lwm_resources_by_host.csv` `GET /rest/v1/resources/lwm_resources_by_host.html`	low water mark resources x host matrix
`GET /rest/v1/resources/consumption_matrix.csv` `GET /rest/v1/resources/consumption_matrix.html`	task × host × theorical lowest possible resources availability matrix
`GET /rest/v1/alert_params/list.csv` `GET /rest/v1/alert_params/list.html`	list of alert params
`GET /rest/v1/alerts/stateful_list.csv` `GET /rest/v1/alerts/stateful_list.html`	list of current stateful alerts, with their state and timestamps
`GET /rest/v1/alerts/last_emitted.csv` `GET /rest/v1/alerts/last_emitted.html`	list of recently emitted alerts, last one first
`GET /rest/v1/alerts_subscriptions/list.csv` `GET /rest/v1/alerts_subscriptions/list.html`	list of alerts subscriptions
`GET /rest/v1/alerts_settings/list.csv` `GET /rest/v1/alerts_settings/list.html`	list of alerts settings
`GET /rest/v1/gridboards/list.csv` `GET /rest/v1/gridboards/list.html`	list of gridboards
`GET /rest/v1/gridboards/%1.html`	gridboard %1, rendered as one or several html tables
`GET /rest/v1/configs/list.csv` `GET /rest/v1/configs/list.html`	list of loaded configurations
`GET /rest/v1/configs/history.csv` `GET /rest/v1/configs/history.html`	list of active configurations history, current one first
`GET /rest/v1/configs/%1.pf`	config %1, in config file format
`POST /rest/v1/configs/`	upload a config, in config file format; uploaded config is not activated, it is only loaded (see /do/v1/configs/activate/) reply body will be of the form `(id %1)`, %1 being replaced by uploaded config id reply has `X-Qron-ConfigId` http header set to uploaded config id, for clients that would find easier to read headers than body
`GET /rest/v1/logs/logfiles.csv` `GET /rest/v1/logs/logfiles.html`	list of logfiles
`GET /rest/v1/logs/entries.txt`	last log entries, last one last optional parameters: `files`: if set to `current` only fetch current log files entries, ignoring older files (useful with e.g. daily-rotated log files) `filter`: plain text filter string `regexp`: regular expression filter
`GET /rest/v1/logs/last_info_entries.csv` `GET /rest/v1/logs/last_info_entries.html`	last log entries with at least INFO severity level, in csv or html table format, last one last

RPC API

RPC calls are described in the table below. They complies with the following rules:

calls are handled with the same meaning regardless which HTTP method is used (GET or POST), the meaning is given by the path
paths always start with /do/%version/%domain/%action/, and in most cases the domain is a collection_name (e.g. tasks).
for calls using HTTP params, both URI request string params (GET) and body x-www-form-urlencoded params (POST) are supported, behavior when using both at a time is unspecified (so don't mix them, this may work in a given qron version and not in the next one, or not with the same priority/overriding implicit rules)
reply HTTP statuses can be trusted with their standard meaning, at least for the first digit (2xx: success, 4xx: input error, 5xx: server-side error, 401 and 403 used for authentication)

The following table describes RPC calls:

Call	Description
`POST\|GET /do/v1/tasks/request/%taskid`	request execution of a task, HTTP params are used as task instance overriding params
`POST\|GET /do/v1/tasks/abort_instances/%taskid`	abort all running instances of a task
`POST\|GET /do/v1/tasks/cancel_requests/%taskid`	cancel all queued requests of a task
`POST\|GET /do/v1/tasks/cancel_requests_and_abort_instances/%taskid`	cancel all queued requests and abort all running instances of a task
`POST\|GET /do/v1/tasks/disable/%taskid`	disable a task from being queued and (if already queued) from running
`POST\|GET /do/v1/tasks/enable/%taskid`	(re-)enable a task
`POST\|GET /do/v1/tasks/disable_all`	disable all tasks at once
`POST\|GET /do/v1/tasks/enable_all`	enable all tasks at once
`POST\|GET /do/v1/taskinstances/cancel/%taskinstanceid`	cancel a queued task instance (a.k.a. a task request)
`POST\|GET /do/v1/taskinstances/abort/%taskinstanceid`	abort a running task instance
`POST\|GET /do/v1/taskinstances/cancel_or_abort/%taskinstanceid`	cancel task instance if it's is still queued or abort it if it's already running
`POST\|GET /do/v1/alerts/emit/%alertid`	emit a one-shot alert
`POST\|GET /do/v1/alerts/raise/%alertid`	raise a stateful alert
`POST\|GET /do/v1/alerts/raise_immediately/%alertid`	raise immediately (without waiting for rise delay) a stateful alert
`POST\|GET /do/v1/alerts/cancel/%alertid`	cancel a stateful alert
`POST\|GET /do/v1/alerts/cancel_immediately/%alertid`	cancel immediately (without waiting for cancel delay) a stateful alert
`POST\|GET /do/v1/gridboards/clear/%gridboardid`	clear a gridboard
`POST\|GET /do/v1/notices/post/%notice`	post a notice, HTTP params are used as notice params
`POST\|GET /do/v1/configs/reload_config_file`	reload configuration file and apply new configuration (if a configuration file is defined, and its content is valid)
`POST\|GET /do/v1/configs/activate/%configid`	activate a configuration from repository
`POST\|GET /do/v1/configs/remove/%configid`	remove a configuration from repository
`POST\|GET /do/v1/scheduler/shutdown`	request scheduler shutdown

References about HTTP APIs

Following documents have been strong source of inspiration for designing the REST and RPC API principles:

Qron User Manual

Concepts

Tasks Hierarchy

Task Groups

Tasks

Tasks Description

Tasks Execution Properties

Tasks Triggers and Constraints

Tasks Parameters

Tasks Vars

Tasks Events Subscriptions

Tasks Monitoring

Tasks Request Forms

Infrastructure

Hosts

Clusters

Logs

Logs Configuration

Centralizing Logging of Tasks Stderr

Logs Format

Logs Rotation

Timestamped Logs File Names

Numbered Log Files Names

Alerts

Stateful Alerts

One-Shot Alerts

Alert Channels

Subscriptions

Automatic Alerts

Custom Alerts

Gridboards

Fine Tuning

Events and Actions

Task-Level Events

Scheduler-Level Events

Actions

Notices

Resources

Parameters

Hierarchical Inheritance of Parameters Sets

%-Evaluation

Special Parameters

Configuration Management

Operations

Task Instances Lifecycle

Task Instances Statuses

Task Instances Timestamps and Durations

Task Instances Success Conditions

Actions Performed by an Operator

Reloading Configuration

Requesting a Task Execution

Canceling a Task Request

Aborting a Task Execution

Disabling a Task

Web User Interface

Description

Views

Customization

HTTP API (REST API)

REST API

RPC API

References about HTTP APIs

Appendices