Qron User Manual

This web page is the user manual of qron free scheduler.

It can be read both in qron web user interface (embeded in qron daemon) and on the Internet here: http://qron.eu/doc/master/user-manual.html.

You are currently reading this version of the documentation: v1.9.9-19-g7bc32a4

Concepts

Qron is a task scheduler, that is, a software controlling execution of non-interactive application components such as large batches planned once a day, automatic processing of incoming messages every minute, or any processing triggered by an event not directly issued by a human being.

This implies defining tasks, the way they are triggered, the properties of the infrastructure on which they run, alongside with centralized logs and alert means to notify about anything wrong or suspect.

Often, scheduling tasks will also need a powerful parameters system with hierarchical inheritance and evaluation language, handling scheduling constraints such as resources and calendars, or a customizable event-driven model.

Monitoring and operating the scheduler can be done through the responsive-design web user interface and (coming soon) desktop and phone/tablet apps.

Integrating with other IT tools and custom extensions is possible through HTTP API and standard operating system integration.

Tasks Hierarchy

Tasks are grouped within taskgroups in a tree hierarchy. In addition, workflow tasks have subtasks.

Sample trigger diagram with 5 tasks within 2 task groups:

Task Groups

Tasks are grouped within taskgroups which make the configuration more clear in a documentation point of view (when used to group tasks e.g. by application or module) and/or by factorizing some task properties (such as paremeters and events).

A taskgroup consist of its id (which is its only mandatory field), plus optional fields: a human readable label and task properties that are inherited by any task belonging to the group. The task properties that can be defined at the taskgroup level are params, setenv and unsetenv and events subscriptions.

There is only one level of taskgroups (a taskgroup cannot belong to another taskgroup) however dots in taskgroups id are processed as a hierarchical separator by qron user interfaces to display a multi-level tree.

Sample configuration file fragments:

(taskgroup app1.biz.batch
  (label business batches for application app1)
)
(taskgroup app2.tech
  (label technical tasks for application app2)
  (param db_password mysecret)
  (setenv ORACLE_SID ORCL)
)
(taskgroup minimalgroup)

Tasks

Task is qron's most important configuration element. It carries almost every information needed to schedule and execute a task and often some of the event and alerting configuration is done task by task.

Tasks Description

A task is described by its id and its parent taskgroup. It should also have a human readable label and some additional documentation info.

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (label "build PDF reports about customers")
  (info "see http://intranet/app1/ReportingBatches")
)
(task oracle-statistics
  (taskgroup app2.tech)
  (label recompute statistics for db ORCL)
)
(task minimaltask(taskgroup minimalgroup))

Tasks Execution Properties

Task execution is defined by 3 properties:

  • mean: describes the mean used to execute the task, among the following:
    • local: spawn a process local to the qron daemon, on the same host
    • ssh: establish an SSH connection to the target host to execute the task
    • http: perform an HTTP request
    • workflow: start a workflow consisting of one or several subtasks conditionaly linked together with transitions, each of them having their own execution properties, see workflows section below
    • donothing: do not execute anything, but still trigger events, pretending the execution to be successful (i.e. onsuccess actions will be triggered, not onfailure ones)
  • target: defines which host or cluster will be choosen to execute the task, defaults to localhost, see infrastructure section for more details
  • command: interpreted depending on the execution mean:
    • local and ssh means: command line, spaces being interpreted as command arguments separators but if they are protected by backslashes
    • http mean: path and query string
    • workflow and donothing means: not applicable, should be omitted

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (mean ssh)
  (target server1)
  (command /opt/app1/bin/build_reports.sh --customers)
)
(task process-last-orders
  (taskgroup app1.biz.queuing)
  (mean http)
  (target server4)
  (command /pollers/last-orders/?from=-5minutes)
  (param method POST) # defaults to GET
  (param user scheduler) # set up a "Authorization: Basic" header
  (param password mysecret)
)
(task remove-old-files
  (taskgroup app1.tech)
  (mean ssh)
  (target server1)
  (command bash -c '
# shell script right in qron config file
set -e
for DIR in %directories; do
  find $DIR -mtime +60 -delete
done
')
  (param directories /opt/app1/files /opt/app1/backup /tmp/app1)
)

Tasks Triggers and Constraints

Tasks are queued for execution when triggered, and one can defined one or several triggers among:

  • cron: time trigger in the form of Unix' cron pattern with precision down to second, e.g.
    • * * * * * *: every seconds
    • 13 6 7 * * 1: every monday (1) at 07:06:13 in the morning
    • /15 * 8-20 * * *: every 15 seconds from 8 a.m. to 8 p.m.
  • notice: event trigger that happen when a given notice event occurs
  • filechange: not yet implented, but will be able to trigger task execution on local or remote file change or creation
  • HTTP API: a task can be triggered by HTTP API requesttask call
  • requesttask event: a task can be triggered by occurrence of a requesttask event

TODO: calendars

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (trigger(cron 0 0 23 * * 1-6)) # 23:00 mon-sat
)
(task process-last-orders
  (taskgroup app1.biz.queuing)
  (trigger
    (cron /15 * 8-20 * * *) # every 15 seconds daily
    (cron 0 0 22 * * * # once in the evening
      (param mode nightly) # overriding parameter 'mode'
    )
    (notice process-last-orders-now) # on event
  )
)

When a task is queued for execution, it will be executed only when no constraint disable the task from running. Contraints are of several types:

  • maxinstances: a given task is only allowed to be run maxinstances times at a time, which defaults to 1
  • resources: if a given task is declared to need some resources it can only run on its target host if it has enough resources available, resources can be seen as semaphores, and can be used for mutual exclusion across tasks, see resources section below.

Sample configuration file fragments:

(task process-customer-request
  (taskgroup app1.biz.queuing)
  (maxinstances 16) # code is multiexec-safe and is faster when parallelized
)
(task build-reports-customers
  (taskgroup app1.biz.batch)
  (target report-server)
  (resource memory 2048) # avoid too many memory consummers on the server
  (resource reports-semaphore 1) # avoid two report batches running at a time
)
(host report-server
  (hostname server4.acme.com)
  (resource memory 8192)
  (resource reports-semaphore 1)
)

In addition, tasks can be rejected when submitted for queueing, and queued tasks can even be automaticaly canceled, according to enqueuepolicy, which is defined task by task and can have the following values/behaviors:

  • enqueueanddiscardqueued: when a task is queued, every other queued task with the same id is canceled, avoiding stacking tasks in big numbers when they are slower than expected, this is the default
  • enqueueall: allow boundless stacking of this task in the queue, provided the queue limit is not reached
    warning: when the queue limit is reached, other tasks may be rejected, so this setting is dangerous for the whole tasks schedule
  • enqueueuntilmaxinstances: reject task requests when task maxinstances is already reached by running instances + queued ones, this can be convenient for user-requested tasks since it's no longer possible to request execution of a task that is already running and is set to maxinstances = 1 (which is the default)

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (enqueuepolicy enqueueuntilmaxinstances)
)

Tasks Parameters

Tasks have free text parameters defined by param configuration element. They can be used for any custom parameters and are evaluated with % evaluation character in many other configuration items.

The whole system for parameters evaluation and inheritance is discussed in deep in parameters section below alonside with special parameters name that have an effect qron behavior (parameters such as ssh.options when using ssh execution mean).

Parameters can be overriden at task request, being it in trigger definition, in HTTP API or in web user interface using request forms.

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (mean ssh)
  (target server1)
  (command /opt/app1/bin/build_reports.sh --customers --from %from)
  (param from -2days) # regular param value
  (trigger
    (cron 0 0 23 * * 1-6)
    (cron 0 0 23 * * 0 (param from -8days)) # different param on sunday
  )
  (requestform
    (field from # can be overriden when the task is started from the web ui
      (label Data depth) # ui label
      (placeholder -2days) # ui placeholder/hint
      (format "-[0-9]+days") # server side validation regexp
    )
  )
)

Tasks Setenvs

Depending on the execution mean, there can be out of band parameters, transmitted by another way than the command itself. For local and ssh means this is possible through environment variables (hence the name setenv). For http mean this is possible through custom HTTP headers.

These out of band parameters are defined and removed using two configuration elements: setenv and unsetenv.

TODO: explain how default envionment is set, according to execution mean.

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (mean ssh)
  (target server1)
  (command /opt/app1/bin/build_reports.sh --customers)
  (setenv ORACLE_SID ORCL)
)
(task process-last-orders
  (taskgroup app1.biz.queuing)
  (mean http)
  (target server4)
  (command /pollers/last-orders/?from=-5minutes)
  (setenv X-TaskInstanceId %!taskinstanceid)
  (setenv Authorization Basic AhjsqduYez=)
  #in fact it's often easier to set Authorization header this way:
  #(param user scheduler) # set up a "Authorization: Basic" header
  #(param password mysecret)
)

Tasks Events Subscriptions

Several task-level events can be used to define actions at task (or taskgroup) level. See Task-Level Events section below for more details.

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (onsuccess
    # sending UDP packets to statsd server, see https://github.com/etsy/statsd
    (requesturl "task.%!taskid.ok:1|c" (address udp://127.0.0.1:8125))
    (requesturl (address udp://127.0.0.1:8125) "task.%!taskid.time:%!totalms|ms")
    # add a custom debug log for task success
    (log(severity debug) task success! *%!tasklocalid*)
  )
  (onfailure
    # sending UDP packets to statsd
    (requesturl(address udp://127.0.0.1:8125) "task.%!taskid.ko:1|c")
  )
)

Tasks Monitoring

Whenever a task has a suspect behavior, the scheduler emit an automatic alert. For instance when a task finishes with a failure status.

It's often a good idea to send by e-mail (or any other mean) the automatic alerts for all or most of the tasks in order to be notified of an issue. See alerts subscriptions section below.

More automatic alerts are available when setting the following task parameters:

  • minexpectedduration: minimum expected task running duration in seconds, an alert will be raised otherwise, with the pattern task.tooshort.%!taskid, the alert is raised when the task finishes.
  • maxexpectedduration: maximum expected task total duration in seconds, an alert will be raised otherwise, with the pattern task.toolong.%!taskid, the alert is raised as soon as the task is running longer than maxexpectedduration (although the detection may take up to 1 minute it won't wait for the task finishing).
  • maxdurationbeforeabort: total duration above which the scheduler will abort a running task, this is useful for certain kind of tasks that should better be killed that take too much time (a nightly purge that can disrupt normal daily processing, a periodic task running every minutes that should rather be killed when frozen after 3 minutes rather than keep disable next executions forever, etc.) but is dangerous in many other cases since aborting is quite hard (kill for local and ssh execution means, close for http execution mean).

Task duration is the time elapsed between the queuing of a task request and the finishing of its execution.

Sample configuration file fragments:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (minexpectedduration 10) # below 10" there is a issue with data selected
  (maxexpectedduration 3600) # 1 hour
)
(task process-last-orders
  (taskgroup app1.biz.queuing)
  (trigger(cron /30 * 8-20 * * *)) # twice a minute
  (maxexpectedduration 20)
  (maxdurationbeforeabort 300) # 5 minutes
)

Tasks Request Forms

When a user task execution request is submited interactively through the web user interface (or, when it will be available, the desktop/phone user interface) it is possible to display a custom form to ask the user which parameters he would like to override.

A request form is described in the configuration file using a requestform element within a task declaration. It contains one field elements per form field, which itself can contain the following optional elements:

  • label: user interface label (field title)
  • placeholder: user interface hint (placeholder attribute of input HTML element)
  • suggestion: default value
  • format: validation regular expression (for the web user interface, the validation is enforced server-side)

Sample configuration file fragment:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (command "/opt/app1/bin/build_reports.sh --customers --from %from
     %{=switch:%dryrun:true:--dryrun:}")
  (param from -2days) # regular param value
  (requestform
    (field from # can be overriden when the task is started from the web ui
      (label Data depth) # ui label
      (placeholder -2days) # ui placeholder/hint
      (format "-[0-9]+days") # server side validation regexp
    )
    (field dryrun
      (label Dry run)
      (suggestion false) # default value in ui form
      (format "true|false")
    )
  )
)

Workflows and Subtasks

TODO

Infrastructure

Tasks are executed on hosts, optionaly grouped by clusters.

Sample deployment diagram with 5 tasks deployed on 1 cluster and 3 hosts:

Hosts

A host is an execution target for tasks. It must have an id and, for most execution means, a hostname (which default value is the id). A host can also carry an optional human readable label and resources that are consummed by tasks that need them, see resources section below

The same host can be used by several execution means, for instance one can defines http and ssh tasks with the same host as target provided both means can use the same hostname to reach the host.

There must always be a host named localhost. If it's not declared in configuratio file, qron will pretend it is declared with no other property than its id.

Sample configuration file fragments:

(host server1
  (hostname server1.mycompany.com)
  (resource memory 8192)
)
(host localhost)

Clusters

A cluster defines an execution target consisting of a list of hosts and a balancing method to choose the actual execution host among them. It must have an id, which cannot be identical to one of the hosts id, and can have a human readable label.

Following balancing methods are supported:

  • first: choose first host (in configuration order) with enough resources
  • random: choose a random host among those with enough resources
  • roundrobin: choose clusters hosts one after another, skipping hosts without enough resources, this is the default balancing method
  • each: when requesting the task once, it is actually executed on every host within the cluster

Sample configuration file fragments:

(cluster app-servers
  (hosts server1 server2 server3)
  (balancing roundrobin) # this can be omitted since this is the default
)
(task process-last-orders
  (taskgroup app1.biz.queuing)
  (mean http)
  (target app-servers)
  (command /pollers/last-orders/?from=-5minutes)
)
(cluster unix-servers
  (hosts server4 server5 server6)
  (balancing each)
)
(task remove-old-files-on-every-unix-server
  (taskgroup app1.tech)
  (mean ssh)
  (target unix-servers)
  (command find /opt/app1/files -mtime +60 -delete)
  (maxinstances 3) # allow parallel execution on 3 hosts
)

Logs

Qron logs information related to tasks scheduling and exectution, configuration changes, web ui user interaction, etc. In addition it also logs error messages from tasks themselves when the execution mean make it possible, which is a simple way to centralize tasks logs.

Logs Configuration

Several log files can be defined in the configuration file with log element, which must have following properties:

  • file: path to the log file, which is %-evaluated and thus can vary depending on the date
  • level: minimal severity level to log, among debug, info, warning, error and fatal

The log element can also have a unbuffered flag to disable write buffers.

Sample configuration file fragments:

(log(level debug)(file "/var/log/qron/debug-%{=date:yyyyMMdd}.log"))
(log(level info)(file "/var/log/qron/info-%{=date:yyyyMMdd}.log"))
(log(level debug)(file "/tmp/qron-debug.log")(unbuffered))

Centralizing Logging of Tasks Stderr

Qron automatically records in its logs the standard error stream of tasks executed using local mean, and (by default) both the standard error and standard output streams of tasks executed using ssh mean.

This feature enable centralizing tasks error streams in qron log. However not all task logs should be recorded that way. For instance if a 4 hours-long huge Java batch logging millions of debug lines send them to its stderr, qron logs will be no longer readable since they will be filled with crap. One should take care of that when designing batches, or their wrapper shell script. Or logging can be disabled for a given task or taskgroup using parmeters described below.

Loging of tasks output is modified by several special task parameters:

  • stderrfilter: regular expression to filter out task output, for instance:
    • .* will drop any data
    • ^Connection to [^ ]* closed.$ will drop an ennoying ssh log line
  • ssh.disablepty: when disabling pty allocation of a task executed through ssh mean, standard output will no longer be merged with standard error and only the later will be recorded in qron log (this is not the only effect of disabling pty allocation, see special parameters section for more information
  • ssh.options: this more general parameter can also, among other things, disable pty allocation

Logs Format

Qron log files are formatted with lines ended with the newline character (0x0a), each line complying to either of these two formats:

  • First record lines immediatly start with a timestamp (in other words, their first character is a digit, likely to be a 2):
    timestamp taskid/taskinstanceid sourcefile:sourceline severity message
  • Continuation lines start with two spaces (0x20):
      message_continuation

Log records fields are defined this way:

  • timestamp: ISO 8601 timestamp, currently from year to milliseconds, with or without timezone (currently the timestamp is recorded using local timezone but without explicit timezone specification, this may change in the future)
  • taskid: either the taskid if the record is related to a specific task, or the name of the logging thread (e.g. SchedulerThread or AlerterThread)
  • taskinstanceid: either the taskinstanceid if the record is related to a specific task instance execution, or 0
  • sourcefile: source code file name if known/relevant, otherwise an empty string
  • sourceline: source code line number or label if known/relevant, otherwise an empty string
  • severity: record severity, currently among: DEBUG INFO WARNING ERROR FATAL (this may change in the future but will stay an alphanumeric identifier without spaces)
  • message: any text, encoded as UTF-8, if newlines are contained in the message, it is continued on the next lines, with two extra space characters (0x20) at the begining of the continuation lines
For instance:
2016-04-05T13:57:42,003 gridboard.failing/201604051357420004 : INFO starting task 'gridboard.failing' through mean 'local' after 1 ms in queue
2016-04-05T13:57:40,421 SchedulerThread/0 : INFO starting scheduler

Logs Rotation

Qron do not rotate, compress or purge its logfiles, however it reopens them periodicaly (every minute) which make several log rotation patterns easy to implement. One of the following should be choosen, or even a mix of them.

Timestamped Logs File Names

Defining log files with a timestamp in the name ensure that qron will stop writing in a log 1 minute after the timestamp changes. Therefore old log files can be compressed or archived by any external mean (e.g. logrotate daemon or a task scheduled by qron itself) without need of signaling qron to reopen its log files.

Sample configuration file fragment:

(log(level debug)(file "/var/log/qron/debug-%{=date:yyyyMMdd}.log"))
(log(level info)(file "/var/log/qron/info-%{=date:yyyyMMdd}.log"))
(taskgroup qron.tech)
(task logs-compress
  (taskgroup qron.tech)
  (mean local)
  (command 'find /var/log/qron -name "*.log" -mtime +1 -exec bzip2 {} \\;')
  (trigger(cron 0 2 0 * * *))
)
(task logs-purge
  (taskgroup qron.tech)
  (mean local)
  (command 'find /var/log/qron -mtime +365 -delete')
  (trigger(cron 0 2 0 * * *))
)

Numbered Log Files Names

When scheduling a small amount of tasks, on can prefer to have a numbered log files scheme. The example below show it with a weekly rotation and compression processed by logrotate.

Sample configuration file fragment:

(log(level info)(file /var/log/qron.log))

Along with a logrotate configuration file like this one:

/var/log/qron.log {
    weekly
    rotate 52
    missingok
    notifempty
    compress
    delaycompress
}

Alerts

There are two kinds of alerts:

  • stateful alerts, which are emitted when raised and canceled but also periodically reminded if still raised
  • one-shot alerts which are emitted immediatly each time they occur: they are de-duplicated to avoid spam but are not subject to raise/cancel/remind mechanisms.

Whenever possible, stateful alerts should be prefered, however there are cases where one wants to be alerted of the occurrence of a abnormal or rare event rather than of an abnormal condition. In these cases one-shot alerts are well suited.

Both alerts kind are UTF-8 characters strings, there are no technical constraints on their format, however the subscriptions mechanisms will be easier to use if the alert strings follow a hierarchical dot naming convention, e.g. task.failure.$TASKGROUP.$TASK or myapp.mycustomalert.

Alerts are notified through alert channels and can be displayed on web ui, especially using gridboards.

Stateful Alerts

At first glance, stateful alerts have 3 statuses: nonexistent, raised and canceled. However, if an alert is constantly raised and canceled, without protection it would be a huge source of spam/noise and there would be no benefit to handle alerts as stateful. Therefore the alert engine implements grace delays before actually raising or canceling an alert, and there are a few more statuses and transitions than the most intuitive ones. They are described below.

Alerts State Diagram:

Alerts statuses:

  • nonexistent: the alert string is not known to the alert engine
  • rising: the alert has been asked for rising since less than the rise delay
  • mayrise: the alert has been asked for cancellation while it was rising, and mayrise delay has not yet been reached
  • raised: the alert was rising and rising delay has been reached, or the alert has been asked for immediate rising
  • dropping: the alert has been asked for cancellation while it was raised, and drop delay has not yet been reached
  • canceled: the alert was dropping and drop delay has been reached, or the alert has been asked for immediate cancellation

Alerts status transitions:

  • raiseAlert(): turn a dropping alert into raised and any other alert into rising
  • cancelAlert(): turn a rising alert into mayrise and a raised alert into dropping
  • raiseImmediatlyAlert(): turn any alert into raised
  • cancelImmediatlyAlert(): turn rising and mayrise alerts into nonexistent, and turn raised and dropping alerts into canceled
  • rise delay timeout: turn rising and mayrise alerts into raised
  • mayrise delay timeout: turn mayrise alerts into nonexistent
  • drop delay timeout: turn dropping alerts into canceled
  • canceled alerts will immediatly become nonexistent again, as soon as cancel notifications are emitted

Induced effects:

  • When an alert becomes raised and was neither raised nor dropping before, an alert is emitted for all matching subscriptions.
  • When an alert becomes canceled, an alert cancellation is emitted for all matching subscriptions.
  • When an alert stays raised or dropping for a long time, some alert channels (most notabily the mail channel) will notify subscribers with alert reminders.

See alerts fine tuning section below for delays default values and advices on their tuning when needed.

One-Shot Alerts

One-shot alerts have no state. They are (almost) immediately emitted each time they are required to.

To avoid noise/spam, only the first emission request during duplicate emit delay interval triggers an immediate alert emission, the following ones are counted and only one other alert will be actually emitted after the delay expires.

The emission counter is displayed by alert channels in a appropriate way, by default they will suffix the alert string with an x and the counter value, e.g.: scheduler.config.load x 3.

See alerts fine tuning section below for delays default values and advices on their tuning when needed.

Alert Channels

Subscriptions

Automatic Alerts

Custom Alerts

TODO: custom alerts through events and actions, but also from outside qron using the HTTP API

Gridboards

Provided they follow a name convention that can be expressed using regular expression capture groups, some alerts can be arranged as tables of statuses named gridboards.

A gridboard is described by its id and several properties among the following:

  • pattern: alert pattern to be processed by this gridboard, must be a regular expression with named capture groups, e.g. ^task\.(?<status>[^\.]+)\.(?<taskid>.+)$
  • dimension: declare a dimension, first one being displayed as rows, second one as columns, using one of the following syntaxes:
    • just the capture group name declared in the pattern, e.g. (dimension taskid)
    • a dimension name and a %-expression, e.g. (dimension TaskId %taskid) (note that the regular expression capture groups are visible in the %-evaluation context which make it possible to use a capture group name or number as a % key)
    • the second dimension is not mandatory, if omitted, the gridboard will be built as if there was a second dimension declared with (dimension status status), i.e. only one column, named "status".
  • label
  • info
  • warningdelay default 0
  • param gridboard parameters, used in %-evaluation performed within the context of this gridboard; in addition some parameters can be used to customize HTML rendering of the gridboard:
    • gridboard.tableclass: HTML table class, defaults to table table-condensed table-hover
    • gridboard.divclass: HTML div class for whole gridbard, defaults to row gridboard-status
    • gridboard.componentclass: HTML div class for each gridboard component, defaults to gridboard-component
    • gridboard.tdclass.ok: HTML table cell class for canceled alerts, defaults to an empty string
    • gridboard.tdclass.warning: HTML table cell class for canceled alerts not canceled for longer than the warningdelay, defaults to warning
    • gridboard.tdclass.error: HTML table cell class for raised alerts, defaults to danger
    • gridboard.tdclass.unknown: HTML table cell class in unknown status (most likely never raised nor canceled), defaults to an empty string
    • gridboard.rowformat: custom format for row headers (i.e. first dimension values), this is a %-expression where %1 is replaced with the actual dimension value, defaults to %1, example below
    • gridboard.columnformat: same as gridboard.rowformat for the second dimension

For instance it is possible to set up a gridboard relying on automatic alerts to display one task per row and several columns and, for each task, if it has failed recently, if it has been too long or too short recently, etc. using this configuration:

(gridboard tasks
  (label Tasks Alerts)
  # match qron's automatic alerts about tasks: failure, toolong, tooshort...
  (pattern "^task\.(?<status>[^\.]+)\.(?<taskid>.+)$")
  (dimension taskid)
  (dimension status)
  # add an HTML link to the task page to the taskid in row headers
  (param gridboard.rowformat '<a href="../tasks/%1">%1</a>')
)

An example of the HTML rendering of this gridboard would be:

Apart from setting up a gridboard to display default automatic alerts raised by qron, it is also possible to use this feature with some custom alerts, being them raised using the alert-related actions or from any external source using the HTTP API.

Here are some examples of dashboards associated to host and application servers supervision probes:

(gridboard ping
  (label Hosts Ping Statuses)
  # match alerts of the form host.down.ping.$HOST
  (pattern "^host\.down\.ping\.(?[^\.]+)")
  # first and only dimension is the host
  (dimension host)
  # implicitely there is a second dimension, always equal to "status"
  (warningdelay 120) # shown as warning after 2 minutes w/o ping status
)

An example of the HTML rendering of this gridboard would be:

(gridboard service_x_instance
  (label Application Services x Deployed Instances)
  # match alerts of the form host.down.http.$HOST.$PORT.$PATH
  (pattern "^host\.down\.http\.(?[^\.]+)\.(?[0-9]+)\.(?.+)$")
  # first dimension is the service, identified by its http path
  (dimension service %path))
  # second dimension is the instance, identified by %host:%port, with special
  # processing on %host to remove everything after first dash
  (dimension instance %{=sub:%host:/-.*//}:%port)
  (warningdelay 300) # shown as warning after 2 minutes w/o http status
)

An example of the HTML rendering of this gridboard would be:

Fine Tuning

TODO

Events and Actions

Some events can be subscribed at task level or scheduler level to make the scheduler perform one or several actions. Events and actions are described below.

Task-Level Events

At task (and taskgroup) level, the following events can be subscribed:

  • onstart: occurs just before a task begin to run
  • onsuccess: occurs just after a task finished with a successful status
  • onfailure: occurs just after a task finished with a failure status
  • onfinish: short for onsuccess and onfailure

Sample configuration file fragment:

(task build-reports-customers
  (taskgroup app1.biz.batch)
  (onsuccess
    # sending UDP packets to statsd server, see https://github.com/etsy/statsd
    (requesturl "task.%!taskid.ok:1|c" (address udp://127.0.0.1:8125))
    (requesturl (address udp://127.0.0.1:8125) "task.%!taskid.time:%!totalms|ms")
    # add a custom debug log for task success
    (log(severity debug) task success! *%!tasklocalid*)
  )
  (onfailure
    # sending UDP packets to statsd
    (requesturl(address udp://127.0.0.1:8125) "task.%!taskid.ko:1|c")
  )
)

Scheduler-Level Events

At scheduler (global) level, the following events can be subscribed:

  • onschedulerstart: occurs when qron daemon starts
  • onconfigload: occurs whenever qron daemon activate a new configuration, including at startup
  • onnotice: occurs whenever a notice is posted
  • onlog: occurs whenever a log entry is recorded, this event is obviously only available for debuging purposes, using it on real-world live systems is not advisable
  • every task-level events can also be subscribed at scheduler level, they will be applicable to every task

Sample configuration file fragment:

(config
  # emit an alert to warn of a configuration (re)load
  (onconfigload (emitalert config.reload))
  # send UDP packets to statsd server on every task success or failure,
  # see https://github.com/etsy/statsd
  (onsuccess
    (requesturl "task.%!taskid.ok:1|c" (address udp://127.0.0.1:8125))
    (requesturl (address udp://127.0.0.1:8125) "task.%!taskid.time:%!totalms|ms")
  )
  (onfailure
    (requesturl(address udp://127.0.0.1:8125) "task.%!taskid.ko:1|c")
  )
)

Actions

Here are the actions that can be performed when an event occurs:

  • postnotice: post a notice
    • takes notice name as its main parameter
    • allow param to set notice parameters
    • e.g. (postnotice file_has_arrived (param path /tmp/foobar) (param urgent_file true))
  • raisealert: raise a stateful alert
    • takes alert name as its main parameter
    • e.g. (raisealert app.alert.too_many_waiting_customers)
  • cancelalert: cancel a raisable alert
    • takes alert name as its main parameter
    • e.g. (cancelalert app.alert.too_many_waiting_customers)
  • emitalert: emit an one shot alert
    • takes alert name as its main parameter
    • e.g. (emitalert app.alert.a_box_has_been_broken)
  • requesttask: request task execution
    • takes task id as its main parameter, if such task is not found, the id is prefixed with current context's taskgroup, which makes shorter ids when requesting a task from another one within the same group
    • allow param to override task parameters
    • e.g. (requesttask import_customers_file (param path /tmp/foobar))
  • requesturl: request a network action by url, currently only UDP and HTTP are supported
    • takes payload as its main parameter
    • needs an address param containing an url
    • http supports following optionnal params: method (default: GET), user and password, port (default: 80), content-type, follow-redirect (default: false), redirect-max.
    • udp supports following optionnal params: connecttimeout (in seconds, default: 2), disconnecttimeout (default: 0.2)
    • e.g. (requesturl(address udp://127.0.0.1:8125) "task.%!taskid.ok:1|c")
    • e.g. (requesturl(address http://monitoring-server/alert) (method POST) "%!taskid failed")
  • log: add an entry to centralized log
    • takes log message as its main parameter
    • supports severity optionnal param, defaults to info
    • e.g. (log this is a debug message(severity debug))
  • writefile: write data to a custom file
    • takes data to write as its main parameter
    • needs a path param containing the file path
    • supports following optionnal params: append (default: true), truncate (default: false), unique (to create a unique file name using a suffix or replacing XXXXXX pattern, default: false)
    • e.g. (writefile(path /var/log/custom.log) "my message\n")
    • e.g. (writefile(path /opt/file_transfer/in/file_XXXXXX.data) (unique true) "my message\n")
  • step: activate another step, only available within a workflow task
    • takes next step id as its main parameter
    • e.g. (step 2)
  • end: ends the workflow, only available within a workflow task

Sample configuration file fragments:

(config
  # emit an alert to warn of a configuration (re)load
  (onconfigload (emitalert config.reload))
  # send UDP packets to statsd server on every task success or failure,
  # see https://github.com/etsy/statsd
  (onsuccess
    (requesturl "task.%!taskid.ok:1|c" (address udp://127.0.0.1:8125))
    (requesturl (address udp://127.0.0.1:8125) "task.%!taskid.time:%!totalms|ms")
  )
  (onfailure
    (requesturl(address udp://127.0.0.1:8125) "task.%!taskid.ko:1|c")
  )
  # record misc execution data in /var/log/qrond_tasks_statistics_yyyyMMdd.log
  (onfinish
    (writefile "%=date task.%!taskid "%!status %!returncode %!queueds %!runnings %!totals\n
       (path /var/log/qrond_tasks_statistics_%{=date!yyyyMMdd}.log))
  )
)
(config
  (task process1 (taskgroup group1)
    # raise and cancel an alert shared with both process1 and process2
    (onfailure (raisealert app1.process_failed))
    (onsucess (cancelalert app1.process_failed))
    (command /bin/process1)(mean local)
  )
  (task process2 (taskgroup group1)
    (onfailure (raisealert app1.process_failed))
    (onsucess (cancelalert app1.process_failed))
    # write "success" or "failure" in /opt/flags/process2_status
    (onfinish (writefile "%!status\n" (path /opt/flags/process2_status)
                 (truncate true)(append false)))
    (command /bin/process2)(mean local)
  )
)
# this example is a kind of small workflow where 'afterreload' is triggered
# by config reload (which could trigger one 'afterreload' per application if
# needed) and then 'afterreload_failed' is triggered only if 'afterreload'
# failed (hence its name)
(config
  # posting a notice make it possible to bind it to several task triggers
  # without specifying the links explicitely
  (onconfigload (postnotice config_load))
  (task afterreload (taskgroup group1)
     (trigger (notice config_load))
     (command /bin/whatever_script)(mean local)
     (onfailure (requesttask afterreload_failed))
  )
  (task afterreload_failed (taskgroup group1)
     # there is no need for a trigger, since (requesttask) will request task
     # execution instead
     (command /bin/another_script)(mean local)
  )
)
TODO workflow exemple with (step) and (end) actions

Notices

A notice is a named event that occur when posted by a postnotice action or a /notices/post API call. Notice triggers subscribe to them to execute tasks. The only link between the post event and the task trigger is the notice name.

This mechanism provides loose coupling between a detected event and the tasks it triggers, the event or external process which posts the notice knowing nothing about the tasks triggered. For instance an external file transfer software can notify the scheduler of a file being ready with a logical name (the notice name) which will be linked to a given task in the scheduler configuration, not elsewhere.

A notice can carry parameters, which are used as evaluation context when evaluating notice trigger overriding params. Using them the post events can provide some contextual information to the triggered tasks, such as a path to a file, a customer orders batch id, etc.

Resources

Parameters

Hierarchical Inheritance of Parameters Sets

TODO

Parameters Sets Hierarchy:

Caption:

%-Evaluation

Special Parameters

Configuration Management

TODO

Operations

Task Instances Lifecycle

Task Instances Statuses

Task Instances State Diagram:

Task instances statuses:

  • queued: task execution has been requested and queued but is not yet starting, likely because some constraints still disable the task to run (not enough resources available, maxinstances already reached by currently running instances, etc.)
  • running: task instance has been started and is not yet ended
  • success: task instance has been ended and success conditions was met
  • failure: task instance has been ended and success conditions was not met
  • canceled: task request was canceled before started

Task instances transitions:

  • requestTask(): explicit task request from API or UI
  • request by configuration (trigger, action, workflow transition, etc.): implicit task request deduced from configuration
  • start: when constraints are met, a queued task is started and reaches running status
  • end: when the underlying process finishes, the task instance is set to either success or failure status depending on success conditions
  • cancelTask(): a queued task instance can be canceled from API or UI, and then reaches canceled status, a running task can no longer be canceled
  • implicit cancellation: task instances can be implicitely canceled when several requests are enqueued for the same task, depending on enqueuepolicy
  • abortTask(): a running task instance can be aborted from API or UI, triggering the end of the underlying process, depending on execution mean (sending kill signal for local execution mean, closing socket for http execution mean, etc.), leading to a fast end of the task in conditions that are interpreted as a failure by default;
    not all task instances can be aborted (for instance a ssh task cannot be aborted if pty allocation is explicitly disabled), but most of them can;
    it is possible to define custom success conditions in a way that make an aborted task be interpreted as a success, but default success conditions will always interpret an abort as a failure
  • failure on start with some configuration errors: when configuration errors are detected in the start process and before the actual task is started, a task instance status can be set from queued to failure without even reaching running, for instance if the target is not set or is invalid and the execution mean requires a target

Task Instances Timestamps and Durations

A task instance bears several timestamps, set when it reaches different steps of its lifecycle. Any of them can be null since the corresponding step may not have been reached.
  • requestdate: set when the task is created (and immediatly reaches queued status)
  • startdate: set when the task reaches running status, which may never happen if the task never starts (if canceled or if an error occurs during start process)
  • enddate: set when the task reaches success, failure or canceled status; it should always happen at some time but can be long, and may even never happen in some abnormal cases (server crash, etc.)
Those timestamps can be read from API, UI and used in configuration through %-expressions, e.g.:
  %!startdate # this task instance start date
  %{!enddate:yyyy-MM-dd} # end date with format specification
  %!workflowrequestdate # main task instance (if subtask) request date
  %{!workflowrequestdate:yyyyMMdd:-1days} # eve of main task request date
  %{!statdate:ms1970} # this (sub)task start date in milliseconds since 1970
For instance this can be used to set a task param or setenv according to the request or start timestamp (but obviously not the end timestamp):
(task build-reports-customers
  (taskgroup app1.biz.batch)
  (trigger(cron 0 0 23 * * 1-6))
  (param filename /tmp/%{!workflowstartdate:yyyyMMdd}.out)
  (mean local)
  (command /opt/myapp/bin/build-reports-customers.sh -o %filename)
)
(task build-reports-customers-alternative
  (taskgroup app1.biz.batch)
  (trigger(cron 0 0 23 * * 1-6))
  (setenv FILENAME /tmp/%{!workflowstartdate:yyyyMMdd}.out)
  (mean local)
  (command /opt/myapp/bin/build-reports-customers.sh) # the script uses $FILENAME
)
Several durations are computed from these timestamps. Any of them is null if its start or end timestamp is null.
  • queued: startdate - requestdate
  • running: enddate - startdate
  • total: enddate - requestdate
Task monitoring features, always use the worst case duration to trigger alerts, e.g. total duration for maxexpectedduration and running duration for minexpectedduration. Like timestamps, durations can also be read from API, UI and used in configuration through %-expressions, e.g.:
  %!totalms # total duration, in milliseconds
  %!queueds # time spent in queue, in seconds
For instance this can be used to feed reporting or monitoring systems with metrics about tasks durations:
(onsuccess
  # sending UDP packets to statsd server, see https://github.com/etsy/statsd
  (requesturl (address udp://127.0.0.1:8125) "task.%!taskid.time:%!totalms|ms")
  (requesturl (address udp://127.0.0.1:8125) "task.%!taskid.delay:%!queuedms|ms")
)

Task Instances Success Conditions

Actions Performed by an Operator

Reloading Configuration

Requesting a Task Execution

Canceling a Task Request

Aborting a Task Execution

Disabling a Task

Web User Interface

Description

TODO incl. responsive-design and bootstrap-based

Views

TODO concept and description of every view

Customization

TODO through parameters

HTTP API (REST API)

Qron's HTTP API is splitted into a REST API (paths begining with /rest/) and a RPC API (paths begining with /do/) which is as simple as a REST API but action-oriented rather than data-oriented and does not rely on HTTP methods to determine the action as the REST API does (therefore they can be called by not-method-aware HTTP clients such as an HTML link or form).

Whatever HTTP URI whose path is not begining with /do/ or /rest/ is not part of the HTTP API, it's only UI. Backward compatibility of UI URIs can be broken between Qron's versions without notice (they should only be linked from within Qron web UI itself, not called or referenced by any third party tool).

REST API

REST calls are described in the table below. They complies with the following rules:

  • HTTP method widespread REST semantics: GET is read-only and has no side effect, POST means create, PUT means update
  • response content-type is defined in the request path (e.g. ending with .html to ask for HTML) not in content-type HTTP headers, to have a more natural explicit choice and avoid any kind of out-of-band parameters
  • paths always start with /rest/%version/%collection_name/, most of them are followed by a collection-wide view (e.g. list.csv) or the object id within the collection (e.g. 2783794.html).
  • reply HTTP statuses can be trusted with their standard meaning, at less for the first digit (2xx: success, 4xx: input error, 5xx: server-side error, 401 and 403 used for authentication)

The following table describes REST calls:

CallDescription

GET /rest/v1/taskgroups/list.csv

GET /rest/v1/taskgroups/list.html

list of task groups

GET /rest/v1/tasks/%taskid/config.pf

task %taskid's configuration, in config file format

GET /rest/v1/tasks/%taskid/workflow.svg

GET /rest/v1/tasks/%taskid/workflow.dot

task %taskid's workflow diagram, in SVG or Graphviz dot format

GET /rest/v1/tasks/list.csv

GET /rest/v1/tasks/list.html

list of tasks

GET /rest/v1/tasks/deployment_diagram.svg

GET /rest/v1/tasks/deployment_diagram.dot

deployment diagram (taskgroup to tasks to either cluster or host graph), in SVG or Graphviz dot format

GET /rest/v1/tasks/trigger_diagram.svg

GET /rest/v1/tasks/trigger_diagram.dot

trigger diagram (taskgroup to tasks to trigger graph), in SVG or Graphviz dot format

GET /rest/v1/steps/list.csv

GET /rest/v1/steps/list.html

list of workflow tasks steps

GET /rest/v1/hosts/list.csv

GET /rest/v1/hosts/list.html

list of hosts

GET /rest/v1/clusters/list.csv

GET /rest/v1/clusters/list.html

list of clusters

GET /rest/v1/global_params/list.csv

GET /rest/v1/global_params/list.html

list of global params

GET /rest/v1/global_setenvs/list.csv

GET /rest/v1/global_setenvs/list.html

list of global setenvs

GET /rest/v1/global_unsetenvs/list.csv

GET /rest/v1/global_unsetenvs/list.html

list of global unsetenvs

GET /rest/v1/calendars/list.csv

GET /rest/v1/calendars/list.html

list of named calendars

GET /rest/v1/taskinstances/list.csv

GET /rest/v1/taskinstances/list.html

GET /rest/v1/taskinstances/list.csv?status=%list

GET /rest/v1/taskinstances/list.html?status=%list

list of task instances, if status parameter is set, the followed values are supported as %list:
  • queued
  • running
  • queued,running

GET /rest/v1/scheduler_events/list.csv

GET /rest/v1/scheduler_events/list.html

list of scheduler events subscriptions

GET /rest/v1/notices/lastposted.csv

GET /rest/v1/notices/lastposted.html

list of last posted notices, in reverse chronological order

GET /rest/v1/resources/free_resources_by_host.csv

GET /rest/v1/resources/free_resources_by_host.html

free resources x host matrix

GET /rest/v1/resources/lwm_resources_by_host.csv

GET /rest/v1/resources/lwm_resources_by_host.html

low water mark resources x host matrix

GET /rest/v1/resources/consumption_matrix.csv

GET /rest/v1/resources/consumption_matrix.html

task × host × theorical lowest possible resources availability matrix

GET /rest/v1/alert_params/list.csv

GET /rest/v1/alert_params/list.html

list of alert params

GET /rest/v1/alerts/stateful_list.csv

GET /rest/v1/alerts/stateful_list.html

list of current stateful alerts, with their state and timestamps

GET /rest/v1/alerts/last_emitted.csv

GET /rest/v1/alerts/last_emitted.html

list of recently emitted alerts, last one first

GET /rest/v1/alerts_subscriptions/list.csv

GET /rest/v1/alerts_subscriptions/list.html

list of alerts subscriptions

GET /rest/v1/alerts_settings/list.csv

GET /rest/v1/alerts_settings/list.html

list of alerts settings

GET /rest/v1/gridboards/list.csv

GET /rest/v1/gridboards/list.html

list of gridboards

GET /rest/v1/gridboards/%1.html

gridboard %1, rendered as one or several html tables

GET /rest/v1/configs/list.csv

GET /rest/v1/configs/list.html

list of loaded configurations

GET /rest/v1/configs/history.csv

GET /rest/v1/configs/history.html

list of active configurations history, current one first

GET /rest/v1/configs/%1.pf

config %1, in config file format

POST /rest/v1/configs/

upload a config, in config file format; uploaded config is not activated, it is only loaded (see /do/v1/configs/activate/)

reply body will be of the form (id %1), %1 being replaced by uploaded config id

reply has X-Qron-ConfigId http header set to uploaded config id, for clients that would find easier to read headers than body

GET /rest/v1/logs/logfiles.csv

GET /rest/v1/logs/logfiles.html

list of logfiles

GET /rest/v1/logs/entries.txt

last log entries, last one last

optional parameters:

  • files: if set to current only fetch current log files entries, ignoring older files (useful with e.g. daily-rotated log files)
  • filter: plain text filter string
  • regexp: regular expression filter

GET /rest/v1/logs/last_info_entries.csv

GET /rest/v1/logs/last_info_entries.html

last log entries with at less INFO severity level, in csv or html table format, last one last

RPC API

RPC calls are described in the table below. They complies with the following rules:

  • calls are handled with the same meaning regardless which HTTP method is used (GET or POST), the meaning is given by the path
  • paths always start with /do/%version/%domain/%action/, and in most cases the domain is a collection_name (e.g. tasks).
  • for calls using HTTP params, both URI request string params (GET) and body x-www-form-urlencoded params (POST) are supported, behavior when using both at a time is unspecified (so don't mix them, this may work in a given qron version and not in the next one, or not with the same priority/overriding implicit rules)
  • reply HTTP statuses can be trusted with their standard meaning, at less for the first digit (2xx: success, 4xx: input error, 5xx: server-side error, 401 and 403 used for authentication)

The following table describes RPC calls:

CallDescription

POST|GET /do/v1/tasks/request/%taskid

request execution of a task, HTTP params are used as task instance overriding params

POST|GET /do/v1/tasks/abort_instances/%taskid

abort all running instances of a task

POST|GET /do/v1/tasks/cancel_requests/%taskid

cancel all queued requests of a task

POST|GET /do/v1/tasks/cancel_requests_and_abort_instances/%taskid

cancel all queued requests and abort all running instances of a task

POST|GET /do/v1/tasks/disable/%taskid

disable a task from being queued and (if already queued) from running

POST|GET /do/v1/tasks/enable/%taskid

(re-)enable a task

POST|GET /do/v1/tasks/disable_all

disable all tasks at once

POST|GET /do/v1/tasks/enable_all

enable all tasks at once

POST|GET /do/v1/taskinstances/cancel/%taskinstanceid

cancel a queued task instance (a.k.a. a task request)

POST|GET /do/v1/taskinstances/abort/%taskinstanceid

abort a running task instance

POST|GET /do/v1/taskinstances/cancel_or_abort/%taskinstanceid

cancel task instance if it's is still queued or abort it if it's already running

POST|GET /do/v1/alerts/emit/%alertid

emit a one-shot alert

POST|GET /do/v1/alerts/raise/%alertid

raise a stateful alert

POST|GET /do/v1/alerts/raise_immediately/%alertid

raise immediately (without waiting for rise delay) a stateful alert

POST|GET /do/v1/alerts/cancel/%alertid

cancel a stateful alert

POST|GET /do/v1/alerts/cancel_immediately/%alertid

cancel immediately (without waiting for cancel delay) a stateful alert

POST|GET /do/v1/gridboards/clear/%gridboardid

clear a gridboard

POST|GET /do/v1/notices/post/%notice

post a notice, HTTP params are used as notice params

POST|GET /do/v1/configs/reload_config_file

reload configuration file and apply new configuration (if a configuration file is defined, and its content is valid)

POST|GET /do/v1/configs/activate/%configid

activate a configuration from repository

POST|GET /do/v1/configs/remove/%configid

remove a configuration from repository

References about HTTP APIs

Following documents have been strong source of inspiration for designing the REST and RPC API principles:

Appendices