===================
dc-tool data mining
===================
dc-tool has several methods of data mining that can be controlled via a
configuration file.
It works like this:
1. Read this documentation and create a configuration file to test.
2. Run ``dc-tool --mine=mysource.conf`` to perform data mining and print
results to standard output.
3. When you are satisfied of the results, run ``dc-tool --mine=mysource.conf --post``
to post data to contributors.debian.org. Run that via cron and you have a
full working data source.
-------------------------
Configuration file syntax
-------------------------
The configuration file follows the usual Debian RFC822/Yaml-like syntax.
If the first group of options does not have a "contribution:" field, it is used
for general configuration of the data source. All other sections define methods
of mining the data you want.
The data source configuration section
=====================================
Example::
# You don't need this option if you call this file nm.debian.org.conf
#source: nm.debian.org
# Auhentication token used to post data. Use a leading '@' as in '@filename'
# to use the contents of another file as auth token. Do not make this file
# world-readable!
auth_token: @secrets/auth_token.txt
The general configuration section has three configurable keywords:
``source``
Data source name, as configured in contributors.debian.org. If omitted,
dc-tool will use the configuration file name. If the file name ends in ``.ini``,
``.conf`` or ``.cfg``, the extension will be removed.
``auth_token``
The authentication token used for posting data to the site.
Anyone with this authentication token can post data for this data source, so
be careful not to give this file world-readable permissions.
``baseurl``
You never need this unless you want to test a local copy of the
contributors.debian.org codebase: it defaults to ``https://contributors.debian.org/``
but you can change it to submit data to your local development version.
Data mining sections
====================
Example::
contribution: committer
# Data mining method
method: gitdirs
# Configuration specific to this method
dirs: /srv/git.debian.org/git/collab-maint/*.git
url: https://alioth.debian.org/users/{user}/
Each data mining section has at least two configurable keywords:
``contribution``
Contribution type for this data source, as configured in contributors.debian.org.
You can have many sections with the same contribution types, and the results
of their data mining will all be merged.
``method``
The mining method. There are several mining method available, each with its
own configuration options, documented below.
The rest of the options are specific to each data mining method. Below is a
full documentation of them.
Data mining methods
===================
bts
---
Parses the debbugs spool directories collecting contributions from mail
headers.
Example::
contribution: correspondant
method: bts
dirs: /srv/bugs.debian.org/spool/db-h/ /srv/bugs.debian.org/spool/archive/
url: https://bugs.debian.org/cgi-bin/pkgreport.cgi?correspondent={email}
Configuration options
`````````````````````
``dirs`` : Glob, required, default: None.
debbugs spool directories to scan. You can give
one or more, and even use shell-style globbing.
``threshold`` : Integer, optional, default: 5.
Minimum number of mails that need to exist
in the BTS for an email address to be
considered
``url`` : Char, optional, default: None.
template used to build URLs to link to people's contributions.
``{email}`` will be replaced with the email address
Option types
````````````
``Char``
A string value. Can be any UTF-8 string.
``Glob``
A string with one or more filenames. Globbing is supported. Arguments can
be quoted to deal with whitespace, but glob characters will always be
expanded.
``Integer``
An integer value.
files
-----
Recursively scan directories using file attributes to detect contributions.
Generates `login` types of identifiers, using the usernames of the system
where it is run.
Example::
contribution: committer
method: files
dirs: /srv/cvs.debian.org/cvs/webwml
url: https://alioth.debian.org/users/{user}/
Configuration options
`````````````````````
``dirs`` : Glob, required, default: None.
directories to scan. You can give one or more, and
even use shell-style globbing.
``url`` : Char, optional, default: None.
template used to build URLs to link to people's contributions.
``{user}`` will be replaced with the username
Option types
````````````
``Char``
A string value. Can be any UTF-8 string.
``Glob``
A string with one or more filenames. Globbing is supported. Arguments can
be quoted to deal with whitespace, but glob characters will always be
expanded.
gitdirs
-------
Scan git directories using file attributes to detect contributions.
Generates `login` types of identifiers, using the usernames of the system
where it is run.
Example::
contribution: committer
method: gitdirs
dirs: /srv/git.debian.org/git/collab-maint/*.git
url: https://alioth.debian.org/users/{user}/
Configuration options
`````````````````````
``dirs`` : Glob, required, default: None.
``.git`` directories to scan. You can give one or more,
and even use shell-style globbing.
``url`` : Char, optional, default: None.
template used to build URLs to link to people's contributions.
``{user}`` will be replaced with the username
Option types
````````````
``Char``
A string value. Can be any UTF-8 string.
``Glob``
A string with one or more filenames. Globbing is supported. Arguments can
be quoted to deal with whitespace, but glob characters will always be
expanded.
gitlogs
-------
Scan git logs, taking note of committer and author activity
Generates `email` types of identifiers, trusting whatever is in the git
log.
Example::
contribution: committer
method: gitlogs
dirs: /srv/git.debian.org/git/collab-maint/*.git
Configuration options
`````````````````````
``author_map`` : IdentMap, optional, default: None.
Convert author emails using the given expressions
``dirs`` : Glob, required, default: None.
``.git`` directories to scan. You can give one or more,
and even use shell-style globbing.
``subdir`` : Char, optional, default: None.
Limit the scan to subdirectories in the repository.
``url`` : Char, optional, default: None.
template used to build URLs to link to people's contributions.
``{email}`` will be replaced with the email address.
Option types
````````````
``Char``
A string value. Can be any UTF-8 string.
``Glob``
A string with one or more filenames. Globbing is supported. Arguments can
be quoted to deal with whitespace, but glob characters will always be
expanded.
``IdentMap``
A string with one or more identifier mapping expressions.
Each expression is on a line by its own. Leading and trailing spaces do not
matter.
Lines can be in one of two forms:
regexp replace
regexp replace flags
If regexp, replace or flags contain spaces, they can be shell-quoted.
Regexp and replace use the syntax as found in re.sub. Flags are as found in
re.X.
For each mapping line, re.sub if called on each value found.
mailfrom
--------
Scan email address from From: headers in mailboxes
Example::
contribution: developer
method: mailfrom
folders: /home/debian/lists/debian-devel-announce/*
url: http://www.example.com/{email}
Configuration options
`````````````````````
``blacklist`` : Emails, optional, default: None.
if present, emails from this list will not be
considered as contributors.
``folders`` : Glob, required, default: None.
mail folders to scan. You can give one or more,
and even use shell-style globbing. Mailbox,
mailbox.gz and Maildir folders are supported.
``url`` : Char, optional, default: None.
template used to build URLs to link to people's contributions.
``{email}`` will be replaced with the email address
``whitelist`` : Emails, optional, default: None.
if present, only emails from this list will be
considered as contributors.
Option types
````````````
``Char``
A string value. Can be any UTF-8 string.
``Emails``
A list of email addresses, like in email To: or Cc: headers.
``Glob``
A string with one or more filenames. Globbing is supported. Arguments can
be quoted to deal with whitespace, but glob characters will always be
expanded.
mock
----
Generate random contributions for random people
Example::
identifier_type: email
method: mock
count: 10000
Configuration options
`````````````````````
``count`` : Integer, optional, default: 1000.
Number of contributions to generate.
``identifier_type`` : IdentifierType, optional, default: None.
identifier type
``url`` : Char, optional, default: None.
template used to build URLs to link to people's
contributions. ``{email}`` will be replaced with
the email address, ``{user}`` will be replaced with
the user name, ``{fpr}`` will be replaced with
the user key fingerprint.
Option types
````````````
``Char``
A string value. Can be any UTF-8 string.
``IdentifierType``
An identifier type. Can be one of:
``auto``
autodetect. "ident" or "Name <ident>" are accepted, and ident can be any
email, login or OpenPGP fingerprint
``login``
debian.org or Alioth login name.
``email``
email address.
``fpr``
OpenPGP key fingerprint.
``Integer``
An integer value.
postgres
--------
Perform data mining using a SQL query on a Postgres database.
This requires python-psycopg2 to be installed.
Example::
contribution: uploader
method: postgres
db: service=projectb
identifier: login
query:
SELECT s.install_date as date,
u.uid as id,
u.name as desc
FROM source s
JOIN fingerprint f ON s.sig_fpr = f.id
JOIN uid u ON f.uid = u.id
url: http://qa.debian.org/developer.php?login={id}&comaint=yes
Configuration options
`````````````````````
``db`` : Char, required, default: None.
database connection string. See `psycopg2.connect
<http://initd.org/psycopg/docs/module.html#psycopg2.connect>`_
for details.
``identifier`` : IdentifierType, optional, default: 'auto'.
type of identifier that is found by this SQL query.
``query`` : Char, required, default: None.
SQL query used to list contributions. SELECT column field names are
significant: ``id`` is the contributor name, email, or fingerprint,
depending on how ``identifier`` is configured. ``date`` is the
contribution date, as a date or datetime. ``desc`` (optional) is a
human-readable description for this ``id``, like a person's name.
All other SELECT columns are ignored, but can be useful to provide
values for the ``url`` template.
``url`` : Char, optional, default: None.
template used to build URLs to link to people's contributions.
Words in curly braces (like ``{id}``) will be
expanded with the SELECT column of the same name.
Option types
````````````
``Char``
A string value. Can be any UTF-8 string.
``IdentifierType``
An identifier type. Can be one of:
``auto``
autodetect. "ident" or "Name <ident>" are accepted, and ident can be any
email, login or OpenPGP fingerprint
``login``
debian.org or Alioth login name.
``email``
email address.
``fpr``
OpenPGP key fingerprint.
svndirs
-------
Scan subversion directories using file attributes to detect contributions.
Generates `login` types of identifiers, using the usernames of the system
where it is run.
Example::
contribution: committer
method: svndirs
dirs: /srv/svn.debian.org/svn/collab-maint
url: https://alioth.debian.org/users/{user}/
Configuration options
`````````````````````
``dirs`` : Glob, required, default: None.
subversion directories to scan. You can give one or more,
and even use shell-style globbing.
``url`` : Char, optional, default: None.
template used to build URLs to link to people's contributions.
``{user}`` will be replaced with the username
Option types
````````````
``Char``
A string value. Can be any UTF-8 string.
``Glob``
A string with one or more filenames. Globbing is supported. Arguments can
be quoted to deal with whitespace, but glob characters will always be
expanded.