GRASS mailing list community evolution

From GRASS-Wiki
Revision as of 22:03, 18 March 2009 by Pibinko (talk | contribs) (New page: =Watching how grass-dev develops (and grass-user is used)= During the 10th GRASS GFOSS User meeting in Cagliari, Italy, a summary of the activities of the Italian GFOSS community was pres...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Watching how grass-dev develops (and grass-user is used)

During the 10th GRASS GFOSS User meeting in Cagliari, Italy, a summary of the activities of the Italian GFOSS community was presented. Together with basic indicators on the activity of the Italian community, some simple yet intriguing statistics, derived from an analysis of the main discussion mailing lists were shown. (DARE DUE esempi SU QUESTO).

In the typical brainstorming atmosphere which permeates events such as software user meetings, we considered the idea of replicating the same analysis on two other mailing lists with a much longer history, namely the grass developer and the grass user mailing lists.

The outcome of the analysis provides a unique insight on the dynamics of the user and developer communities, over an extremely long time span, from 1991 through 2008.

SHORT STORY ABOUT LONG CONVERSATION (uh, could be better)


- US Army mailing lists launch 12/1991 - interfaced with deja news (http://en.wikipedia.org/wiki/Deja_News) in (check MN) - Deja_News forum only (dovrei verificare ma ho gli mbox files delle liste,

si fa preso con "mutt")

- 1995 (?) email spam nasce in Dejanews (http://en.wikipedia.org/wiki/E-mail_spam),

carefully later polished manually from the list

- new mailing lists born in 1999 (check MN) at University of Hannover as

dejanews wasn't usable and pratical

- lists migrated to Italy with MN and server migration in 2001 - email recovered from dejanews and merged into original lists mbox files (which

MN received from US Army, don't remember precisely)
[we need to be vague about this because perhaps the msg copyright was with
dejanews when using their system. dejanews was then bought by Google].
All email headers for many years had to be reconstructed since the format
was broken.

- complete archive restored and online (check date MN) - in 2007, lists migrated to OSGeo infrastructure

Analysis Methodology

The information extraction approach used leans on the KISS side: the core of the parsing is handled by a perl script, while the remaining post processing is carried out via standard queries and no-nonsense charting tools.

TIME AND SPACE

The first core set of information extracted was the time zone reference of the messages, considering tha time zone may be used to provide an approximate indication of longitude.

For the grass-dev list, the results we obtained from a first pass with the tools is able to parse correctly over 99% percent of the messages. It may be possible to obtain a greater completeness by refining the parsing algorithm to handle exceptions encountered in the parsing process, but we considered the level of approximation obtained in the extraction of the time zone reference to be adequate for the quality objectives of our analysis.

For the grass-user mailing list, the number of messages with time zone not identified by the first pass of the parsing algorithm is higher (some 3%), but still considered satisfactory within the scope of the current analysis.


WHAT DO TIME AND TIME ZONES TELL

The charts (include numbers) show:

  • absolute number of message postings by time zone and year
  • the relative proportion of messages posted each year from a given time zone.
  • the cumulated proportion of messages deriving from different time
zones, calculated assuming 100% to be the e-mail traffic generated
from the beginning of the mailing list records through 2008.
(mettere qui vari spunti)


Another interesting analysis is represented by the text extraction of

specific keywords from the message body - in the case of the GRASS
lists, we decided to focus on GRASS commands.
Matrices with the occurrence of GRASS commands by year were generated
for both mailing lists.
The clear limitation in this type of analysis is that the use of a
term is not associated to context. Reference to a specific command may
not indicate if this is associated to a coding problem, to issues in
use, or to working examples.
Another element which is neglected in the analysis is quotation: i.e.
the occurrence of a term is counted as long as it appears in the body
of a message.
The review of the entries reported by the parser (DOVE CI PORTA ?)

yay... La cosa deve un po' crescere. (AL LIMITE CI LIMITIAMO A SPIEGARE CHE SIAMO CONTENTI DI AVER FATTO UNA PRIMA ESTRAZIONE...E CHE )

Poi

- RELEASES AND EMAIL HYPE (faccio io) - ANNI 90: depression and renewal - ...