![]() |
Type of Document Dissertation Author Xu, Jin URN etd-04182007-083425 Title Mining and Modeling the Open Source Software Community Degree Doctor of Philosophy Department Computer Science and Engineering Advisory Committee
Advisor Name Title Gregory Madey Committee Member Keywords
- OPEN SOURCE
- SOCIAL NETWORK
- POWER LAW
- SMALL WORLD
- DATA MINING
Date of Defense 2007-03-19 Availability unrestricted Abstract The success of Open Source Software (OSS) has attracted increased interest in many research areas.Unlike proprietary closed software, OSS projects are developed in a distributed and decentralized way.
The OSS community is largely composed of part-time developers. These developers have developed a substantial
number of outstanding technical achievements. A research study on how OSS developers interact with each other
and how projects are developed will help researchers understand the success and failure of OSS projects. OSS developers can also benefit from this research, by being able to make more
informed decisions for participating on OSS projects.
In this dissertation, we address the challenge of efficiently mining data from OSS web repositories and building models
to study OSS community features. Data collection for OSS study is nontrivial since most OSS projects are developed by
distributed developers using web tools. Most previous studies focus on manually creating a web crawler to collect data
from OSS web sites. This method is usually implemented by creating a web crawler based on specific research goals.
We design a mining process which combines web mining and database mining together to identify, extract, filter and
analyze data. We address and analyze the difficulty of mining OSS data. Our work provides a general solution for
researchers to implement advanced techniques, such as web mining, data mining, statistics, and algorithms to collect
and analyze web repository data.
Based on our mining results, we model the OSS community as a social network, one which can be further modeled as
a project network and a developer network, and study properties of these networks. Our goal is to find intrinsic
mechanisms that lie in OSS networks to explain some OSS specific features such as roles of developers, communication,
and reliability of the OSS community. We construct four social networks for the OSS development community at
SourceForge cite{sourceforge}. Each social network is created by expanding the number of people with different roles
in the network, moving from the core project leaders, to the core developers, to the co-developers, and finally out
to active users. Social network properties such as degree distribution, diameter, cluster size, and clustering coefficient
are calculated and compared for each of the expanding social networks. We elaborate on how the changing topological
characteristics of the social networks may signify important capabilities for the diffusion of information, the ability
to find collaborations, and the overall robustness of the OSS development community. We further find that all the social
networks have scale-free properties, and the inclusion of the co-developers and active users triggers the emergence of
the small-world phenomenon for the social network. We examine how these topological
network properties may potentially explain the success and efficiency of
OSS development practices.
To study the organization and backbones of the OSS community, we conduct the identification of the community structure
on the SourceForge project network. We find that groups exist in the SourceForge project network. Furthermore, we explore
possible reasons for the formation of those groups by examining assortative mixing coefficients for projects categories.
Among them, we find projects with same programming languages, operating systems and topics are more likely to be grouped
together. Our research provides useful information to study the interaction between projects and the communication and
information flow in OSS virtual organizations.
We simulate the OSS community based on four social network models: random graphs, preferential attachment,
preferential attachment with constant fitness, and preferential attachment with dynamic fitness, using two
tools -- Repast and Swarm. Our simulation models are fit to data from year two in the history of SourceForge. To prove the
correctness of
our simulations, docking experiments are performed on the Repast simulation and the Java/Swarm simulation. Our models
simulate developers' actions and the growth of the OSS community. We compare properties of social networks such as degree distribution,
diameter and clustering coefficient to dock Repast and Swarm simulations of four social network models.
Our practice demonstrates the importance of verifications in scientific simulations. The simulation models
we build can be used to forecast future development of OSS community.
Files
Filename Size Approximate Download Time (Hours:Minutes:Seconds)
28.8 Modem 56K Modem ISDN (64 Kb) ISDN (128 Kb) Higher-speed Access XuJ042007.pdf 22.01 Mb 01:41:54 00:52:24 00:45:51 00:22:55 00:01:57