Data Sets
Data sets collected and used in various research projects. All data available here is in the public domain.- Power-law distributions (part 1): 24 univariate quantities that exhibit a
heavy-tailed pattern. Quantities drawn from language, cellular biology,
communication networks, the World Wide Web, human conflict, ecology, human
infrastructure, natural disasters, and various social or economic phenomena.
Citation: Varies by data set.
- Power-law distributions (part 2): 12 univariate quantities that exhibit a heavy-tailed
pattern, and most are reported as binned data. Quantities drawn from human
conflict, plant physiology, biomedicine, natural disasters, glaciology, and
human infrastructure.
Citation: Varies by data set.
- Food web of grassland species (network and vertex labels), where vertices are
species (herbivores, parasites, etc.) and edges indicate predation.
Citation: H.A. Dawah, B.A. Hawkins and M.F. Claridge, "Structure of the parasitoid communities of grass-feeding chalcid wasps." Journal of Animal Ecology 64, 708-720 (1995).
- Terrorist associations for 9/11 attacks (network, vertex labels, and vertex names), where vertices are individuals associated with the 9/11 terrorist
attacks, and edges indicate social associations.
Citation: V. Krebs, "Mapping networks of terrorist cells." Connections 24, 43-52 (2002).
- NFL 2009 league network (weighted network and vertex labels), where vertices
are NFL teams in 2009, the presence of an edge indicates that a game was
played, and edges are weighted by the mean score difference across all such
games.
Citation: C. Aicher, A.Z. Jacobs and A. Clauset, "Learning latent block structure in weighted networks." Journal of Complex Networks 3(2), 221–248 (2015).
- Body masses of extant whale species (table, xlsx),
where each line is a measurement, with taxonomic information and source
reference.
Citation: A. Clauset, "How large should whales be?" PLOS ONE 8(1), e53967 (2013).
- Sizes of terrorist events worldwide, 1968-2008
(events list),
where each line is an event with its date, severity, and a few covariates.
Citation: A. Clauset and R. Woodard, "Estimating the historical and future probabilities of large terrorist events." Annals of Applied Statistics 7(4), 1838-1865 (2013).
- Zachary Karate Club 77 (network and vertex metadata), for the Zachary Karate Club
social network. In Zachary's original paper, the adjacency matrix contains an
ambiguous link. The 78-edge version includes this link, while the 77-edge
version here omits it.
Citation: W. W. Zachary, "An Information Flow Model for Conflict and Fission in Small Groups." J. Anthro. Research 33(4), 452-473 (1977).
- Faculty hiring networks (networks and vertex metadata), for 205
Computer Science departments in North America, 112 Business schools in the US,
and 144 History departments in the US, representing about 19,000 faculty.
Citation: A. Clauset, S. Arbesman, and D. B. Larremore, "Systematic inequality and hierarchy in faculty hiring networks." Science Advances 1(1), e1400005 (2015).
- Golden Age of Hollywood actor collaborations (networks and vertex names),
for 55 actors who were particularly active from 1930-1959. This is a directed,
weighted, temporal network spanning 1909 to 2009, aggregated at the level of
decades.
Citation: D. Taylor, S. A. Myers, A. Clauset, M. A. Porter, and P. J. Mucha, "Eigenvector-Based Centrality Measures for Temporal Networks." Multiscale Modeling and Simulation 15(1), 537-574 (2017).
- CommunityFitNet corpus of 406 structurally diverse networks, drawn from the Index of Complex Networks, which represent a stable and realistic benchmark for evaluating and comparing community detection algorithms.
Citation: A. Ghasemian, H. Hosseinmardi, and A. Clauset, Evaluating Overfit and Underfit in Models of Network Community Structure. Preprint, arXiv:1802.10582 (2018).
- Degree sequences for 927 structurally diverse networks, drawn from the Index of Complex Networks, which were used to evaluate the status of the scale-free networks hypothesis.
Citation: A. D. Broido and A. Clauset, Scale-free networks are rare. Preprint, arXiv:1801.03400 (2018).
- Parental leave policy data for 205 universities in the U.S. and Canada
Citation: A. C. Morgan, S. F. Way, M. Galesic, D. B. Larremore, and A. Clauset, Paid Parental Leave at US and Canadian Universities /parental-leave/ (2018).
- Travel distances for 18th century New Spain, from the 'Plano del arzobispado de Mexico' painted map
Citation: Anonymous, Plano del Arzobispado de México Instituto Nacional de Antropologia e Historia, Mexico, Accessed 9 June (2018).
- LinkPrediction Corpus of 548 real-world networks, spanning social, economic, biological, technological, information, and transportation domains.
Citation: A. Ghasemian, H. Hosseinmardi, A. Galstyan, E. M. Airoldi, and A. Clauset, "Stacking Models for Nearly Optimal Link Prediction in Complex Networks." Preprint, arxiv:1909.07578 (2019).
Code
Open-source implementations of algorithms and models developed by our research group and close collaborators.- Stacked topological model for link prediction, using 42 topological features within a trained random forest model, which produces (as best we can ascertain) nearly optimal predictions. Python code (via Amir Ghasemian).
Citation: A. Ghasemian, H. Hosseinmardi, A. Galstyan, E. M. Airoldi, and A. Clauset, "Stacking Models for Nearly Optimal Link Prediction in Complex Networks." Preprint, arxiv:1909.07578 (2019).
- Blockmodel Entropy Significance Test (BESTest) and the neoSBM, for characterizing and exploring the relationship between node metadata and network structure. In Matlab and Python, respectively (via Dan Larremore and Leto Peel).
Citation: L. Peel, D. B. Larremore, and A. Clauset, "The ground truth about metadata and community detection in networks." Science Advances 3(5), e1602548 (2017).
- Generalized hierarchical random graph (GHRG) model and change-point detection toolkit for time-evolving
networks. Python code (via Leto Peel).
Citation: L. Peel and A. Clauset, "Detecting change points in the large-scale structure of evolving networks." Proc. AAAI, 2914-2920 (2015).
- Bipartite stochastic block model (biSBM) for extracting the communities within a bipartite network, from 2014. Matlab code (via Dan Larremore).
Citation: D.B. Larremore, A. Clauset and A.Z. Jacobs, "Efficiently inferring community structure in bipartite networks." Phys. Rev. E 90, 012805 (2014).
- Weighted stochastic block model (WSBM) for extracting the communities within a weighted network, from 2014. Matlab code.
Citation: C. Aicher, A.Z. Jacobs and A. Clauset, "Learning latent block structure in weighted networks." Journal of Complex Networks 3(2), 221-248 (2015).
- Toolkit for estimating the probability of rare events in heavy-tailed distributions, from 2013. Matlab code.
Citation: A. Clauset and R. Woodard, "Estimating the historical and future probabilities of large terrorist events." Annals of Applied Statistics 7(4), 1838-1865 (2013).
- Toolkit for estimating the
rugged shape of the modularity function for a particular network, via simulated
annealing and a low-dimensional projection, from 2010. Python code.
Citation: B.H. Good, Y.-A. de Montjoye and A. Clauset, "The performance of modularity maximization in practical contexts." Physical Review E 81, 046106 (2010).
- Toolkit for fitting, testing, and comparing power-law distributions in empirical data, from 2009. Matlab and R code.
Citation: A. Clauset, C. R. Shalizi and M.E.J. Newman, "Power-law distributions in empirical data." SIAM Review 51(4), 661-703 (2009).
- Hierarchical random graphs (HRG) model for extracting hierarchical group structure from networks, from 2008. Can also generate networks with hierarchical structure, and use a fitted model to predict missing links. C/C++ code.
Citation: A. Clauset, C. Moore and M.E.J. Newman, "Hierarchical structure and the prediction of missing links in networks." Nature 453, 98-101 (2008).
- Local community detection via optimizing local modularity algorithm, from 2005. C/C++ code.
Citation: A. Clauset, "Finding local community structure in networks." Phys. Rev. E 72, 026132 (2005).
- Clauset-Newman-Moore (CNM) "fast modularity" community detection algorithm, from 2004. C/C++ code.
Citation: A. Clauset, M.E.J. Newman, C. Moore, "Finding community structure in very large networks." Phys. Rev. E 70, 066111 (2004).