Data Sets

Data sets collected and used in various research projects. All data available here is in the public domain.

  1. Power-law distributions (part 1): 24 univariate quantities that exhibit a heavy-tailed pattern. Quantities drawn from language, cellular biology, communication networks, the World Wide Web, human conflict, ecology, human infrastructure, natural disasters, and various social or economic phenomena.

    Citation: Varies by data set.

  2. Power-law distributions (part 2): 12 univariate quantities that exhibit a heavy-tailed pattern, and most are reported as binned data. Quantities drawn from human conflict, plant physiology, biomedicine, natural disasters, glaciology, and human infrastructure.

    Citation: Varies by data set.

  3. Food web of grassland species (network and vertex labels), where vertices are species (herbivores, parasites, etc.) and edges indicate predation.

    Citation: H.A. Dawah, B.A. Hawkins and M.F. Claridge, "Structure of the parasitoid communities of grass-feeding chalcid wasps." Journal of Animal Ecology 64, 708-720 (1995).

  4. Terrorist associations for 9/11 attacks (network, vertex labels, and vertex names), where vertices are individuals associated with the 9/11 terrorist attacks, and edges indicate social associations.

    Citation: V. Krebs, "Mapping networks of terrorist cells." Connections 24, 43-52 (2002).

  5. NFL 2009 league network (weighted network and vertex labels), where vertices are NFL teams in 2009, the presence of an edge indicates that a game was played, and edges are weighted by the mean score difference across all such games.

    Citation: C. Aicher, A.Z. Jacobs and A. Clauset, "Learning latent block structure in weighted networks." Journal of Complex Networks 3(2), 221–248 (2015).

  6. Body masses of extant whale species (table, xlsx), where each line is a measurement, with taxonomic information and source reference.

    Citation: A. Clauset, "How large should whales be?" PLOS ONE 8(1), e53967 (2013).

  7. Sizes of terrorist events worldwide, 1968-2008 (events list), where each line is an event with its date, severity, and a few covariates.

    Citation: A. Clauset and R. Woodard, "Estimating the historical and future probabilities of large terrorist events." Annals of Applied Statistics 7(4), 1838-1865 (2013).

  8. Zachary Karate Club 77 (network and vertex metadata), for the Zachary Karate Club social network. In Zachary's original paper, the adjacency matrix contains an ambiguous link. The 78-edge version includes this link, while the 77-edge version here omits it.

    Citation: W. W. Zachary, "An Information Flow Model for Conflict and Fission in Small Groups." J. Anthro. Research 33(4), 452-473 (1977).

  9. Faculty hiring networks (networks and vertex metadata), for 205 Computer Science departments in North America, 112 Business schools in the US, and 144 History departments in the US, representing about 19,000 faculty.

    Citation: A. Clauset, S. Arbesman, and D. B. Larremore, "Systematic inequality and hierarchy in faculty hiring networks." Science Advances 1(1), e1400005 (2015).

  10. Golden Age of Hollywood actor collaborations (networks and vertex names), for 55 actors who were particularly active from 1930-1959. This is a directed, weighted, temporal network spanning 1909 to 2009, aggregated at the level of decades.

    Citation: D. Taylor, S. A. Myers, A. Clauset, M. A. Porter, and P. J. Mucha, "Eigenvector-Based Centrality Measures for Temporal Networks." Multiscale Modeling and Simulation 15(1), 537-574 (2017).

  11. CommunityFitNet corpus of 406 structurally diverse networks, drawn from the Index of Complex Networks, which represent a stable and realistic benchmark for evaluating and comparing community detection algorithms.

    Citation: A. Ghasemian, H. Hosseinmardi, and A. Clauset, Evaluating Overfit and Underfit in Models of Network Community Structure. Preprint, arXiv:1802.10582 (2018).

  12. Degree sequences for 927 structurally diverse networks, drawn from the Index of Complex Networks, which were used to evaluate the status of the scale-free networks hypothesis.

    Citation: A. D. Broido and A. Clauset, Scale-free networks are rare. Preprint, arXiv:1801.03400 (2018).

  13. Parental leave policy data for 205 universities in the U.S. and Canada

    Citation: A. C. Morgan, S. F. Way, M. Galesic, D. B. Larremore, and A. Clauset, Paid Parental Leave at US and Canadian Universities /parental-leave/ (2018).

  14. Travel distances for 18th century New Spain, from the 'Plano del arzobispado de Mexico' painted map

    Citation: Anonymous, Plano del Arzobispado de México Instituto Nacional de Antropologia e Historia, Mexico, Accessed 9 June (2018).

  15. LinkPrediction Corpus of 548 real-world networks, spanning social, economic, biological, technological, information, and transportation domains.

    Citation: A. Ghasemian, H. Hosseinmardi, A. Galstyan, E. M. Airoldi, and A. Clauset, "Stacking Models for Nearly Optimal Link Prediction in Complex Networks." Preprint, arxiv:1909.07578 (2019).

Code

Open-source implementations of algorithms and models developed by our research group and close collaborators.

  1. Stacked topological model for link prediction, using 42 topological features within a trained random forest model, which produces (as best we can ascertain) nearly optimal predictions. Python code (via Amir Ghasemian).

    Citation: A. Ghasemian, H. Hosseinmardi, A. Galstyan, E. M. Airoldi, and A. Clauset, "Stacking Models for Nearly Optimal Link Prediction in Complex Networks." Preprint, arxiv:1909.07578 (2019).

  2. Blockmodel Entropy Significance Test (BESTest) and the neoSBM, for characterizing and exploring the relationship between node metadata and network structure. In Matlab and Python, respectively (via Dan Larremore and Leto Peel).

    Citation: L. Peel, D. B. Larremore, and A. Clauset, "The ground truth about metadata and community detection in networks." Science Advances 3(5), e1602548 (2017).

  3. Generalized hierarchical random graph (GHRG) model and change-point detection toolkit for time-evolving networks. Python code (via Leto Peel).

    Citation: L. Peel and A. Clauset, "Detecting change points in the large-scale structure of evolving networks." Proc. AAAI, 2914-2920 (2015).

  4. Bipartite stochastic block model (biSBM) for extracting the communities within a bipartite network, from 2014. Matlab code (via Dan Larremore).

    Citation: D.B. Larremore, A. Clauset and A.Z. Jacobs, "Efficiently inferring community structure in bipartite networks." Phys. Rev. E 90, 012805 (2014).

  5. Weighted stochastic block model (WSBM) for extracting the communities within a weighted network, from 2014. Matlab code.

    Citation: C. Aicher, A.Z. Jacobs and A. Clauset, "Learning latent block structure in weighted networks." Journal of Complex Networks 3(2), 221-248 (2015).

  6. Toolkit for estimating the probability of rare events in heavy-tailed distributions, from 2013. Matlab code.

    Citation: A. Clauset and R. Woodard, "Estimating the historical and future probabilities of large terrorist events." Annals of Applied Statistics 7(4), 1838-1865 (2013).

  7. Toolkit for estimating the rugged shape of the modularity function for a particular network, via simulated annealing and a low-dimensional projection, from 2010. Python code.

    Citation: B.H. Good, Y.-A. de Montjoye and A. Clauset, "The performance of modularity maximization in practical contexts." Physical Review E 81, 046106 (2010).

  8. Toolkit for fitting, testing, and comparing power-law distributions in empirical data, from 2009. Matlab and R code.

    Citation: A. Clauset, C. R. Shalizi and M.E.J. Newman, "Power-law distributions in empirical data." SIAM Review 51(4), 661-703 (2009).

  9. Hierarchical random graphs (HRG) model for extracting hierarchical group structure from networks, from 2008. Can also generate networks with hierarchical structure, and use a fitted model to predict missing links. C/C++ code.

    Citation: A. Clauset, C. Moore and M.E.J. Newman, "Hierarchical structure and the prediction of missing links in networks." Nature 453, 98-101 (2008).

  10. Local community detection via optimizing local modularity algorithm, from 2005. C/C++ code.

    Citation: A. Clauset, "Finding local community structure in networks." Phys. Rev. E 72, 026132 (2005).

  11. Clauset-Newman-Moore (CNM) "fast modularity" community detection algorithm, from 2004. C/C++ code.

    Citation: A. Clauset, M.E.J. Newman, C. Moore, "Finding community structure in very large networks." Phys. Rev. E 70, 066111 (2004).