Inference, Models and Simulation for Complex Systems
CSCI 7000/4830, Fall 2011
Time: Tuesday and Thursday, 11:00am - 12:15pm
Room: ECCR 131
Instructor: Aaron Clauset
Office: ECOT 743
Office hours: Tuesday, 1:30-3:00pm
Email: look it up
Syllabus
Description
Course work and grading
Schedule and lecture notes
Problem sets
Supplemental readings
Description
Computers are fundamentally changing how science is done by allowing us to
record enormous amounts of data on social, biological, technological and
physical systems. This data glut should allow us both to address old questions
about complex systems in new ways and to answer fundamentally new kinds of
questions. But, our ability to do this relies on the development of new
algorithms to automatically extract information about patterns in these huge
data sets and to reliably test complex scientific hypotheses. Examples include
understanding the evolution of the Internet (at various levels), social
behavior on the Web (Facebook, Wikipedia, Twitter, etc.), the role of networks
in structuring complex social behaviors and biological functions, the
emergence of population-scale patterns from seemingly random individual-level
behavior, and forecasting rare but important events in the future.
This graduate-level topics course will cover a selection of recent
developments in computational approaches to doing science with complex
systems. It is not a scientific computing course. Topics will include
statistical inference, the structure of complex networks, macro-phenomena in
biological evolution and in wars and terrorism, simple mathematical models,
and simulation techniques for more complicated models. The focus will be on
using computational tools (algorithms) to do science (work with data; test
hypotheses; build understanding; make predictions).
Prerequisites
Required courses: CSCI 3104 (undergraduate algorithms), two semesters of
calculus and basic statistics and probability theory
Recommended courses: CSCI 5454 (graduate algorithms), APPM 3050
(Scientific Computing in Matlab), APPM 3570 (Applied Probability) or APPM 4570
(Statistical Methods)
Note: No biological or social science background is necessary. A quantitative
background is mandatory. The concepts and techniques covered in this course
depend heavily on basic statistics, scientific programming (Matlab and
Mathematica) and calculus (integration and differentiation). Students without
sufficient preparation in these areas will struggle to keep up with the
lectures and especially with the assignments, each of which has a mathematical
and a programming component. Students concerned about their preparation are
encouraged to audit the course.
Text
Required (available at the CU Bookstore):
1. Networks: An Introduction by M.E.J. Newman
2. The Sciences of the Artificial (3rd ed.) by H.A. Simon
Optional:
3. All of Statistics by L. Wasserman
4. Numerical Recipes are recommended.
the first two of these are being held on reserve for this course at the CU Library.
Course work and grading
Attendance to the lectures is required.
The first half of the course will be lecture and problem-set driven. The
second half of the class will revolve around crowd-source (student) lectures
on advanced topics and a small independent research project. The lecture and
project are the major deliverables for the class, and students are expected to
spend considerable time outside of class preparing them. There are no written
examinations in this class; the lecture and project should be viewed as an
oral and a final examination. Many lectures will have an accompanying reading
assignment from the required texts or a reading from the primary literature.
I'll post the latter readings on the class website. Additional detail on the
crowd-sourced lectures is given in the syllabus.
Problem sets: The four problem sets will each have some mathematical
and some programming problems. Programming problems can be done in any
reasonable imperative language, but I highly recommend using
Matlab, or something
similar, which has data analysis and visualization capabilities built-in.
Familiarity with
Mathematica will be useful for many of the mathematical problems.
Problem sets will be due roughly two weeks after they are assigned (schedule
given below). Solutions must be in PDF format (e.g., typeset using
LaTeX), should include all
necessary details for me to follow the logic, and must be submitted via email
by 11:59pm the day they are due. No late assignments will be accepted.
Collaboration is allowed on the problem sets, but you may not copy (in any
way) from your collaborators and you must respect University academic policies
at all times. To be clear: you may discuss the problems verbally, but you must
write up your solutions separately. If you do discuss the problems with
someone (and you are encouraged to!), you must then list and describe the
extent of your collaboration in your solutions (a footnote is fine). Copying
from any source, in any way, including the Web but especially from another
student (past or present), is strictly forbidden. If you are unsure about
whether something is permitted under these rules, ask me well before the
deadline.
Crowd-source lecture: Students will self-organize into teams of two.
Each team will then give a full-length (75 minute) technical lecture that (i)
frames a scientific question and motivates an algorithm or model to address
it, (ii) describes the algorithm or model in appropriate technical detail (at
least at the pseudocode level), (iii) describes its performance or results,
and (iv) describes interesting extensions, open question, problems, etc.
Students should notify me via email of their team composition by September 13.
Each team should then notify me via email by September 15 of their preferences
on lecture topics in the form of a complete ranking of the topics listed
below. I will match teams to topics using a matching algorithm. (The number of
topics and the precise schedule may change slightly during the semester
depending on the number of students in the class, etc.)
Independent project: The assignment here is to develop and complete a small research-style project based around the tools and concepts developed in the course. This independent project should focus on either inference, modeling or simulation. Students will work independently, but are strongly encouraged to get feedback from each other on their ideas. The deliverables here are a short (20 minute) in-class presentation of their results and a 10-page writeup due via email to me by 11:00am on December 15.
A short project proposal is due via email on October 13. The proposal should
be two paragraphs: one explaining any background material, including necessary
references to the scientific literature, and a second describing what you're
going to do. When choosing a topic, students should be cognizant of the need
to make reasonable progress in the 9 weeks remaining in the semester. To help
facilitate this, students are strongly encouraged to meet with me outside of
class prior to October 13 to discuss potential project topics.
Grading Grades will be assigned based on a 20% attendance, 30% problem
sets, 20% lecture and 30% project division.
Undergraduates taking CSCI 4830: Instead of doing both the lecture and
the project, undergraduate students may choose to do one or the other. Grades
will be assigned based on a 30% attendance, 30% problem sets, and 40%
lecture/project division.
Tentative Schedule
Week 1: Overview, probability distributions, likelihood functions, the Poisson process (Lecture 0 and Lecture 1)
Week 2: Power-law distributions and power-law tails (Lecture 2 and Lecture 3)
Week 3: Model plausibility, model comparison, and two fallacies (Lecture 4 and Lecture 5)
Week 4: Time series analysis, random walks (Lecture 6 and Lecture 7)
Week 5: Random walk models of macroevolution (Lectures 8 and 9)
Week 6: Probabilistic models of terrorism and wars (Lecture 10)
Week 7: Measures of network structure, algorithms (Lecture 12a and Lecture 12b)
Week 8: Random graph models, citation networks (Lecture 13 and Lecture 14)
Week 9: Modular and hierarchical structure, missing information, etc. (Lecture 15 and Lecture 16)
Week 10: (CSL 1) The bootstrap and (2) duplication-mutation models
Week 11: (3) Privacy and de-anonymizing human data and (4) Levy flights
Week 12: (5) Regression and (6) metabolic networks
Week 13: (7) Gaussian mixture models and (8) People are not? particles
Week 14: Fall break
Week 15: Project presentations
Week 16: Project presentations
Problem Sets
Problem set 1 (assigned August 23; due September 13)
-
data set 1A:
the frequencies of occurrence of US family names in the 1990 US Census.
(From A. Clauset, C.R. Shalizi and M.E.J. Newman (2009) SIAM Review 51, 661.) -
data set 1B:
the intensity of wars from 1816-1980 measured as the number of battle deaths per
10,000 of the combined populations of the warring nations.
(From M. Small and J.D. Singer (1982) Sage Publications). -
data set 1C:
the aggregate net worth in US dollars of the richest individuals in the United States
in October 2003.
(From M.E.J. Newman (2005) Contemporary Physics 46, 323). -
data set 1D:
the numbers of adherents of religious denominations, bodies, and sects, as compiled and
published on the web site adherents.com.
(From A. Clauset, C.R. Shalizi and M.E.J. Newman (2009) SIAM Review 51, 661.)
Problem set 2 (assigned September 16; due September 27)
- data set 2A
- data set 2B
- data set 2C
- data set 2D: DJIA time series from 1928-2011
Problem set 3 (assigned September 30; due October 11)
Problem set 4 (assigned October 11; due October 25)
Week 1:
- L. Breiman, "Statistical Modeling: The Two Cultures." Statistical Science 16, 199-231 (2001).
- R.D. Malmgren, D.B. Stouffer, A.E. Motter and L.A.N. Amaral, "A Poissonian explanation for heavy tails in e-mail communication." PNAS 105, 18153-18158 (2008).
Week 2:
- M.E.J. Newman, "Power laws, Pareto distributions and Zipf's law." Contemporary Physics 46(5), 323-351 (2005).
- M. Mitzenmacher, "A Brief History of Generative Models for Power Law and Lognormal Distributions." Internet Mathematics 1(2), 226-251 (2004).
- A. Clauset, C.R. Shalizi and M.E.J. Newman, "Power-law distributions in empirical data." SIAM Review 51(4), 661-703 (2009).
- S. Goel, A. Broder, E. Gabrilovich and B. Pang, "Anatomy of the Long Tail: Ordinary People with Extraordinary Tastes." Proc. 3rd ACM Workshop on Web Search and Data Mining, New York City (2010).
Week 3:
- A. Gelman and C.R. Shalizi, "Philosophy and the practice of Bayesian statistics." Preprint (2010).
- D.G. Mayo, Error and the Growth of Experimental Knowledge. Chicago Press (1996).
Week 4:
- S. Redner, "Random multiplicative processes: An elementary tutorial." American Journal of Physics 58(3), 267-273 (1990).
- A. Gabel and S. Redner, "Random Walk Picture of Basketball ." arxiv:1109.2825 (2011).
Week 5:
- J. Alroy, "Understanding the dynamics of trends within evolving lineages." Paleobiology 26(3), 319-329 (2000).
- G. Hunt, "Evolution in Fossil Lineages: Paleontology and The Origin of Species." The American Naturalist 176, Supplement S61-S75 (2010).
- A. Clauset and D.H. Erwin, "The Evolution and Distribution of Species Body Size." Science 321, 399-401 (2008).
Week 6:
- A. Clauset, M. Young and K.S. Gleditsch, "On the Frequency of Severe Terrorist Events." Journal of Conflict Resolution 51, 58-87 (2007).
- A. Clauset and K.S. Gleditsch, "The developmental dynamics of terrorist organizations." Preprint, arxiv:0906.3287 (2011).
Weeks 7 and 8:
- M.E.J. Newman, "The structure and function of complex networks." SIAM Review 45, 167-256 (2003).
- S.N. Dorogovtsev and J.F.F. Mendes, "Evolution of networks." Adv. Phys. 51, 1079 (2002).
- M.E.J. Newman, S.H. Strogatz and D.J. Watts, "Random graphs with arbitrary degree distributions and their applications." Physical Review E 64, 026118 (2001).
- D.S. Callaway, J.E. Hopcroft, J.M. Kleinberg, M.E.J. Newman and S.H. Strogatz, "Are randomly grown graphs really random?." Physical Review E 64, 041902 (2001).
- M.E.J. Newman, "The first-mover advantage in scientific publication." European Physics Letters 86, 68001 (2009).
- S. Redner, "Citation statistics from 110 years of Physical Review." Physics Today 58, 49-54 (2005).
Resources
LaTeX (general) and TeXShop (Mac)
Matlab license for CU staff (includes student employees)
Mathematica license for CU students
NumPy/SciPy libraries for Python (similar to Matlab)
GNU Octave (similar to Matlab)
Wolfram Alpha (Web interface for simple integration and differentiation)
Machine Learning, Statistical Inference and Induction Notebook (by Cosma Shalizi)
Power Law distributions, etc. Notebook (by Cosma Shalizi)
Some Advice on Process for
[Research Projects]
Cytoscape, network
visualization software
yEd Graph Editor, network
visualization software
Graphviz,
network visualization software