Rutgers University Department of Physics and Astronomy

Special Topics Seminar on Data Science in Astrophysics
Physics 689, Fall 2016

This is a graduate-level seminar on astrophysical data analysis. Class time will be split between lecture introducing key concepts and hacks using the AstroML Python package provided by the textbook authors.

Instructor: Prof. Eric Gawiser, Serin 303W, 848-445-8874,
Seminar: Wednesday, 12-3PM
Location: Serin 401, Busch Campus
Office Hours: email to arrange a meeting - or just stop by

"Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data" by Zeljko Ivezic, Andrew J. Connolly, Jacob T. VanderPlas & Alexander Gray, ISBN 978-0-691-15168-7
I will lecture on roughly one textbook chapter per week (see Syllabus below).
(Astro)physicists are welcome to sit in on any seminar whose topics appear of interest and to participate in the hacks and discussions. Dates in bold below refer to Friday 10:20-11:40AM meetings (still in room 401). For textbook chapters, #A refers to the first half of Chapter # and #B refers to the second half.

Lecture Date Topic Text Chapter Hacks & Discussions
1 Sep 7 Intro 1,Appendices Syllabus preferences and term project design
How to determine if a set of points lies above a correlation
2 Sep 14 Algorithms and computational efficiency 2 Input both versions of the SDSS Stripe 82 standard star catalog; reproduce textbook Fig. 1.6 for both versions of that catalog; note and explain the differences between the versions
Produce Figs. 1.9 and 1.10; tune the contours/colorbars to best represent the data; can you develop an even better visualization of these data?
3 Sep 21 Review of probability and statistics 3 Choose your favorite type of tree and use it to find the nearest-neighbor to the star at the lowest-right location in Fig. 1.6; compare the run-time to brute force
Use Bayes' rule to solve the N=3 version of the Monty Hall problem for rules of Type I (host never reveals a car) and Type II (host chooses randomly which door to open); perform a Monte Carlo simulation to check your derivation for each type of game
4 Sep 28 Frequentist vs. Bayesian approaches 4A Choose a panel of Fig. 3.23; fit a bivariate gaussian distribution; plot the residuals perpendicular to the major axis of the ellipse; find a distribution that provides a decent fit to these residuals and report its best-fit parameters
5 Oct 5 Classical statistical inference 4B Term project pitches
Reproduce one of the panels of Fig. 3.24; determine how many resamplings are needed to see the difference between the no outlier and outlier cases; now use 10% outliers - how many resamplings are needed?
6 Oct 12 Bayesian statistical inference 5A Discussion of term project roles for advisors and students
Figure out what went wrong for \sigma_G^* in the left panel of Fig. 4.4; what could be done to avoid this? Produce the right panel of Fig. 4.4, and see if you can improve the \sigma_G behavior.
7 Oct 19 Markov Chain Monte Carlo 5B Reproduce Fig. 5.6; simplify the error model to one you think is realistic for astronomical data; describe how the distribution and gaussian fits change; can you develop an error model that makes the two gaussian fits agree well?
8 Oct 26 Density estimation 6 Reproduce Fig. 5.26; note if your MCMC parameter contours match precisely, and explain why or why not; now conduct a test for convergence
9 Nov 2 Dimensional Reduction 7A Mid-course evaluations
Ch. 6 features visualizations of the SDSS Great Wall in Figs. 6.3, 6.4, 6.7, 6.15; can you improve on these? Options include: Epanechnikov & cosine kernels, varying h, varying K, Kth-nearest vs. all-K-nearest neighbors, color vs. B/W display
10 Nov 9
("Panic Day")
Data Mining 7B Reproduce the right panel of Fig. 6.17. How do run-time and performance vary if you switch from L-S estimator (6.45) to naive estimator (6.44)? How about if you sub-sample the galaxies? Do the errorbars change as you'd expect if you increase the number of bootstrap samples?
11 Nov 16 Regression 8A Use the SDSS spectra of Fig. 7.1 to reproduce Fig. 7.4. Now pick your favorite method (column) and see how the results change as you toggle each of normalization, whitening, and mean-subtraction (where possible). Can you do anything to make the results more physically meaningful?
12 Nov 30 Model Fitting 8B, Hogg, Bovy & Lang 2010 Determine if the LAEs from Vargas+14 and Hagen+14 lie above the z=2 Star Formation Rate-Stellar Mass correlation reported by Kurczynski+16
13.1 Dec 2 Classification 9A Plot the simulated supernova data of Figure 8.2. Choose a regression method for which cross-validation is applicable. Use cross-validation to optimize the "hyperparameters" of the method. To the extent possible, predict the distribution of future data.
13.2 Dec 9 Classification (continued) 9B Presentations I
14 Dec 14 Time series 10 Presentations II
Discussion of modifying this course for future offerings
Several figures in Chapter 3 (starting with Fig. 9.3) illustrate the multi-color classification of RR Lyrae vs. main sequence stars. Develop a rough-but-realistic figure-of-merit for the S/N of some measurement that includes terms for contamination and incompleteness. Use this to select a preferred classification method for RR Lyrae vs. main sequence stars, and use training and cross-validation as appropriate to predict the value of the figure-of-merit that will be achieved with similar future data.

Term Projects
Each registered student will participate in two term projects, one as the "advisor" and the other as the "student". Advisors will consult the instructor on their choice of term project by September 30 and offer a brief (2 minute) pitch for their projects in class on October 5. Students will note their top several project choices, and the instructor will match students with advisors on this basis. (All students were assigned to one of their top three project choices.) Advisors and students are encouraged to meet weekly at the beginning and to then arrive at a mutually agreeable meeting schedule thereafter. Term projects are expected to require a solid week of work (40-60 hours of effort) from the student spread over the semester; our discussion determined that it would be reasonable to split the total credit for each project roughly 2:1 between student and advisor.
Student Advisor Topic
Jack Prasiddha Supernova remnant spectral analysis
Kartheik Elaad Star cluster mass functions vs. the IMF
Elaad Conan Astronomical image compression
Kyle Humna Correlation function speed-up
Prasiddha Catie Which galaxies are in the cluster?
Conan Jack "Simple" microlensing model
Sourabh Kyle SN Ia parameter correlations
Humna Kartheik Gaussian process SED fitting
Catie Sourabh Strong gravitational lensing

Final Presentation
Final presentations will occupy the last two seminars, December 9 (Jack, Sourabh, Kartheik, Humna) and December 14 (Prasiddha, Conan, Elaad, Kyle, Catie). Final presentation drafts (in Powerpoint, Keynote, or PDF) are due to me at least two days before the presentation occurs so that I can give feedback.

Students will be graded on a uniform weighting of term project achievement, term project presentation quality, term project advising effort, and class participation. Note that this awards roughly 2/3 of the credit for each term project to the student and the other 1/3 to the advisor, but this is not zero-sum since the term project itself could be evaluated as great or mediocre.
Auditors are expected to participate in class discussions and but are not allowed to do a term project without special permission.
Students with Disabilities
Information is available here.

Back to Rutgers Physics Home Page

Please send any comments on this page to

Last revised October 19, 2016