How to do Machine Learning on Massive Astronomical Datasets: Alexander Gray

Published on September 2020 | Categories: Documents | Downloads: 0 | Comments: 0 | Views: 94
of x
Download PDF   Embed   Report

Comments

Content

 

How to do Machine Learning on

Massive Astronomical Datasets  Datasets 

Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory

 

The FASTlab Fundamental Algorithmic and Statistical Tools Laboratory 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

Arkadas Ozakin: Research scientist, scientist, PhD Theoretical Physics Dong Ryeol Lee: PhD student, student, CS + Math Ryan Riegel: PhD student, CS + Math Parikshit Ram: PhD student, student, CS + Math William March: PhD student, student, Math + CS James Waters: PhD student, student, Physics + CS Hua Ouyang: PhD student, student, CS Sooraj Bhat: PhD student, student, CS Ravi Sastry: PhD student, CS Long Tran: PhD student, student, CS Michael H Ho olmes: PhD student, student, CS + Physics (co-supervised) Nikolaos Vasiloglou: PhD student, student, EE (co-supervised)

13. 14. 15. 16. 17. 18.

Wei Guan: PhD student, student, CS (co-supervised) Nishant Mehta: PhD student, student, CS (co-supervised) Wee Chin Wong: PhD student, ChemE (co-supervised) Abhimanyu Aditya: MS student, student, CS Yatin Kanetkar: MS student, student, CS Praveen Kr Krishnaiah: MS student, student, CS

19. 20.

Devika Karnik: MS student, student, CS Prasad Jakka: MS student, student, CS

 

Exponential growth in dataset sizes Instruments

 

Data: CMB Maps 1.0E+07 1.0E+07 1.0E+06 1.0E+06 1.0E+05 1.0E+05 1.0E+04 1.0E+04 1.0E+03 1.0E+03 1.0E+02 1.0E+02 1985 1985

1990 1990

[S c i e n c e , Szalay & J. Gray, 2001]

Data: Local Redshift Surveys 1986 CfA 1996 LCRS

3,500 23,000

2003 2dF 250,000 2005 SDSS 800,000

Data:  Angular Surveys 1970 Lick

1M

1990 APM 2M 2005 SDSS 200M 2010 LSST 2B

1995 1995

2000 2000

2005 2005

2010 2010

1990 COBE 1,000 2000 Boomerang 10,000 2002 CBI 50,000 2003 WMAP 1 Million 2008 Planck 10 Million

 

1993-1999: DPOSS 1999-2008: SDSS Coming: Pan-STARRS, LSST

 

Happening everywhere! microarray chips

Molecular biology (cancer)   (cancer) fiber optics

microprocessors

Network traffic (spam)  (spam) 

300M/day

Simulations (Millennium)   (Millennium) particle colliders

Particle events (LHC)  (LHC) 

1B 1M/sec

 

 Astrophysicist

1. How did g galaxi alaxies es evolv evolve? e? 2. What wa was s the early universe universe like like? ? 3. Does d dark ark ene energy rgy exis exist? t? 4. Is our mode modell (GR+infl (GR+inflation ation)) right?

R. Nichol, Inst. Cosmol. Gravitation  A. Connolly, Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Gravitation D. Wake, Inst. Cosmol. Gravitation R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics U. Hawaii Inst. Astronomy I. Szapudi, G. Richards, Princeton Physics  A. Szalay, Szalay, Johns Hopkins Physics

Machine learning/ statistics guy

 

 Astrophysicist

1. How did g galaxi alaxies es evolv evolve? e? 2. What wa was s the early universe universe like like? ? 3. Does d dark ark ene energy rgy exis exist? t? 4. Is our mode modell (GR+infl (GR+inflation ation)) right?

• Kernel density estimator O(N2) R. Nichol, Inst. Cosmol. Grav Grav.. spatial statistics O(Nn) • n-point  A. Connolly, Connolly, U. Pitt Physics • Nonparametric Bayes classifier O(N2) C. Miller, NOAO • Support vector machine O(N2) R. Brunner, NCSA • Nearest-neighbor statistics O(N2) 3) G. Kulkarni, Djorgovsky,  Caltech O(N •Cosmol.  Gaussian G. Inst. Grav. process regression O(cDT(N)) •  Hierarchical clustering D. Wake, Inst. Cosmol. Grav. Grav. R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics Machine learning/ I. Szapudi, U. Hawaii Inst. Astro. statistics guy G. Richards, Princeton Physics

 A. Szalay, Szalay, Johns Hopkins Physics

 

 Astrophysicist

1. How did g galaxi alaxies es evolv evolve? e? 2. What wa was s the early universe universe like like? ? 3. Does d dark ark ene energy rgy exis exist? t? 4. Is our mode modell (GR+infl (GR+inflation ation)) right?

• Kernel density estimator O(N2) R. Nichol, Inst. Cosmol. Grav Grav.. spatial statistics O(Nn) • n-point  A. Connolly, Connolly, U. Pitt Physics • Nonparametric Bayes classifier O(N2) C. Miller, NOAO • Support vector machine O(N3) R. Brunner, NCSA • Nearest-neighbor statistics O(N2)

G. Djorgovsky, Caltech  Gaussian G. Kulkarni, Inst. •Cosmol. Grav. process regression O(N3) 3) O(N •  Hierarchical clustering D. Wake, Inst. Cosmol. Grav. Grav. R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics Machine learning/ U. Hawaii Inst. Astro. I. statistics guy G.Szapudi, Richards, Princeton Physics  A. Szalay, Szalay, Johns Hopkins Physics

 

 Astrophysicist

1. How did g galaxi alaxies es evolv evolve? e? 2. What wa was s the early universe universe like like? ? 3. Does d dark ark ene energy rgy exis exist? t? 4. Is our mode modell (GR+infl (GR+inflation ation)) right?

• Kernel density estimator O(N2) • n-point R. Nichol, Inst. Cosmol. Grav Grav.. spatial statistics O(Nn)  A. Connolly, Connolly, U. Pitt Physics • Nonparametric Bayes classifier O(N2) C. Miller, NOAO • Support vector machine O(N3) R. Brunner, NCSA • Nearest-neighbor statistics O(N2)

G. Djorgovsky, Caltech  Gaussian G. Kulkarni, Inst. •Cosmol. Grav. process regression O(N3) O(N3) • Hierarchical clustering D. Wake, Inst. Cosmol. Grav. Grav. R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics Machine learning/ Hawaii Inst. Astro. I. statistics guy But I U. have 1Physics million points G.Szapudi, Richards,  Princeton  A. Szalay, Szalay, Johns Hopkins Physics

 

The challenge State-of-the-art State-of-theart statistical methods…  methods…  • Best accuracy with fewest assumptions

with orders-of-mag more efficiency. • Large N  (#data), D  (#features), M  (#models) D

Reduce data? Use simpler model? N

 Approximation with poor/no error  Approximation bounds? M

Poor results

 

How to do Machine Learning on

Massive Astronomical Datasets? 1. Choo Choose se the the app appro ropr pria iate te statistical task and method for the scientific question 2. Use Use the the fas fastest est algorithm and data structure for the statistical method 3. Put it in software software  

 

How to do Machine Learning on

Massive Astronomical Datasets? 1. Choo Choose se the the app appro ropr pria iate te statistical task and method for the scientific question 2. Use Use the the fas fastest est algorithm and data structure for the statistical method 3. Put it in software software  

 

10 data analysis problems, and scalable tools we’d like for them  them  1.

Querying (e.g. characterizing a region of space): space):   –  –  –

spherical range-search O(N) orthogonal range-search O(N)  k-nearest-neighbors O(N)

 –

all-k-nearest-neighbors O(N2)

2.

Density estimation (e.g. comparing galaxy types): types):   –  –  –  –

mixture of Gaussians  Gaussians  kernel density estimation O(N2) L2  density tree [ Ra m a n d G r a y i n p r e p ] 3  manifold kernel density estimation O (N   ) [ O zak i n an d G r ay 2008 20 08,, to b e su bm itted]

 –

4  hyper-kernel density estimation O (N   ) [ S as t r y an d G r ay 2008,

submitted]

 

10 data analysis problems, and scalable tools we’d like for them  them  3. Regression (e.g. photometric redshifts):    –  –

linear regression O(D2)  kernel regression O(N2)  3

 –

Gaussian process regression/kriging O(N regression/kriging O(N )

4. Classification (e.g. quasar detection, star-galaxy separation)::  separation)  –  –  –  –  –

k-nearest-neighbor classifier O(N2)  nonparametric Bayes classifier O(N2)  support vector machine (SVM) O(N3) 3  non-negative SVM O (N   ) [ Gu an an d G r ay , i n p r ep ] 3  false-positive-limiting SVM O (N   ) [ S as t r y an d G r ay , i n p r ep ]

 –

3  [ V a s i l o g l o u , G r a y , a n d A n d e r s o n  ) separation map O (N 

2008 20 08,, su bm itted]

 

10 data analysis problems, and scalable tools we’d like for them  them  5.

Dimension reduction (e.g. galaxy or spectra characterization)::  characterization)  –  –  –  –  –  –

principal component analysis O(D2)  non-negative matrix factorization kernel PCA O(N3)  maximum variance unfolding O(N3)  3  co-occurrence embedding O (N   ) [ O z a k i n a n d G r a y , i n p r e p ]   3  rank-based manifolds O (N   ) [ O u y a n g a n d G r a y 2 0 0 8 , IC M L ]  

 –

3  [ V a s i l o g l o u ,  ) isometric non-negative matrix factorization O (N  G r a y , a n d A n d e r s o n 2 0 08 0 8 , s u b m i t t e d ]   

6.

Outlier de detection (e.g. new object types, data  –

cleaning): cleaning) :  by density estimation, by dimension reduction  reduction 

 

by robust L p estimation [Ram, Riegel and and Gray, in p rep]

 

10 data analysis problems, and scalable tools we’d like for them  them  7. Clustering (e.g. automatic Hubble sequence)  sequence)   –  –  –

by dimension reduction, by density estimation k-means mean-shift segmentation O(N2) 

 –

hierarchical clustering (“friends (“friends-of-of-friends”) friends”) O(N3) 

8. Time series analysis (e.g. asteroid tracking, variable objects)::  objects)  –  –  –  –

Kalman filter O(D2)  hidden Markov model O(D2)  trajectory tracking O(Nn) Markov matrix factorization [Tran, Won g, and Gr ay 2008, 2008, submitted]

 –

functional independent component analysis [Mehta and a nd Gray 2008 20 08,, su bm itted]

 

10 data analysis problems, and scalable tools we’d like for them  them  9. Feature selection and causality (e.g. which features  predict star/galaxy) star/galaxy)  – LASSO regression  – L1 SVM  –  –  –  –

Gaussian graphical model inference and structure search discrete graphical model inference and structure search 0-1 feature-selecting SVM [ G u a n a n d G r a y , i n p r e p ] L1 Gaussian graphical model inference and structure search [Tran, Lee, Holm es, and and Gray, in prep]

10. 2-sample testing and matching (e.g. cosmological validation, multiple surveys): surveys):   – minimum spanning tree O(N3)   – n-point correlation O(Nn)  3   )  – bipartite matching/Gaussian graphical model inference O (N    [Waters and Gray, in prep] 

 

How to do Machine Learning on

Massive Astronomical Datasets? 1. Choo Choose se the the app appro ropr pria iate te statistical task and method for the scientific question 2. Us Use e the the fa fas stes estt algorithm and data structure for the statistical method 3. Put it in software software  

 

Core computational problems What are the basic mathematical operations, operation s, or bottleneck subroutines, can we focus on developing fast algorithms for?

 

Core computational problems • Aggregations • Generalized N-body problems • Graphical model inference • Linear algebra • Optimization 

 

Core computational problems  Aggregations,  GNPs,  Aggregations, GNPs,  graphical models,  models, linear algebra,  algebra, optimization • • • • • • • • • •

Querying: nearest-neighbor, sph range-search, ortho range-search, all-nn Querying:  Density estimation:  estimation: kernel density estimation, mixture of Gaussians Regression:  linear regression, Regression: regression,  kernel regression, regression,  Gaussian process regression   regression Classification:  nearest-neighbor classifier, nonparametric Bayes classifier,  Classification: classifier,  support vector  machine Dimension reduction:  reduction: principal component analysis, analysis,  non-negative matrix factorization, kernel kernel  PCA PCA,, maximum variance unfolding  estimation,, by density estimation, by Outlier detection: by detection: by robust L2 estimation dimension reduction Clustering:  k-means, Clustering: k-means, mean mean-shift, shift, hierarchical clustering (“friends-of(“friends-offriends”), by dimension reduction, by density estimation filter,  hidden hidden  Markov model, model, trajectory Time series analysis:  analysis: Kalman filter, tracking Feature selection and causality: LASSO LASSO  regression, L1 support vector   machine, Gaussian graphical models, discrete graphical models  models  2-sample testing and matching:  matching: n-point correlation, bipartite matching  matching 

 

Aggregations • How it appears: nearest-neighbor, appears: nearest-neighbor, sph range-search, ortho range-search • Common methods: methods: locality  locality sensitive hashing, kd-trees, metric trees, disk-based trees • Mathematical challenges: high challenges: high dimensions, provable runtime, distributiondistributiondependent analysis, parallel indexing • Mathematical topics: c omputationall topics: computationa geometry, randomized algorithms

 

How can we compute this efficiently? k d -trees: most widely-used spacepartitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995]

 

kd-tree: tree: level 1  A kd-

 

-tree: level 2  A kd -tree:

 

-tree: level 3  A kd -tree:

 

kd-tree: tree: level 4  A kd-

 

-tree: level 5  A kd -tree:

 

-tree: level 6   A kd -tree:

 

Range-count recursive algorithm

 

Range-count recursive algorithm

 

Range-count recursive algorithm

 

Range-count recursive algorithm

 

Range-count recursive algorithm

Pruned! (inclusion)

 

Range-count recursive algorithm

 

Range-count recursive algorithm

 

Range-count recursive algorithm

 

Range-count recursive algorithm

 

Range-count recursive algorithm

 

Range-count recursive algorithm

 

Range-count recursive algorithm

 

Range-count recursive algorithm

Pruned! (exclusion)

 

Range-count recursive algorithm

 

Range-count recursive algorithm

 

Range-count recursive algorithm

fastest practical algorithm [Bentley 1975]

our algorithms can use any tree

 

Aggregations •

Interesting approach: Cover-trees [Beygelzimer et al 2004]  – Provable runtime  – Consistently good performance, even in higher dimensions



Interesting approach:  approach: Learning trees [Cayton et al 2007]  – Learning data-optimal data structures  – Improves performance over kd-trees





Interesting approach:  approach: MapReduce [Dean and Ghemawat 2004]  – Brute-force  – But makes HPC automatic for a certain problem form Interesting approach:  approach: approximation in rank [ Ra m , O u y a n g a n d Gray]  –  Approximate NN in terms of distance conflicts with known theoretical results  – Is approximation in rank feasible?

 

Generalized N-body Problems • How it appears: kernel appears: kernel density estimation, mixture of Gaussians, kernel regression, Gaussian process regression, nearest-neighbor classifier, nonparametric Bayes classifier, support vector machine, hierarchical clustering, trajectory tracking, kernel n-pointPCA, correlation • Common methods: FFT, methods: FFT, Fast Gauss Transform, WellSeparated Pair Decomposition • Mathematical challenges: high challenges: high dimensions, querydependent relative error guarantee, parallel, beyond pairwise potentials • Mathematical topics: approximation topics: approximation theory, computational physics, computational geometry

 

Generalized N-body Problems • Interesting approach: Generalized Fast Multipole Method, aka multi-tree methods [Gray and Moore 2001, NIPS; Riegel, Boyer and Gray]  – Fastest practical algorithms for the problems to which it has been applied  – Hard query-dependent relative error bounds  – Automatic  –  Automatic parallelization (THOR: (THOR: Tree-based Higher-Order Reduce) Reduce) [Boyer, Riegel and Gray to be submitted]

 

Characterization of an entire distribution?

2-point correlation

?”  “How many pairs have distance < r ?”   N   N 

i   x  j  r )  I  (  x      i

 j  i

2-point correlation

r

function  

The n -point correlation functions • Spatial inferences: filaments, clusters, voids, homogeneity, isotropy, 2-sample 2-sample testing, …  … 

• Foundation

for theory of point processes

[Daley,Vere-Jones 1972], unifies spatial statistics [Ripley 1976]

• Used heavily in biostatistics, cosmology, particle physics, statistical physics

2pcf definition: 2

  2 [1   ( r )] dP     dV 1 dV  3pcf definition: 3

dP     dV 1dV 2 dV 3  [1    (r   12 )   (r 23 )   ( r 13 )    ( r 12 , r 23 , r 13 )]  

Standard model: n>0 terms should be zero!

r 2

r 1 r 3

3-point correlation “How many triples have pairwise distances < r ?”  ?”   N 

 N 

 N  ij

 I ( 

1

 jk 

r ) I ( 

2

ki

r   ) I ( 

3

r  )

  i

 j  i k   j  i



 



 

How can we count n -tuples efficiently?

“How many triples have pairwise distances < r ?”  ?” 

 

Use n  trees! [Gray & Moore, NIPS 2000]

 

“How many valid triangles a-b-c a-b-c (where a  A,  b  B, c  C  ) could there be?

r

count{A count{ A,B,C} =

?  A B C

C  

“How many valid triangles a-b-c a-b-c (where a  A,  b  B, c  C  ) could there be?

r

count{A count{ A,B,C} =

 A B

count{A,B,C.left  count{A } + A,B,C.right C.right}} count{A count{

C  

“How many valid triangles a-b-c a-b-c (where a  A,  b  B, c  C  ) could there be?

r

count{A count{ A,B,C} = C.left}} count{A,B,C.left count{A +

 A B

count{A,B,C.r i g h t  count{A }

C  

“How many valid triangles a-b-c a-b-c (where a  A,  b  B, c  C  ) could there be?

r

 A B

count{A count{ A,B,C} = C

?

 

“How many valid triangles a-b-c a-b-c (where a  A,  b  B, c  C  ) could there be?

r

 A B

count{A,B,C} = count{A C

Exclusion

0!

 

“How many valid triangles a-b-c a-b-c (where a  A,  b  B, c  C  ) could there be? r  A

B

count{A count{ A,B,C} = C

?

 

“How many valid triangles a-b-c a-b-c (where a  A,  b  B, c  C  ) could there be? Inclusion  A

r B

count{A,B,C} = count{A

Inclusion C

|A| x |B| x |C| Inclusion

 

3-point runtime (biggest previous: 20K)

n=2: O(N) 3  n=3: O (N l o g )

VIRGO simulation data,

N = 75,000,000 naïve: 5x109 sec. (~150 years)  multi-tree: 55 sec.

2  n=4: O (N   )

(exact)  

Generalized N-body Problems • Interesting approach (for n-point): ntree algorithms [Gray and Moore 2001, NIPS; Moore et al. 2001, Mining the Sky]  – First efficient exact algorithm for n-point correlations

• Interesting approach (for n-point): Monte Carlo n-tree [Waters, n-tree [Waters, Riegel and Gray]  – Orders of magnitude faster

 

Generalized N-body Problems • Interesting approach (for EMST): dualtree Boruvka algorithm [March and Gray]  – Note this is a cubic problem  problem 

• Interesting approach (N-body decision problems): dual-tree bounding with hybrid tree expansion [Liu, Moore, and Gray 2004; Gray and Riegel 2004, CompStat; Riegel and Gray 2007, SDM]  – An  –  An exact classification classification algorithm

 

Generalized N-body Problems • Interesting approach (Gaussian kernel): dual-tree with multipole multipole/Hermite /Hermite expansions [Lee, Gray and Moore 2005, NIPS; Lee and Gray 2006, UAI]  – Ultra-accurate fast kernel summations

• Interesting approach (arbitrary kernel): automatic derivation of hierarchical series expansions [Lee and Gray]  – For large class of kernel functions  functions  

 

Generalized N-body Problems • Interesting approach (summative forms): multi-scal multi-scale e Monte Carlo [Holmes, Gray, Isbell 2006 NIPS; Holmes, Gray, Isbell 2007, UAI]  – Very fast bandwidth learning

• Interesting approach (summative forms): Monte Carlo multipole methods [Lee and Gray 2008, NIPS]  – Uses SVD tree  tree 

 

Generalized N-body Problems • Interesting approach (for multi-body potentials in physics): higher-order multipole methods [Lee, Waters, Ozakin, and Gray, et al.]  – First fast algorithm for higher-order potentials

• Interesting approach (for quantum-level quantum-level simulation): 4-body treatment of HartreeFock [March and Gray, et al.]

 

Graphical model inference • How it appears: hidden appears: hidden Markov models, bipartite matching, Gaussian and discrete graphical models propagation, • Common methods: belief methods: belief propagation, expectation propagation propagation • Mathematical challenges: challenges: large  largewith cliques, upper and lower bounds, graphs loops, parallel • Mathematical topics: variational topics: variational methods, statistical physics, turbo codes

 

Graphical model inference • Interesting method (for discrete models): Survey propagation [Mezard et al 2002]  – Good results for combinatorial optimization  – Based on statistical physics ideas

• Interesting method (for discrete models):  models):  Expectation propagation [Minka 2001]  – Variational method based on moment-matching idea

• Interesting method (for Gaussian models): L models): L p  structure search, solve linear system for inference [Tran, Lee, Holmes, and Gray]

 

Linear algebra • How it appears: linear appears: linear regression, Gaussian process regression, PCA, kernel PCA, Kalman filter • Common methods:  methods: QR, Krylov, …  …  • Mathematical challenges: numerical challenges: numerical stability, sparsity preservation, …  …  • Mathematical topics: linear topics: linear algebra,

randomized algorithms, Monte Carlo  

Linear algebra • Interesting method (for probably-approximate k-rank SVD): Monte Carlo k-rank SVD [Frieze, Drineas, et al. 1998-2008]  – Sample either columns or rows, from squared length distribution  – For rank-k matrix approx; must know k

• Interesting method (for probably-approximate full SVD): QUIC-SVD SVD):  QUIC-SVD [Holmes, Gray, Isbell 2008, NIPS]; QUIK-SVD [Holmes QUIK-SVD  [Holmes and Gray]  – Sample using cosine trees and stratification  – Builds tree as needed  – Full SVD: automatically sets rank based on desired error

 

QUIC-SVD speedup

38 days  1.4 hrs, 10% rel. error

40 days  2 min, 10% rel. error

 

Optimization • How it appears: support appears: support vector machine, maximum variance unfolding, robust L2  estimation • Common methods: interior methods: interior point, Newton’s method  method  • Mathematical challenges: ML-specific challenges: ML-specific objective large number of variables functions, / constraints, global optimization, parallel • Mathematical topics: optimization topics: optimization theory,

linear algebra, convex analysis  

Optimization • Interesting method: Sequential minimization optimization optimiza tion (SMO) [Platt 1999]  – Much more efficient than interior-point, for SVM QPs

• Interesting method: Stochastic quasi-Newton [Schraudolf 2007]  – Does not require scan of entire data

• Interesting method:  method: Sub-gradient methods [Vishwanathan and Smola 2006]  – Handles kinks in regularized risk functionals • Interesting method:  method: Approximate  Approximate inverse  preconditioning  preconditio ning using QUIC-SVD fo forr energy minimization minimization and interior-point [March, Vasiloglou, Holmes, Gray]  – Could potentially treat a large number of optimization problems

 

Now fast! very fast

as fast as possible (conjecture)  (conjecture) 

• • •

Querying: nearest-neighbor,  Querying:  nearest-neighbor, sph range-search, ortho range-search, all-nn Density estimation:  estimation: kernel density estimation, mixture of Gaussians Regression:  linear regression, kernel regression, Regression: regression,  Gaussian process



regression regression    Classification:  Classification:  nearest-neighbor classifier , nonparametric Bayes classifier,  classifier,  support vector machine Dimension reduction: principal component analysis, non-negative matrix factorization,, kernel PCA, factorization PCA,  maximum variance unfolding Outlier detection: by detection: by robust L2 estimation  estimation  

• • • • • •

Clustering:  Clustering: mean-shift, shift, hierarchical clustering (“friends-of(“friends-offriends”) friends”)     k-means, meanTime series analysis:  analysis: Kalman filter, hidden Markov model, model,  trajectory tracking Feature selection and causality: LASSO regression, L1 support vector machine, Gaussian graphical models, discrete graphical models correlation,, bipartite matching 2-sample testing and matching:  matching: n-point correlation

 

Astronomic Astronomical al applications applications   • All-k-nearest-neighbors: All-k-nearest-neighbors: O(N2) exact. Used in [ B u dav ari et al .,., i n

O(N), pr ep]

• Kernel density estimation: O(N2) O(N), rel err. Used in [B alog h et al. 2004 20 04]] • Nonparametric Bayes classifier (KDA): et O(N), exact. Used in [ R i c hard s et O(N2) al. 2004,2009 2004,2009], ], [Scr an to n et al. 2005] 2005 ]

• n-point correlations: O(Nn)

O(Nlogn),

[W ak e et al. 2004], 200 4],

exact. Used in [W ak e

et al. 2004], 200 4], [Giann anto n io et al a l 2006 20 06], ],[K [K u lkarn i et al a l 2007 20 07]]

 

highlights  Astronomical highlights   – Dark energy evidence, energy evidence, Science Science 2003,  2003, Top Scientific Breakthrough of the year (n-point) • 2007 biggest 3-point calculation ever

 – Cosmic magnification verification magnification verification Nature 2005 Nature  2005 (nonparam. Bayes clsf) • 2008 largest quasar catalog ever

 

note…  A few others to note…  very fast

as fast as possible (conjecture)  (conjecture) 

• •

Querying: nearest-neighbor,  Querying:  nearest-neighbor, sph range-search, ortho range-search, all-nn Density estimation:  estimation: kernel density estimation, mixture of Gaussians



Regression:  linear regression, Regression:



kernel regression, Gaussian

process regression

Classification: nearest-neighbor classifier , nonparametric Bayes classifier,  Classification:  classifier,  support vector machine



Dimension reduction:  reduction: principal component analysis, nonnegative matrix factorization, factorization, kernel PCA,  PCA, maximum variance unfolding

• •

Outlier detection: by detection: by robust L2 estimation  estimation   Clustering:  k-means, mean-shift, hierarchical Clustering:

clustering

(“friends-of(“friends -of-friends”) friends”)   •

Time series analysis:  analysis: Kalman filter, hidden trajectory tracking

Markov model, 



Feature selection causality: LASSO regression, L1models  support vector machine, Gaussianand graphical models, discrete graphical



correlation,, bipartite matching 2-sample testing and matching:  matching: n-point correlation

 

How to do Machine Learning on

Massive Astronomical Datasets? 1. Choo Choose se the the app appro ropr pria iate te statistical task and method for the scientific question 2. Use Use the the fas fastest est algorithm and data structure for the statistical method 3. Put it in software software  

 

Keep in mind the machine • Memory hierarchy: cache, RAM, out-of-core • Dataset bigger than one machine: parallel/distributed • Everything is becoming multicore • Cloud computing: software as a

service   

Keep in mind the overall system • Databases can be more useful than ASCII files (e.g. CAS) • Workflows can be more useful than brittle perl scripts • Visual analytics connects visualization/HCI with data

analysis (e.g. In-SPIRE)  

Our upcoming products • MLPACK: “the LAPACK of machine learning” – Dec. – Dec. 2008 [FASTlab] • THOR:  THOR: “the MapReduce of Generalized NNbody Problems” – Apr. – Apr. 2009 [Boyer, Riegel, Gray] • CAS Analytics:  fast 2009 data analysis CAS (SQL Analytics: fast Server) – Server)  – Apr.  Apr. [Riegel, in Aditya, Krishnaiah, Jakka, Karnik, Gray] • LogicBlox: all-in-one business

intelligence [Kanetkar, Riegel, Gray]  

Keep in mind the software complexity •  Automatic code generation (e.g. MapReduce)

•  Automatic tuning (e.g. OSKI) •  Automatic algorithm derivation  (e.g. SPIRAL, AutoBayes) [Gray et al. 2004; Bhat, Riegel, Gray,

Agarwal]  

The end • We have/will have fast algorithms for most data analysis methods in MLPACK • Many opportunities for applied math and computer science in large-scale data analysis • Caveat: Must treat the right problem • Computational astronomy workshop and largescale data analysis workshop coming soon

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close