123

Published on May 2016 | Categories: Documents | Downloads: 29 | Comments: 0 | Views: 221

of 95

Content

NEURAL NETWORKS
Lecturer: Primož Potočnik University of Ljubljana Faculty of Mechanical Engineering Laboratory of Synergetics www.neural.si [email protected] +386-1-4771-167

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#1

TABLE OF CONTENTS
0. 1. 2. 3. 4. 5. 6. 7. 8. Organization of the Study Introduction to Neural Networks Neuron Model – Network Architectures – Learning Perceptrons and linear filters Backpropagation Dynamic Networks Radial Basis Function Networks Self-Organizing Maps Practical Considerations

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#2

1

0. Organization of the Study
0.1 Objectives of the study 0.2 Teaching methods 0.3 Assessment 0.4 Lecture plan 0.5 Books 0.6 SLO books 0.7 E-Books 0.8 Online resources 0.9 Simulations 0.10 Homeworks
© 2012 Primož Potočnik NEURAL NETWORKS (0) Organization of the Study #3

1. Objectives of the study
• Objectives
– Introduce the principles and methods of neural networks (NN) – Present the principal NN models – Demonstrate the process of applying NN

• Learning outcomes
– Understand the concept of nonparametric modelling by NN – Explain the most common NN architectures
• • • • Feedforward networks Dynamic networks Radial Basis Function Networks Self-organized networks

– Develop the ability to construct NN for solving real-world problems
• Design proper NN architecture • Achieve good training and generalization performance • Implement neural network solution

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#4

2

2. Teaching methods
• Teaching methods:
1. Lectures 4 hours weekly, clasical & practical (MATLAB) • Tuesday 9:15 - 10:45 • Friday 9:15 - 10:45 2. Homeworks home projects 3. Consultations with the lecturer

• Organization of the study
– – – Nov – Dec: Jan: Jan: lectures homework presentations exam

• Location
– Institute for Sustainable Innovative Technologies, (Pot za Brdom 104, Ljubljana)

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#5

3. Assessment
• ECTS credits:
– EURHEO (II): 6 ECTS

• Final mark:
– Homework – Written exam 50% final mark 50% final mark

• Important dates
– Homework presentations: – Written exam: Tue, 8 Jan 2013 Fri, 11 Jan 2013 Fri, 18 Jan 2013

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#6

3

4. Lecture plan (1/5)
1. Introduction to Neural Networks
1.1 1.2 1.3 1.4 1.5 1.6 1.7 What is a neural network? Biological neural networks Human nervous system Artificial neural networks Benefits of neural networks Brief history of neural networks Applications of neural networks

2. Neuron Model, Network Architectures and Learning
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Neuron model Activation functions Network architectures Learning algorithms Learning paradigms Learning tasks Knowledge representation Neural networks vs. statistical methods

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#7

4. Lecture plan (2/5)
3. Perceptrons and Linear Filters
3.1 3.2 3.3 3.4 3.5 3.6 Perceptron neuron Perceptron learning rule Adaline LMS learning rule Adaptive filtering XOR problem

4. Backpropagation
4.1 4.2 4.3 4.4 4.5 Multilayer feedforward networks Backpropagation algorithm Working with backpropagation Advanced algorithms Performance of multilayer perceptrons

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#8

4

4. Lecture plan (3/5)
5. Dynamic Networks
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Historical dynamic networks Focused time-delay neural network Distributed time-delay neural network NARX network Layer recurrent network Computational power of dynamic networks Learning algorithms System identification Model reference adaptive control

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#9

4. Lecture plan (4/5)
6. Radial Basis Function Networks
6.1 RBFN structure 6.2 Exact interpolation 6.3 Commonly used radial basis functions 6.4 Radial Basis Function Networks 6.5 RBFN training 6.6 RBFN for pattern recognition 6.7 Comparison with multilayer perceptron 6.8 RBFN in Matlab notation 6.9 Probabilistic networks 6.10 Generalized regression networks

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#10

5

4. Lecture plan (5/5)
7. Self-Organizing Maps
7.1 7.2 7.3 7.4 7.5 Self-organization Self-organizing maps SOM algorithm Properties of the feature map Learning vector quantization

8. Practical considerations
8.1 8.2 8.3 8.4 8.5 8.6 8.7 Designing the training data Preparing data Selection of inputs Data encoding Principal component analysis Invariances and prior knowledge Generalization

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#11

5. Books
1. 2. Neural Networks and Learning Machines, 3/E Simon Haykin (Pearson Education, 2009) Neural Networks: A Comprehensive Foundation, 2/E Simon Haykin (Pearson Education, 1999)

3. 4. 5. 6.

Neural Networks for Pattern Recognition Chris M. Bishop (Oxford University Press, 1995) Practical Neural Network Recipes in C++ Timothy Masters (Academic Press, 1993) Advanced Algorithms for Neural Networks Timothy Masters (John Wiley and Sons, 1995) Signal and Image Processing with Neural Networks Timothy Masters (John Wiley and Sons, 1994)

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#12

6

6. SLO Books
1. Nevronske mreže Andrej Dobnikar, (Didakta 1990) 2. Modeliranje dinamičnih sistemov z umetnimi nevronskimi mrežami in sorodnimi metodami Juš Kocijan, (Založba Univerze v Novi Gorici, 2007)

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#13

7. E-Books (1/2)
List of links at www.neural.si – An Introduction to Neural Networks Ben Krose & Patrick van der Smagt, 1996
Recommended as an easy introduction

– Neural Networks - Methodology and Applications Gerard Dreyfus, 2005 – Metaheuristic Procedures for Training Neural Networks Enrique Alba & Rafael Marti (Eds.), 2006 – FPGA Implementations of Neural Networks Amos R. Omondi & Mmondi J.C. Rajapakse (Eds.), 2006 – Trends in Neural Computation Ke Chen & Lipo Wang (Eds.), 2007
© 2012 Primož Potočnik NEURAL NETWORKS (0) Organization of the Study #14

7

7. E-Books (2/2)
– Neural Preprocessing and Control of Reactive Walking Machines Poramate Manoonpong, 2007 – Artificial Neural Networks for the Modelling and Fault Diagnosis of Technical Processes Krzysztof Patan, 2008 – Speech, Audio, Image and Biomedical Signal Processing using Neural Networks [only two chapters], Bhanu Prasad & S.R. Mahadeva Prasanna (Eds.), 2008

– MATLAB Neural Networks Toolbox 7 User's Guide, 2010

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#15

8. Online resources
List of links at www.neural.si
• • • • • • • • Neural FAQ – by Warren Sarle, 2002 How to measure importance of inputs – by Warren Sarle, 2000 MATLAB Neural Networks Toolbox (User's Guide) – latest version Artificial Neural Networks on Wikipedia.org Neural Networks – online book by StatSoft Radial Basis Function Networks – by Mark Orr Principal components analysis on Wikipedia.org libsvm – Support Vector Machines library

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#16

8

9. Simulations
• Recommended computing platform
– MATLAB R2010b (or later) & Neural Network Toolbox 7 http://www.mathworks.com/products/neuralnet/ Acceptable older MATLAB release: – MATLAB 7.5 & Neural Network Toolbox 5.1 (Release 2007b)

• Introduction to Matlab
– Get familiar with MATLAB M-file programming – Online documentation: Getting Started with MATLAB

• Freeware computing platform
– Stuttgart Neural Network Simulator http://www.ra.cs.uni-tuebingen.de/SNNS/

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#17

10. Homeworks
• EURHEO students (II)
1. Practical oriented projects 2. Based on UC Irvine Machine Learning Repository data http://archive.ics.uci.edu/ml/ 3. Select data set and discuss with lecturer 4. Formulate problem 5. Develop your solution (concept & Matlab code) 6. Describe solution in a short report 7. Submit results (report & Matlab source code) 8. Present results and demonstrate solution
• Presentation (~10 min) • Demonstration (~20 min)

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#18

9

Video links
• Robots with Biological Brains: Issues and Consequences Kevin Warwick, University of Reading http://videolectures.net/icannga2011_warwick_rbbi/ Computational Neurogenetic Modelling: Methods, Systems, Applications Nikola Kasabov, University of Auckland http://videolectures.net/icannga2011_kasabov_cnm/

•

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#19

© 2012 Primož Potočnik

NEURAL NETWORKS (0) Organization of the Study

#20

10

1. Introduction to Neural Networks
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 What is a neural network? Biological neural networks Human nervous system Artificial neural networks Benefits of neural networks Brief history of neural networks Applications of neural networks List of symbols

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#21

1.1 What is a neural network? (1/2)
• Neural network
– Network of biological neurons – Biological neural networks are made up of real biological neurons that are connected or functionally-related in the peripheral nervous system or the central nervous system

• Artificial neurons
– Simple mathematical approximations of biological neurons

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#22

11

What is a neural network? (2/2)
• Artificial neural networks
– – – – – Networks of artificial neurons Very crude approximations of small parts of biological brain Implemented as software or hardware By “Neural Networks” we usually mean Artificial Neural Networks Neurocomputers, Connectionist networks, Parallel distributted processors, ...

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#23

Neural network definitions
• Haykin (1999)
– A neural network is a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: – Knowledge is acquired by the network through a learning process. – Interneuron connection strengths known as synaptic weights are used to store the knowledge.

• Zurada (1992)
– Artificial neural systems, or neural networks, are physical cellular systems which can acquire, store, and utilize experiential knowledge.

• Pinkus (1999)
– The question 'What is a neural network?' is ill-posed.

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#24

12

1.2 Biological neural networks
Cortical neurons (nerve cells) growing in culture Neurons have a large cell body with several long processes extending from it, usually one thick axon and several thinner dendrites Dendrites receive information from other neurons Axon carries nerve impulses away from the neuron. Its branching ends make contacts with other neurons and with muscles or glands

0.1 mm

This complex network forms the nervous system, which relays information through the body
#25

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

Biological neuron

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#26

13

Interaction of neurons
• Action potentials arriving at the synapses stimulate currents in its dendrites These currents depolarize the membrane at its axon, provoking an action potential Action potential propagates down the axon to its synaptic knobs, releasing neurotransmitter and stimulating the post-synaptic neuron (lower left)

•

•

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#27

Synapses
• • Elementary structural and functional units that mediate the interaction between neurons Chemical synapse: pre-synaptic electric signal  chemical neurotransmitter  post-synaptic electrical signal

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#28

14

Action potential
• Spikes or action potential
– Neurons encode their outputs as a series of voltage pulses – Axon is very long, high resistance & high capacity – Frequency modulation  Improved signal/noise ratio

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#29

1.3 Human nervous system
• Human nervous system can be represented by three stages:

Stimulus

Receptors

Neural net (Brain)

Effectors

Response

• • •

Receptors
– collect information from environment (photons on retina, tactile info, ...)

Effectors
– generate interactions with the environment (muscle activation, ...)

Flow of information
– feedforward & feedback

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#30

15

Human brain
Human activity is regulated by a nervous system: • Central nervous system
– – Brain Spinal cord

•

Peripheral nervous system

≈ 1010 neurons in the brain ≈ 104 synapses per neuron ≈ 1 ms processing speed of a neuron  Slow rate of operation  Extrem number of processing units & interconnections  Massive parallelism

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#31

Structural organization of brain
Molecules & Ions ................ transmitters Synapses ............................ fundamental organization level Neural microcircuits .......... assembly of synapses organized into patterns of connectivity to produce desired functions Dendritic trees .................... subunits of individual neurons Neurons ............................... basic processing unit, size: 100 μm Local circuits ....................... localized regions in the brain, size: 1 mm Interregional circuits .......... pathways, topographic maps Central nervous system ..... final level of complexity
© 2012 Primož Potočnik NEURAL NETWORKS (1) Introduction to Neural Networks #32

16

1.4 Artificial neural networks
• Neuron model

• Network of neurons

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#33

What NN can do?
• In principle
– NN can compute any computable function (everything a normal digital computer can do)

• In practice
– NN are especially useful for classification and function approximation problems which are tolerant of some imprecision – Almost any finite-dimensional vector function on a compact set can be approximated to arbitrary precision by feedforward NN – Need a lot of training data – Difficulties to apply hard rules (such as used in an expert system)

• Problems difficult for NN
– – – – Predicting random or pseudo-random numbers Factoring large integers Determining whether a large integer is prime or composite Decrypting anything encrypted by a good algorithm

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#34

17

1.5 Benefits of neural networks (1/3)
1. Ability to learn from examples
• • • • Train neural network on training data Neural network will generalize on new data Noise tolerant Many learning paradigms
• • • Supervised (with a teacher) Unsupervised (no teacher, self-organized) Reinforcement learning

2. Adaptivity
• • • Neural networks have natural capability to adapt to the changing environment Train neural network, then retrain Continuous adaptation in nonstationary environment

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#35

Benefits of neural networks (2/3)
3. Nonlinearity
• • • Artificial neuron can be linear or nonlinear Network of nonlinear neurons has nonlinearity distributed throughout the network Important for modelling inherently nonlinear signals

4. Fault tolerance
• • Capable of robust computation Graceful degradation rather then catastrophic failure

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#36

18

Benefits of neural networks (3/3)
5. Massively parallel distributed structure
• • Well suited for VLSI implementation Very fast hardware operation

6. Neurobiological analogy
• • • NN design is motivated by analogy with brain NN are research tool for neurobiologists Neurobiology inspires further development of artificial NN

7. Uniformity of analysis & design
• • Neurons represent building blocks of all neural networks Similar NN architecture for various tasks: pattern recognition, regression, time series forecasting, control applications, ...

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#37

www.stanford.edu/group/brainsinsilicon/

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#38

19

1.6 Brief history of neural networks (1/2)
-1940 von Hemholtz, Mach, Pavlov , etc.
– General theories of learning, vision, conditioning – No specific mathematical models of neuron operation

1943 1949

McCulloch and Pitts
– Proposed the neuron model

Hebb
– Published his book The Organization of Behavior – Introduced Hebbian learning rule

1958

Rosenblatt, Widrow and Hoff
– Perceptron, ADALINE – First practical networks and learning rules

1969

Minsky and Papert
– Published book Perceptrons, generalised the limitations of single layer perceptrons to multilayered systems – Neural Network field went into hibernation

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#39

Brief history of neural networks (2/2)
1974 Werbos
– Developed back-propagation learning method in his PhD thesis – Several years passed before this approach was popularized

1982 1982

Hopfield
– Published a series of papers on Hopfield networks

Kohonen
– Developed the Self-Organising Maps

1980s Rumelhart and McClelland
– Backpropagation rediscovered, re-emergence of neural networks field – Books, conferences, courses, funding in USA, Europe, Japan

1990s Radial Basis Function Networks were developed 2000s The power of Ensembles of Neural Networks and Support Vector Machines becomes apparent

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#40

20

Current NN research
Topics for the 2013 International Joint Conference on NN
– – – – – – – – – – – – Neural network theory and models Computational neuroscience Cognitive models Brain-machine interfaces Embodied robotics Evolutionary neural systems Self-monitoring neural systems Learning neural networks Neurodynamics Neuroinformatics Neuroengineering Neural hardware – – – – – – – – – – – – Neural network applications Pattern recognition Machine vision Collective intelligence Hybrid systems Self-aware systems Data mining Sensor networks Agent-based systems Computational biology Bioinformatics Artificial life

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#41

1.7 Applications of neural networks (1/3)
• Aerospace
– High performance aircraft autopilots, flight path simulations, aircraft control systems, autopilot enhancements, aircraft component simulations, aircraft component fault detectors

• Automotive
– Automobile automatic guidance systems, warranty activity analyzers

• Banking
– Check and other document readers, credit application evaluators

• Defense
– Weapon steering, target tracking, object discrimination, facial recognition, new kinds of sensors, sonar, radar and image signal processing including data compression, feature extraction and noise suppression, signal/image identification

• Electronics
– Code sequence prediction, integrated circuit chip layout, process control, chip failure analysis, machine vision, voice synthesis, nonlinear modeling

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#42

21

Applications of neural networks (2/3)
• Financial
– Real estate appraisal, loan advisor, corporate bond rating, credit line use analysis, portfolio trading program, corporate financial analysis, currency price prediction

• Manufacturing
– Manufacturing process control, product design and analysis, process and machine diagnosis, real-time particle identification, visual quality inspection systems, welding quality analysis, paper quality prediction, computer chip quality analysis, analysis of grinding operations, chemical product design analysis, machine maintenance analysis, project planning and management, dynamic modelling of chemical process systems

• Medical
– Breast cancer cell analysis, EEG and ECG analysis, prothesis design, optimization of transplant times, hospital expense reduction, hospital quality improvement, emergency room test advisement

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#43

Applications of neural networks (3/3)
• Robotics
– Trajectory control, forklift robot, manipulator controllers, vision systems

• Speech
– Speech recognition, speech compression, vowel classification, text to speech synthesis

• Securities
– Market analysis, automatic bond rating, stock trading advisory systems

• Telecommunications
– Image and data compression, automated information services, real-time translation of spoken language, customer payment processing systems

• Transportation
– Truck brake diagnosis systems, vehicle scheduling, routing systems

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#44

22

1.8 List of symbols
THIS PRESENTATION n t x y d f v w b e | MATLAB

– iteration, time step – time – input .................................. p – network output ................... a – desired (target) output ....... t – activation function – induced local field .............. n – synaptic weight – bias – error

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#45

© 2012 Primož Potočnik

NEURAL NETWORKS (1) Introduction to Neural Networks

#46

23

2. Neuron Model – Network Architectures – Learning
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Neuron model Activation functions Network architectures Learning algorithms Learning paradigms Learning tasks Knowledge representation Neural networks vs. statistical methods

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#47

2.1 Neuron model
• Neuron
– information processing unit that is fundamental to the operation of a neural network

• Single input neuron
– – – – – – – scalar input x synaptic weight w bias b adder or linear combiner Σ activation potential v activation function f neuron output y

x

v

y

• Adjustable parameters
– synaptic weight w – bias b
© 2012 Primož Potočnik

y

f (wx b)

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#48

24

Neuron with vector input
• Input vector
x = [x1, x2, ... xR ], R = number of elements in input vector

• Weight vector
w = [w1, w2, ... wR ]

• Activation potential
v=wx+b product of input vector and weight vector

x1   xR

w1

v

y

wR

y

f ( wx b) f ( w1 x1 w2 x2 ... wR xR b)

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#49

2.2 Activation functions (1/2)
• Activation function defines the output of a neuron • Types of activation functions
Threshold function Linear function Sigmoid function

y (v )

1 if v 0 0 if v 0

y(v) v
y

y (v )

1 1 exp( v)

y

y

v

v

v

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#50

25

Activation functions (2/2)

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#51

McCulloch-Pitts Neuron (1943)
• Vector input, threshold activation function

y (v) sgn( wx b)
x1   xR
v
y
The output is binary, depending on whether the input meets a specified threshold

y 1 if wx y 0 if wx
y f (wx b)

b b

•

Extremely simplified model of real biological neurons
– Missing features: non-binary outputs, non-linear summation, smooth thresholding, stochasticity, temporal information processing

•

Nevertheless, computationally very powerful
– Network of McCulloch-Pits neurons is capable of universal computation

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#52

26

Matlab notation
• Presentation of more complex neurons and networks
– Input vector p is represented by the solid dark vertical bar – Weight vector is shown as single-row, R-column matrix W – p and W multiply into scalar Wp [R x 1] [1 x R]

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#53

Matlab Demos
• nnd2n1 – One input neuron • nnd2n2 – Two input neuron

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#54

27

2.3 Network architectures
About network architectures
– Two or more of the neurons can be combined in a layer – Neural network can contain one or more layers – Strong link between network architecture and learning algorithm

1. Single-layer feedforward networks
• • Input layer of source nodes projects onto an output layer of neurons Single-layer reffers to the output layer (the only computation layer)

2. Multi-layer feedforward networks
• • One or more hidden layers Can extract higher-order statistics

3. Recurrent networks
• • Contains at least one feedback loop Powerfull temporal learning capabilities

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#55

Single-layer feedforward networks

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#56

28

Multi-layer feedforward networks (1/2)

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#57

Multi-layer feedforward networks (2/2)
• Data flow strictly feedforward: input  output • No feedback  Static network, easy learning

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#58

29

Recurrent networks (1/2)
• Also called “Dynamic networks” • Output depends on
– current input to the network (as in static networks) – and also on current or previous inputs, outputs, or states of the network

• Simple recurrent network

Delay

Feedback loop

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#59

Recurrent networks (2/2)
• Layered Recurrent Dynamic Network – example

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#60

30

2.4 Learning algorithms
• Important ability of neural networks
– To learn from its environment – To improve its performance through learning

•

Learning process
1. Neural network is stimulated by an environment 2. Neural network undergoes changes in its free parameters as a result of this stimulation 3. Neural network responds in a new way to the environment because of its changed internal structure

•

Learning algorithm
Prescribed set of defined rules for the solution of a learning problem
1. 2. 3. 4. Error correction learning Memory-based learning Hebbian learning Competitive learning

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#61

Error-correction learning (1/2)
d(t) x(t) y(t)

e(t)

1. Neural network is driven by input x(t) and responds with output y(t) 2. Network output y(t) is compared with target output d(t) Error signal = difference of network output and target output

e(t )

y(t ) d (t )

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#62

31

Error-correction learning (2/2)
• • • Error signal  control mechanism to correct synaptic weights Corrective adjustments  designed to make network output y(t) closer to target d(t) Learning achieved by minimizing instantaneous error energy

(t )
•

1 2 e (t ) 2

Delta learning rule (Widrow-Hoff rule)
– Adjustment to a synaptic weight of a neuron is proportional to the product of the error signal and the input signal of the synapse

w(t )
• Comments
– – –

e(t ) x(t )

Error signal must be directly measurable Key parameter: Learnign rate η Closed loop feedback system  Stability determined by learning rate η

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#63

Memory-based learning
• All (or most) past experiences are stored in a memory of input-output pairs (inputs and target classes)
( xi , yi )
N i 1

•

Two essential ingredients of memory-based learning
1. Define local neighborhood of a new input xnew 2. Apply learning rule to adapt stored examples in the local neighborhood of xnew

•

Examples of memory-based learning
– Nearest neighbor rule
• • • Local neighborhood defined by the nearest training example (Euclidian distance) Local neighborhood defined by k-nearest training examples  robust against outliers Selecting the centers of basis functions

– K-nearest neighbor classifier – Radial basis function network

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#64

32

Hebbian learning
• The oldest and most famous learning rule (Hebb, 1949)
– Formulated as associative learning in a neurobiological context
“When an axon of a cell A is near enough to exite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one or both cells such that A’s efficiency as one of the cells firing B, is increased.”

– Strong physiological evidence for Hebbian learning in hippocampus, important for long term memory and spatial navigation

•

Hebbian learning (Hebbian synapse)
– Time dependent, highly local, and strongly interactive mechanism to increase synaptic efficiency as a function of the correlation between the presynaptic and postsynaptic activities.
1. If two neurons on either side of a synapse are activated simultaneously, then the strength of that synapse is selectively increased 2. If two neurons on either side of a synapse are activated asynchronously, then that synapse is selectively weakned or eliminated

– Simplest form of Hebbian learning

w(t )
© 2012 Primož Potočnik

y(t ) x(t )

x

y

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#65

Competitive learning
Inputs

•

Competitive learning network architecture
1. Set of inputs, connected to a layer of outputs 2. Each output neuron receives excitation from all inputs 3. Output neurons of a neural network compete to become active by exchanging lateral inhibitory connections 4. Only a single neuron is active at any time

•

Competitive learning rule
– Neuron with the largest induced local field becomes a winning neuron – Winning neuron shifts its synaptic weights toward the input

Individual neurons specialize on ensambles of similar patterns  feature detectors for different classes of input patterns

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#66

33

2.5 Learning paradigms
• Learning algorithm
– Prescribed set of defined rules for the solution of a learning problem
1. 2. 3. 4. Error correction learning Memory-based learning Hebbian learning Competitive learning

•

Learning paradigm
– Manner in which a neural network relates to its environment 1. Supervised learning 2. Unsupervised learning 3. Reinforcement learning

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#67

Supervised learning
• Learning with a teacher
– Teacher has a knowledge of the environment – Knowledge is represented by a set of input-output examples Environment Teacher
Target response = optimal action

+ Learning system
Error signal

-

Σ

• Learning algorithm
– Error-correction learning – Memory-based learning

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#68

34

Unsupervised learning
• Unsupervised or self-organized learning
– No external teacher to oversee the learning process – Only a set of input examples is available, no output examples Learning system

Environment

– Unsupervised NNs usually perform some kind of data compression, such as dimensionality reduction or clustering

• Learning algorithms
– Hebbian learning – Competitive learning

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#69

Reinforcement learning
– No teacher, environment only offers primary reinforcement signal – System learns under delayed reinforcement
• Temporal sequence of inputs which result in the generation of a reinforcement signal

– Goal is to minimize the expectation of the cumulative cost of actions taken over a sequence of steps – RL is realized through two neural networks: Critic and Learning system
Primary reinforcement

Environment

Critic
Heuristic reinforcement

– Critic network converts primary reinforcement signal (obtained directly from environment) into a higher quality heuristic reinforcement signal which solves temporal credit assignment problem
Actions

Learning system

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#70

35

2.6 Learning tasks (1/7)
1. Pattern Association
– Associative memory is brain-like distributed memory that learns by association – Two phases in the operation of associative memory
1. 2. Storage Recall

– Autoassociation
• • • Neural network stores a set of patterns by repeatedly presenting them to the network Then, when presented a distored pattern, neural network is able to recall the original pattern Unsupervised learning algorithms

– Heteroassociation
• • Set of input patterns is paired with arbitrary set of output patterns Supervised learning algorithms

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#71

Learning tasks (2/7)
2. Pattern Recognition
– In pattern recognition, input signals are assigned to categories (classes) – Two phases of pattern recognition
1. 2. Learning (supervised) Classification

– Statistical nature of pattern recognition
• Patterns are represented in multidimensional decision space Decision space is divided by separate regions for each class Decision boundaries are determined by a learning process Support-Vector-Machine example

•

•

•

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#72

36

Learning tasks (3/7)
3. Function Approximation
– Arbitrary nonlinear input-output mapping y = f(x) can be approximated by a neural network, given a set of labeled examples {xi, yi}, i=1,..,N – The task is to approximate the mapping f(x) by a neural network F(x) so that f(x) and F(x) are close enough ||F(x) – f(x)|| < ε for all x, (ε is a small positive number)

– Neural network mapping F(x) can be realized by supervised learning (error-correction learning algorithm) – Important function approximation tasks • System identification • Inverse system
© 2012 Primož Potočnik NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #73

Learning tasks (4/7)
• System identification Environment Unknown System Neural network
Error signal

Unknown system response

+ -

Σ

•

Inverse system
Inputs from the environment

+ Environment System Neural network
Error signal

-

Σ

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#74

37

Learning tasks (5/7)
4. Control
• • Neural networks can be used to control a plant (a process) Brain is the best example of a paralled distributed generalized controller
• • • • Operates thousands of actuators (muscles) Can handle nonlinearity and noise Can handle invariances Can optimize over long-range planning horizon

– Feedback control system (Model reference control)
• NN controller has to supply inputs that will drive a plant according to a reference

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#75

– Model predictive control
• NN model provides multi-step ahead predictions for optimizer

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#76

38

Learning tasks (6/7)
5. Filtering
• • Filter – device or algorithm used to extract information about a prescribed quantity of interest from a noisy data set Filters can be used for three basic information processing tasks:
o o o o o o o o o o

1. Filtering
• •

Extraction of information at discrete time n by using measured data up to and including time n Examples: Cocktail party problem, Blind source separation
o o o o o o x o o o

2. Smoothing

• Differs from filtering in: a) Data need not be available at time n b) Data measured later than n can be used to obtain this information

3. Prediction
• •

o

o

o

o

o

o

o

o

o

o

x

Deriving information about the quantity in the future at time n+h, h>0, by using data measured up to including n Example: Forecasting of energy consumption, stock market prediction
NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #77

© 2012 Primož Potočnik

Learning tasks (7/7)
6. Beamforming
– Spatial form of filtering, used to distinguish between the spatial properties of a target signal and background noise – Device is called a beamformer – Beamforming is used in human auditory response and echo-locating bats  the task is suitable for neural network application – Common beamforming tasks: radar and sonar systems
• • • Task is to detect a target in the presence of receiver noise and interfering signals Target signal originates from an unknown direction No a priori information available on interfering signals

– Neural beamformer, neuro-beamformer, attentional neurocomputers

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#78

39

Adaptation
Learning has spatio-temporal nature
– Space and time are fundamental dimensions of learning (control, beamforming)

1. Stationary environment
– Learning under the supervision of a teacher, weights then frozen – Neural network then relies on memory to exploit past experiences

2. Nonstationary environment
– Statistical properties of environment change with time – Neural network should continuously adapt its weights in real-time – Adaptive system  continuous learning

3. Pseudostationary environment
– Changes are slow over a short temporal window
• • Speech – stationary in interval 10-30 ms Ocean radar – stationary in interval of several seconds

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#79

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#80

40

2.7 Knowledge representation
• What is knowledge?
– Stored information or models used by a person or machine to interpret, predict, and appropriately respond to the outside world (Fischler & Firschein, 1987)

•

Knowledge representation
– – 1. 2. Good solution depends on a good representation of knowledge Knowledge of the world consists of: Prior information – facts about what is and what has been known Observations of the world – measurements, obtained through sensors designed to probe the environment

Observations can be: 1. Labeled – input signals are paired with desired response 2. Unlabeled – input signals only

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#81

Knowledge representation in NN
• Design of neural networks based directly on real-life data
– Examples to train the neural network are taken from observations

• Examples to train neural network can be
– Positive examples ... input and correct target output
• e.g. sonar data + echos from submarines

– Negative examples ... input and false output
• e.g. sonar data + echos from marine life

• Knowledge representation in neural networks
– Defined by the values of free parameters (synaptic weights and biases) – Knowledge is embedded in the design of a neural network – Interpretation problem – neural networks suffer from inability to explain how a result (decision / prediction / classification) was obtained
• Serious limitation for safe-critical application (medicial diagnosis, air trafic) • Explanation capability by integration of NN and other artificial intelligence methods

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#82

41

Knowledge representation rules for NN
Rule 1 Similar inputs from similar classes should produce similar representations inside the network, and should be classified to the same category Rule 2 Items to be categorized as separate classes should be given widelly different representations in the network Rule 3 If a particular feature is important, then there should be a large number of neurons involved in the representation of that item in the network Rule 4 Prior information and invariances should be built into the design of a neural network, thereby simplifying the network design by not having to learn them
© 2012 Primož Potočnik NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #83

Prior information and invariances (Rule 4)
• Application of Rule 4 results in neural networks with specialized structure
– Biological visual and auditory networks are highly specialized – Specialized network has smaller number of parameters
• • • • needs less training data faster learning faster network throughput cheaper because of its smaller size

•

How to build prior information into neural network design
– Currently no well defined rules, but usefull ad-hoc procedures – We may use a combination of two techniques
1. Receptive fields  restricting the network architecture by using local connections 2. Weight-sharing  several neurons share same synaptic weights

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#84

42

How to build invariances into NN
Character recognition example
Transformations  Pattern recognition system should be invariant to them

Original

Size

Rotation

Shift

Incomplete image

Techniques
1. 2. 3. Invariance by neural network structure Invariance by training Invariant feature space

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#85

Invariant feature space
• Neural net classifier with invariant feature extractor
Input Invariant feature extractor Neural network classifier Class estimate

•

Features
– Characterize the essential information content of an input data – Should be invariant to transformations of the input

•

Benefits
1. Dimensionality reduction – number of features is small compared to the original input space 2. Relaxed design requirements for a neural network 3. Invariances for all objects can be assured (for known transformations)  Prior knowledge is required!

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#86

43

Example 2A (1/4) Invariant character recognition
• Problem: distinguishing handwritten characters ‘a’ and ‘b’

• Classifier design
Invariant feature extractor Neural network classifier Class estimate: ‘A’, ‘B’

• Image representation
– Grid of pixels (typically 256x256) with gray level [0..1] (typically 8-bit coding)

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#87

Example 2A (2/4) Problems with image representation
1. Invariance problem (various transformations) 2. High dimensionality problem
– Image size 256x256  65536 inputs

Curse of dimensionality – increasing input dimensionality leads to sparse data and this provides very poor representation of the mapping  problems with correct classification and generalization

Possible solution
– Combining inputs into features  Goal is to obtain just a few features instead of 65536 inputs

Ideas for feature extraction (for character recognition)
F1 character heigth character width
NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #88

© 2012 Primož Potočnik

44

Example 2A (3/4) Feature extraction
• Extracted feature:
F1 character heigth character width

• Distribution for various samples from class ‘A’ and ‘B’
Decision
Class ‘A’ Class ‘B’

samples from class ‘A’

samples from class ‘B’

F1

• Overlaping distributions: need for additional features
– F1, F2, F3, ...
© 2012 Primož Potočnik NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #89

Example 2A (4/4) Classification in multi feature space
• Classification in the space of two features (F1, F2)
F2
samples from class ‘A’

Decision boundary

samples from class ‘B’

F1

• Neural network can be used for classification in the feature space (F1, F2)
– 2 inputs instead of 65536 original inputs – Improved generalization and classification ability

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#90

45

Generalization and model complexity
• What is the optimal decision boundary?

Linear classifier is insufficient, false classifications

Optimal classifier ?

Over-fitting, correct classification but poor generalization

– Best generalization is achieved by a model whose complexity is neither too small nor too large – Occam’s razor principle: we should prefer simpler models to more complex models – Tradeoff: modeling simplicity vs. modeling capacity

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#91

2.8 Neural networks vs. stat. methods (1/3)
• Considerable overlap between neural nets and statistics
– – – Statistical inference means learning to generalize from noisy data Feedforward nets are a subset of the class of nonlinear regression and discrimination models Application of statistical theory to neural networks: Bishop (1995), Ripley (1996)

• Most NN that can learn to generalize effectively from noisy data are similar or identical to statistical methods
– – – – – – Single-layered feedforward nets are basically generalized linear models Two-layer feedforward nets are closely related to projection pursuit regression Probabilistic neural nets are identical to kernel discriminant analysis Kohonen nets for adaptive vector quantization are similar to k-means cluster analysis Kohonen self-organizing maps are discrete approximations to principal curves and surfaces Hebbian learning is closely related to principal component analysis

• Some neural network areas have no relation to statistics
– – Reinforcement learning Stopped training (similar to shrinkage estimation, but the method is quite different)
NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #92

© 2012 Primož Potočnik

46

Neural networks vs. statistical methods (2/3)
• Many statistical methods can be used for flexible nonlinear modeling
• • • • • • • • Polynomial regression, Fourier series regression K-nearest neighbor regression and discriminant analysis Kernel regression and discriminant analysis Wavelet smoothing, Local polynomial smoothing Smoothing splines, B-splines Tree-based models (CART, AID, etc.) Multivariate adaptive regression splines (MARS) Projection pursuit regression, various Bayesian methods

• Why use neural nets rather than statistical methods?
– Multilayer perceptron (MLP) tends to be useful in similar situations as projection pursuit regression, i.e.: • the number of inputs is fairly large, • many of the inputs are relevant, but • most of the predictive information lies in a low-dimensional subspace • computing predicted values from MLPs is simpler and faster • MLPs are better at learning moderately pathological functions than are many other methods with stronger smoothness assumptions
© 2012 Primož Potočnik NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #93

– Some advantages of MLPs over projection pursuit regression

Neural networks vs. statistical methods (2/3)
Neural Network Jargon
– –

Statistical Jargon

–

– – – – – – – – – – –

Generalizing from noisy data .................................... Statistical inference Neuron, unit, node .................................................... A simple linear or nonlinear computing element that accepts one or more inputs and computes a function thereof Neural networks ....................................................... A class of flexible nonlinear regression and discriminant models, data reduction models, and nonlinear dynamical systems Architecture .............................................................. Model Training, Learning, Adaptation ................................. Estimation, Model fitting, Optimization Classification ............................................................ Discriminant analysis Mapping, Function approximation ............................ Regression Competitive learning ................................................. Cluster analysis Hebbian learning ...................................................... Principal components Training set ............................................................... Sample, Construction sample Input ......................................................................... Independent variables, Predictors, Regressors, Explanatory variables, Carriers Output ....................................................................... Predicted values Generalization .......................................................... Interpolation, Extrapolation, Prediction Prediction ................................................................. Forecasting
NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #94

© 2012 Primož Potočnik

47

MATLAB example
• nn02_neuron_output

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#95

MATLAB example
• nn02_custom_nn

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#96

48

MATLAB example
• nnstart

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#97

© 2012 Primož Potočnik

NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning

#98

49

3. Perceptrons and Linear Filters
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 Perceptron neuron Perceptron learning rule Perceptron network Adaline LMS learning rule Adaline network ADALINE vs. Perceptron Adaptive filtering XOR problem

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#99

Introduction
• Pioneering neural network contributions
– McCulloch & Pits (1943) – the idea of neural networks as computing machines – Rosenblatt (1958) – proposed perceptron as the first supervised learning model – Widrow and Hoff (1961) – least-mean-square learning as an important generalization of perceptron learning

• Perceptron
– Layer of McCulloch-Pits neurons with adjustable synaptic weights – Simplest form of a neural network for classification of linearly separable patterns – Perceptron convergence theorem for two linearly separable classes

• Adaline
– Similar to perceptron, trained with LMS learning – Used for linear adaptive filters

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#100

50

3.1 Perceptron neuron
• Perceptron neuron (McCulloch-Pits neuron): hard-limit (threshold) activation function
y (v )
x1   xR
v

1 if v 0 0 if v 0
y

y

v

• Perceptron output: 0 or 1  usefull for classification
If y=0  pattern belongs to class A If y=1  pattern belongs to class B
© 2012 Primož Potočnik NEURAL NETWORKS (3) Perceptrons and Linear Filters #101

Linear discriminant function
• Perceptron with two inputs
x1

v
x2

y

y

f (wx b)

f (w1 x1 w2 x2 b)

– Separation between the two classes is a straight line, given by

w1 x1 w2 x2 b 0
– Geometric representation

x2

x2

w1 x1 w2

b w2
x1

– Perceptron represents linear discriminant function

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#102

51

Matlab Demos (Perceptron)
• nnd2n2 – Two input perceptron • nnd4db – Decision boundaries

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#103

How to train a perceptron?
• How to train weights and bias?
– Perceptron learning rule – Least-means-square learning rule or “delta rule”

•

Both are iterative learning procedures
1. A learning sample is presented to the network 2. For each network parameter, the new value is computed by adding a correction

w j (n 1) b(n 1)

w j ( n) b ( n)

w j ( n) b( n)

x1 x2 xR

v

y

•

Formulation of the learning problem
– How do we compute Δw(t) and Δb(t) in order to classify the learning patterns correctly?

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#104

52

3.2 Perceptron learning rule
• A set of learning samples (inputs and target classes)
( xi , di )
N i 1

xi

, di

0,1

•

Objective:
Reduce error e between target class d and neuron response y (error-correction learning)

e=d-y

•

Learning procedure
1. 2. 3. 4. Start with random weights for the connections Present an input vector xi from the set of training samples If perceptron response is wrong: y≠d, e≠0, modify all connections w Go back to 2

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#105

Three conditions for a neuron
• After the presentation of input x, the neuron can be in three conditions:
– CASE 1: If neuron output is correct, weights w are not altered – CASE 2: Neuron output is 0 instead of 1 (y=0, d=1, e=d-y=1) Input x is added to weight vector w
• This makes the weight vector point closer to the input vector, increasing the chance that the input vector will be classified as 1 in the future.

– CASE 3: Neuron output is 1 instead of 0 (y=1, d=0, e=d-y=-1) Input x is subtracted from weight vector w
• This makes the weight vector point farther away from the input vector, increasing the chance that the input vector will be classified as a 0 in the future.

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#106

53

Three conditions rewritten
• Three conditions for a neuron rewritten
– CASE 1: e = 0  Δw = 0 – CASE 2: e = 1  Δw = x – CASE 3: e = -1  Δw = -x

• Three conditions in a single expression
Δw = (d-y)x = ex

• Similar for the bias
Δb = (d-y)(1) = e

• Perceptron learning rule
w j (n 1) b(n 1) w j (n) e(n) x j (n) b(n) e(n)
x1 x2 x1

v

y

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#107

Convergence theorem
• For the perceptron learning rule there exists a convergence theorem:
Theorem 1 If there exists set of connection weights w which is able to perform the transformation d=y(x), the perceptron learning rule will converge to some solution in a finite number of steps for any initial choice of the weights.

• Comments
– Theorem is only valid for linearly separable classes – Outliers can cause long training times – If classes are linearly separable, perceptron offers a powerfull pattern recognition tool

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#108

54

Perceptron learning rule summary
1. Start with random weights for the connections w 2. Select an input vector x from the set of training samples 3. If perceptron response is wrong: y≠d, modify all connections according to learning rule:
w ex b e

4. Go back to 2 (until all input vectors are correctly classified)

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#109

Matlab demo (Preceptron learning rule)
• nnd4pr – Two input perceptron

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#110

55

MATLAB example nn03_perceptron
• Classification of linearly separable data with a perceptron

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#111

Matlab demo (Presence of an outlier)
• demop4 – Slow learning with the presence of an outlier

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#112

56

Matlab demo (Linearly non-separable classes)
• demop6 – Perceptron attempts to classify linearly nonseparable classes

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#113

Matlab demo (Classification application)
• nnd3pc – Perceptron classification fruit example

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#114

57

3.3 Perceptron network
• Single layer of perceptron neurons

• Classification in more than two linearly separable classes
© 2012 Primož Potočnik NEURAL NETWORKS (3) Perceptrons and Linear Filters #115

MATLAB example nn03_perceptron_network
• Classification of 4-class problem with a 2-neuron perceptron

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#116

58

3.4 Adaline
• ADALINE = Adaptive Linear Element • Widrow and Hoff, 1961: LMS learning (Least mean square) or Delta rule • Important generalization of perceptron learning rule • Main difference with perceptron  activation function
– Perceptron: Threshold activation function – ADALINE: Linear activation function

• Both Perceptron and ADALINE can only solve linearly separable problems
© 2012 Primož Potočnik NEURAL NETWORKS (3) Perceptrons and Linear Filters #117

Linear neuron
• Basic ADALINE element
Linear transfer function
y

x1   xR

v

y

v

y

wx b

y(v) v

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#118

59

Simple ADALINE
• Simple ADALINE with two inputs
x1

v
x2

y

y

f (wx b)

w1 x1 w2 x2 b

• Like a perceptron, ADALINE has a decision boundary
– defined by network inputs for which network output is zero

w1 x1 w2 x2 b 0
– see Perceptron decision boundary

• ADALINE can be used to classify objects into categories
© 2012 Primož Potočnik NEURAL NETWORKS (3) Perceptrons and Linear Filters #119

3.5 LMS learning rule
• LMS = Least Square Learning rule • A set of learning samples (inputs and target classes)
( xi , di )
N i 1

xi

, di

• Objective: reduce error e between target class d and neuron response y (error-correction learning)
e=d–y

• Goal is to minimize the average sum of squared errors
mse 1 N
N

d ( n ) y ( n)
n 1

2

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#120

60

LMS algorithm (1/3)
• LMS algorithm is based on approximate steepest decent procedure
– Widrow & Hoff introduced the idea to estimate mean-square-error

mse

1 N

N

d ( n ) y ( n)
n 1

2

– by using square-error at each iteration

e2 (n)

d (n) y(n)

2

– and change the network weights proportional to the negative derivative of error

w j ( n)

e 2 ( n) wj

– with some learning constant η

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#121

LMS algorithm (2/3)
– Now we expand the expression for weight change ...

w j ( n)

e 2 ( n) wj

2 e( n )

e( n ) wj

2 e( n )

d ( n) y ( n ) wj

– Expanding the neuron activation y(n)

y(n) Wx(n) w1 x1 (n)  w j x j (n)  wR xR (n)
– and using the cosmetic correction

2
– we finaly obtain the weight change at step n

w j (n)

e(n) x j (n)

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#122

61

LMS algorithm (3/3)
• Final form of LMS learning rule
w j (n 1) b(n 1) w j ( n) b ( n) e(n) x j (n) e(n)

– Learning is regulated by a learning rate η – Stable learning  learning rate η must be less then the reciprocal of the largest eigenvalue of the correlation matrix xTx of input vectors

• Limitations
– Linear network can only learn linear input-output mappings – Proper selection of learning rate η

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#123

Matlab demo (LMS learning)
• pp02 – Gradient descent learning by LMS learnig rule

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#124

62

3.6 Adaline network
• ADALINE network = MADALINE (single layer of ADALINE neurons)

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#125

3.7 ADALINE vs. Perceptron
• Architectures
ADALINE PERCEPTRON

v

y

v

y

• Learning rules
LMS learning Perceptron learning

w j (n 1) b(n 1)

w j ( n) b ( n)

e(n) x j (n) e(n)

w j (n 1) b(n 1)

w j (n) e(n) x j (n) b(n) e(n)

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#126

63

ADALINE and Perceptron summary
• Single layer networks can be built based on ADALINE or Perceptron neurons • Both architectures are suitable to learn only linear inputoutput relationships • Perceptron with threshold activation function is suitable for classification problems • ADALINE with linear output is more suitable for regression & filtering • ADALINE is suitable for continuous learning
© 2012 Primož Potočnik NEURAL NETWORKS (3) Perceptrons and Linear Filters #127

3.8 Adaptive filtering
• ADALINE is one of the most widely used neural networks in practical applications • Adaptive filtering is one of its major application areas • We introduce the new element: Tapped delay line
– Input signal enters from the left and passes through N-1 delays – Output of the tapped delay line (TDL) is an N-dimensional vector, composed from current and past inputs

Input

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#128

64

Adaptive filter
• Adaptive filter = ADALINE combined with TDL

a(k ) Wp b
i
© 2012 Primož Potočnik

wi p(k i 1) b
#129

NEURAL NETWORKS (3) Perceptrons and Linear Filters

Simple adaptive filter example
• Adaptive filter with three delayed inputs

a(t )

w1 p(t ) w2 p(t 1) w3 p(t 2) b

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#130

65

Adaptive filter for prediction
• Adaptive filter can be used to predict the next value of a time series p(t+1) Now
Learning
p(t-2) p(t-1) p(t) p(t+1) Time

Operation
p(t-2) p(t-1) p(t) p(t+1) Time

Learning

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#131

Noise cancellation example
• Adaptive filter can be used to cancel engine noise in pilot’s voice in an airplane
– – The goal is to obtain a signal that contains the pilot’s voice, but not the engine noise. Linear neural net is adaptively trained to predict the combined pilot/engine signal m from an engine signal n. Only engine noise n is available to the network, so it only learns to predict the engine’s contribution to the pilot/engine signal m. The network error e becomes equal to the pilot’s voice. The linear adaptive network adaptively learns to cancel the engine noise. Such adaptive noise canceling generally does a better job than a classical filter, because the noise here is subtracted from rather than filtered out of the signal m.

–

–

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#132

66

Single-layer adaptive filter network
• If more than one output neuron is required, a tapped delay line can be connected to a layer of neurons

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#133

Matlab demos (ADALINE)
• nnd10eeg – ADALINE for noise filtering of EEG signals • nnd10nc – Adaptive noise cancelation

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#134

67

MATLAB example nn_03_adaline
• ADALINE time series prediction with adaptive linear filter

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#135

3.9 XOR problem
• Single layer perceptron cannot represent XOR function
– One of Minsky and Papert’s most discouraging results – Example: perceptron with two inputs
x1

v
x2

y

Discriminant function

x2

w1 x1 w2

b w2

– Only AND and OR functions can be represented by Perceptron

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#136

68

XOR solution
• Extending single-layer perceptron to multi-layer perceptron by introducing hidden units
x1
w2,1
w1,1 1 w2,1 1 w2, 2 w2,3 1 b2 0.5 2

w2, 2

v

y

w1, 2 b1

1 0.5

x2

w2,3

• XOR problem can be solved but we no longer have a learning rule to train the network • Multilayer perceptrons can do everything  How to train them?

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#137

Homework
• Create a two-layer perceptron to solve XOR problem
– Create a custom network – Demonstrate solution

© 2012 Primož Potočnik

NEURAL NETWORKS (3) Perceptrons and Linear Filters

#138

69

4. Backpropagation

4.1 4.2 4.3 4.4 4.5

Multilayer feedforward networks Backpropagation algorithm Working with backpropagation Advanced algorithms Performance of multilayer perceptrons

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#139

Introduction
• Single-layer networks have severe restrictions
– Only linearly separable tasks can be solved

• Minsky and Papert (1969)
– Showed a power of a two layer feed-forward network – But didn’t find the solution how to train the network

• Werbos (1974)
– Parker (1985), Cun (1985), Rumelhart (1986) – Solved the problem of training multi-layer networks by back-propagating the output errors through hidden layers of the network

• Backpropagation learning rule
© 2012 Primož Potočnik NEURAL NETWORKS (4) Backpropagation #140

70

4.1 Multilayer feedforward networks
• Important class of neural networks
– Input layer (only distributting inputs, without processing) – One or more hidden layers – Output layer

• Commonly referred to as multilayer perceptron
© 2012 Primož Potočnik NEURAL NETWORKS (4) Backpropagation #141

Properties of multilayer perceptrons
1. Neurons include nonlinear activation function
– Without nonlinearity, the capacity of the network is reduced to that of a single layer perceptron – Nonlinearity must be smooth (differentiable everywhere), not hard-limiting as in the original perceptron – Often, logistic function is used: 1

y

1 exp( v)

2. One or more layers of hidden neurons
– Enable learning of complex tasks by extracting features from the input patterns

3. Massive connectivity
– Neurons in successive layers are fully interconnected

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#142

71

Matlab demo
• nnd11nf – Response of the feedforward network with one hidden layer

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#143

About backpropagation
• Multilayer perceptrons can be trained by backpropagation learning rule
– Based on error-correction learning rule – Generalization of LMS learnig rule (used to train ADALINE)

•

Backpropagation consists of two passes through the network 1. Forward pass
– Input is applied to the network and propagated to the output – Synaptic weights stay frozen – Based on the desired response, error signal is calculated

2. Backward pass
– Error signal is propagated backwards from output to input – Synaptic weights are adjusted according to the error gradient

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#144

72

4.2 Backpropagation algorithm (1/9)
• A set of learning samples (inputs and target outputs)
( xn , d n )
N n 1

xn

M

, dn

R

• Error signal at output layer, neuron j, learning iteration n
e j (n) d j (n) y j (n)

• Instantaneous error energy of output layer with R neurons
E ( n) 1 2
R

e j ( n) 2
j 1

• Average error energy over all learning set
E 1 N
N

E ( n)
n 1

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#145

Backpropagation algorithm (2/9)
• Average error energy E represents a cost function as a measure of learning performance • E is a function of free network parameters
– synaptic weights – bias levels

• Learning objective is to minimize average error energy E by minimizing free network parameters • We use an approximation: pattern-by-pattern learning instead of epoch learning
– Parameter adjustments are made for each pattern presented to the network – Minimizing instantaneous error energy at each step instead of average error energy

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#146

73

Backpropagation algorithm (3/9)
• Similar as LMS algorithms, backpropagation applies correction of weights proportional to partial derivative
w ji (n) E ( n) w ji (n)
Instantaneous error energy

• Expressing this gradient by the chain rule
E ( n) w ji (n) E ( n) e j ( n) y j ( n) v j ( n) e j (n) y j (n) v j (n) w ji (n)
yi
w ji

output error network output induced local field synaptic weight

vj

yj

ej E

dj 1 2
R

yj ej
j 1 2

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#147

Backpropagation algorithm (4/9)
1. Gradient on output error
E ( n) e j ( n) e j ( n)

2. Gradient on network output
e j (n) d j (n) y j (n)

e j ( n) y j ( n) y j ( n) v j ( n)

1

3. Gradient on induced local field
yi
w ji

vj

yj

f (v j (n))

4. Gradient on synaptic weight
R

v j ( n)
j 0
© 2012 Primož Potočnik

w ji (n) yi (n)

v j ( n) w ji (n)

yi (n)
#148

NEURAL NETWORKS (4) Backpropagation

74

Backpropagation algorithm (5/9)
• Putting gradients together
E ( n) w ji (n) E ( n) e j ( n) y j ( n) v j ( n) e j (n) y j (n) v j (n) w ji (n) e j (n) ( 1) f (v j (n)) yi (n) e j (n) f (v j (n)) yi (n)
yi
w ji

vj

yj

• Correction of synaptic weight is defined by delta rule
w ji (n) E ( n) w ji e j (n) f (v j (n)) yi (n)    
j (n)

Learning rate

Local gradient

w ji (n)
© 2012 Primož Potočnik

j

(n) yi (n)
#149

NEURAL NETWORKS (4) Backpropagation

Backpropagation algorithm (6/9)
CASE 1 Neuron j is an output node
– Output error ej(n) is available – Computation of local gradient is straightforward
f (v j (n)) 1 1 exp( av j (n)) a exp( av j (n)) [1 exp( av j (n))]2

e j (n) f (v j (n)) j (n)
f (v j (n))

CASE 2 Neuron j is a hidden node
– Hidden error is not available  Credit assignment problem – Local gradient solved by backpropagating errors through the network

E ( n) w ji (n)

E ( n) e j ( n) y j ( n) v j ( n) e j (n) y j (n) v j (n) w ji (n)        
j (n)

yi

w ji

vj

yj

yi ( n )

y j ( n) v j ( n)

f (v j (n))

j

( n)

E ( n) y j ( n) y j ( n) v j ( n)

E ( n) f (v j (n)) y j ( n)
#150

derivative of output error energy E on hidden layer output yj ?

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

75

Backpropagation algorithm (7/9)
CASE 2 Neuron j is a hidden node ...
– Instantaneous error energy of the output layer with R neurons

E ( n)

1 R ek (n) 2 2k1

– Expressing the gradient of output error energy E on hidden layer output yj

E ( n) y j ( n)

ek
k

ek (n) y j ( n) ek (n) vk (n) vk ( n ) y j ( n )   
f ( vk ( n )) wkj

ek (n)

d k ( n) y k ( n) d k ( n)
vk (n)
j 0

f (vk (n))
M

ek
k

wkj (n) y j (n)

ek f (vk (n)) wkj
k k k
© 2012 Primož Potočnik

yj

wkj

vk

yk

wkj
#151

NEURAL NETWORKS (4) Backpropagation

Backpropagation algorithm (8/9)
CASE 2 Neuron j is a hidden node ...
– Finally, combining ansatz for hidden layer local gradient
j

( n)

E ( n) f (v j (n)) y j ( n)

– and gradient of output error energy on hidden layer output

E ( n) y j ( n)

k k

wkj

– gives final result for hidden layer local gradient

j

( n)

f (v j (n))
k

k

wkj

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#152

76

Backpropagation algorithm (9/9)
• Backpropagation summary

w ji (n)
Weight correction Learning rate

j

(n) yi (n)
Local gradient Input of neuron j

1. Local gradient of an output node
k

(n) ek (n) f (vk (n))

xi

w ji

vj

yj
wkj

2. Local gradient of a hidden node
j

vk

yk

( n)

f (v j (n))
k

k

wkj

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#153

Two passes of computation
1. Forward pass
Input is applied to the network and propagated to the output Inputs  Hidden layer output  Output layer output  Output error

xi (n)



yj

f

w ji xi


xi

yk
w ji

f
vj

wkj y j
yj



ek (n) d k (n) yk (n)

wkj

vk

yk

2. Backward pass
– Recursive computing of local gradients Output local gradients  Hidden layer local gradients
k

(n) ek (n) f (vk (n)) 

j

( n)

f (v j (n))
k

k

wkj

– Synaptic weights are adjusted according to local gradients

wkj (n)
© 2012 Primož Potočnik

k

(n) y j (n)

w ji (n)

j

(n) xi (n)
#154

NEURAL NETWORKS (4) Backpropagation

77

Summary of backpropagation algorithm
1. Initialization
– Pick weights and biases from the uniform distribution with zero mean and variance that induces local fields between the linear and saturated parts of logistic function

2. Presentation of training samples
– For each sample from the epoch, perform forward pass and backward pass

3. Forward pass
– Propagate training sample from network input to the output – Calculate the error signal

4. Backward pass
– Recursive computation of local gradients from output layer toward input layer – Adaptation of synaptic weights according to generalized delta rule

5. Iteration
– Iterate steps 2-4 until stopping criterion is met

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#155

Matlab demo
• nnd11bc – Backpropagation calculation

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#156

78

Matlab demo
• nnd12sd1 – Steepest descent

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#157

Backpropagation for ADALINE
• Using backpropagation learning for ADALINE
– No hidden layers, one output neuron – Linear activation function
x1   xR
v

y

f (v(n)) v(n)

f (v(n)) 1

• Backpropagation rule
wi (n) (n) yi (n), yi xi (n) e(n) f (v(n)) e(n) wi (n) e(n) xi (n)

• Original Delta rule
wi (n) e(n) xi (n)

• Backpropagation is a generalization of a Delta rule
© 2012 Primož Potočnik NEURAL NETWORKS (4) Backpropagation #158

79

4.3 Working with backpropagation
• Efficient application of backpropagation requires some “fine-tuning” • Various parameters, functions and methods should be selected
– – – – – – – Training mode (sequential / batch) Activation function Learning rate Momentum Stopping criterium Heuristics for efficient backpropagation Methods for improving generalization

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#159

Sequential and batch training
• Learning results from many presentations of training examples
– Epoch = presentation of the entire training set

• Batch training
– Weight updating after the presentation of a complete epoch

• Sequential training
– Weight updating after the presentation of each training example – Stochastic nature of learning, faster convergence – Important practical reasons for sequential learning:
• Algorithm is easy to implement • Provides effective solution to large and difficult problems

– Therefore sequential training is preferred training mode – Good practice is random order of presentation of training examples

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#160

80

Activation function
• Derivative of activation function f (v j (n)) is required for computation of local gradients
– Only requirement for activation function: differentiability – Commonly used: logistic function

f (v j (n))

1 1 exp( av j (n))

a

0, v j ( n)

– Derivative of logistic function

f (v j (n))

a exp( av j (n)) [1 exp( av j (n))]2

y j ( n ) f ( v j ( n ))

f (v j (n)) a y j (n)[1 y j (n)]

Local gradient can be calculated without explicit knowledge of the activation function

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#161

Other activation functions
• Using sin() activation functions
f ( x) a
k 1

ck sin(kx

k

)

– Equivalent to traditional Fourier analysis – Network with sin() activation functions can be trained by backpropagation – Example: Approximating periodic function by

8 sigmoid hidden neurons

4 sin hidden neurons

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#162

81

Learning rate
• Learning procedure requires
– Change in the weight space to be proportional to error gradient – True gradient descent requires infinitesimal steps

• Learning in practice
w ji (n) – Factor of proportionality is learning rate η  j (n) yi (n) – Choose a learning rate as large as possible without leading to oscillations
0.010 0.035
0.040

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#163

Stopping criteria
• • Generally, backpropagation cannot be shown to converge
– No well defined criteria for stopping its operation

Possible stopping criteria
1. Gradient vector – Euclidean norm of the gradient vector reaches a sufficiently small gradient 2. Output error – Output error is small enough – Rate of change in the average squared error per epoch is sufficiently small 3. Generalization performance – Generalization performance has peaked or is adequate 4. Max number of iterations – We are out of time ...

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#164

82

Heuristics for efficient backpropagation (1/3)
1. Maximizing information content
General rule: every training example presented to the backpropagation algorithm should be chosen on the basis that its information content is the largest possible for the task at hand Simple technique: randomize the order in which examples are presented from one epoch to the next

2. Activation function
– Faster learning with antisimetric sigmoid activation functions – Popular choice is:

f (v) a tanh(bv) a 1.72 b 0.67

f (1) 1, f ( 1) 1 effective gain f (0) 1 max second derivative at v 1

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#165

Heuristics for efficient backpropagation (2/3)
3. Target values
– Must be in the range of the activation function – Offset is recommended, otherwise learnig is driven into saturation
• Example: max(target) = 0.9 max(f)

4. Preprocessing inputs
a) Normalizing mean to zero b) Decorrelating input variables (by using principal component analysis) c) Scaling input variables (variances should be approx. equal)

Original

a) Zero mean

b) Decorrelated

c) Equalized variance

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#166

83

Heuristics for efficient backpropagation (3/3)
5. Initialization
– Choice of initial weights is important for a successful network design
• • • • Large initial values  saturation Small initial values  slow learning due to operation only in the saddle point near origin Standard deviation of induced local fields should lie between the linear and saturated parts of its sigmoid function tanh activation function example (a=1.72, b=0.67): synaptic weights should be chosen from a uniform distribution with zero mean and standar deviation m 1/ 2 m ... number of synaptic weights v

– Good choice lies between these extrem values

6. Learning from hints
– Prior information about the unknown mapping can be included into the learning proces
• • • Initialization Possible invariance properties, symetries, ... Choice of activation functions

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#167

Generalization
• Neural network is able to generalize:
– Input-output mapping computed by the network is correct for test data
• Test data were not used during training • Test data are from the same population as training data

– Correct response even if input is slightly different than the training examples

Overfitting

Good generalization

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#168

84

Improving generalization
• Methods to improve generalization
1. Keeping the network small 2. Early stopping 3. Regularization

•

Early stopping
– Available data are divided into three sets:
1. Training set – used to train the network 2. Validation set – used for early stopping, when the error starts to increase 3. Test set – used for final estimation of network performance and for comparison of various models

Early stopping

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#169

Regularization
• Improving generalization by regularization
– Modifying performance function

mse

1 N 1 M

N

(d j (n) y j (n)) 2
n 1

– with mean sum of squares of network weights and biases
M 2 wm m 1

msw

– thus obtaining new performance function

msreg

mse (1

)msw

– Using this performance function, network will have smaller weights and biases, and this forces the network response to be smoother and less likely to overfit

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#170

85

Deficiencies of backpropagation
Some properties of backpropagation do not guarantee the algorithm to be universally useful: 1. Long training process
– Possibly due to non-optimum learning rate (advanced algorithms address this problem)

2. Network paralysis
– Combination of sigmoidal activation and very large weights can decrease gradients almost to zero  training is almost stopped

3. Local minima
– Error surface of a complex network can be very complex, with many hills and valleys – Gradient methods can get trapped in local minima – Solutions: probabilistic learning methods (simulated annealing, ...)

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#171

4.4 Advanced algorithms
• Basic backpropagation is slow
• • Adjusts the weights in the steepest descent direction (negative of the gradient) in which the performance function is decreasing most rapidly It turns out that, although the function decreases most rapidly along the negative of the gradient, this does not necessarily produce the fastest convergence

1. Advanced algorithms based on heuristics
– Developed from an analysis of the performance of the standard steepest descent algorithm
• • • Momentum technique Variable learning rate backpropagation Resilient backpropagation

2. Numerical optimization techniques
– Application of standard numerical optimization techniques to network training
• • • Quasi-Newton algorithms Conjugate Gradient algorithms Levenberg-Marquardt

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#172

86

Momentum
• A simple method of increasing learning rate yet avoiding the danger of instability • Modified delta rule by adding momentum term
w ji (n)
j

(n) yi (n)

w ji (n 1)

1 – Momentum constant 0 – Accelerates backpropagation in steady downhill directions
Large learning rate (oscillations) Small learning rate

Learning with momentum

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#173

Variable learning rate η(t)
• Another method of manipulating learning rate and momentum to accelerate backpropagation
1. If error decreases after weight update:
• • • • • • • • • weight update is accepted learning rate is increased ............................................. η(t+1) = ση(t), σ >1 if momentum has been previously reset to 0, it is set to its original value weight update is accepted learning rate is not changed ......................................... η(t+1) = η(t), if momentum has been previously reset to 0, it is set to its original value weight update is discarded learning rate is decreased ............................................ η(t+1) = ρη(t), 0<ρ<1 momentum is reset to 0

2. If error increases less than ζ after weight update:

3. If error increases more than ζ after weight update:

Possible parameter values:

4%,

0.7,

1.05

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#174

87

Resilient backpropagation
• Slope of sigmoid functions approaches zero as the input gets large
– This causes a problem when you use steepest descent to train a network – Gradient can have a very small magnitude  also changes in weights are small, even though the weights are far from their optimal values

•

Resilient backpropagation
– Eliminates these harmful effects of the magnitudes of the partial derivatives – Only sign of the derivative is used to determine the direction of weight update, size of the weight change is determined by a separate update value – Resilient backpropagation rules:
1. Update value for each weight and bias is increased by a factor δinc if derivative of the performance function with respect to that weight has the same sign for two successive iterations 2. Update value is decreased by a factor δdec if derivative with respect to that weight changes sign from the previous iteration 3. If derivative is zero, then the update value remains the same 4. If weights are oscillating, the weight change is reduced

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#175

Numerical optimization (1/3)
• Supervised learning as an optimization problem
– Error surface of a multilayer perceptron, expressed by instantaneous error energy E(n), is a highly nonlinear function of synaptic weight vector w(n)

E (n)

E (w(n))

E(w1,w2)

w2

w1

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#176

88

Numerical optimization (2/3)
• Expanding the error energy by a Taylor series
E (n) E (w(n))
w(n)) E ( w(n)) g T (n) w(n) 1 T w (n) H (n) w(n) 2

E ( w(n)

Local gradient

g T ( n)

E ( w) w w
2

w( n )

Hessian matrix H ( n)

E ( w) w2 w

w( n )

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#177

Numerical optimization (3/3)
• Steepest descent method (backpropagation)
– Weight adjustment proportional to the gradient – Simple implementation, but slow convergence

w(n)

g (n)

• Significant improvement by using higher order information
– Adding momentum term  crude approximation to use second order information about error surface – Quadratic approximation about error surface  The essence of Newton’s method

w(n)

H 1 (n) g (n)
gradient descent Newton’s method

– H-1 is the inverse of Hessian matrix

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#178

89

Quasi-Newton algorithms
• Problems with the calculation of Hessian matrix
– – – – Inverse Hessian H-1 is required, which is computationally expensive Hessian has to be nonsingular which is not guaranteed Hessian for neural network can be rank defficient No convergence guarantee for non-quadratic error surface

• Quasi-Newton method
– Only requires calculation of the gradient vector g(n) – The method estimates the inverse Hessian directly without matrix inversion – Quasi-Newton variants:
• Davidon-Fletcher-Powell algorithm • Broyden-Fletcher-Goldfarb-Shanno algorithm ... best form of Quasi-Newton algorithm!

• Application for neural networks
– The method is fast for small neural networks

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#179

Conjugate gradient algorithms
• Conjugate gradient algorithms
– Second order methods, avoid computational problems with the inverse Hessian – Search is performed along conjugate directions, which produces generally faster convergence than steepest descent directions
1. In most of the conjugate gradient algorithms, the step size is adjusted at each iteration 2. A search is made along the conjugate gradient direction to determine the step size that minimizes the performance function along that line

– Many variants of conjugate gradient algorithms
• • • • Fletcher-Reeves Update Polak-Ribiére Update Powell-Beale Restarts Scaled Conjugate Gradient
gradient descent conjugate gradient

•

Application for neural networks
– Perhaps the only method suitable for large scale problems (hundreds or thousands of adjustable parameters)  well suited for multilayer perceptrons

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#180

90

Levenberg-Marquardt algorithm
• Levenberg-Marquardt algorithm (LM)
– Like the quasi-Newton methods, LM algorithm was designed to approach second-order training speed without having to compute the Hessian matrix – When the performance function has the form of a sum of squares (typical in neural network training), then the Hessian matrix H can be approximated by Jacobian matrix J

H

JT J

– where Jacobian matrix contains first derivatives of the network errors with respect to the weights – Jacobian can be computed through a standard backpropagation technique that is much less complex than computing the Hessian matrix

• Application for neural networks
– Algorithm appears to be the fastest method for training moderate-sized feedforward neural networks (up to several hundred weights)

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#181

Advanced algorithms summary
• Practical hints (Matlab related)
– Variable learning rate algorithm is usually much slower than the other methods – Resiliant backpropagation method is very well suited to pattern recognition problems – Function approximation problems, networks with up to a few hundred weights: Levenberg-Marquardt algorithm will have the fastest convergence and very accurate training – Conjugate gradient algorithms perform well over a wide variety of problems, particularly for networks with a large number of weights (modest memory requirements)

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#182

91

Training algorithms in MATLAB

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#183

4.5 Performance of multilayer perceptrons
• Approximation error is influenced by
– Learning algorithm used ... (discused in last section)
• This determines how good the error on the training set is minimized

– Number and distribution of learning samples
• This determines how good training samples represent the actual function

– Number of hidden units
• This determines the expressive power of the network. For smooth functions only a few number of hidden units are needed, for wildly fluctuating functions more hidden units will be needed

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#184

92

Number of learning samples
• Function approximation example y=f(x)
4 learning samples 20 learning samples

– Learning set with 4 samples has small training error but gives very poor generalization – Learning set with 20 samples has higher training error but generalizes well – Low training error is no guarantee for a good network performance!

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#185

Number of hidden units
• Function approximation example y=f(x)
5 hidden units 20 hidden units

– A large number of hidden units leads to a small training error but not necessarily to a small test error – Adding hidden units always leads to a reduction of the training error – However, adding hidden units will first lead to a reduction of test error but then to an increase of test error ... (peaking efect, early stopping can be applied)

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#186

93

Size effect summary

Number of training samples
Error rate Error rate

Number of hidden units

Optimal number of hidden neurons Test set

Test set Training set Number of training samples

Training set Number of hidden units

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#187

Matlab demo
• nnd11fa – Function approximation, variable number of hidden units

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#188

94

Matlab demo
• nnd11gn – Generalization, variable number of hidden units

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#189

© 2012 Primož Potočnik

NEURAL NETWORKS (4) Backpropagation

#190

95

123

Comments

Content

Sponsor Documents

Recommended