本文介绍: 偏差度量的是单个模型的学习能力,而方差度量的是同一个模型在不同数据集上的稳定性。

Improve NN

train/dev/test set

0.7/0/0.3 0.6.0.2.0.2 -> 100-10000

0.98/0.01/0.01 … -> big data

Bias/Variance

偏差度量的是单个模型的学习能力,而方差度量的是同一个模型在不同数据集上的稳定性。

在这里插入图片描述

high variance ->high dev set error

high bias ->high train set error

basic recipe

high bias -> bigger network / train longer / more advanced optimization algorithms / NN architectures

high variance -> more data / regularization / NN architecture

Regularization

Logistic Regression

L

2
    

r

e

g

u

l

a

r

i

z

a

t

i

o

n

:

m

i

n

J

(

w

,

b

)

J

(

w

,

b

)

=

1

m

i

=

1

m

L

(

y

^

(

i

)

,

y

(

i

)

)

+

λ

2

m

w

2

2

L2;; regularization:\minmathcal{J}(w,b)rightarrow J(w,b)=frac{1}{m}sum_{i=1}^mmathcal{L}(hat y^{(i)},y^{(i)})+frac{lambda}{2m}Vert wVert_2^2

L2regularization:minJ(w,b)J(w,b)=m1i=1mL(y^(i),y(i))+2mλw22

Neural network

F

r

o

b

e

n

i

u

s
    

n

o

r

m

w

[

l

]

F

2

=

i

=

1

n

[

l

]

j

=

1

n

[

l

1

]

(

w

i

,

j

[

l

]

)

2

D

r

o

p

o

u

t
    

r

e

g

u

l

a

r

i

z

a

t

i

o

n

:

d

3

=

n

p

.

r

a

n

d

m

.

r

a

n

d

(

a

3.

s

h

a

p

e

.

s

h

a

p

e

[

0

]

,

a

3.

s

h

a

p

e

[

1

]

<

k

e

e

p

.

p

r

o

b

)

a

3

=

n

p

.

m

u

l

t

i

p

l

y

(

a

3

,

d

3

)

a

3

/

=

k

e

e

p

.

p

r

o

b

Frobenius;; norm\ Vert w^{[l]}Vert^2_F=sum_{i=1}^{n^{[l]}}sum_{j=1}^{n^{[l-1]}}(w_{i,j}^{[l]})^2\\ Dropout;; regularization:\ d3=np.randm.rand(a3.shape.shape[0],a3.shape[1]<keep.prob)\ a3=np.multiply(a3,d3)\ a3/=keep.prob

Frobeniusnormw[l]F2=i=1n[l]j=1n[l1](wi,j[l])2Dropoutregularization:d3=np.randm.rand(a3.shape.shape[0],a3.shape[1]<keep.prob)a3=np.multiply(a3,d3)a3/=keep.prob

other ways

  • early stopping
  • data augmentation

optimization problem

speed up the training of your neural network

Normalizing inputs

  1. subtract mean

μ

=

1

m

i

=

1

m

x

(

i

)

x

:

=

x

μ

mu =frac{1}{m}sum _{i=1}^{m}x^{(i)}\ x:=x-mu

μ=m1i=1mx(i)x:=xμ

  1. normalize variance

σ

2

=

1

m

i

=

1

m

(

x

(

i

)

)

2

x

/

=

σ

sigma ^2=frac{1}{m}sum_{i=1}^m(x^{(i)})^2\ x/=sigma

σ2=m1i=1m(x(i))2x/=σ

vanishing/exploding gradients

y

=

w

[

l

]

w

[

l

1

]

.

.

.

w

[

2

]

w

[

1

]

x

w

[

l

]

>

I

(

w

[

l

]

)

L

w

[

l

]

<

I

(

w

[

l

]

)

L

0

y=w^{[l]}w^{[l-1]}…w^{[2]}w^{[1]}x\ w^{[l]}>Irightarrow (w^{[l]})^Lrightarrowinfty \w^{[l]}<Irightarrow (w^{[l]})^Lrightarrow0

y=w[l]w[l1]w[2]w[1]xw[l]>I(w[l])Lw[l]<I(w[l])L0

weight initialize

v

a

r

(

w

)

=

1

n

(

l

1

)

w

[

l

]

=

n

p

.

r

a

n

d

o

m

.

r

a

n

d

n

(

s

h

a

p

e

)

n

p

.

s

q

r

t

(

1

n

(

l

1

)

)

var(w)=frac{1}{n^{(l-1)}}\ w^{[l]}=np.random.randn(shape)*np.sqrt(frac{1}{n^{(l-1)}})

var(w)=n(l1)1w[l]=np.random.randn(shape)np.sqrt(n(l1)1)

gradient check

Numerical approximation

f

(

θ

)

=

θ

3

f

(

θ

)

=

f

(

θ

+

ε

)

f

(

θ

ε

)

2

ε

f(theta)=theta^3\ f'(theta)=frac{f(theta+varepsilon)-f(theta-varepsilon)}{2varepsilon}

f(θ)=θ3f(θ)=2εf(θ+ε)f(θε)

grad check

d

θ

a

p

p

r

o

x

[

i

]

=

J

(

θ

1

,

.

.

.

θ

i

+

ε

.

.

.

)

J

(

θ

1

,

.

.

.

θ

i

ε

.

.

.

)

2

ε

=

d

θ

[

i

]

c

h

e

c

k

:

d

θ

a

p

p

r

o

x

d

θ

2

d

θ

a

p

p

r

o

x

2

+

d

θ

2

<

1

0

7

dtheta_{approx}[i]=frac{J(theta_1,…theta_i+varepsilon…)-J(theta_1,…theta_i-varepsilon…)}{2varepsilon}=dtheta[i]\ check:frac{Vert dtheta_{approx}-dthetaVert_2}{Vert dtheta_{approx}Vert_2+Vert dthetaVert_2}<10^{-7}

dθapprox[i]=2εJ(θ1,θi+ε)J(θ1,θiε)=dθ[i]check:dθapprox2+dθ2dθapproxdθ2<107

原文地址:https://blog.csdn.net/Star__01/article/details/136063994

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。

如若转载,请注明出处:http://www.7code.cn/show_67649.html

如若内容造成侵权/违法违规/事实不符,请联系代码007邮箱:suwngjj01@126.com进行投诉反馈,一经查实,立即删除!

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注