0%

这一年来眼睛经常感到疲倦,再加之镜腿断了,于是想要换个合适的眼镜。

换眼镜也是换出了很多学问,索性记录一下。

how to buy

第一件事是决定怎么买,参考网上各论坛建议,选了三种方案。

  1. 淘宝镜片 + 淘宝镜架 + 香港验光

    • 优点,便宜,透明。明月全套250 + 验光100 + 深圳路费 50
    • 缺点,时间久,去深圳麻烦
  2. 深圳眼镜

    • 优点,便宜,是批发市场
    • 缺点,不透明,容易被宰,验光不规范
  3. 香港眼镜

    • 优点,眼光规范,方便,品牌货便宜
    • 缺点,一般货贵,容易被宰

因为镜腿断了,急着要用眼镜。而另外两个都有明显缺陷(淘宝太久,深圳太远而且怕被宰)

where to buy

决定在香港买眼镜后发现事情没那么简单。香港眼镜店对我来说也分了三种选择。

  • 眼镜88,EGG等连锁店
  • 写字楼 楼上铺
  • 日本平价店铺 zoff

还能怎么办,一个一个比较,看呗。

另外在香港论坛查了许多,又恰逢发现小米有镜架(249),蓝光眼镜(99)卖。查了下都是钨钢材质,虽然听不懂但对小米的品质还是有信心的,于是买了个蓝光眼镜拆了镜片做镜框。(客服说99的镜框和249的镜架材质重量设计都一模一样= =,唯一区别是一个韩国,一个国产。。。)

镜架有了,一个一个去问镜片。

香港好的一点是,说我再去其他地方看看不会尴尬,店员都很爽快。

总共问了6个店子。

写字楼1号 巴黎眼镜店

熟人推荐,瑞士宝1.61非球面300+防蓝光200 = 500。在那才知道。。我戴了三年的眼镜。。。戴!反!了!

难!怪!一!直!头!晕!

难!怪!一!直!掉!镜!片!

写字楼2号 不知名眼镜店

只有一个说广东话的老头,这种看起来挺靠谱的。

阿波罗1.61非球面290+防蓝光200 = 490

写字楼3号 连锁眼镜仓库?

装修的和内地有点像,但是听到 依视路880?不确定听没听错。

眼镜88

DULUX1.61蓝光 = 700

依视路1.61蓝光 = 1300

EGG

豪雅?不确定是否为EGG特供。

1.61非球面600 + 蓝光100 = 700

1.67非球面900 + 蓝光0 = 900

ZOFF

一家日本的连锁平价眼镜,在香港很受欢迎。

因为暂时只在太古城有店,所以电话问了下。

  1. 不支持单独配镜片
  2. 镜片是自己zoff的。。。?
  3. 1.61非球面480+ 蓝光280= 760 全套眼镜

最终看了下,大概就是一般品牌500,豪雅700值得考虑。又想了下豪雅这么便宜!还要什么飞机!淘宝上买镜片都不止这个价了。最终手机没电了,不想再来所以入手EGG豪雅。。

值得一提的是香港的验光确实多了些没体验过的环节,总的来说感觉上专业一点。

conclusion

遗憾:

最后买的有点仓库,没考虑几点:

  • 是否为EGG豪雅特供镜片,如果是的话。。。
  • 许多参数并没有了解询问,如片基,阿贝数,镜片材料
  • 具体型号?现在才知道就算是正品豪雅(VP IP 等等)也有很多型号

不过总的来说觉得700验光+豪雅已经是很不错的了,具体是什么豪雅 后天就知道了(特意叮嘱了给我留着镜片包装)

更新

眼镜到手,试戴半天后效果特别好,真的看电脑久了不头晕了!不过也可能是之前一直是反的原因。

镜片包装也如约给我了:

1515759363478

1515759363478

包装上没有任何豪雅的资料,电话问了豪雅的人确定确实egg有用,找工作人员,工作人员加了我微信,说可以给我一个我自己镜片的豪雅证明,里面也会specify具体是哪一款豪雅镜片。

总之是一次特别棒的体验

更新

本着所有东西都要体验一下的高尚理想,这次(2019.10)配眼镜选了另外一种方法…网上听热门的 日本JINS网店 海淘。

因为自己海淘的挺多,加之性价比确实蛮高的,500+可以有一个1.74折射率的HOYA眼镜,还是钛金属的镜框(镜架是塑料。。)也是挺满意的说。

另外打算下次配眼镜要么去日本线下?顺便还可以去玩下。要么就去知乎的深圳眼镜热店。

生活就是折腾才有意思啊哈哈~

CUHK CSCI5030

instructor: XU Lei

来自 交大 的大佬教授。

大佬就是大佬,直接用中文。

四年来第一次上中问的专业课,感觉exciting。

Preface

Probability distributions

  • Uniform
  • Poisson
    • Discrete
    • Parameter: $\lambda$
    • $P(X = k) = \frac {e^{-\lambda}\lambda^k}{k!}$
  • Normal
  • t-dis???
  • Chi-square 卡方分布
  • Cauchy

Benford’s Law

$P(d) = log_{10}(d+1)-log_{10}(d)$, where d is the case that the first digit of the data is d

Ch 1

Sample space: $S$

event:

  • subset of $S$
  • measurable, (can assign a probability)
  • subset is NOT event

Conditional Probability:

$P(A|B) = \frac{P(A\cap B)}{P(B)}$

Bayes theorem

$P(A) = \sum{P(A\cap B_i)}$

Now: We want to know: if A happened, what are the p of different B?

$P(B_j|A) = \frac{P(A\cap B_j)}{P(A)}$

$P(A\cap B_j) = P(A|B_j)P(B_j)$

Bayes theorem definition:

Consider a partition $B_j$ of $S$ and event $A$:

$P(B_j|A) = \frac{P(A\cap B_j)}{P(A)}= \frac{P(A|B_j)P(B_j)}{\sum{P(A|B_i)P(B_i)}}​$

Monty Hall Problem

$B_j = C_2|X_1, A = H_3|X_1$

Ch 2 Discrete Random Variables

Mean is the quantity a that minimizes $min_aE(X-a)^2$

Variance $(x-E(x))^2$

Moment generating function = mgf = summary of the overall random behavior

median $min_b |x-b|$

新的一年

惟愿此心无怨尤。

Tensor to numpy array

In Tensorflow, kernel (tf.Variable) can be easily accessed by slim.get_model_variables() (return a list of tf.Variable)if using slim package. Also, if we want to access corresponding

  • name: simply by .name
  • value: sess.run
1
2
3
4
5
6
7
   weights = slim.get_model_variables()
#sturcture: list of variable name
structure = []
for i in range(len(weights)):
structure.append(weights[i].name)
#weights: list of variable value(numpy array)
weights = self.sess.run(weights,feed_dict={})

Numpy array to file

BAD TRY: numpy.savetxt

Extracted numpy array is usually multi-dimentional, therefore numpy.savetxt is not applicable, because it requires a 1-d or 2-d array.

BAD TRY: numpy.save

pretty easy to use, but only applicable when we only need to load the file using Python, but we need to implement a C code in our project

SUCCESSFUL: scipy.io.loadmat

numpy to .mat file, need a dic to generate the file, therefore we need the name of tf.variable above.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
import scipy.io
class extractor:
def __init__(self,filepath):
self.filepath = filepath
def tensor_to_file(self,tensor_list,structure):
mat_dict = {}
for i in range(len(tensor_list)):
mat_dict.update({structure[i]:tensor_list[i]})
scipy.io.savemat(self.filepath,mdict=mat_dict)
print '.mat file has been saved.'
def file_to_tensor(self):
tensor_dic = scipy.io.loadmat(self.filepath)
print '.mat file has been loaded.'
return tensor_dic

  • speed up trainning:
    • gradient computation -> 0 or 1
    • computation of ReLU is also easy, 0 or original
  • vanishing gradient problem
  • sparse output, reduce overfitting, (like L1 reg)

1. Introduction

We need to consider more about transformation rather than information.
eg. Expect top3, but 8th.
除去多余石头,变成大卫

Two keys to be extraordinary successful
1.Believe in yourself.
2.Keep asking questions.Not satisfied with an offer.

2. Why positive psychology

Most papers, researchers focus on negative such as depression,
few is working on what works, such as positive feelings.

Why deeper? not fatter?

  • Analogy: Using multiple layers of neurons (logic gates) to represent some functions are much simpler

  • deeper layer ->Modularization (More kinds of module with less training data)

    low layer = basic module = enough training data

Bad training results?

  • choosing proper loss

    softmax output layer -> cross entropy

    And its partial derivation by W is:

    $\begin{eqnarray} \frac{\partial C}{\partial w_j} & = & \frac{1}{n} \sum_x \frac{\sigma’(z) x_j}{\sigma(z) (1-\sigma(z))} (\sigma(z)-y). \tag{60}\end{eqnarray}$

    Because the final activation function is usually softmax, therefore,

    $\begin{eqnarray} \frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j(\sigma(z)-y). \tag{61}\end{eqnarray}$

  • mini-batch

    • for each batch(n=1000), update parameters by picking mini-batches. ( the total loss is not the whole batch loss, but a mini-part).
    • If there are 20 batches, update 20 times in one epoch. ( Shuffle the training examples for each epoch)
  • Vanishing Gradient Problem

    lower layer = smaller gradients = slow learning = almost random

    deeper layer = larger gradients = fast learning = already converge

    • ReLU

      1. Fast to computer
      2. Biological reason
      3. Infinite sigmoid with different biases
      4. Vanishing gradient problem

      $\dfrac{\partial{C}}{\partial{b_{1}}}=\sigma^{\prime}(z_{1})\omega_{2}\sigma^{\prime}(z_{2})\omega_{3}\sigma^{\prime}(z_{3})\omega_{4}\sigma^{\prime}(z_{4})\dfrac{\partial{C}}{\partial{a_{4}}}$

      where $\sigma^{\prime}(z)$ is the activation function respectively. Therefore, ReLU = 1

  • Adaptive Learning Rate

    If learning rate is too large, Total loss may not decrease after each update

    If learning rate is too small, Training would be too slow

    Reduce the learning rate by some factor every few epochs:

    • Adagrad

      Learning rate is smaller and smaller for all parameters

      Smaller derivatives, larger learning rate, and vice versa

  • Momentum

Bad test result? overfittting?

  • Early Stopping

  • L2 regu

    我们拿 L2正则化来探讨一下, 机器学习的过程是一个 通过修改参数 theta 来减小误差的过程, 可是在减小误差的时候非线性越强的参数, 比如在 x^3 旁边的 theta 4 就会被修改得越多, 因为如果使用非线性强的参数就能使方程更加曲折, 也就能更好的拟合上那些分布的数据点. Theta 4 说, 瞧我本事多大, 就让我来改变模型, 来拟合所有的数据吧, 可是它这种态度招到了误差方程的强烈反击, 误差方程就说: no no no no, 我们是一个团队, 虽然你厉害, 但也不能仅仅靠你一个人, 万一你错了, 我们整个团队的效率就突然降低了, 我得 hold 住那些在 team 里独出风头的人. 这就是整套正规化算法的核心思想. 那 L1, L2 正则化又有什么不同呢?

    l1:

    尽量稀疏 和dropout 类似

  • Weight Decay

    $w = 0.99 w - learning~rate * gradient$

  • Dropout

    • Each time before updating the parameters: Each neuron has p% to dropout

    • The structure of the network is changed

    • Using the new network for training

    • For each mini-batch, we resample the dropout neurons

    • If the dropout rate at training is p%, all the weights when testing times (1-p)%

    • Reason

      • When teams up, if everyone expect the partner will do the work, nothing will be done finally

        However, if you know your partner will dropout, you will do better.

        When testing, no one dropout actually, so obtaining good results eventually.

      • Dropout is a kind of ensemble

        Train a bunch of networks with different structures

  • $\Delta$ : \Delta
  • $\leftrightarrow$ : \leftrightarrow
  • $\Rightarrow$: \Rightarrow
  • $\neg$: \neg
  • $\wedge$: \wedge
  • $\vee$: \bee
  • $\sum_{n = 0}^{\infty}$: \sum_{n = 0}^{\infty}
  • $\lambda$: \lambda

CSCI3230 HW3 Li Wei 1155062148

test!

1. Information Theory and Logic

a.

define $P_{front}$ and $P_{back}$ as the possibilities of front and back respectively. And we have:

$P_{front}+P_{back} = 1$

$I(V) = 0$: The coin flipping result must be front/ back, which means its possibility is 1. $P_{front} \ or \ P_{back} =1$ so that $I(V) = - 0\times log_2 0 - 1\times log_21 = 0$

$I(V) = log_2n$: The possibility of front and back are same, which is 0.5 respectively. $P_{front} =P_{back} =0.5$ so that $I(V) = - 0.5\times log_2 0.5 - 0.5\times log_20.5 = log_22=1$

b.

R1:

$Pass(x,computer) \wedge Win(x,prize)\rightarrow Happy(x)\Rightarrow $

$\neg(Pass(x,computer) \wedge Win(x,prize)) \vee Happy(x) \Rightarrow$

$\neg Pass(x,computer) \vee \neg Win(x,prize) \vee Happy(x) $

R2:

$Study(y) \vee Lucky(y) \rightarrow Pass(y,z) \Rightarrow$

$\neg(Study(y) \vee Lucky(y)) \vee Pass(y,z) \Rightarrow$

$(\neg Study(y) \wedge \neg Lucky(y)) \vee Pass(y,z) \Rightarrow$

$(\neg Study(y) \vee Pass(y,z)) \wedge(\neg Lucky(y)\vee Pass(y,z))$

R3:

$Lucky(w) \rightarrow Win(w,prize) \Rightarrow$

$\neg Lucky(w) \vee Win(w,prize) $

hw3_1

2. Neural Network

a.

$O = f(\sum_{j=1}^{n} (w_jI_j)+ w_o)$

b.

$h_{i,k} = f(\sum_{j = 1}^{H_{i-1}}(w_{i-1,j,k}h_{i-1,j}))$

$O_m =f(\sum_{j = 1}^{H_{K}}(w_{K,j,m}h_{K,j})) $

c.

$f’(z) = (\frac{1}{1+e^{-z}})’ = \frac{e^{-z}}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}}\times \frac{e^{-z}}{1+e^{-z}}= f(z)(1-f(z))$

d.

Learning rate represents the impact of the correction applied following a training step. The bigger the learning rate, the more drastic the changes at each step. And if we set learning rate as 1, which means no learning rate, we’ll mostly likely start diverging away from the minimum. Therefore, we need a learning rate to control the changing velocity of weights and it is better to use a smaller learning rate.

e.
  1. $\frac{\delta E}{\delta w_{K,j,k}} = \frac{\delta E}{\delta O_k}\cdot \frac{\delta O_k}{\delta w_{K,j,k}}= (O_k - T_k)\cdot O_k\cdot(1-O_k)\cdot h_{K,j}​$ (because $E = 0.5\sum_{m=1}^{H_{K+1}}(O_m - T_m)^2 \ and\ O_ k=f(\sum_{j = 1}^{H_{K}}(w_{K,j,k}h_{K,j}) ) ​$)

  2. considering $E=F(h_{i+2,1}+ …+ h_{i+2,H_{i-2}})$ and $h_{i+2,j} = G(h_{i+1,k})$, according to the multivariate chain rule:

    $\frac{\delta E}{\delta h_{i+1,k}} = \sum_{j=1}^{H_{i+2}}(\frac{\delta E}{\delta h_{i+2,j}} \cdot \frac{\delta h_{i+2,j}}{\delta h_{i+1},k})$

    $=\sum_{j=1}^{H_{i+2}}(\frac{\delta E}{\delta h_{i+2,j}}\cdot h_{i+2,k}\cdot (1-h_{i+2,k})\cdot w_{i+1,k,j}) = \sum_{j=1}^{H_{i+2}}\Delta_{i+2,j} \cdot w_{i+1,k,j}$

  3. According to chain rule, we have

    $\frac{\delta E}{\delta w_{i,j,k}} = \frac{\delta E}{\delta h_{i+1},k} \cdot \frac{\delta h_{i+1},k}{w_{i,j,k}}$

    And $\frac{\delta h_{i+1,k}}{w_{i,j,k}} = f’(\sum_{j = 1}^{H_{i}}(w_{i,j,k}h_{i,j}))\cdot h_{i,j}= h_{i+1,k}\cdot (1-h_{i+1,k})\cdot h_{i,j}​$

    Therefore, the result is:

    $\frac{\delta E}{\delta w_{i,j,k}} = (\sum_{j=1}^{H_{i+2}}\Delta_{i+2,j} \cdot w_{i+1,k,j})\cdot h_{i+1,k}\cdot (1-h_{i+1,k})\cdot h_{i,j} $


  4. def BP(network,examples,a) return a modified network:
    INPUTS:
    network, a multilayer network
    examples, a set of data/label pairs
    a, learning rate

    Repeat:

    for each e in examples:
    $O \leftarrow Run(Network, e)$
    for each neuron in network:

     $\Delta_{K+1,k} \leftarrow (O_k - T_k)\cdot O_k\cdot(1-O_k)\cdot h_{K,j}$
    

    for each weight connected to corresponding neuron:

    ​ $W_{K,j,k} \leftarrow W_{K,j,k} - a\cdot \Delta_{K+1,k} \cdot h_{K,j}$

    for each sub-layer except the first layer:

    ​ **for each ** neuron in sub layer:

    ​ $\Delta_{i,k}= (\sum_{j=1}^{H_{i+1}}\Delta_{i+1,j} \cdot w_{i,k,j})\cdot h_{i,k}\cdot (1-h_{i,k})\cdot $

    for each weight connected to corresponding neuron in current sublayer:

    ​ $W_{i,j,k} \leftarrow W_{i,j,k} - a\cdot \Delta_{i+1,k} \cdot h_{i,j}$

    Until network has converged

    return network