Jekyll2020-05-25T01:12:02+00:00http://www.fractalai.org/feed.xmlFractalFractal is an organization that provides guides for topics related to computer science and artifical intelligence. We are also support computer science education by combining programming with other subjects.Python Basics2020-05-22T08:06:43+00:002020-05-22T08:06:43+00:00http://www.fractalai.org/cs/2020/05/22/python-basics<h3 id="introduction">Introduction</h3>
<p align="justify">Python is a versatile programming language used in many different applications from simple scripting to entire applications. It was created in the late 80s by Guido van Rossum, and has seen increased adoption particularly in the data science and machine learning communities in recent years. We are using python because it is syntactically easy to read compared to other languages, beginner friendly, as has many packages that we can use for anything we want to build.</p>
<p align="justify">Python files have the ".py" file extension and can be run in the terminal (command prompt for windows users) by running, </p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python filename.py
</code></pre></div></div>
<p align="justify">where "filename" is replaced with the name of your python file, while in the directory (folder) that the file is in. You can interactively test python code by simply running</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python
</code></pre></div></div>
<p align="justify">in the terminal. This will open a python shell where you can type and run code line by line.</p>
<h3 id="primitive-types">Primitive Types</h3>
<p align="justify">The most basic data in python are primitive data types, which include:</p>
<ul>
<li>int (integers, example: 1)</li>
<li>float (floating point numbers, example: 0.25)</li>
<li>str (words and characters, example: ‘a’)</li>
<li>bool (true and false, example: True)</li>
</ul>
<p align="justify">We can assign values that have these types to variable in our program using the = operator (the assignment operator). Anything following "#" in a python program is a code comment, meaning it will not be considered code in the program. Comments help the programmer and others who are reading the code understand what the code does.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="mi">1</span> <span class="c1"># the variable x is of type int
</span><span class="n">y</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="c1"># the variable y is of type float
</span>
<span class="n">word</span> <span class="o">=</span> <span class="s">'this is a string'</span> <span class="c1"># the variable word is of type str
</span><span class="n">word</span> <span class="o">=</span> <span class="s">"this is a string"</span> <span class="c1"># we can use either '' or "" to define a string
</span><span class="n">program_running</span> <span class="o">=</span> <span class="bp">True</span> <span class="c1"># the variable program_running is of type bool
</span>
<span class="c1"># We can perform math on floats and ints
</span><span class="n">z</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span>
<span class="n">z</span> <span class="o">=</span> <span class="p">(</span><span class="n">z</span> <span class="o">+</span> <span class="mi">45</span><span class="p">)</span> <span class="o">/</span> <span class="n">y</span> <span class="o">-</span> <span class="p">(</span><span class="mf">171.4256</span> <span class="o">*</span> <span class="mf">0.5</span><span class="p">)</span>
<span class="c1"># We cannot combine ints and float with strings
</span><span class="n">bad_code</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">word</span> <span class="c1"># this will error
</span></code></pre></div></div>
<p align="justify">The above code defines some variables or different types and manipulates them. Note how some types cannot be combined. However, we can append strings together, and we can covert an int to a string add it to another string</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">num_apples</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">sentence_begin</span> <span class="o">=</span> <span class="s">'I have '</span>
<span class="n">sentence_end</span> <span class="o">=</span> <span class="s">' apples'</span>
<span class="c1"># The str() function converts primitives of other types into a string
</span><span class="n">sentence</span> <span class="o">=</span> <span class="n">sentence_begin</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">num_apples</span><span class="p">)</span> <span class="o">+</span> <span class="n">sentence_end</span>
<span class="k">print</span><span class="p">(</span><span class="n">sentence</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>output:
I have 5 apples
</code></pre></div></div>
<p align="justify">Just like the str() function converts other types to string, there is an int() function and float() function that converts to int and float respectively. They don't work with every input though; float('hello') will give you an error.</p>
<h3 id="data-structures">Data Structures</h3>
<p align="justify">Data structures are more advanced ways to store data that prove very useful in a variety of scenarios. We will cover two commonly used data structures, lists and dictionaries, but there are many more. Just like we have primitive types, list and dictionaries are their own types in python, but they are not primitives.</p>
<h4 id="lists">Lists</h4>
<p align="justify">A list in python is a way to store sequential elements of information. They are created using square brackets. You can retrieve a particular element from a list by doing something called indexing, which means you pass in the position of the element you want in the list (see code below).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mylist</span> <span class="o">=</span> <span class="p">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">245</span><span class="p">]</span> <span class="c1"># a list of ints
</span>
<span class="n">element</span> <span class="o">=</span> <span class="n">mylist</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="c1"># use brackets to index into the list
</span><span class="n">another_element</span> <span class="o">=</span> <span class="n">mylist</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="c1"># indices begin at 0 in python!!! <-- IMPORTANT
</span>
<span class="c1"># Accessing the last element in a list
</span><span class="n">last_element</span> <span class="o">=</span> <span class="n">mylist</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">mylist</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span>
<span class="c1"># Lists can hold elements of differing types
</span><span class="n">another_list</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mf">24.6</span><span class="p">,</span> <span class="s">"an element"</span><span class="p">,</span> <span class="bp">True</span><span class="p">]</span>
<span class="c1"># You can also add elements to an existing list later on
</span><span class="n">mylist</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="mi">27</span><span class="p">)</span>
</code></pre></div></div>
<p align="justify">Note that indices begin at 0 in python. There is a very good theoretical reason for this that I won't go into here, but just remember the first element of a list is always at index 0. This means the final index of a list is the length of the list minus 1. The length of a python list can be accessed using the len() function. To add elements to the end of an existing list, use the append() method.</p>
<h4 id="dictionaries">Dictionaries</h4>
<p align="justify">A dictionary in python is a data structure that acts as a mapping from keys to values. A dictionary is created using curly braces. Values can be accessed using the name of the key insides square brackets.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Define the dictionary
</span><span class="n">fruits</span> <span class="o">=</span> <span class="p">{}</span>
<span class="c1"># Add some key-value pairs
</span><span class="n">fruits</span><span class="p">[</span><span class="s">'apples'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">fruits</span><span class="p">[</span><span class="s">'oranges'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">6</span>
<span class="n">fruits</span><span class="p">[</span><span class="s">'bananas'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">2400</span>
<span class="n">num_bananas</span> <span class="o">=</span> <span class="n">fruits</span><span class="p">[</span><span class="s">'bananas'</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">num_bananas</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">fruits</span><span class="p">[</span><span class="s">'oranges'</span><span class="p">]</span> <span class="o"><</span> <span class="mi">7</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>output:
2400
True
</code></pre></div></div>
<p align="justify">One important property of dictionaries is that keys must be unique, which values do not have to be unique. This means we cannot define two entries in the dictionary whose keys are both 'apple'. Some useful methods (functions that are specific to a particular type) that can be called on dictionaries are keys() and values().</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">keys</span> <span class="o">=</span> <span class="n">fruits</span><span class="o">.</span><span class="n">keys</span><span class="p">()</span> <span class="c1"># the keys method returns a list of the keys in the dictionary
</span><span class="k">print</span><span class="p">(</span><span class="n">keys</span><span class="p">)</span>
<span class="n">vals</span> <span class="o">=</span> <span class="n">fruits</span><span class="o">.</span><span class="n">values</span><span class="p">()</span> <span class="c1"># the values method returns a list of the values in the dictionary
</span><span class="k">print</span><span class="p">(</span><span class="n">vals</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>output:
['apples', 'oranges', 'bananas']
[10, 6, 2400]
</code></pre></div></div>
<h3 id="logical-statements">Logical Statements</h3>
<p align="justify">Just like any programming language, python has some basic logical statements that can be used to encode logic into your program. The main ones we will use are if statements, for loops, and while loops.</p>
<h4 id="if">If</h4>
<p align="justify">If statements can be used to separate code that executes based on a condition. If blocks are broken up into if, elif, and else, which checks a condition, checks a condition if the previous condition was false, and provides some code to default to if all the previous conditions were false, respectively.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="mi">5</span>
<span class="k">if</span> <span class="n">x</span> <span class="o">==</span> <span class="mi">4</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">"x is 4!"</span><span class="p">)</span>
<span class="k">elif</span><span class="p">:</span> <span class="n">x</span> <span class="o">==</span> <span class="mi">5</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">"x is 5!"</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">"x is neither 4 nor 5"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>output:
x is 5 !
</code></pre></div></div>
<p align="justify">In the if statement above, the condition is x == 4. Note the difference between the assignment operator "=", which assigns a value to a variable, and the equality operator "==" which checks if two values are equal. X is not 4, so the code will not execute the print statement in that if block. The program moves on to the next statement, the elif statement which checks if x is equal to 5. Since x is 5, the output of this program is "x is 5!". The program then moves past this entire block; there is no need to look at the else statement because we already know x is 5 at this point. You have to include an elif or an else statement.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="bp">True</span> <span class="o">==</span> <span class="bp">True</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">"This is a tautology!"</span><span class="p">)</span>
<span class="k">if</span> <span class="bp">True</span> <span class="o">==</span> <span class="bp">False</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">"This is a fallacy!"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>output:
This is a tautology!
</code></pre></div></div>
<h4 id="for-and-while-loops">For and While Loops</h4>
<p align="justify">For loops are a way to loop over the same block of code a predefined number of times. The canonical way to construct a for loop in python is to use the range() function which executes code a certain number of times.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="n">mylist</span> <span class="o">=</span> <span class="p">[</span><span class="mi">10</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">14</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">mylist</span><span class="p">)):</span>
<span class="k">print</span><span class="p">(</span><span class="n">mylist</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>output:
0
1
2
10
12
14
</code></pre></div></div>
<p align="justify">While loops on the other hand execute a block of code until the condition is false.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">4</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>output:
0
1
2
3
</code></pre></div></div>
<p align="justify">In the above code, the code inside the while loop executed until i was no longer less than 4. If we never reassigned x to increment at event iteration of the while loop, this loop would have continued forever!</p>
<h3 id="functions">Functions</h3>
<p align="justify">We have already seen some built-in functions that come with python, namely len(), range(), print(). But we can also define our own functions that take input parameters and do some computation. This is useful for situations when we might need to reuse the same code several times so we don't have to keep rewriting it. To do this, we use the "def" keyword. To return a value from a function, use the "return" keyword.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">add</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">add</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">7</span><span class="p">)</span>
</code></pre></div></div>
<p align="justify">The above function is very simple, but we can do all sorts of crazy computation using functions, and we can also compose functions together.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_max_if_below_x</span><span class="p">(</span><span class="n">list_</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">maximum</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">list_</span><span class="p">)</span>
<span class="k">if</span> <span class="n">maximum</span> <span class="o"><</span> <span class="n">x</span><span class="p">:</span>
<span class="k">return</span> <span class="n">maximum</span>
<span class="k">else</span>
<span class="k">return</span> <span class="mi">0</span>
<span class="n">mylist</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">9</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">get_max_if_below_x</span><span class="p">(</span><span class="n">mylist</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">get_max_if_below_x</span><span class="p">(</span><span class="n">mylist</span><span class="p">,</span> <span class="mi">9</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>output:
9
0
</code></pre></div></div>
<hr />
<p align="justify">This guide introduced some basic types in python, a couple useful data structures, logical statements, and custom functions. These are the basic steps needed to create more complex programs using python. I encourage you to get creative with writing custom programs and if you get stuck, be proactive about searching for answers online. I'll leave you with some resources that helped me when I was learning python - good luck!</p>
<ul>
<li><a href="https://automatetheboringstuff.com/">A great online python textbook</a></li>
<li><a href="https://stackoverflow.com/questions/tagged/python-3.x">StackOverflow</a></li>
</ul>IntroductionInnovation and Trade2020-05-18T08:06:43+00:002020-05-18T08:06:43+00:00http://www.fractalai.org/proj/2020/05/18/innovation-and-trade<h3 id="introduction">Introduction</h3>
<p align="justify">This project allows students to gain hands-on programming experience while they learn topics from world history. Specifically, this assignment allows students to simulate trade on the Silk Road and observe the effects innovation had on trade at the time. Students will then be able to plot their results and analyze the results. The students will gain practice programming and data analysis experience while learning about history concepts. This assignment is targeted at AP World History students.</p>
<h4 id="prerequisites">Prerequisites</h4>
<p align="justify">This assignment requires the students to have an extremely basic understanding of python and logical thinking. If the student has no programming experience, they can likely still complete the assignment by referencing the <a href="http://www.fractalai.org/guides">Python Basics</a> guide. In addition to this, you must have python and jupyter installed on your computer, which is covered in the <a href="http://www.fractalai.org/guides">Getting Started with Python</a> guide.</p>
<h3 id="getting-started">Getting Started</h3>
<p align="justify">First, you have to download the files for the assignment, which are all neatly zipped up in a folder. Click <a href="http://www.fractalai.org/myprojects/innovation-and-trade.zip">here</a> to download the zipped folder. After you download it, the folder will be in your downloads folder on your computer and you can unzip it from there and move the unzipped folder to whatever location you like. Now you'll need to navigate to that folder on your computer via your terminal (on macOS) or command prompt (on Windows).</p>
<h4 id="for-mac-users">For Mac Users</h4>
<p align="justify">Open the Finder application and navigate to the folder you just unzipped. Right click the folder in Finder and select "New Terminal at Folder"</p>
<div class="container" style="padding: 10px;">
<center>
<div class="col-lg-10 col-md-10 col-sm-12 col-xs-12">
<img src="/assets/Common/mac-terminal.png" />
</div>
</center>
</div>
<p align="justify">When the terminal opens, type</p>
<p><code class="language-plaintext highlighter-rouge">jupyter notebook</code></p>
<p align="justify">This will start a jupyter notebook in the browser.</p>
<h4 id="for-windows-users">For Windows Users</h4>
<p align="justify">Navigate to the folder you downloaded right click the folder while holding shift. You should see an option to open "Command Prompt window here"; click on that. In the photo below command prompt is replaced with Powershell for my computer, but for our purposes its the same thing.</p>
<div class="container" style="padding: 10px;">
<center>
<div class="col-lg-10 col-md-10 col-sm-12 col-xs-12">
<img src="/assets/Common/windows-terminal.png" />
</div>
</center>
</div>
<p align="justify">When the command prompt opens, type</p>
<p><code class="language-plaintext highlighter-rouge">jupyter notebook</code></p>
<p align="justify">This will start a jupyter notebook in the browser.</p>
<h3 id="using-the-jupyter-notebook">Using the Jupyter Notebook</h3>
<p align="justify">After the notebook starts in your browser, click on the "innovation-and-trade.ipynb" file to open the interactive notebook. This notebook has a description of the assignment for the students and the code is grouped together in blocks. Each block can be run sequentially by clicking the "Run" button in the top menu.</p>
<h3 id="the-project">The Project</h3>
<p align="justify">Now that the students are set up with interactive notebook, they can step through the code blocks and simulate the trade of different empires involved with the Silk Road. Encourage them to add new empires with different goods based on their research about the time period as well as explore different ways of modifying the innovation index to see how that affects trade.</p>
<p align="justify">The deliverable for this project will be a two or three page paper where students explain the main players in Silk Road trade and how innovation affected trade. Students should include the graphs from their code and intepret their meaning.</p>IntroductionGetting Started with Python2020-05-03T08:06:43+00:002020-05-03T08:06:43+00:00http://www.fractalai.org/cs/2020/05/03/getting-started-with-python<p>Placeholder</p>PlaceholderNeural Networks2020-04-12T08:06:43+00:002020-04-12T08:06:43+00:00http://www.fractalai.org/ml/2020/04/12/neural-networks<p>$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$</p>
<p>Before we begin, I should preface this by saying this will be a lot of information, and I assume you are at least somewhat familiar with how classifiers and loss functions work, but I will provide a brief recap. This guide will be structured in the following order.</p>
<ol>
<li>A recap of classifiers, loss functions, and regularization</li>
<li>Introduction to neural network structure and the forward pass</li>
<li>Training neural networks and backpropagation</li>
</ol>
<p>This guide is key for understanding later concepts, particularly those in deep learning. With that said, let’s get started.</p>
<h3 id="recap">Recap</h3>
<h4 id="classifiers">Classifiers</h4>
<p>Recall training classifiers in the context of a supervised learning problem. We have a set of data $\{(x_{1}, y_{1}), …, (x_{m}, y_{m})\}$, where each $x_{i} \in X$ is the input to our classifier (e.g. an image or a block of text or simply a set of abstract features that represent something else) and each $y_{i} \in Y$ is the label associated with that input (e.g. cat, dog, ship). We are trying to learn a function $f : X \rightarrow Y$ that is a good approximation of the relationship between our input and the associated output. We also have a hypothesis space $H = \{h | h : X \rightarrow Y\}$ which is the set of all functions that map $X$ to $Y$ in a certain way. For example, we might only be considering linear classifiers, which are functions of the form $h(x; \mathbf{W, b}) = \mathbf{W}x + \mathbf{b}$ (linear mappings of x onto the output space).</p>
<p>For the simplistic example of a linear classifier trained to take in images and classify them as an image of a dog or not an image of a dog, we can visualize the classifier like below</p>
<center>
<div class="col-lg-8 col-md-8 col-sm-12 col-xs-12">
<img src="/assets/Neural_Networks/linear-dog.png" />
</div>
</center>
<p>where the bias term has been left out for simplicity.</p>
<h4 id="loss-and-regularization">Loss and Regularization</h4>
<p>Classifiers usually output either a score for each of its possible classes or a probability mass function over the possible classes. For example, a linear classifier trained to classify across 10 different classes would output a 10-dimensional vector where each entry is the score for that class (i.e. how strongly the classifier feels that the input belongs to each class). We need some way to measure the performance of our classifier, which is where the loss function comes in. The loss function answers the question “how good is my classifier with respect to my data?”. You might imagine we have some function $\ell_{i}(x_{i}, y_{i})$ that takes in the input and its corresponding label and tells us how good our classifier does at predicting $y_{i}$. Then we can sum up the loss for each data point to get our total loss.</p>
<script type="math/tex; mode=display">L(X, Y) = \frac{1}{M}\sum_{i=1}^{M}\ell_{i}(x_{i}, y_{i})</script>
<p>There are many different choices for $\ell(x, y)$ depending on the context of our problem and the type of classifier we have. The loss function is a design choice. The loss function I will introduce explicitly is the one we will use as I continue to introduce neural networks because it works well with the problem of image classification; it’s called the softmax loss or cross-entropy loss.</p>
<h5 id="softmax-loss">Softmax loss</h5>
<p>Recall the softmax function which takes the output vector of scores $\mathbf{s}$ from our classifier and tells us the probability of each class $k$. $P(y = k | \mathbf{s}) = \frac{e^{\mathbf{s_{k}}}}{\sum_{j}e^{\mathbf{s_{j}}}}$</p>
<p>The softmax loss tries to maximize the log-likelihood of the correct class, or equivalently minimize the negative log-likelihood of the correct class</p>
<script type="math/tex; mode=display">\ell_{i} = -\log P(y = y_{i} | x = x_{i}) = -\log \frac{e^{s_{y_{i}}}}{\sum_{j}e^{s_{j}}}</script>
<p>where $\mathbf{s}$ is again the scores given by the classifier which of course depends on $x_{i}$.</p>
<h5 id="regularization">Regularization</h5>
<p>Typically we add a regularization term to our loss to prevent our model overfitting the training data too much. The traditional way to do this is by adding an extra term to the loss function.</p>
<p>$L = \frac{1}{M}\sum_{i=1}^{M}\ell_{i}(x_{i}, y_{y}) + \lambda R(\mathbf{W})$</p>
<p>Where R is the regularization function and $\lambda$ is the regularization strength and is a hyperparameter. A common choice for R is $L_{2}$ regularization</p>
<script type="math/tex; mode=display">R(\mathbf{W}) = \sum_{i=0}^{K-1}\sum_{j=0}^{D-1}(W[i, j])^{2}</script>
<p>where $\mathbf{W} \in \mathbb{R}^{K \times D}$.</p>
<h3 id="neural-networks-and-forward-propagation">Neural Networks and Forward Propagation</h3>
<p>Our simple linear classifier we mentioned before is a function that looks like $f(X) = WX$; it’s a linear function. But some data cannot be perfectly classified with a linear function. Consider the example below with input data $x \in \mathbb{R}^{2}$ and the output is one of two classes (i.e. $y \in \{+, -\}$).</p>
<center>
<div class="col-lg-8 col-md-8 col-sm-12 col-xs-12">
<img src="/assets/Neural_Networks/inseparable.png" />
</div>
</center>
<p>We can’t draw a line anywhere in the input space that perfectly classifies this dataset. This is what’s called linearly inseparable. It’s clear that we need some sort of non-linear function to achieve perfect classification of this dataset. One such function can be depicted by the separating boundary below, which is clearly a nonlinear function.</p>
<center>
<div class="col-lg-8 col-md-8 col-sm-12 col-xs-12">
<img src="/assets/Neural_Networks/separable.png" />
</div>
</center>
<p>So we have to change our classifier from something that looks like $f(X) = WX$ to something that looks like $f(X) = \text{Nonlinearity}(WX)$. This nonlinearity is called the activation function in neural networks. A simple two-layer neural network might look something like $f(X) = W_{2}\max(0, W_{1}X)$, where the max function acts as the nonlinearity. Pictographically, this function/network can be visualized like below.</p>
<center>
<div class="col-lg-8 col-md-8 col-sm-12 col-xs-12">
<img src="/assets/Neural_Networks/classic-view.png" />
</div>
</center>
<p>The input is flattened into a tall vector, multiplied by $W_{1}$, and run through the activation function to produce the output of the first layer. That output is then multiplied by $W_{2}$ to produce the output of the second layer. One “layer” typically consists of a matrix multiplication followed by the activation function. This type of layer is also referred to as a “fully connected (FC) layer”. In the image above, the middle layer is sometimes called the “hidden layer”.</p>
<p>The flattened input vector is dotted with each row of the weight matrix to produce an output vector which serves as the input to the next layer, after being run through the activation function of course. More concretely, the hidden layer output $h_{1} = \text{ReLU}(W_{1}X)$, where ReLU is the activation function.</p>
<p>As a concrete example, let’s say we have a 28 x 28 image for the input and we want to classify these images across 10 different classes. We will use a hidden layer of size 100, which is chosen somewhat arbitrarily. First we flatten the image into a vector of size 784. Since we want the hidden layer to be output size 100, we are projecting a vector in $\mathbb{R}^{784}$ to $\mathbb{R}^{100}$, and hence $W_{1} \in \mathbb{R}^{100 \times 784}$. The same logic is applied for projecting the outut of the hidden layer to the output size which is 10, hence $W_{2} \in \mathbb{R}^{10 \times 100}$.</p>
<h5 id="how-does-it-work">How Does it Work?</h5>
<p>Consider the case where we want to classify images of hand-written digits into 10 different classes (0-9). The output layer is a 10-dimensional vector where each entry is the score corresponding to each digit. The maximum score’s index is the predicted digit. How does the network determine this score? Consider an input image of a 7. The last layer predicts the digit. The hidden layers learn to predict pieces of the digit in certain positions (i.e. it would learn that a 7 is made up of a horizontal edge toward the top of the image and a slanted edge going down the middle).</p>
<center>
<div class="col-lg-10 col-md-10 col-sm-12 col-xs-12">
<img src="/assets/Neural_Networks/7-classification.png" />
</div>
</center>
<p>In order for the network to detect a horizontal edge at the top of the image, the activations corresponding to those pixels in the input need to be larger, while the surrounding pixels should have low activations. So the activation function assigns a value to a certain weighted sum of the previous layer.</p>
<p>Common activation functions include $\text{ReLU}(x) = \max(0, x)$, $\sigma(x) = \frac{1}{1 + e^{-x}}$ (pronounced “sigmoid of x”), and $\tanh(x)$ to name a few.</p>
<h3 id="training-neural-networks-and-backpropagation">Training Neural Networks and Backpropagation</h3>
<p>In order to update our model parameters we use the gradient update rule</p>
<script type="math/tex; mode=display">W \leftarrow W - \alpha \nabla_{W} L(X, Y)</script>
<p>which means we need to compute the gradient of the weight matrix for every layer. We can do this in a modular fashion called backpropagation or “backprop”.</p>
<p>In forward propagation, the computation takes in an input and a weight matrix and produces an output. In backprop, we assume we have access to the gradient of the loss w.r.t. the layer output $h_{i}$.</p>
<center>
<div class="col-lg-12 col-md-12 col-sm-12 col-xs-12">
<img src="/assets/Neural_Networks/forward-back.png" />
</div>
</center>
<p>We can use this gradient to compute the gradient of the loss w.r.t. the weight matrix $W_{i}$ using the chain rule. In order to compute the gradients for other layers, we also need to compute the gradient of the loss w.r.t. the input. After we have these gradients, we can simply apply the update rule.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\frac{\partial L}{\partial W_{i}} &= \frac{\partial L}{\partial h_{i}}\frac{\partial h_{i}}{\partial W_{i}} \\
\frac{\partial L}{\partial h_{i-1}} &= \frac{\partial L}{\partial h_{i}}\frac{\partial h_{i}}{\partial h_{i-1}}
\end{align*} %]]></script>
<p>for a typical FC layer $h_{i} = \max(0, W_{i}h_{i-1})$, we can step through the computations of the necessary gradients. Let’s define another variable $z = W_{i}h_{i-1}$ so that $h_{i} = \max(0, z)$</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\frac{\partial h_{i}}{\partial W_{i}} &= \frac{\partial h_{i}}{\partial z}\frac{\partial z}{\partial W_{i}} \\
&= \mathbb{1}[z \geq 0] \cdot h_{i-1}
\end{align*} %]]></script>
<p>where $\mathbb{1}[\cdot]$ is the indicator function. Similarly we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\frac{\partial h_{i}}{\partial h_{i-1}} &= \frac{\partial h_{i}}{\partial z}\frac{\partial z}{\partial h_{i-1}} \\
&= \mathbb{1}[z \geq 0] \cdot W_{i}
\end{align*} %]]></script>
<p>Now that we know how to compute the necessary gradients a FC layer, we have everything we need to train our network. Pseudocode for a typical training loop would look like this.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Initialize</span> <span class="n">W1</span> <span class="n">randomly</span>
<span class="n">Initialize</span> <span class="n">W2</span> <span class="n">randomly</span>
<span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_epochs</span><span class="p">):</span>
<span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_batches</span><span class="p">):</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">forward_pass</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">loss_function</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">labels</span><span class="p">)</span>
<span class="n">dLdW1</span><span class="p">,</span> <span class="n">dLdW2</span> <span class="o">=</span> <span class="n">backward_pass</span><span class="p">(</span><span class="n">loss</span><span class="p">)</span>
<span class="n">W1</span> <span class="o">=</span> <span class="n">W1</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">dLdW1</span>
<span class="n">W2</span> <span class="o">=</span> <span class="n">W2</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">dLdW2</span>
</code></pre></div></div>$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$Convolutional Neural Networks2020-02-29T08:06:43+00:002020-02-29T08:06:43+00:00http://www.fractalai.org/dl/2020/02/29/cnns<h3 id="convolutional-layers">Convolutional Layers</h3>
<p>Recall fully-connected (FC) neural networks, in which each feature in the input is connected to every neuron in the first layer. For images, this means that every pixel has a weight corresponding to each neuron the next layer.</p>
<script type="math/tex; mode=display">f_{\text{FC}}(X; \mathbf{W_{1}, W_{2}}) = \mathbf{W_{2}}\max(0, \mathbf{W_{1}}X)</script>
<p>As you might imagine, this requires us to maintain an extremely large number of parameters. For example, consider a 28 by 28 image being input into a two-layer FC network with 100 neurons in the first layer and 10 neurons in the output layer. First we flatten our image into a 28 x 28 = 784 dimensional vector. This vector is being projected into 100 dimensions, so $\mathbf{W_{1}}$ will be of shape $100 \times 784$, or equivalently, $\mathbf{W_{1}}\in \mathbb{R}^{100 \times 784}$. That 100 dimensional vector then has to pass through the output layer and be turned into a 10 dimensional vector, so $\mathbf{W_{2}}\in \mathbb{R}^{10 \times 100}$. So just with this simple network on a small image, we are using $100 \cdot 784 + 10 \cdot 100 = 79400$ parameters!</p>
<p>What if instead we considered groups of locally-connected elements in the input? In other words, if we had a filter that gathered local parts of the input, performed some operation on them, and then output the result. In this way, each element of the output would only result from a small locally-connected group in the input. Certainly this would result in less parameters, since every element of the input is not connected to each element in the output, but rather a subset of the elements in the input are connected to each element of the output.</p>
<center>
<div class="col-lg-6 col-md-6 col-sm-12 col-xs-12">
<img src="/assets/CNNs/cnn-visual-1.png" />
</div>
</center>
<p>We can accomplish this using convolutions. In mathematics, a convolutions is an operation on two functions that produces another function. We can borrow this idea and create a discrete form of convolution to implement this filter idea. We will say a filter $\mathbf{W}$, which is just a matrix is convolved with the input image $\mathbf{X}$, producing the output $\mathbf{Y}$, where each element of $\mathbf{Y}$ is determined in the following way</p>
<script type="math/tex; mode=display">\mathbf{Y}[r, c] = \sum_{i=0}^{k_{1}}\sum_{j=0}^{k_{2}}\mathbf{X}[r + i, c + j]\mathbf{W}[i, j]</script>
<p>where $k_{1}$ and $k_{2}$ are the dimensions of the filter (also called the kernel). So each element of the output is a weighted sum of a local subset of input elements. This is a 2D discrete convolution (actually it’s a cross-correlation, but that’s a technical detail and I’ll just say convolution), but convolutions using filters of differing numbers of dimensions are equally valid.</p>
<center>
<div class="col-lg-8 col-md-8 col-sm-12 col-xs-12">
<img src="/assets/CNNs/cnn-visual-2.png" />
</div>
</center>
<p>Each element of the output is created by this convolution, and the filter “slides” across the input until the the output is filled up. The idea is to learn good weights for this filter in the same way we learned the parameters of FC networks. Also, we could have multiple filters, each of which slide across the whole input and produce an output channel. These output channels are concatenated together so that if we have $n$ filters, output will have $n$ channels, or activation maps.</p>
<center>
<div class="col-lg-8 col-md-8 col-sm-12 col-xs-12">
<img src="/assets/CNNs/cnn-visual-3.png" />
</div>
</center>
<p>Each filter slides across the input. The amount by which the filter slides each time is called the stride. The output dimensions are determined by the input size, the filter size, and the stride in the following way. Let the input and output heights and widths be denoted by $h_{\text{in}}, w_{\text{in}}, h_{\text{out}}, w_{\text{out}}$ respectively, and let the filter dimensions be $k_{1}, k_{2}$.</p>
<script type="math/tex; mode=display">h_{\text{out}} = \frac{h_{\text{in}} - k_{1}}{\text{stride}} + 1</script>
<p>A similar formula holds for the widths. If this doesn’t produce a whole number, it is common to pad the outsides of the input with zeros enough so that the output dimensions will be whole numbers. Now if we consider out example from earlier, a $28 \times 28$ image, we can see the reduction in parameters afforded by a convolutional layer. Let’s say we have the same network as earlier, but this time the first layer is a convolutional layer and not a FC layer. Let’s also say we have 16 $4 \times 4$ filters with stride 2. Then the output size for our convolutional layer will be</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
h_{\text{out}} &= \frac{28 - 4}{2} + 1 = 13 \\
w_{\text{out}} &= \frac{28 - 4}{2} + 1 = 13
\end{align*} %]]></script>
<p>The number of parameters for this layer is just the combined filter size: $16 \cdot 4 \cdot 4 = 256$. Now if we take the convolutional layer output, flatten it to size $16 \cdot 13 \cdot 13 = 2704$, and feed it to the output layer of size 10, this takes $2704 \cdot 10 = 27040$ parameters. So the total number of parameters in our network has been reduced from 79400 to 27296.</p>
<hr />
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<hr />
<h3 id="convolutional-networks">Convolutional Networks</h3>
<h4 id="pooling-layers">Pooling layers</h4>
<p>A Convolutional Neural Network, or CNN, is a combination of convolutional layers with interconnected activation functions and/or pooling layers. Pooling layers usually go hand-in-hand with convolutional layers, and are used to improve the robustness of convolutional layers with respect to the exact location of certain features in the input. For example, if the input is a photo of a face, the nose might be in the center of the image, or it might be slightly to the left of the center, or it might be somewhere else in the image entirely. We can be more robust to this spatial noise by using a special type of filter that takes the maximum over a group of activations, these activations being the ones output by the convolutional layer followed by some activation function. By doing this, even if the strongest activation coming out of the convolutional layer is slightly offset, the max filter will still capture it. This is what a pooling layer does. Assuming a $2 \times 2$ pooling max filter with stride 1, the pooling layer will look something like this.</p>
<center>
<div class="col-lg-8 col-md-8 col-sm-12 col-xs-12">
<img src="/assets/CNNs/pooling-layer-visual.png" />
</div>
</center>
<p>Pooling makes the activation maps smaller as is usually done over each activation map (i.e. each channel of the output) separately. There are different types of pooling layers such as max pooling, average pooling, and several others.</p>
<h4 id="the-full-network">The Full Network</h4>
<p>After a series of convolutional layers with intermixed pooling layers and activations, CNNs usually have a FC layer or a series of FC layers referred to as the “classifier” of the network. The purpose of the convolutional layer and pooling layers are to extract useful features for eventual input into the classifier, which will give us the actual predictive output of the network. The full network can be visualized like below.</p>
<center>
<div class="col-lg-12 col-md-12 col-sm-12 col-xs-12">
<img src="/assets/CNNs/full-conv-net.png" />
</div>
</center>
<hr />
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<hr />
<h4 id="backprop-through-conv-layers">Backprop Through Conv Layers</h4>
<p>Just like FC layers, we have to figure out how to propagate gradients thought convolutional layers. Assume we have input $\mathbf{X}$, output $\mathbf{Y}$, of size $h_{1} \times w_{1}$ and $h_{2} \times w_{2}$ respectively, and kernel $\mathbf{W}$ of size $k_{1} \times k_{2}$. We assume we have access to the upstream gradient $\frac{\partial L}{\partial \mathbf{Y}}$, where $L$ is the loss function. We need to calculate $\frac{\partial L}{\partial \mathbf{X}}$ and $\frac{\partial L}{\partial \mathbf{W}}$. We also assume that the stride is 1 in all dimensions for the kernel to simplify indexing.</p>
<center>
<div class="col-lg-10 col-md-10 col-sm-12 col-xs-12">
<img src="/assets/CNNs/backprop-cnn.png" />
</div>
</center>
<p>Recall that for an output at index $r, c$ of $\mathbf{Y}$, we have</p>
<script type="math/tex; mode=display">\mathbf{Y}[r, c] = \sum_{i=0}^{k_{1}}\sum_{j=0}^{k_{2}}\mathbf{X}[r + i, c + j]\mathbf{W}[i, j]</script>
<p>Now we will see how to calculate the unknown gradients.</p>
<h5 id="fracpartial-lpartial-w">$\frac{\partial L}{\partial W}$:</h5>
<p>We will consider the gradient one pixel at a time. I.e. let’s consider $\frac{\partial L}{\partial \mathbf{W}[a, b]}$. This kernel weight affects everything in the output, and we’ll sum all its contributions to compute the gradient. Below we can see this visually.</p>
<center>
<div class="col-lg-10 col-md-10 col-sm-12 col-xs-12">
<img src="/assets/CNNs/backprop-cnn-1.png" />
</div>
</center>
<p>Note that because each output pixel is just a weighted sum of some elements of the input with the kernel weights, we have the following.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\mathbf{Y}[r, c] &= \sum_{i=0}^{k_{1}}\sum_{j=0}^{k_{2}}\mathbf{X}[r + i, c + j]\mathbf{W}[i, j] \\
\Rightarrow \frac{\partial \mathbf{Y}[r, c]}{\partial \mathbf{W}[a, b]} &= \sum_{i=0}^{k_{1}}\sum_{j=0}^{k_{2}} \frac{\partial}{\partial \mathbf{W}[a, b]}\mathbf{X}[r + i, c + j]\mathbf{W}[i, j] \\
&= \mathbf{X}[r + a, c + b]
\end{align*} %]]></script>
<p>We can accumulate the contribution of this kernel weight by considering every pixel in the output as follows.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\frac{\partial L}{\partial \mathbf{W}[a, b]} &= \sum_{r=0}^{h_{2}}\sum_{c=0}^{w_{2}}\frac{\partial L}{\partial \mathbf{Y}[r, c]}\frac{\partial \mathbf{Y}[r, c]}{\partial \mathbf{W}[a, b]} \\
&= \sum_{r=0}^{h_{2}}\sum_{c=0}^{w_{2}}\frac{\partial L}{\partial \mathbf{Y}[r, c]} \mathbf{X}[r + a, c + b]
\end{align*} %]]></script>
<p>Note that this is exactly a convolution between $\mathbf{X}$ and $\frac{\partial L}{\partial \mathbf{Y}}$ but clipped to be the dimensions of the kernel.</p>
<h5 id="fracpartial-lpartial-x">$\frac{\partial L}{\partial X}$:</h5>
<p>Note $\frac{\partial L}{\partial \mathbf{X}}$ is the same size as $\mathbf{X}$ and we will also compute pixel-by-pixel. Consider $\frac{\partial L}{\partial \mathbf{X}[r’, c’]}$. This pixel only affects elements of the output that are produced when the kernel is over that pixel of the input. $\mathbf{X}[r’, c’]$ only affects a region of $\mathbf{Y}$.</p>
<center>
<div class="col-lg-10 col-md-10 col-sm-12 col-xs-12">
<img src="/assets/CNNs/backprop-cnn-2.png" />
</div>
</center>
<p>Thinking about this in the general sense reveals that $\mathbf{X}[r’, c’]$ affects a box of output pixels whose upper left corner is $\mathbf{Y}[r’ - (k_{1} - 1), c’ - (k_{2} - 1)]$ and whose lower right pixel is $\mathbf{Y}[r’, c’]$. Therefore we can wright</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\frac{\partial L}{\partial \mathbf{X}[r', c']} &= \sum_{i=0}^{k_{1} - 1}\sum_{j=0}^{k_{2} - 1} \frac{\partial L}{\partial \mathbf{Y}[r' - i, c' - j]} \frac{\partial \mathbf{Y}[r' - i, c' - j]}{\partial \mathbf{X}[r', c']}
\end{align*} %]]></script>
<p>Using the equation for the discrete convolution we already know, we can see that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\mathbf{Y}[r' - i, c' - j] &= \sum_{n=0}^{k_{1}}\sum_{m=0}^{k_{2}} \mathbf{X}[r' - i + n, c' - j + m]\mathbf{W}[n, m] \\
\Rightarrow \frac{\partial \mathbf{Y}[r' - i, c' - j]}{\partial \mathbf{X}[r', c']} &= \sum_{n=0}^{k_{1}}\sum_{m=0}^{k_{2}} \frac{\partial}{\partial \mathbf{X}[r', c']}\mathbf{X}[r' - i + n, c' - j + m]\mathbf{W}[n, m] \\
&= \mathbf{W}[i, j]
\end{align*} %]]></script>
<p>This is because $\mathbf{X}[r’, c’]$ only appears once in this sum when $i=n$ and $j=m$. So finally we can write</p>
<script type="math/tex; mode=display">\frac{\partial L}{\partial \mathbf{X}[r', c']} = \sum_{i=0}^{k_{1} - 1}\sum_{j=0}^{k_{2} - 1} \frac{\partial L}{\partial \mathbf{Y}[r' - i, c' - j]}\mathbf{W}[i, j]</script>
<p>Now we know how to compute the gradients needed for backprop through convolutional layers. We have shown that given $\frac{\partial L}{\partial \mathbf{Y}}$, we can compute $\frac{\partial L}{\partial \mathbf{W}}$ and $\frac{\partial L}{\partial \mathbf{X}}$.</p>
<hr />
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<hr />Convolutional LayersLinear Approximation2020-02-21T08:06:43+00:002020-02-21T08:06:43+00:00http://www.fractalai.org/mfml/2020/02/21/linear-approximation<p>$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$</p>
<h3 id="linear-approximation">Linear Approximation</h3>
<p>Linear approximation is a fundamental problem in machine learning, and one that has a surprising amount of mathematical structure built around it for such a seemingly simple problem. Consider the following problem: We have a Hilbert space $\mathbf{S}$ and a subspace $\mathbf{T} \subseteq \mathbf{S}$. We also have an element $\mathbf{x} \in \mathbf{S}$. What is the closest element $\mathbf{\hat{x}} \in \mathbf{T}$ to $\mathbf{x}$?</p>
<center>
<div class="col-lg-6 col-md-6 col-sm-12 col-xs-12">
<img src="/assets/Linear_Approx/linear-approx-problem.png" />
</div>
</center>
<p>This Hilbert space $\mathbf{S}$ has an inner product $\langle\cdot, \cdot\rangle$ and induced norm $\norm{\cdot}$. So we can frame the problem as finding the point $\mathbf{\hat{x}} \in \mathbf{T}$ such that $\norm{\mathbf{\hat{x} - x}}$ is minimized.</p>
<script type="math/tex; mode=display">\begin{equation} \tag{1}
\text{minimize}_{\mathbf{y\in T}} \norm{\mathbf{y - x}}
\end{equation}</script>
<p>We can find a unique minimizer by exploiting orthogonality. In fact, $\mathbf{\hat{x} \in T}$ is the closest point to $\mathbf{x \in S}$ if $\mathbf{\hat{x} - x}$ is orthogonal to all other points $\mathbf{y \in T}$. This means that $\langle \mathbf{\hat{x} - x}, y\rangle = 0$ for all $\mathbf{y \in T}$</p>
<p>Lets show that if $\langle\mathbf{\hat{x} - x}, \mathbf{y}\rangle = 0$ for all $\mathbf{y \neq \hat{x} \in T}$ then $\mathbf{\hat{x}}$ is minimizer of $(1)$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\norm{\mathbf{x - y}}^{2} &= \norm{(\mathbf{x - \hat{x}}) - (y - \mathbf{\hat{x}})}^{2} \\
&= \norm{\mathbf{x - \hat{x}}}^{2} + \norm{\mathbf{y} - \mathbf{\hat{x}}}^{2}
\end{align*} %]]></script>
<p>The last equality follows from the Pythagorean theorem. This is valid because we required that $\mathbf{x - \hat{x}}$ was orthogonal to all points in $\mathbf{T}$, and $\mathbf{y} - \mathbf{\hat{x}}$ is certainly in $\mathbf{T}$!</p>
<center>
<div class="col-lg-8 col-md-8 col-sm-12 col-xs-12">
<img src="/assets/Linear_Approx/closest-point.png" />
</div>
</center>
<p>Therefore, if $\norm{\mathbf{y} - \mathbf{\hat{x}}}^{2} \neq 0$ (i.e. $\mathbf{y} \neq \mathbf{\hat{x}}$), then</p>
<script type="math/tex; mode=display">\norm{\mathbf{x} - \mathbf{y}}^{2} > \norm{\mathbf{x - \hat{x}}}^{2}</script>
<p>where equality is achievd only when $\mathbf{y} = \mathbf{\hat{x}}$. This implies that $\mathbf{\hat{x}}$ is a unique minimizer of $(1)$. This is a pretty intuitive result: If $\mathbf{x - y}$ is not orthogonal to $\mathbf{T}$, then there is some other point $\mathbf{\hat{x}}$ that comes closer to $\mathbf{x}$ while still remaining inside $\mathbf{T}$. This can be seen visually in the image above.</p>
<hr />
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<hr />
<h4 id="computing-the-closest-point">Computing the closest point</h4>
<p>So we know that $\mathbf{\hat{x}}$ is a unique minimizer of $(1)$ if $\langle\mathbf{x - \hat{x}}, y\rangle = 0$ for all $\mathbf{y} \neq \mathbf{\hat{x}}$ in $\mathbf{T}$, but how do we actually compute $\mathbf{\hat{x}}$? If $\mathbf{T}$ is an $N$-dimensional subspace, that means we can represent any point in the space by a linear combination of $N$ basis vectors - call them $\mathbf{v_{1}}, \mathbf{v_{2}}, …, \mathbf{v_{N}}$.</p>
<script type="math/tex; mode=display">\mathbf{\hat{x}} = \alpha_{1}\mathbf{v_{1}} + \alpha_{2}\mathbf{v_{2}} + ... + \alpha_{N}\mathbf{v_{N}} = \sum_{n=1}^{N}\alpha_{n}\mathbf{v}_{N}</script>
<p>for some constants $\{ \alpha \}_{1}^{N}$. Orthogonality also tells us</p>
<script type="math/tex; mode=display">\langle\mathbf{x - \hat{x}}, \mathbf{v_{k}}\rangle = 0</script>
<p>If we take the inner product of $\mathbf{x - \hat{x}}$ with one of the basis vectors we generate a linear equation.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\langle\mathbf{x - \hat{x}}, \mathbf{v_{k}}\rangle &= \big\langle\mathbf{x} - \sum_{n=1}^{N}\alpha_{n}\mathbf{v_{n}}, \mathbf{v_{k}}\big\rangle \\
&= \langle\mathbf{x}, \mathbf{v_{k}}\rangle - \alpha_{1}\langle\mathbf{v_{1}, \mathbf{v_{k}}}\rangle - ... - \alpha_{N}\langle\mathbf{v_{1}, \mathbf{v_{k}}}\rangle \\
\Rightarrow \langle\mathbf{x}, \mathbf{v_{k}}\rangle &= \alpha_{1}\langle\mathbf{v_{1}, \mathbf{v_{k}}}\rangle + ... + \alpha_{N}\langle\mathbf{v_{1}, \mathbf{v_{k}}}\rangle
\end{align*} %]]></script>
<p>In fact, we can generate $N$ different linear equations by taking the inner product with each of the basis vector separately. That means we can solve this linear system of equations for $\mathbf{\alpha}$, the vector of coefficients!</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\begin{bmatrix}\langle\mathbf{x}, \mathbf{v_{1}}\rangle \\ \vdots \\ \langle\mathbf{x}, \mathbf{v_{N}}\rangle\end{bmatrix} &=
\begin{bmatrix}
\langle\mathbf{v_{1}}, \mathbf{v_{1}}\rangle & ... & \langle\mathbf{v_{N}}, \mathbf{v_{1}}\rangle \\
\vdots & \ddots & \vdots \\
\langle\mathbf{v_{1}}, \mathbf{v_{N}}\rangle & ... & \langle\mathbf{v_{N}}, \mathbf{v_{N}}\rangle
\end{bmatrix}\begin{bmatrix}\alpha_{1} \\ \vdots \\ \alpha_{N}\end{bmatrix} \\
\\
\mathbf{b} &= \mathbf{G\alpha} \\
\Rightarrow \mathbf{\alpha} &= \mathbf{G^{-1}b}
\end{align*} %]]></script>
<p>Where $\mathbf{G}$ is the matrix if inner products and is called the Gram Matrix or Grammian of the basis $\{\mathbf{v}\}_{n=1}^{N}$. After we solve for our coeeficientls, we can easily reconstruct the closest point in $\mathbf{T}$ to $\mathbf{x}$ by</p>
<script type="math/tex; mode=display">\mathbf{\hat{x}} = \alpha_{1}\mathbf{v_{1}} + ... + \alpha_{N}\mathbf{v_{N}}</script>
<p>Take a second to appreciate what we did. We took a minimization problem, converted it to a finite dimensional linear algebra problem by exploiting our basis to ask the question “what basis coefficients will create a $\mathbf{\hat{x}}$ that minimizes the objective?”. This idea is central to many more topics we will cover.</p>
<p>$\mathbf{G}$ is invertible because the basis vectors are linearly independent. Also, since the inner product is a symmetric function, the Gram Matrix is also symmetric. Because the Gram matrix is square and invertible, $\mathbf{b} = \mathbf{G\alpha}$ always has a solution. Further, if we have an orthogonal basis, then the Gram Matrix is exactly the Identity transformation, and the coefficients can be calculated by simply taking inner products of $\mathbf{x}$ with each basis vector.</p>
<h5 id="example">Example</h5>
<p>We will close with an example to drive this idea home. Let our Hilbert space $\mathbf{S} = \mathbb{R}^{3}$ with the standard inner product and</p>
<script type="math/tex; mode=display">\mathbf{T} = \text{Span}\Bigg(\begin{bmatrix}1 \\ 0 \\ 1\end{bmatrix}, \begin{bmatrix}-1 \\ 0 \\ 1\end{bmatrix}\Bigg), \mathbf{x} = \begin{bmatrix}2 \\ 1 \\ 0\end{bmatrix}</script>
<p>The vectors we defined $\mathbf{T}$ with form a basis for the subspace. What is the closest point in $\mathbf{T}$ to $\mathbf{x}$? We can write $\mathbf{\hat{x}}$ as</p>
<script type="math/tex; mode=display">\mathbf{\hat{x}} = \alpha_{1}\mathbf{v_{1}} + \alpha_{2}\mathbf{v_{2}}</script>
<p>and our Gram Matrix and $\mathbf{b}$ are</p>
<script type="math/tex; mode=display">% <![CDATA[
\mathbf{G} = \begin{bmatrix}
2 & 0 \\
0 & 2
\end{bmatrix}, \mathbf{b} = \begin{bmatrix}2 \\ -2\end{bmatrix} %]]></script>
<p>The inverse Gram Matrix is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{bmatrix}
\frac{1}{2} & 0 \\
0 & \frac{1}{2}
\end{bmatrix} %]]></script>
<p>Finally, $\mathbf{\alpha} = \begin{bmatrix}1 & -1\end{bmatrix}^{T}$. We reconstruct our solution using the coefficients: $\mathbf{\hat{x}} = \mathbf{v_{1}} - \mathbf{v_{2}} = \begin{bmatrix}2 & 0 & 0\end{bmatrix}^{T}$</p>
<hr />
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<hr />$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$Policy Gradient and Actor Critic2020-02-10T08:06:43+00:002020-02-10T08:06:43+00:00http://www.fractalai.org/dl/2020/02/10/policy-gradient-actor-critic<h3 id="policy-gradient">Policy Gradient</h3>
<p>What if we could learn the policy parameters directly? We can approach this problem by thinking of policies abstractly - Let’s consider a class of policies defined by $\theta$ and refer to such a policy as $\pi_{\theta}(a|s)$ which is a probability distribution over the action space conditioned on the state $s$. These parameters $\theta$ could be the parameters of a neural network or a simple polynomial or anything really.</p>
<p>Let’s note define a metric $J$ which can be used to evaluate the quality of a policy $\pi_{\theta}$. What we really want to do is maximize the expected future reward, so naturally we can write</p>
<script type="math/tex; mode=display">J(\pi_{\theta}) = \mathbb{E}\bigg[\sum_{t=1}^{T}R(s_{t}, a_{t})\bigg]</script>
<p>where $R(s_{t}, a_{t})$ is the reward given by taking action $a$ in state $s$ and time $t$. The optimal set of parameters for the policy can then be written as</p>
<script type="math/tex; mode=display">\theta^{\ast} = \arg\max_{\theta}\mathbb{E}\bigg[\sum_{t=1}^{T}R(s_{t}, a_{t})\bigg]</script>
<p>Now consider a trajectory $\tau = (s_{1}, a_{1}, s_{2}, a_{2}, …, s_{T})$ which is a sequence of state-action pairs until the terminal state. We are trying to learn $\theta$ that maximizes the reward of some trajectory. So in the spirit of gradient descent, we are going to take actions within our environment to sample a trajectory and then use the rewards gained from that trajectory to adjust our parameters. We can write our loss function as</p>
<script type="math/tex; mode=display">J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}[R(\tau)]</script>
<p>where $R(\tau)$ is the cumulative reward gained by our trajectory. Our objective is to take the gradient of this function with respect to $\theta$ so that we can use the gradient descent update rule to adjust our parameters, but the reward function is not known and may not even be differentiable, but with a few clever tricks we can estimate the gradient. Recall that for any continuous function $f(x)$, $\mathbb{E}[f(x)] = \int_{-\infty}^{\infty}p(x)f(x)dx$ where $p(x)$ is the probability of event $x$ occurring. So we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
J(\theta) &= \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}[R(\tau)] \\
&= \int p(\tau)R(\tau)d\tau \\
&= \int \pi_{\theta}(\tau)R(\tau)d\tau
\end{align*} %]]></script>
<p>and</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\nabla_{\theta}J(\theta) &= \nabla_{\theta} \int \pi_{\theta}(\tau)R(\tau)d\tau \\
&= \int \nabla_{\theta}\pi_{\theta}(\tau)R(\tau)d\tau \\
&= \int \pi_{\theta}(\tau)\nabla_{\theta}\log(\pi_{\theta}(\tau))R(\tau)d\tau \\
&= \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}[\nabla_{\theta}\log(\pi_{\theta}(\tau))R(\tau)]
\end{align*} %]]></script>
<p>Where the third line follows from the fact that $\nabla_{x}f(x) = f(x)\nabla_{x}\log(f(x))$. The fact that we have turned the gradient of our cost function $J$ into an expectation is good because that means we can estimate it by sampling data. The last piece of the puzzle is to figure out how to calculate $\nabla_{\theta}\log(\pi_{\theta}(\tau))$. Note that we can rewrite $\pi_{\theta}(\tau)$ as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\pi_{\theta}(\tau) = \pi_{\theta}(a_{1}, s_{1}, a_{2}, s_{2}, ..., s_{T}) &= p(s_{1}) \prod_{t=1}^{T} p(a_{t}|s_{t})p(s_{t+1}|a_{t}, s_{t}) \\
&= p(s_{1}) \prod_{t=1}^{T} \pi_{\theta}(a_{t}|s_{t})p(s_{t+1}|a_{t}, s_{t})
\end{align*} %]]></script>
<p>Convince yourself that the above relation is true. $\pi_{\theta}(\tau)$ is the probability of trajectory $\tau$ happening. It is the probability of starting in $s_{1}$, then taking action $a_{1}$ given $s_{1}$, then transitioning to state $s_{2}$ given $a_{1}$ in $s_{1}$, and so on. This joint probability can be factored out. The last step is to realize $p(a_{t}|s_{t})$ is the definition of $\pi_{\theta}(a_{t}|s_{t})$. Now</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\nabla_{\theta} \log(\pi_{\theta}(\tau)) &= \nabla_{\theta}\log\bigg[p(s_{1}) \prod_{t=1}^{T} \pi_{\theta}(a_{t}|s_{t})p(s_{t+1}|a_{t}, s_{t})\bigg] \\
&= \nabla_{\theta}\bigg[\log(p(s_{1})) + \sum_{t=1}^{T}\log(\pi_{\theta}(a_{t}|s_{t})) + \sum_{t=1}^{T}\log(p(s_{t+1}|a_{t}, s_{t}))\bigg] \\
&= 0 + \nabla_{\theta}\sum_{t=1}^{T}\log(\pi_{\theta}(a_{t}|s_{t})) + 0
\end{align*} %]]></script>
<p>This simplication is enough for us to completed our estimate of the policy gradient $\nabla_{\theta}J(\theta)$.</p>
<script type="math/tex; mode=display">\nabla_{\theta}J(\theta) \approx \frac{1}{N}\sum_{n=1}^{N}\Bigg[\bigg(\sum_{t=1}^{T} \nabla_{\theta}\log(\pi_{\theta}(a_{n,t}|s_{n,t}))\bigg)\bigg(\sum_{t=1}^{T}r(s_{n,t},a_{n,t})\bigg)\Bigg]</script>
<p>Where $N$ is just the number of episodes (analogous to epochs) we do. Having a set of $N$ trajectories and then averaging the policy gradient estimate over each of them makes this estimate more robust. Now that we can estimate the policy gradient, we simply would update our parameters in the familiar way</p>
<script type="math/tex; mode=display">\theta \leftarrow \theta - \alpha\nabla_{\theta}J(\theta)</script>
<p>One interpretation of this result is that we are trying to maximize the log likelihood of trajectories that give good rewards and minimize the log likelihood of those that don’t. This is the idea behind the REINFORCE algorithm which is</p>
<ol>
<li>sample $N$ trajectories by running the policy</li>
<li>estimate the policy gradient like above</li>
<li>update the parameters $\theta$</li>
<li>Repeat until converged</li>
</ol>
<hr />
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<hr />
<h3 id="actor-critic">Actor Critic</h3>
<p>One issue with vanilla policy gradients is that its very hard to assign credit to state-action pairs that resulted in good reward because we only consider the total reward $\sum_{t=1}^{T}R(a_{t}, s_{t})$. The trajectories are noisy. But if we had the $Q$ function, we would know what state-action pairs were good. In other words, we would estimate the gradient of $J$ as</p>
<script type="math/tex; mode=display">\nabla_{\theta}J(\theta) = \mathbb{E}[\nabla_{\theta}\log(\pi_{\theta}(\tau))Q_{\pi_{\theta}}(\tau)]</script>
<p>The idea of actor-critic is that we have an actor that samples trajectories using the policy, and a critic that critiques the policy using the $Q$ function. Since we don’t have the optimal $Q$ functions, we can estimate it like we did in deep Q learning. So we could have a policy network that takes in a state and returns a probability distribution over the action space (i.e. $\pi_{\theta}(a|s))$ and a $Q$ network that takes in a state-action pair and returns its Q value estimate. Let’s say this network is parameterized by a generic variable $\beta$. Note that these don’t have to be neural networks, but for the sake of this guide I’ll just say “network”. So we have networks $\pi_{\theta}$ and $Q_{\beta}$. The general actor-critic algorithm then goes like</p>
<ol>
<li>Initialize $s, \theta, \beta$</li>
<li>Repeat until converged:
<ul>
<li>Sample action $a$ from $\pi_{\theta}(\cdot|s)$</li>
<li>Receive reward $r$ and sample next state $s’ \sim p(s’|s, a)$</li>
<li>Use the critic to evaluate the actor and update the policy similar to like we did in policy gradients:
<script type="math/tex">\theta \leftarrow \theta - \alpha\nabla_{\theta}\log(\pi_{\theta}(a|s))Q_{\beta}(s, a)</script></li>
<li>Update the critic according to some loss metric: $\text{MSE Loss} = (Q_{t+1}(s, a) - (r + \max_{a’}Q_{t}(s’, a’)))^{2}$</li>
<li>Update $\beta$ using backprop or whatever update rule</li>
</ul>
</li>
</ol>
<p>Of course you can sample whole trajectories instead of one state-action pair at a time. Different types of actor-critic result from changing the “critic”. In REINFORCE, the critic was simply the reward we got from the trajectory. In actor-critic, the critic is the Q function. Another popular choice is called advantage actor-critic, in which the critic is the advantage functions</p>
<script type="math/tex; mode=display">A_{\pi_{\theta}}(s, a) = Q_{\pi_{\theta}}(s, a) - V_{\pi_{\theta}}(s)</script>
<p>Where V is the value function (recall value iteration). The advantage function A tells us how much better is taking action $a$ in state $s$ than the expected cumulative reward of being in state $s$.</p>
<p>This concludes our discussion of RL for the Deep Learning section. In the future I will make more RL-related guides that focus on more advanced topics and current research. Feel free to reach out with any questions or if you notice something you think is inaccurate and I’ll do my best to respond!</p>
<hr />
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<hr />Policy GradientDeep Q-Learning2020-02-05T08:06:43+00:002020-02-05T08:06:43+00:00http://www.fractalai.org/dl/2020/02/05/deep-q-learning<h3 id="learning-based-methods">Learning-Based Methods</h3>
<p>Policy and Value Iteration gave us a solid way to find the optimal policy when we have perfect information about the environment (i.e. we know the distributions of state transitions and rewards), but when this information is not know, we have to get clever with how we determine good policies. One way is to learn by trial and error - taking actions in the environment and observing what states we transition to under different actions and what rewards we obtain for doing so. Doing this gives us data in the form $(s, a, r, s’)$. If we take action $a$ in state $s$ we receive reward $r$ and transition to state $s’$. From this data we can try to approximate the unknown distributions.</p>
<p>Another issue we face is large state spaces. Policy and Value iteration worked fine for Gridworld (small state space), but when the total number of states becomes large, these algorithms become intractable - they contain a $|\mathbf{S}|^{3}$ and a $|\mathbf{S}|^{2}$ term respectively in their time complexities! Our solution to this issue is to learn a lower-dimensional representation of the state using neural networks. This is known as deep reinforcement learning, and the type we will be exploring in this guide is called deep Q-Learning.</p>
<h3 id="deep-q-learning">Deep Q-Learning</h3>
<p>Essentially what we are trying to do is approximate $Q^{\ast}(s, a)$ using a neural network. If we can get a good approximation of $Q^{\ast}$, we can extract a good policy. This neural network will be parameterized by a generic term $\theta$, will take as input the state $s$ and output value for each possible action, which we can perform a max operation over to get the best action to take.</p>
<p>In order to learn such a function, we need to define a loss function so that our network knows what it’s optimizing for. Recall that the optimal Q function satisfies</p>
<script type="math/tex; mode=display">Q^{\ast}(s,a) = \mathbb{E}_{s' \sim p(s'|s, a)}\bigg[r(s, a) + \gamma \max_{a'}Q^{\ast}(s', a')\bigg]</script>
<p>Assume we have a bunch of data in the form ${(s, a, r, s’)}_{i=1}^{N}$. Then for one of the data points, we can measure how close our Q network approximates the optimal Q function by the following equation.</p>
<script type="math/tex; mode=display">\text{MSE Loss} = (Q_{t}(s, a) - [r + \max_{a} Q_{t-1}(s', a)])^{2}</script>
<p>Where $Q_{t}$ and $Q_{t-1}$ represent the network output before and after a single weight update. Notice how the first term in the square is our network’s current output and the second term is the target Q-value that we want, but based on the old network weights. The training pipeline looks something like below. First we college a batch of data (i.e. the agent takes actions in the environment) of size $B$, then we feed that data into the network, compute the loss, and update our network weights. Below, $D$ is the dimensionality of the state representation (e.g. the number of pixels in an image).</p>
<center>
<div class="col-lg-10 col-md-10 col-sm-12 col-xs-12">
<img src="/assets/Deep_Q_Learning/q-network-training.png" />
</div>
</center>
<hr />
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<hr />
<h3 id="epsilon-greedy-and-experience-replay">Epsilon Greedy and Experience Replay</h3>
<p>This framework gives us a good way to approximate the optimal Q function, but there still remains the question of how do we actually collect the data? What policy should we use for that? To better explain this problem let’s consider an example. Say we have some sub-optimal policy $\pi_{0}$ that we will use to collect experience $(s, a, r, s’)$ data in the environment. If we simply choose the best action for each state according to this sub-optimal policy, we may not discover that some actions that are not chosen by $\pi_{0}$ lead to good rewards. Essentially, we will be stuck in local minima. One way around this is to occasionally take random actions so that we have a chance of seeing new experiences and hopefully finding better actions to take. This is an exploration strategy known as epsilon-greedy. It says that for some time $t$ the action we choose should be made according to the following rule.</p>
<script type="math/tex; mode=display">% <![CDATA[
a_{t} =
\begin{cases}
\arg\max_{a}Q_{t}(s, a) & \text{with probability } 1 - \epsilon \\
\text{random action} & \text{with probability } \epsilon
\end{cases} %]]></script>
<p>This will allow our agent to do some exploring to find good state-action combinations. Typically, it is good to do a lot of exploration when the network first starts training by using a high value for epsilon, reducing epsilon gradually as training progresses.</p>
<p>The next issue we run into is that consecutive data is highly correlated, which can lead to feedback loops or just really slow training. For example, if we are gathering data under a policy that tells the agent to move down, then data that represents this type of action will be overrepresented in the next iteration of training even though the better option might be to go right and we just haven’t explored that yet. To solve this, one solution is to maintain a buffer that stores data $(s, a, r, s’)$ that we continually update while the agent moves through the environment, removing old data as the buffer gets full. When it comes time to sample a batch of data for training, we randomly sample from this buffer rather that take a bunch of consecutive data like before. This approach is called experience replay.</p>
<p>Armed with the knowledge of these common problems and some solid ways to address them, we present the full Deep Q-Learning algorithm with Experience Replay.</p>
<center>
<figure>
<div class="col-lg-12 col-md-12 col-sm-12 col-xs-12">
<img src="/assets/Deep_Q_Learning/DQN-algorithm.png" />
<figcaption>Credit: Fei-Fei Li, Justin Johnson, Serena Yeung: CS231n</figcaption>
</div>
</figure>
</center>
<p>The function $\phi$ is just a preprocessing step before inputting the data into the neural network and can be ignored for our purposes. The curious reader can explore the full paper from DeepMind: <a href="https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf">Playing Atari with Deep Reinforcement Learning</a>.</p>
<p>We have seen a method for approximating Q function using neural networks by gathering experience data from within the environment and using it to train the network as well as some problems that arise from this approach. We have also seen reasonable ways to deal with these problems. Next, we will learn methods for estimating the optimal policy without going through the middle-man of estimating a Q function.</p>
<hr />
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<hr />Learning-Based MethodsReinforcement Learning Background2020-02-02T08:06:43+00:002020-02-02T08:06:43+00:00http://www.fractalai.org/dl/2020/02/02/reinforcement-learning<h3 id="reinforcement-learning">Reinforcement Learning</h3>
<p>Reinforcement learning (RL) is difference from supervised and unsupervised learning. In supervised learning, we have truth data (labels) for our problem that we use to check the output of our model against, correcting for mistakes accordingly. In unsupervised learning, we are learning some structure to the data. In RL we don’t have data necessarily, but instead we have an environment and a set of rules. There exists an agent that lives in this environment and its objective is to take actions that will eventually lead to reward. Whereas supervised learning tries to match data to its corresponding label, in RL we try to maximize reward. In other words, we are learning how to make the agent make a good sequence of actions.</p>
<center>
<div class="col-lg-6 col-md-6 col-sm-12 col-xs-12">
<img src="/assets/RL_Intro/rl-schema.png" />
</div>
</center>
<h3 id="framing-an-rl-problem">Framing an RL Problem</h3>
<p>We well frame an RL problem as a Markov Decision Process (MDP), which is a fancy-sounding way of formulating decision making under uncertainty. We will define the following ideas that will guide us in formulating the problem:</p>
<ul>
<li>$\mathbf{S}$: The set of possible states</li>
<li>$\mathbf{A}$: The set of possible actions the agent can take</li>
<li>$R(s, a, s’)$: A probability distribution of the reward given for being in state $s$, taking action $a$ and ending up in a new state $s’$</li>
<li>$\mathbb{T}(s, a, s’)$: A probability distribution of state transitions</li>
<li>$\gamma \in [0, 1)$: A scalar discount factor (will come in handy later)</li>
</ul>
<p>Some literature will also use $\mathbf{O}$ which is the set of possible observations given to the agent by the environment. This is sometimes the same as $\mathbf{S}$ and sometimes not. In a fully observable MDP, the agent has has all information about the state of the environment, so when the agent receives an observation $o_{i} \in \mathbf{O}$, it contains the same information as the state of the environment $s_{i} \in \mathbf{S}$. An example of this is chess - each player (agent) knows exactly what the state of the game is at any time. In a partially observable MDP this is not the case. The agent does not have access to the full state of the environment, so when it received an observation, it does not contain the same information as the state of the environment, hence these are two difference concepts. An example of this is poker - each player does not know the cards of the other players and therefore does not have access to the full state of the game.</p>
<p>The last concept is a policy, which is a function $\pi(s) : \mathbf{S} \Rightarrow \mathbf{A}$ that tells us which action to take given a state. The whole idea of RL is to learn a good policy; one that will tell us good actions to take in each state of the environment. A policy can interpreted deterministically $\pi(s)$ (The action taken when we are in state $s$), or stochastically $\pi(a|s)$ (the probability of taking action $a$ in state $s$).</p>
<p>Most of the time in RL, we do not have access to the true distributions $R(s, a, s’)$ and $\mathbb{T}(s, a, s’)$. If we had these distributions, we could easily calculate the optimal policy, however without this information we have to estimate them by trying out actions in our environment and seeing if we get reward or not.</p>
<h3 id="grid-world">Grid World</h3>
<p>For now, we will assume we have access to the distributions $R(s, a, s’)$ and $\mathbb{T}(s, a, s’)$ so that we can really drive home the point that if we have the true distributions at hand, we can calculate the optimal policy. Image we have the following problem.</p>
<ul>
<li>The agent lives in a grid, where each square is a state. This is the state space.</li>
<li>The agent can move North, South, East, or West (N, S, E, W). This is the action space.</li>
<li>80% of the time, the action the agent takes does as it is intended. 10% of the time the agent slips and moves to one side, and 10% of the time the agent slips to the other side. For example if the agent chooses to move north, there is a 80% chance it will do so, a 10% chance it will move west, and a 10% chance it will move east. This is the transition probability distribution.</li>
<li>There is a destination state that deterministically gives the agent a reward of +1 for reaching it and a terminal state that deterministically gives the agent a reward of -1 for reaching it.</li>
</ul>
<center>
<div class="col-lg-8 col-md-8 col-sm-12 col-xs-12">
<img src="/assets/RL_Intro/gridworld-example.png" />
</div>
</center>
<hr />
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<hr />
<h3 id="finding-optimal-policies">Finding Optimal Policies</h3>
<p>So now that we have a concrete example of a problem, we can discuss what it means to find an optimal policy for it. Some questions that come when determining what a “good” policy is are “does it maximize the reward right now?” and “does is maximize the future reward?”. Typically, we maximize the discounted future reward; the idea being that we want policies that take future state into consideration, but we also don’t want the policy to focus so much on optimizing for future rewards that it doesn’t take actions that would put the agent in a good state now. Therefore we define the optimal policy $\pi^{\ast}$ in the following way.</p>
<script type="math/tex; mode=display">\pi^{\ast} = \arg \max_{\pi} \mathbb{E}\bigg[\sum_{t \geq 0} \gamma^{t}r_{t}|\pi\bigg]</script>
<p>Here, time is indexed by $t$. This means we want to maximize the expectation of the discounted reward given some policy. Notice since $\gamma$ is between 0 and 1, we will optimize for states closer in time more than ones further.</p>
<h4 id="value-function-and-q-function">Value Function and Q-Function</h4>
<p>We have a notion of what a “good” policy is, but how do we actually find it? This is where the Value function and Q function come in. The value function is a prediction of future reward and basically answers the question “how good is the current state $s$ that I’m in?”. We denote $V^{\pi}(s)$ as the expected cumulative reward of being in state $s$ and then following policy $\pi$ thereafter.</p>
<script type="math/tex; mode=display">V^{\pi}(s) = \mathbb{E}\bigg[\sum_{t \geq 0} \gamma^{t}r_{t}|s_{0}=s, \pi\bigg]</script>
<p>We also have the notion of an optimal value function $V^{\ast}(s)$, which is the expected cumulative reward of being in state $s$ and then following the optimal policy $\pi^{\ast}$ thereafter. The Q function represents a similar idea - $Q^{\pi}(s, a)$ is the expected cumulative reward for taking action $a$ in state $s$ and then following policy $\pi$ thereafter. Similarly $Q^{\ast}(s, a)$ is the expected cumulative reward of taking action $a$ in state $s$ and following the optimal policy thereafter.</p>
<script type="math/tex; mode=display">Q^{\pi}(s, a) = \mathbb{E}\bigg[\sum_{t \geq 0} \gamma^{t}r_{t}|s_{0}=s, a_{0}=a, \pi\bigg]</script>
<p>Remember, the value function only deals with states, and the Q function deals with state-action pairs! Now we can go about defining the optimal value and policy from the Q function values. It is clear that the optimal value and policy for a state can be defined in terms of the Q function as follows.</p>
<script type="math/tex; mode=display">V^{\ast}(s) = \max_{a}Q^{\ast}(s, a)</script>
<script type="math/tex; mode=display">\pi^{\ast}(s) = \arg \max_{a}Q^{\ast}(s, a)</script>
<p>These optimal values can be calculated recursively using what are called the Bellman equations, defined below. Notice how the calculation of these values requires we have access to the true distributions $\mathbb{T}(s’, a, s)$ (denoted with $p(\cdot)$ below) and $R(s’, a, s)$ (denoted with $r(\cdot)$ below).</p>
<script type="math/tex; mode=display">V^{\ast}(s) = \max_{a}\sum_{s'}p(s'|s, a)[r(s, a) + \gamma V^{\ast}(s')]</script>
<script type="math/tex; mode=display">Q^{\ast}(s, a) = \sum_{s'}p(s'|s, a)[r(s, a) + \gamma V^{\ast}(s')]</script>
<p>The summation over all possible next-states $s’$ of $p(s’|s, a)$ comes from the definition of expectation in probability $\mathbb{E}[f(\cdot)] = \sum_{x}p(x) \cdot f(x)$. We are summing over all subsequent states the probability of being in that state, given the current state and action, then multiplying by the reward we get for being in that next state. It should be clear that the expected reward of being in state $s$, taking action $a$ and ending up in state $s’$ is exactly $r(s, a) + \gamma V^{\ast}(s’)$.</p>
<p>To reiterate, if we know the distributions $\mathbb{T}$ and $R$, we have a recursive way of calculating the optimal Q value of any state-action pair, and hence we can extract the optimal policy. Now we will go over two algorithms for doing so.</p>
<h3 id="value-iteration">Value Iteration</h3>
<p>The idea of value iteration pretty much exactly follows the logic we described above. The algorithm is as follows.</p>
<center>
<div class="col-lg-10 col-md-10 col-sm-12 col-xs-12">
<img src="/assets/RL_Intro/VI-algorithm.png" />
</div>
</center>
<p>Each iteration of Value Iteration costs $O(|\mathbf{S}|^{2}|\mathbf{A}|)$ time and is very expensive for large state spaces. Recall our grid world game with values for each state initialized to 0.</p>
<center>
<div class="col-lg-6 col-md-6 col-sm-12 col-xs-12">
<img src="/assets/RL_Intro/gridworld-VI-step1.png" />
</div>
</center>
<p>Let’s do an example calculation of one iteration of Value Iteration on the state (3, 3) (where the agent is pictured).</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
V^{2}((3, 3)) &= \max_{a}\sum_{s'}p(s'|(3, 3), a)[r(s, (3, 3)) + \gamma V^{1}(s')] \\
&= \sum_{s'\in \{(4, 3), (3, 2)\}} p(s'|(3, 3), \text{right})[r((3, 3), \text{right}) + \gamma V^{1}(s')] \\
&= (0.8 * (0 + \gamma * 1)) + (0.1 (0 + \gamma * 0)) + (0.1 (0 + \gamma * 0)) \\
&= 0.8\gamma
\end{align*} %]]></script>
<p>Note that the above calculation did not include other actions for brevity since we already knew the max operation would give us right as the optimal action. Now state (3, 3) has value $0.8\gamma$ and we can keep recursing to calculate the values of all the other states. After doing so, this would be 1 iteration of Value Iteration. We would repeat this process until the values converge.</p>
<h3 id="policy-iteration">Policy Iteration</h3>
<p>The next algorithm we will discuss is called Policy Iteration. The idea is that we start with some policy $\pi_{0}$ and iteratively refine it until the policy does not change anymore (i.e. it has converged). The algorithm involves two steps: computing the value of a policy, then using those values to greedily change the actions chosen by the previous policy to create a new policy.</p>
<center>
<div class="col-lg-10 col-md-10 col-sm-12 col-xs-12">
<img src="/assets/RL_Intro/PI-algorithm.png" />
</div>
</center>
<p>Policy Iteration has time complexity $O(|\mathbf{S}|^{3})$ for each iteration because of the linear system of equations, but in practice it often converges faster than Value Iteration because the policy becomes locked in place faster than the values in Value Iteration.</p>
<p>Next time we will discuss how to find good policies even when the distributions $\mathbb{T}$ and $R$ are not known. This will largely amount to taking exploratory actions in the environment to collect data about what sequences of actions give good rewards and what sequences don’t. This opens up the door to the field of RL which we will soon begin exploring.</p>
<hr />
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<hr />Reinforcement LearningVector Spaces, Norms, and Inner Products2020-01-23T08:06:43+00:002020-01-23T08:06:43+00:00http://www.fractalai.org/mfml/2020/01/23/vector-spaces-norms-and-inner-products<p>$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$</p>
<h3 id="vector-spaces">Vector Spaces</h3>
<p>We will begin our study of the mathematical foundations of machine learning by considering the idea of a vector space. A vector space $\mathbf{S}$ is a set of elements called vectors that obey the following</p>
<ul>
<li>For $\mathbf{x, y, z} \in \mathbf{S}$:
<ul>
<li>$\mathbf{x} + \mathbf{y} = \mathbf{y} + \mathbf{x}$ (commutative)</li>
<li>$\mathbf{x} + (\mathbf{y} + \mathbf{z}) = (\mathbf{x} + \mathbf{y}) + \mathbf{z}$ (associative)</li>
<li>$\mathbf{x} + 0 = \mathbf{x}$</li>
</ul>
</li>
<li>Scalar multiplication is distributive and associative</li>
<li>$\mathbf{S}$ is closed under scalar multiplication and vector addition. i.e.
$\mathbf{x}, \mathbf{y} \in \mathbf{S} \implies a\mathbf{x} + b\mathbf{y} \in S \quad \forall a, b \in \mathbb{R}$</li>
</ul>
<p>The last bullet is arguably the most important and describes the more descriptive “linear vector space”.</p>
<p>A couple examples of linear vectors spaces are:</p>
<ol>
<li>
<p>$\mathbb{R}^{N}$</p>
<script type="math/tex; mode=display">\mathbf{x} = \begin{bmatrix}x_{1} \\ \vdots \\ x_{N}\end{bmatrix}</script>
<p>Note that the addition of any two vectors in $\mathbb{R}^{N}$ is also a vector in $\mathbb{R}^{N}$.</p>
</li>
<li>
<p>The set of all polynomials of degree $N$</p>
<p>Note that for polynomials $p(x) = \alpha_{N}x^{N} + … + \alpha_{1}x + \alpha_{0}$ and $t(x) = \beta_{N}x^{N} + … + \beta_{1}x + \beta_{0}$, $ap(x) + bt(x)$ is still a polynomial of degree $N$ for any choice of $a$ and $b$, therefore the space of all degree $N$ polynomials is a linear vector space.</p>
</li>
</ol>
<p>Thinking of functions as elements of a vector space might seem strange, but we will soon see that functions can be represented as discrete sets of numbers (i.e. vectors) and manipulated the same way that we normally think about manipulating vectors in $\mathbb{R}^{N}$.</p>
<h4 id="linear-subspaces">Linear Subspaces</h4>
<p>Now that we have the notion of a vector space, we can introduce the idea of a linear subspace, which is a mathematical tool that will soon become useful. A linear subspace is a subset $\mathbf{T}$ of a vector space $\mathbf{S}$ that contains the zero vector (i.e. $\mathbf{0} \in \mathbf{T}$) and is closed under vector addition and scalar multiplication.</p>
<script type="math/tex; mode=display">\mathbf{x}, \mathbf{y} \in \mathbf{T} \implies a\mathbf{x} + b\mathbf{y} \in T \quad \forall a, b \in \mathbb{R}</script>
<div class="container">
<div class="row">
<div class="col-lg-6 col-md-6 col-sm-12 col-xs-12">
<figure class="figure">
<img src="/assets/Vector_Spaces/subspace_counterexample.png" />
<figcaption class="figure-caption text-center">T is not a linear subspace</figcaption>
</figure>
</div>
<div class="col-lg-6 col-md-6 col-sm-12 col-xs-12">
<figure class="figure">
<img src="/assets/Vector_Spaces/subspace_example.png" />
<figcaption class="figure-caption text-center">T is a linear subspace</figcaption>
</figure>
</div>
</div>
</div>
<p>In the figure above we see on the left a counter example of a linear subspace. It is a counter example because it does not contain the zero vector and also because it is easy to see we could take a linear combination of two vectors in $\mathbf{T}$ to get a vector outside $\mathbf{T}$, so both conditions are violated. This is not the case for the subspace on the right, and it is in fact a linear subspace of $\mathbf{S} = \mathbb{R}^{2}$.</p>
<hr />
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<hr />
<h3 id="norms">Norms</h3>
<p>A Vector space is a set of elements that obey certain properties. By introducing a norm to a particular vector space, we are giving it a sense of distance. A norm $\norm{\cdot}$ is a mapping from a vector space $\mathbf{S}$ to $\mathbb{R}$ such that for all $\mathbf{x, y} \in \mathbf{S}$,</p>
<ol>
<li>$\norm{\mathbf{x}} \geq 0$ and $\norm{\mathbf{x}} = 0 \iff \mathbf{x} = \mathbf{0}$</li>
<li>$\norm{\mathbf{x} + \mathbf{y}} \leq \norm{\mathbf{x}} + \norm{\mathbf{y}}$ (triangle inequality)</li>
<li>$\norm{a\mathbf{x}} = |a|\norm{\mathbf{x}}$ (homogeneity)</li>
</ol>
<p>This definition should feel familiar. The norm of a vector $\norm{\mathbf{x}}$ is its distance from the origin and the norm of the difference of two vectors $\norm{\mathbf{x - y}}$ is the distance between the two vectors. Here are some examples of norms that we will be using later on.</p>
<ol>
<li>
<p>The standard euclidean norm (aka the $\ell_{2}$ norm): $\mathbf{S} = \mathbb{R}^{N}$</p>
<script type="math/tex; mode=display">\norm{\mathbf{x}}_{2} = \sqrt{\sum_{n=1}^{N}|x_{n}|^{2}}</script>
</li>
<li>
<p>$\mathbf{S} = $ the set of continuous functions on $\mathbb{R}$ ($\mathbf{x}$ is a function)</p>
<script type="math/tex; mode=display">\norm{\mathbf{x}}_{2} = \sqrt{\int_{-\infty}^{\infty}|x(t)|^{2}dt}</script>
</li>
</ol>
<h3 id="inner-products">Inner Products</h3>
<p>By now we have introduced vector spaces and normed vector spaces, the latter being a subset of the former. Now we will introduce the inner product. The inner product $\langle\cdot, \cdot\rangle$ is a function that takes two vectors in a vector space and produces a real number (or complex number, but we will ignore this for now).</p>
<script type="math/tex; mode=display">\langle\cdot,\cdot\rangle: \mathbf{S}\times\mathbf{S}\rightarrow \mathbb{R}</script>
<p>A valid inner product obeys three rules for $\mathbf{x, y, z}\in\mathbf{S}$:</p>
<ol>
<li>
<p>$\langle\mathbf{x},\mathbf{y}\rangle = \langle\mathbf{y},\mathbf{x}\rangle$ (symmetry)</p>
</li>
<li>
<p>For $a, b \in \mathbb{R}$</p>
<p><script type="math/tex">\langle a\mathbf{x} + b\mathbf{y}, \mathbf{z}\rangle = a\langle\mathbf{x}, \mathbf{z}\rangle + b\langle\mathbf{y}, \mathbf{z}\rangle</script> (linearity)</p>
</li>
<li>
<p>$\langle\mathbf{x}, \mathbf{x}\rangle \geq 0$ and $\langle\mathbf{x}, \mathbf{x}\rangle = 0 \iff \mathbf{x} = \mathbf{0}$</p>
</li>
</ol>
<p>Two important examples of inner products are</p>
<ol>
<li>
<p>The standard inner product (aka the dot product): $\mathbf{S} = \mathbb{R}^{N}$</p>
<script type="math/tex; mode=display">\langle\mathbf{x},\mathbf{y}\rangle = \sum_{n=1}^{N}x_{n}y_{n} = \mathbf{y}^{T}\mathbf{x}</script>
</li>
<li>
<p>The standard inner product for continuous functions on $\mathbb{R}^{N}$. If $\mathbf{x, y}$ are two such functions</p>
<script type="math/tex; mode=display">\langle\mathbf{x}, \mathbf{y}\rangle = \int_{-\infty}^{\infty}x(t)y(t)dt</script>
</li>
</ol>
<p>The last concept I want to introduce is the idea of an induced norm. It is a fact that every valid inner product induces a valid norm (but not the other way around). This induces norm has very useful properties that not all other norms have. For some inner product $\langle\cdot,\cdot\rangle_{\mathbf{S}}$ on a vector space $\mathbf{S}$, the induced norm is defined as</p>
<script type="math/tex; mode=display">\norm{\mathbf{x}}_{\mathbf{S}} = \sqrt{\langle\mathbf{x},\mathbf{x}\rangle_{\mathbf{S}}}</script>
<p>The standard inner product induces the standard euclidean norm. Two important properties of induced norms (not all norms!) are</p>
<ol>
<li>
<p>The Cauchy-Schwartz Inequality:</p>
<script type="math/tex; mode=display">|\langle\mathbf{x},\mathbf{y}\rangle| \leq \norm{\mathbf{x}}\norm{\mathbf{y}}</script>
</li>
<li>
<p>Pythagorean Theorem:</p>
<p>If $\langle\mathbf{x},\mathbf{y}\rangle = 0$ then $\mathbf{x}$ and $\mathbf{y}$ are orthogonal and $\norm{\mathbf{x} + \mathbf{y}} = \norm{\mathbf{x}} + \norm{\mathbf{y}}$</p>
</li>
</ol>
<p>A Hilbert space is an inner product space that is also complete, which means that for every infinite sequence of elements $\mathbf{x_{1}}, \mathbf{x_{2}}, … $ that gets closer and closer to one another, these elements also approach some precise element in the space. In more rigorous terms, it means that every Cauchy sequence is convergent, but the spaces we discuss in these guides will have this completeness property unless otherwise stated, so I will use Hilbert space and inner product space more or less interchangeably. Just keep the completeness requirement in the back of your mind.</p>
<p>All the ideas presented in these notes are important foundational mathematical concepts that we will make use of in later notes. You should become very familiar with them and know how to determine if an inner product or a norm is valid or not. Now that we have some mathematical tools, next time we will discuss a foundational problem is machine learning - linear approximation.</p>
<hr />
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<hr />$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$