[MUSIC] Stanford University. >> Okay everyone. We're ready. Okay well welcome to
CS224N in Linguistics 284. This is kind of amazing. Thank you for everyone who's here that's
involved and also the people who don't fit in here and the people who
are seeing it online on SCPD. Yeah it's totally amazing the number of
people who've signed up to do this class and so in some sense it seems like
you don't need any advertisements for why the combination of
natural language process and deep learning is a good
thing to learn about.
But nonetheless today,
this class is really going to give some of that advertisement,
so I'm Christopher Manning. So what we're gonna do is I'm
gonna start off by saying a bit of stuff about what natural language
processing is and what deep learning is, and then after that we'll spend a few
minutes on the course logistics. And a word from my co-instructor, Richard. And then, get through some more
material on why is language understanding difficult, and then starting
to do an intro to deep learning for NLP.
So we've gotten off to
a rocky start today, cause I guess we started about ten minutes
late because of that fire alarm going off. Fortunately, there's actually not a lot
of hard content in this first lecture. This first lecture is really to
explain what an NLP class is and say some motivational content about how and
why deep learning is changing the world. That's going to change immediately
on the Thursday lecture because for the Thursday lecture is then we're
gonna start with sort of vectors and derivatives and chain rules and
all of that stuff.
So you should get mentally prepared for that change of level
between the two lectures. Okay, so first of all what is
natural language processing? So natural language processing, that's
the sort of computer scientist's name for the field. Essentially synonymous with
computational linguistics which is sort of the linguist's name of the field. And so it's in this intersection of
computer science and linguistics and artificial intelligence. Where what we're trying
to do is get computers to do clever things with human languages
to be able to understand and express themselves in human languages
the way that human beings do. So natural language processing counts
as a part of artificial intelligence. And there are obviously other important
parts of artificial intelligence, of doing computer vision, and robotics, and knowledge representation,
reasoning and so on. But language has had a very special
part of artificial intelligence, and that's because that language has
been this very distinctive properties of human beings, and we think and go about
the world largely in terms of language. So lots of creatures around the planet
have pretty good vision systems, but human beings are alone for language.
And when we think about how we express
our ideas and go about doing things that language is largely our tool for
thinking and our tool for communication. So it's been one of the key
technologies that people have thought about in artificial intelligence and it's
the one that we're going to look at today. So our goal is how can we
get computers to process or understand human languages in order
to perform tasks that are useful. So that could be things like making
appointments, or buying things, or it could be more highfalutin goals of sort
of, understanding the state of the world. And so this is a space in which there's
starting to be a huge amount of commercial activity in various directions,
some of things like making appointments. A lot of it in the direction
of question answering. So, luckily for people who do language,
the arrival of mobile has just been super, super friendly in terms of the importance
of language has gone way way higher. And so now really all of the huge
tech firms whether it's Siri, Google Assistant, Facebook and Cortana.
But what they're furiously doing is putting out products that use natural
language to communicate with users. And that's an extremely
compelling thing to do. It's extremely compelling on phones
because phones have these dinky little keyboards that are really
hard to type things on. And a lot of you guys are very
fast at texting, I know that, but really a lot of those problems are much
worse for a lot of other people. So it's a lot harder to put in Chinese
characters than it is to put in English letters. It's a lot harder if you're elderly. It's a lot harder if you've
got low levels of literacy.
But then there are also
being new vistas opening up. So Amazon has had this amazing success
with Alexa, which is really shown the utility of having devices that
are just ambient in the environment, and that again you can communicate
with by talking to them. As a quick shout-out for
Apple, I mean, really, we do have Apple to thank for
launching Siri. It was, essentially,
Apple taking the bet on saying we can turn human language into
consumer technology that really did set off this arms race every
other company is now engaging on. Okay, I just sort of loosely said meaning.
One of the things that we'll talk about
more is meaning is a kind of a complex, hard thing and it's hard to know what
it means to understand fully meaning. At any rate that's certainly a very tough
goal which people refer to as AI-complete and it involves all forms of
our understanding of the world. So a lot of the time when we
say understand the meaning, we might be happy if we sort of
half understood the meaning. And we'll talk about different
ways that we can hope to do that. Okay, so one of the other things
that we hope that you'll get in this class is sort of a bit of
appreciation for human language and what it's levels are and
how it's processed.
Now obviously we're not gonna do a huge
amount of that if you really wanna learn a lot about that. There are lots of classes that you can
take in the linguistics department and learn much more about it. But I really hope you can at least
sort of get a bit of a high level of understanding. So this is kind of the picture that
people traditionally have given for levels of language. So at the beginning there's input. So input would commonly be speech. And then you're doing phonetic and phonological analysis to
understand that speech. Though commonly it is also text. And then there's some processing
that's done there which has sort of been a bit marginal from
a linguistics point of view, OCR, working out the tokenization of the words.
But then what we do is go through
a series of processing steps where we work out complex
words like incomprehensible, it has the in in front and
the ible at the end. And that sort of morphological analysis,
the parts of words. And then we try and understand the structure of sentences,
that syntactic analysis. So if I have a sentence
like 'I sat on the bench', that 'I' is the subject of the verb 'sat',
and the 'on the bench' is the location. Then after that we attempt to
do semantic understanding. And that's semantic interpretation's
working out the meaning of sentences. But simply knowing the meaning
of the words of a sentence isn't sufficient to actually really
understand human language. A lot is conveyed by the context
in which language is used. And so that then leads into areas like
pragmatics and discourse processing. So in this class, where we're gonna
spend most of our time is in that middle piece of syntactic analysis and
semantic interpretation. And that's sort of bulk of our
natural language processing class.
We will say a little bit right at
the top left where this discussion, speech signal analysis. And interestingly, that was actually
the first place where deep learning really proved itself as super, super
useful for tasks involving human language. Okay, so applications of
Natural Language Processing are now really spreading out thick and fast. And every day you're variously
using applications of Natural Language Processing. And they vary on a spectrum. So they vary from very simple
ones to much more complex ones. So at the low level,
there are things like spell checkings, or doing the kind of
autocomplete on your phone. So that's a sort of a primitive
language understanding task. Variously, when you're doing web searches, your search engine is considering
synonyms, and things like that for you. And, well,
that's also a language understanding task. But what we are gonna be more
interested in is trying to push our language understanding
computers up to more complex tasks. So some of the next level up kind of tasks
that we're actually gonna want to have computers look at text information,
be it websites, newspapers or whatever.
And get the information out of it, to
actually understand the text well enough that they know what it's talking
about to at least some extent. And so that could be things like expecting
particular kinds of information, like products and their prices or people and
what jobs they have and things like that. Or it could be doing other related
tasks to understanding the document, such as working out the reading level or
intended audience of the document. Or whether this tweet is
saying something positive or negative about this person,
company, band or whatever. And then going even a higher level than
that, what we'd like our computers to be able to do is complete whole
level language understanding tasks. And some of the prominent tasks of that
kind that we're going to talk about. Machine translation, going from one human
language to another human language. Building spoken dialogue systems,
so you can chat to a computer and have a natural conversation,
just as you do with human beings.
Or having computers that can actually
exploit the knowledge of the world that available on things like
Wikipedia and other sources. And so it could actually just
intelligently answer questions for you, like a know everything
human being could. Okay, and we're starting to see a lot
of those things actually being used regularly in industry. So every time you're doing a search,
in little places, there are bits of natural language processing and
natural language understanding happening. So if you're putting in
forms of words with endings, your search engine's
considering taking them off. If there are spelling errors,
they're being corrected. Synonyms are being considered,
and things like that. Similarly, when you're being matched for
advertisements. But what's really exciting is that
we're now starting to see much bigger applications of natural language
processing being commercially successful.
So in the last few years,
there's just been amazing, amazing advances in machine translation
that I'll come back to later. There have been amazing advances in
speech recognition so that we just now get hugely good performance in speech
recognition even on our cell phones. Products like sentiment analysis
they have become hugely commercially important, right? It depends on your favorite industries but
there are lots of Wall Street Journal firms that every hour of the day
are scanning news articles looking for sentiment about companies to make buy and
sell decisions. And just recently,
really over the last 12 months, there's been this huge growth of
interest in how to build chatbots and dialog agents for
all sorts of interface tasks. And that sort of seems like it's
growing to become a huge new industry.
Okay, see I'm getting behind already. So in just a couple of minutes, I want to say that corresponding
things about deep learning. But before getting into that, let me just say a minute about
what's special about human language. Maybe we'll come back to this, but I think it's interesting to have
a sense of right at the beginning. So there's an important
difference between language and most other kinds of things that people
think of when they do signal processing and data mining and
all of those kinds of things.
So for most things, there's just sort of
data that's either the world out there. It has some kind of,
pick up some visual system for it. Or someone's sort of buying
products at the local Safeway. And then someone else is picking
up the sales log and saying, let me analyze this and
see what I can find, right? So it's just sort of all
this random data and then then someone's trying
to make sense of it.
So fundamentally,
human language isn't like that. Human language isn't just sort of a
massive data exhaust that you're trying to process into something useful. Human language,
almost all of it is that there's some human being who actually had some
information they wanted to communicate. And they constructed
a message to communicate that information to other human beings. So it's actually a deliberate
form of sending a particular message to other people. Okay, and an amazing fact about human
language is it's this very complex system that somehow two, three, four year old kids amazingly
can start to pick it up and use it. So there's something good going on there. Another interesting property of
language is that language is actually what you could variously call a discrete,
symbolic, or categorical signaling system. So we have words for
concepts like rocket or violin. And basically, we're communicating
with other people via symbols. There are some tiny exceptions for
expressive signaling, so you can distinguish saying,
I love it versus I LOVE it. And that sounds stronger. But 99% of the time it's using these
symbols to communicate meaning. And presumably, that came about in
a sort of EE information theory sense.
Because by having symbols, they're very reliable units that can
be signaled reliably over a distance. And so that's an important
thing to be aware of, right? Language is symbols. So if symbols aren't just some
invention of logic or classical AI. But then, when we move beyond that, there's actually something
interesting going on. So when human beings
communicate with language that although what they're wanting
to communicate involves symbols. That the way they communicate those
symbols is using a continuous substrate. And a really interesting
thing about language is you can convey exactly the same message by
using different continuous substrates.
So commonly, we use voice and
so there are audio waves. You can put stuff on a piece of paper and
then you have a vision problem. You can also use sign
language to communicate. And that's a different kind
of continuous substrate. So all of those can be used. But there's sort of a symbol underlying
all of those different encodings. Okay, so what the picture we have is that
the communication medium is continuous. Human languages are a symbol system. And then the interesting part
is what happens after that. So the dominant idea in most of
the history of philosophy and science and artificial intelligence
was to sort of project the symbol system of
language into our brains.
And think of brains as
symbolic processors. But that doesn't actually seem to have
any basis in what brains are like. Everything that we know about
brains is that they're completely continuous systems as well. And so the interesting idea that's
been emerging out of this work in deep learning is to say, no, what we should
be doing is also thinking of our brains as having continuous
patterns of activation. And so then the picture we have is that
we're going from continuous to symbolic, back to continuous every
time that we use language. So that's interesting. It also points out one of the problems of
doing language understanding that we'll come back to a lot of times. So in languages we have huge vocabularies. So languages have tens of
thousands of words minimum. And really, languages like English
with a huge scientific vocabulary, have hundreds of thousands
of words in them. It depends how you count.
If you start counting up all of the
morphological forms, you can argue some languages have an infinite number of words
cuz they have productive morphology. But however you count, it means we've
got this huge problem of sparsity and that's one of the big problems that
we're gonna have to deal with. Okay, now I'll change gears and say
a little bit of an intro to deep learning. So deep learning has been this area that
has erupted over the sort of this decade.
And I mean, it's just been enormously, enormously exciting how deep learning
has succeeded and how it has expanded. So really, at the moment it seems like
every month you see in the tech news that there's just amazing new improvements
that are coming out from deep learning. So one month it's super human
computer vision systems, the next month it's machine
translation that's vastly improved. The month after that people are working
out how to get computers to produce their own artistry
that's incredibly realistic.
Then the month after that, people are producing new text-to-speech
systems that sound amazingly lifelike. I mean, there's just been this
sort of huge dynamic of progress. So what is underlying all of that? So, well, as a starting point, deep
learning, it's part of machine learning. So in general, it's this idea of how
can we get computers to learn stuff automatically, rather than just us having
to tell them things and coding by hand in the kind of traditional write computer
program to tell it what you want it to do. But deep learning is also profoundly
different to the vast majority of what happened in machine learning
in the 80s, 90s, and 00s. And this central difference is that for
most of traditional machine learning, if I call it that. So this is all of the stuff like
decision trees, logistic regressions, naive bayes, support vector machines,
and any of those sort of things.
Essentially the way
that we did things was, what we did was have a human being
who looked carefully at a particular problem and worked out what
was important in that problem. And then designed features that
would be useful features for handling the problem that they
would then encode by hand. Normally by writing little
bits of Python code or something like that to
recognize those features. They're probably a little bit small to
read, but over on the right-hand side, these are showing some features for
an entity recognition system. Finding person names,
company names, and so on in text. And this is just the kind of
system I've written myself. So, well, if you want to know whether
a word is a company, you'd wanna look whether it was capitalized, so
you have a feature like that.
It turns out that looking at
the words to the left and right would be useful to have features for
that. It turns out that looking
at substrings of words is useful cause they're kind
of common patterns of letter sequences that indicate names
of people versus of names of companies. So you put in features for substrings. If you see hyphens and things,
that's an indicator of some things. You put in a feature for that. So you keep on putting in features and
commonly these kind of systems would end up with millions of
hand-designed features. And that was essentially how Google search
was done until about 2015 as well, right? They liked the word signal
rather than feature. But the way you improved Google
search was every month some bunch of engineers came
up with some new signal. That they could show with an experiment
that if you added in these extra features, Google search got a bit better. And [INAUDIBLE] a degree and
that would get thrown in, and things would get a bit better.
But the thing to think about is, well,
this was advertised as machine learning, but what was the machine
actually learning? It turns out that the machine
was learning almost nothing. So the human being was learning
a lot about the problem, right? They were looking at the problem hard,
doing lots of data analysis, developing theories, and learning a lot about
what was important for this property. What was the machine doing? It turns out that the only
thing the machine was doing was numeric optimization. So once you had all these signals, what you're then going to be doing
was building a linear classifier. Which meant that you were putting a
parameter weight in front of each feature. And the machine learning system's
job was to adjust those numbers so as to optimize performance. And that's actually something that
computers are really good at. Computers are really good at
doing numeric optimization and it's something that human beings
are actually less good at. Cuz humans, if you say,
here are 100 features, put a real number in front of
each one to maximize performance.
Well, they've got sort of a vague idea but they certainly can't do that
as well as a computer can. So that was useful but
is doing numeric optimization, is that what machine learning means? It doesn't seem like it should be. Okay, so what we found that in practice
machine learning was sort of 90% human beings working out how to describe
data and work out important features. And only sort of 10% the computer
running this learning numerical optimization algorithm. Okay, so
how does that differ with deep learning? So deep learning works, is part of this field that's
called representation learning. And the idea of representation learning is
to say, we can just feed to our computers raw signals from the world, whether that's
visual signals or language signals. And then the computer can automatically,
by itself, come up with good intermediate representations
that will allow it to do tasks well. So in some sense,
it's gonna be inventing its own features in the same way that in the past the human
being was inventing the features. So precisely deep learning, the real meaning of the word deep
learning is the argument that you could actually have multiple layers
of learned representations.
And that you'd be able to outperform
other methods of learning by having multiple layers
of learned representations. That was where the term
deep learning came from. Nowadays, half the time, deep learning
just means you're using neural networks. And the other half of the time it means
there's some tech reporter writing a story and it's vaguely got to do
with intelligent computers and all other bets are off.
Okay, [LAUGH] yeah. So with the kind of coincidence
where sort of deep learning really means neural networks a lot of
the time, we're gonna be part of that. So what we're gonna focus on in this class
is different kinds of neural networks. So at the moment,
they're clearly the dominant family of ways in which people have reached
success in doing deep learning. But it's not the only possible way
that you could do it that people have certainly looked at trying to use various
other kinds of probabilistic models and other things in deep architectures. And I think that may well be
more of that work in the future. What are these neural networks
that we are talking about? That's something we'll come back to and
talk a lot about both on Thursday and next week.
I mean you noticed a lot of
these neural terminology. I mean in some sense if you're kind of
coming from a background of statistics or something like that,
you could sort of say neural networks, they're kind of nothing really more
than stack logistic regressions or perhaps more generally kinda
stacked generalized linear models. And in some sense that's true. There are some connections to
neuroscience in some cases, so that's not a big focus
on this class at all. But on the other hand, there's
something very qualitatively different, that by the kind of architectures
that people are building now for these complex stacking of
neural unit architectures, you end up with a behavior and a way of
thinking and a way of doing things that's just hugely different, than anything that
was coming before in earlier statistics. We're not really gonna take
a historical approach, we're gonna concentrate on
methods that work well right now. If you'd like to read a long
history of deep learning, though I'll warn you it's a pretty dry and
boring history, there's this very long arxiv paper by
Jürgen Schmidhuber that you could look at.
Okay, so why is deep learning exciting? So in general our manually designed
features tend to be overspecified, incomplete, take a long time to design and
validate, and only get you to a certain level of
performance at the end of the day. Where the learned features are easy
to adapt, fast to train, and they can keep on learning so
that they get to a better level of performance than we've been
able to achieve previously.
So, deep learning ends up providing this
sort of very flexible, almost universal learning framework which is just great for
representing all kinds of information. Linguistic information but also world
information or visual information. It can be used in both supervised
fashions and unsupervised fashions. The real reason why deep learning
is exciting to most people is it has been working. So starting from approximately 2010,
there were initial successes where deep learning were shown to work far
better than any of the traditional machine learning methods that have been used for
the last 30 years. But going even beyond that, what has just been totally stunning
is over the last six or seven years, there's just been this amazing ramp in
which deep learning methods have been keeping on being improved and
getting better at just an amazing speed. Which is actually sort of being,
maybe I'm biased, but in the length of my lifetime,
I'd actually just say it's unprecedented, in terms of seeing a field that has
been progressing quite so quickly in its ability to be sort of rolling out better
methods of doing things, month on month.
And that's why you're sort of seeing
all of this huge industry excitement, new products, and you're all here today. So why has deep learning succeeded so
brilliantly? And I mean this is actually
a slightly more subtle and in some sense not quite so
uplifting a tale. Because when you look at a lot of
the key techniques that we use for deep learning were actually
invented in the 80s or 90s. They're not new. We're using a lot of stuff that
was done in the 80s and 90s. And somehow,
they didn't really take off then. So what is the difference? Well it turns out that actually
some of the difference, actually maybe quite a lot of
the difference, is just that technological advances have happened
that make this all possible.
So we now have vastly greater amounts
of data available because of our online society where just about
everything is available as data. And having vast amounts of data
really favors deep learning models. In the 80s and 90s, there sort of wasn't really enough
compute power to do deep learning well. So having sort of several
more decades of compute power has just made it that we can
now build systems that work. I mean in particular there's
been this amazing confluence that deep learning has proven to be just
super well suited to the kind of parallel vector processing that's available now for
very little money in GPUs. So there's been this sort of
marriage between deep learning and GPUs, which has enabled a lot
of stuff to have happened. So that's actually quite
a lot of what's going on. But it's not the only thing that's going
on and it's not the thing that's leading to this sort of things keeping on getting
better and better month by month. I mean, people have also come up with better ways of learning
intermediate representations. They've come up with much better ways of
doing end-to-end joint system learning.
They've come up with much better ways of transferring information between domains
and between contexts and things. So there are also a lot of new algorithms
and algorithmic advances and they're sort of in some sense the more exciting
stuff that we're gonna focus on for more of the time. Okay, so really the first big breakthrough in
deep learning was in speech recognition. It wasn't as widely heralded as the second
big breakthrough in deep learning. But this was really
the big one that started. At the University of Toronto,
George Dahl working with Geoff Hinton started showing on tiny datasets, that they could do exciting things with deep
neural networks for speech recognition. So George Dahl then went off to Microsoft
and then fairly shortly after that, another student from Toronto
went to Google and they started building big speech recognition systems
that use deep learning networks.
And speech recognition's a problem
that's been worked on for decades by hundreds of people. And there are big companies. And there was this sort of fairly
standardized technology of using Gaussian mixture models for
the acoustic analysis and hidden Markov models and blah blah blah. Which people have been honing for decades
trying to improve a few percent a year. And what they were able to
show was by changing from that to using deep learning models for
doing speech recognition, that they were immediately able to get just these
enormous decreases in word error rate. About a 30% decrease in word error rate. Then the second huge example of
the success of deep learning, which ended up being a much bigger thing
in terms of everybody noticing it, was in the ImageNet computer
vision competition.
So in 2012 again students of Geoff Hinton
at Toronto set about building a computer vision system of doing ImageNet task
of classifying objects into categories. And that was again a task that
had been run for several years. And performance seemed fairly stalled with
traditional computer vision methods and running deep neural networks on GPUs
that they were able to get an over one-third error reduction
in one fell swoop. And that progress is continued
through the years, but we won't say a lot on that here. Okay, that's taken me a fair way. So let's stop for a moment and
do the logistics, and I'll say more about deep learning and NLP.
Okay, so
this class is gonna have two instructors. I'm Chris Manning and I'm a Stanford
faculty, then the other one is Richard, who's the chief scientist of
faith of Salesforce, and so I'll let him say a minute or two hello. >> Hi there, great to be here. I guess,
just a brief little bit about myself. In 2014, I graduated,
I got my PhD here with Chris and Enring in deep learning for NLP. And then almost became a professor,
but then started a little company, built an ad platform, did some research. And then earlier last year, we got acquired by Salesforce,
which is how I ended up there. I've been teaching CS224D
the last two years and super excited to merge to two classes. >> Okay. >> I think next week, I'll do the two
lectures, so you'll see a lot of me. >> [LAUGH]
>> I'll do all the boring equations.
>> [LAUGH] Okay, and then TAs,
we've got many really wonderful, competent, great TAs for this class. Yeah, so normally I go through all
the TAs, but there are sort of so many, both of them and you, that maybe
I won't go through them all, but maybe they could all just sort of stand up
for a minute if you're a TA in the class. They're all in that corner, okay,
[LAUGH] and they're clustered. [LAUGH] Okay, right,
yeah, so at this point, I mean, apologies about the room capacity. So the fact of the matter is if this class
is being kind of videoed and broadcast, this is sort of the largest SCPD
classroom that they record in. So, there's no real choice for this, this is the same reason that this is
where 221 is, and this is where 229 is. But it's a shame that there aren't enough
seats for everybody, sorry about that. It will be available shortly after
each class, also as a video. In general for the other information,
look at the website, but there's a couple things that I do just wanna say a little
bit about, prerequisites and work to do.
So, when it comes down to it, these are the things that you
sort of really need to know. And we'll expect you to know, and if you
don't know, you should start working out what you don't know and
what to do about it very quickly. So the first one is we're gonna
do the assignments in Python, so proficiency in Python,
there's a tutorial on the website, not hard to learn if
you do something else. Essentially, Python has just become
the lingua franca of nearly all the deep learning toolkits, so
that seems the thing to use. We're gonna do a lot of stuff
with calculus and vectors and matrices, so multivariate calculus,
linear algebra.
It'll start turning up on Thursday and
even more next week. Sort of basic probability and statistics,
you don't need to know anything fancy about martingales or
something, I don't either. But you should know
the elements of that stuff. And then we're gonna assume you know
some fundamentals of machine learning. So if you've done 221 or 229, that's fine. Again, you don't need to know
all of that content, but we sort of assume that you've seen loss
functions, and you have some idea about how you do optimization with gradient
descent and things like that. Okay, so in terms of what we hope to
teach, the first thing is an understanding of and ability to use effective
modern methods for deep learning. So we'll be covering all the basics, but especially an emphasis on the main
methods that are being used in NLP, which is things like recurrent networks,
attention, and things like that.
Some big picture understanding
of human languages and the difficulties in understanding and
producing them. And then the third one is essentially
the intersection of those two things. So the ability to build systems for
important NLP problems. And you guys will be building some of
those for the various assignments. So in terms of the work to be done,
this is it. So there's gonna be three assignments.
There's gonna be a midterm exam. And then at the end, there's this
bigger thing where you sort of have a choice between either you can
come up with your own exciting world shattering final project and
propose it to us. And we gotta make sure every final project
has a mentor, which can either be Richard or me, one of the TAs, or someone else
who knows stuff about deep learning. Or else, we can give you
an exciting project, and so there'll be sort of
a default final project, otherwise known as Assignment 4. There's gonna be a final poster session.
So every team for the final project,
you're gonna have teams up to three for the final project,
has to be at the final poster session. Now we thought about having it
in our official exam slot, but that was on Friday afternoon, and so
we decided people might not like that. So we're gonna have it in
the Tuesday early afternoon session, which is when the language
class exams are done. So no offense to languages, but we're assuming that none of you are doing
first year intensive language classes.
Or at least,
you better find a teammate who isn't. >> [LAUGH]
>> Okay, yeah, so we've got some late days. Note that each assignment has to be handed
in within three days so we can grade it. Yeah, okay, yeah, so Assignment 1,
we're gonna hand out on Thursday, so for that assignment,
it's gonna be pure Python, except for using the NumPy library, which is kinda
the basic vector and matrices library.
And people are gonna do things
from scratch, because I think it's a really important educational skill
that you've actually done things and gotten it to work from scratch. And you really know for yourself what the derivatives
are because you've calculated them. And because you've implemented them,
and you've found that you can calculate derivatives and implement them, and
the thing does actually learn and work. If you've never done this, the whole thing's gonna seem
like black magic ever after. So it's really important to actually
work through it by yourself. But nevertheless, one of what things
that's being transforming deep learning is that there are now these
very good software packages, which actually make it crazily easy
to build deep learning models.
That you can literally take one of these
libraries and sort of write 60 lines of Python, and you can be training
a state-of-the-art deep learning system that will work super well, providing
you've got the data to train it on. And that's sort of actually been
an amazing development over the last year or two. And so for Assignments 2 and
3, we're gonna be doing that. In particular, we're gonna be using
TensorFlow, which is the Google deep learning library, which is sort of,
well, Google's very close to us. But it's also very well engineered and has sort of taken off as
the most used library now.
But there really are a whole bunch of
other good libraries for deep learning. And I mentioned some of them below. Okay, do people have any
questions on class organization? Or anything else up until now,
or do I just power on? >> [INAUDIBLE]
>> Yeah Okay, so, and something I'm gonna do is repeat all questions,
so they'll actually work on the video. So, the question is, how are our
assignments gonna be submitted? They're gonna be submitted
electronically online, instructions will be on
the first assignment. But yeah, everything has to be electronic,
what we use in Gradescope for the grading. For written stuff, if you wanna hand
write it, you have to scan it for yourself, and submit it online. Any other questions? >> [INAUDIBLE]
>> Yeah. So, the question was,
are the slides on the website? Yes, they are. The slides were on the website before
the class began, and we're gonna try and keep that up all quarter.
So, you should just be able to find them,
cs224n.stanford.edu. Any other questions, yeah? Yeah, so that was on the logistics,
if you're doing assignment four. It's partly different, and partly
the same, so if you're doing the default assignment four, and we'll talk all about
final projects in a couple of weeks. You don't have to write a final
project proposal, or talk to a mentor, because we've designed the project for you
as a starting off point of the project.
But on the other hand,
otherwise, it's the same. So, it's gonna be an open ended project, in which there are lots of things that you
can try to make the system better, and we want you to try, and we want you to be
able to report on what are the different exciting things you've tried, whether they
did, or didn't make your system better. And so, we will be expecting people doing
assignment four to also write up and present a poster on what they've done. Any other questions? Yes, so their question was on
whether we're using Piazza. Yes, we're using Piazza for communication. So, we've already setup the Piazza, and
we attempted to enroll all the enrolled students, so hopefully if you're
an involved student, there's somewhere in your junk mailbox, or in one of those
places, a copy of a Piazza announcement. Any other questions? Okay, 20 some minutes to go. I'll power ahead. Very quickly, why is NLP hard? I think most people,
maybe especially computer scientist, going into this just don't
understand why NLP is hard.
It's just a sequence of words, and they've
been dealing with programming languages. And you're just gonna read
the sequence the words. Why is this hard? It turns out it's hard for
a bunch of reasons, because human languages aren't
like programming languages. So, human languages
are just all ambiguous. Programming languages
are constructed to be unambiguous, that's why they have rules like you can. And else goes with the nearest 'if' and you have to get the indentation
right in Python. Human languages aren't like that,
so human languages are when there's an 'else' just interpret it with whatever
'if' makes most sense to the hearer. And when we do reference
in programming language, we use variable names like x and
y, and this variable. Whereas, in human languages, we say
things like this and that and she, and you're just meant to be able to figure out
from context who's being talked about.
But that's a big problem, but it's
perhaps, not even the biggest problem. The biggest problem is that humans use language as an efficient
communication system. And the way they do that is by
not saying most things, right? When you write a program, we say
everything that's needed to get it to run. Where in a human language, you leave out
most of the program, because you think that your listener will be able to work
out which code should be there, right? So, it's sorta more a code
snippet on StackOverflow, and the listener is meant to be able to
fill in the rest of the program. So, human language gets its efficiency.
We kinda actually communicate very
fast by human language, right? The rate at which we can speak. It's not 5G communications speeds, right? It's a slow communication channel. But the reason why it works efficiently
is we can say minimal messages. And our listener fills in all
the rest with their world knowledge, common sense knowledge, and
contextual knowledge of the situation. And that's the biggest reason
why natural language is hard. So, as sort of a profound version
of why natural language is hard: I really like this XKCD cartoon, but
you definitely can't read, and I can barely read on
the computer in front of me.
>> [LAUGH]
>> But I think if you think about it, it says actually a lot about why
natural language understanding is hard. So, the two women speaking to each other. One says, 'anyway, I could care less,' and
the other one says, 'I think you mean you couldn't care less,
saying you could care less implies you care to some extent,' and
the other one says, 'I don't know,' and then continues. We're these unbelievably complicated
beings drifting through a void, trying in vain to connect
with one another by blindly flinging words
out in to the darkness. Every trace of phrasing,
and spelling and tone and timing carries countless signals and
contexts and subtexts and more. And every listener interprets
these signals in their own way. Language isn't a formal system of
language, it's glorious chaos. You can never know for
sure what any words will mean to anyone. All you can do is try to get better at
guessing how your words affect people.
So, you have a chance of finding
the ones that will make them feel something like you want them to feel. Everything else is pointless. I assume you're giving me tips
on how you interpret words, because you want me to feel less alone. If so, then thank you, that means a lot. But if you're just running my sentences
passed some mental check list, so you can show off how well you know it,
then I could care less. >> [LAUGH]
>> And I think if you reflect on this XKCD comic, there's actually a lot of
profound content there as to what human language understanding is like, and
what the difficulties of it are. But that's probably a bit
hard to do in detail, so I'm just gonna show you some
simple examples for a minute. You get lots of ambiguities, including
funny ambiguities, in natural language. So, here are a couple of, here's one of my favorites that came
out recently from TIME magazine.
The Pope's baby steps on gays, no, that's
not how you meant to interpret this. You're meant to interpret this as
the Pope's baby steps on gays. >> [LAUGH]
>> Okay. So a question, I mean,
why do you get those two interpretations? What is it about human language,
and English here, about English that allows you to
have these two interpretations? What are the different things going on? Is anyone game to give
an explanation of how we Okay, yeah, right. I'll repeat the explanation as I go. You started off with saying it
was idiomatic, and some sense, baby steps is sort of an,
sort of a metaphor, an idiom where baby steps is meaning
little steps like a baby would take, but I mean, before you even get to that,
you can kind of just think a large part of this is just a structural ambiguity,
which then governs the rest of it.
So, one choice Is that you have this
noun phrase of the Pope's baby, and then you start interpreting
it as a real baby. And then steps is being
interpreted as a verb. So, something we find in a lot
of languages, including English, is the same word can have
fundamentally different roles. He, and the verbal interpretation verb,
steps would be being used as a verb. But the other reading is as you
said it's a noun compound, so you can put nouns together, and make
noun compounds very freely in English. Computer people do it all the time, right? As soon as you've got something like disk
drive enclosure, or network interface hub, or something like that, you're just
nailing nouns together to make big nouns.
So, you can put together baby and
steps as two nouns, and make baby steps as a noun phrase. And then you can make the Pope's
baby steps is a larger noun phrase. And then you're getting this
very different interpretation. But simultaneously, at the same time,
you're also changing the meaning of baby. So in one case, the baby was this
metaphorical baby, and then in the other one it's a perhaps counter-factually
it's a literal baby. Let's do at least one more of that. Here's another good fun one. Boy paralyzed after tumor
fights back to gain black belt. >> [LAUGH]
>> Which is, again, not how you're meant to read it. You're meant to read it as boy, paralyzed after tumor,
fights back to gain black belt. So, how could we characterize
the ambiguity in that one? [LAUGH] So,
someone suggested missing punctuation, and if, to some extent, that's true. And to some extent,
you can use commas to try and make readings clearer in some cases. But there are lots of places where
there are ambiguities in language, where it's just not usual standard to
put in punctuation, to disambiguate.
And indeed, if you're the kind of computer
scientist who feels like you want to start putting matching parentheses around pieces
of human language to make the unclear interpretation much clearer, you're not
then a typical language user anymore. [LAUGH]
>> Okay, anyone else gonna have a go, yeah? Yeah, so, this is sort of the ambiguities
are in the syntax of the sentence. So, when you have this 'paralyzed'
that could either be the main verb of the sentence, so. The boy is paralyzed, then all of
after tumor fights back to gain black belt is then this sort of subordinate
clause of saying when it happened. And so then the 'tumor' is
the subject of 'fights back', or you can have this
alternative where 'paralyzed' can also be what's called
a passive participle.
So, it's introducing a participial
phrase of 'paralyzed after tumor'. And so that can then be a modifier of
the boy in the same way an adjective can, young boy fights back to gain black belt. It could be boy paralyzed after tumor
fights back to gain black belt. And then it's the boy that's
the subject of fights. Okay, I have on this slide a couple more
examples, but I think I won't go through them in detail, since I'm sort
of behind as things are going.
Okay, so what I wanted to
get into a little bit of for the last bit of class
until my time runs out is to introduce this idea
of deep learning and NLP. And so, I mean essentially,
this is combining the two things that we've been talking
about so far, deep learning and NLP. So, we're going to use the ideas
of deep learning, neural networks, representation learning, and
we're going to apply them to problems in language understanding,
natural language processing. And so, in the last couple of years, especially this is just an area that's
sorta really starting to take off, and just for the rest of today's
class we'll say, a little bit about what are some of the stuff happening
where they're at a very high level and that'll sort of prepare for Thursday,
starting to dive right into the specifics.
And so, that, so there is so different,
different classifications you can look at. So on the one hand, deep learning is being
applied to lots of different levels of language that things like speech words,
syntax, semantics. It's been applied to lots of different
sort of tools, algorithms that we use for natural language processing. So, that's things like labeling words for
part-of-speech, finding person and organization names, or coming up with
syntactic structures of sentences. And then it's been applied to lots of language applications that
put a lot of this together. So things that I've mentioned before, like
machine translation, sentiment analysis, dialogue agents. And one of the really, really interesting
things is that deep learning models have been giving a very unifying method
of using the same tools and technologies to understand
a lot of these problems. So yes, there are some specifics
of different problems. But something that's been quite stunning
in the development of deep learning is that there's actually been a very
small toolbox of key techniques, which have turned out to
be just vastly applicable with enormous accuracy to just many,
many problems.
Which actually includes not only many,
many language problems, but also, most of the rest of what
happens in deep learning, whether it's looking at vision problems,
or applying deep learning through any other kind of signal analysis,
knowledge representation, or anything that you see these few key tools
being used to solve all the problems. And what is somewhat embarrassing for
human beings part is that typically, they're sort of working super well, much better than the techniques that
human beings had previously slaved on for decades developing, without very much
customization for different tasks. Okay, so deep learning and language it
all starts off with word meaning, and so this is a very central idea gonna develop
starting off with the second class. So, what we're gonna do with words is
say were going to represent a word, in particular we're going to
represent the meaning of the word. As a vector of your numbers. So here's my vector for the word expect.
And so I made that, whatever it is,
an 8-dimensional vector, I think, since that was good for my slide. But really,
we don't use much that small vectors. So minimally, we might use something
like 25-dimensional vectors. Commonly, we might be using something
like 300-dimensional vectors. And if we're really going to town because we wanna have the best
ever system doing something, we might be using a 1000-dimensional
vector or something like that. So when we have vectors for words, that means we're placing words in
a high-dimensional vector space. And what we find out is,
when we have these methods for learning word vectors from deep
learning and place words into these high-dimensional vector spaces,
these act as wonderful semantic spaces. So, words with similar meanings will
cluster together in the vector space, but actually more than that.
We'll find out that there
are directions in the vector space that actually tell you about
components and meaning. So we, one of the problems of
human beings is that they're not very good at looking at
high-dimensional spaces. So, for the human beings,
we always have to project down onto two or three dimensions. And so, in the background,
you can see a little bit of a word cloud of a 2D projection of a word vector space,
which you can't read at all. But we could sort of
start to zoom in on it. And then you get something
that's just about readable. So in one part of the space, this is
where country words are clustering. And in another part of the space, this
is where you're seeing verbs clustering. And you're seeing kind of it's grouping
together verbs that mean most similarly. So 'come' and 'go' are very similar,
'say' and 'think' are similar, 'think' and 'expect' are similar.
'Expecting' and 'thinking' are actually
similar to 'seeing things' a lot of the time, because people often
use see as an analogy for think. Yes? Okay, so the question is, what do
the axes in these vector spaces mean? And, in some sense,
the glib answer is nothing. So when we learn these vector spaces,
well actually we have these 300 D vectors. And they have these axes
corresponding to those vectors. And often in practice, we do sort of
look at some of those elements in along the axes and see if we can
interpret them because it's easy to do. But really, there's no particular
reason to think that elements and meaning should follow those vector lines. They could be any other angle
in the vector space, and so they don't necessarily mean anything.
When we wanna do a 2D projection
like this, what we're then using is some method to try and
most faithfully get out some of the main meaning from the high dimensional
vector space so we can show it to you. So the simplest method that many of you
might have seen before in other places, is doing PCA,
doing a principal components analysis. There's another method that we'll get
to called t-SNE, which is kind of a non-linear dimensionality
reduction which is commonly used.
But these are just to try and give human
beings some sense of what's going on. And it's important to realize that any
of these low dimensional projections can be extremely,
extremely misleading, right? Because they are just leaving out
a huge amount of the information that's actually in the vector space. Here's, I'm just looking at closest words,
to the word frog. I'm using the GLOVE embeddings that we did
at Stanford and we'll talk about more, in the next couple of lectures. So frogs and toad are the nearest words,
which looks good. But if we then look at these other
words that we don't understand, it turns out that they're also names for
other pretty kinds of frogs. So these word meaning vectors are a great
basis of starting to do things.
But I just wanna give you a sense, for the last few minutes,
that we can do a lot beyond that. And the surprising thing is we're gonna
keep using some of these vectors. So traditionally, if we're looking at
complex words like uninterested, we might just think of them as being made up as
morphemes of sort of smaller symbols. But what we're gonna do is say, well no. We can also think of parts of words as vectors that represent
the meaning of those parts of words. And then what we'll wanna do is build
a neural network which can compose the meaning of larger units
out of these smaller pieces. That was work that Minh-Thang Luong and
Richard did a few years ago at Stanford. Going beyond that, we want to
understand the structure of sentences. And so another tool we'll use
deep learning for is to make syntactic pauses that find out
the structure of sentences.
So Danqi Chen who's over there,
is one of the TAs for the class. So something that she worked on
a couple of years ago was doing neural network methods for dependency parsing. And that was hugely successful. And essentially, if you've seen any
of the recent Google announcements with their Parsey McParseface and
syntax net. That essentially what that's
using is a more honed and larger version of the technique
that Danqi introduced. So once we've got some of
the structure of sentences, we then might want to understand
the meaning of sentences. And people have worked on the meaning
of sentences for decades. And I certainly don't wanna
belittle other ways of working out the meaning of sentences. But in the terms of doing
deep learning for NLP, in this class I also wanna give a sense
of how we'll do things differently. So the traditional way of doing things,
which is commonly lambda calculus, calculus-based semantic theories. That you're giving meaning functions for
individual words by hand. And then there's a careful,
logical algebra for how you combine together the meanings of
words to get kind of semantic expressions.
Which have also sometimes been used for
programming languages where people worked on denotational semantics for
programming languages. But that's not what we're gonna do here. What we're gonna do is say, well,
if we start off with the meaning of words being vectors, we'll make meanings for
phrases which are also vectors. And then we have bigger phrases and sentences also have their
meaning being a vector.
And if we wanna know what
the relationships between meanings of sentences or between sentences and
the world, such as a visual scene, the way we'll do that is we'll try
to learn a neural network that can make those decisions for us. Yeah, let's see. So we can use it for
all kinds of semantics. This was actually one of the pieces
of work that Richard did while he was a PhD student,
was doing sentiment analysis. And so
this was trying to do a much better, careful, real meaning representation and understanding of the positive and
negative sentiments of sentences by actually working out which parts
of sentences have different meanings. So the sentences, This movie
doesn't care about cleverness, wit, or any other kind of intelligent humor,
and the system is actually very accurately able to work out, well there's all of
this positive stuff down here, right? There's cleverness,
wit, intelligent humor.
It's all very positive, and that's
the kind of thing a traditional sentiment analysis system would fall apart on, and
just say this is a positive sentence. But our neural network system
is noticing that there's this movie doesn't care
at the beginning and is accurately deciding the overall
sentiment for the sentence is negative. Okay, I'm gonna run out of time, so
I'll skip a couple of things, but let me just mention two other
things that've been super exciting. So there's this enormous excitement
now about trying to build chat bots, dialogue agents. Of having speech and
language understanding interfaces that humans can interact
with mobile computers.
There's Alexa and
other things like that with and I think it's fair to say that the state
of the technology at the moment is that speech recognition has
made humongous advances, right? So I mean, speech recognition
has been going on for decades, and as someone involved with language
technology, I'd been claiming to people, from the 1990s, no,
speech recognition is really good. We've worked out really good
speech recognition systems. But the fact of the matter is they were
sorta not very good and real human beings would not use them if they had any choice
because the accuracy was just so low.
Whereas, in the last few years
neural network-based deep learning speech recognition systems
have become amazingly good. I think, I mean maybe this isn't true
of the young people in this room apart from me. But I think a lot of people don't actually
realize how good that they've gotten. Because I think that there are a lot of
people that try things out in 2012 and decide, they're pretty reasonable,
but not fantastic, and haven't really used it since. So I encourage all of you, if you don't
regularly use speech recognition to go home and
try saying some things to your phone. And, I think it's now just amazing how
well the speech recognition works. But there's a problem. The speech recognition works flawlessly. And then your phone has no idea
what you're saying, and so it says, would you like me to Google that for you? So the big problem, and the centerpiece of the kind of stuff that
we're working on in this class, is well how can we actually make the natural
language understanding equally good? And so that's a big concentration
that what we're going to work on.
One place that's actually, have any of you played with
Google's Inbox program on cell phones? Any of you tried that out? A few of you have. So one cool but
very simple example of a deployed deep learning dialogue agent is
Google Inbox's Suggested Replies. So you having recurrent neural network
that's going through the message and is then suggesting three replies to your
message to send back to the other person. And you know although there are lots
of concerns in that program of sort of privacy and other things, and
they're careful how they're doing it. Actually often the replies it comes
up with are really rather good. If you're looking to cut down on your
email load, give Google Inbox a try and you might find that actually you can reply
to quite a bit of your email using it. Okay, the one other example I
wanted to mention before finishing was Machine Translation. So Machine Translation, this is actually
when natural language processing started. It didn't actually start with
language understanding in general. Where natural language processing started
was, it was the beginning of the Cold War.
Americans and Russians alarmed that each
other knew too much about something they couldn't understand what
people were saying. And coming off of the successes
of code breaking in World War II, people thought, we can just get our
computers to do language translation. And in the early days it
worked really terribly, and things started to get a bit better in
the 2000s, and I presume you've all seen kind of classic Google Translate,
and that's a lot of half worked. You could sorta get the gist of what it's
saying, but it still worked very terribly. Whereas just in the last couple of
years really only starting in 2014, there's then started to be use of
end-to-end trained deep learning systems to do machine translation which is
then called neural machine translation.
And it's certainly not the case that
all the problems in MT are solved, there's still lots of work to do
to improve machine translation. But again,
this is a case in which just overnight replacing the 200 person years
of work on Google Translate with a new deep learning based machine
translation system has overnight produced a huge improvement
in translation quality. And there was a big
long article about that in the New York Times magazine a few
weeks ago that you might've seen. And so rather than traditional
approaches to translation where again just running a big, deep,
recurrent neural network where it starts off reading through
a source sentence generating vector internal representations that
represent the sentence so far.
And then once it's gone to
the end of the sentence, it then starts to generate
out words in the translation. So generating words in
sequence in the translation is then what's referred to as
kind of neural language models, and that is also a key technology that
we use in a lot of things that we do. So that's both what's used in
the kind of Google Inbox, recurrent neural network, and in the generation side
of a neural machine translation system.
Okay, so we've gotten to,
I just have one more minute and try and get us out of here not too
late even though we started late. I mean, the final thing I want to
say it's just sort of to emphasize the fact the amazing thing that's
happening here is it's all vectors, right? We're using this for
all representations of language, whether it's sounds,
parts of words, words, sentences, conversations, they're all getting
turned into these real value vectors.
And that's something that
we'll talk about a lot more. I'll talk about it for
word vectors on Thursday and Richard will talk a lot more
about the vectors next time. I mean, that's something that appalls many
people, but I think it's important to realize it's actually something a lot
more subtle than many people realize. You could think that there's no structure
in this big long vector of numbers. But equally you could say,
well I could reshape that vector and I could turn into a matrix or a higher
order array which we call a tensor.
Or I could say different parts of it or directions of it represent
different kinds of information. It's actually a very
flexible data structure with huge representational capacity and that's what deep learning systems really
take advantage of in all that they do. Okay, thanks a lot. >> [APPLAUSE].