Statistics and probability lecture
The notation is , and we will also call these sets events. The notation becomes useful and elegant when we combine it with the probability measure. In words, it is just the sum of the probabilities that individual outcomes will have a value under that lands in. We will also use for the shorthand notation or.
- Sweet n Low (Wrongful Delivery Book 1).
- Temporary Power Outages.
- Probability — Fundamentals of Machine Learning (Part 1).
- Bite Club (The Morganville Vampires Book 10);
Often times will be smaller than itself, even if is large. Then is very small indeed. We should also note that because our probability spaces are finite, the image of the random variable is a finite subset of real numbers. In other words, the set of all events of the form where form a partition of. As such, we get the following immediate identity:. The final definition we will give in this section is that of independence. There are two separate but nearly identical notions of independence here. The first is that of two events. There are multiple ways to realize this formally, but without the aid of conditional probability more on that next time this is the easiest way.
One should note that this is distinct from being disjoint as sets, because there may be a zero-probability outcome in both sets. The second notion of independence is that of random variables. The definition is the same idea, but implemented using events of random variables instead of regular events.
An Intuitive Introduction to Probability
We now turn to notions of expected value and variation, which form the cornerstone of the applications of probability theory. Note that if we label the image of by then this is equivalent to. The most important fact about expectation is that it is a linear functional on random variables.
The only real step in the proof is to note that for each possible pair of values in the images of resp. That is, because has a constant value on , the second definition of expected value gives. We leave this as an exercise to the reader, with the additional note that the sum is identical to. In this case.
We leave the proof as an exercise to the reader. It is important, however, to note that the expected value need not be a value assumed by the random variable itself; that is, it might not be true that. For instance, in an experiment where we pick a number uniformly at random between 1 and 4 the random variable is the identity function , the expected value would be:. But the random variable never achieves this value. The power of this example lies in the method: after a shrewd decomposition of a random variable into simpler usually indicator random variables, the computation of becomes trivial.
The datum of such a path is a list of numbers , where we visit vertex at stage of the traversal. The condition for this to be a valid Hamiltonian path is that is an edge in for all. That is, is the random variable giving the number of Hamiltonian paths in such a randomly generated tournament, and we are interested in. To compute this, simply note that we can break , where ranges over all possible lists of the vertices. Then , and it suffices to compute the number of possible paths and the expected value of any given path.
That is, the expected number of Hamiltonian paths is. Just as expectation is a measure of center, variance is a measure of spread. That is, variance measures how thinly distributed the values of a random variable are throughout the real line. That is, is a number, and so is the random variable defined by. It is the expectation of the square of the deviation of from its expected value. One often denotes the variance by or. The variance operator has a few properties that make it quite different from expectation, but nonetheless fall our directly from the definition.
We encourage the reader to prove a few:. In addition, the quantity is more complicated than one might first expect. In fact, to fully understand this quantity one must create a notion of correlation between two random variables. Note the similarities between the variance definition and this one: if then the two quantities coincide. To make this rigorous, we need to derive a special property of the covariance.
Then their covariance is at most the product of the standard deviations in magnitude:. Take any two non-constant random variables and we will replace these later with. Construct a new random variable where is a real variable and inspect its expected value. Because the function is squared, its values are all nonnegative, and hence its expected value is nonnegative. Expanding this and using linearity gives. This is a quadratic function of a single variable which is nonnegative.
From elementary algebra this means the discriminant is at most zero. Note that equality holds in the discriminant formula precisely when the discriminant is zero , and after the replacement this translates to for some fixed value of.
In other words, for some real numbers we have. This has important consequences even in English: the covariance is maximized when is a linear function of , and otherwise is bounded from above and below. By dividing both sides of the inequality by we get the following definition:. If is close to -1 we call them negatively correlated, and if is close to zero we call them uncorrelated.
The idea is that if two random variables are positively correlated, then a higher value for one variable with respect to its expected value corresponds to a higher value for the other. Likewise, negatively correlated variables have an inverse correspondence: a higher value for one correlates to a lower value for the other.
The picture is as follows:. The linear correspondence is clear.
There are plenty of interesting examples of random variables with non-linear correlation, and the Pearson correlation coefficient fails miserably at detecting them. Here are some more examples of Pearson correlation coefficients applied to samples drawn from the sample spaces of various continuous, but the issue still applies to the finite case probability distributions:.
Various examples of the Pearson correlation coefficient, credit Wikipedia. Though we will not discuss it here, there is still a nice precedent for using the Pearson correlation coefficient. But this strays a bit far from our original point: we still want to find a formula for.
Expanding the definition, it is not hard to see that this amounts to the following proposition:. Note that in the general sum, we get a bunch of terms.
Another way to look at the linear relationships between a collection of random variables is via a covariance matrix. As we have already seen on this blog in our post on eigenfaces , one can manipulate this matrix in interesting ways. In particular and we may be busting out an unhealthy dose of new terminology here , the covariance matrix is symmetric and nonnegative, and so by the spectral theorem it has an orthonormal basis of eigenvectors, which allows us to diagonalize it.
This will get our toes wet with some basic measure theory, but as every mathematician knows: analysis builds character.
- Statistics and probability lecture.
- Extra Learning: Out of School Learning and Study Support in Practice;
- Kinesiology of the Musculoskeletal System - E-Book: Foundations for Rehabilitation.
- Your Answer?
Like Like. Hello Jeremy, would you consider using MathJax or something similar on your blog? It would help tremendously with content consumption on mobile devices like my smartphone. Right now, all the images are centered on their own separate line. Maybe I can file a bug. A primer I never might that is so clear and intuitive, for a non mathematician. Often with data sets the standard deviation is given along with the mean, mode, etc.
What meaningful information does a std dev give us? It gives you even more power if you know more information about the distribution of your random variable. Coupling this with more beefy theorems like the central limit theorem, you can start saying interesting things about collections of non-normally distributed random variables.
- Statistics and probability lecture.
- New Worlds Reflected.
- Probability basics.
- Maths in a minute: The axioms of probability theory.
The central limit theorem only tells you about means of i. If your distribution is very skewed, measuring deviations can be a very bad approximation to posterior probability mass, which is why people use exact tests rather than normal approximations where possible. For normal distributions we get these nice interval bounds, but in general the Chernoff bounds always use variance.
Maybe this is my mathematical training inhibiting me. I believe I mentioned that if perhaps only briefly in a figure with some examples of nonlinear relationships giving zero correlation. I think you mean , not. I have no idea why math people think that expressing things in symbols not found in the typical programming language or keyboard is in any way clearer to a programmer.
Like regular expressions, such notation is not designed for either clarity or readability. It is a concise way to represent things — but a very bad way indeed to explain things. As all web developers know, all things being the same the simplest option is the best. Mathematical notation is the simplest and most widely understood, and making up my own programmer-friendly notation would be far worse: then all of my readers would have to puzzle over it instead of just my readers who are afraid of math, which is a small fraction to be sure. Indeed, the whole point of this blog is that I believe programmers stand to learn a lot by getting involved with some real mathematics.
In my mind, the programming world has too much of a focus on vapid web development and useless mobile apps, and not enough on doing anything interesting. This blog is a testament to the fact that many of those interesting things require or stand to benefit from mathematics. The complement, written E C , is the set of all events in a sample space that are not part of E.