Sums of normal random variables Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30UTC (7:30pm US/Eastern)Perfectly correlated (normal) random variablesNormal approximation to the binomial distributionTransforming two normal random variablesVariance of random variable for normal distributionDetermining characteristics of peaks after mclust finite mixture modelWhat is probability that one normal random variable is max of three normal random variables?Are two standard normal random variables always independent?Representation of equicorrelated normal random variablesconcatenating two normal random variablesNormal random variables arithmetics?
How many time has Arya actually used Needle?
Drawing spherical mirrors
Co-worker has annoying ringtone
How fail-safe is nr as stop bytes?
C's equality operator on converted pointers
Sentence with dass with three Verbs (One modal and two connected with zu)
What are the discoveries that have been possible with the rejection of positivism?
Lagrange four-squares theorem --- deterministic complexity
Google .dev domain strangely redirects to https
What order were files/directories output in dir?
Why does 14 CFR have skipped subparts in my ASA 2019 FAR/AIM book?
How does light 'choose' between wave and particle behaviour?
Amount of permutations on an NxNxN Rubik's Cube
What does Turing mean by this statement?
Crossing US/Canada Border for less than 24 hours
How to pronounce 伝統色
How to compare two different files line by line in unix?
What is an "asse" in Elizabethan English?
In musical terms, what properties are varied by the human voice to produce different words / syllables?
A term for a woman complaining about things/begging in a cute/childish way
Is it possible for SQL statements to execute concurrently within a single session in SQL Server?
Would it be easier to apply for a UK visa if there is a host family to sponsor for you in going there?
Central Vacuuming: Is it worth it, and how does it compare to normal vacuuming?
Is there public access to the Meteor Crater in Arizona?
Sums of normal random variables
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30UTC (7:30pm US/Eastern)Perfectly correlated (normal) random variablesNormal approximation to the binomial distributionTransforming two normal random variablesVariance of random variable for normal distributionDetermining characteristics of peaks after mclust finite mixture modelWhat is probability that one normal random variable is max of three normal random variables?Are two standard normal random variables always independent?Representation of equicorrelated normal random variablesconcatenating two normal random variablesNormal random variables arithmetics?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
Consider a sample of n independent normal rvs. I would like to identify a systematic way of calculating the probability of having the sum of a subset of them larger than the sum of the rest of rvs.
An example case:
Population of fish. Mean: 10 kg, stdv: 3 kg.
I fish five fish (n=5). What is the probability of having two fish weighing more than the rest of the three fish?
The steps which can be followed is to calculate the prob for every combination of fish and then use the inclusion exclusion formula for their union. Is there anything smarter?
Note: if four fish were considered the probability of having two of them heavier than the other two should be one. How could this be computed immediately?
Thanks for the answers.
normal-distribution independence
$endgroup$
|
show 3 more comments
$begingroup$
Consider a sample of n independent normal rvs. I would like to identify a systematic way of calculating the probability of having the sum of a subset of them larger than the sum of the rest of rvs.
An example case:
Population of fish. Mean: 10 kg, stdv: 3 kg.
I fish five fish (n=5). What is the probability of having two fish weighing more than the rest of the three fish?
The steps which can be followed is to calculate the prob for every combination of fish and then use the inclusion exclusion formula for their union. Is there anything smarter?
Note: if four fish were considered the probability of having two of them heavier than the other two should be one. How could this be computed immediately?
Thanks for the answers.
normal-distribution independence
$endgroup$
1
$begingroup$
You could certainly do simulation.
$endgroup$
– Peter Flom♦
Apr 11 at 13:15
$begingroup$
@whuber - You give a great answer assuming that we have a specific two in mind (or randomly choose two). My initial pass at reading thought it was asking about if there were any subsets of 2 such that the sum was greater than the remaining (as evidenced by their claim that if there were 4 fish then the probability would be 1) in which case we would want to look at the distribution of the biggest two vs the distribution of the remaining and would have to dive into the order statistics. Simulation suggests in this situation the probability is roughly .464.
$endgroup$
– Dason
Apr 11 at 14:36
1
$begingroup$
@Dason Thank you for pointing that out: it is a very plausible interpretation and one I had not conceived of. It also explains why Peter was suggesting simulation, because that's a much trickier problem. I think you're correct about order statistics, because we can reframe the problem as asking "what is the chance that the sum of the $k$ largest of $n$ values exceeds the sum of the $n-k$ smallest ones?" Although we can write down the value as an integral, in general it requires numerical evaluation and rapidly gets onerous as $n$ grows.
$endgroup$
– whuber♦
Apr 11 at 14:40
1
$begingroup$
@Manos - If the 1st and 3rd summed were larger than the 2nd, 4th, and 5th... then the 1st and 2nd summed would be larger than the 3rd, 4th, and 5th and would also meet your criteria. So in terms of checking if any subsets meet the criteria we only need to check if the top k sum to something larger than the bottom n-k.
$endgroup$
– Dason
Apr 11 at 16:09
1
$begingroup$
They could. But s as whuber mentions it's not an easy problem. Simulation would get you a result much easier for any specific situation.
$endgroup$
– Dason
Apr 11 at 16:19
|
show 3 more comments
$begingroup$
Consider a sample of n independent normal rvs. I would like to identify a systematic way of calculating the probability of having the sum of a subset of them larger than the sum of the rest of rvs.
An example case:
Population of fish. Mean: 10 kg, stdv: 3 kg.
I fish five fish (n=5). What is the probability of having two fish weighing more than the rest of the three fish?
The steps which can be followed is to calculate the prob for every combination of fish and then use the inclusion exclusion formula for their union. Is there anything smarter?
Note: if four fish were considered the probability of having two of them heavier than the other two should be one. How could this be computed immediately?
Thanks for the answers.
normal-distribution independence
$endgroup$
Consider a sample of n independent normal rvs. I would like to identify a systematic way of calculating the probability of having the sum of a subset of them larger than the sum of the rest of rvs.
An example case:
Population of fish. Mean: 10 kg, stdv: 3 kg.
I fish five fish (n=5). What is the probability of having two fish weighing more than the rest of the three fish?
The steps which can be followed is to calculate the prob for every combination of fish and then use the inclusion exclusion formula for their union. Is there anything smarter?
Note: if four fish were considered the probability of having two of them heavier than the other two should be one. How could this be computed immediately?
Thanks for the answers.
normal-distribution independence
normal-distribution independence
edited Apr 11 at 10:12
Tim♦
60.5k9133230
60.5k9133230
asked Apr 11 at 9:45
ManosManos
412
412
1
$begingroup$
You could certainly do simulation.
$endgroup$
– Peter Flom♦
Apr 11 at 13:15
$begingroup$
@whuber - You give a great answer assuming that we have a specific two in mind (or randomly choose two). My initial pass at reading thought it was asking about if there were any subsets of 2 such that the sum was greater than the remaining (as evidenced by their claim that if there were 4 fish then the probability would be 1) in which case we would want to look at the distribution of the biggest two vs the distribution of the remaining and would have to dive into the order statistics. Simulation suggests in this situation the probability is roughly .464.
$endgroup$
– Dason
Apr 11 at 14:36
1
$begingroup$
@Dason Thank you for pointing that out: it is a very plausible interpretation and one I had not conceived of. It also explains why Peter was suggesting simulation, because that's a much trickier problem. I think you're correct about order statistics, because we can reframe the problem as asking "what is the chance that the sum of the $k$ largest of $n$ values exceeds the sum of the $n-k$ smallest ones?" Although we can write down the value as an integral, in general it requires numerical evaluation and rapidly gets onerous as $n$ grows.
$endgroup$
– whuber♦
Apr 11 at 14:40
1
$begingroup$
@Manos - If the 1st and 3rd summed were larger than the 2nd, 4th, and 5th... then the 1st and 2nd summed would be larger than the 3rd, 4th, and 5th and would also meet your criteria. So in terms of checking if any subsets meet the criteria we only need to check if the top k sum to something larger than the bottom n-k.
$endgroup$
– Dason
Apr 11 at 16:09
1
$begingroup$
They could. But s as whuber mentions it's not an easy problem. Simulation would get you a result much easier for any specific situation.
$endgroup$
– Dason
Apr 11 at 16:19
|
show 3 more comments
1
$begingroup$
You could certainly do simulation.
$endgroup$
– Peter Flom♦
Apr 11 at 13:15
$begingroup$
@whuber - You give a great answer assuming that we have a specific two in mind (or randomly choose two). My initial pass at reading thought it was asking about if there were any subsets of 2 such that the sum was greater than the remaining (as evidenced by their claim that if there were 4 fish then the probability would be 1) in which case we would want to look at the distribution of the biggest two vs the distribution of the remaining and would have to dive into the order statistics. Simulation suggests in this situation the probability is roughly .464.
$endgroup$
– Dason
Apr 11 at 14:36
1
$begingroup$
@Dason Thank you for pointing that out: it is a very plausible interpretation and one I had not conceived of. It also explains why Peter was suggesting simulation, because that's a much trickier problem. I think you're correct about order statistics, because we can reframe the problem as asking "what is the chance that the sum of the $k$ largest of $n$ values exceeds the sum of the $n-k$ smallest ones?" Although we can write down the value as an integral, in general it requires numerical evaluation and rapidly gets onerous as $n$ grows.
$endgroup$
– whuber♦
Apr 11 at 14:40
1
$begingroup$
@Manos - If the 1st and 3rd summed were larger than the 2nd, 4th, and 5th... then the 1st and 2nd summed would be larger than the 3rd, 4th, and 5th and would also meet your criteria. So in terms of checking if any subsets meet the criteria we only need to check if the top k sum to something larger than the bottom n-k.
$endgroup$
– Dason
Apr 11 at 16:09
1
$begingroup$
They could. But s as whuber mentions it's not an easy problem. Simulation would get you a result much easier for any specific situation.
$endgroup$
– Dason
Apr 11 at 16:19
1
1
$begingroup$
You could certainly do simulation.
$endgroup$
– Peter Flom♦
Apr 11 at 13:15
$begingroup$
You could certainly do simulation.
$endgroup$
– Peter Flom♦
Apr 11 at 13:15
$begingroup$
@whuber - You give a great answer assuming that we have a specific two in mind (or randomly choose two). My initial pass at reading thought it was asking about if there were any subsets of 2 such that the sum was greater than the remaining (as evidenced by their claim that if there were 4 fish then the probability would be 1) in which case we would want to look at the distribution of the biggest two vs the distribution of the remaining and would have to dive into the order statistics. Simulation suggests in this situation the probability is roughly .464.
$endgroup$
– Dason
Apr 11 at 14:36
$begingroup$
@whuber - You give a great answer assuming that we have a specific two in mind (or randomly choose two). My initial pass at reading thought it was asking about if there were any subsets of 2 such that the sum was greater than the remaining (as evidenced by their claim that if there were 4 fish then the probability would be 1) in which case we would want to look at the distribution of the biggest two vs the distribution of the remaining and would have to dive into the order statistics. Simulation suggests in this situation the probability is roughly .464.
$endgroup$
– Dason
Apr 11 at 14:36
1
1
$begingroup$
@Dason Thank you for pointing that out: it is a very plausible interpretation and one I had not conceived of. It also explains why Peter was suggesting simulation, because that's a much trickier problem. I think you're correct about order statistics, because we can reframe the problem as asking "what is the chance that the sum of the $k$ largest of $n$ values exceeds the sum of the $n-k$ smallest ones?" Although we can write down the value as an integral, in general it requires numerical evaluation and rapidly gets onerous as $n$ grows.
$endgroup$
– whuber♦
Apr 11 at 14:40
$begingroup$
@Dason Thank you for pointing that out: it is a very plausible interpretation and one I had not conceived of. It also explains why Peter was suggesting simulation, because that's a much trickier problem. I think you're correct about order statistics, because we can reframe the problem as asking "what is the chance that the sum of the $k$ largest of $n$ values exceeds the sum of the $n-k$ smallest ones?" Although we can write down the value as an integral, in general it requires numerical evaluation and rapidly gets onerous as $n$ grows.
$endgroup$
– whuber♦
Apr 11 at 14:40
1
1
$begingroup$
@Manos - If the 1st and 3rd summed were larger than the 2nd, 4th, and 5th... then the 1st and 2nd summed would be larger than the 3rd, 4th, and 5th and would also meet your criteria. So in terms of checking if any subsets meet the criteria we only need to check if the top k sum to something larger than the bottom n-k.
$endgroup$
– Dason
Apr 11 at 16:09
$begingroup$
@Manos - If the 1st and 3rd summed were larger than the 2nd, 4th, and 5th... then the 1st and 2nd summed would be larger than the 3rd, 4th, and 5th and would also meet your criteria. So in terms of checking if any subsets meet the criteria we only need to check if the top k sum to something larger than the bottom n-k.
$endgroup$
– Dason
Apr 11 at 16:09
1
1
$begingroup$
They could. But s as whuber mentions it's not an easy problem. Simulation would get you a result much easier for any specific situation.
$endgroup$
– Dason
Apr 11 at 16:19
$begingroup$
They could. But s as whuber mentions it's not an easy problem. Simulation would get you a result much easier for any specific situation.
$endgroup$
– Dason
Apr 11 at 16:19
|
show 3 more comments
1 Answer
1
active
oldest
votes
$begingroup$
Your example suggests that not only are the $n$ variables $X_1,X_2,ldots,X_n$ independent, they also have the same Normal distribution. Let its parameters be $mu$ (the mean) and $sigma^2$ (the variance) and suppose the subset consists of $k$ of these variables. We might as well index the variables so that $X_1,ldots, X_k$ are this subset.
The question asks to compute the chance that the sum of the first $k$ variables equals or exceeds the sum of the rest:
$$p_n,k(mu,sigma) = Pr(X_1+cdots+X_k ge X_k+1+cdots+X_n ) = Pr(Y le 0)$$
where
$$Y = -(X_1+cdots+X_k) + (X_k+1+cdots+X_n).$$
$Y$ is a linear combination of independent Normal variables and therefore has a Normal distribution--but which one? The laws of expectation and variance immediately tell us
$$E[Y] = -kmu + (n-k)mu = (n-2k)mu$$
and
$$operatornameVar(Y) = k sigma^2 + (n-k)sigma^2 = nsigma^2.$$
Therefore $$Z=fracY - (n-2k)musigmasqrtn$$ has a standard Normal distribution with distribution function $Phi,$ whence the answer is
$$p_n,k(mu,sigma) = Pr(Y le 0) = Prleft(Z le -frac(n-2k)musigmasqrtnright) = Phileft(-frac(n-2k)musigmasqrtnright).$$
In the question, $n=5,k=2,mu=10,$ and $sigma=3,$ whence
$$p_5,2(10,3) = Phileft(-frac(5-2(2))103sqrt10right)approx 0.0680186.$$
Generalization
Little needs to change in this analysis even when the $X_i$ have different normal distributions or are even correlated: you only need to assume they have an $n$-variate Normal distribution to assure their linear combination still has a Normal distribution. The calculations are carried out in the same way and result in a similar formula.
Check
A commenter suggested solving this with simulation. Although that wouldn't be a solution, it's a decent way to check a solution quickly. Thus, in R
we might establish the inputs of the simulation in some arbitrary way as
n <- 5
k <- 2
mu <- 10
sigma <- 3
n.sim <- 1e6 # Simulation size
set.seed(17) # For reproducible results
and simulate such data and compare the sums with these two lines:
x <- matrix(rnorm(n*n.sim, mu, sigma), ncol=n)
p.hat <- mean(rowSums(x[, 1:k]) >= rowSums(x[, -(1:k)]))
The post-processing consists of finding the fraction of simulated datasets in which one sum exceeds the other and comparing that to the theoretical solution:
se <- sqrt(p.hat * (1-p.hat) / n.sim)
p <- pnorm(-(n-2*k)*mu / (sigma * sqrt(n)))
signif(c(Simulation=p.hat, Theory=p, `Z-score`=(p.hat-p)/se), 3)
The output in this case is
Simulation Theory Z-score
0.0677 0.0680 -1.1900
The agreement is close and the small absolute z-score allows us to attribute the discrepancy to random fluctuations rather than any error in the theoretical derivation.
$endgroup$
$begingroup$
We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
$endgroup$
– Acccumulation
Apr 11 at 17:33
$begingroup$
@Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
$endgroup$
– whuber♦
Apr 11 at 20:15
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f402405%2fsums-of-normal-random-variables%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Your example suggests that not only are the $n$ variables $X_1,X_2,ldots,X_n$ independent, they also have the same Normal distribution. Let its parameters be $mu$ (the mean) and $sigma^2$ (the variance) and suppose the subset consists of $k$ of these variables. We might as well index the variables so that $X_1,ldots, X_k$ are this subset.
The question asks to compute the chance that the sum of the first $k$ variables equals or exceeds the sum of the rest:
$$p_n,k(mu,sigma) = Pr(X_1+cdots+X_k ge X_k+1+cdots+X_n ) = Pr(Y le 0)$$
where
$$Y = -(X_1+cdots+X_k) + (X_k+1+cdots+X_n).$$
$Y$ is a linear combination of independent Normal variables and therefore has a Normal distribution--but which one? The laws of expectation and variance immediately tell us
$$E[Y] = -kmu + (n-k)mu = (n-2k)mu$$
and
$$operatornameVar(Y) = k sigma^2 + (n-k)sigma^2 = nsigma^2.$$
Therefore $$Z=fracY - (n-2k)musigmasqrtn$$ has a standard Normal distribution with distribution function $Phi,$ whence the answer is
$$p_n,k(mu,sigma) = Pr(Y le 0) = Prleft(Z le -frac(n-2k)musigmasqrtnright) = Phileft(-frac(n-2k)musigmasqrtnright).$$
In the question, $n=5,k=2,mu=10,$ and $sigma=3,$ whence
$$p_5,2(10,3) = Phileft(-frac(5-2(2))103sqrt10right)approx 0.0680186.$$
Generalization
Little needs to change in this analysis even when the $X_i$ have different normal distributions or are even correlated: you only need to assume they have an $n$-variate Normal distribution to assure their linear combination still has a Normal distribution. The calculations are carried out in the same way and result in a similar formula.
Check
A commenter suggested solving this with simulation. Although that wouldn't be a solution, it's a decent way to check a solution quickly. Thus, in R
we might establish the inputs of the simulation in some arbitrary way as
n <- 5
k <- 2
mu <- 10
sigma <- 3
n.sim <- 1e6 # Simulation size
set.seed(17) # For reproducible results
and simulate such data and compare the sums with these two lines:
x <- matrix(rnorm(n*n.sim, mu, sigma), ncol=n)
p.hat <- mean(rowSums(x[, 1:k]) >= rowSums(x[, -(1:k)]))
The post-processing consists of finding the fraction of simulated datasets in which one sum exceeds the other and comparing that to the theoretical solution:
se <- sqrt(p.hat * (1-p.hat) / n.sim)
p <- pnorm(-(n-2*k)*mu / (sigma * sqrt(n)))
signif(c(Simulation=p.hat, Theory=p, `Z-score`=(p.hat-p)/se), 3)
The output in this case is
Simulation Theory Z-score
0.0677 0.0680 -1.1900
The agreement is close and the small absolute z-score allows us to attribute the discrepancy to random fluctuations rather than any error in the theoretical derivation.
$endgroup$
$begingroup$
We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
$endgroup$
– Acccumulation
Apr 11 at 17:33
$begingroup$
@Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
$endgroup$
– whuber♦
Apr 11 at 20:15
add a comment |
$begingroup$
Your example suggests that not only are the $n$ variables $X_1,X_2,ldots,X_n$ independent, they also have the same Normal distribution. Let its parameters be $mu$ (the mean) and $sigma^2$ (the variance) and suppose the subset consists of $k$ of these variables. We might as well index the variables so that $X_1,ldots, X_k$ are this subset.
The question asks to compute the chance that the sum of the first $k$ variables equals or exceeds the sum of the rest:
$$p_n,k(mu,sigma) = Pr(X_1+cdots+X_k ge X_k+1+cdots+X_n ) = Pr(Y le 0)$$
where
$$Y = -(X_1+cdots+X_k) + (X_k+1+cdots+X_n).$$
$Y$ is a linear combination of independent Normal variables and therefore has a Normal distribution--but which one? The laws of expectation and variance immediately tell us
$$E[Y] = -kmu + (n-k)mu = (n-2k)mu$$
and
$$operatornameVar(Y) = k sigma^2 + (n-k)sigma^2 = nsigma^2.$$
Therefore $$Z=fracY - (n-2k)musigmasqrtn$$ has a standard Normal distribution with distribution function $Phi,$ whence the answer is
$$p_n,k(mu,sigma) = Pr(Y le 0) = Prleft(Z le -frac(n-2k)musigmasqrtnright) = Phileft(-frac(n-2k)musigmasqrtnright).$$
In the question, $n=5,k=2,mu=10,$ and $sigma=3,$ whence
$$p_5,2(10,3) = Phileft(-frac(5-2(2))103sqrt10right)approx 0.0680186.$$
Generalization
Little needs to change in this analysis even when the $X_i$ have different normal distributions or are even correlated: you only need to assume they have an $n$-variate Normal distribution to assure their linear combination still has a Normal distribution. The calculations are carried out in the same way and result in a similar formula.
Check
A commenter suggested solving this with simulation. Although that wouldn't be a solution, it's a decent way to check a solution quickly. Thus, in R
we might establish the inputs of the simulation in some arbitrary way as
n <- 5
k <- 2
mu <- 10
sigma <- 3
n.sim <- 1e6 # Simulation size
set.seed(17) # For reproducible results
and simulate such data and compare the sums with these two lines:
x <- matrix(rnorm(n*n.sim, mu, sigma), ncol=n)
p.hat <- mean(rowSums(x[, 1:k]) >= rowSums(x[, -(1:k)]))
The post-processing consists of finding the fraction of simulated datasets in which one sum exceeds the other and comparing that to the theoretical solution:
se <- sqrt(p.hat * (1-p.hat) / n.sim)
p <- pnorm(-(n-2*k)*mu / (sigma * sqrt(n)))
signif(c(Simulation=p.hat, Theory=p, `Z-score`=(p.hat-p)/se), 3)
The output in this case is
Simulation Theory Z-score
0.0677 0.0680 -1.1900
The agreement is close and the small absolute z-score allows us to attribute the discrepancy to random fluctuations rather than any error in the theoretical derivation.
$endgroup$
$begingroup$
We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
$endgroup$
– Acccumulation
Apr 11 at 17:33
$begingroup$
@Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
$endgroup$
– whuber♦
Apr 11 at 20:15
add a comment |
$begingroup$
Your example suggests that not only are the $n$ variables $X_1,X_2,ldots,X_n$ independent, they also have the same Normal distribution. Let its parameters be $mu$ (the mean) and $sigma^2$ (the variance) and suppose the subset consists of $k$ of these variables. We might as well index the variables so that $X_1,ldots, X_k$ are this subset.
The question asks to compute the chance that the sum of the first $k$ variables equals or exceeds the sum of the rest:
$$p_n,k(mu,sigma) = Pr(X_1+cdots+X_k ge X_k+1+cdots+X_n ) = Pr(Y le 0)$$
where
$$Y = -(X_1+cdots+X_k) + (X_k+1+cdots+X_n).$$
$Y$ is a linear combination of independent Normal variables and therefore has a Normal distribution--but which one? The laws of expectation and variance immediately tell us
$$E[Y] = -kmu + (n-k)mu = (n-2k)mu$$
and
$$operatornameVar(Y) = k sigma^2 + (n-k)sigma^2 = nsigma^2.$$
Therefore $$Z=fracY - (n-2k)musigmasqrtn$$ has a standard Normal distribution with distribution function $Phi,$ whence the answer is
$$p_n,k(mu,sigma) = Pr(Y le 0) = Prleft(Z le -frac(n-2k)musigmasqrtnright) = Phileft(-frac(n-2k)musigmasqrtnright).$$
In the question, $n=5,k=2,mu=10,$ and $sigma=3,$ whence
$$p_5,2(10,3) = Phileft(-frac(5-2(2))103sqrt10right)approx 0.0680186.$$
Generalization
Little needs to change in this analysis even when the $X_i$ have different normal distributions or are even correlated: you only need to assume they have an $n$-variate Normal distribution to assure their linear combination still has a Normal distribution. The calculations are carried out in the same way and result in a similar formula.
Check
A commenter suggested solving this with simulation. Although that wouldn't be a solution, it's a decent way to check a solution quickly. Thus, in R
we might establish the inputs of the simulation in some arbitrary way as
n <- 5
k <- 2
mu <- 10
sigma <- 3
n.sim <- 1e6 # Simulation size
set.seed(17) # For reproducible results
and simulate such data and compare the sums with these two lines:
x <- matrix(rnorm(n*n.sim, mu, sigma), ncol=n)
p.hat <- mean(rowSums(x[, 1:k]) >= rowSums(x[, -(1:k)]))
The post-processing consists of finding the fraction of simulated datasets in which one sum exceeds the other and comparing that to the theoretical solution:
se <- sqrt(p.hat * (1-p.hat) / n.sim)
p <- pnorm(-(n-2*k)*mu / (sigma * sqrt(n)))
signif(c(Simulation=p.hat, Theory=p, `Z-score`=(p.hat-p)/se), 3)
The output in this case is
Simulation Theory Z-score
0.0677 0.0680 -1.1900
The agreement is close and the small absolute z-score allows us to attribute the discrepancy to random fluctuations rather than any error in the theoretical derivation.
$endgroup$
Your example suggests that not only are the $n$ variables $X_1,X_2,ldots,X_n$ independent, they also have the same Normal distribution. Let its parameters be $mu$ (the mean) and $sigma^2$ (the variance) and suppose the subset consists of $k$ of these variables. We might as well index the variables so that $X_1,ldots, X_k$ are this subset.
The question asks to compute the chance that the sum of the first $k$ variables equals or exceeds the sum of the rest:
$$p_n,k(mu,sigma) = Pr(X_1+cdots+X_k ge X_k+1+cdots+X_n ) = Pr(Y le 0)$$
where
$$Y = -(X_1+cdots+X_k) + (X_k+1+cdots+X_n).$$
$Y$ is a linear combination of independent Normal variables and therefore has a Normal distribution--but which one? The laws of expectation and variance immediately tell us
$$E[Y] = -kmu + (n-k)mu = (n-2k)mu$$
and
$$operatornameVar(Y) = k sigma^2 + (n-k)sigma^2 = nsigma^2.$$
Therefore $$Z=fracY - (n-2k)musigmasqrtn$$ has a standard Normal distribution with distribution function $Phi,$ whence the answer is
$$p_n,k(mu,sigma) = Pr(Y le 0) = Prleft(Z le -frac(n-2k)musigmasqrtnright) = Phileft(-frac(n-2k)musigmasqrtnright).$$
In the question, $n=5,k=2,mu=10,$ and $sigma=3,$ whence
$$p_5,2(10,3) = Phileft(-frac(5-2(2))103sqrt10right)approx 0.0680186.$$
Generalization
Little needs to change in this analysis even when the $X_i$ have different normal distributions or are even correlated: you only need to assume they have an $n$-variate Normal distribution to assure their linear combination still has a Normal distribution. The calculations are carried out in the same way and result in a similar formula.
Check
A commenter suggested solving this with simulation. Although that wouldn't be a solution, it's a decent way to check a solution quickly. Thus, in R
we might establish the inputs of the simulation in some arbitrary way as
n <- 5
k <- 2
mu <- 10
sigma <- 3
n.sim <- 1e6 # Simulation size
set.seed(17) # For reproducible results
and simulate such data and compare the sums with these two lines:
x <- matrix(rnorm(n*n.sim, mu, sigma), ncol=n)
p.hat <- mean(rowSums(x[, 1:k]) >= rowSums(x[, -(1:k)]))
The post-processing consists of finding the fraction of simulated datasets in which one sum exceeds the other and comparing that to the theoretical solution:
se <- sqrt(p.hat * (1-p.hat) / n.sim)
p <- pnorm(-(n-2*k)*mu / (sigma * sqrt(n)))
signif(c(Simulation=p.hat, Theory=p, `Z-score`=(p.hat-p)/se), 3)
The output in this case is
Simulation Theory Z-score
0.0677 0.0680 -1.1900
The agreement is close and the small absolute z-score allows us to attribute the discrepancy to random fluctuations rather than any error in the theoretical derivation.
edited Apr 11 at 14:05
answered Apr 11 at 13:50
whuber♦whuber
207k34453824
207k34453824
$begingroup$
We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
$endgroup$
– Acccumulation
Apr 11 at 17:33
$begingroup$
@Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
$endgroup$
– whuber♦
Apr 11 at 20:15
add a comment |
$begingroup$
We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
$endgroup$
– Acccumulation
Apr 11 at 17:33
$begingroup$
@Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
$endgroup$
– whuber♦
Apr 11 at 20:15
$begingroup$
We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
$endgroup$
– Acccumulation
Apr 11 at 17:33
$begingroup$
We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
$endgroup$
– Acccumulation
Apr 11 at 17:33
$begingroup$
@Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
$endgroup$
– whuber♦
Apr 11 at 20:15
$begingroup$
@Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
$endgroup$
– whuber♦
Apr 11 at 20:15
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f402405%2fsums-of-normal-random-variables%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
You could certainly do simulation.
$endgroup$
– Peter Flom♦
Apr 11 at 13:15
$begingroup$
@whuber - You give a great answer assuming that we have a specific two in mind (or randomly choose two). My initial pass at reading thought it was asking about if there were any subsets of 2 such that the sum was greater than the remaining (as evidenced by their claim that if there were 4 fish then the probability would be 1) in which case we would want to look at the distribution of the biggest two vs the distribution of the remaining and would have to dive into the order statistics. Simulation suggests in this situation the probability is roughly .464.
$endgroup$
– Dason
Apr 11 at 14:36
1
$begingroup$
@Dason Thank you for pointing that out: it is a very plausible interpretation and one I had not conceived of. It also explains why Peter was suggesting simulation, because that's a much trickier problem. I think you're correct about order statistics, because we can reframe the problem as asking "what is the chance that the sum of the $k$ largest of $n$ values exceeds the sum of the $n-k$ smallest ones?" Although we can write down the value as an integral, in general it requires numerical evaluation and rapidly gets onerous as $n$ grows.
$endgroup$
– whuber♦
Apr 11 at 14:40
1
$begingroup$
@Manos - If the 1st and 3rd summed were larger than the 2nd, 4th, and 5th... then the 1st and 2nd summed would be larger than the 3rd, 4th, and 5th and would also meet your criteria. So in terms of checking if any subsets meet the criteria we only need to check if the top k sum to something larger than the bottom n-k.
$endgroup$
– Dason
Apr 11 at 16:09
1
$begingroup$
They could. But s as whuber mentions it's not an easy problem. Simulation would get you a result much easier for any specific situation.
$endgroup$
– Dason
Apr 11 at 16:19