Where is this function increasing? Is it an increasing function?

Looking at various recent examination papers, it has become clear to me that there is significant confusion between these two questions. This post is intended to bring some clarity to the situation.

At the start of this post, I will give an example of the confusion as it appears in exam questions (and probably elsewhere), and clarify what the two different phrases mean using the above example. I will then delve more deeply into the mathematics of these two things, going beyond A-level content, and use some undergraduate analysis to find equivalent conditions for them in terms of the derivatives of the functions. It is fine to skip over the technical stuff and just look at the results (theorems)!

(Exactly the same applies to the use of the term “decreasing”, but for simplicity we will focus on increasing functions in this post.)

Here is an example of a question (based on a real exam question) which typifies the confusion.

The equation of a curve is $y=x^3+4x^2-5x$.

Find the set of values of $x$ for which $y$ is an increasing function of $x$.

If we replace “increasing function” by another familiar A-level term describing functions, “one-to-one function”, the question becomes:

A function is given by $f(x)=x^3+4x^2-5x$.

Find the set of values of $x$ for which $f(x)$ is a one-to-one function of $x$.

This is clearly nonsensical, because whether a function is one-to-one
or not is a property of the function *as a whole*, not a property of
the function values at any particular input value.

Likewise, a function either is or is not an *increasing function*; it
is a property of the function *as a whole*.

Informally (and not quite correctly), we can describe the difference as follows:

- A function is an
*increasing function*if larger input values give larger output values. - A function is
*increasing at a point*if at that point, the function has a positive gradient.

An example which shows that these are not the same is the function $f(x)=-\dfrac{1}{x}$ for $x\ne0$ shown above. This function is increasing at every value of $x\ne0$, as the gradient is always positive. However, it is not an increasing function, because $f(1)<f(-1)$. If, though, we restricted the domain of the function to $x>0$, then it would be an increasing function.

So the above-quoted exam question does not make any sense, just as the modified version did not: either $y$ is an increasing function of $x$ or it is not. If the question had instead asked “Find the set of values of $x$ at which $y$ is increasing,” it would have been fine.

Incidentally, the idea of increasing and decreasing functions connects very well with the issue of rearranging inequalities (increasing the depth of connections within the subject): a function can be applied to both sides of an inequality without changing the direction of the inequality if the function is (strictly) increasing; it can be applied but with a change in the direction of the inequality if the function is (strictly) decreasing, and if the function is neither, then the function cannot be applied to the inequality. So we cannot square both sides of an inequality unless we are restricted to non-negative values, and we cannot take the reciprocal of an inequality unless we have the same restriction (and in that case, we must also reverse the direction of the inequality).

It seems reasonable to assert that if a function is an increasing function, then it will be increasing at every point. There turns out to be some subtlety to this, as we now delve into a little more deeply.

We can give a formal definition of an increasing function. For
example, this definition is from Apostol, *Mathematical Analysis*, 2nd
ed, p94, and identical definitions appear on the internet:

Definition 1: Let $f$ be a real-valued function whose domain is a subset $S$ of $\mathbb{R}$. Then $f$ is said to be an

increasing(ornondecreasing) function if for every pair of points $x$ and $y$ in $S$, $x<y$ implies $f(x)\le f(y)$. If $x<y$ implies $f(x)<f(y)$, then $f$ is said to be astrictly increasingfunction. (Decreasing functions are similarly defined.)

Note the distinction between increasing and strictly increasing here: a constant function such as $f(x)=0$ for $x\in\mathbb{R}$ is both an increasing and decreasing function, though it is not a strictly increasing function.

We could also try to come up with a definition of increasing at a point. There are no standard definitions of this idea, and the following proposed definition is certainly beyond A-level in its formality. It is based on the definition of continuity, which is about the behaviour of a function “near” to a point.

Definition 2: Let $f$ be a real-valued function whose domain is a subset $S$ of $\mathbb{R}$. Then $f$ is said to be

increasing at the point$x$ in $S$ if there is some $\delta>0$ such that:for every $y$ in $S$ with $x<y<x+\delta$, $f(x)\le f(y)$, and for every $y$ in $S$ with $x-\delta<y<x$, $f(y)\le f(x)$.

If the $\le$ signs are replaced by $<$ signs in these two inequalities, then $f$ is said to be

strictly increasing at$x$.

With this definition, the above exam question (reworded) makes sense, and the correct final answer is what the examiner would expect. (One might wonder whether one could make such a local definition of one-to-one, and indeed, this is done when considering the Inverse Function and Implicit Function theorems. But that is a story for another day.)

So far, no calculus has appeared, yet we typically teach our students to determine whether a function is an increasing function or to find where it is increasing by differentiating the function. So let us now consider how we could use calculus to help us.

For us to be able to use calculus, we need to assume that our function is differentiable throughout $S$. We could then propose the following theorem:

Theorem 1 (incorrect attempt): Let $f$ be a real-valued continuous function whose domain is a subset $S$ of $\mathbb{R}$ and is differentiable at every (interior) point of $S$. Then $f$ is an increasing function if and only if $f’(x)>0$ for all $x$ in (the interior of) $S$.

(The use of “interior” is to avoid certain technical complications.)

Unfortunately this fails immediately: the constant function $f(x)=0$ for $x\in\mathbb{R}$ is increasing, yet $f’(x)=0$.

We could try changing this to say that $f$ is a *strictly* increasing
function, but that fails if the function has a point of inflection.
For example, $f(x)=x^3$ is a strictly increasing function, even though
its derivative is zero at $x=0$.

We could also try changing the condition to say that $f’(x)\ge0$ for all $x$ in $S$. However, this also fails: if the graph has a discontinuity, such as the function $f(x)=-\dfrac{1}{x}$ for $x\ne0$ that we looked at before, then it might have $f’(x)>0$ for all $x$ in $S$, yet not be an increasing function.

This feels more hopeful, though: after all, the only problem now is the “hole” in the domain $S$. And it turns out that if we restrict the domain to be an interval (that is, a subset of the reals with no “holes”), then it will work:

Theorem 1 (correct version): Let $f$ be a real-valued continuous function whose domain is an interval $I$ of $\mathbb{R}$ and is differentiable at every point in (the interior of) $I$. Then $f$ is an increasing function if and only if $f’(x)\ge 0$ for all $x$ in (the interior of) $I$.

The formal proof of this is found below, and though it is quite technical, the theorem itself seems clearly true, and school students could probably be convinced to believe it (at least once it is written in more student-friendly language).

What can we say, though, about whether a (differentiable) function is increasing at a point? Using Definition 2 above, we get the corresponding theorem:

Theorem 2: Let $f$ be a real-valued continuous function whose domain is an interval $I$ of $\mathbb{R}$ and which is differentiable at every (interior) point of $I$. Then is $f$ is increasing at the point $x$ in $I$ if and only if there is some $\delta>0$ for which $f’(y)\ge0$ for all $y$ in (the interior of) $I$ with $x-\delta<y<x+\delta$.

Why is it not sufficient to just require $f’(x)\ge0$? Well, consider the functions $f(x)=x^3$ and $f(x)=-x^3$. They both have $f’(x)=0$, yet the first is increasing (indeed, even strictly increasing) at $x=0$, while the second is decreasing at $x=0$. And a function such as $f(x)=x^2$ is neither increasing nor decreasing at $x=0$. So we really do need to consider a small interval around the point of interest.

(Theorem 2 could be extended, with care, to more general subsets of $\mathbb{R}$, as we are only discussing a local property of the function. But it is not particularly interesting to do so.)

So the question of determining at which points a function is increasing (or decreasing) is more subtle than it appears: not only does one have to find where the function has derivative $\le0$ (and not just $<0$), but one also has to determine what is happening at those points where the derivative is zero, as there are different types of stationary points. (At those points where the derivative is strictly positive, the function is certainly strictly increasing, which follows from Theorem 4 below.)

Things get more complicated if we now wish to consider strictly increasing (or decreasing) functions. There is a relatively weak theorem which will suffice much of the time:

Theorem 3: Let $f$ be a continuous real-valued function whose domain is an interval $I$ of $\mathbb{R}$ and which is differentiable at every (interior) point of $I$. Then if $f’(x)>0$ throughout $I$, $f$ is a strictly increasing function.

Note that this is a one-directional theorem; $f(x)=x^3$ for $x\in\mathbb{R}$ is our standard example of a strictly increasing function which does not have $f’(x)>0$ throughout the domain because of the point of inflection at the origin. The proof of Theorem 3 follows exactly as that of Theorem 1.

An easy corollary of this is the following (local) theorem:

Theorem 4: Let $f$ be a continuous real-valued function whose domain is a subset $S$ of $\mathbb{R}$. If $f$ is differentiable at the point $x$ in the interior of $S$ and $f’(x)>0$, then $f$ is strictly increasing at $x$.

This is the theorem which is typically used when answering A-level exam questions such as the one above. Unfortunately, as we see from our example of $f(x)=x^3$, this too is a one-directional theorem: every point at which $f’(x)>0$ is a point at which the function is strictly increasing, but there may be other points where this is the case but where $f’(x)=0$. (If $f’(x)<0$, then the function is strictly decreasing at this point, so it cannot be increasing.) The question of using calculus to determine where a function is increasing, rather than strictly increasing, is somewhat more complicated, as we see from Theorem 2 above. But at A-level, the functions are always nice enough that the only difficulties will be at the stationary points.

There is actually a necessary and sufficient condition for a function to be strictly increasing, but this is more subtle:

Theorem 5: Let $f$ be a continuous real-valued function whose domain is an interval $I$ of $\mathbb{R}$ and which is differentiable at every interior point of $I$. Then $f$ is strictly increasing on $I$ if and only if $f’(x)\ge0$ throughout $I$ and there is no non-trivial subinterval $J$ of $I$ with $f’(x)=0$ for all $x$ in the interior of $J$.

The proof can be found below.

Putting this all together, we see that Theorem 4 is the crucial theorem for school use. Teaching the meaning of the term “increasing function” (Definition 1) and a simplified explanation of “increasing at a point” (Definition 2), along with Theorem 4 should give a good grounding. It would also be wise to caution that it is a one-way theorem by comparing and contrasting examples such as $f(x)=x^2$ and $f(x)=x^3$.

This technical appendix uses tools from undergraduate analysis. The proofs of the other three theorems are very similar to these or they follow immediately from these.

Let $f$ be a real-valued continuous function whose domain is an interval $I$ of $\mathbb{R}$ and is differentiable at every point in the interior of $I$. Then $f$ is an increasing function if and only if $f’(x)\ge 0$ for all $x$ in the interior of $I$.

**Proof**

We show first that if $f$ is an increasing function, then $f’(x)\ge0$ for all $x$ in the interior of $I$, and we argue by contradiction. Assume that $f’(x_0)<0$ for some $x_0$ in the interior of $I$. Using the definition of derivative, this means that $\lim\limits_{\substack{x\to x_0\\ x\in I}}\dfrac{f(x)-f(x_0)}{x-x_0}<0$. So there is some $x_1\in I$ (where $x_1\ne x_0$) with $\dfrac{f(x)-f(x_0)}{x-x_0}<0$ (otherwise the limit would be $\ge0$). If $x_1>x_0$, then multiplying by $x_1-x_0$ gives $f(x_1)-f(x_0)<0$, so $f(x_1)<f(x_0)$, If $x_1<x_0$, then multiplying by $x_1-x_0$ gives $f(x_1)-f(x_0)>0$, so $f(x_1)>f(x_0)$. Either way, this shows that the function is not increasing on $I$, and we have our desired contradition. Thus if $f$ is an increasing function, we must have $f’(x)\ge0$ for all $x$ in the interior of $I$.

Conversely, if $f’(x)\ge0$ for all $x$ in the interior of $I$, then let $x<y$ be any two points in $I$. Then by the mean-value theorem, there is some $z$ with $x<z<y$ for which $f(y)-f(x)=f’(z)(y-x)$ (and note that $z$ lies in the interior of $I$ as $I$ is an interval). Since $f’(z)\ge0$ by assumption, and $y-x>0$, it follows that $f(y)-f(x)\ge0$, so $f(x)\le f(y)$. Therefore $f$ is an increasing function.

Let $f$ be a continuous real-valued function whose domain is an interval $I$ of $\mathbb{R}$ and which is differentiable at every interior point of $I$. Then $f$ is strictly increasing on $I$ if and only if $f’(x)\ge0$ throughout $I$ and there is no non-trivial subinterval $J$ of $I$ with $f’(x)=0$ for all $x$ in the interior of $J$.

**Proof**

We first prove that if the derivative condition is not met, then $f$ is not strictly increasing on $I$. If $f’(x)<0$ at any point in $I$, then $f$ is not increasing (by Theorem 1), so it is certainly not strictly increasing. If $f’(x)\ge0$ throughout $I$ but there is a non-trivial subinterval $J$ of $I$ with $f’(x)=0$ for all $x$ in the interior of $J$, then $f$ is constant throughout $J$ (by the mean-value theorem). In particular, there are $y<z$ in $J$ with $f(y)=f(z)$, showing that $f$ is not strictly increasing.

Conversely, if $f’(x)\ge0$ throughout $I$, then $f$ is increasing by
Theorem 1. Assume now that there is no non-trivial subinterval $J$ of
$I$ with $f’(x)=0$ for all $x$ in the interior of $J$. But if $f$
were *not* strictly increasing, then there would be $y<z$ in $I$ with
$f(y)=f(z)$, so $f(x)$ is constant on the interval $y<x<z$. (For if
$f(y)<f(x)$ for some $x$ in this interval, we would have $f(x)>f(z)$,
contradicting $f$ increasing.) Therefore $f’(x)=0$ throughout this
interval, contradicting our assumption. So $f$ must be strictly
increasing.

Though my main interest was the maths teaching, I was fascinated by the whole experience, so that is what I will focus most of my attention on here. I used to teach in a school (“W”) with a broadly similar type of intake: it was in an area with many students from ethnic minorities and many students on free school meals; that school was also in an area in which there was a grammar school system, so many of the highest-attaining students in the catchment area attended the more selective local schools. This gave me an interesting basis for comparison.

The most obvious thing which struck me was the atmosphere that Katharine and her staff have established in the school. It was very purposeful, and the students I met generally seemed happy and to like the school. They were polite to me, and some were genuinely interested in talking to me. (Or at least they gave the convincing impression that they were!) Some students were immensely proud of what they were doing and showed off their work to me (without my even asking).

Many have written about the very strictly enforced behaviour policies. But what I had not expected was the huge warmth pouring forth from the staff to the students in their lessons, and the humanity pervading the school. Whilst demerits were regularly given for infringements of the school’s very strict behaviour policies - generally accompanied by just a few seconds’ calm explanation of the positive benefits of doing what was expected or the negative impact the behaviour was having on others - merits were given even more liberally (and fairly consistently between lessons) for behaviours the school wants to encourage, such as good vocal projection when answering a question, asking good questions and giving good explanations. And these were always accompanied by brief warm words. This contrasts so dramatically with my experience at “W”, where though some teachers managed their classes well, there wasn’t anything close to a consistent school-wide system at this level of detail. There is clearly a benefit to be gained from having such an consistently enforced system throughout the school, though it is tough for teachers. (Mind you, it is not as tough as teaching in a school where students throw things at teachers on a semi-regular basis.)

The most challenging class I saw was a small bottom-set year 10 class, several of whom had already been permanently excluded from one or more other schools. Yet there they were, behaving and mostly participating in the lesson, learning and targeting a grade 4 or 5 at GCSE Mathematics. Wow. At “W”, lower-middle sets were only targeting a grade D (on the old system, the equivalent of a grade 3 on the new system), and most of them did not achieve even that. The contrast could not be greater.

A few things struck me immediately during the day, without even entering a classroom. The first was the immaculate state of the building: not a speck of litter to be seen during the course of the day. This is in stark contrast to most of the schools I’ve taught in and visited over the years, and vastly different from “W”. The students have clearly been taught to respect their environment.

Is the school’s approach a good thing? This is a difficult question for me to answer. I certainly had a sense that the school was infusing students with British culture (whatever that means), and yet for students living in the UK, is this not a good thing? It will give them significant (British) cultural capital on which they will be able to draw in later life, and which they might well otherwise not gain.

On the other hand, students are constantly being watched, as are staff: for example, visitors (with DBS certificates) are permitted to just walk into any lesson, yet the teachers and students generally didn’t bat an eyelid when I quietly walked in. Yet this significantly reduces the chances of bullying and destructive behaviour: there are no “safe spaces” within the school for bullying or other damaging behaviour to take place without a teacher seeing.

I observed parts of about eight maths lessons during the day (as well as a smattering of other subjects). My concern was that they would be very procedural in nature, given the rigidity of the school system. However, I was pleasantly surprised: while they were clearly teacher-led, the questioning did include a good mix of knowledge and deeper understanding questions. For example, in a lesson on Pythagoras’s Theorem, there were early questions designed to ensure that the students knew which side was the hypotenuse, and later questions which required more thought, such as “If I have a triangle with side lengths 6, 7, 8, can I draw a right-angle here?” (I am not overly concerned with Year 8 students not clearly distinguishing between Pythagoras’s Theorem and its converse. Students were spending enough effort getting to grips with what the question meant.) Some time was spent working on questions from their workbooks, but this was far from the majority of the time.

During one of the lessons, students were asked to read out from their workbooks. I was surprised - though I probably should not have been - at how difficult they found it to read technical vocabulary; how often do we ask our students to read a piece of technical material?

There were also opportunities for discussion in pairs; these were short and effective, and the students were continually encouraged to use the time productively, as anyone could be picked on to answer a question after the discussion time.

Dani spoke much more about the planning process and lesson structure at Michaela in her podcast, which was fascinating, so I won’t say more about it here.

Returning home and reflecting, I have two big questions about this model, and in particular with regard to maths. The first (somewhat mathematics-specific) question is whether students get enough opportunity to think about any (mathematics) problem for a protracted period of time. Hearing about other countries’ approaches, I wonder whether this is a potentially missed opportunity, especially once behaviour is so well-managed that there is a good learning atmosphere.

The second, more pervasive question, is about the use of streaming within the school. (“Streaming” means that students are put in groups which are dependent upon their academic performanace across a number of subjects. They remain in these groups for all of their subjects. As far as I could tell, it is used in Years 7-9 and possibly in Year 10 as well.) It is very effective for behaviour management, as the entire class is together the whole time, including at lesson changeover time. However, I am very unconvinced that it is good for equity, which is part of the school’s mission. Hearing of the experiences of primary and secondary schools which have moved away from ability (or better: attainment) grouping to mixed-attainment grouping, one has to ask whether this would be better for the majority of the students within the school, certainly at Key Stage 3 (11-14 year olds) and possibly older too. Teachers’ academic expectations of students are lower when they are teaching lower-attaining groups, and I strongly doubt that Michaela’s excellent teachers are any less affected by this.

And would I consider teaching at Michaela? I’m not sure it would be the “right” school for me, but I would take it over “W” any day.

Finally, the “family lunch”. The initial poetry reading was like being at a summer youth camp: the energy, enthusiasm and fun were palpable. There was quite a buzz in the room during this! And the discussion over lunch - this time about volunteering, in light of the outstanding work of volunteers in the Thailand cave rescue - was fascinating.

It was a pleasure to visit the school, and I thank the staff for being so open and welcoming. I look forward to hearing of their results, both academic and beyond, in the years to come.

]]>The question is: what is the period of a pendulum?

We can model the pendulum as a thin rod (inextensible and rigid) of length $L$, freely pivoted about a point $O$, with a single point mass $P$ of mass $m$ on the end of the rod, as shown here (where $T$ is the tension in the rod):

The velocity and acceleration of $P$ are as follows, where $\dot\theta$ means $\dfrac{d\theta}{dt}$ and $\ddot\theta$ means $\dfrac{d^2\theta}{dt^2}$; a derivation of these can be found at the end:

We can now apply Newton’s second law (“$F=ma$”) to the situation: working perpendicular to the rod, this gives $-mg\sin\theta=mL\ddot\theta$ (the minus sign is because the component of the force $mg$ is in the opposite direction to the $L\ddot\theta$ on our diagram). Rearranging this, we get the differential equation:

Unfortunately, this equation is impossible to solve in terms of simple
functions. But if we **assume that the swing of the pendulum is
small**, so that $\theta$ is small, then we can approximate
$\sin\theta$ by $\theta$, and our differential equation becomes

This differential equation (an example of simple harmonic motion) has a solution

(which is easy to check), where $A$ is the amplitude (maximum angle) of the swing. The period of this swing is $2\pi\sqrt{\dfrac{L}{g}}$, which is independent of the amplitude and the mass at the end of the rod! So as long as the swing is relatively small, the period is only dependent upon the length of the pendulum (and the acceleration due to gravity), which is likely to be a surprising result the first time it is met. This would have had great significance for clock-makers in times gone by.

We can work out the velocity and acceleration of $P$ in several different ways. One way is to use coordinates, where $O$ is the origin, and the vertical line is the $y$-axis. Then when $P$ is at an angle of $\theta$, it has a position vector of

A unit vector in the direction of $\overrightarrow{OP}$ is

and a unit vector perpendicular to this in the direction of increasing $\theta$ is

as shown in this diagram:

The velocity of $P$ can be found by differentiating $\mathbf{r}$ with respect to time, giving:

Then the acceleration can be found by differentiating again (using the product rule on both of the components of $\dot{\mathbf{r}}$) to obtain:

These are the components of the velocity and acceleration shown above.

Without as much rigour, one could observe that the distance of $P$ along the circumference of the circle is given by $L\theta$, so it is reasonable to suggest that its speed is $L\dot\theta$ (as $L$ is a constant). Then the acceleration in this direction is plausibly $L\ddot\theta$, while the radial acceleration - which we are not interested in for this application - is a result of the velocity changing direction.

]]>During the day and on my journey home, I thought about this and some of the connections between it and other areas of the syllabus. So here are a few quick thoughts on ways we could think about them, making connections between this and other areas of the syllabus. I hope that this post offers some different perspectives on the topic.

This is a diagram probably familiar from most A-level textbooks (I don’t have one to hand, unfortunately). We have our familiar unit circle, and draw a right-angled triangle with angle $\theta$, opposite $\sin\theta$ and adjacent $\cos\theta$. We also see that the arc length subtended by the angle $\theta$ is $r\theta=\theta$ as the radius is 1. (We must be working in radians for this to be correct!) Already in this diagram, $\sin\theta$ and $\theta$ do not look very different, so $\sin\theta\approx\theta$. On the other hand, $\cos\theta$ looks pretty close to $1$, so we have $\cos\theta\approx1$. Visually, say using GeoGebra, we see that these approximations get better as $\theta$ gets smaller: the arc and the half-chord become closer and closer to each other. We can then work out $\tan\theta=\dfrac{\sin\theta}{\cos\theta}\approx {\theta}{1}=\theta$.

Another way of seeing this approximation to $\tan\theta$ is to draw the triangle with adjacent equal to $1$:

If we take $\sin\theta\approx\theta$, then we can work out a better approximation for $\cos\theta$ using the binomial theorem. We have, for small $\theta$ (positive or negative):

where we have used the first two terms of the binomial expansion on the last line. So $\cos\theta\approx 1-\frac{1}{2}\theta^2$.

Another way of obtaining the approximation for $\cos\theta$ is to relate cos and sin using a double-angle formula:

so

where we have used $\sin\tfrac{1}{2}\theta\approx\tfrac{1}{2}\theta$ on the second line.

The approximations for $\sin\theta$ and $\tan\theta$ are also closely related to the shape of their graphs near the origin (though there is potentially some circular reasoning here - no pun intended!):

We have drawn the graphs of $y=x$ (red), $y=\sin x$ (green) and $y=\tan x$ (blue). Near the origin, the three graphs look very similar, so for small $x$, $\sin x\approx x \approx \tan x$.

This also tells us that at the origin, $\frac{d}{dx}(\sin x)$ and $\frac{d}{dx}(\tan x)$ equal $1$.

We can also argue in the opposite direction. If we have already convinced ourselves why the derivative of $\sin x$ is $\cos x$ using a different approach (for example, by using Rotating derivatives), then we can say that for small values of $x$, the graph of $y=\sin x$ is approximated by the tangent to the graph at $x=0$ (see A tangent is… for more on this point). We can calculate the tangent: since $\frac{d}{dx}(\sin x)=\cos x$ giving $\cos 0 = 1$, and $\sin 0 = 0$, the tangent has equation $y=x$. So for small $x$, $\sin x\approx x$.

]]>or as the rule that students are frequently taught: “turn the second fraction upside-down and multiply”?

I’ve been inspired to revisit this question after listening to Ed Southall talking on Mr Barton’s Maths Podcast, where he mentioned this question.

In this post I suggest a teaching sequence which might lead to an understanding of the rule above, as well as a procedural knowledge of how to perform the rule.

I have seen textbooks and websites explain the rule for division of fractions by talking about how many times we can fit $\frac{1}{3}$ into $\frac{4}{5}$, say, but that seems to me to be quite challenging: students have to hold on to several ideas at once, and make sense of diagrammatic representations at the same time as trying to think about what division means. It also becomes very hard as the fractions become more complicated. In my experience, few students develop a solid understanding through this approach: they either get lost in the reasoning or they resort to following a rule.

This problem ties in quite neatly with some things I have recently read, in particular:

- James Tanton’s post The Unreasonableness of K-12 Mathematics, in which he gives an idealised description of the development of the concept of number.
- Liping Ma’s book “Knowing and teaching elementary mathematics”, in which US and Chinese teachers’ understanding of this rule is compared.
- John Mighton, the founder of JUMP Math,
wrote
The end of ignorance;
he observes there that
*meaningful*symbolic manipulation can precede both an attempt to explain an idea or technique in everyday terms, and the development of understanding; moreover, understanding can emerge*from*the manipulations if examples are well-chosen and students are given the opportunity to reflect.

The calculation $8-5$ means “what number $\square$ makes $\square+5=8$ true?” Similarly, when we write $12\div 3$, we mean “what number $\square$ makes $\square\times3=12$ true?” This says that division is the inverse of multiplication. (More precisely, for each non-zero number $c$, dividing by $c$ is the inverse of multiplying by $c$.) The same applies to division of fractions: $\frac{3}{5}\div\frac{2}{3}$ means “what number $\square$ makes $\square\times\frac{2}{3}=\frac{3}{5}$ true?”

Once we notice that $\frac{3}{2}\times\frac{2}{3}=1$, we can then multiply both sides of this equation by $\frac{3}{5}$ to obtain

Therefore $\square$ must be $\frac{3}{5}\times\frac{3}{2}$, or

This method will work for any fraction division question, and so these steps give us our familiar rule: “turn the divisor upside-down and multiply”.

What follows is a suggestion for how these ideas could be introduced over a sequence of lessons, which could span several months or even years. This offers students the chance to revisit the ideas again and again, thereby reinforcing them, as well as building up stronger connections and a deeper understanding. In the later steps, I assume that students can multiply fractions.

All of the questions below are available in this Word document.

We begin by asking students what other number statements they can deduce from $3+5=8$. There are many possible answers (such as $30+50=80$), and here we highlight those obtained by rearranging the numbers. (These could be encouraged by a question such as “Using only the numbers 3, 5 and 8, what other number statements can you get from $3+5=8$?”) Three key statements are:

as well as the same statements written the other way round, such as $5=8-3$; we won’t mention these reversed statements again here.

The last of these three statements says that addition is
*commutative*: the order of adding does not matter. The other two say
that subtraction is the inverse of addition: the three problems

are equivalent, as are similiar problems about $8-3=\square$. Making this connection explicit would be beneficial, especially in relation to the later parts of this sequence of steps.

Students could then be asked to write statements equivalent to statements such as $10-3=\square$ to reinforce this idea.

This idea may well have already been introduced via a bar model approach or using Cuisenaire rods or suchlike.

It is useful to recognise that it doesn’t matter whether we are working with whole numbers, directed numbers, fractions or whatever: subtraction always has this meaning, so returning to this idea periodically will benefit students’ understanding.

This is the parallel of Step 1 for multiplication and division. What can be deduced from $3\times4=12$? This again leads to interesting points such as why $30\times40=120$ is an incorrect statement, whereas $30+50=80$ is correct. But for our current purposes, the key deductions are again those obtained by rearrangement:

As before, we see that multiplication is commutative and that division is the inverse of multiplication. In particular, this means that answering the question $12\div4=\square$ is the same as filling in the missing number in $4\times\square=12$; asking students to make deductions from $12\div4=\square$, as above, will reinforce this idea.

A key part of this approach is to learn about reciprocals of fractions. We start with the reciprocals of unit fractions.

For this missing-number problem, I would suggest asking students to work on this themselves rather than showing them how to do the first one. (I am assuming that they already know enough about fractions to work out the answers to these questions.)

Students should spot the pattern. Following this by asking questions such as $\frac{1}{82}\times\square = 1$ can help them to realise that they can now do some very complicated-sounding questions, even if they can’t imagine what $\frac{1}{82}$ of a cake might look like. (I was reminded of this approach by John Mighton’s book.)

Students should then connect this back to the earlier steps, by asking them to rearrange $\frac{1}{2}\times2=1$. This will allow students to (re)discover that $1\div 2=\frac{1}{2}$ (and similarly for the other statements); this can also be used to reinforce the idea that a fraction such as $\frac{1}{2}$ just means “1 divided by 2”. (The division symbol itself suggests this: $\div$ is just a fraction with dots in place of actual numbers.) Another way of rearranging the number statement gives $1\div\frac{1}{2}=2$, which could be related to the “practical” meaning of division: there are 2 halves in a whole.

It might be too big of a jump for some students to go straight to finding the reciprocal of a general fraction, so this step provides a structured intermediate step, once they are developing some confidence with the above idea.

Here is a second sequence of missing-number problems:

Once students have worked out answers to these (and perhaps adding a few more similar examples), either ask them to generalise by making up their own similar examples, or ask superficially harder questions such as $\frac{74}{133} \times \square = 74$, so that the structure becomes clear.

Asking students to rearrange these statements once again results in statements like $2\div3 = \frac{2}{3}$ (further reinforcing the division idea) and $2\div \frac{2}{3} = 3$.

A useful preparatory question before this step would be something like: “If you know that $96\times 48=4608$, then what is the missing number in $96\times \square = 2304$?” This recalls the idea that we can divide the product by 2 by dividing the multiplicand (or multiplier) by 2. (The use of two-digit numbers is designed to discourage students from doing a division!)

In this step, we replace the integers on the right-hand sides of the previous set of questions with 1:

If students cannot work out how to answer the first question, it would be helpful to remind them of their answer to $\frac{2}{3}\times \square = 2$. Tying this to the preparatory question above should help them get to the answer.

Again, students can be invited to generalise at this point, or to answer a question like the one in the previous step: $\frac{74}{133} \times \square = 1$. Also, it is helpful to then rearrange these results; we have $1\div \frac{2}{3} = \frac{3}{2}$, and we are seeing the first clear case of turning fractions upside-down.

After these, it could be interesting to also revisit unit fractions: following the same pattern that we have seen, how else could the answer to $\frac{1}{3}\times \square = 1$ be written, besides as $3$?

Before working on the full-blown division of fractions, it would be useful to preface it by another relevant rearranging activity: how can the number statement $2\times 3\times 4=24$ be rearranged, while keeping all of the numbers involved the same? This gives rise to a number of statements, such as:

This may cause some difficulty and lead to some interesting class discussions.

And now we can build on the ideas developed in Step 5. How could we complete the following statements?

A prompting question, if needed, is “What is $\frac{2}{3}\times \frac{3}{2}$?”

And then what about these, where the two squares should be filled in following the pattern we have just seen?

Once students feel competent at these, ask how they can use these to work out:

And with this, students have reached a point where the rule for dividing by a fraction will make some sense: we multiply the reciprocal of the divisor (so as to get 1 when it is multiplied by the divisor itself) by the dividend, which is our well-known rule.

]]>Stuart Price noted that the answer to the last part can be obtained as the answer to part (iii) divided by the answer to part (iv), by the definition of conditional probability.

But if we think about what’s going on a little further, we will be able to understand the structure of this problem more and see further connections.

The first thing to do to make our life a little simpler is to replace the specific numbers 5.1 and 3.6 with variables, so that the algebraic structure becomes clearer. So let’s call the means of the two independent Poisson distributions $\lambda$ and $\mu$. We will stick with the 5 and 7 for the time being, and generalise those later.

Therefore our problem says that the number of lorry drivers is distributed as $\mathrm{Po}(\lambda)$, the number of car drivers is distributed as $\mathrm{Po}(\mu)$ and the total number of drivers is distributed as $\mathrm{Po}(\lambda+\mu)$. The relevant probabilities are then as follows:

But this is just a binomial probability! It is the probability of 5 successes from 7, where the probability of success is $\frac{\lambda}{\lambda+\mu}$, which equals the mean number of lorry drivers divided by the mean number of drivers. It is clear that we could replace 5 and 7 by any numbers $r$ and $n$ in the above calculation, to deduce that given that there are $n$ drivers in total, the probability that $r$ of them are lorry drivers is

If we had assumed that the probability that a visiting driver picked at random is a lorry driver is $\frac{\lambda}{\lambda+\mu}$, then we would have got the same answer without having to calculate any Poisson probabilities at all.

This seems like a reasonable suggestion, but how can we justify it?

One technical way is to say that the binomial probabilities we have found above prove that this is the case. But this gives little insight into the reason for it.

A better way is to simply observe that the ratio of the rate of lorry drivers arriving to the rate of car drivers arriving is $\lambda:\mu$, so the probability that a particular driver is a lorry driver is indeed $\frac{\lambda}{\lambda+\mu}$. This might feel a little problematic, though, as it seems to ignore the probabilistic aspects involved.

A more careful way of doing this is to think about the behaviour and meaning of Poisson distibutions. The means $\lambda$ and $\mu$ are for the unit time period of 1 hour. If we had a time period of $t$ hours, with the same uniform random driver arrivals over the whole period, then the mean number of lorry and car drivers would be $t\lambda$ and $t\mu$ respectively, with the distribution of the number of drivers still being Poisson. A standard thing to do at this point is to take $t$ to be very small. In this case, the probability of there being more than one driver arriving in the period is negligible, so the probabilities become:

Two ways of deriving these probabilities are: (a) calculate the Poisson probabilities, expanding $e^{-t\lambda}$ and ignoring all terms involving $t^2$; (b) assume that the number of lorry drivers arriving is zero or one, then calculate what the probability of one lorry driver arriving would have to be so that the expected number of lorry drivers is $t\lambda$. Note that we also ignore the negligible probability that both a lorry driver and car driver arrive.

Therefore, in this very short time period of $t$ hours, we have

This means that whenever a driver arrives, the probability that this driver is a lorry driver is indeed $\frac{\lambda}{\lambda+\mu}$, exactly as we wanted.

Incidentally, the small-time-slice thinking also shows why the Poisson distribution is a good approximation to the binomial distribution: imagine we are dealing with a time interval and our Poisson distribution has mean $\lambda$. Divide the whole time interval into $N$ equal slices, and assume that no slices can have more than one event. Then each slice has a probability of $\lambda/N$ of having an event, and the number of events is distributed as $\mathrm{B}(N,\lambda/N)$. The larger $N$ is, the better the assumption that no slice can have more than one event becomes, and so the more closely $\mathrm{B}(N,\lambda/N)$ approximates $\mathrm{Po}(\lambda)$.

This also reminds me of a lovely and surprising probability question on this topic that I saw on an undergraduate problem set (question 6(ii) on Examples sheet 2 here):

The number of misprints on a page has a Poisson distribution with parameter $\lambda$, and the numbers on different pages are independent. A proofreader studies a single page looking for misprints. She catches each misprint (independently of others) with probability 1/2. Let $X$ be the number of misprints she catches. Find $\mathrm{P}(X=k)$. Given that she has found $X=10$ misprints, what is the distribution of $Y$, the number of misprints she has not caught? How useful is $X$ in predicting $Y$?

]]>Our aim in this note is to prove the equivalence of “ordinary” induction and strong induction. For concreteness, let us assume that we are trying to prove the statement $P(n)$ (which is a statement about the integer $n$) is true for all $n\ge n_0$, where $n_0$ is some integer. (Typically we will have $n_0=0$ or $n_0=1$, but not necessarily.)

For example, we might be trying to prove that the sum of the first $n$ positive integers is $\frac12 n(n+1)$, in which case we could take $P(n)$ to be the statement $1+2+\cdots+n=\frac12 n(n+1)$ and $n_0=1$. Or we might be trying to prove some statement about all finite graphs, in which case $P(n)$ might be “blah is true for all graphs with $n$ vertices” and $n_0=1$ again.

**The principle of mathematical induction**

$P(n)$ is true for all $n\ge n_0$ if the following two conditions hold:

(a) $P(n_0)$ is true (the *base case*), and

(b) if $k\ge n_0$ and $P(k)$ is true, then $P(k+1)$ is true (the
*induction step*).

The principle of strong (mathematical) induction can be useful when the proof of $P(k)$ depends on more than one smaller case.

**The principle of strong induction**

$P(n)$ is true for all $n\ge n_0$ if the following conditions hold:

(a) $P(n_0)$ is true (the *base case*), and

(b) if $k>n_0$ and $P(j)$ is true for all $n_0\le j<k$, then $P(k)$ is
true (the *induction step*).

For example, if we are trying to prove a result about Fibonacci numbers, we might use the definition $F_n=F_{n-1}+F_{n-2}$ and have to make use of properties of two smaller numbers. Or we might be arguing about graphs with $n$ vertices, and split a graph up into two smaller graphs with $m$ and $n-m$ vertices; in this case, we may need to assume that whatever result we are trying to show holds not just for graphs with $n-1$ vertices but also for graphs with $m$ and $n-m$ vertices for any $1\le m<n$. In cases such as these, this “stronger” version of induction is very useful.

It turns out that we can actually combine these two conditions into the single condition:

- if $k\ge n_0$ and $P(j)$ is true for all $n_0\le j<k$, then $P(k)$ is true.

The induction step where $k>n_0$ is exactly as before, and the base case is where $k=n_0$. In this case, this condition becomes “if $P(j)$ is true for all $n_0\le j<n_0$, then $P(n_0)$ is true”. But there is no $j$ with $n_0\le j<n_0$, so it is vacuously true that $P(j)$ is true for all such $j$, and hence $P(n_0)$ is true. It is easy to overlook this special vacuous case, though, or to argue about it incorrectly within a general argument, so it is often wise, in practice, to handle the base case separately as above.

We are now in a position to prove the equivalence of these two formulations of induction. We first need to be clear what we mean by these being equivalent. What we mean is as follows: if we assume that the principle of mathematical induction is true, then the principle of strong induction follows from this, and vice versa.

**Theorem**

The principle of mathematical induction and the principle of strong induction are equivalent to each other.

**Proof**

Let us assume first that the principle of strong induction is true, and aim to prove that the principle of mathematical induction follows from this.

So let $P(n)$ be a statement, $n_0$ an integer, and assume that $P(n)$ satisfies the conditions for mathematical induction, namely:

(i) $P(n_0)$ is true, and

(ii) if $k\ge n_0$ and $P(k)$ is true, then $P(k+1)$ is true.

We wish to show that $P(n)$ is true for all $n\ge n_0$, and we do this by showing that it also satisfies the conditions for strong induction. Now, the base case (i) is the same as the base case (a) for strong induction on $P(n)$. Furthermore, $P(n)$ satifies condition (b) for strong induction, for if $k>n_0$ and $P(j)$ is true for all $n_0\le j<k$, then in particular $P(k-1)$ is true, so by (ii), it follows that $P(k)$ is true. (And note that $k-1\ge n_0$.) Thus the induction step for strong induction also holds, and so by strong induction, $P(n)$ is true for all $n\ge n_0$, as we required.

We now prove the converse: we assume that the principle of mathematical induction is true, and aim to prove that the principle of strong induction follows from this.

So let $P(n)$ be a statement, $n_0$ an integer, and assume that $P(n)$ satisfies the conditions for strong induction, namely:

(i) $P(n_0)$ is true, and

(ii) if $k>n_0$ and $P(j)$ is true for all $n_0\le j<k$, then $P(k)$ is true.

We wish to show that $P(n)$ is true for all $n\ge n_0$. We define a new statement $Q(n)$ for $n\ge n_0$, which states “$P(k)$ is true for all $n_0\le k\le n$”. Then (i) is equivalent to stating that $Q(n_0)$ is true, and we can rewrite (ii) as: if $k>n_0$ and $Q(k-1)$ is true, then $P(k)$ is true. But if $Q(k-1)$ is true and $P(k)$ is true, then $Q(k)$ is true (as now $P(j)$ is true for all $n_0\le j\le k$). So (ii) becomes: if $k>n_0$ and $Q(k-1)$ is true, then $Q(k)$ is true. If we now replace $k-1$ by $k$, we get: if $k\ge n_0$ and $Q(k)$ is true, then $Q(k+1)$ is true.

These are now the base case and induction step for the principle of mathematical induction, and so it follows that $Q(n)$ is true for all $n\ge n_0$. But if $Q(n)$ is true, then $P(n)$ is true (by the definition of $Q(n)$), and so $P(n)$ is true for all $n\ge n_0$, as we required.

This argument shows that the principle of mathematical induction and the principle of strong induction are equivalent and can be used interchangeably.

It is also worth noting that these principles are axioms of arithmetic: it is impossible to “prove” the principle of mathematical induction or the principle of strong induction, though we have proven them to be equivalent to each other. More about them can be found in articles on Peano arithmetic or books on mathematical logic.

]]>\begin{equation} \frac{dy}{dx}=1\biggm/\frac{dx}{dy}. \label{eq:recip} \end{equation}

Some questions raised by this include:

(a) What does this equation mean?

(b) How can we explain this to students and also why it is true?

(c) Where would this result be useful to them (besides in artificial exam questions)?

In this post, I will offer some thoughts on (a) and (b), but I’m still fairly stuck on (c).

A typical textbook explanation of the formula begins as follows: “Suppose that $x$ is given as a function of $y$” and then goes on to give a reasonable-looking explanation involving $\delta x$ and $\delta y$. Some books draw a sketch to illustrate this, while others just use algebra.

In a particular commonly-used textbook, a few examples then show how this can be used when we have $x=f(y)$ for some function $f$. One of them is $x=y^2$. Here we have $\frac{dx}{dy}=2y$, so $\frac{dy}{dx}=\frac{1}{2y}$. The textbook notes that although this could be written as $\frac{dy}{dx}=\frac{1}{2\sqrt{x}}$, it is more common to leave it as a function of $y$, matching the form of the original relation.

But if we sketch the graph of $x=y^2$, it becomes clear that this note is simply incorrect.

Here, if we regard the derivative as $\frac{1}{2\sqrt{x}}$, then at both $A(4,2)$ and $B(4,-2)$, we would obtain the derivative $\frac{dy}{dx}=\frac{1}{4}$, which is clearly wrong. However, the original version $\frac{1}{2y}$ would give derivatives of $\frac{1}{4}$ at $A$ but $-\frac{1}{4}$ at $B$. (And we can’t fix things by saying, “Well, the derivative is $\pm\frac{1}{2\sqrt{x}}$”, because how do we decide which sign to take at any particular value of $x$?)

So there is something inherently different about the two offered forms of the derivative: one is given as a function of $y$ and “works”, while the other is given as a function of $x$ and fails, and it is clearly because we are given $x$ as a function of $y$, so $\frac{dx}{dy}$ is a meaningful function of $y$.

Another point to note is that when we write $\frac{dy}{dx}$, we are thinking of $y$ as a function of $x$, and then asking how the function $y$ changes as $x$ changes. Therefore, when we write $\frac{dx}{dy}$, we are thinking of $x$ as a function of $y$ – as it is in our case, and then asking how $x$ changes as $y$ changes. So the original equation \eqref{eq:recip} is actually relating the behaviour of $x$ as a function of $y$ to the behaviour of $y$ as a function of $x$. It is not even obvious that this makes sense, as we have seen that $y$ may not be a function of $x$!

There is a function from analysis called “The Inverse Function Theorem” which sheds light on this. I’ll briefly describe that later, but in our context, it (roughly) tells us the following:

Consider the function $x=f(y)$, and assume that at $(x_0, y_0)$ (where $x_0=f(y_0)$), the derivative $f’(y_0)$ is non-zero. Then we can restrict the domain of $f$ to an interval containing $y_0$ so that it becomes invertible with inverse $y=g(x)$, say. Then $g(x)$ is differentiable and we have

where $x=f(y)$ and $y$ lies in this restricted domain. In other notation, this equation reads

So in our case of $x=y^2$, when looking at the point $A(4,2)$, we could restrict the domain of the function to $1<y<3$ as shown here:

(We could alternatively have restricted to $y>0$, but it makes no difference to the derivative at $A$.) Then the function is one-to-one on the domain $1<y<3$, so it has an inverse $y=+\sqrt{x}$ there, and we have $\frac{dy}{dx}=1\bigm/\frac{dx}{dy}$ as required. And if we wish, we could write the derivative in terms of $x$ as $\frac{dy}{dx}=\frac{1}{2\sqrt{x}}$. If, on the other hand, we looked at the point $B(4,-2)$, then we could restrict the domain to $-3<y<-1$ and find that the inverse function is $y=-\sqrt{x}$. In this case, then, $\frac{dy}{dx}=-\frac{1}{2\sqrt{x}}$. Finally, at the origin, we have $\frac{dx}{dy}=0$: the function does not have a local inverse there, and we do not have a value for $\frac{dy}{dx}$. (There is some sense in which it is infinite at the origin.)

How can we explain this subtlety to students?

One way may just be to offer them examples such as the above, and ask how we can write the derivative $\frac{dy}{dx}$.

A visual argument for the relationship between $\frac{dy}{dx}$ and $\frac{dx}{dy}$ is the approach the textbook offered, once we understand that we are talking about functions and their inverses.

An alternative argument, which is more algebraic, is to use the chain rule: if $y=g(x)$ is the (local) inverse of $x=f(y)$, then we have $g(f(y))=y$. If we differentiate both sides with respect to $y$, we obtain

If we write $x=f(y)$, then this becomes our familiar $g’(x).f’(y)=1$, or $g’(x)=1/f’(y)$.

(It may also be worth noting that $x=f(y)$ may have an inverse even if $f’(y_0)=0$, for example $x=y^3$ has the inverse $y=\sqrt[3]{x}$, but this is not differentiable at the origin.)

This still doesn’t give a reason for why students might want to use this result! And of course, any time that we want to find $\frac{dy}{dx}$ and we are given $x$ as a function of $y$, we can differentiate both sides with respect to $x$, using implicit differentiation. And that renders this result somewhat pointless for school calculus. So any thoughts on why students might find a need for this would be welcomed!

I mentioned the Inverse Function Theorem earlier. Here’s a statement of the theorem from Tom Apostol’s “Mathematical Analysis” (2nd edition).

Theorem 13.6 (The Inverse Function Theorem)Assume $\mathbf{f}=(f_1,\dots,f_n)\in C’$ (i.e., continuously differentiable) on an open set $S$ in $\mathbb{R}^n$, and let $T=\mathbf{f}(S)$. If the Jacobian determinant $J_{\mathbf{f}}(\mathbf{a})\ne 0$ for some point $\mathbf{a}$ in $S$, then there are two open sets $X\subseteq S$ and $Y\subseteq T$ and a uniquely determined function $\mathbf{g}$ such that(a) $\mathbf{a}\in X$ and $\mathbf{f}(\mathbf{a})\in Y$,

(b) $Y=\mathbf{f}(X)$,

(c) $\mathbf{f}$ is one-to-one on $X$,

(d) $\mathbf{g}$ is defined on $Y$, $\mathbf{g}(Y)=X$, and $\mathbf{g}[\mathbf{f}(\mathbf{x})]=\mathbf{x}$ for every $\mathbf{x}$ in $X$,

(e) $\mathbf{g}\in C’$ on $Y$.

(I won’t attempt to explain the technical terms here, as this post is too long already; the internet has much on these for the interested reader.)

We can apply this theorem to our context. We are dealing initially with a function $x=f(y)$, so we take $n=1$ and let $\mathbf{f}=(f_1)=(f)$. Our functions at high school level are almost all well-behaved (that is, smooth), except perhaps at an occasional point, so we will just ignore the $C’$ issue, so we can take $S$ to be the domain of the function $f$ and $T$ to be its range.

The Jacobian determinant for our one-dimensional function $f$ is just $f’(y)$, so then this theorem simplifies to the (less precisely stated) result we gave above, noting though that the $\mathbf{x}$ of the theorem is our $y$, and $\mathbf{a}$ is our $y_0$. The relationship between the derivatives follows from (d) using the chain rule, as we described above.

]]>The paper is beautifully written, and amazingly needs only relatively elementary undergraduate algebra. (It is generalised to the Galois field $\mathbb{F}_q$, but if we take $q$ to be prime, then even that is unnecessary to understand the argument.)

I was somewhat stuck on two small points at the start of the proof of Proposition 4, and thought I would share my realisation of the argument here for others’ benefit.

The first is the assertion in the first paragraph that “The space $V$ of polynomials in $S_n^d$ vanishing on the complement of $-\gamma A$ has dimension at least $m_d-q^n+|A|$”. For simplicity, write $B$ for the complement of $-\gamma A$, so $|B|=q^n-|A|$ (assuming that $\gamma\ne0$). Considering now the evaluation function $e:S_n\to \mathbb{F}_q^{\mathbb{F}_q^n}$ described before Proposition 2, we can look at the restriction $e_d$ of $e$ to $S_n^d$, and then take the restriction of the image of $e_d$ to $B$. In other words, if $p\in S_n^d$, then $e_d(p)$ is a function $\mathbb{F}_q^n\to\mathbb{F}_q$; we then take the restriction of this: $e_d(p)|_B$. This composition $e_d|_B$ therefore gives us a linear map $S_n^d\to\mathbb{F}_q^B$, from a vector space of dimension $m_d$ to one of dimension $|B|$. The required space $V$ vanishing on $B$ is the kernel of this linear map, which therefore has dimension at least $m_d-|B|$, as required.

The second point is the assertion in the next paragraph that if $|\Sigma|<\dim V$, then there is a non-zero $Q\in V$ vanishing on $\Sigma$. The argument for this is fairly similar. Let $p_1$, $p_2$, …, $p_k$ be a basis for $V$, where $k>|\Sigma|$. Then under the linear isomorphism $e$, the functions on $\mathbb{F}_q^n$ given by $e(p_1)$, …, $e(p_k)$ are linearly independent. But now restricting them to functions on $\Sigma$, a space of dimension $|\Sigma|$, necessarily gives a linear dependence between the restricted functions (as $k>|\Sigma|$). So this gives a non-trivial linear combination of these functions which will be zero on $\Sigma$ but is not the zero function on the whole of $\mathbb{F}_n^q$, as they are linearly independent in $V$.

]]>