The Other Markov’s Inequality

If a polynomial function is trapped in a box, how much can it wiggle? This question is answered by Markov’s inequality, which states that for a degree-n polynomial p that maps [-1,1] into [-1,1], it holds that

(1)   \[\max_{x \in [-1,1]} |p'(x)| \le n^2. \]

That is, if a polynomial p is trapped within a square box [-1,1] \times [-1,1], the fastest it can wiggle—as measured by its first derivative—is the square of its degree.

How tight is this inequality? Do polynomials we know and love come close to saturating it, or is this bound very loose for them? A first polynomial which is natural to investigate is the power p(x) = x^n. This function maps [-1,1] into [-1,1], and the maximum value of its derivative is

    \[\max_{x \in [-1,1]} |p'(x)| = \max_{x \in [-1,1]} |nx^{n-1}| = n\ll n^2.\]

Markov’s inequality is quite loose for the power function.

To saturate the inequality, we need something wigglier. The heavyweight champions for polynomial wiggliness are the Chebyshev polynomials T_n, which are motivated and described at length in this previous post. For our purposes, what’s important is that Chebyshev polynomials really know how to wiggle. Just look at how much more rapidly the degree-7 Chebyshev polynomial (blue solid line) moves around than the degree-7 power x^n (orange dashed line).

In addition to seeming a lot more wiggly to the eye, the degree-n Chebyshev polynomials have much larger derivatives, saturating Markov’s inequality:

    \[\max_{x \in [-1,1]} |T_n'(x)| = n^2.\]

These two examples illustrate a good rule of thumb. The derivative of a polynomial which is “simple” or “power-like” will be of size about n, whereas the derivative of a “clever” or “Chebyshev-like” polynomial will be much higher at about n^2.

The inequality (1) was proven by Andrey Markov. It is much less well-known than Andrey Markov’s famous inequality in probability theory. A generalization of Markov’s inequality (1) was proven by Andrey’s brother Vladimir Markov, who proved a bound on the kth derivative of any polynomial p mapping [-1,1] into [-1,1]:

(2)   \[\max_{x \in [-1,1]} |p^{(k)}(x)| \le \max_{x \in [-1,1]} |T_d^{(k)}| = \frac{n^2(n^2-1^2)\cdots(n^2-(k-1)^2)}{1\times 3 \times \cdots \times (2k-1)}. \]

The inequality (1) is a special case of (2) with k=1. The inequality (2) is often called the Markov brothers’ inequality, and this name is sometimes also attached to the special case (1)—to help distinguish it from the probabilistic Markov inequality.

For the rest of this post, we will focus on the basic Markov inequality (1). This inequality is easily extended to polynomials trapped in a box [a,b] \times [c,d] of general sidelengths. Indeed, any polynomial p mapping [a,b]\to[c,d] can be transmuted to a polynomial \tilde{p} mapping [-1,1] to [-1,1] by an affine change of variables:

    \[\tilde{p}(x) = -1 + 2\cdot \frac{p(\ell(x)) - c}{d-c} \quad \text{for } \ell(x) = a + (b-a) \cdot \frac{x+1}{2}.\]

The precise form of this change of variables is not important. What’s important is that \tilde p maps [-1,1] to [-1,1] and, by the chain rule, the maximum value of its derivative is

    \[\max_{t \in [a,b]} |p'(t)| = \frac{d-c}{b-a} \cdot \max_{x \in [-1,1]} |\tilde p'(x)|.\]

Therefore, we obtain a general version of Markov’s inequality (1).

Markov’s inequality (general domain/codomain). Let p be a degree-n polynomial that maps [a,b] to [c,d]. Then

(3)   \[\max_{t \in [a,b]} |p'(t)| \le \frac{d-c}{b-a} \cdot n^2. \]

What’s Markov’s inequality good for? A lot, actually. In this post, we’ll see one application of this inequality: proving polynomial inapproximability. There are lots of times in computational math where its valuable to approximate some function like \mathrm{e}^{-t}, \sqrt{t}, or t^{-1} by a polynomial. There are lots of techniques for producing polynomial approximations and understanding the rate of convergence of the best-possible polynomial approximation. But sometimes we just want a quick estimate of the form, say, “a polynomial needs to be of degree at least n to approximate that function up to additive error 0.1“. That is, we seek lower bounds on the polynomial needed to approximate a given function to a specified level of accuracy.

The general Markov’s inequality (3) provides a direct way of doing this. Our treatment follows Chapter 5 of Faster Algorithms via Approximation Theory by Sushant Sachdeva and Nisheesh K Vishnoi. The argument will consist of two steps. First, we trap the function in a box. Then, we show it wiggles a lot (i.e., there is a point at which the derivative is large). Therefore, by Markov’s inequality, we conclude that the degree of the polynomial must be sufficiently large.

Let’s start with the function \mathrm{e}^{-t} and ask the question:

What polynomial degree n do we need to approximate this function to error 0.1 on the interval [0,b]?

We are interested in the case where the interval is large, so we assume b\ge 2. To address this question, suppose that p is a polynomial that satisfies

    \[|p(t) - \e^{-t}| \le 0.1 \quad \text{for all }t \in [0,b].\]

We will use this information to trap p in a box and show it wiggles.

  • Trap it in a box. Since 0 \le \e^{-t}\le 1 for positive t, it therefore must hold that -0.1 \le p(t) \le 1.1. Therefore, the polynomial p is trapped in the box [0,b] \times [-0.1,1.1].
  • Show it wiggles. The function \e^{-t} decreases quite rapidly. At zero, it is \e^{-0} = 1 and at two, it is \e^{-2} = 0.135\ldots < 0.14. Therefore, since p(t) is within 0.1 of \e^{-t} for all t, it must hold that p(0) is at least 0.9 and p(2) is at most 0.24. Therefore, by the intermediate value theorem, there is some t^* between 0 and 2 for which

        \[p'(t^*) = \frac{p(2) - p(0)}{2-0} \ge \frac{0.9 - 0.24}{2}=0.33.\]

We are ready for the conclusion. Apply the bullet point above and Markov’s inequality (3) to obtain

    \[0.33 \le p'(t^*) \le \max_{t \in [0,b]} |p'(t)| \le \frac{1.1-(-0.1)}{b-0}  \cdot n^2.\]

Rearrange to obtain

    \[n \ge \sqrt{\frac{0.33}{1.2}} \cdot \sqrt{b} > 0.5 \sqrt{b}.\]

We conclude that we need a polynomial of degree at least 0.5\sqrt{b} to approximate \e^{-t} on [0,b]. This rough estimate actually turns out to be pretty sharp. By using an appropriate polynomial approximation technique (say, interpolation at the Chebyshev points), a polynomial of degree \mathcal{O}(\sqrt{b}) also suffices to approximate \e^{-t} to this level of accuracy on [0,b].

We’ve illustrated with just a single example, but this same technique can also be used to give (sharp) inapproximability results for other functions we know and love like 1/x and \sqrt{x}. For a bit of fun, see if you can get results for approximating a power x^s on [-1,1] by a polynomial n of degree n \ll s. You may be surprised by what you find!

I find the Markov inequality technique for proving polynomial inapproximability to be pretty cool. As we saw in a previous post, we usually understand the difficulty of approximating a function in terms of the rate of convergence of the best (polynomial) approximation, which is tied to fine properties of the function and its smoothness. The Markov inequality approach answers a different question and uses entirely different information about the function. Rather than asking about the asymptotic rate of convergence, the Markov inequality approach tells you at what degree does a polynomial start approximating the function at all. And rather than using information about smoothness, the Markov approach shows that polynomial inapproximability is hard for any function that changes a lot over a small interval. As an exercise, you can show that the same argument for inapproximability of \e^{-t} also shows a polynomial of degree \Omega(\sqrt{b}) is necessary for approximating the ramp function

    \[f_{\rm ramp}(t) = \begin{cases} 1-t/2, & 0 \le t \le 2, \\ 0, & 2 < t \le b.\end{cases}.\]

From the perspective of rate of convergence, these two functions could not be more different from one another, as polynomial approximations to \e^{-t} converge at an exponential rate, whereas polynomial approximations to f_{\rm ramp} converge at the rate \order(1/n). But in terms of the polynomial degree to start approximating these functions, both functions require the same degree \Theta(\sqrt{b}). Pretty neat, I think.

Reference. This blog post is my take on an argument presented in Chapter 5 of Faster Algorithms via Approximation Theory by Sushant Sachdeva and Nisheesh K Vishnoi. It’s a very nice monograph, and I highly recommend you check it out!

Leave a Reply

Your email address will not be published. Required fields are marked *