I thought it was known for a long time that any function can be represented with...

blt · on June 23, 2023

The result in the paper is not an approximation result. The shallow network is of bounded width and is exactly equal to the deep one.

elcomet · on June 24, 2023

Only for relu networks though, which are already piecewise linear functions.

curiousllama · on June 23, 2023

> Based on this proof, we provide an algorithm that, given a deep ReLU network, finds the explicit weights of the corresponding shallow network.

I’m out of date on the research, but I suspect the real value here is the algorithm. Sounds like some version of this could eventually help reduce inference time

tripplyons · on June 23, 2023

I haven't read the paper, but most of the time, in order to achieve matching outputs with less layers, you will need exponentially more neurons per layer.

If you have infinite parallelism I believe the shallow network would be faster, but deep networks will use less total operations and will be faster in practice.

alok-g · on June 24, 2023

+1

This is how it is for logic design using gates, and I believe that's not a coincidence.

Finding the right optimum between parallel and deep for logic design is computationally intensive itself, hence it often comes down to experiencial learning. The latter implies potential use of machine learning for that optimization. And it contunues ... lived happily thereafter. :-)

uoaei · on June 24, 2023

There may be some utility to optimizing a given neural network to exploit the features of a given computational architecture. The point would be to expand the width of each layer to hit the limits of your hardware's parallelism. I.e., partition the network into n parts and shallow-ize each one to produce a new neural network with n layers, each of which being roughly as wide as your p parallel computational channels.

WithinReason · on June 23, 2023

I suspect the trade-off for a shallow network is an exponential explosion in weights and computation.

joe_the_user · on June 23, 2023

From the paper:

    Corollary 1. Given a ReLU network N : Rn → Rm, the shallow 
    network S : Rn → Rm of depth L = 3, as specified in the proof 
    of Theorem 1, has width bounded by 
    max{2n + k, 2n + p, 2 · p · m}

I think p might be the number of layers but I'm not sure and I'm sure about k at all.

So it they might be claiming a pretty tight bound in on weights.

a1369209993 · on June 24, 2023

They (suspiciously but not necessarily intentionally) avoid calling attension to it but:

> Let ({fω}ω∈Ω,Ω) be the decomposition that corresponds to N. Label the parts 1,...,p.

p is the number of regions the deep network divides R^n into, and is in general exponential in the network depth. (Eg consider R^1 = [-1,+1], L[k](x) = 2*abs(x)-1 ∀k (where abs(x) = x - 2*ReLU(-x), IIRC), which divides R^1 into 2^D equal segments using only D individual ReLUs, assuming I wrote the formulas down correctly.)

So for a arbitrary deep network the shallow network size is definitely exponential in input depth, and while it's not clear whether that's true for typical 'trained' deep networks, I would be surprised if it wasn't.

uoaei · on June 24, 2023

I would assume the distinction between trained and untrained networks is a constant factor at best.

a1369209993 · on June 25, 2023

I meant 'trained' networks in contrast to totally-synthetic intentionally-pathological networks like my 2abs-1 example. (It could be that only pathological cases give bad results - similar to how (Hindley-Milner) type inference is EXPTIME-complete but works fine for ~all real world programs - but that seems unlikely.)

youssefabdelm · on June 24, 2023

I guess people could use it as a form of post-training "compression"

coderenegade · on June 24, 2023

The best use case would be if you can efficiently switch between representations. Shallow nets don't suffer from the vanishing gradient problem, so you could have very, very deep RNNs, convert to a shallow net for the gradient update, and then convert back to an RNN for the next rollout. Of course, most RNNs track an internal state, which isn't considered in the paper, but overcoming that limitation could make RNN training effective for long enough rollouts that transformers are no longer needed, especially given that RNNs scale better for long sequences than transformers do.

throwawaymaths · on June 23, 2023

I'm not sure it's exponential, but it is dramatic.

tdullien · on June 23, 2023

coderenegade · on June 24, 2023

If you can go back and forth between representations with better than quadratic scaling, it would mean RNNs with infinite context lengths and no vanishing gradient. You wouldn't need transformers.

throwawaymaths · on June 23, 2023

this is the "universal approximator theorem".

https://cognitivemedium.com/magic_paper/assets/Hornik.pdf

> infinite number of [activations]

I don't think you need an infinite number of them, there is a relationship between "how close you want to get" and "how many of them you need".

hervature · on June 23, 2023

I would just add the caveat that it is a single hidden layer. If you were to write this as a sequential model you would have something like:

nn.Sequential(

    Dense(many neurons),

    Dense(1)

)

From this, it's pretty easy to see it is "two" layers but also from your equation c_i and a_i denote two separate matrix multiplications.

amelius · on June 23, 2023

This is like how you can write any boolean function as a (large) sum of minterms.

extasia · on June 23, 2023

Thanks for the explanation. I've heard this before but think I get it now!

madmaxmcfly · on June 23, 2023

Isn't that just like a Taylor series?

charliea0 · on June 23, 2023

Yes - or Fourier series or other decomposition into a sum of orthogonal basis functions.

mkl · on June 24, 2023

I don't think orthogonality is important here. Taylor series basis functions are not orthogonal, and nor are ReLUs.

godelski · on June 24, 2023

I think you're being a little too quick to jump to the answer. There are plenty of functions where this isn't true. Two easy examples are jump discontinuities (one that looks like this ____----- but more wiggly) (known as piecewise continuous) and functions with hole. These cannot be approximated to arbitrary accuracy, even with an infinite number of them.

You need to pay close attention to the wording in the approximation theorems.

Hornik[0] which does the proof you're discussing says

> This paper rigorously establishes that standard multilayer feedforward networks with as few as one _hidden layer_ using arbitrary squashing functions are capable of approximating any __Borel measurable function__ from one __finite__ dimensional space to another to any desired degree of accuracy

The confusion is probably in the Borel sigma-algebra. Borel means that it includes all finite open intervals in the real numbers. So (0,1) but not [0,1]. Open means boundary is not included! The interval also needs to be continuous, so our discontinuities violate the assumptions. Funahashi's[1] and Cybenko's[2] also require continuous functions.

This is actually a really important thing that gets ignored because it "seems obvious." Or maybe we just see it too often. But it is __critical__ to pay attention to assumptions and limitations. There is a lot of that going off the rails lately with ML and it's going to give us some roadblocks (we're already seeing some[side note]). EVERYTHING (and I mean literally everything) has assumptions, and therefor biases[3]. Every evaluation method you use has a bias. Every learning method you use has a bias. Every dataset. Every architecture. Everything. This is because everything has a certain number of assumptions baked in. We try to reduce these and make them as sane as possible, but we should be aware of them. And in the case of the universal approximation, well the limitation is fairly meaningful. Data is not guaranteed to lie upon a smooth continuous manifold.

[0] https://cognitivemedium.com/magic_paper/assets/Hornik.pdf

[1] https://dx.doi.org/10.1016/0893-6080%2889%2990003-8

[2] https://link.springer.com/article/10.1007/BF02551274

[3] https://en.wikipedia.org/wiki/Bias_(statistics)

[side note] We actually see this a lot in evaluation, and this is why benchmarkism is so problematic. Because it causes us to look at results as hard signals instead of guides. Evaluation is fucking hard. No empirical results will ever give you the full answer: the map is not the territory. We saw a hugging face blog today[4] about the LLM results differing and the reason is because the different evaluation methods biased towards different models. This means the metrics can be hacked, even specifically by the RLHF tuning. Or even the tokenization. These things are hard to make real good judgements on if you don't know the limits of the evaluation method (i.e. the assumptions it makes).

[4] https://huggingface.co/blog/evaluating-mmlu-leaderboard

ndriscoll · on June 24, 2023

> Borel means that it includes all finite open intervals in the real numbers. So (0,1) but not [0,1].

The Borel algebra is generated by open sets, but it includes complements (and therefore closed sets) as well. In fact it's also generated by closed sets. The Borel algebra of R also contains sets like the rationals and the irrationals. The types of sets that aren't included in the Borel algebra (but are in the Lebesgue) are nasty things like (some) subsets of the cantor set.

wewxjfq · on June 23, 2023

This is such a bullshit HN comment, lmao.

dang · on June 23, 2023

Could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.

If you know more than others, that's great—please share some of what you know, so the rest of us can learn something, or else don't post. Sneers and putdowns only make everything worse.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.