The first time I used L1 regularization to recover a sparsely sampled signal’s DCT coefficients my mind was blown. If you’ve never tried it out, you should, it’s pretty crazy! The power of priors and choosing the appropriate inductive bias…
The classic learning example is to take a couple of sine waves at high frequencies, add them together, sample the summation at a very low frequency (so that the signals are way above the nyquist sample rate) or better yet just randomly at a handful of points, and then turn it into an optimization problem, where your model parameters are the coefficients of the DCT of the signal; run the coefficients through the inverse transform to recover a time domain signal, and then compute a reconstruction error at the locations in the time domain representation for which you have observed data, and then add an L1 penalty term to request that the coefficients be sparse.
It’s quite beautiful! All sorts of fun more advanced applications but it’s quite awesome to see it happen before your own eyes!
My personal signals & systems knowledge is pretty novice level but the amount of times that trick has come in handy for me is remarkable… even just as a slightly more intelligent data imputation technique, it can be pretty awesome! The prof who taught it to me worked at Bell Labs for a while, so it felt a bit like a guru sharing secrets, even though it’s a well documented technique.
I believe he's describing Orthogonal Matching Pursuit. It's a dictionary learning algorithm that can be used to recover sparse dictionarries using L1 regularization.
Not quite, though very related and I believe both should end up with essentially the same result.
Matching pursuit is essentially a greedy algorithm, if I recall correctly - please do correct me if I am wrong, where you conceptually find the component that explains the most data at each iteration, remove it, and then repeat the process on the residual data. Pardon if that isn’t quite the right explanation, but it’s what my intuition is recalling right now…
What I was describing was a simpler algorithm that can be done with gradient descent or any other vanilla optimizer.
Your model parameters are the coefficients a_i over all basis functions in the frequency domain representation. Run them through the synthesis function to get a time domain signal, and select the values of the time domain signal where your target data is known. Compute the squared error at each of those locations, and take the mean. This is your reconstruction error, and should be trivially differentiable with respect to the coefficients a_i. Compute an additional error term which is a standard L1 regularization, ie sum(|a_i|), which can be added to the reconstruction error term with some weight λ (λ=1 is even fine here, at least for simple problems), and then is also trivially differentiable (provided you haven’t initialized any of the coefficients to 0). As with any L1 regularization term, the resulting solution should be sparse in the L1 regularized parameters (look up visualizations of problems with only 2 model parameters to see how this emerges from the L1 contour lines of equal loss forming “diamonds” with the points on the axes).
the diamond construct also feels evocative of dimer &/or branched manifold / lattice methods, be that Viterbi or otherwise 2.2 in op post is reminiscent of that, e.g., if we view the DCT reconstruction as implicit matched filter
yes in theory should converge on similar result may quickly get into alternating conic optimization especially depending how the signal pair constructed, e.g., if one signal is an ECC &/or if L1 regularization is operating as an error squasher alternating op,
The first time I used L1 regularization to recover a sparsely sampled signal’s DCT coefficients my mind was blown. If you’ve never tried it out, you should, it’s pretty crazy! The power of priors and choosing the appropriate inductive bias…
The classic learning example is to take a couple of sine waves at high frequencies, add them together, sample the summation at a very low frequency (so that the signals are way above the nyquist sample rate) or better yet just randomly at a handful of points, and then turn it into an optimization problem, where your model parameters are the coefficients of the DCT of the signal; run the coefficients through the inverse transform to recover a time domain signal, and then compute a reconstruction error at the locations in the time domain representation for which you have observed data, and then add an L1 penalty term to request that the coefficients be sparse.
It’s quite beautiful! All sorts of fun more advanced applications but it’s quite awesome to see it happen before your own eyes!
My personal signals & systems knowledge is pretty novice level but the amount of times that trick has come in handy for me is remarkable… even just as a slightly more intelligent data imputation technique, it can be pretty awesome! The prof who taught it to me worked at Bell Labs for a while, so it felt a bit like a guru sharing secrets, even though it’s a well documented technique.
incredibly succinct way of describing -- any official sources, arXiv or otherwise that articulate similarly, re: mix, sample, optimize, regularize?
I believe he's describing Orthogonal Matching Pursuit. It's a dictionary learning algorithm that can be used to recover sparse dictionarries using L1 regularization.
Not quite, though very related and I believe both should end up with essentially the same result.
Matching pursuit is essentially a greedy algorithm, if I recall correctly - please do correct me if I am wrong, where you conceptually find the component that explains the most data at each iteration, remove it, and then repeat the process on the residual data. Pardon if that isn’t quite the right explanation, but it’s what my intuition is recalling right now…
What I was describing was a simpler algorithm that can be done with gradient descent or any other vanilla optimizer.
Your model parameters are the coefficients a_i over all basis functions in the frequency domain representation. Run them through the synthesis function to get a time domain signal, and select the values of the time domain signal where your target data is known. Compute the squared error at each of those locations, and take the mean. This is your reconstruction error, and should be trivially differentiable with respect to the coefficients a_i. Compute an additional error term which is a standard L1 regularization, ie sum(|a_i|), which can be added to the reconstruction error term with some weight λ (λ=1 is even fine here, at least for simple problems), and then is also trivially differentiable (provided you haven’t initialized any of the coefficients to 0). As with any L1 regularization term, the resulting solution should be sparse in the L1 regularized parameters (look up visualizations of problems with only 2 model parameters to see how this emerges from the L1 contour lines of equal loss forming “diamonds” with the points on the axes).
ah got it -- thank you!
Not quite - see my response for to GP for what I was describing.
ahi interesting
so residual error derivatives in a sense?
the diamond construct also feels evocative of dimer &/or branched manifold / lattice methods, be that Viterbi or otherwise 2.2 in op post is reminiscent of that, e.g., if we view the DCT reconstruction as implicit matched filter
yes in theory should converge on similar result may quickly get into alternating conic optimization especially depending how the signal pair constructed, e.g., if one signal is an ECC &/or if L1 regularization is operating as an error squasher alternating op,
definitely good stuff
> so residual error derivatives in a sense?
Yep, loss = residuals*2 + lambda*sum(|a_i|), and we take the gradient of that with respect to the a_i to guide our gradient descent steps.
I implemented the AAN 8-point DCT algorithm some years ago: https://www.nayuki.io/page/fast-discrete-cosine-transform-al...
libjpegturbo uses this one for its fast variant of the DCT.
https://github.com/libjpeg-turbo/libjpeg-turbo/blob/36ac5b84...