szvsw a day ago

The first time I used L1 regularization to recover a sparsely sampled signal’s DCT coefficients my mind was blown. If you’ve never tried it out, you should, it’s pretty crazy! The power of priors and choosing the appropriate inductive bias…

The classic learning example is to take a couple of sine waves at high frequencies, add them together, sample the summation at a very low frequency (so that the signals are way above the nyquist sample rate) or better yet just randomly at a handful of points, and then turn it into an optimization problem, where your model parameters are the coefficients of the DCT of the signal; run the coefficients through the inverse transform to recover a time domain signal, and then compute a reconstruction error at the locations in the time domain representation for which you have observed data, and then add an L1 penalty term to request that the coefficients be sparse.

It’s quite beautiful! All sorts of fun more advanced applications but it’s quite awesome to see it happen before your own eyes!

My personal signals & systems knowledge is pretty novice level but the amount of times that trick has come in handy for me is remarkable… even just as a slightly more intelligent data imputation technique, it can be pretty awesome! The prof who taught it to me worked at Bell Labs for a while, so it felt a bit like a guru sharing secrets, even though it’s a well documented technique.

  • aghilmort 10 hours ago

    incredibly succinct way of describing -- any official sources, arXiv or otherwise that articulate similarly, re: mix, sample, optimize, regularize?

    • muragekibicho 9 hours ago

      I believe he's describing Orthogonal Matching Pursuit. It's a dictionary learning algorithm that can be used to recover sparse dictionarries using L1 regularization.

      • szvsw 8 hours ago

        Not quite, though very related and I believe both should end up with essentially the same result.

        Matching pursuit is essentially a greedy algorithm, if I recall correctly - please do correct me if I am wrong, where you conceptually find the component that explains the most data at each iteration, remove it, and then repeat the process on the residual data. Pardon if that isn’t quite the right explanation, but it’s what my intuition is recalling right now…

        What I was describing was a simpler algorithm that can be done with gradient descent or any other vanilla optimizer.

        Your model parameters are the coefficients a_i over all basis functions in the frequency domain representation. Run them through the synthesis function to get a time domain signal, and select the values of the time domain signal where your target data is known. Compute the squared error at each of those locations, and take the mean. This is your reconstruction error, and should be trivially differentiable with respect to the coefficients a_i. Compute an additional error term which is a standard L1 regularization, ie sum(|a_i|), which can be added to the reconstruction error term with some weight λ (λ=1 is even fine here, at least for simple problems), and then is also trivially differentiable (provided you haven’t initialized any of the coefficients to 0). As with any L1 regularization term, the resulting solution should be sparse in the L1 regularized parameters (look up visualizations of problems with only 2 model parameters to see how this emerges from the L1 contour lines of equal loss forming “diamonds” with the points on the axes).

      • aghilmort 9 hours ago

        ah got it -- thank you!

        • szvsw 8 hours ago

          Not quite - see my response for to GP for what I was describing.

          • aghilmort 8 hours ago

            ahi interesting

            so residual error derivatives in a sense?

            the diamond construct also feels evocative of dimer &/or branched manifold / lattice methods, be that Viterbi or otherwise 2.2 in op post is reminiscent of that, e.g., if we view the DCT reconstruction as implicit matched filter

            yes in theory should converge on similar result may quickly get into alternating conic optimization especially depending how the signal pair constructed, e.g., if one signal is an ECC &/or if L1 regularization is operating as an error squasher alternating op,

            definitely good stuff

            • szvsw 5 hours ago

              > so residual error derivatives in a sense?

              Yep, loss = residuals*2 + lambda*sum(|a_i|), and we take the gradient of that with respect to the a_i to guide our gradient descent steps.