# Reviewer Comments:
---
## Reviewer: 1
1.	Theorem 2 – it seems that for a nearly diagonal covariance matrix \Sigma with the S matrix correspondingly becoming vanishingly small the constant C1 becomes very large and the bound on \epsilon diverges. Can you comment on this, as it seems that in this case the SNR function should approach a modular set function with \epsilon=0

 ** That was an artifact due to the the selection of property to highlight in the bound, in this case the condition number. For a different selection of $\hat{h}(\mathcal{A})$ (43) the bound might reflect the behaviour as $\Sigma$ becomes diagonal. In our bound, we show the effect of $\kappa$ which does not necessary makes $\epsilon =0$ when the $\Sigma$ approaches to a diagonal matrix (with distinct entries). In addition, due to the arbitrary decomposition, we only consider $a->0$ as $S$ can be a diagonal matrix. Therefore, $S$ might take a form $S=\Sigma=\alpha*I$ without affecting (22), for arbitrary $\alpha$.**


2.	Eq. (29) suggests that maximizing the surrogate f(A) is equivalent to maximizing s(A)\gamma(A) where s(A) is the SNR. I find the study of the term \gamma(A) somewhat lacking depth – how do we know that \gamma(A) does not behave in a completely erratic way or opposite to S(A) (as the authors suggest). A discussion would be well warranted.

  **In general, the only thing that can be said with certainty about \gamma(A) is that this set function is monotone in the set size. Depending on which sensor is being selecting, the projection of the standard basis, i.e., e_i = [0 0...1...0 0], onto the basis of matrix S will define how it behaves. Therefore, the optimization will try to aline the selection to basis of S in order to make more positive definite the matrix [S^-1 + a^-1*diag(1_A)]to reduce the SNR loss.**



3.	Computational complexity analysis following Equation (34) – the authors find the computational complexity of their approach as O(MK^3) compared to cubic complexity in M for the convex approach. However, as we have K>>M it seems like the computational complexity of this approach is in fact larger as there is no fixed number of possible choices K in the convex formulation. Can the authors resolve this?

  >> ** In the setting in which the work is posed we always assumed K << M, never K >> M as K is strictly (and way smaller than M for sparse sensing (sensor selection) to make sense. In the case that K \aprox M the complementary set, which is a 'small' set, should be considered. That is, instead of adding, the removal of "less informative" sensors should done.**

4.	Section 4D – can the authors report algorithm running time when comparing their approach to the convex optimization to support their claim of improved running time?

  > **If we consider instances with M > 1e3 the SDP formulation will not be a match against the recursive formulation when selecting a small subset of sensors, as loading the whole SDP is already a complicated feat**.

5.	It would be helpful if the authors can calculate and discuss the sumodularity bound \epsilon as determined according to their theorem 2 for all the specific settings for which they report numerical results.

  >> ** The computation of the particular \epsilon is intractable in general, as it requires factorial many comparisons. Here, it is used to indicate when the greedy heuristic might lead to a good a result. **

6.	Can the authors note on the possible extension of their method for applications where both means and covariance matrices are different for the two hypotheses?

  > **REVIEWER 3 ANSWER**

## Reviewer: 2

Recommendation: RQ - Review Again After Major Changes

Major issues
- The approximation guarantee for the SNR based on $\epsilon$-submodularity is additive. As noted in the paper, the guarantee therefore depends on $\epsilon$ being small compared to the optimal value. However, it is not completely clear when Theorem 2 yields good guarantees. The text notes that it works better for well-conditioned $\Sigma$ (small $\kappa$), i.e., when the data are not too correlated. This case, however, could be addressed before by from assuming the data are white (the mismatch in this situation is small). As the measurements become more correlated, the condition number increases and the guarantee worsens. A quick computation for the covariance (35) used in the simulations, for example, gives $\epsilon$ in the order of thousands, whereas f(Aopt) is most likely on the order of tens. I would encourage a more extensive discussion of the results and plots with the value of the bound or the resulting guarantee for different K. As it is, one cannot judge the usefulness of the derived near-optimal result.

  ** The provided bound is an informative upper bound on the $\epsilon$ constant, it characterize the behaviour of the "loss" submodularity of the set function in terms of the condition number. It basically indicate that well-condition matrices have small "deviation" from submodularity. However, as remarked in the text, ill-condition matrices but that have appropriate structure, e.g., diagonal matrices, can be even lead to modular functions. Therefore, this bound should be taken as information, saying that for well-condition matrices, the application of the might greedy heuristic leads to results not far from the original near-optimal g>uarantee. Computing the true epsilon for a set of simulations require to compute factorially-many pairs to compare, therefore intractable for meaningful examples. **



- Algorithm 2 used for the different covariances case has no performance guarantee. Results in [42] only show that each step improves the performance and that it converges to a local optimum. However, it gives no approximation certificate. This should be made clear in the text.

  ** Added in the manuscript**

- The cost function obtained from (25) appears to display a trade-off between numerical stability and accuracy. If $a \approx \lambda_{min}$, then $S$ is almost rank deficient and $M$ is very ill-conditioned. Moreover, $S$ and $\Sigma$ are no longer similar and the lower diagonal block of $M$ is therefore not good approximation of the SNR. On the other hand, if $\beta \approx 0$, then $S \approx \Sigma$ and the cost function approximates the SNR well, but $1/a \to \infty$. Again $M$ is ill-conditioned. These numerical issues should show up more in larger problems, so it would be important to show numerical results with larger ground sets (e.g., 1000). Whether this is a serious limitation or not should be noted in the text.

  ** As the cost function is not evaluated using the original expression involving $M$ but using the recursion of (34) this problems can be avoided. In addition, as $a$ can be chosen arbitrary its proper selection, e.g., $a = 0.5\lambda_\min$ leads to numerical stable updates (at the expense that the original matrix, \Sigma, is well-conditioned).**

- In (29) and forward, $s(A)$ is not the SNR unless $a = 0$. It is important to make this clear in the text and maybe use a different symbol to avoid confusion. Although this is not seriously detrimental to the contribution of the work, it appears in comments throughout the text.

  >>** Here, possibly there was a misunderstanding. s(A) is always the SNR, even for a \not = 0. The decomposition holds for any value of a which leads to an invertible matrix S. For any proper value for a, it is possible to go from (24) to the original (20) using matrix inversion lemma.**

Minor comments
- p.2, l. 50: "non-monotone separable constraints". Is this the case in this paper? It doesn't look like it, but I may be missing something. I cannot see why this is mentioned here at all.

  **Here, what we wanted to state were the limitations of the approach lay. That is, there is only a certain kind of constraints that fits the submodular machinery, i.e., combinatorial constraints such as knapsack constraints, matroid constraints, etc. In the manuscript this phrase has been changed to indicated example of convex constraints that are not manageable directly by this approach.**

- The work repeatedly refers to the complexity of greedy as "linear in the size of the input set". The number of queries required by greedy is K*M. It therefore depends polynomially on both the ground set and the number of sensors selected. Also, this is the query complexity of the algorithm. The complexity of each query is also a factor to be considered (which is the justification given for Section IV-C).

  **The distinction between "the number of cost function evaluation scales linearly" and "linear time complexity" has been clarified in the manuscript.**

- p.3, l. 40, 2nd column: "the methods presented in this work can be easily extended to budget functions [...]" I understand this from an algorithmic point of view, but then what happens to the guarantees? What about complexity? Matroids can be hard to express efficiently.

  ** Good point. As you may have assumed, this come with a tradeoff. From the complexity point of view the greedy heuristic does not change, however now an oracle for checking the independence of the set is required (we need to answer the question: is this new set in the matroid, or not?). This effectively requires another function evaluation, which might or might not increase the complexity (depending of how efficiently the matroid can be defined). In terms of guarantees, these get worse. That is, the guarantees degrade depending of the "hardness" of the constraints; the harder the constraints the worse the guarantee. However, at least there is (a non trivial) guarantee. A small comment with respect to issue has been added.**

- Definition 2: should it be $\epsilon > 0$?

  ** Added **

- Theorem 2: $\beta$ appears to be an artifact of the proof. If so, why not minimize over $\beta$ for the final result? Moreover, as it is, the statement and (22) are confusing, since $a$ and $C_1$ depend on both $\beta$ and $\lambda_{min}$. In fact, $C_1$ is not a universal constant. I suggest writing out the result in a more direct form.

  **$\beta$ is introduced to state that a is a fractional part of $\lambda_min$.  Proof and result has been restructured to consider only universal constants.**

- p.7, l. 25: the (1-1/e) guarantee for stochastic greedy holds "in expectation" for submodular functions.

  ** True, a note about this has been added. **

- What is the value of $a$ used in the simulations of Section IV-D?

  ** As \Sigma is generated as a superposition of M unit power Gaussian sources in array signal processing model the value of a is arbitrary as the decomposition (23) is not unique. Therefore, 'a' can be, for example: a = min_eig(\Sigma) - b, for 0 < b < min_eig(\Sigma), as then S is invertible.**

- The example in Section IV-E is very interesting and illustrative. It is also in line with the result of Theorem 2. Indeed, as $\rho \to 1$, $\epsilon \to \infty$ and the performance guarantee for greedy search is no longer meaningful. This is indeed a situation where you would expect things to not work.

  ** Indeed. **
- Also in Section IV-E, Fig. 3 appears to show that for a small number of sensors, greedy SNR outperforms (or performs as well as) the submodular surrogate. Placing more than 25% of the ground set seems unlikely to happen, especially for large-scale problems. It would be interesting to see a detail of this region in the figure.

  **The figure was meant to show that the submodular reach the maximal function value with less sensors, but yes, for small number of sensors the selection leads to similar values.**

- I would suggest rewriting the proof of Theorem 2 to make the goals of each step clearer. As it is, the presentation is confusing.

  ** Proof has been restructured**

## Reviewer: 3
### TO DO
* Stiefel Comparison (R3)
* Zoom  Fig. 3 (not sure)
* Time Comparison (not sure)
