Processing math: 100%
Huilin Ge, Zhiyu Zhu, Biao Wang, et al., “View synthesis in tidal flat environments with spherical harmonics and neighboring views integration,” Chinese Journal of Electronics, vol. x, no. x, pp. 1–10, xxxx. DOI: 10.23919/cje.2024.00.158
Citation: Huilin Ge, Zhiyu Zhu, Biao Wang, et al., “View synthesis in tidal flat environments with spherical harmonics and neighboring views integration,” Chinese Journal of Electronics, vol. x, no. x, pp. 1–10, xxxx. DOI: 10.23919/cje.2024.00.158

View Synthesis in Tidal Flat Environments with Spherical Harmonics and Neighboring Views Integration

More Information
  • Author Bio:

    Ge Huilin: Huilin Ge was born in Jiangsu, he received the Ph.D. degree in Naval Archi-tecture and Ocean Engineering from Jiangsu University of Science and Technology. He is now a Professor of Jiangsu University of Science and Technology. His inerests include Deep learning and digital image processing. (Email: ghl1989@just.edu.cn)

    Zhu Zhiyu: Zhiyu Zhu was born in Jiangsu, he is now a Professor of Jiangsu University of Science and Technology. His inerests include Ship system control, radar signal processing and intelligent control. (Email: zzydzz@163.com)

  • Corresponding author:

    Zhu Zhiyu, Email: zzydzz@163.com

  • Available Online: December 01, 2024
  • We present a novel view synthesis method that introduces radial field representation of density and tidal flat appearance in neural rendering. Our method aims to generate realistic images from new viewpoints by utilizing continuous scene information generated from different sampling points on a set of identical rays. This approach significantly improves rendering quality and reduces blurring and aliasing artifacts compared to existing techniques such as Nerfacto. Our model employs the spherical harmonic function to efficiently encode viewpoint orientation infor-mation and integrates image features from neighboring viewpoints for enhanced fusion. This results in an accurate and detailed reconstruction of the scene’s geometry and appearance.We evaluate our approach on publicly available datasets containing a variety of indoor and outdoor scenes, as well as on customized tidal flats datasets. The results show that our algorithm outperforms Nerfacto in terms of PSNR, SSIM, and LPIPS metrics, demonstrating superior performance in both complex and simple environments. This study emphasizes the potential of our approach in advancing view synthesis techniques and provides a powerful tool for environmental research and conservation efforts in dynamic ecosystems such as mudflats. Future work will focus on further optimizations and extensions to improve the efficiency and quality of the rendering process.

  • By basing the new view synthesis on a set of sequential images of a scene, the goal is to generate realistic images of the same scene at the new viewpoint. Early research in new viewpoint generation focused on geometric and image rendering techniques to generate new viewpoint images by modeling the geometric structure of the scene and rendering the images. The 3D structure of the scene is reconstructed from multiple images with different viewpoints, and then the reconstructed 3D model is used to generate new viewpoint images [1]–[4].

    With further technological developments, light field rendering [5], [6], view-dependent texturing [7], and more modern learning-based approaches have been proposed, where image rendering methods typically operate by deforming, resampling, and/or blending the source view to the target viewpoint. Such approaches can achieve high-resolution rendering but usually require very dense input views or explicit proxy geometries, which are difficult to estimate with high quality, leading to artifacts in the rendering.

    Recently, NeRF has achieved incredible results in new viewpoint synthesis, obtaining extremely realistic new viewpoint images in complex real-world situations by using neural network weights to record scene features through 5D inputs, combined with high-dimensional position-encoded multi-layer perceptron (MLP) representations of the scene. Although NeRF has achieved impressive results in new view fusion, when using the NeRF technique for view fusion of tidal flats, an overly blurred rendering occurs. NeRF operates by sampling a set of points along the ray and approximating the segmented successive integrals as an accumulation of estimated volumetric features at the sampling intervals. However, the MLP can only be used at a fixed set of discrete positions where queries are made, and the same fuzzy point sampling features are used to represent opacity at all points with intervals between samples, leading to ambiguity in the neural rendering. The features of the samples can vary significantly on different rays projected from different viewing directions. Thus, simultaneous supervision of these rays to produce isotropic volumetric features may result in artifacts like blurred surfaces and blurred textures.

    To overcome this limitation of NeRF and to improve the rendering quality, some works [8]–[10] have introduced shape-based sampling into scene representation. By embedding new spatial paradigms such as Gaussian ellipsoids, truncated bodies, and spheres, these models can reduce representation ambiguity, improve rendering quality, and reduce blurring and aliasing artifacts.

    In this paper, we present a new radial field representation of the density and appearance of tidal flats in neural rendering, better reconstructing the geometry and appearance of the scene and generating more realistic images from new viewpoints. The viewpoint orientation information is queried using a spherical harmonic function and transformed into a set of coefficients that compactly and efficiently represent the complex orientation distribution. While rendering light, image features from adjacent viewpoints are used for fusion, performing density, occlusion, visibility inference, and color blending. This allows our system to operate without any scene-specific optimizations or precomputed proxy geometry. The core of our approach is to go beyond considering only the information in the current viewpoint during rendering, aggregating information from adjacent views along a given ray to compute their final colors. For 3D positions sampled along the ray, potential 2D features from nearby source views are acquired and then aggregated at each sampled position via a transformer to produce density features. The final density value is then computed via the ray converter module by considering these density features along the entire ray, enabling visibility inference across larger spatial scales.

    Constructing radial fields from image features of the input view to render new viewpoints has become a hot research topic, using neural networks to represent the shape and appearance of the scene. Earlier research [11]–[15] has shown that MLPs can be used as an implicit shape representation to approximate highly complex shapes, utilizing MLPs’ ability to model nonlinear functions, ensuring a smooth transition of the SDF (Signed Distance Function) values throughout the space by mapping any given 3D point coordinates to the corresponding signed distance values.

    With the proliferation of microrenderable methods [16]–[19], eliminating the need for hard-to-obtain 3D ground truth data and using two-dimensional images for supervision, the process of microrendering allows for end-to-end training of the network, making it possible to learn 3D representations directly from 2D image data. Some scholars focus on extracting geometric features of a scene from simple geometry and multiple views of diffuse materials [20]–[23]. NeRF [24] achieved impressive results in novel view synthesis by optimizing the 5D neural radiation field of a scene, an achievement that has attracted the attention of many researchers. However, traditional NeRF techniques are difficult to apply directly to large-scale, high-resolution UAV imagery.

    To address this, Jonathan T. Barron et al. proposed mip-NeRF [8], which uses a conical table instead of traditional light sampling to solve aliasing issues in the reconstruction process. Matthew Tancik et al. presented Block-NeRF, which divides a large-scale, borderless scene into chunks, rendering an entire neighborhood of San Francisco [25]. Turki et al. proposed Mega-NeRF, applying a weighted averaging strategy to filter out differences in overlapping regions, successfully training NeRF on UAV images [26]. Many researchers are now validating the potential of NeRF in urban environments [27]–[29], as well as in remote sensing and satellite imagery [30]–[32].

    However, there are fewer studies using similar techniques applied to tidal flat conservation, where significant differences exist relative to urban environments and satellite imagery due to the complexity and dynamics of tidal flats’ natural ecology.

    Environmental conservation through image acquisition by UAVs has been a noteworthy research focus and currently plays an important role in forests [33], artificial wetlands [34], and species conservation [34]–[36]. In this paper, UAVs collect images of tidal flats for perspective fusion to monitor the health of these areas.

    Current research largely focuses on monitoring the health of tidal flats from a globally vast macroscopic perspective using time series images [37]–[39]. However, relatively few studies integrate multi-scale perspectives on tidal flats at smaller scales. Small-scale 3D reconstruction can capture subtle changes on the surface of tidal flats, such as erosion, sedimentation, and vegetation changes, which are often not accurately reflected in large-scale monitoring. Monitoring key areas of the tidal flat ecosystem at a smaller scale, such as salt marshes, mudflats, and intertidal zones, can more accurately capture these subtle changes in critical areas.

    Early work in image rendering was based on weighted blending of reference pixels to synthesize a set of reference images into a new perspective picture [6], [40], where blending weights are computed by calculating ray-space proximity or by proxy geometries [5], [40], [41]. There has been increasing research interest in proxy geometries [42], [43], optical flow corrections [44]–[46], and soft blending [47], [48], which improve the quality of generated images. For example, two types of multi-view stereo [49], [50] are used to generate a mesh surface associated with a view, and then a CNN computes the blending weights. Alternatively, a radiation field is generated based on a mesh surface [51] or a point cloud [52], [53].

    While these methods achieve good results in some cases and can handle sparser views than other methods, they are fundamentally limited by the performance of 3D reconstruction algorithms [49], [50]. 3D reconstruction tends to fail in low-textured or reflective regions and is not well-suited for handling partially translucent surfaces, whereas tidal flats have many low-textured or reflective regions. In our method, continuous volume density learning is used to optimize the quality of synthetic new view images for better performance in challenging scenes like tidal flats.

    Neural Radiance Fields (NeRF) were introduced as a groundbreaking method for synthesizing 3D scenes through the application of deep learning, signifying a significant advancement in computer graphics and 3D modeling. NeRF utilizes a fully connected deep neural network to map 5D coordinates (spatial x,y,z and 2D viewing directions) to color and volume density.

    The radiation field can be conceptualized as a function where the input is a ray in space r(t)=o+td (where rR). This function allows us to query the density σ of the ray r(t) at each point (x,y,z) in space, as well as the color C(r) rendered in the direction d of the ray. The density σ also signifies the probability that a ray will terminate at this point in space and controls the amount of radiation absorbed by other rays as they pass through the point.

    When rendering an image for a given position o and direction d, the radiation from all points along a given ray r(t) is accumulated to compute the color value C(r) of the corresponding point in the image. Formally, this is represented as:

    C(r)=tftnT(r(t))σ(r(t))c(r(t),d)dt (1)

    where T(r(t)) represents the cumulative transmittance along the ray up to r(t) and is defined by

    T(r(t))=exp(ttnσ(r(s))ds) (2)

    In this formulation, t represents time, with tn and tf as the start and end points. The rendering outcome results from the interaction of three critical factors: cumulative transmittance T(r(t)), density σ(r(t)), and color c(r(t),d). The interaction of T(r(t)) and σ(r(t)) serves as a “color weight” parameter, quantifying the remaining light intensity at a specific point and its corresponding density. As depicted in Equation (2), this relationship follows an inverse exponential pattern, meaning that a higher density at a given point results in a lesser amount of light penetrating beyond that point.

    In the actual rendering process, the discrete forms of Equation (1) and Equation (2) are represented as follows:

    C(r)=Ni=1T(r(i))(1exp(σ(r(i))δi))c(r(i),d) (3)
    T(r(i))=exp(i1j=1σ(r(j))δj) (4)

    In Equation (3), δi=ti+1ti represents the distance between consecutive sampling points along the ray. The relation between σ(r(t)) and 1exp(σ(r(i))δi) has been demonstrated in previous work [54].

    The optimization of NeRF involves minimizing the mean square error (MSE) loss between the predicted image ˆC(r) and the ground truth image C(r), specifically:

    Lmse=rRˆC(r)C(r)2 (5)

    As shown in Figure 1, at a given location of the generated image, we collect images from adjacent viewpoints and use these adjacent viewpoint images in the rendering process of the new viewpoint. This approach leverages the colors and densities in the images from neighboring viewpoints to generate a more realistic rendering at a specified viewpoint. Additionally, a spherical harmonic function is introduced to process the viewpoint information, ultimately providing the density σ and color of the new viewpoint. The loss function is then calculated to complete the rendering of the new viewpoint image.

    Figure  1.  System overview.

    To render the target view, we first select the view whose direction is closest to that of the target viewpoint from the acquired neighboring views. The number of views N is selected based on GPU capacity and forms the set of neighboring views used for rendering the new viewpoint. To prepare for rendering, features are extracted from these neighboring viewpoint images using a shared-weight network.

    The image features are extracted using the network architecture outlined in Table 1. We implement the feature extraction network using the architecture based on ResNet34 [55], implemented in PyTorch [56]. We replace all batch normalization [57] with instance normalization [58], as shown in [59], and remove the maximum pooling layer, substituting it with a cross-row convolution. Our network is fully convolutional and can accept variable-sized input images. For instance, we use a single image of size 640×480×3 as an example input.

    Table  1.  Generate image feature network structure
    Input (id: dimension) Layer Output (id: dimension)
    0: 640 × 480 × 3 7 × 7 Conv, 64, stride 2 1: 320 × 240 × 64
    1: 320 × 240 × 64 Residual Block 1 2: 160 × 120 × 64
    2: 160 × 120 × 64 Residual Block 2 3: 80 × 60 × 128
    3: 80 × 60 × 128 Residual Block 3 4: 40 × 30 × 256
    5: 40 × 30 × 256 3 × 3 Upconv, 128, factor 2 6: 80 × 60 × 128
    [3,6]: 80 × 60 × 256 3 × 3 Conv, 128 7: 80 × 60 × 128
    7: 80 × 60 × 128 3 × 3 Upconv, 64, factor 2 8: 160 × 120 × 64
    [2,8]: 160 × 120 × 128 3 × 3 Conv, 64 9: 160 × 120 × 64
    9: 160 × 120 × 64 1 × 1 Conv, 64 Out: 160 × 120 × 64
     | Show Table
    DownLoad: CSV

    Use Ii[0,1]Hi×Wi×3 to denote the i-th neighbouring view of the target perspective, and PiR3×4 to denote the i-th camera projection matrix of the target perspective. The image features Fi are obtained by the above shared weight network structure, and for each new viewpoint rendering, the input tuple {(Ii,Pi,Fi)}Ni=1 is constructed as the input to the network.

    Our proposed method renders the final 2D image by accumulating the colours and densities on the rays, using the classical neural radiation field as a baseline for generating the new view image. The difference is that our method aggregates the body densities and colours of adjacent views used to generate the new viewpoint image.

    The image features are extracted using a network of common parameters to aggregate the features of the views under the target viewpoint d. The image is then rendered as a 2D image. Also, in order to make it more likely that 3D points on a surface will have a consistent low-local appearance across multiple views, an effective way to maintain the consistency of Fi, we use a point network architecture [60], as shown in Figure 2. Multi-view features are employed and variance is used as the global pool operator. Firstly, the mean μ and variance v of each image extracted feature vector Fi are calculated to consider the image extracted features from a global perspective and construct the global link between image features. Subsequently, the feature vectors Fi, mean μ and variance v are concatenated to obtain the new feature vector FN and the weight parameter wn through MLP, and then these two parameters are mapped to the density feature Fσ using MLP.

    Figure  2.  Rendering a new perspective picture flowchart using adjacent views.

    Our network inputs consider neighbouring view information, but do not expect all neighbouring views to carry the same weight in the final density and colour generation process, and expect neighbouring views that are close to the target viewpoint to carry more weight in the rendering process. We use the pooling technique proposed by Sun et al [61]. Our weighting function is defined as follows:

    ˜wfi(d,di)=max(0,es(ddi1)minj=1,,Nes(ddj1))wfi(d,di)=˜wfi(d,di)j˜wfj(d,dj) (6)

    Similarly, we set s as a learnable parameter to further control the weight occupied by neighbouring views with different viewpoint gaps in the rendering treatments through the neural network. Obtaining the density feature Fσ can be done directly using a neural network to turn it into the σ needed for rendering, but direct conversion produces images with blurring and artefacts in the new view. We believe that further use of global information is needed, and we employ a ray transformation module that is able to make the most of the information on the ray. We use the classical Transformer consisting of two core components: positional encoding and self-attention.

    The ray converter consists of two key components from the classical transformer [62]: positional encoding and self-attention. Given M samples along a ray, our ray transformer processes the samples as a sequence from near to far, applying positional encoding and multi-head self-attention to the dense sequence of features (Fσ(X1),,Fσ(XM)). It then predicts the final density value σ based on the features provided by each sample.

    For the set consisting of the selected neighboring views, we consider that a smaller distance from the object view d to the neighborhood view di implies a greater likelihood that the color of the target view is similar to the respective color of view i and vice versa. To predict the weights of neighboring views for each target view, as shown in Figure. 2, we link FN and the viewpoint difference dtargetdi by mixing the weights with the following weight function:

    c=Ni=1Ciexp(wci)Nj=1exp(wcj) (7)

    Similarly, we concatenate the feature vector of the target view to participate in the training and then obtain the weight value of the color. The color value contributed by each adjacent view to this current feature point is obtained by multiplying the weight and color values, and all the color values are summed to obtain the final color. In this case, the color in the current view is used with the same spherical harmonic function as in instant-ngp [63] to get Ctarget, as follows:

    C(θ,ϕ)=Jj=0jm=jcmjYmj(θ,ϕ) (8)
    Ymj(θ,ϕ)={2Kmjcos(mϕ)Pmj(cosθ)if m>02Kmjsin(mϕ)Pmj(cosθ)if m<0K0jP0j(cosθ)if m=0 (9)

    where d(θ,ϕ) indicates the line of sight direction, θ is the pitch angle, and ϕ is the yaw angle. The parameter j describes the “order” of the spherical harmonic function, representing the number of ripples of the function on the sphere, and takes a non-negative integer value, j=0,1,2,,J. The parameter m describes the variation of the function in azimuth for a given order; m takes any integer value between j and j, m=j,j+1,,0,,j1,j. Pmj is the j-th m-order associated Legendre function, defined as follows:

    Pmj(x)=(1)m(1x2)m/2dmdxmPj(x),Pn(x)=12nn!dndxn[(x21)n]Kmj=(2j+1)(j|m|)!4π(j+|m|)! (10)

    We first evaluated our model on a publicly available da-taset: 7 scenes (3 outdoor and 4 indoor), each contains a complicated central object or region and a detailed background. Each of these types has about 200-300 or so images. We then similarly validate our algo-rithm on the tidal flats dataset from our earlier work. The main com-parison algorithm ne-facto [64]. In this study, we used UAV tidal flats imagery acquired from a UAV flyover tidal flats environment for perspec-tive fusion. The specific location where the imagery was acquired is along a stretch of thousands of kilometers of Australian coastline, specifically at an altitude of 1000 feet between Smithton and Woolnoth in northwestern Tasmania. It has a diverse intertidal ecosystem that in-cludes a variety of tidal flat types and ecological land-forms. As shown in Figure 3, this data enables a comprehensive assessment of the effectiveness of our algorithms in a tidal flat environment. We further classi-fied the acquired images by intertidal ecosystems, in-cluding “tidal trees”, estuaries, ground texture, plants and deeply watered areas, every scenario is containing 30-90 images that were finely processed and standardized for resolution and color fidelity. A series of standardization steps were used to ensure the consistency and quality of the image data. First, we calibrated the images to remove color deviations and geometric distortions due to light-ing and camera settings. Next, we segmented and filtered each image to exclude low-quality or unsuitable images for analysis.

    Figure  3.  Location map of the survey sample area.

    Assessment of the accuracy of the quality of the resulting synthesized images and the new views generated, we uti-lize three key metrics: PSNR, SSIM, and LPIPS. These metrics allow for a comprehensive assessment of the structural similarity, luminance contrast, and perceptual differences between the synthesized and real images, en-suring a thorough evaluation of the new views’ quality and similarity. Peak Signal-to-Noise Ratio)(PSNR)is a conventional in-dex for measuring image quality. The formula is as fol-lows:

    To assess the accuracy and quality of the resulting synthesized images and new views generated, we utilize three key metrics: PSNR, SSIM, and LPIPS. These metrics allow for a comprehensive assessment of the structural similarity, luminance contrast, and perceptual differences between the synthesized and real images, ensuring a thorough evaluation of the new views’ quality and similarity.

    Peak Signal-to-Noise Ratio (PSNR) is a conventional metric for measuring image quality. The formula is as follows:

    MSE=1mnm1i=0n1j=0|I(i,j)K(i,j)|2 (11)
    PSNR=10log10((2n1)2MSE) (12)

    The PSNR value is positively correlated with the image quality and is usually used to quantify the similarity of the resulting image.

    The Structural Similarity Index (SSIM) indicates the structural likeness between the resulting image and the original, including contributions from luminance, contrast, and texture. By calculating the SSIM value, the degree of resemblance between the generated image and the real image can be evaluated:

    SSIM(x,y)=(2μxμy+c1)(2σxy+c2)(μ2x+μ2y+c1)(σ2x+σ2y+c2) (13)

    LPIPS (Learned Perceptual Image Patch Similarity) [65] is a metric for assessing image quality and similarity. Proposed by Richard Zhang et al. in 2018, it is widely used in the fields of image processing and computer vision. LPIPS aims to assess the perceptual similarity between two images, focusing on how the human visual system perceives image quality and similarity rather than pixel-wise differences. LPIPS uses a trained deep learning model (typically a convolutional neural network, CNN) to extract image features. These features capture complex patterns and structures in an image, reflecting human perception more accurately. The metric assesses similarity by comparing localized image chunks (patches) of an image, allowing for a detailed analysis of the local details and textures.

    As indicated in Table 1, we used a comparison between our method and Nefacto’s results on a public dataset, and we can see that our method outperforms Nefacto in PSNR and SSIM, though it is slightly inferior in LPIPS. As illustrated in Figure 4, our approach involves comparing the images by actually generating them.

    Figure  4.  Public dataset comparison results.

    We likewise validated the specifics of our algorithm on a tidal flats dataset of our own making as shown in Figure 5, where it can be seen that our algorithm achieves much superior results.

    Figure  5.  Comparison of renderings of tidal flats environments.

    Through a contrasting experiment, our approach con-sistently outperforms the conventional NeRF algorithm in both image and accuracy. This result fully demon-strates the higher reliability of our method in modeling and rendering tidal flat environments. Our method is able to capture scene details more accurately and provides a powerful tool for tidal flat research and conservation.

    Among the evaluation metrics used, our method is very advantageous.

    Table  2.  Comparison results for the borderless scenario dataset
    ScenePSNRSSIMLPIPS
    Our MethodNefactoOur MethodNefactoOur MethodNefacto
    Bicycle22.7419.620.6490.4970.3840.385
    Bonai27.3321.400.8720.6930.2520.302
    Counter25.5124.010.7980.7440.3280.331
    Garden25.0123.060.7450.6270.2820.284
    Kitchen25.9226.000.7890.7910.3020.299
    Room28.5422.320.8910.7770.2590.303
    Stump15.9715.960.3690.3450.7360.727
    Average24.4321.770.7300.6300.3630.376
     | Show Table
    DownLoad: CSV
    Table  3.  Comparison results for the borderless scenario dataset
    ScenePSNRSSIMLPIPS
    NerfactoOursNerfactoOursNerfactoOurs
    Tidal Trees20.4621.770.6850.7360.2310.171
    River Mouths25.5224.900.4180.4630.1890.145
    Ground Textures21.6522.180.4640.4110.3510.209
    Vegetation23.3923.270.7710.7870.2400.229
    Deep-Water Areas20.4920.340.4290.4430.3840.383
    Average22.3022.490.5530.5680.2790.227
     | Show Table
    DownLoad: CSV

    In this article, we have compared the proposed algorithm with the most advanced Nerfacto algorithm. At first, we performed comparison trials on a public dataset. The outcome shows that our proposed algorithm is comparable to the state-of-the-art algorithm in most of the indoor and outdoor scenarios. However, in the Stump scene, it is difficult to clearly distinguish between the foreground and background because they are very similar. This may be due to the fact that during the scene formation process, the values are sampled along the light and composited into a single color. Similarly, in the Tidal Flats environment, although our method is able to generate corresponding images in new viewpoints, some subtle features are still lost in some viewpoints. While our method performs well in all train and test images, it also performs well when viewing the content of the scene from an approximately constant range; however, significant aliasing artifacts appear in the rendered scene at some feature points, resulting in overly blurred images at specific viewpoints.

    In the follow-up work, we refer to some current stage solutions to related problems. For example, in order to improve rendering, we will let multiple rays pass through each pixel for sampling. However, this is very costly for a neural volume representation technique like NeRF, as hundreds of MLP computations are required to render a single light ray, and reconstructing a scene can take days. Another more efficient approach is to use projection cone techniques, which encode the shape and size of the ray paths and enable multi-scale modeling of the scene by training neural networks.

    In this document, we would like to present a facet of a novel view synthesis approach, especially tailored for tidal flat environments, by introducing a new radial field representation for density and appearance in neural rendering. Our method significantly improves upon the rendering quality and reduces blurring and aliasing artifacts compared to existing techniques such as Nerfacto.

    Through extensive experiments on both public datasets and our custom tidal flats dataset, our algorithm demonstrated superior performance in generating realistic new viewpoint images. Despite some challenges in distinguishing foreground from background in highly similar scenes, our method effectively captures the complex geometry and appearance of various tidal flat features.

    Future work will explore further optimizations and extensions of our method, such as incorporating more advanced sampling techniques to address remaining challenges and enhance the overall efficiency and quality of the rendering process. By continuing to refine our approach, we aim to provide robust tools for environmental research and conservation efforts, particularly in dynamic and complex ecosystems like tidal flats.

    This work was supported by the National Natural Science Foundation of China (Grant No. 62476113), the Jiangsu Province Key Research and Development Plan—Social Development Project (Grant No. BE2022783), the Zhenjiang Key Research and Development Plan—Social Development Project (Grant No. SH2022013), and the Ministry of Science and Technology’s Xiong’an New Area Science and Technology Innovation Special Project (Grant No. 2022XAGG0126). Acknowledgments are a crucial element of any published piece of work, be it professional, fictional, non-fictional, or academic. This section is dedicated to thanking the people and organizations that provided support and resources necessary to conduct this research.

  • [1]
    R. Chen, S. F. Han, J. Xu, et al., “Point-based multi-view stereo network,” in Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), pp. 1538–1547, 2019.
    [2]
    Y. Yao, Z. X. Luo, S. W. Li, et al., “Recurrent MVSNet for high-resolution multi-view stereo depth inference,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 5520–5529, 2019.
    [3]
    M. Q. Ji, J. Gall, H. T. Zheng, et al., “SurfaceNet: An end-to-end 3D neural network for multiview stereopsis,” in Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, pp. 2326–2334, 2017.
    [4]
    H. W. Yi, Z. Z. Wei, M. Y. Ding, et al., “Pyramid multi-view stereo net with self-adaptive view aggregation,” in Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, pp. 766–782, 2020.
    [5]
    C. Buehler, M. Bosse, L. McMillan, et al., “Unstructured lumigraph rendering,” in Proceedings of the 28th Annual Conference on COMPUTER Graphics and Interactive Techniques, New York, NY, USA, pp. 425–432, 2001.
    [6]
    M. Levoy and P. Hanrahan, “Light field rendering,” Seminal Graphics Papers: Pushing the Boundaries, vol. 2, article no. article no. 47, 2023 DOI: 10.1145/3596711.3596759
    [7]
    P. Debevec, Y. Z. Yu, and G. Borshukov, “Efficient view-dependent image-based rendering with projective texture-mapping,” in Rendering Techniques’98, G. Drettakis and N. Max, Eds. Springer, Vienna, Austria, pp. 105–116, 1998.
    [8]
    J. T. Barron, B. Mildenhall, M. Tancik, et al., “Mip-NeRF: A multiscale representation for anti-aliasing neural radiance fields,” in Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, pp. 5835–5844, 2021.
    [9]
    J. T. Barron, B. Mildenhall, D. Verbin, et al., “Mip-NeRF 360: Unbounded anti-aliased neural radiance fields,” in Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, pp. 5460–5469, 2022.
    [10]
    J. T. Barron, B. Mildenhall, D. Verbin, et al., “Zip-NeRF: Anti-aliased grid-based neural radiance fields,” in Proceedings of 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 19640–19648, 2023.
    [11]
    M. Atzmon, N. Haim, L. Yariv, et al., “Controlling neural level sets,” in Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, Canada, 2019.
    [12]
    K. Genova, F. Cole, A. Sud, et al., “Local deep implicit functions for 3D shape,” in Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 4856–4865, 2020.
    [13]
    C. Y. Jiang, A. Sud, A. Makadia, et al., “Local implicit grid representations for 3D scenes,” in Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 6000–6009, 2020.
    [14]
    L. Mescheder, M. Oechsle, M. Niemeyer, et al., “Occupancy networks: Learning 3D reconstruction in function space,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 4455–4465, 2019.
    [15]
    J. J. Park, P. Florence, J. Straub, et al., “DeepSDF: Learning continuous signed distance functions for shape representation,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 165–174, 2019.
    [16]
    W. Z. Chen, J. Gao, H. Ling, et al., “Learning to predict 3D objects with an interpolation-based differentiable renderer,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, article no. 862, 2019.
    [17]
    Y. Jiang, D. T. Ji, Z. Z. Han, et al., “SDFDiff: Differentiable rendering of signed distance fields for 3D shape optimization,” in Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 1248–1258, 2020.
    [18]
    M. Niemeyer, L. Mescheder, M. Oechsle, et al., “Differentiable volumetric rendering: Learning implicit 3D representations without 3D supervision,” in Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 3501–3512, 2020.
    [19]
    S. H. Liu, Y. D. Zhang, S. Y. Peng, et al., “DIST: Rendering deep implicit signed distance function with differentiable sphere tracing,” in Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 2016–2025, 2020.
    [20]
    L. J. Liu, J. T. Gu, K. Z. Lin, et al., “Neural sparse voxel fields,” in Proceedings of the 34th Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 15651–15663, 2020.
    [21]
    S. Lombardi, T. Simon, J. Saragih, et al., “Neural volumes: Learning dynamic renderable volumes from images,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, article no. article no. 65, 2019 DOI: 10.1145/3306346.3323020
    [22]
    S. Saito, Z. Huang, R. Natsume, et al., “PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization,” in Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), pp. 2304–2314, 2019.
    [23]
    K. Schwarz, Y. Y. Liao, M. Niemeyer, et al., “GRAF: Generative radiance fields for 3D-aware image synthesis,” in Proceedings of the 34th Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 20154–20166, 2020.
    [24]
    K. Schwarz, Y. Y. Liao, M. Niemeyer, et al., “GRAF: Generative radiance fields for 3D-aware image synthesis,” in Proceedings of the 34th Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 20154–20166, 2020. (查阅网上所有资料, 本条文献和第23条文献重复, 请核对)
    [25]
    M. Tancik, V. Casser, X. C. Yan, et al., “Block-NeRF: Scalable large scene neural view synthesis,” in Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, pp. 8238–8248, 2022.
    [26]
    H. Turki, D. Ramanan, and M. Satyanarayanan, “Mega-NeRF: Scalable construction of large-scale NERFs for virtual fly-throughs,” in Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, pp. 12912–12921, 2022.
    [27]
    H. Turki, J. Y. Zhang, F. Ferroni, et al., “SUDS: Scalable urban dynamic scenes,” in Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, pp. 12375–12385, 2023.
    [28]
    P. Tu, X. Zhou, M. M. Wang, et al., “NeRF2Points: Large-scale point cloud generation from street views’ radiance field optimization,” arXiv preprint, arXiv: 2404.04875, 2024.
    [29]
    J. F. Guo, N. C. Deng, X. Y. Li, et al., “StreetSurf: Extending multi-view implicit surface reconstruction to street views,” arXiv preprint, arXiv: 2306.04988, 2023.
    [30]
    S. L. Xie, L. Zhang, G. Jeon, et al., “Remote sensing neural radiance fields for multi-view satellite photogrammetry,” Remote Sensing, vol. 15, no. 15, article no. article no. 3808, 2023 DOI: 10.3390/rs15153808
    [31]
    L. L. Zhang and E. Rupnik, “SparseSat-NeRF: Dense depth supervised neural radiance fields for sparse satellite images,” arXiv preprint, arXiv: 2309.00277, 2023.
    [32]
    M. Gableman and A. Kak, “Incorporating season and solar specificity into renderings made by a NeRF architecture using satellite images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, pp. 4348–4365, 2024 DOI: 10.1109/TPAMI.2024.3355069
    [33]
    S. Getzin, R. S. Nuske, and K. Wiegand, “Using unmanned aerial vehicles (UAV) to quantify spatial gap patterns in forests,” Remote Sensing, vol. 6, no. 8, pp. 6988–7004, 2014 DOI: 10.3390/rs6086988
    [34]
    D. Chabot, V. Carignan, and D. M. Bird, “Measuring habitat quality for least bitterns in a created wetland with use of a small unmanned aircraft,” Wetlands, vol. 34, no. 3, pp. 527–533, 2014 DOI: 10.1007/s13157-014-0518-1
    [35]
    M. Mulero-Pázmány, R. Stolper, L. D. van Essen, et al., “Remotely piloted aircraft systems as a rhinoceros anti-poaching tool in Africa,” PLoS One, vol. 9, no. 1, article no. article no. e83873, 2014 DOI: 10.1371/journal.pone.0083873
    [36]
    S. K. Valicharla, X. Li, J. Greenleaf, et al., “Precision detection and assessment of ash death and decline caused by the emerald ash borer using drones and deep learning,” Plants, vol. 12, no. 4, article no. article no. 798, 2023 DOI: 10.3390/plants12040798
    [37]
    M. X. Chang, P. Li, Z. H. Li, et al., “Mapping tidal flats of the Bohai and Yellow Seas using time series sentinel-2 images and Google Earth Engine,” Remote Sensing, vol. 14, no. 8, article no. article no. 1789, 2022 DOI: 10.3390/rs14081789
    [38]
    M. M. Jia, Z. M. Wang, D. H. Mao, et al., “Rapid, robust, and automated mapping of tidal flats in China using time series sentinel-2 images and Google Earth Engine,” Remote Sensing of Environment, vol. 255, article no. article no. 112285, 2021 DOI: 10.1016/j.rse.2021.112285
    [39]
    X. X. Wang, X. M. Xiao, Z. H. Zou, et al., “Mapping coastal wetlands of China using time series Landsat images in 2018 and Google Earth Engine,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 163, pp. 312–326, 2020 DOI: 10.1016/j.isprsjprs.2020.03.014
    [40]
    P. E. Debevec, C. J. Taylor, and J. Malik, “Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach,” Seminal Graphics Papers: Pushing the Boundaries, vol. 2, pp. 465–474, 2023 DOI: 10.1145/3596711.3596761
    [41]
    B. Heigl, R. Koch, M. Pollefeys, et al., “Plenoptic modeling and rendering from image sequences taken by a hand-held camera,” in Mustererkennung 1999, W. Förstner, J. M. Buhmann, A. Faber, et al., Eds. Springer, Berlin Heidelberg, Germany, pp. 94–101, 1999.
    [42]
    G. Chaurasia, S. Duchene, O. Sorkine-Hornung, et al., “Depth synthesis and local warps for plausible image-based navigation,” ACM transactions on graphics (TOG), vol. 32, no. 3, article no. article no. 30, 2013 DOI: 10.1145/2487228.2487238
    [43]
    P. Hedman, T. Ritschel, G. Drettakis, et al., “Scalable inside-out image-based rendering,” ACM Transactions on Graphics (TOG), vol. 35, no. 6, article no. article no. 231, 2016 DOI: 10.1145/2980179.2982420
    [44]
    D. Casas, C. Richardt, J. Collomosse, et al., “4D model flow: Precomputed appearance alignment for real-time 4D video interpolation,” Computer Graphics Forum, vol. 34, no. 7, pp. 173–182, 2015 DOI: 10.1111/cgf.12756
    [45]
    R. F. Du, M. Chuang, W. Chang, et al., “Montage4D: Interactive seamless fusion of multiview video textures,” in Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, Montreal, QC, Canada, article no. 5, 2018.
    [46]
    M. Eisemann, B. De Decker, M. Magnor, et al., “Floating textures,” Computer Graphics Forum, vol. 27, no. 2, pp. 409–418, 2008 DOI: 10.1111/j.1467-8659.2008.01138.x
    [47]
    E. Penner and L. Zhang, “Soft 3D reconstruction for view synthesis,” ACM Transactions on Graphics (TOG), vol. 36, no. 6, article no. article no. 235, 2017 DOI: 10.1145/3130800.3130855
    [48]
    G. Riegler and V. Koltun, “Free view synthesis,” in Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, pp. 623–640, 2020.
    [49]
    M. Jancosek and T. Pajdla, “Multi-view reconstruction preserving weakly-supported surfaces,” in Proceedings of CVPR 2011, Colorado Springs, CO, USA, pp. 3121–3128, 2011.
    [50]
    J. L. Schönberger and J. M. Frahm, “Structure-from-motion revisited,” in Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 4104–4113, 2016.
    [51]
    J. W. Huang, J. Thies, A. Dai, et al., “Adversarial texture optimization from RGB-D scans,” in Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 1556–1565, 2020.
    [52]
    M. Meshry, D. B. Goldman, S. Khamis, et al., “Neural rerendering in the wild,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 6871–6880, 2019.
    [53]
    F. Pittaluga, S. J. Koppal, S. B. Kang, et al., “Revealing scenes by inverting structure from motion reconstructions,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 145–154, 2019.
    [54]
    N. Max, “Optical models for direct volume rendering,” IEEE Transactions on Visualization and Computer Graphics, vol. 1, no. 2, pp. 99–108, 1995 DOI: 10.1109/2945.468400
    [55]
    A. Paszke, S. Gross, S. Chintala, et al., “Automatic differentiation in PyTorch,” in Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017.
    [56]
    K. M. He, X. Y. Zhang, S. Q. Ren, et al., “Deep residual learning for image recognition,” in Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 770–778, 2016.
    [57]
    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 448–456, 2015.
    [58]
    D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint, arXiv: 1607.08022, 2017.
    [59]
    K. Y. Luo, T. Guan, L. L. Ju, et al., “Attention-aware multi-view stereo,” in Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 1587–1596, 2020.
    [60]
    C. R. Qi, H. Su, K. C. Mo, et al., “PointNet: Deep learning on point sets for 3D classification and segmentation,” in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 77–85, 2017. (查阅网上资料, 未能确认标黄作者信息, 请确认)
    [61]
    T. C. Sun, Z. X. Xu, X. M. Zhang, et al., “Light stage super-resolution: Continuous high-frequency relighting,” ACM Transactions on Graphics (TOG), vol. 39, no. 6, article no. article no. 260, 2020 DOI: 10.1145/3414685.3417821
    [62]
    A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp. 6000–6010, 2017.
    [63]
    T. Müller, A. Evans, C. Schied, et al., “Instant neural graphics primitives with a multiresolution hash encoding,” ACM transactions on graphics (TOG), vol. 41, no. 4, article no. article no. 102, 2022 DOI: 10.1145/3528223.3530127
    [64]
    M. Tancik, E. Weber, E. Ng, et al., “Nerfstudio: A modular framework for neural radiance field development,” in ACM SIGGRAPH 2023 Conference Proceedings, Los Angeles, CA, USA, article no. 72, 2023.
    [65]
    R. Zhang, P. Isola, A. A. Efros, et al., “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 586–595, 2018.
  • Other Related Supplements

Catalog

    Figures(5)  /  Tables(3)

    Article Metrics

    Article views (23) PDF downloads (11) Cited by()
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return