Sri Vinayak

Aaruush Chennai | Earth Data Analysis Developers Wanted!

Earth data Analysis

High-Level Project Summary

Importantense: The Earth's changing environments, and the natural distribution of its mineral, water, biota, and energy resources and provide methods for predicting and mitigating the effects of geologic hazards such as earthquakes, volcanic eruptions, floodsDeveloping: Data provide a wealth of information to aid in our understanding of Earth's interrelated processes, in developing innovative solutions for real-world challenges, and in making data-based decisions.Solving: Earth data scientists use programming languages like R and Python to analyze Earth and environmental data from data sources including satellites, drones, social media, field studies, and surveys.

Link to Final Project

https://doi.org/10.1080/20964471.2019.1611175

Link to Project "Demo"

https://doi.org/10.1080/20964471.2019.1611175

Detailed Project Description

Infrastructural support:

Most big Earth data analytical systems have already or are being migrated to a cloud

computing environment for rapid prototyping, result sharing, and reproducible research

(Peng, 2011). Some choose the private cloud as it allows for full control (Doelitzscher,

Sulistio, Reich, Kuijs, & Wolf, 2011), but most adopt the public cloud where a third-party

cloud provider performs the updates and maintenance of computing resources (Varia &

Mathew, 2014). For example, Mapbox uses Landsat on Amazon Web Services to power

Landsat-live, a browser-based map that is constantly refreshed with the latest imagery

from the Landsat 8 satellite (Yang, Yu, Hu, Jiang, & Li, 2017a).

Cloud computing can support sustainable archive, access to different computing node

types, virtual desktops, and collaboration on data analytics. But for large scale, tightly

coupled big data analytics or modeling, high-performance computing is still the solution

for modeling, colocation of computing and data, data assimilation and inverse problems

(Huang et al., 2013). For example, NASA has been planning to go up to support 1.6 Exabytes

data with a 0.75 km resolution and global coverage for climate data (Lee, 2018). This means

to integrate datasets from global Goddard Earth Observing System Model (GEOS), Global

Modeling and Assimilation Office (GMAO), and other sources with sufficient computing and

storage capacity to a) provide data/analytical/knowledge services, b) support artificial

intelligence/machine learning/deep learning for inference, and c) engage PB level data to

support comprehensive analytics and data fusion.

Graphics processing units (GPU) computing has boosted the simulation and analytics of

Earth and space phenomena demonstrating significant speedups than conventional central

computer processors (Madhukar, 2019). For example, the calculation of aerosol optical

depth from the Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data

using GPU can be 43 times faster than the one using central processing units (Liu et al.,

2016). Numerical simulation can also be accelerated using GPU computing. The large-scale

simulation of seismic wave propagation on GPU was 45-fold faster than CPU whilst main-

taining a precise accuracy (Okamoto, Takenaka, Nakamura, & Aoki, 2013).

Recent computing advancements also distribute some computing tasks to the edge of

the infrastructure, for example, the smart things at the edge of the Internet of Things, and

the mobile devices of mobile Internet. They are termed as mobile computing and edge

computing to conduct early processing or preprocessing of data collected at the sensor

side and to provide end visualization and facilitate user interaction.

While the computing infrastructure powers big data analytics, network and security

infrastructure as well as monitoring, scheduling, managing, and integration infra-

structure enables the computing and analytics to be operated in a smooth, dynamic,

safe, and easy-to-use fashion.

Data sources, ingestion, and store:

Another important module of the system architecture is the data store, which is responsible

for archive and access to Earth data archived. Traditionally, Earth science data can be

categorized into the atmosphere, ocean, land, hydrology, and socio-economic data accord-

ing to their disciplines (Acker & Leptoukh, 2007). New data sources in the Big Data era are

expanded to real-time location tracking, observations of the urban environment, and social media data from citizens (Mayer-Schönberger & Cukier, 2013).

Depending on the nature and usage of Earth data, they are traditionally stored in

a file system, relational, or No-SQL database. For example, real-time location tracking

data are usually stored in a Relational Database Management System (RDBMS) (Tian,

Jiang, Chen, Li, & Mu, 2014). Several efforts have been made to store geospatial

coverages when structured as arrays with an array-based database as the coverages

are not well suited to traditional RDBMSs (Baumann, 2014).

Data discovery and analytics:

As a prior step to performing any data analytical tasks, traditional data discovery relies on

open source technologies such as Solr and Elasticsearch (Nogueras-Iso, Zarazaga-Soria,

Béjar, Álvarez, & Muro-Medrano, 2005). Metadata of these data are often stored in a full-

text search engine (e.g. Apache Lucene) (Jiang, Yang, Xia, & Liu, 2016), which can be

searched like a google search engine. Recent endeavors started to integrate smart capabil-

ities, e.g. query understanding, ranking, and recommendation, based on artificial intelli-

gence advancements (Jiang et al., 2018, 2017; Li, Goodchild, & Raskin, 2014; Wiegand &

García, 2007). Common Earth data analytical functions range in complexity from simple

numerical functions to raster and vector operations, visualization and exploration, and

machine learning. More details of analytical functions will be reviewed in the next session.

Distributed computing technologies are widely adopted across different existing

systems (Agrawal, Das, & El Abbadi, 2011) for big data analytics. Apache Spark and

Hadoop MapReduce are two typical open source distributed solutions for big data

analytics. The former is usually much faster as the latter reads and writes from disk

more often (Zaharia et al., 2012). For example, Li et al. (2016) proposed a workflow to

accelerate the Weblog mining process using Spark.

Big Earth data analytics:

Big Earth data analytics include the analytical lifecycle of preparing, reducing, analyzing,

mining, and visualizing large amounts of spatiotemporal and spectral data, encompassing

a variety of data types (Kempler & Mathews, 2017). The volume, velocity, variety, and veracity

in the acquired data pose grant challenges in data processing for value (Yang et al., 2017a).

The analytical process enables the discovery of patterns, correlations, principles, knowledge

and other information for better understanding our Earth system and responding to problems

induced by global and regional changes (Bhattacharyya & Ivanova, 2017). The following

sections summarize the literature from different aspects of big Earth data analytics.

Data analytical methods:

After preprocessing, the main focus of data analytics is to reveal hidden patterns,

unknown correlations, and other useful information from a large volume of heteroge-

neous data to facilitate Earth science study. Big Earth data analytics support all aspects

of Earth science research, such as hypothesis and data discovery-driven methods,

dynamical models, and goal driven decisions (Kempler & Mathews, 2017). The involved

methods can be categorized into model simulation and prediction, statistics, machine

learning, and deep learning.

Space Agency Data

Machine learning methods:

Evolving from artificial intelligence, machine learning methods develop models that are

based on characteristics and features learned from empirical data and can infer unknown

problems and discover unknown patterns (Sellars et al., 2013). Machine learning methods

generally have the advantage over traditional statistical methods in non-linear relationship

understanding, and this advantage can be leveraged to model high-dimensional and non-

linear data with complex interactions and missing values, which is particularly the case for

big Earth data (Thessen, 2016). Derived from statistical methods, regression, classification,

clustering can also be used as machine learning methods, thus the exact division between

machine learning and statistical methods is not always clear. For example, Artificial Neural

Networks can produce regression on approximating and predicting ecological conditions

(Franceschini et al., 2019). Machine learning classifiers including Random Forest, Support

Vector Machines, and Bayesian Classifiers can produce the probability of an observation

belonging to a specific class of Earth process, such as landslide (Hong et al., 2016).

Clustering can group observations based on similarity, which is useful in detecting rare

events such as fire (Chakraborty & Paul, 2010; Khatami et al., 2017). Fuzzy inference and

some tree-based machine learning methods (e.g. Decision Tree) can extract a set of rules

from the observation to make predictions, such as forest cover and change (Sexton et al.,

2016).

Deep learning methods:

Deep learning methods, evolving from machine learning, offer unique capabilities in

extracting and presenting features at different and detailed levels from the Earth data

(Manning, 2015; LeCun, Bengio, & Hinton, 2015). These features and characteristics are

extremely important in Earth data classification and segmentation tasks. Due to its more

powerful expression and parameter optimization capability, deep learning has achieved

great performance in computer vision, natural language processing, recommendation

systems, and others (Collobert & Weston, 2008; Krizhevsky et al., 2012; Schmidhuber,

2015). For example, the deep convolutional neural networks (CNNs), e.g. AlexNet

(Krizhevsky et al., 2012), VGGNet (Chatfield et al., 2014), and PlacesNet (Zhou et al.,

2014), can perform satisfying results in classifying scenes from high resolution remote

sensing imagery into categories such as airport, bridge, desert, forest, and so on. Beyond

image classification, objects can be detected and segmented from Earth datasets using

deep learning techniques (Cimpoi et al., 2015; Girshick et al., 2014). Deep learning

methods can also help increase the computational efficiency of numerical simulations

(e.g. weather prediction) whilst maintaining reasonable accuracy (Wang et al., 2018).

We selected popular tools to analyze how they support different big Earth data

analytics and compared them (Table 3) from aspects of scalability, analytical methods,

programming languages, and graphical user interface (GUI).

Natural resources & environment:

Natural resources have been over-exploited by human kind, causing loss and degrada-

tion of habitats and depleting biological diversity (Smil, 2013). Human beings, especially

the marginalized and vulnerable communities, need to adapt to the rapidly changing

environment and its corresponding adverse circumstance, leading to the attention of

natural resource conservation and sustainable use of biological diversity (Collen et al.,

2013). The capability to monitor the impact of biological diversity and global environ-

mental change is crucial to designing effective adaptation and mitigation strategies to

prevent further loss of natural resources (Pettorelli et al., 2014). This requires the

scientific community to obtain datasets and assess the spatiotemporal changes in the

distribution of atmospheric, ocean, and land surface conditions, and the distribution and

function of the natural resource. Big Earth data are the source for mapping the distribu-

tion of natural resources, especially over large areas, including forest cover change

(Hansen et al., 2013), vegetation cover (Karnieli et al., 2013), and biodiversity dynamics

(Jeltsch et al., 2013; Kuenzer et al., 2014).

Environmental pollution requires big Earth data to monitor and assess in the long term.

Satellite observations, for example, are used in the analysis of European nighttime lights

over 15 years, showing complex patterns of light pollution (Bennie et al., 2014), provide

insight into global long-term changes in air, water, and soil pollution (Fingas & Brown, 2014;

Lehmann et al., 2015; Lin et al., 2015; Schmidt et al., 2015; Van Donkelaar et al., 2015).

Hackathon Journey

It's a Amazing work with you. I know lot of space oriented theory so thank you for this opportunity.

With regards

Sri vinayak .N

References

Acker, J. G., & Leptoukh, G. (2007). Online analysis enhances use of NASA earth science data. Eos,

Transactions American Geophysical Union, 88(2), 14–17.

Agrawal, D., Das, S., & El Abbadi, A. (2011, March). Big data and cloud computing: Current state and

future opportunities. Proceedings of the 14th International Conference on Extending Database

Technology (pp. 530–533). Uppsala, Sweden: ACM.

Ahmad, A., Paul, A., Rathore, M., & Chang, H. (2016). An efficient multidimensional big data fusion

approach in machine-to-machine communication. ACM Transactions on Embedded Computing

Systems (TECS), 15(2), 39.

Alpaydin, E. (2014). Introduction to machine learning. Cambridge, MA: MIT press.

Apache (2017). The Science Data Analytics Platform (SDAP) proposal [online]. Retrieved from

https://wiki.apache.org/incubator/SDAPProposal

Asner, G. P., Knapp, D. E., Boardman, J., Green, R. O., Kennedy-Bowdoin, T., Eastwood, M., . . .

Field, C. B. (2012). Carnegie Airborne observatory-2: Increasing science data dimensionality via

high-fidelity multi-sensor fusion. Remote Sensing of Environment, 124, 454–465.

Bambacus, M., Yang, C. P., Leung, R. Y., Barbee, B., Nuth, J. A., Seery, B., . . . Xu, M. (2017). A Planetary

Defense Gateway for Smart Discovery of relevant Information for Decision Support.

Batty, M. (2007). Cities and complexity: Understanding cities with cellular automata, agent-based

models, and fractals. Cambridge, MA: The MIT press.

Baumann, P. (2014). Rasdaman: Array databases boost spatio-temporal analytics. Computing for

Geospatial Research and Application (COM. Geo), 2014 Fifth International Conference (p. 54).

Washington, DC.

Bendig, J., Bolten, A., & Bareth, G. (2012). Introducing a low-cost mini-UAV for thermal-and

multispectral-imaging. International Archives of the Photogrammetry, Remote Sensing and

Spatial Information Sciences, 39(B1), 345–349.

Bennie, J., Davies, T. W., Duffy, J. P., Inger, R., & Gaston, K. J. (2014). Contrasting trends in light

pollution across Europe based on satellite observed night time lights. Scientific Reports, 4, 3789.

Bernhardt, K. (2007). Agent-based modeling in transportation. Artificial Intelligence in

Transportation, 72(E-C113).

Bhattacharyya, S., & Ivanova, D. (2017). Scientific computing and big data analytics: Application in

climate science. In S. Mazumder, R. S. Bhadoria & G. C. Deka (Eds.), Distributed computing in big

data analytics (pp. 95–106). Cham: Springer.

Binkowski, F. S., & Roselle, S. J. (2003). Models-3 Community Multiscale Air Quality (CMAQ) model

aerosol component 1. Model description. Journal of Geophysical Research: Atmospheres, 108, D6.

Borradaile, G. J. (2013). Statistics of earth science data: Their distribution in time, space and orienta-

tion. Berlin, Germany: Springer Science & Business Media.

Caldwell, P. M., Bretherton, C. S., Zelinka, M. D., Klein, S. A., Santer, B. D., & Sanderson, B. M. (2014).

Statistical significance of climate sensitivity predictors obtained by data mining. Geophysical

Research Letters, 41(5), 1803–1808.

Camara, G., Assis, L. F., Ribeiro, G., Ferreira, K. R., Llapa, E., & Vinhas, L. (2016, October). Big earth

observation data analytics: Matching requirements to system architectures. Proceedings of the

5th ACM SIGSPATIAL International Workshop on Analytics For Big Geospatial Data (pp. 1–6).

Burlingname, CA: ACM.

Candiago, S., Remondino, F., De Giglio, M., Dubbini, M., & Gattelli, M. (2015). Evaluating multi-

spectral images and vegetation indices for precision farming applications from UAV images.

Remote Sensing, 7(4), 4026–4047.

Chakraborty, I., & Paul, T. K. (2010, June). A hybrid clustering algorithm for fire detection in video

and analysis with color based thresholding method. In 2010 International Conference on

Advances in Computer Engineering (pp. 277–280). Bangalore, India: IEEE.

Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014, September 1–5). Return of the devil

in the details: Delving deep into convolutional nets. Proceedings of the British Machine Vision

Conference, Nottingham, UK.

Chini, M., Piscini, A., Cinti, F. R., Amici, S., Nappi, R., & DeMartini, P. M. (2013). The 2011 Tohoku

(Japan) Tsunami inundation and liquefaction investigated through optical, thermal, and SAR

data. IEEE Geoscience and Remote Sensing Letters, 10(2), 347–351.

Chun, B., & Guldmann, J. M. (2014). Spatial statistical analysis and simulation of the urban heat

island in high-density central cities. Landscape and Urban Planning, 125, 76–88.

Cimpoi, M., Maji, S., & Vedaldi, A. (2015, June 7–12). Deep filter banks for texture recognition and

segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

Boston, MA, USA. (pp. 3828–3836).

Collen, B., Pettorelli, N., Baillie, J. E., & Durant, S. M. (Eds.) (2013). Biodiversity monitoring and

conservation: Bridging the gap between global commitment and local action. Cambridge, UK:

John Wiley & Sons, Wiley-Blackwell.

Collobert, R., & Weston, J. (2008, July). A unified architecture for natural language processing: Deep

neural networks with multitask learning. In Proceedings of the 25th international conference on

Machine learning (pp. 160–167). Helsinki, Finland: ACM.

Courtier, P., Thépaut, J. N., & Hollingsworth, A. (1994). A strategy for operational implementation of

4D-Var, using an incremental approach. Quarterly Journal of the Royal Meteorological Society, 120

(519), 1367–1387.

Cressie, N. (2015). Statistics for spatial data. Hoboken, NJ: John Wiley & Sons.

de Jong, R., de Bruin, S., Schaepman, M., & Dent, D. (2011). Quantitative mapping of global land

degradation using Earth observations. International Journal of Remote Sensing, 32(21),

6823–6853.

De Lannoy, G. J., Reichle, R. H., Arsenault, K. R., Houser, P. R., Kumar, S., Verhoest, N. E., &

Pauwels, V. R. (2012). Multiscale assimilation of advanced microwave scanning radiometer–