Deep Solar ☀️

High-Level Project Summary

The project consists of an autoencoder (artificial neural network) pipeline packed into an interactive application that fetches data from a specific period in cfd format from NASA's Open Data portal for a minor adjustment of the model's parameters in order to identify anomalies in the solar wind. More details about the model and the overall solution are provided in the final project link. Demo: https://rodolfoferro-deep-solar-main-gpclxz.streamlitapp.com/

Detailed Project Description

Welcome to Deep Solar! 👋

To see the application in action, please visit our deployed beta ➡️ https://rodolfoferro-deep-solar-main-gpclxz.streamlitapp.com/

The Challenge

If a major space weather event like the Carrington Event of 1859 were to occur today, the impacts to society could be devastating. The challenge is to develop a machine learning algorithm or neural network pipeline to correctly track changes in the peak solar wind speed and provide an early warning of the next potential Carrington-like event.

The Solution

The project consists of an autoencoder (artifical neural network) pipeline packed into an interactive application that fetches data from a specific period in cfd format from the NASA's Open Data portal for a minor adjustment of model's parameters in order to identify anomalies in the solar wind. More details about the model used are provided in the corresponding section.

Check our working model in the beta application section.

Anomaly Detection (Srivignesh, 2021)

Anomaly detection is the process of finding abnormalities in data. Abnormal data (an anomaly) is defined as the ones that deviate significantly from the general behavior of the data. Some of the applications of anomaly detection include fraud detection, fault detection, and intrusion detection. Anomaly Detection is also referred to as outlier detection.

What is an anomaly? (Gavrilova, 2021)

Generally speaking, an anomaly is something that differs from a norm: a deviation, an exception. In software engineering, by anomaly we understand a rare occurrence or event that doesn't fit into the pattern, and, therefore, seems suspicious.

Common reasons for outliers are: data preprocessing errors, noise, fraud, and attacks.

Types of anomalies

Global outliers

When a data point assumes a value that is far outside all the other data point value ranges in the dataset, it can be considered a global anomaly. In other words, it's a rare event.

Contextual outliers

When an outlier is called contextual it means that its value doesn't correspond with what we expect to observe for a similar data point in the same context. Contexts are usually temporal, and the same situation observed at different times can be not an outlier.

Collective outliers

Collective outliers are represented by a subset of data points that deviate from the normal behavior.

What are anomaly detection methods?

Anomaly detection methods are used to identify outliers in a dataset. Depending if the data is labeled, partially labeled, or not at all, a supervised, unsupervised, or semi-supervised approach can be used.

Some of the most common anomaly detection methods are:


  • Local outlier factor (LOF)
  • K-nearest neighbors
  • Support vector machines
  • DBSCAN
  • Bayesian networks
  • Autoencoders

In our case, we will use an autoencoder to detect anomalies in the solar wind data.

Autoencoders

What are autoencoders? (Canziani, 2022)

Autoencoders are artificial neural networks, trained in an unsupervised manner, that aim to first learn encoded representations of our data and then generate the input data (as closely as possible) from the learned encoded representations. Thus, the output of an autoencoder is its prediction for the input.

The previous figure shows the architecture of a basic autoencoder. We start from the bottom with the input which is subjected to an encoder (affine transformation). This results in the intermediate hidden layer subjected to the decoder (another affine transformation). This produces the output, which is our model's prediction/reconstruction of the input. We can represent the above network mathematically by using the following equations:

Where:

Why do we use autoencoders?

The primary applications of an autoencoder is for anomaly detection or denoising. An autoencoder's task is to be able to reconstruct data that lives on the manifold i.e. given a data manifold, we would want our autoencoder to be able to reconstruct only the input that exists in that manifold. Thus we constrain the model to reconstruct things that have been observed during training, and so any variation present in new inputs will be removed because the model would be insensitive to those kinds of perturbations.

This is why autoencoders are used for anomaly detection in our project.

Results

Current method

According to Space Apps Challenge (2022), at present, the raw data are converted to physics units using a calibration table, summed over the sensor segments, and recast as phase space distribution functions. A variety of ad hoc operations are invoked in order to tweak the background subtraction, isolate the physically relevant sub-range in voltage, and remove transients. Then a Gaussian peak fitting is performed. The parameters of the Gaussian map directly to _(n, w, |v|)_. Finally, the ratios between the signal peak values are fed into a table lookup to estimate the flow angle and complete the velocity vector.


Notes:

Our approach

As mentioned in the previous sections, we are using autoencoders for:


  1. Denoising
  2. Then, anomaly detection

Models

We trained 2 different models for our project. Both models are Denoising Autoencoders.


The idea was to explore different data features and train a model in order to reconstruct original data. The reconstructed data is then compared with the original data to find the difference between them. The difference is then used to find the anomalies in the data.

Model 1 (Full features)

The first model was trained on the raw data. Since data was not completely pre-processed, this model is going to be omitted from the final report.


Another exploration will be added on the future about this.

Model 2 (Compressed data representation)

The second model was trained on a compressed representation of data. The raw data corresponding to coordinates in `BGSE` format was converted into a single Real Number value using the norm of the vector.


Notes:
  • BGSE:Geocentric Solar Ecliptic system. This has its X-axis pointing from the Earth toward the Sun and its Y-axis is chosen to be in the ecliptic plane pointing towards dusk (thus opposing planetary motion). Its Z-axis is parallel to the ecliptic pole. Relative to an inertial system this system has a yearly rotation.
  • BGSM: Geocentric Solar Magnetospheric system. This has its X-axis from the Earth to the Sun. The Y-axis is defined to be perpendicular to the Earth's magnetic dipole so that the X-Z plane contains the dipole axis. The positive Z-axis is chosen to be in the same sense as the northern magnetic pole. The difference between the GSM and GSE systems is simply a rotation about the X-axis.

This led us to have a 2-column dataset, with the first column corresponding to the mentioned norm of the coordinates, and with the second column corresponding to BF1 values from the Wind dataset.

The autoencoder achitecture can be seen as follows:

This model was trained with ~29.7 million datapoints containing a time window of one month, with a batch size of 64 for 20 epochs, using the MSLE loss. The training history can be seen as follows:

As seen in the training history, the model was able to learn the data representation and was able to reconstruct the data with a low loss. The loss with the validation dataset after training was 5.7263e-08.


Notes:
  • The loss is calculated using the mean squared log error.
  • This low trained model turned out to be efficient, and it was only trained 20 epochs since 100 epochs would take around 18 hours to train.

A test performed to reconstruct a small time window of the data from the most recent day reported (2022-09-17) with the trained model can be seen in the following figure:

This trained model has been exported and is being used in this project's application, which can be found in the corresponding section.

Space Agency Data

The data has been mainly provided by the Challenge.

Mainly, we focused on using:


More details of the overall usable data can be seen here.

Hackathon Journey

Even if the time seemed not to be sufficient, an awesome project has been developed. Working with NASA's open data has been an awesome experience and achieving great results has made it even a better experience.

If possible, we would like to keep exploring the data in a deeper and more robust way, and also extend the project functionalities in order to help NASA's scientists to prevent another Carrington Event!

Our team got the inspiration to work on this project primarily because of the idea of using NASA's data and deep learning models to solve the challenge. We learned about the physic interpretation of the gathered data from the DSCOVR Mission, and how it could be integrated into a functional application.

We would like to thank @mike_from_smithsonian (<- his Discord tag) for all the support provided and the help all the time, it is a titanic task to help thousands of people around the world!

References

All references were separated according to the project's sections.

About anomaly detection:


About autoencoders, implementations, and results:


Tags

#python, #data-science, #deep-learning, #autoencoders, #solar-data