Predicting solar ion density, thermal speed and solar wind velocity vector.

High-Level Project Summary

We have completed several stages of data prep-on and created baseline models for forecasting solar wind. The algorithms used are already able to capture patterns in the data, which indicates that there is potential for further, more in-depth work with satellite data.Our work is a contribution to the attempts of finding ways to predict solar wind peaks. We are sure that NASA has already tried to accomplish this task in diff-t ways, but so far these attempts have not been crowned with great success. The imp-ce of our work lies in the fact that its res-lts may prompt other researchers and enthusiasts to find a new and better approach to solving the problem. After all, that's how science works!

Detailed Project Description

The machine learning model we have built is capable of making predictions of important solar wind parameters based on historical data obtained from two NASA satellites, DSCOVR and Wind. The time horizon of this forecast is one hour. In other words, upon receipt of the next new data, this model is able to predict key indicators 1 hour ahead.

For clarity of understanding, we will describe step by step how exactly the pipeline we created works:


  1. Obtaining satellite data (in CDF format created by NASA).
  2. Extracting the necessary parameters for analysis and forecasting (for example, Magnetic field magnitude and its standard deviation and many others (see final code)).
  3. Bringing data into a tabular form (dask or pandas data frames) for further processing on more familiar tools.
  4. Data aggregation, their averaging within one hour. The time unit of the measured indicators is calculated from one millisecond to several seconds and minutes. It was necessary to bring the indicators in all data frames to a single time scale. We did this by aggregating the mean and standard deviation of the mean of all indicators within each hour of data.
  5. We also tried Data Time Warping. In theory, it must be performed on raw data (every second), however, these datasets weighed a lot (more than 200 million lines), which our laptops were not able to withstand. There was an attempt to perform this step on aggregated data, but in the end, we decided to take a different approach (correlation. see below). The Data Time Warping stage is important because the data received from the two satellites is subject to temporal corruption (one satellite receives the same data later than the other). This time shift needed to be smoothed out.
  6. We decided to solve this problem in a different way: leave the data as it is and check their correlation with the predicted variables (velocity, temperature, density). Thus, from the variety of variables, we left only that which has the greatest predictive power for all indicators.
  7. In order to enhance the predictive power of the data, we have created additional time variables: day, month, week, and others. The test showed that they also have a small correlation strength with the target.
  8. Separation of data into training and testing. This is necessary in order to evaluate the performance of our model.
  9. We built pipelines for data processing: filling in empty values (because they usually have a bad effect on the overall quality of the forecast and can often be predicted well) and scaling (because the data has different units, which is bad for regression algorithms).
  10. And then they were combined with pipelines of forecast models. As part of this hackathon, we used 3 models: MultiOutputRegressor, its combination with LGBMRegressor, and Tensorflow (its results could not be fully obtained due to technical reasons).
  11. There are also several graphs in the code showing that the models are quite capable of capturing patterns. And this is despite the fact that we have not yet carried out serious work with parameter tuning and data engineering. These models, as we said, are baselines - those that estimate the lower threshold of the potential of the data.

Such a model, more precisely, its perfected version, can provide a very great service to earthlings: it will warn us 1 hour in advance about an unusual increase in the speed of the solar wind. In this case, people will be able to prepare for a large solar wave and protect electronic devices on earth and its orbit. Specifically, our project helps in that it is another contribution (attempt) to solve this complex problem. Having studied some scientific literature, we came to the conclusion that quite a few of the authors predicted exactly those parameters that were proposed within the framework of this hackathon. Moreover, in general, they all used the same parameters in an attempt to improve existing approaches, and not to find completely new ones (especially given the variety of data that NASA offers).

We evaluate our small contribution as one (or more :D) grains of sand in the sandcastle of science. We are very interested in continuing to explore this project and hope that other young enthusiasts will be able to learn something from it.

In the process of work, a large number of tools were used:


  1. Programming language - Python.
  2. Computing platform - Jupyter Notebook.
  3. Library for working with NASA data: spacepy.
  4. Extraction of data from the NASA website: os, urllib, BeautifulSoup, glob.
  5. Main packages for data manipulation and visualization: pandas, numpy, itertools, seaborn, matplotlib.
  6. Package for accelerated data manipulation: dask.
  7. Trying to do Date Time Warping: tslearn, dtw pack (dtaidistance).
  8. Data processing, feature engineering, building pipelines: scikit-learn.
  9. Model building: lightgb, scikit-learn, tensorflow (keras).
  10. Hardware: MacBook Pro (13-inch, 2020, Two Thunderbolt 3 ports), Processor(1.4 GHz Quad-Core Intel Core i5), Memory(16 GB 2133 MHz LPDDR3).


Space Agency Data

As part of the project, we used magnetic field data collected by DSCOVR and Wind satellites in 2022:


Each of the datasets consists of 261, 174, and 238 files, one file per day.

The first two data sets are similar to each other in many ways and contain a number of variables that we used to predict the speed, density, and temperature of solar protons. However, as noted in the Supplementary Materials for the assignment, the DSCOVR data is somewhat distorted (due to wear and tear of the equipment), which should be compensated for by the data from the Wind satellite, which is of better quality. However, we also needed to resolve the issue of comparability of data over time, since these two satellites can be located at different distances from each other and receive the same data at different times. We tried to solve this problem with Date Time Warping, however, we were limited by the technical capabilities of our equipment to successfully complete this part of the job.

We extracted this data, studied it, divided it into training and test data, processed it, and then passed it on to the algorithm. All our work has been to derive value from this data.

Hackathon Journey

It is known that for many people, including us, when thinking about space, at the borders of our consciousness, the NASA logo surely looms. The association between these two concepts is so strong that any kind of contact with this organization evokes a feeling of delight that we have when we think about space or look at the starry sky. It was this feeling of delight that we experienced during these two days as if we had more direct contact with the space!

Our team consists of two machine learning enthusiasts, so our choice was automatically reduced to a few projects, among which the project we chose seemed both very interesting and feasible for our level of knowledge. However, everything turned out to be not so simple, especially for us, people without a serious mathematical and technical background! But we did not despair and turned it into a real research project with the study of all possible scientific materials, with the decoding of mathematical and physical symbols and formulas, debates in the middle of the night, converting our solutions into code, and much more!

We have learned a lot about space weather: how it is formed, what effect it has on our planet, how it is measured, what instruments are used, how these instruments work, where they are located, how they receive data, the complexities that arise for scientists in processing data from satellites and forecasting weather using these data. For two people with a humanitarian background, this experience was a complete shake-up!

The main feature of our approach was that we wanted, first of all, to understand as far as possible the problem: to understand the whole process of obtaining data, and possible gaps in this process, to find out what approaches already exist in this area, and to what extent they are basically similar to what we do. The difficulty was that we had no previous experience in a similar field and it was necessary to simultaneously study a huge amount of information and work on a project. Nevertheless, It was very exciting!

We would like to thank NASA for the fact that they are not limited in their activities only to the USA, but involve people from all over the world! This is especially important for such a small country as ours, where space sciences are not developed at all. We also admire how many opportunities NASA is opening up for exploring space and making our own contribution to this common cause. This is evidenced by a huge array of organized data, which we got acquainted with in the course of our work.

We would also like to express our gratitude to the organizers of the hackathon in our country for the work done! Everything was at the highest level!

References

Instruments:


Exploring context and existing approaches:



Tags

#solar #solarwind #CarringtonEvent #ML #DSCOVR #Wind