About PhD in Data Science
Program Description
The field of data science is emerging as a critical discipline with high relevance to economic growth and development. This doctoral training program established by AIMS will provide emerging African scientists the opportunity to conduct research at the forefront of data science, and work towards a PhD degree within a high-quality training program in Africa, in cooperation with institutions internationally.
The program will focus on theoretical foundations of data science as well as applications of data science to improve the daily lives of Africans. It is built on the understanding that modern approaches in data science require a combination of expertise spanning the areas of mathematics, statistics, computer science, and the applied sciences.
AIMS will be offering up to seven fully-funded PhD positions in this prestigious new doctoral program. The recruited students will be based in Rwanda at AIMS Rwanda, or any of the other AIMS centers, in partnership with universities and research institutions across Africa and globally. The program aims to train future change-makers, who will have an impact across academia, industry, education, and government.
Candidates can choose from a list of proposed research topics, and AIMS will assist in building a supervision team around these topics. Alternatively, candidates can suggest their own research topics, together with a proposed supervision team. Selected students will start in October.
Eligibility criteria
- Master’s degree (completed by Sept) in mathematics, statistics, computer science, engineering, physics or other relevant fields;
- Sufficient theoretical foundations evidenced by prior work (courses/thesis/other training);
- Qualification for pursuing research on the chosen topic, including relevant programming expertise;
- Research potential evidenced by academic performance and involvement in relevant academic activities;
- Motivation for pursuing a PhD by research in the suggested topic;
- Being an African national.
Summary
- Length of program: 4 years
- Fully funded (stipend, equipment, health insurance, relocation costs, conference attendance, direct cost to graduating institution such as tuition fees and registration fees)
- International supervision teams from well-known research institutions
- Research topics that push the boundaries of data science (focused on AI/ML and/or health)
- Program start: Oct
Application Information
Selection Process
The selection process is competitive, and conducted in two phases:
Phase 1:
The application must be submitted through the online submission form by midnight (Central African Time) on deadline 14 April 2023.
You will be asked to submit the following documents
- Application Form
- CV (you can use your own format, but please make sure to cover the content mentioned in this template that applies to your case. The template.tex file also contains general tips on writing CVs at the top.)
- Transcripts (Please submit both your undergraduate and your masters level transcripts. Additional transcripts can be submitted if relevant. )
As part of the application form, you will be able to
- Select or suggest a research topic
- Name two academics (ideally senior researchers), who are willing to write a letter of support
- Indicate whether you are already admitted at a university or otherwise have future plans that you consider to combine with this program
Additionally, the application form will allow you to tell us more about your research interests, motivation for pursuing a PhD and future plans. Your answers to the following questions will be central to the selection process. We advise that you prepare your answers offline with care.
- What is your motivation to pursue a PhD? Here, you can also mention plans for your future career. (1500 characters)
- Which research directions are you most interested in and why? Justify why you are qualified to pursue research in this area. Here, you can also comment on your reason for choosing the research topics selected above (2500 characters).
As part of this application process, it is our goal to support your individual situation and work together to design a PhD program that fits your circumstances.
Remarks on submitting your application
- You will be asked to select one main research topic (or suggest your own); there is the possibility to indicate a second topic choice (or suggest your own).
- If you propose your own topic, make sure you prepare the topic description with care (max 2000 characters). Have a look at the proposed TOPIC DESCRIPTION as a guidance for style and detail. Most importantly, the topic description should explain what exactly you will do as part of your PhD, including the methods used, the technical skills you need, and if data is needed, which data sets you are planning to use and how you will get access to them. From your description, it should be possible for us to assess if the proposed research topic is
- novel and interesting to the international research community,
- realistically achievable in the context of a PhD,
- matching with your skill background and with the skill background of your proposed supervisors,
- equipping you with the expertise to proceed into a successful future career,
- fitting to the scope of our PhD program.
- Make sure that you satisfy the required background and skills when selecting or proposing a research topic.
- Arrangements are flexible, and we can work with selected candidates to adapt to individual circumstances.
- No reference letters are required in Phase 1 of the application process. Please notify your referees that you are submitting their names as part of this application, as they may be contacted for letters or additional information as part of the selection process.
- Make sure that your CV covers as a minimum all items provided inthis template (if they apply to your situation).
Phase 2:
A small number of shortlisted candidates will be invited to round 2 of the selection process. In Phase 2, supervision teams are formed and applicants discuss with potential supervisors more details on their research plans. As part of Phase 2, applicants are asked to submit a detailed research proposal.
Women applicants are encouraged to apply.
PhD in Data Science at QLA
The new Doctoral Training Program in Data Science (DTP-DS) is established by Quantum Leap Africa (QLA) at the African Institute for Mathematical Sciences (AIMS) Rwanda in collaboration with top researchers from across the globe. Here, you can find more information on the following:
Ph.D. streams
The second cohort of the DTP-DS (start 2022) will offer two streams:
- General Data Science,
- Data Science for Health in Rwanda.
General Data Science: We have several Ph.D. positions available for topics in mathematics, statistics, computer science, and the applied sciences that broadly fit under the umbrella of data science. These QLA positions are funded in partnership with the African Institute for Mathematical Sciences, The Government of Rwanda, The Carnegie Corporation of New York, and DeepMind Technologies Limited. Two (2) fully-funded Ph.D. positions offered in partnership with DeepMind Technologies Limited are putting a particular focus on data science topics in the fields of artificial intelligence and machine learning.
Data Science for Health in Rwanda: Four (4) fully funded PhD positions are offered in partnership with University of Rwanda (UR, Rwanda), AIMS Rwanda (Rwanda) and Washington University in St. Louis (State of Missouri, United States of America), as part of the grant Harnessing Data Science for Health Discovery and Innovation in Africa Research Training Program. This program focuses on the following three major scientific areas:
- computer science & informatics
- statistics & mathematics
- biomedical sciences & public health
A high prevalence of communicable diseases, coupled with a rapidly expanding epidemic of non-communicable diseases (NCDs), forecasts a perfect public health storm in Rwanda, providing impetus and rationale to leverage DS to address
health care gaps. A structured program design will help develop trainee research careers in data science with particular focus on health care topics relevant to Rwanda, including communicable (i.e., HIV, malaria, COVID-19, etc.) and chronic NCDs (i.e., hypertension, heart disease, diabetes, etc.).
Enrollment & Graduation
General Data Science: Candidates that are accepted into the program will be enrolled in two institutions:
- One of the five AIMS Centers of Excellence (Rwanda, Ghana, Cameroon, Senegal, South Africa)
- A higher education institution (generally in Africa) partnering with AIMS
Candidates will need to be ordinarily resident in an African country, and satisfy the degree requirements for a PhD in Research of their graduating institution, as well as the program requirements of the AIMS DTP-DS. The PhD degree will be conferred by the partnering institution upon successful completion; QLA is facilitating the creation of international co-supervision partnerships, is providing funding through global partnerships, and offers additional training in research skills and transferable skills to the PhD candidates.
Data science for Health in Rwanda: This stream is restricted to Rwandan nationals only. Successful candidates will be enrolled at University of Rwanda and work on topics relevant to Rwanda.
Supervision
General Data Science: Candidates are mentored by a supervision team of 2-4 supervisors, forming a partnership between AIMS and higher education institutions in Africa and internationally. Each supervision team should consist of at least one supervisor affiliated with AIMS, and one supervisor affiliated with the graduating institution. These rules are flexible and details can be discussed and adjusted on a case-by-case basis.
Data Science for Health in Rwanda: Candidates are mentored by a supervision team where one supervisor is affiliated with the University of Rwanda. Additional supervisors can be affiliated with AIMS, Washington University in St. Louis or other institutions in Africa and internationally as suitable.
The supervision team will be formed during Phase 2 of the application process in communication with shortlisted candidates, the DTP-DS management board, and potential supervisors. Candidates have the possibility to suggest their own supervision team.
Research topics
Applicants can select from a list of research topics suggested by leading researchers in their field. Alternatively, applicants are welcome to suggest their own research topics. Shortlisted candidates will be put in touch with the supervision teams that proposed their selected topics for discussions on more concrete research ideas in Phase 2 of the application process. More more details on the selection of research topics, click Research topics.
Training Components
All candidates are invited to participate in an intensive training school in the first year of the program, organized by QLA. Here, candidates will acquire skills relevant to their research and broaden their subject knowledge in data science through a small number of intensive core courses taught by top international researchers.
The program plans to provide continuous training opportunities virtually and/or in person. Additional training components may include (but are not limited to):
- Guided seminars and reading groups
- Participation in transferable skills courses (academic writing, presentations skills, research methodology course)
- Group projects / mini dissertations
- Designing and delivering a mini-course (senior PhD students)
- 3 Minute Thesis Competition (senior PhD students)
- Tutoring in AIMS structural masters program (senior PhD students)
PhD candidates are encouraged to pursue internships in industry or external institutions towards the end of their PhD in a field related to their research topic, depending on sufficient progress on their dissertation.
2023 RESEARCH TOPICS
To become a PhD student as part of this doctoral training program, you have two options:
- Work on one of the research topics suggested by the program,
- Suggest your own research topic.
Working on a proposed topic
For option i), please consult the research topics posted below. Each research topic already comes with a supervision team. The top five applicants for each topic will be shortlisted and then put in touch with the supervision team directly. In the next step, supervision teams will select those candidates with whom they would like to move into Phase 2. In Phase 2, the supervision team together with selected shortlisted candidates prepare a detailed research proposal. This will also provide you the opportunity to discuss the precise focus of your chosen research topic with your supervision team, discuss if you and your supervision team want to add additional supervisors, and decide at which university you plan to register and graduate for the PhD.
When shortlisting candidates for the proposed topics, we will evaluate how well your academic background and skills fit the topic you are applying to. Please read the description and required background of the topic you are applying to in detail, and make sure your application matches the research direction.
Suggesting your own research topic
To suggest your own research topic, simply select ‘Other’ in the application form. A box will open where you can specify a title and a topic description as well as your supervisors. You will also have to classify in which stream your topic best fits.
When suggesting your own research topic, make sure that you have put detailed thought into the research direction you are suggesting. In the end, this is the problem you may spend 3-4 years working on full-time. Your research plans should be advanced enough that it is clear that your research can start immediately and has a high chance of success. Prepare your topic description with care; note that the application form is restricted to max 2000 characters for the topic description. Most importantly, the topic description should explain what exactly you will do as part of your PhD, including the methods used, the technical skills you need, what exactly is the novel research contribution it provides, and if data is needed, which data sets you are planning to use and how you will get access to them. Be precise and to the point, do not spend too much characters on general introduction and motivation. From your description, it should be possible for us to assess if the proposed research topic is
- novel and interesting to the international research community,
- realistically achievable in the context of a PhD,
- matching with your skill background and with the skill background of your proposed supervisors,
- equipping you with the expertise to proceed into a successful future career,
- fitting to the scope of our PhD program.
When proposing your own topic, you will also have to build your own supervision team. Make sure your supervisors are aware you are planning to work with them and agree to supervise you before mentioning their names. If you are shortlisted, we may also be able to help you find additional supervisors if needed. At the end of Phase 2, your application will be evaluated as a package, including how well your supervisors and your background are suited to advance the research you propose overall.
Proposed topics are listed below, Click on topic you want to focus on it’s description
Topics:
- Network-Based Differential Equations for Public Health Applications
- Anomaly hunting: search for the unexpected in the largest ever physics dataset
- Deep learning and Bioacoustics
- Efficient machine learning techniques for log determinant estimation
- Genetic informatics in oncology
- Deep Generative Models in non-Euclidean Spaces
- Mathematical and AI/ML approaches in Oncology
- Exploring the Use of Differential Equations for Representation Learning in Knowledge Graphs
- Detection of Early Warning Signs in Autism Spectrum Disorders in young children
- An integrated AI and Simulation Framework towards Digital Twin Driven Engineering of Industrial
Systems - Artificial Intelligence for Student Monitoring and Shaping Online Learning
- Computational Modelling and Machine Learning for Materials Discovery
- Dynamic time-to-event models for life risk quantification.
- Constrained Machine Learning Models for risk quantification
Topic description and required background:
Network-Based Differential Equations for Public Health Applications
Anomaly hunting: search for the unexpected in the largest ever physics dataset
Deep learning and Bioacoustics
Efficient machine learning techniques for log determinant estimation
Genetic informatics in oncology
Topic Description
This Ph.D. research plan involves exploratory, discovery-driven research as part of a medical informatics research group. The research will involve developing informatics algorithms that interrogate -omics data from laboratory models and cancer patient samples. In collaboration with medical researchers at the University of Minnesota, the student will identify a research question that is clinically significant and obtain relevant datasets. The particular mathematical tools used will depend on the data and research question.Information about the research group may be found here: https://sites.google.com/umn.edu/hwang-lab-website/home
Examples of types of research questions:
– Are there relations between gene expression levels and cancer outcomes? Do these relations involve single genes, or groups of genes? – Are there measurement or other biases that are corrupting the data? – Can data across different cancers and different measurement techniques be combined?
As an example of similar research in a related area, see: Ekaterina Smirnova, Snehalata Huzurbazar, Farhad Jafari, PERFect: PERmutation Filtering test for microbiome data, Biostatistics, Volume 20, Issue 4, October 2019, Pages 615–631, https://doi.org/10.1093/biostatistics/kxy020Background & Skills
Graduate-level linear algebra, statistics, and probability, Programming experience in R and Python.
Deep Generative Models in non-Euclidean Spaces
- Mathematical and AI/ML approaches in Oncology
- Exploring the Use of Differential Equations for Representation Learning in Knowledge Graphs
Detection of Early Warning Signs in Autism Spectrum Disorders in young children
An integrated AI and Simulation Framework towards Digital Twin Driven Engineering of Industrial Systems
Artificial Intelligence for Student Monitoring and Shaping Online Learning
Computational Modelling and Machine Learning for Materials Discovery
Dynamic time-to-event models for life risk quantification.
Constrained Machine Learning Models for risk quantification
Topic Description
Spatiotemporal network data has many public health and medical applications across geographic regions and patient social networks. Forman-Ricci curvature/flow and other stochastic differential equation models (such as SIR models) have improved recent understanding of epidemic evolution and spread, as well as identified vulnerable areas of networks both in industry and in academic settings. Our student will compare existing methods, such as SIR and Forman-Ricci flow models, to network-based models that our student will develop (such as Hodgkin-Huxley models to identify epidemic waves/breaking points or compartmentalized SIR models based on network hubs). This will both add to the understanding of how and when these tools are most useful in real-world settings and allow for the comparison of methods to see which will work best for spotting transition points in new epidemics. Datasets will include open-source Humanitarian Exchange data on recent epidemics (such as the 2018 Ebola outbreak) and data collected from Dr. Corey Amann’s technical platform. Our student would also generate a theoretical outbreak scenario with the tools developed and a network constructed from Rwandan travel networks between major cities.
Background & Skills
Strong undergraduate/masters background in differential equations and/or network science (both would be great). In particular, familiarity with network geometry and epidemic-based differential equation models (such as SIR models) would be preferable. The candidate should have at least a publication or conference presentation on these topics and should be eager to continue their research on our topic. Some background knowledge or experience with public health or health behavior data is a plus. A social science background would also be an asset, as our applications will focus on behavioral and public health topics. Because our team is split between academia and industry, an entrepreneurial mindset and prior industry work would be an asset. Above all, we want a student eager to make a difference in his/her community with these new tools.
Topic Description
High energy physics (HEP) studies the fundamental building blocks of nature. HEP was one of the first scientific disciplines to adopt machine learning (ML), and these techniques played an instrumental role in the discovery of the Higgs boson, which has been the most prominent success of the Large Hadron Collider (LHC) so far. A key requirement of the undertaking towards a new physics discovery is handling the huge amount of complex experimental data collected at the LHC. Today, the rapid increase in available computing power has unlocked new ML algorithms beyond the Boosted Decision Trees’ traditionally used in HEP. For example, deep neural networks, adversarial training schemes, anomaly detection methods are becoming increasingly common, not just in rejection of experimental backgrounds, but in calibration, simulation and re-interpretation of results. The advantage of anomaly detection methods lies in the fact that they are not model dependent, rather can use the collider data directly. This project will use the state-of-the-art anomaly detection methods to dig through the unprecedented amount of data collected at the LHC to look for signs of dark matter, a mysterious and yet undiscovered component of our universe.
However, the big data problem is hardly unique to particle physics. There are extensive use cases in financial transactions, fleet and traffic management, predictive maintenance, medical diagnosis just to name a few. The synergy between LHC and industry provides a perfect ground to hone and develop these skills, which are portable across scientific disciplines and beyond. We will model this project along the successful SMARTHEP (https://www.smarthep.org/) programme in Europe, with the co-supervisor Prof. Caterina Doglioni being a founding member.
Background & Skills
Masters qualification in Physics and Good knowledge of Python/C++ preferred. Experience in particle physics, data analysis and machine learning algorithms will be beneficial.
Topic Description
A recent report issued by the WWF states that there has been a catastrophic decline in wildlife population in recent years. A large number of species are threatened with extinction due to a number of factors. Certain species have been placed on the IUCN Red List for several years, but further conservation efforts are still urgently required to ensure the survival of the remaining individuals. Studying endangered species while limiting disturbance generally requires long-term remote monitoring of an environment to try to identify the presence of the species of interest. To this end, Passive Acoustic Monitoring (PAM), the study of animal vocalisations by deploying acoustic recordings, has been widely used to study animal population dynamics in a minimally intrusive manner. In a field survey, ecologists typically deploy several acoustic recording devices over long periods of time. A survey thus generates an enormous amount of audio data for which a lot of this does not contain sounds that are of interest. This is particularly the case when monitoring endangered animals for which the population size might be very small, and consequently few vocalisation events from the animal may have been recorded. On the other hand, a large percentage of the audio will either contain sounds that are not particularly useful or simply not contain any sound (e.g silence). In recent years, machine learning has been applied to PAM for various species, and a number of endangered ones. The ecology community is adopting machine learning techniques, however, this is being done at a slow pace. Machine learning on the other hand is experiencing drastic developments with new algorithms being proposed at a rapid pace. This thesis will investigate how advances in deep learning can be applied to bioacoustics problems, in particular, how can the sounds of animals be automatically detected. It will explore and adapt recent advancements in deep learning, create new techniques where such is lacking, and take inspiration from recent advances in computer vision and speech recognition, with the goal of advancing PAM. The outcome of the thesis will contribute to, and disrupt how ecologists are analysing data for endangered animals. The outcomes will thus enable for large strides to be taken on how machine learning should be applied to PAM, which in turn will have practical and societal outcomes related to sustainable development goals and conservation. Relevant literature:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8944344/
https://www.sciencedirect.com/science/article/pii/S1574954122001388
https://www.sciencedirect.com/science/article/pii/S1574954122001388
Background & Skills
n short: the candidate should have a background in computer science and/or algorithm design and programming skills, additionally, the candidate should have existing knowledge in machine learning, more specifically, projects (non-tutorial projects) in deep neural networks.
For more details, see here. The candidate should have a strong foundation in Python programming, in terms of the ability to write functions and classes, to be able to write clean code that is well documented, and be able to read existing code (e.g. from Github) comfortably. The candidate should be well versed in Numpy and have experience with machine learning libraries such as Scikit-learn. The candidate should be able to independently write code, design algorithms on paper which are then translated into Python code. The candidate should be able to comfortably switch between programming languages should the project require a change in language. An ideal candidate should be comfortable with the idea of designing and proposing new algorithms from scratch which may not already exist, or be willing to extend existing algorithms. The candidate should be comfortable with synthesising literature.
An ideal candidate should have an interest in science communication and facilitating the communication of their research publicly via seminars, journal club, or social media such as Twitter, Youtube and blog posts. The ability to take initiative and overall desire to “use machine learning for good” is the attitude which we are seeking. The candidate should be comfortable with communication in general, be it via spoken language or written English.
Optional but advantageous skills include: knowledge in object oriented programming, algorithm design, experience in Tensorflow or PyTorch, a proven GitHub record of existing projects, existing blog posts about the student’s existing work to demonstrate their skills in science communication. A candidate which exhibits a lot of energy and passion to advance machine learning for ecological outcomes and for science communication would be advantageous. Experience with LaTeX could be advantageous.Experience in deep learning (Tensorflow 2 preferably, other libraries welcome too) would be advantageous.
The candidate will be joining the machine learning for ecology research group at AIMS South Africa and will thus attend/present their research at journal clubs, engage with existing researchers, be driven to establish a small network of their own by reaching out to other scientists globally, and be excited to contribute new ideas to the research group and community as a whole. The candidate will sit in the group’s room and thus will be in an exciting environment with like minded scientists. The candidate should be willing and excited to go out in the field and capture data (typically acoustic data) to support their work. This would primarily be within or around Cape Town, South Africa. The ideas proposed and developed by the student will become open access and should be easy to use by non-programming individuals (potentially ecologists) as a means of facilitating outputs that are usable and of high impact.
Topic Description
In subsurface, uncertainties usually appear in seismic data and permeability. It is common to assume that the permeability k and the seismic AVA reflection are lognormaly distributed. To quantity the uncertainties it usual to find the optimum permeability or the optimum seismic AVA that lead to estimate some hyper parameters maximizing the marginal likelihood conditional probability. In the Bayesian inference (machine learning) setting, this implies the computation of the log determinant of the precision matrix. The method of choice depends on the dimension of the problem as well as the structure of the covariance. In low dimension, the method of choice is based on Cholesky factorization. However in high dimensions, computing the Cholesky factor may be prohibitive due to memory limitation. Developing efficient methods that are comparable in term of efficiency to Cholesky factorization in low dimension for sampling Gaussian distributions and ease the computation of the log determinant of the precision matrix is currently a hot research topic. Iteratively adapting ideas from numerical linear algebra are currently the right way .Although Krylov subspace techniques have been proved to be efficient its efficiency strongly depends of the structure of the precision matrix, and the orthogonalisation processes in Krylov subspace techniques may be computational very costly. Our goal will be to propose an alternative technique based on Leja points technique where the implementation is only based on matrices multiplication and a rough estimation of eigenvalues bounds of the precision matrix.
Background & Skills
Statistic, Numerical Analysis, linear algebra
Topic Description
Most of the successful deep learning methods such as Deep Convolutional Neural Networks (DCNNs) are based on classical image processing models that limit their applicability to data with underlying Euclidean grid-like structure, e.g., 2D/3D images or audio signals. Non-Euclidean (graph- or manifold-structured) data are becoming increasingly abundant; prominent examples include social networks, graphs of molecules, interactomes, as well as 3D meshes in computer graphics and 3D point clouds or meshes. Several Geometric Deep Learning solutions have been developed in the last few years with the idea of extending convolution-like operations to non-Euclidean domains represented by point clouds, meshes, or graphs. Most of these solutions aim at generalizing convolutions so as to replicate operations that revealed successful in the image domain. However, it remains a complicate task to extend the DCNN architecture to analyzing graph, mesh and point clouds. Indeed, the group of transformations in 3D data is more complex compared to 2D images, as 3D entities are often transformed by arbitrary 3D translations and 3D rotations when observed. In this project, we will develop an equivariant network for such data. In addition, there is not much attention to the choice of loss function that is typically based on a distance evaluated in the Euclidean domain. We believe that embedding new metrics based on the induced surface metrics can result in more natural measures on non-linear domains and therefore in better training of learning models.Finally, we will develop generative models in non-Euclidean domains such as meshes and graphs.
Background & Skills
Masters in Mathematics/Applied Mathematics/Computer Science. Skills: Partial Differential Equations; Optimization; Differential Geometry; Statistical Learning; Numerical Analysis; (Python/Matlab) Programming skills; knowledge in Deep Learning, Generative Models will be beneficial
Topic Description
Current AML (acute myeloid leukemia) treatment protocols have not significantly changed over the past five decades and depend largely on combination chemotherapy and subsequent hematopoietic stem cell transplantation (HSCT). However, in more recent years, great hopes have been put on novel AML therapies specifically targeting recurrent genetic and phenotypic abnormalities, but their spectrum of longer-term efficacy on the various molecular subtypes of leukemia is still unclear and so far, they are still lagging behind their expectations.
Despite the efficacy of current treatments, the vast majority of AML patients will experience a relapse due to the persistence of chemoresistant leukemic cells (MRD, minimal residual disease). Collaborators of the PhD team have recently shown that ORAI calcium (calcium release-activated calcium channel protein) channels could play a role in the development of this chemoresistance. Furthermore, studies have shown that residual malignant cells can persist under the control of a specific T cell immune response, and that the immunological synapse (IS) between T cells and their autologous blasts can be defective. However, no studies have been performed to decipher the calcium signature at the IS level in a clinical or tumor dormancy context.
The thesis is part of the mathematical, compute modeling in oncology. Namely, the project aims to develop statistical, machine learning models strategies able to identify new AML treatment protocols.
The first objective is to develop IA/ML strategies to predict and to prevent relapse by performing an unbiased ex vivo drug screen of standard and targeted therapeutic strategies in primary patient AML cells and to gain a better understanding of leukaemia intra-AML evolution under therapy.
The data at disposal for this aim are ex vivo sensitivity data for n diagnostic AML patients, tested for at least d molecules (the max number is 17). For each molecule, AML single cell responses are recorded using p different increasing drug percentages. For each AML, we then have a dataset matrix of dimension p*d. Each AML experience is repeated twice. The number of tested molecules varies from a patient to another. Our data architecture consists of a three-dimensional panel where d items (molecules) are measured for each of n individuals (patients) over p drug percentages.
We will then develop statistical/machine learning models and algorithms for large or moderate data learning adapted to multi-protein, heterogeneous, repeated measurements data. First, an unsupervised learning analysis of the multidimensional functional-valued individual drug response is explored by ML approaches. This part consists of proposing relevant mathematical representations of the data characteristics that are of different drug response measurement dimensions in a convenient rich space of functions. Once this is done, a subspace of reduced dimensions will extract features to represent and explain major information of the original data. These features permit the visualization of the data on a factorial map, and clustering approaches are used to identify subsets of patients or drug responses in an agnostic way. Specifically, we will perform outlier detection, dimensional reduction (principal component analysis (PCA), correspondence analysis (CA)). Secondly, Mathematical supervised analysis (SVM, Kernel, Neural Network,…, models) will rely on the feature data of moderate dimensions extracted from unsupervised learning. Based on a constructed ex vivo sensitivity data for several diagnostic AML, tested for at least 17 molecules and benchmark datasets available on the literature, prediction rule will be learnt and validated based on a pipeline of original multivariate mathematical models. Once an optimal model is built among the identified pipeline, a simple score is derived to classify/predict individuals behavior based on the input drug response characteristics.The second objective of this project is to identify the calcium signature associated with the progression of AML toward MRD and relapse. To do so, we will monitor in patient’s cells the expression levels (RNA-seq) and activity (calcium imaging, patch-clamp) of the main calcium entry pathways, as well as the associated signaling pathways. An innovative advanced mathematical analysis of the data obtained on patients samples and on treated cells (chemotherapies, targeted therapies) will allow us to identify the molecular actors responsible for persistence and relapse, and to propose combinatorial therapies targeting calcium signaling.
Images data are already available on resting cytosolic calcium (Ca2+) signaling in leukemic cells and T cells either alone, after chemotherapeutic treatments, or during IS formation will be monitored using real-time calcium imaging or confocal microscopy. Additional data will be collected by the team members collaborators specialized in Biology. Cells will be captured with a microfluidic device used in routine by the experts collaborators in Bio-Mems (biomedical micro-electro-mechanical systems) prior to all calcium imaging experiments.
Mathematical models will be investigated to analyze these images building AI, statistical/machine learning methods in order to (i) automatically detect regions where cells are present, and track calcium responses in both leukemic cells and T cells either alone or during IS formation; (ii) classify patients from different stages by using features selection on calcium measurements. Implementation of these methods by team 4 will allow the processing of thousands of cells in parallel, and help to compare results between patients at similar or different stages of the disease.
Background & Skills
Highly motivated candidate with an outstanding potential and a strong background in statistics/Data Science/Machine Learning, a deep interest in applications and package (with Python or R) development. Mastery of Python and or R software
Topic Description
Deep learning approaches have been used very successfully to automatically find appropriate representations of input data in order to solve machine learning tasks. One particularly relevant, but also challenging, type of input data are knowledge graphs (KGs). Knowledge graphs encode human knowledge, which in turn is often structured according to potentially complex underlying patterns. Currently, most deep learning approaches for representation learning in knowledge graphs are empirically driven. There is a lack of a clear mathematical understanding of how deep learning approaches can capture the complexity of human knowledge. Furthermore, the connection between practical performance and mathematical properties of representation learning approaches is not well researched yet. Building and understanding mathematical foundations for representation learning in knowledge graphs can help to advance Artificial Intelligence in general. In this dissertation, we particularly focus on representation learning over knowledge graphs. The candidate is expected to design a comprehensive mathematical framework in order to obtain a better understanding of deep learning for Representation Learning in Knowledge Graphs, with a particular focus on employing different geometry, ODEs, and PDEs based on the needs of the underlying tasks (e.g., link prediction, classification) and characteristics of the data (e.g., dynamicity, hierarchies, cyclic structures, implication patterns). We believe that (partial) differential equations provide an appropriate mathematical foundation for such a framework as they are both sufficiently flexible and generic to capture the above requirements, in particular they allow modelling the structural and semantic patterns inside a knowledge graph from a local perspective.
Background & Skills
The candidate is expected to have strong skills in the following concepts: strong passion in mathematics and machine learning; specialized and strong knowledge on differential equations; experience in implementation and training of machine learning approaches; good skills in coding with Python (PyTorch is a bonus); basic knowledge in graph theory
Topic Description
One of the promising directions for improving the detection of Autism Spectrum Disorder (ASD) is to build classification systems using intelligent digital technologies. In this regard, the African Institute for Mathematical Sciences (AIMS), Autisme Rwanda (AR), Rwanda Biomedical Centre/Ministry of Health, and School of Public Health and Health Sciences at the University of Massachusetts are establishing a digital health program to support people with learning disabilities and autism. Through this program, a Community Health Worker (CHW)-administered screening tool “IGAJU app” has been recently developed and validated for early detection of autistic traits in young Rwandan children. This PhD project is building on the IGAJU app initiative, as well as recent successes in implementing digital technologies in the Rwandan healthcare sector. The selected candidate will work with the team on:
– Incorporating the following into the already developed app tool that can allow for remote assessment and diagnosis by trained neuropsychiatrist without having to wait for an appointment: (a) A real-time method e.g. video sharing that enables Autism experts to consult with the families and to assess the child/adult in real time, (b) A Store-and-Forward method to upload videos of child’s behaviours to a web portal that enables the Autism clinicians to make an assessment remotely, (c) A possibility to collect information on important modifiable risk factors for autism in Rwanda (e.g. Folic acid supplementation, environmental factors).
– Conducting an evaluation study on the usability, user acceptance, and efficacy of the developed app to allow for early detection of cases of autism and facilitate intervention.
– Designing AI algorithms to train machine-learned classifiers for the app to automatically detect/diagnose autism through the different stages of child development, without putting a heavy burden on the already limited neuropsychiatric resources in the country.
Background & Skills
Statistical analysis and computing, Machine Learning, Deep Learning, Processing large data sets, Data Visualization, Data Wrangling, Programming, Data science in psychiatry
Topic Description
A Digital Twin (DT) is a model of a real system, which automatically learns and continually updates, in order to represent the dynamics of the system in near real time, using sensor data reflecting various aspects of its operating conditions, human experts with relevant domain knowledge, and its environment. DTs are so relevant today that the European Commission considers that they have become one of the main drivers for addressing industrial performance.
Self-updating capability is a key feature in the DT technology. It relates to the ability of the DT to detect changes in the system of interest, and to update accordingly the structure of the model. This raises the challenge of model inference from data collected on the system, and requires a formal framework be defined, in which a system representation can be coupled with learning methods to achieve automatic model updating. As such, a better synergy between AI and simulation, with the support of IoT enables the design of high-fidelity simulations, which can be used to better explore various alternatives to decision-making.
This thesis will explore the potential to propose such a framework, based on a hierarchy of inference levels, and the formalization of an approach to address each level of this hierarchy. To that end, the DEVS modeling and simulation approach will be hybridized with different learning algorithms, each adapted to a level of inference (from self-calibration and self-initialization to self-restructuration). The framework will be showcased with a real-world case study.
References:
Liu, Y.K., Ong, S.K. & Nee, A.Y.C. State-of-the-art survey on digital twin implementations. Adv. Manuf. 10, 1–23 (2022). https://doi.org/10.1007/s40436-021-00375-w
Traoré, M.K. Unifying Digital Twin Framework: Simulation-Based Proof-of-Concept. IFAC PapersOnLine 54-1 (2021) 886–893
Vohra, M. Digital Twin Technology: Fundamentals and Applications. Willey 2023.
Background & Skills
Computer science (Programming, Network). Optimization. Modelling.
Topic Description
Students are the main stakeholders of educational institutions, and student monitoring is vital for teaching and learning. Continuous monitoring and timely feedback are often limited in online learning environments compared to traditional classroom settings. A fully automated, non-invasive way of estimating students’ attention and actions to modify and adapt the dynamics of a lecture and/or teaching material is crucial for enhancing the student experience and the success of online learning. This work proposes using multimodal machine learning mechanisms to integrate student data in producing correlations and insights for better-estimating student attention and providing interventions in the form of feedback in real-time.
Background & Skills
Subject Knowledge: Deep Learning (DL), Artificial Intelligence (AI), (Basic) Computer Vision. Academic background: Computer Science or related discipline. Skills: Good computing and analytical skills, Exposure to existing Deep Learning and Computer Vision libraries is highly recommended.
Topic Description
This project focuses on the development, validation, and application of numerical models and machine-learning methods to accelerate the discovery of materials for energy applications. Iron-based metal oxides with structural formula AFeO2, AFeO3, AFeO4, and AFe2O4 (where A is an element of the alkali, alkaline-earth, (post-)transition-metal, or lanthanoid series) are of broad technological interest as photoelectrode materials for photoelectrochemical energy storage due to their stability, abundance and moderate cost, and to the tunability of their solar absorption. Project aims to establish models to guide efficient materials design and development by exploring materials properties and their interdependent relationships and fine-tuning their performance based on a combination of Monte Carlo techniques and machine-learning statistical inference.
Background & Skills
Mathematics, Statistics and Machine learning
Topic Description
Quantifying risks is crucial across various aspects of life. For example, the quantification of longevity risk or the measurement of life expectancy plays a significant role not only in public health and life insurance but also in the planning of public services at regional and national levels. The data available for this quantification often pose a number of modelling challenges, some of which are discussed in the literature. The aim of this project is develop a class of statistical and machine learning models for dynamic risk quantification and underlying uncertainty in the Bayesian paradigm. This class of models will (i) address many shortcomings in existing time-to-event models and (ii) have a wide range of applications including medicine, longevity risk, life insurance and demography. Along with completion of a PhD, it is anticipated that this project will lead to (i) publications in good peer-reviewed journals and (ii) an R package for practitioners.
Background & Skills
Solid background (MSc) in Mathematics and statistics; Good programming skills in R or C++; Highly competent, ability to think independently and good interpersonal skills.
Topic Description
Major institutions undertake stress test to determine whether they have enough capital to withstand negative shocks. Prominent methodologies in this area are built on a range of prior assumptions, including assumptions about the future derived from various types of forecasting models. However, many forecasting models often yield forecasts with undesirable properties, especially when forecasting far into the future. The aim of this project is to tackle this type of problem by developing Machine Learning models with information constraints. For example, the student will explore how prospective desired properties can be expressed as sets of deterministic and non-deterministic constraints, and then build them into machine learning models to derive improved forecasts and quantify the underlying uncertainty. Along with completion of a PhD, it is anticipated that this project will lead to publications in good peer-reviewed journals.
Background & Skills
Solid background (MSc) in Mathematics and Statistics or Strong knowledge of the mathematics of Machine Learning; Good programming skills in C++, R, or Python; Highly competent, ability to think independently and good interpersonal skills.