Hi Emma, nice poster, have a great summer! Do you think that using codellama vs regular llama affects the output of the summaries? Have you done comparisons?
Hi Vincent, in my project, I didn’t perform any quantitative comparisons between the CodeLlama and Llama 2 models. However, when running the models on the same prompts, I didn’t observe any significant differences in their performance. While the official CodeLlama paper doesn’t provide direct data on the models’ capabilities for general language tasks, it does include a relevant graph – Figure 5(c). This figure examines the “Helpfulness” score of the CodeLlama models in comparison to Llama 2. Notably, the “Helpfulness” metric is described as intended to capture a more broad measure of the model’s language understanding and general ability to be helpful, beyond just coding-specific tasks. Interestingly, the CodeLlama-13B model was found to have a slightly higher Helpfulness score compared to the standard Llama 2-13B model. This suggests the CodeLlama model may have retained strong general language abilities, even with its specialized training for coding tasks.
This is a really cool project that leverages NLP modelI! What are the different models and prompts that were assessed, and how were they scored/compared? How did you pick the prompt for codellama-17b? In the Connectome Data Extraction step of the workflow, almost 31000 gene IDs had empty results. Why is this, and what can be done to decrease this number?
Hi Dien, this is a great question. I have compared models for the llama-7b, llama-13b, codellama-7b, and codellama-13b. I also tried codellama-70b, but I wasn’t able to make it work on Compute Canada, as the computing resources required are relatively high. For the llama and codellama comparison, I mentioned that codellama can accept a longer prompt, so I chose it over llama. The 7b model is less effective than the 13b model both theoretically and when tested on a small sample. For the prompts, I have compared prompts that just introduce the task using the system role without giving an example of it. The aim was to save more prompt length to reduce the complexity of the task; however, the PubMed ID was missing in approximately 50% of the generated summaries when using that method. One limitation of the project is that not many quantitative analyses for different prompts and model comparisons were done.
Regarding your second question, the reason connectome data had empty results for 31,000 gene IDs is that the plant connectome is still developing and currently lacks many genes in their database. They will be making a new release soon, and we’re planning to rerun the program after their new release to reduce the number of empty results.
Hi Emma, excellent poster and presentation! I’m curious about how the raw gene IDs set is constructed, especially considering the 40% duplication rate. Could you elaborate on why duplication is so prevalent in the raw gene IDs set? Additionally, I’m interested to know if the entire process is computationally intensive. Can you shed some light on the computational requirements? Finally, could you share how long it took to process the 66,014 raw gene IDs?
For the raw gene ID list, I obtained the data from the file at https://www.arabidopsis.org/download_files/Genes/Araport11_genome_release/Araport11_TAIRAccessionID_AGI_mapping.txt. This file contains the complete set of gene IDs for the Arabidopsis genome, including alternate transcript variants. To use this list, I had to remove the “.1”, “.2”, “.3”, etc. suffixes from the gene IDs. This process of string manipulation was quite fast, taking only a couple of seconds to run on the 66,014 gene IDs.
The reason for the 40% duplication rate in the raw gene ID set is due to the way the data is structured in the source file. The Arabidopsis genome contains many genes with multiple transcript variants, and the file includes a distinct ID for each of these variants. By removing the version numbers, these transcript variants collapsed into a single gene ID, resulting in the observed duplication.
The overall process is computationally intensive. It requires writing a lot of code to perform the task in the desired way and to run the process in parallel to save time and avoid issues like IP bans. The reason it took days to run the queries using LLaMA 2 on Compute Canada is because that model has a very large size, with billions of parameters. Storing and processing such a large model requires significant computational resources and memory.
Similarly, running BERT, which is a relatively smaller model compared to LLaMA 2, also took days when using only a CPU. However, like LLaMA 2, BERT can also benefit from leveraging GPU-accelerated computing resources, such as those available through Compute Canada. This could potentially reduce the processing time from days to just a few hours for both large language models.
Hi Emma, nice poster, have a great summer! Do you think that using codellama vs regular llama affects the output of the summaries? Have you done comparisons?
Hi Vincent, in my project, I didn’t perform any quantitative comparisons between the CodeLlama and Llama 2 models. However, when running the models on the same prompts, I didn’t observe any significant differences in their performance. While the official CodeLlama paper doesn’t provide direct data on the models’ capabilities for general language tasks, it does include a relevant graph – Figure 5(c). This figure examines the “Helpfulness” score of the CodeLlama models in comparison to Llama 2. Notably, the “Helpfulness” metric is described as intended to capture a more broad measure of the model’s language understanding and general ability to be helpful, beyond just coding-specific tasks. Interestingly, the CodeLlama-13B model was found to have a slightly higher Helpfulness score compared to the standard Llama 2-13B model. This suggests the CodeLlama model may have retained strong general language abilities, even with its specialized training for coding tasks.
This is a really cool project that leverages NLP modelI! What are the different models and prompts that were assessed, and how were they scored/compared? How did you pick the prompt for codellama-17b? In the Connectome Data Extraction step of the workflow, almost 31000 gene IDs had empty results. Why is this, and what can be done to decrease this number?
Hi Dien, this is a great question. I have compared models for the llama-7b, llama-13b, codellama-7b, and codellama-13b. I also tried codellama-70b, but I wasn’t able to make it work on Compute Canada, as the computing resources required are relatively high. For the llama and codellama comparison, I mentioned that codellama can accept a longer prompt, so I chose it over llama. The 7b model is less effective than the 13b model both theoretically and when tested on a small sample. For the prompts, I have compared prompts that just introduce the task using the system role without giving an example of it. The aim was to save more prompt length to reduce the complexity of the task; however, the PubMed ID was missing in approximately 50% of the generated summaries when using that method. One limitation of the project is that not many quantitative analyses for different prompts and model comparisons were done.
Regarding your second question, the reason connectome data had empty results for 31,000 gene IDs is that the plant connectome is still developing and currently lacks many genes in their database. They will be making a new release soon, and we’re planning to rerun the program after their new release to reduce the number of empty results.
Hi Emma, excellent poster and presentation! I’m curious about how the raw gene IDs set is constructed, especially considering the 40% duplication rate. Could you elaborate on why duplication is so prevalent in the raw gene IDs set? Additionally, I’m interested to know if the entire process is computationally intensive. Can you shed some light on the computational requirements? Finally, could you share how long it took to process the 66,014 raw gene IDs?
Hi Fangyi, thanks for your interest in my work!
For the raw gene ID list, I obtained the data from the file at https://www.arabidopsis.org/download_files/Genes/Araport11_genome_release/Araport11_TAIRAccessionID_AGI_mapping.txt. This file contains the complete set of gene IDs for the Arabidopsis genome, including alternate transcript variants. To use this list, I had to remove the “.1”, “.2”, “.3”, etc. suffixes from the gene IDs. This process of string manipulation was quite fast, taking only a couple of seconds to run on the 66,014 gene IDs.
The reason for the 40% duplication rate in the raw gene ID set is due to the way the data is structured in the source file. The Arabidopsis genome contains many genes with multiple transcript variants, and the file includes a distinct ID for each of these variants. By removing the version numbers, these transcript variants collapsed into a single gene ID, resulting in the observed duplication.
The overall process is computationally intensive. It requires writing a lot of code to perform the task in the desired way and to run the process in parallel to save time and avoid issues like IP bans. The reason it took days to run the queries using LLaMA 2 on Compute Canada is because that model has a very large size, with billions of parameters. Storing and processing such a large model requires significant computational resources and memory.
Similarly, running BERT, which is a relatively smaller model compared to LLaMA 2, also took days when using only a CPU. However, like LLaMA 2, BERT can also benefit from leveraging GPU-accelerated computing resources, such as those available through Compute Canada. This could potentially reduce the processing time from days to just a few hours for both large language models.