4-7: PHP programming language basics – more on strings and biological sequences manipulation with predefined functions

In the previous section we have started to see some basics on how to manipulate strings and biological sequences in PHP by using predefined functions. In this section we explore the topic further by exploring a few more useful PHP built-in tools (predefined functions).

Splitting a biological sequence in single nucleotides, codons or amino-acids with the str_split() function

str_split() – string split – requires two arguments, a string and a number (an integer). It allows, as the name suggests, to split a string in pieces composed by a certain number (the second argument passed on function call) of characters. If the passed number exceeds the string length, it will return the entire string. If the string length cannot be exactly divided by the number, it will return sub strings formed by the passed number and a last substring with what remains.

Let’s make this more clear with a few examples.

In the example above please note that when we try to split a 9 characters string in substrings of 2 characters, str_split() will generate an array of 5 elements. The first 4 are composed by substrings of 2 characters (what we asked for by passing 2 as second argument to the function) and the last by just one character, that constitutes the remainder of the string after having taken out all the possible two characters substrings.

Also note that when we attempt to subdivide our string in substrings longer than the string itself – in this example we try to subdivide a 9 characters string in substrings of 10 – the whole string is returned as the single element of the str_split() output array.

In the example that follows we use str_split() to subdivide a coding sequence into the codons (triplets) that compose it. You may see as this could be a first step toward a translation of our DNA coding sequence into a protein sequence.

This will generate the following output:

Here are the codons composing ATGGCTAATGATAGA

  • ATG
  • GCT
  • AAT
  • GAT
  • AGA

If we want to split the same DNA sequence used in the previous example into single nucleotides instead codons, all we have to do is use 1 instead of 3 as an argument in the str_split() call. Let’s also change the variables names so that they make sense for the new script.

Here is the output of the script above:

Here are the nucleotides composing ATGGCTAATGATAGA

  • A
  • T
  • G
  • G
  • C
  • T
  • A
  • A
  • T
  • G
  • A
  • T
  • A
  • G
  • A

How to reverse-complement a DNA sequence in PHP

We now have enough knowledge of PHP to perform a simple and basic, yet often essential operation that concerns DNA sequences: from one strand, extrapolate the other. You are surely familiar with the concept that DNA is a double helix and the two strands of the helix are complementary to each other: if A is on one strand, T is on the other (and vice-versa) and if C is on one strand G is on the other (and vice-versa).

The DNA double helix
Figure 4-7-1: DNA is a double helix in which the two strands are complementary: when an A is present on a strand, a T is present on the other strand. When a C is present on a strand, a G is present on the other strand. From the sequence of one strand you can easily compute the sequence of the other by performing a “reverse complement” operation. Image credits: Zephyris, Wikipedia

Given the sequence of one DNA strand, you can easily obtain the sequence of the other by performing a so called “reverse complement” operation.

Here is some PHP code that allows you to perform just that. We will explore this much further on the web applications chapter.

Here is the output of this script:

Input Sequence
ATGGTGAAGCAGATCGA

Reverse Complement
TCGATCTGCTTCACCAT

Translating a DNA coding sequence to an amino-acids sequence with PHP

Let’s now take the splitting of a DNA sequence into codons shown above one step further and actually perform the translation of the DNA coding sequence to an amino-acids sequence. In order to do a translation, pretty much any translation from a language to another, we do need a dictionary where we can look for a word and get the translated word for the new language we are interested in. In the case of DNA codons, we need a dictionary to translate triplets of DNA nucleotides (codons) to the correspond amino-acids. We can easily generate such a dictionary in PHP from the genetic code.

The genetic code
Figure 4-7-2: The Genetic Code – Source: Wikipedia

As you know at this point, a dictionary can be created in PHP as an array in which each key is associated to a value. Each key is the word to translate while the corresponding value is the translation itself.

Let’s get into it. Here’s the genetic code as a PHP array dictionary, derived from the figure above:

We will use this genetic code PHP dictionary to translate an actual DNA coding sequence. Let’s take the Human Thioredoxin (Uniprot P10599) as an example. The coding sequence can be found here.

This is the full output of the code above:

The Human Thioredoxin DNA coding sequence
ATGGTGAAGCAGATCGAGAGCAAGACTGCTTTTCAGGAAGCCTTGGACGCTGCAGGTGATAAACTTGTAGTAGTTGACTTCTCAGCCACGTGGTGTGGGCCTTGCAAAATGATCAAGCCTTTCTTTCATTCCCTCTCTGAAAAGTATTCCAACGTGATATTCCTTGAAGTAGATGTGGATGACTGTCAGGATGTTGCTTCAGAGTGTGAAGTCAAATGCATGCCAACATTCCAGTTTTTTAAGAAGGGACAAAAGGTGGGTGAATTTTCTGGAGCCAATAAGGAAAAGCTTGAAGCCACCATTAATGAATTAGTCTAA

The translation of the individual codons
ATG translates to M
GTG translates to V
AAG translates to K
CAG translates to Q
ATC translates to I
GAG translates to E
AGC translates to S
AAG translates to K
ACT translates to T
GCT translates to A
TTT translates to F
CAG translates to Q
GAA translates to E
GCC translates to A
TTG translates to L
GAC translates to D
GCT translates to A
GCA translates to A
GGT translates to G
GAT translates to D
AAA translates to K
CTT translates to L
GTA translates to V
GTA translates to V
GTT translates to V
GAC translates to D
TTC translates to F
TCA translates to S
GCC translates to A
ACG translates to T
TGG translates to W
TGT translates to C
GGG translates to G
CCT translates to P
TGC translates to C
AAA translates to K
ATG translates to M
ATC translates to I
AAG translates to K
CCT translates to P
TTC translates to F
TTT translates to F
CAT translates to H
TCC translates to S
CTC translates to L
TCT translates to S
GAA translates to E
AAG translates to K
TAT translates to Y
TCC translates to S
AAC translates to N
GTG translates to V
ATA translates to I
TTC translates to F
CTT translates to L
GAA translates to E
GTA translates to V
GAT translates to D
GTG translates to V
GAT translates to D
GAC translates to D
TGT translates to C
CAG translates to Q
GAT translates to D
GTT translates to V
GCT translates to A
TCA translates to S
GAG translates to E
TGT translates to C
GAA translates to E
GTC translates to V
AAA translates to K
TGC translates to C
ATG translates to M
CCA translates to P
ACA translates to T
TTC translates to F
CAG translates to Q
TTT translates to F
TTT translates to F
AAG translates to K
AAG translates to K
GGA translates to G
CAA translates to Q
AAG translates to K
GTG translates to V
GGT translates to G
GAA translates to E
TTT translates to F
TCT translates to S
GGA translates to G
GCC translates to A
AAT translates to N
AAG translates to K
GAA translates to E
AAG translates to K
CTT translates to L
GAA translates to E
GCC translates to A
ACC translates to T
ATT translates to I
AAT translates to N
GAA translates to E
TTA translates to L
GTC translates to V
TAA translates to Stop

The Human Thioredoxin protein sequence
MVKQIESKTAFQEALDAAGDKLVVVDFSATWCGPCKMIKPFFHSLSEKYSNVIFLEVDVDDCQDVASECEVKCMPTFQFFKKGQKVGEFSGANKEKLEATINELV

A few points are worth noting in respect to our DNA sequence to protein sequence translation script above:

  • We use a “break” statement, inside the foreach loop that cycles sequentially through each codon, that is executed if we find a “stop” codon in the sequence, as we do not wish to add a “Stop” string in our translated sequence but rather stop the translation job and exit the foreach cycle. Indeed executing a “break” statement within a cycle will stop it and the code that follows the cycle in the script will be executed. This is in contrast with die(), that will entirely terminate the script execution, as we have seen in the PHP conditional statements section earlier in this chapter.
  • There are actually some issues with the output as it is now in the script. Specifically, the initial DNA sequence is a very long uninterrupted string and will normally force an horizontal scrolling in the web page in order to be able to see it fully. This is not happening here, in this very page, as the WordPress template used will prevent that. If you however execute the translation script above in a standalone page, you will see that an horizontal scrolling bar will be present, which is not nice. There are of course ways to format the sequence before giving it in output to a webpage, inserting break tags – for example every 80 characters – so as to have a nice display and avoid horizontal scrolling. This was not implemented in this specific example.
  • The $translated_sequence variable is declared as empty before the foreach cycle, and then filled up with translation results during the cycle. This is a classical way to proceed: declare an empty string or empty array before the start of a cycle and then fill it up during the cycle. Take note.

Classifying amino-acids in a peptide or protein sequence according to their nature (nonpolar, polar, basic, acidic)

Let us use the ability that we have acquired in this section to split a sequence into individual amino-acids or nucleotides to classify all the amino-acids of a peptide or protein sequence according to their nature, expanding on the example given at the end of the previous section.

This is the output of the script:

Here is a full listing of the amino-acids in our sequence

  1. M => nonpolar
  2. V => nonpolar
  3. K => basic
  4. Q => polar
  5. I => nonpolar
  6. E => acidic
  7. S => polar
  8. K => basic
  9. T => polar
  10. A => nonpolar
  11. F => nonpolar
  12. Q => polar
  13. E => acidic
  14. A => nonpolar
  15. L => nonpolar
  16. D => acidic
  17. A => nonpolar
  18. A => nonpolar
  19. G => nonpolar
  20. D => acidic
  21. K => basic
  22. L => nonpolar
  23. V => nonpolar
  24. V => nonpolar
  25. V => nonpolar
  26. D => acidic
  27. F => nonpolar

Chapter Sections