Cleaning sequence headers

8 replies [Last post]
Quinn
User offline. Last seen 3 years 31 weeks ago. Offline
Joined: 03/01/2014

Hi Steve and Casey thanks for great book.
I used information in chapter 2 to clean up sequence headers from sequences I got from Protein Blast. For example:
From
>gi|565467582|ref|XP_006291671.1| hypothetical protein CARUB_v10017829mg [Capsella rubella] >gi|482560378|gb|EOA24569.1| hypothetical protein CARUB_v10017829mg [Capsella rubella]
MATTLHCLSTIHLLPRSHYPKTLNSLKPITKSQPCKIPDISSNPNALQLLKSSSLPLAVVALPFFIDAQDAAAAGGEFGILEGRTFALIHPIVMGGLFAYTLWAGYLGWQWRRVRTIQSEISDLKKQLKPTPVSPDGSSAVDSSSPPSATELQIQRLTEERKELIKGSYRDKHFDAGSVLLGFGVLEAVFGGLNTFLRTGKLFPGPHLYAGAGITVLWAAAAALVPAMQKGNETARSLHIALNAVNVLLFVWQIPTGLDIVLKVFEFTKWP

TO
>gi|565467582|Capsella rubella
MATTLHCLSTIHLLPRSHYPKTLNSLKPITKSQPCKIPDISSNPNALQLLKSSSLPLAVVALPFFIDAQDAAAAGGEFGILEGRTFALIHPIVMGGLFAYTLWAGYLGWQWRRVRTIQSEISDLKKQLKPTPVSPDGSSAVDSSSPPSATELQIQRLTEERKELIKGSYRDKHFDAGSVLLGFGVLEAVFGGLNTFLRTGKLFPGPHLYAGAGITVLWAAAAALVPAMQKGNETARSLHIALNAVNVLLFVWQIPTGLDIVLKVFEFTKWP

using this regex in TextWrangler on OSX10.8
(>\w+)(\|\d+\|).+\[(.+)\]

My question is how do I use such regex in Python or Bash shell to do the same thing?

pcfb
User offline. Last seen 43 min 23 sec ago. Offline
Joined: 08/04/2010
Script for Python regex

Hi. I think python will be the most flexible way to do this and have it adaptable in the long term.

Below is a short script based on filestoXYYY.py from the book, mixed with a bit of latlon5.py... I tried to put some comments in there to explain what is going on, and to point out some of the other ways are to do things. (It is indented with 3 spaces instead of tabs, to fit better in the small window.)

It would be relatively easy to nest this into a loop that converts several files (to merge into one output file) or to have the program write to a file name instead of capturing the output with >.

Another thing that might be nice to add is a check of whether the search matched or not... As it is currently written, if it doesn't match it just prints out the original line, but that doesn't help with trouble-shooting your regex against variations in the database.

#!/usr/bin/env python
"""
fastanamefix.py - version 1.0
Applies a regex to the name fields in a fasta file
Prints the lines, so to save, use > to capture.
 
Usage:
  fastanamefix.py infile.fta > converted.fta
"""
 
import sys
import re
 
SearchStr = r'(>\w+)(\|\d+\|).+\[(.+)\]'
ReplaceStr = r'\1\2\3'
 
if len(sys.argv)<2:
   # __doc__ is a special name for the multi-line doc string above
   sys.stderr.write(__doc__) 
   sys.stderr.write("Current search string: '{}'\n".format(SearchStr))
else:
   InFileName= sys.argv[1]  # Just works on one file, give at run time
 
   Infile = open(InFileName,'rU')
 
   RegSub = re.compile(SearchStr) # re.compile makes things go faster
   # option would be to use it all in one line as in the comment below
 
   for Line in Infile: 
      Line=Line.strip('\n')
#     if Line and Line[0]=='>':    # optional way
      if Line.startswith('>'):
         NewLine = RegSub.sub(ReplaceStr,Line)
         # could also use this format, without compiling:
         # NewLine = re.sub(SearchStr, ReplaceStr, Line)
         print NewLine
      else:
         print Line
   Infile.close()

Quinn
User offline. Last seen 3 years 31 weeks ago. Offline
Joined: 03/01/2014
Nice script

Steve
Thanks for taking the time. I recently learned to use ipython notebook. I ran a test file with 20 sequences. It worked and I actually understood the code. I will run the geeky way (argv stuff) when I get a chance.

Also thanks for the suggestions before the code. One thing I do not get is this sentence:
"As it is currently written, if it doesn't match it just prints out the original line, but that doesn't help with trouble-shooting your regex against variations in the database."
What are you suggesting? A message or ...

Steve
User offline. Last seen 3 years 29 weeks ago. Offline
Joined: 08/15/2010
Trouble shooting code

Hi. The current script, by just printing the original text if the regex doesn't match, doesn't notify you when your regex might have failed. In my experience, GenBank names can sometimes be inconsistent, so you might have to account for more variability in what fields occur where. I just thought it would be nice to put this into the code. You could use if re.search() to make sure that a match is actually found for each name, and then sys.stderr.write() to print a message with the non-matching names. In general, it is a good idea to build in a safety mechanism if you are going to be deploying this for a large number of sequences, where you might not catch errors just by eyeballing the results.

Quinn
User offline. Last seen 3 years 31 weeks ago. Offline
Joined: 03/01/2014
Trouble shooting code

Oh I see. That is important. > is required but the rest can change. I could modify the code a bit like this:

for Line in Infile:
    Line=Line.strip('\n')
    if re.search(SearchStr, Line):
        NewLine = re.sub(SearchStr, ReplaceStr, Line)
        print NewLine
    else:
        print Line    
Infile.close()

But I don't see how to print non-matching names. Do I need another SearchStr for that? Now the for-loop replaces the first line (the header) and then prints the sequence.

Steve
User offline. Last seen 3 years 29 weeks ago. Offline
Joined: 08/15/2010
Providing feedback within code

Hi Quinn—

Check out page 205 and the code on 210 in the book, where we talk about sys.stderr.write().

In short, there are two "routes" through which you can send (display or pipe) information: The standard output includes the text that is displayed on the screen when you use a print statement, and importantly that is also what gets captured into a file when you put the redirect symbol > after the command that runs the program (see page 70 and 206). (This approach probably won't work the same in IPythonNotebook -- you would have to write to a file.)

If you run the original program at a shell prompt like this (below), you won't see any output because all the printed lines are saved to the converted.fta file instead:

fastanamefix.py infile.fta > converted.fta

However, the second way to present output is through the standard error route. This doesn't have to be an actual error, but can be used for system notification, etc, as described briefly in Chapter 11. Text that you output using the command sys.stderr.write() will show up on the screen interspersed with yourprint statements, but it will not get captured into the file that you save with >. This gives you a way to save the modified fasta file, but also send yourself notifications, like listing the names that don't match. One thing to note, sys.stderr.write() does not automatically append a newline at the end of each output, so if you use it several times in the program, it can generate a single line of output. You need to specify \n wherever you want a break.

So, in practice, you would modify the script above adding a couple lines after the .startswith()line:

      if Line.startswith('>'):
         # add the following two lines to alert the user of non-matching names
         if not re.search(SearchStr,Line):
            sys.stderr.write("Mismatch: %s\n" % (Line))
 
         NewLine = RegSub.sub(ReplaceStr,Line)
         print NewLine
      else:
         print Line

You don't need to do an else: statement -- just test if there is not a match and print the error notification. Then let the replacement do its thing (it will output the original line to the file as long as you are capturing that output).

You could also add some counters to say how many sequences were found, and how many mismatches, and send this output using sys.stderr.write() instead of print.

Quinn
User offline. Last seen 3 years 31 weeks ago. Offline
Joined: 03/01/2014
Providing feedback within code

I read chapter 11 and your comments above. For now I used this line:
sys.stderr.write("Mismatch: %s\n %d" % (Line, count))
to capture notification and count.

The way the script is written I do need an else statement to print the sequence as the lines of sequence do not begin with >.

Thank you very much for your help. This kind of stuff is impossible without any help. Have you guys considered teaching a 4-week or 6-week course at MOOC?
www.class-central.com

Steve
User offline. Last seen 3 years 29 weeks ago. Offline
Joined: 08/15/2010
Moving from regex to python

Great that you are thinking along these lines. That is just what we were hoping for. How far have you read? The explanation will depend on what you already are comfortable doing...

Also, what is the replacement string -- what do you want the final to be?

Quinn
User offline. Last seen 3 years 31 weeks ago. Offline
Joined: 03/01/2014
Moving from regex to python

Thank you.
Sorry forgot to give you the replace string. It was: \1\2\3. That worked in cleaning up the header as shown above.
I read the first 14 chapters in the first round. The second reading will depend on what I need. It is good to learn regex in text editor but then to move on. I will struggle thru whatever you suggest in shell and python.