Using conditions to match multiple patterns within a lineStyling multi-line conditions in 'if' statements?How to import a module given it's name as string?Does Python have a ternary conditional operator?How to read a file line-by-line into a list?Catch multiple exceptions in one line (except block)Why is reading lines from stdin much slower in C++ than Python?Extracting a block of fasta sequences with a particular fasta IDExtract sequences from a FASTA file to multiple files, file based on header_IDs in a separate fileWhy is “1000000000000000 in range(1000000000000001)” so fast in Python 3?How to search for matching fasta sequences in multifasta files and append output in another file?

Can a network vulnerability be exploited locally?

Group riding etiquette

What makes these white stars appear black?

Is there an in-universe explanation given to the senior Imperial Navy Officers as to why Darth Vader serves Emperor Palpatine?

Why does the weaker C–H bond have a higher wavenumber than the C=O bond?

Which polygons can be turned inside out by a smooth deformation?

Is there any problem with a full installation on a USB drive?

Could the UK amend the European Withdrawal Act and revoke the Article 50 invocation?

Why does this London Underground poster from 1924 have a Star of David atop a Christmas tree?

Is this position a forced win for Black after move 14?

What to do about my 1-month-old boy peeing through diapers?

How to deal with anxiety caused by dangerous riding conditions stemming from poor lane design and inconsiderate fellow road users?

Notice period 60 days but I need to join in 45 days

Why does a sticker slowly peel off, but if it is pulled quickly it tears?

Are there any to-scale diagrams of the TRAPPIST-1 system?

Stolen MacBook should I worry about my data?

Is there anyway to repent for proselytizing for idol worship?

Modifing a GFF3 file and writting to a new file

Is there a word or phrase that means "use other people's wifi or Internet service without consent"?

What's the point of fighting monsters in Zelda BotW?

Elementary lower bounds for the number of primes in arithmetic progressions

How to handle inventory and story of a player leaving

Would it be better to write a trilogy over a much longer series?

Is there a better way to use C# dictionaries than TryGetValue?

Using conditions to match multiple patterns within a line

Styling multi-line conditions in 'if' statements?How to import a module given it's name as string?Does Python have a ternary conditional operator?How to read a file line-by-line into a list?Catch multiple exceptions in one line (except block)Why is reading lines from stdin much slower in C++ than Python?Extracting a block of fasta sequences with a particular fasta IDExtract sequences from a FASTA file to multiple files, file based on header_IDs in a separate fileWhy is “1000000000000000 in range(1000000000000001)” so fast in Python 3?How to search for matching fasta sequences in multifasta files and append output in another file?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I have a fasta file like this:
myfasta.fasta

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA

I have a code that I want to use to separate sequences based on their ids that have matching patterns like 'CDS', 'tRNA' etc. In the code below, I am trying to use startswith and also match pattern in line which doesn't seem to work. Can someone please help me how to look for two conditions in line in python.

code: python mycode.py myfasta.fasta

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta)

for line in fasta:
 if line.startswith('>') and 'CDS' in line:
 print(line)
 else:
 print(line)

Expected output (if I use CDS):

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT

edited Mar 27 at 22:24

asked Mar 27 at 21:16

MAPK

2,43113 silver badges44 bronze badges

1

Mmmhh, seems like you're printing line anyway... How do you differentiate between positives and negatives?

– Jacques Gaudin
Mar 27 at 21:21

1

yeah...code seems to do what you think, except that you're going to get blank lines because you have EOL at the end of each of your lines, and then print() is adding another one.

– Steve
Mar 27 at 21:25

1

You're still just printing every line unconditionally!!! You have to do something different depending on if your test succeeds or fails, or what's the point?

– Steve
Mar 27 at 21:26

1

You have to treat lines with '>' at the front differently than those without, regardless of if they match 'CDS' or not. When you see '>' at the front of the line, you need to check to see if 'CDS' is in it to know if you should print it. But you ALSO have to set a flag, so that the next time you see a line without a '>' at the front, you'll know if you should print that line or not. That's the secret hear. Create a boolean variable to keep track of if you printed the last '>' line you saw or not, and then use that variable to decide what to do with non-'>' lines

– Steve
Mar 27 at 21:30

2

easiest with Biopython: from Bio import SeqIO; SeqIO.write((r for r in SeqIO.parse('in.fa', 'fasta') if 'CDS' in r.id), 'out.fa', 'fasta')

– Chris_Rands
Mar 28 at 14:00

|
show 7 more comments

I have a fasta file like this:
myfasta.fasta

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA

code: python mycode.py myfasta.fasta

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta)

for line in fasta:
 if line.startswith('>') and 'CDS' in line:
 print(line)
 else:
 print(line)

Expected output (if I use CDS):

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT

edited Mar 27 at 22:24

asked Mar 27 at 21:16

MAPK

2,43113 silver badges44 bronze badges

1

Mmmhh, seems like you're printing line anyway... How do you differentiate between positives and negatives?

– Jacques Gaudin
Mar 27 at 21:21

1

yeah...code seems to do what you think, except that you're going to get blank lines because you have EOL at the end of each of your lines, and then print() is adding another one.

– Steve
Mar 27 at 21:25

1

You're still just printing every line unconditionally!!! You have to do something different depending on if your test succeeds or fails, or what's the point?

– Steve
Mar 27 at 21:26

1

You have to treat lines with '>' at the front differently than those without, regardless of if they match 'CDS' or not. When you see '>' at the front of the line, you need to check to see if 'CDS' is in it to know if you should print it. But you ALSO have to set a flag, so that the next time you see a line without a '>' at the front, you'll know if you should print that line or not. That's the secret hear. Create a boolean variable to keep track of if you printed the last '>' line you saw or not, and then use that variable to decide what to do with non-'>' lines

– Steve
Mar 27 at 21:30

2

easiest with Biopython: from Bio import SeqIO; SeqIO.write((r for r in SeqIO.parse('in.fa', 'fasta') if 'CDS' in r.id), 'out.fa', 'fasta')

– Chris_Rands
Mar 28 at 14:00

|
show 7 more comments

I have a fasta file like this:
myfasta.fasta

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA

code: python mycode.py myfasta.fasta

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta)

for line in fasta:
 if line.startswith('>') and 'CDS' in line:
 print(line)
 else:
 print(line)

Expected output (if I use CDS):

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT

edited Mar 27 at 22:24

asked Mar 27 at 21:16

MAPK

2,43113 silver badges44 bronze badges

I have a fasta file like this:
myfasta.fasta

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA

code: python mycode.py myfasta.fasta

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta)

for line in fasta:
 if line.startswith('>') and 'CDS' in line:
 print(line)
 else:
 print(line)

Expected output (if I use CDS):

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT

python bioinformatics fasta

edited Mar 27 at 22:24

asked Mar 27 at 21:16

MAPK

2,43113 silver badges44 bronze badges

edited Mar 27 at 22:24

asked Mar 27 at 21:16

MAPK

2,43113 silver badges44 bronze badges

edited Mar 27 at 22:24

asked Mar 27 at 21:16

MAPK

2,43113 silver badges44 bronze badges

asked Mar 27 at 21:16

MAPK

2,43113 silver badges44 bronze badges

asked Mar 27 at 21:16

MAPK

2,43113 silver badges44 bronze badges

1

Mmmhh, seems like you're printing line anyway... How do you differentiate between positives and negatives?

– Jacques Gaudin
Mar 27 at 21:21

1

yeah...code seems to do what you think, except that you're going to get blank lines because you have EOL at the end of each of your lines, and then print() is adding another one.

– Steve
Mar 27 at 21:25

1

You're still just printing every line unconditionally!!! You have to do something different depending on if your test succeeds or fails, or what's the point?

– Steve
Mar 27 at 21:26

1

You have to treat lines with '>' at the front differently than those without, regardless of if they match 'CDS' or not. When you see '>' at the front of the line, you need to check to see if 'CDS' is in it to know if you should print it. But you ALSO have to set a flag, so that the next time you see a line without a '>' at the front, you'll know if you should print that line or not. That's the secret hear. Create a boolean variable to keep track of if you printed the last '>' line you saw or not, and then use that variable to decide what to do with non-'>' lines

– Steve
Mar 27 at 21:30

2

easiest with Biopython: from Bio import SeqIO; SeqIO.write((r for r in SeqIO.parse('in.fa', 'fasta') if 'CDS' in r.id), 'out.fa', 'fasta')

– Chris_Rands
Mar 28 at 14:00

|
show 7 more comments

1

Mmmhh, seems like you're printing line anyway... How do you differentiate between positives and negatives?

– Jacques Gaudin
Mar 27 at 21:21

1

yeah...code seems to do what you think, except that you're going to get blank lines because you have EOL at the end of each of your lines, and then print() is adding another one.

– Steve
Mar 27 at 21:25

1

You're still just printing every line unconditionally!!! You have to do something different depending on if your test succeeds or fails, or what's the point?

– Steve
Mar 27 at 21:26

1

You have to treat lines with '>' at the front differently than those without, regardless of if they match 'CDS' or not. When you see '>' at the front of the line, you need to check to see if 'CDS' is in it to know if you should print it. But you ALSO have to set a flag, so that the next time you see a line without a '>' at the front, you'll know if you should print that line or not. That's the secret hear. Create a boolean variable to keep track of if you printed the last '>' line you saw or not, and then use that variable to decide what to do with non-'>' lines

– Steve
Mar 27 at 21:30

2

easiest with Biopython: from Bio import SeqIO; SeqIO.write((r for r in SeqIO.parse('in.fa', 'fasta') if 'CDS' in r.id), 'out.fa', 'fasta')

– Chris_Rands
Mar 28 at 14:00

Mmmhh, seems like you're printing line anyway... How do you differentiate between positives and negatives?

– Jacques Gaudin
Mar 27 at 21:21

yeah...code seems to do what you think, except that you're going to get blank lines because you have EOL at the end of each of your lines, and then print() is adding another one.

– Steve
Mar 27 at 21:25

You're still just printing every line unconditionally!!! You have to do something different depending on if your test succeeds or fails, or what's the point?

– Steve
Mar 27 at 21:26

You have to treat lines with '>' at the front differently than those without, regardless of if they match 'CDS' or not. When you see '>' at the front of the line, you need to check to see if 'CDS' is in it to know if you should print it. But you ALSO have to set a flag, so that the next time you see a line without a '>' at the front, you'll know if you should print that line or not. That's the secret hear. Create a boolean variable to keep track of if you printed the last '>' line you saw or not, and then use that variable to decide what to do with non-'>' lines

– Steve
Mar 27 at 21:30

easiest with Biopython: from Bio import SeqIO; SeqIO.write((r for r in SeqIO.parse('in.fa', 'fasta') if 'CDS' in r.id), 'out.fa', 'fasta')

– Chris_Rands
Mar 28 at 14:00

|
show 7 more comments

3 Answers
3

active

oldest

votes

Here is a code that works for you. If a line has CDS, it prints the line and the next lines. strip() removes the endline character while printing the line.

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]

flag = False
with open(myfasta) as fasta:
 for line in fasta:
 if line.startswith('>') and 'CDS' in line:
 flag = True
 elif line.startswith('>'):
 flag = False
 if flag:
 print(line.strip())

Edit: You can remove the elif part as the following code:

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]

flag = False
with open(myfasta) as fasta:
 for line in fasta:
 if line.startswith('>'):
 flag = 'CDS' in line
 if flag:
 print(line.strip())

edited Mar 30 at 9:56

answered Mar 27 at 21:48

maanijou

6641 gold badge6 silver badges19 bronze badges

Ok i changed the code. the first one was based on your current inputs

– maanijou
Mar 27 at 22:02

Are you sure? I'm getting multiple lines with your inputs.

– maanijou
Mar 27 at 22:09

2

You should probably close the file at the end, or use a with context manager.

– bli
Mar 28 at 17:50

Good point. Edited.

– maanijou
Mar 28 at 18:53

1

i mean exactly what i wrote, flag is still a boolean, but you can remove your elif clause

– Chris_Rands
Mar 30 at 9:48

|
show 3 more comments

Maanijou's answer is fine.

Also, consider an alternative with a iterator instead.

EDIT: Updated the code based on your comments

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta, "r+")

file_contents = iter(fasta)

try:
 print_flag = True
 while True:
 line = file_contents.next()
 if line.startswith('>'):
 if "CDS" in line:
 print (line.strip())
 print_flag = True
 else:
 print_flag = False
 else:
 if print_flag:
 print (line.strip())

except StopIteration:
 print ("Done")
 fasta.close()

Explanation

file_contents = iter(fasta) converts the iterable file object into an iterator on which you can simply keep calling next() till you run out of things to read

Why I do not recommend calling readlines as some other answers have is that sometimes fasta files can be big and calling readlines consumes significant memory.

if a line satisfies your search req you simply print it and the next line, if not you simply read the next line and do nothing,

Explanation for Update

You got the Attribute error because of file modes, I could not reproduce it locally but I think opening the file with the right mode should fix it

You now said there could be more than 1 genome sequence for CDS updated the code to print all the genome sequences for 1 CDS header in the file

I tested it with a modified fasta file like so

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA

And this output

python fasta.py fasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
Done

edited Mar 27 at 22:06

answered Mar 27 at 21:54

Srini

1,2591 gold badge15 silver badges30 bronze badges

getting this error: AttributeError: '_io.TextIOWrapper' object has no attribute 'next'

– MAPK
Mar 27 at 22:01

I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you

– Srini
Mar 27 at 22:08

add a comment |

Is this what you want?

#!/usr/bin/env python
import sys
import os
from collections import defaultdict

myfasta = sys.argv[1]
with open(myfasta) as fasta:
 data = fasta.read().splitlines()

pattern_data = defaultdict(list)
index = 0
while index < len(data):
 if data[index].startswith('>'):
 start = data[index].index('_') + 1
 key = data[index][start:]
 pattern_data[key].append(data[index + 1])
 index += 2

At this point you are free to do whatever you please with the sorted data.

The above assumes that the whole file you parse follows the exact format shown above: 1 line starting with a ">" that id's the single line that follows. If you have multiple lines that follow, the code needs minor modification.

EDIT:
I just read up on fasta files. I now know that they actually may have sequences that are longer than one line after they are identified. So the above code does need to be modified to account for multiline sequences. A more generalized approach is as follows:

#!/usr/bin/env python
import sys
import os
from collections import defaultdict

myfasta = sys.argv[1]
with open(myfasta) as fasta:
 data = fasta.read().splitlines()

id_line_indices = [index for index, line in enumerate(data) if line.startswith('>')]
id_line_indices.append(len(data))
pattern_buckets = defaultdict(list)

i = 0
while i < len(id_line_indices) - 1:
 start = data[id_line_indices[i]].index('_') + 1
 key = data[id_line_indices[i]][start:]

 sequence = [data[index] for index in range(id_line_indices[i] + 1, id_line_indices[i + 1])]
 sequence = ''.join(sequence)

 pattern_buckets[key].append(sequence)
 i += 1

This still achieves the same results for the above data set. For example,

print(pattern_buckets['CDS'])
print(pattern_buckets['rRNA'])

Will get you:

['AAAAATTTCTGGGCCCCGGGGG', 'TTAAAAATTTCTGGGCCCCGGGAAAAAA', 'TTTGGGAATTAAACCCT', 'TTTGGGAATTAAACCCT']
['TTAAAAATTTCTGGGCCCCGGGAAAAAA']

edited Mar 27 at 22:11

answered Mar 27 at 21:49

Perplexabot

1,1331 gold badge9 silver badges16 bronze badges

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55386592%2fusing-conditions-to-match-multiple-patterns-within-a-line%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Here is a code that works for you. If a line has CDS, it prints the line and the next lines. strip() removes the endline character while printing the line.

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]

flag = False
with open(myfasta) as fasta:
 for line in fasta:
 if line.startswith('>') and 'CDS' in line:
 flag = True
 elif line.startswith('>'):
 flag = False
 if flag:
 print(line.strip())

Edit: You can remove the elif part as the following code:

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]

flag = False
with open(myfasta) as fasta:
 for line in fasta:
 if line.startswith('>'):
 flag = 'CDS' in line
 if flag:
 print(line.strip())

edited Mar 30 at 9:56

answered Mar 27 at 21:48

maanijou

6641 gold badge6 silver badges19 bronze badges

Ok i changed the code. the first one was based on your current inputs

– maanijou
Mar 27 at 22:02

Are you sure? I'm getting multiple lines with your inputs.

– maanijou
Mar 27 at 22:09

2

You should probably close the file at the end, or use a with context manager.

– bli
Mar 28 at 17:50

Good point. Edited.

– maanijou
Mar 28 at 18:53

1

i mean exactly what i wrote, flag is still a boolean, but you can remove your elif clause

– Chris_Rands
Mar 30 at 9:48

|
show 3 more comments

Here is a code that works for you. If a line has CDS, it prints the line and the next lines. strip() removes the endline character while printing the line.

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]

flag = False
with open(myfasta) as fasta:
 for line in fasta:
 if line.startswith('>') and 'CDS' in line:
 flag = True
 elif line.startswith('>'):
 flag = False
 if flag:
 print(line.strip())

Edit: You can remove the elif part as the following code:

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]

flag = False
with open(myfasta) as fasta:
 for line in fasta:
 if line.startswith('>'):
 flag = 'CDS' in line
 if flag:
 print(line.strip())

edited Mar 30 at 9:56

answered Mar 27 at 21:48

maanijou

6641 gold badge6 silver badges19 bronze badges

Ok i changed the code. the first one was based on your current inputs

– maanijou
Mar 27 at 22:02

Are you sure? I'm getting multiple lines with your inputs.

– maanijou
Mar 27 at 22:09

2

You should probably close the file at the end, or use a with context manager.

– bli
Mar 28 at 17:50

Good point. Edited.

– maanijou
Mar 28 at 18:53

1

i mean exactly what i wrote, flag is still a boolean, but you can remove your elif clause

– Chris_Rands
Mar 30 at 9:48

|
show 3 more comments

Here is a code that works for you. If a line has CDS, it prints the line and the next lines. strip() removes the endline character while printing the line.

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]

flag = False
with open(myfasta) as fasta:
 for line in fasta:
 if line.startswith('>') and 'CDS' in line:
 flag = True
 elif line.startswith('>'):
 flag = False
 if flag:
 print(line.strip())

Edit: You can remove the elif part as the following code:

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]

flag = False
with open(myfasta) as fasta:
 for line in fasta:
 if line.startswith('>'):
 flag = 'CDS' in line
 if flag:
 print(line.strip())

edited Mar 30 at 9:56

answered Mar 27 at 21:48

maanijou

6641 gold badge6 silver badges19 bronze badges

Here is a code that works for you. If a line has CDS, it prints the line and the next lines. strip() removes the endline character while printing the line.

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]

flag = False
with open(myfasta) as fasta:
 for line in fasta:
 if line.startswith('>') and 'CDS' in line:
 flag = True
 elif line.startswith('>'):
 flag = False
 if flag:
 print(line.strip())

Edit: You can remove the elif part as the following code:

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]

flag = False
with open(myfasta) as fasta:
 for line in fasta:
 if line.startswith('>'):
 flag = 'CDS' in line
 if flag:
 print(line.strip())

edited Mar 30 at 9:56

answered Mar 27 at 21:48

maanijou

6641 gold badge6 silver badges19 bronze badges

edited Mar 30 at 9:56

answered Mar 27 at 21:48

maanijou

6641 gold badge6 silver badges19 bronze badges

answered Mar 27 at 21:48

maanijou

6641 gold badge6 silver badges19 bronze badges

answered Mar 27 at 21:48

maanijou

6641 gold badge6 silver badges19 bronze badges

Ok i changed the code. the first one was based on your current inputs

– maanijou
Mar 27 at 22:02

Are you sure? I'm getting multiple lines with your inputs.

– maanijou
Mar 27 at 22:09

2

You should probably close the file at the end, or use a with context manager.

– bli
Mar 28 at 17:50

Good point. Edited.

– maanijou
Mar 28 at 18:53

1

i mean exactly what i wrote, flag is still a boolean, but you can remove your elif clause

– Chris_Rands
Mar 30 at 9:48

|
show 3 more comments

Ok i changed the code. the first one was based on your current inputs

– maanijou
Mar 27 at 22:02

Are you sure? I'm getting multiple lines with your inputs.

– maanijou
Mar 27 at 22:09

2

You should probably close the file at the end, or use a with context manager.

– bli
Mar 28 at 17:50

Good point. Edited.

– maanijou
Mar 28 at 18:53

1

i mean exactly what i wrote, flag is still a boolean, but you can remove your elif clause

– Chris_Rands
Mar 30 at 9:48

Ok i changed the code. the first one was based on your current inputs

– maanijou
Mar 27 at 22:02

Are you sure? I'm getting multiple lines with your inputs.

– maanijou
Mar 27 at 22:09

You should probably close the file at the end, or use a with context manager.

– bli
Mar 28 at 17:50

Good point. Edited.

– maanijou
Mar 28 at 18:53

i mean exactly what i wrote, flag is still a boolean, but you can remove your elif clause

– Chris_Rands
Mar 30 at 9:48

|
show 3 more comments

Maanijou's answer is fine.

Also, consider an alternative with a iterator instead.

EDIT: Updated the code based on your comments

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta, "r+")

file_contents = iter(fasta)

try:
 print_flag = True
 while True:
 line = file_contents.next()
 if line.startswith('>'):
 if "CDS" in line:
 print (line.strip())
 print_flag = True
 else:
 print_flag = False
 else:
 if print_flag:
 print (line.strip())

except StopIteration:
 print ("Done")
 fasta.close()

Explanation

file_contents = iter(fasta) converts the iterable file object into an iterator on which you can simply keep calling next() till you run out of things to read

Why I do not recommend calling readlines as some other answers have is that sometimes fasta files can be big and calling readlines consumes significant memory.

if a line satisfies your search req you simply print it and the next line, if not you simply read the next line and do nothing,

Explanation for Update

You got the Attribute error because of file modes, I could not reproduce it locally but I think opening the file with the right mode should fix it

You now said there could be more than 1 genome sequence for CDS updated the code to print all the genome sequences for 1 CDS header in the file

I tested it with a modified fasta file like so

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA

And this output

python fasta.py fasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
Done

edited Mar 27 at 22:06

answered Mar 27 at 21:54

Srini

1,2591 gold badge15 silver badges30 bronze badges

getting this error: AttributeError: '_io.TextIOWrapper' object has no attribute 'next'

– MAPK
Mar 27 at 22:01

I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you

– Srini
Mar 27 at 22:08

add a comment |

Maanijou's answer is fine.

Also, consider an alternative with a iterator instead.

EDIT: Updated the code based on your comments

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta, "r+")

file_contents = iter(fasta)

try:
 print_flag = True
 while True:
 line = file_contents.next()
 if line.startswith('>'):
 if "CDS" in line:
 print (line.strip())
 print_flag = True
 else:
 print_flag = False
 else:
 if print_flag:
 print (line.strip())

except StopIteration:
 print ("Done")
 fasta.close()

Explanation

file_contents = iter(fasta) converts the iterable file object into an iterator on which you can simply keep calling next() till you run out of things to read

Why I do not recommend calling readlines as some other answers have is that sometimes fasta files can be big and calling readlines consumes significant memory.

if a line satisfies your search req you simply print it and the next line, if not you simply read the next line and do nothing,

Explanation for Update

You got the Attribute error because of file modes, I could not reproduce it locally but I think opening the file with the right mode should fix it

You now said there could be more than 1 genome sequence for CDS updated the code to print all the genome sequences for 1 CDS header in the file

I tested it with a modified fasta file like so

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA

And this output

python fasta.py fasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
Done

edited Mar 27 at 22:06

answered Mar 27 at 21:54

Srini

1,2591 gold badge15 silver badges30 bronze badges

getting this error: AttributeError: '_io.TextIOWrapper' object has no attribute 'next'

– MAPK
Mar 27 at 22:01

I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you

– Srini
Mar 27 at 22:08

add a comment |

Maanijou's answer is fine.

Also, consider an alternative with a iterator instead.

EDIT: Updated the code based on your comments

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta, "r+")

file_contents = iter(fasta)

try:
 print_flag = True
 while True:
 line = file_contents.next()
 if line.startswith('>'):
 if "CDS" in line:
 print (line.strip())
 print_flag = True
 else:
 print_flag = False
 else:
 if print_flag:
 print (line.strip())

except StopIteration:
 print ("Done")
 fasta.close()

Explanation

file_contents = iter(fasta) converts the iterable file object into an iterator on which you can simply keep calling next() till you run out of things to read

Why I do not recommend calling readlines as some other answers have is that sometimes fasta files can be big and calling readlines consumes significant memory.

if a line satisfies your search req you simply print it and the next line, if not you simply read the next line and do nothing,

Explanation for Update

You got the Attribute error because of file modes, I could not reproduce it locally but I think opening the file with the right mode should fix it

You now said there could be more than 1 genome sequence for CDS updated the code to print all the genome sequences for 1 CDS header in the file

I tested it with a modified fasta file like so

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA

And this output

python fasta.py fasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
Done

edited Mar 27 at 22:06

answered Mar 27 at 21:54

Srini

1,2591 gold badge15 silver badges30 bronze badges

Maanijou's answer is fine.

Also, consider an alternative with a iterator instead.

EDIT: Updated the code based on your comments

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta, "r+")

file_contents = iter(fasta)

try:
 print_flag = True
 while True:
 line = file_contents.next()
 if line.startswith('>'):
 if "CDS" in line:
 print (line.strip())
 print_flag = True
 else:
 print_flag = False
 else:
 if print_flag:
 print (line.strip())

except StopIteration:
 print ("Done")
 fasta.close()

Explanation

file_contents = iter(fasta) converts the iterable file object into an iterator on which you can simply keep calling next() till you run out of things to read

Why I do not recommend calling readlines as some other answers have is that sometimes fasta files can be big and calling readlines consumes significant memory.

if a line satisfies your search req you simply print it and the next line, if not you simply read the next line and do nothing,

Explanation for Update

You got the Attribute error because of file modes, I could not reproduce it locally but I think opening the file with the right mode should fix it

You now said there could be more than 1 genome sequence for CDS updated the code to print all the genome sequences for 1 CDS header in the file

I tested it with a modified fasta file like so

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA

And this output

python fasta.py fasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
Done

edited Mar 27 at 22:06

answered Mar 27 at 21:54

Srini

1,2591 gold badge15 silver badges30 bronze badges

edited Mar 27 at 22:06

answered Mar 27 at 21:54

Srini

1,2591 gold badge15 silver badges30 bronze badges

answered Mar 27 at 21:54

Srini

1,2591 gold badge15 silver badges30 bronze badges

answered Mar 27 at 21:54

Srini

1,2591 gold badge15 silver badges30 bronze badges

getting this error: AttributeError: '_io.TextIOWrapper' object has no attribute 'next'

– MAPK
Mar 27 at 22:01

I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you

– Srini
Mar 27 at 22:08

add a comment |

getting this error: AttributeError: '_io.TextIOWrapper' object has no attribute 'next'

– MAPK
Mar 27 at 22:01

I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you

– Srini
Mar 27 at 22:08

getting this error: AttributeError: '_io.TextIOWrapper' object has no attribute 'next'

– MAPK
Mar 27 at 22:01

I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you

– Srini
Mar 27 at 22:08

add a comment |

Is this what you want?

#!/usr/bin/env python
import sys
import os
from collections import defaultdict

myfasta = sys.argv[1]
with open(myfasta) as fasta:
 data = fasta.read().splitlines()

pattern_data = defaultdict(list)
index = 0
while index < len(data):
 if data[index].startswith('>'):
 start = data[index].index('_') + 1
 key = data[index][start:]
 pattern_data[key].append(data[index + 1])
 index += 2

At this point you are free to do whatever you please with the sorted data.

#!/usr/bin/env python
import sys
import os
from collections import defaultdict

myfasta = sys.argv[1]
with open(myfasta) as fasta:
 data = fasta.read().splitlines()

id_line_indices = [index for index, line in enumerate(data) if line.startswith('>')]
id_line_indices.append(len(data))
pattern_buckets = defaultdict(list)

i = 0
while i < len(id_line_indices) - 1:
 start = data[id_line_indices[i]].index('_') + 1
 key = data[id_line_indices[i]][start:]

 sequence = [data[index] for index in range(id_line_indices[i] + 1, id_line_indices[i + 1])]
 sequence = ''.join(sequence)

 pattern_buckets[key].append(sequence)
 i += 1

This still achieves the same results for the above data set. For example,

print(pattern_buckets['CDS'])
print(pattern_buckets['rRNA'])

Will get you:

['AAAAATTTCTGGGCCCCGGGGG', 'TTAAAAATTTCTGGGCCCCGGGAAAAAA', 'TTTGGGAATTAAACCCT', 'TTTGGGAATTAAACCCT']
['TTAAAAATTTCTGGGCCCCGGGAAAAAA']

edited Mar 27 at 22:11

answered Mar 27 at 21:49

Perplexabot

1,1331 gold badge9 silver badges16 bronze badges

add a comment |

Is this what you want?

#!/usr/bin/env python
import sys
import os
from collections import defaultdict

myfasta = sys.argv[1]
with open(myfasta) as fasta:
 data = fasta.read().splitlines()

pattern_data = defaultdict(list)
index = 0
while index < len(data):
 if data[index].startswith('>'):
 start = data[index].index('_') + 1
 key = data[index][start:]
 pattern_data[key].append(data[index + 1])
 index += 2

At this point you are free to do whatever you please with the sorted data.

#!/usr/bin/env python
import sys
import os
from collections import defaultdict

myfasta = sys.argv[1]
with open(myfasta) as fasta:
 data = fasta.read().splitlines()

id_line_indices = [index for index, line in enumerate(data) if line.startswith('>')]
id_line_indices.append(len(data))
pattern_buckets = defaultdict(list)

i = 0
while i < len(id_line_indices) - 1:
 start = data[id_line_indices[i]].index('_') + 1
 key = data[id_line_indices[i]][start:]

 sequence = [data[index] for index in range(id_line_indices[i] + 1, id_line_indices[i + 1])]
 sequence = ''.join(sequence)

 pattern_buckets[key].append(sequence)
 i += 1

This still achieves the same results for the above data set. For example,

print(pattern_buckets['CDS'])
print(pattern_buckets['rRNA'])

Will get you:

['AAAAATTTCTGGGCCCCGGGGG', 'TTAAAAATTTCTGGGCCCCGGGAAAAAA', 'TTTGGGAATTAAACCCT', 'TTTGGGAATTAAACCCT']
['TTAAAAATTTCTGGGCCCCGGGAAAAAA']

edited Mar 27 at 22:11

answered Mar 27 at 21:49

Perplexabot

1,1331 gold badge9 silver badges16 bronze badges

add a comment |

Is this what you want?

#!/usr/bin/env python
import sys
import os
from collections import defaultdict

myfasta = sys.argv[1]
with open(myfasta) as fasta:
 data = fasta.read().splitlines()

pattern_data = defaultdict(list)
index = 0
while index < len(data):
 if data[index].startswith('>'):
 start = data[index].index('_') + 1
 key = data[index][start:]
 pattern_data[key].append(data[index + 1])
 index += 2

At this point you are free to do whatever you please with the sorted data.

#!/usr/bin/env python
import sys
import os
from collections import defaultdict

myfasta = sys.argv[1]
with open(myfasta) as fasta:
 data = fasta.read().splitlines()

id_line_indices = [index for index, line in enumerate(data) if line.startswith('>')]
id_line_indices.append(len(data))
pattern_buckets = defaultdict(list)

i = 0
while i < len(id_line_indices) - 1:
 start = data[id_line_indices[i]].index('_') + 1
 key = data[id_line_indices[i]][start:]

 sequence = [data[index] for index in range(id_line_indices[i] + 1, id_line_indices[i + 1])]
 sequence = ''.join(sequence)

 pattern_buckets[key].append(sequence)
 i += 1

This still achieves the same results for the above data set. For example,

print(pattern_buckets['CDS'])
print(pattern_buckets['rRNA'])

Will get you:

['AAAAATTTCTGGGCCCCGGGGG', 'TTAAAAATTTCTGGGCCCCGGGAAAAAA', 'TTTGGGAATTAAACCCT', 'TTTGGGAATTAAACCCT']
['TTAAAAATTTCTGGGCCCCGGGAAAAAA']

edited Mar 27 at 22:11

answered Mar 27 at 21:49

Perplexabot

1,1331 gold badge9 silver badges16 bronze badges

Is this what you want?

#!/usr/bin/env python
import sys
import os
from collections import defaultdict

myfasta = sys.argv[1]
with open(myfasta) as fasta:
 data = fasta.read().splitlines()

pattern_data = defaultdict(list)
index = 0
while index < len(data):
 if data[index].startswith('>'):
 start = data[index].index('_') + 1
 key = data[index][start:]
 pattern_data[key].append(data[index + 1])
 index += 2

At this point you are free to do whatever you please with the sorted data.

#!/usr/bin/env python
import sys
import os
from collections import defaultdict

myfasta = sys.argv[1]
with open(myfasta) as fasta:
 data = fasta.read().splitlines()

id_line_indices = [index for index, line in enumerate(data) if line.startswith('>')]
id_line_indices.append(len(data))
pattern_buckets = defaultdict(list)

i = 0
while i < len(id_line_indices) - 1:
 start = data[id_line_indices[i]].index('_') + 1
 key = data[id_line_indices[i]][start:]

 sequence = [data[index] for index in range(id_line_indices[i] + 1, id_line_indices[i + 1])]
 sequence = ''.join(sequence)

 pattern_buckets[key].append(sequence)
 i += 1

This still achieves the same results for the above data set. For example,

print(pattern_buckets['CDS'])
print(pattern_buckets['rRNA'])

Will get you:

['AAAAATTTCTGGGCCCCGGGGG', 'TTAAAAATTTCTGGGCCCCGGGAAAAAA', 'TTTGGGAATTAAACCCT', 'TTTGGGAATTAAACCCT']
['TTAAAAATTTCTGGGCCCCGGGAAAAAA']

edited Mar 27 at 22:11

answered Mar 27 at 21:49

Perplexabot

1,1331 gold badge9 silver badges16 bronze badges

edited Mar 27 at 22:11

answered Mar 27 at 21:49

Perplexabot

1,1331 gold badge9 silver badges16 bronze badges

answered Mar 27 at 21:49

Perplexabot

1,1331 gold badge9 silver badges16 bronze badges

answered Mar 27 at 21:49

Perplexabot

1,1331 gold badge9 silver badges16 bronze badges

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

3 Answers
3

EDIT: Updated the code based on your comments

Explanation

Explanation for Update

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

EDIT: Updated the code based on your comments

Explanation

Explanation for Update

EDIT: Updated the code based on your comments

Explanation

Explanation for Update

EDIT: Updated the code based on your comments

Explanation

Explanation for Update

EDIT: Updated the code based on your comments

Explanation

Explanation for Update

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

3 Answers 3

EDIT: Updated the code based on your comments

Explanation

Explanation for Update

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

EDIT: Updated the code based on your comments

Explanation

Explanation for Update

EDIT: Updated the code based on your comments

Explanation

Explanation for Update

EDIT: Updated the code based on your comments

Explanation

Explanation for Update

EDIT: Updated the code based on your comments

Explanation

Explanation for Update

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

3 Answers
3

3 Answers
3

3 Answers
3