Using conditions to match multiple patterns within a lineStyling multi-line conditions in 'if' statements?How to import a module given it's name as string?Does Python have a ternary conditional operator?How to read a file line-by-line into a list?Catch multiple exceptions in one line (except block)Why is reading lines from stdin much slower in C++ than Python?Extracting a block of fasta sequences with a particular fasta IDExtract sequences from a FASTA file to multiple files, file based on header_IDs in a separate fileWhy is “1000000000000000 in range(1000000000000001)” so fast in Python 3?How to search for matching fasta sequences in multifasta files and append output in another file?
Can a network vulnerability be exploited locally?
Group riding etiquette
What makes these white stars appear black?
Is there an in-universe explanation given to the senior Imperial Navy Officers as to why Darth Vader serves Emperor Palpatine?
Why does the weaker C–H bond have a higher wavenumber than the C=O bond?
Which polygons can be turned inside out by a smooth deformation?
Is there any problem with a full installation on a USB drive?
Could the UK amend the European Withdrawal Act and revoke the Article 50 invocation?
Why does this London Underground poster from 1924 have a Star of David atop a Christmas tree?
Is this position a forced win for Black after move 14?
What to do about my 1-month-old boy peeing through diapers?
How to deal with anxiety caused by dangerous riding conditions stemming from poor lane design and inconsiderate fellow road users?
Notice period 60 days but I need to join in 45 days
Why does a sticker slowly peel off, but if it is pulled quickly it tears?
Are there any to-scale diagrams of the TRAPPIST-1 system?
Stolen MacBook should I worry about my data?
Is there anyway to repent for proselytizing for idol worship?
Modifing a GFF3 file and writting to a new file
Is there a word or phrase that means "use other people's wifi or Internet service without consent"?
What's the point of fighting monsters in Zelda BotW?
Elementary lower bounds for the number of primes in arithmetic progressions
How to handle inventory and story of a player leaving
Would it be better to write a trilogy over a much longer series?
Is there a better way to use C# dictionaries than TryGetValue?
Using conditions to match multiple patterns within a line
Styling multi-line conditions in 'if' statements?How to import a module given it's name as string?Does Python have a ternary conditional operator?How to read a file line-by-line into a list?Catch multiple exceptions in one line (except block)Why is reading lines from stdin much slower in C++ than Python?Extracting a block of fasta sequences with a particular fasta IDExtract sequences from a FASTA file to multiple files, file based on header_IDs in a separate fileWhy is “1000000000000000 in range(1000000000000001)” so fast in Python 3?How to search for matching fasta sequences in multifasta files and append output in another file?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have a fasta file like this:myfasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
I have a code that I want to use to separate sequences based on their ids that have matching patterns like 'CDS', 'tRNA' etc. In the code below, I am trying to use startswith and also match pattern in line which doesn't seem to work. Can someone please help me how to look for two conditions in line in python.
code: python mycode.py myfasta.fasta
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta)
for line in fasta:
if line.startswith('>') and 'CDS' in line:
print(line)
else:
print(line)
Expected output (if I use CDS
):
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
python bioinformatics fasta
|
show 7 more comments
I have a fasta file like this:myfasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
I have a code that I want to use to separate sequences based on their ids that have matching patterns like 'CDS', 'tRNA' etc. In the code below, I am trying to use startswith and also match pattern in line which doesn't seem to work. Can someone please help me how to look for two conditions in line in python.
code: python mycode.py myfasta.fasta
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta)
for line in fasta:
if line.startswith('>') and 'CDS' in line:
print(line)
else:
print(line)
Expected output (if I use CDS
):
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
python bioinformatics fasta
1
Mmmhh, seems like you're printingline
anyway... How do you differentiate between positives and negatives?
– Jacques Gaudin
Mar 27 at 21:21
1
yeah...code seems to do what you think, except that you're going to get blank lines because you have EOL at the end of each of your lines, and then print() is adding another one.
– Steve
Mar 27 at 21:25
1
You're still just printing every line unconditionally!!! You have to do something different depending on if your test succeeds or fails, or what's the point?
– Steve
Mar 27 at 21:26
1
You have to treat lines with '>' at the front differently than those without, regardless of if they match 'CDS' or not. When you see '>' at the front of the line, you need to check to see if 'CDS' is in it to know if you should print it. But you ALSO have to set a flag, so that the next time you see a line without a '>' at the front, you'll know if you should print that line or not. That's the secret hear. Create a boolean variable to keep track of if you printed the last '>' line you saw or not, and then use that variable to decide what to do with non-'>' lines
– Steve
Mar 27 at 21:30
2
easiest with Biopython:from Bio import SeqIO; SeqIO.write((r for r in SeqIO.parse('in.fa', 'fasta') if 'CDS' in r.id), 'out.fa', 'fasta')
– Chris_Rands
Mar 28 at 14:00
|
show 7 more comments
I have a fasta file like this:myfasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
I have a code that I want to use to separate sequences based on their ids that have matching patterns like 'CDS', 'tRNA' etc. In the code below, I am trying to use startswith and also match pattern in line which doesn't seem to work. Can someone please help me how to look for two conditions in line in python.
code: python mycode.py myfasta.fasta
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta)
for line in fasta:
if line.startswith('>') and 'CDS' in line:
print(line)
else:
print(line)
Expected output (if I use CDS
):
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
python bioinformatics fasta
I have a fasta file like this:myfasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
I have a code that I want to use to separate sequences based on their ids that have matching patterns like 'CDS', 'tRNA' etc. In the code below, I am trying to use startswith and also match pattern in line which doesn't seem to work. Can someone please help me how to look for two conditions in line in python.
code: python mycode.py myfasta.fasta
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta)
for line in fasta:
if line.startswith('>') and 'CDS' in line:
print(line)
else:
print(line)
Expected output (if I use CDS
):
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
python bioinformatics fasta
python bioinformatics fasta
edited Mar 27 at 22:24
MAPK
asked Mar 27 at 21:16
MAPKMAPK
2,43113 silver badges44 bronze badges
2,43113 silver badges44 bronze badges
1
Mmmhh, seems like you're printingline
anyway... How do you differentiate between positives and negatives?
– Jacques Gaudin
Mar 27 at 21:21
1
yeah...code seems to do what you think, except that you're going to get blank lines because you have EOL at the end of each of your lines, and then print() is adding another one.
– Steve
Mar 27 at 21:25
1
You're still just printing every line unconditionally!!! You have to do something different depending on if your test succeeds or fails, or what's the point?
– Steve
Mar 27 at 21:26
1
You have to treat lines with '>' at the front differently than those without, regardless of if they match 'CDS' or not. When you see '>' at the front of the line, you need to check to see if 'CDS' is in it to know if you should print it. But you ALSO have to set a flag, so that the next time you see a line without a '>' at the front, you'll know if you should print that line or not. That's the secret hear. Create a boolean variable to keep track of if you printed the last '>' line you saw or not, and then use that variable to decide what to do with non-'>' lines
– Steve
Mar 27 at 21:30
2
easiest with Biopython:from Bio import SeqIO; SeqIO.write((r for r in SeqIO.parse('in.fa', 'fasta') if 'CDS' in r.id), 'out.fa', 'fasta')
– Chris_Rands
Mar 28 at 14:00
|
show 7 more comments
1
Mmmhh, seems like you're printingline
anyway... How do you differentiate between positives and negatives?
– Jacques Gaudin
Mar 27 at 21:21
1
yeah...code seems to do what you think, except that you're going to get blank lines because you have EOL at the end of each of your lines, and then print() is adding another one.
– Steve
Mar 27 at 21:25
1
You're still just printing every line unconditionally!!! You have to do something different depending on if your test succeeds or fails, or what's the point?
– Steve
Mar 27 at 21:26
1
You have to treat lines with '>' at the front differently than those without, regardless of if they match 'CDS' or not. When you see '>' at the front of the line, you need to check to see if 'CDS' is in it to know if you should print it. But you ALSO have to set a flag, so that the next time you see a line without a '>' at the front, you'll know if you should print that line or not. That's the secret hear. Create a boolean variable to keep track of if you printed the last '>' line you saw or not, and then use that variable to decide what to do with non-'>' lines
– Steve
Mar 27 at 21:30
2
easiest with Biopython:from Bio import SeqIO; SeqIO.write((r for r in SeqIO.parse('in.fa', 'fasta') if 'CDS' in r.id), 'out.fa', 'fasta')
– Chris_Rands
Mar 28 at 14:00
1
1
Mmmhh, seems like you're printing
line
anyway... How do you differentiate between positives and negatives?– Jacques Gaudin
Mar 27 at 21:21
Mmmhh, seems like you're printing
line
anyway... How do you differentiate between positives and negatives?– Jacques Gaudin
Mar 27 at 21:21
1
1
yeah...code seems to do what you think, except that you're going to get blank lines because you have EOL at the end of each of your lines, and then print() is adding another one.
– Steve
Mar 27 at 21:25
yeah...code seems to do what you think, except that you're going to get blank lines because you have EOL at the end of each of your lines, and then print() is adding another one.
– Steve
Mar 27 at 21:25
1
1
You're still just printing every line unconditionally!!! You have to do something different depending on if your test succeeds or fails, or what's the point?
– Steve
Mar 27 at 21:26
You're still just printing every line unconditionally!!! You have to do something different depending on if your test succeeds or fails, or what's the point?
– Steve
Mar 27 at 21:26
1
1
You have to treat lines with '>' at the front differently than those without, regardless of if they match 'CDS' or not. When you see '>' at the front of the line, you need to check to see if 'CDS' is in it to know if you should print it. But you ALSO have to set a flag, so that the next time you see a line without a '>' at the front, you'll know if you should print that line or not. That's the secret hear. Create a boolean variable to keep track of if you printed the last '>' line you saw or not, and then use that variable to decide what to do with non-'>' lines
– Steve
Mar 27 at 21:30
You have to treat lines with '>' at the front differently than those without, regardless of if they match 'CDS' or not. When you see '>' at the front of the line, you need to check to see if 'CDS' is in it to know if you should print it. But you ALSO have to set a flag, so that the next time you see a line without a '>' at the front, you'll know if you should print that line or not. That's the secret hear. Create a boolean variable to keep track of if you printed the last '>' line you saw or not, and then use that variable to decide what to do with non-'>' lines
– Steve
Mar 27 at 21:30
2
2
easiest with Biopython:
from Bio import SeqIO; SeqIO.write((r for r in SeqIO.parse('in.fa', 'fasta') if 'CDS' in r.id), 'out.fa', 'fasta')
– Chris_Rands
Mar 28 at 14:00
easiest with Biopython:
from Bio import SeqIO; SeqIO.write((r for r in SeqIO.parse('in.fa', 'fasta') if 'CDS' in r.id), 'out.fa', 'fasta')
– Chris_Rands
Mar 28 at 14:00
|
show 7 more comments
3 Answers
3
active
oldest
votes
Here is a code that works for you. If a line has CDS, it prints the line and the next lines. strip()
removes the endline character while printing the line.
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
flag = False
with open(myfasta) as fasta:
for line in fasta:
if line.startswith('>') and 'CDS' in line:
flag = True
elif line.startswith('>'):
flag = False
if flag:
print(line.strip())
Edit: You can remove the elif part as the following code:
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
flag = False
with open(myfasta) as fasta:
for line in fasta:
if line.startswith('>'):
flag = 'CDS' in line
if flag:
print(line.strip())
Ok i changed the code. the first one was based on your current inputs
– maanijou
Mar 27 at 22:02
Are you sure? I'm getting multiple lines with your inputs.
– maanijou
Mar 27 at 22:09
2
You should probably close the file at the end, or use awith
context manager.
– bli
Mar 28 at 17:50
Good point. Edited.
– maanijou
Mar 28 at 18:53
1
i mean exactly what i wrote,flag
is still a boolean, but you can remove yourelif
clause
– Chris_Rands
Mar 30 at 9:48
|
show 3 more comments
Maanijou's answer is fine.
Also, consider an alternative with a iterator instead.
EDIT: Updated the code based on your comments
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta, "r+")
file_contents = iter(fasta)
try:
print_flag = True
while True:
line = file_contents.next()
if line.startswith('>'):
if "CDS" in line:
print (line.strip())
print_flag = True
else:
print_flag = False
else:
if print_flag:
print (line.strip())
except StopIteration:
print ("Done")
fasta.close()
Explanation
file_contents = iter(fasta)
converts the iterable file object into an iterator on which you can simply keep calling next()
till you run out of things to read
Why I do not recommend calling readlines
as some other answers have is that sometimes fasta files can be big and calling readlines
consumes significant memory.
if a line satisfies your search req you simply print it and the next line, if not you simply read the next line and do nothing,
Explanation for Update
- You got the Attribute error because of file modes, I could not reproduce it locally but I think opening the file with the right mode should fix it
- You now said there could be more than 1 genome sequence for
CDS
updated the code to print all the genome sequences for 1CDS
header in the file
I tested it with a modified fasta file like so
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
And this output
python fasta.py fasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
Done
getting this error:AttributeError: '_io.TextIOWrapper' object has no attribute 'next'
– MAPK
Mar 27 at 22:01
I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you
– Srini
Mar 27 at 22:08
add a comment |
Is this what you want?
#!/usr/bin/env python
import sys
import os
from collections import defaultdict
myfasta = sys.argv[1]
with open(myfasta) as fasta:
data = fasta.read().splitlines()
pattern_data = defaultdict(list)
index = 0
while index < len(data):
if data[index].startswith('>'):
start = data[index].index('_') + 1
key = data[index][start:]
pattern_data[key].append(data[index + 1])
index += 2
At this point you are free to do whatever you please with the sorted data.
The above assumes that the whole file you parse follows the exact format shown above: 1 line starting with a ">" that id's the single line that follows. If you have multiple lines that follow, the code needs minor modification.
EDIT:
I just read up on fasta files. I now know that they actually may have sequences that are longer than one line after they are identified. So the above code does need to be modified to account for multiline sequences. A more generalized approach is as follows:
#!/usr/bin/env python
import sys
import os
from collections import defaultdict
myfasta = sys.argv[1]
with open(myfasta) as fasta:
data = fasta.read().splitlines()
id_line_indices = [index for index, line in enumerate(data) if line.startswith('>')]
id_line_indices.append(len(data))
pattern_buckets = defaultdict(list)
i = 0
while i < len(id_line_indices) - 1:
start = data[id_line_indices[i]].index('_') + 1
key = data[id_line_indices[i]][start:]
sequence = [data[index] for index in range(id_line_indices[i] + 1, id_line_indices[i + 1])]
sequence = ''.join(sequence)
pattern_buckets[key].append(sequence)
i += 1
This still achieves the same results for the above data set. For example,
print(pattern_buckets['CDS'])
print(pattern_buckets['rRNA'])
Will get you:
['AAAAATTTCTGGGCCCCGGGGG', 'TTAAAAATTTCTGGGCCCCGGGAAAAAA', 'TTTGGGAATTAAACCCT', 'TTTGGGAATTAAACCCT']
['TTAAAAATTTCTGGGCCCCGGGAAAAAA']
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55386592%2fusing-conditions-to-match-multiple-patterns-within-a-line%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here is a code that works for you. If a line has CDS, it prints the line and the next lines. strip()
removes the endline character while printing the line.
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
flag = False
with open(myfasta) as fasta:
for line in fasta:
if line.startswith('>') and 'CDS' in line:
flag = True
elif line.startswith('>'):
flag = False
if flag:
print(line.strip())
Edit: You can remove the elif part as the following code:
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
flag = False
with open(myfasta) as fasta:
for line in fasta:
if line.startswith('>'):
flag = 'CDS' in line
if flag:
print(line.strip())
Ok i changed the code. the first one was based on your current inputs
– maanijou
Mar 27 at 22:02
Are you sure? I'm getting multiple lines with your inputs.
– maanijou
Mar 27 at 22:09
2
You should probably close the file at the end, or use awith
context manager.
– bli
Mar 28 at 17:50
Good point. Edited.
– maanijou
Mar 28 at 18:53
1
i mean exactly what i wrote,flag
is still a boolean, but you can remove yourelif
clause
– Chris_Rands
Mar 30 at 9:48
|
show 3 more comments
Here is a code that works for you. If a line has CDS, it prints the line and the next lines. strip()
removes the endline character while printing the line.
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
flag = False
with open(myfasta) as fasta:
for line in fasta:
if line.startswith('>') and 'CDS' in line:
flag = True
elif line.startswith('>'):
flag = False
if flag:
print(line.strip())
Edit: You can remove the elif part as the following code:
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
flag = False
with open(myfasta) as fasta:
for line in fasta:
if line.startswith('>'):
flag = 'CDS' in line
if flag:
print(line.strip())
Ok i changed the code. the first one was based on your current inputs
– maanijou
Mar 27 at 22:02
Are you sure? I'm getting multiple lines with your inputs.
– maanijou
Mar 27 at 22:09
2
You should probably close the file at the end, or use awith
context manager.
– bli
Mar 28 at 17:50
Good point. Edited.
– maanijou
Mar 28 at 18:53
1
i mean exactly what i wrote,flag
is still a boolean, but you can remove yourelif
clause
– Chris_Rands
Mar 30 at 9:48
|
show 3 more comments
Here is a code that works for you. If a line has CDS, it prints the line and the next lines. strip()
removes the endline character while printing the line.
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
flag = False
with open(myfasta) as fasta:
for line in fasta:
if line.startswith('>') and 'CDS' in line:
flag = True
elif line.startswith('>'):
flag = False
if flag:
print(line.strip())
Edit: You can remove the elif part as the following code:
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
flag = False
with open(myfasta) as fasta:
for line in fasta:
if line.startswith('>'):
flag = 'CDS' in line
if flag:
print(line.strip())
Here is a code that works for you. If a line has CDS, it prints the line and the next lines. strip()
removes the endline character while printing the line.
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
flag = False
with open(myfasta) as fasta:
for line in fasta:
if line.startswith('>') and 'CDS' in line:
flag = True
elif line.startswith('>'):
flag = False
if flag:
print(line.strip())
Edit: You can remove the elif part as the following code:
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
flag = False
with open(myfasta) as fasta:
for line in fasta:
if line.startswith('>'):
flag = 'CDS' in line
if flag:
print(line.strip())
edited Mar 30 at 9:56
answered Mar 27 at 21:48
maanijoumaanijou
6641 gold badge6 silver badges19 bronze badges
6641 gold badge6 silver badges19 bronze badges
Ok i changed the code. the first one was based on your current inputs
– maanijou
Mar 27 at 22:02
Are you sure? I'm getting multiple lines with your inputs.
– maanijou
Mar 27 at 22:09
2
You should probably close the file at the end, or use awith
context manager.
– bli
Mar 28 at 17:50
Good point. Edited.
– maanijou
Mar 28 at 18:53
1
i mean exactly what i wrote,flag
is still a boolean, but you can remove yourelif
clause
– Chris_Rands
Mar 30 at 9:48
|
show 3 more comments
Ok i changed the code. the first one was based on your current inputs
– maanijou
Mar 27 at 22:02
Are you sure? I'm getting multiple lines with your inputs.
– maanijou
Mar 27 at 22:09
2
You should probably close the file at the end, or use awith
context manager.
– bli
Mar 28 at 17:50
Good point. Edited.
– maanijou
Mar 28 at 18:53
1
i mean exactly what i wrote,flag
is still a boolean, but you can remove yourelif
clause
– Chris_Rands
Mar 30 at 9:48
Ok i changed the code. the first one was based on your current inputs
– maanijou
Mar 27 at 22:02
Ok i changed the code. the first one was based on your current inputs
– maanijou
Mar 27 at 22:02
Are you sure? I'm getting multiple lines with your inputs.
– maanijou
Mar 27 at 22:09
Are you sure? I'm getting multiple lines with your inputs.
– maanijou
Mar 27 at 22:09
2
2
You should probably close the file at the end, or use a
with
context manager.– bli
Mar 28 at 17:50
You should probably close the file at the end, or use a
with
context manager.– bli
Mar 28 at 17:50
Good point. Edited.
– maanijou
Mar 28 at 18:53
Good point. Edited.
– maanijou
Mar 28 at 18:53
1
1
i mean exactly what i wrote,
flag
is still a boolean, but you can remove your elif
clause– Chris_Rands
Mar 30 at 9:48
i mean exactly what i wrote,
flag
is still a boolean, but you can remove your elif
clause– Chris_Rands
Mar 30 at 9:48
|
show 3 more comments
Maanijou's answer is fine.
Also, consider an alternative with a iterator instead.
EDIT: Updated the code based on your comments
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta, "r+")
file_contents = iter(fasta)
try:
print_flag = True
while True:
line = file_contents.next()
if line.startswith('>'):
if "CDS" in line:
print (line.strip())
print_flag = True
else:
print_flag = False
else:
if print_flag:
print (line.strip())
except StopIteration:
print ("Done")
fasta.close()
Explanation
file_contents = iter(fasta)
converts the iterable file object into an iterator on which you can simply keep calling next()
till you run out of things to read
Why I do not recommend calling readlines
as some other answers have is that sometimes fasta files can be big and calling readlines
consumes significant memory.
if a line satisfies your search req you simply print it and the next line, if not you simply read the next line and do nothing,
Explanation for Update
- You got the Attribute error because of file modes, I could not reproduce it locally but I think opening the file with the right mode should fix it
- You now said there could be more than 1 genome sequence for
CDS
updated the code to print all the genome sequences for 1CDS
header in the file
I tested it with a modified fasta file like so
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
And this output
python fasta.py fasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
Done
getting this error:AttributeError: '_io.TextIOWrapper' object has no attribute 'next'
– MAPK
Mar 27 at 22:01
I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you
– Srini
Mar 27 at 22:08
add a comment |
Maanijou's answer is fine.
Also, consider an alternative with a iterator instead.
EDIT: Updated the code based on your comments
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta, "r+")
file_contents = iter(fasta)
try:
print_flag = True
while True:
line = file_contents.next()
if line.startswith('>'):
if "CDS" in line:
print (line.strip())
print_flag = True
else:
print_flag = False
else:
if print_flag:
print (line.strip())
except StopIteration:
print ("Done")
fasta.close()
Explanation
file_contents = iter(fasta)
converts the iterable file object into an iterator on which you can simply keep calling next()
till you run out of things to read
Why I do not recommend calling readlines
as some other answers have is that sometimes fasta files can be big and calling readlines
consumes significant memory.
if a line satisfies your search req you simply print it and the next line, if not you simply read the next line and do nothing,
Explanation for Update
- You got the Attribute error because of file modes, I could not reproduce it locally but I think opening the file with the right mode should fix it
- You now said there could be more than 1 genome sequence for
CDS
updated the code to print all the genome sequences for 1CDS
header in the file
I tested it with a modified fasta file like so
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
And this output
python fasta.py fasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
Done
getting this error:AttributeError: '_io.TextIOWrapper' object has no attribute 'next'
– MAPK
Mar 27 at 22:01
I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you
– Srini
Mar 27 at 22:08
add a comment |
Maanijou's answer is fine.
Also, consider an alternative with a iterator instead.
EDIT: Updated the code based on your comments
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta, "r+")
file_contents = iter(fasta)
try:
print_flag = True
while True:
line = file_contents.next()
if line.startswith('>'):
if "CDS" in line:
print (line.strip())
print_flag = True
else:
print_flag = False
else:
if print_flag:
print (line.strip())
except StopIteration:
print ("Done")
fasta.close()
Explanation
file_contents = iter(fasta)
converts the iterable file object into an iterator on which you can simply keep calling next()
till you run out of things to read
Why I do not recommend calling readlines
as some other answers have is that sometimes fasta files can be big and calling readlines
consumes significant memory.
if a line satisfies your search req you simply print it and the next line, if not you simply read the next line and do nothing,
Explanation for Update
- You got the Attribute error because of file modes, I could not reproduce it locally but I think opening the file with the right mode should fix it
- You now said there could be more than 1 genome sequence for
CDS
updated the code to print all the genome sequences for 1CDS
header in the file
I tested it with a modified fasta file like so
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
And this output
python fasta.py fasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
Done
Maanijou's answer is fine.
Also, consider an alternative with a iterator instead.
EDIT: Updated the code based on your comments
#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta, "r+")
file_contents = iter(fasta)
try:
print_flag = True
while True:
line = file_contents.next()
if line.startswith('>'):
if "CDS" in line:
print (line.strip())
print_flag = True
else:
print_flag = False
else:
if print_flag:
print (line.strip())
except StopIteration:
print ("Done")
fasta.close()
Explanation
file_contents = iter(fasta)
converts the iterable file object into an iterator on which you can simply keep calling next()
till you run out of things to read
Why I do not recommend calling readlines
as some other answers have is that sometimes fasta files can be big and calling readlines
consumes significant memory.
if a line satisfies your search req you simply print it and the next line, if not you simply read the next line and do nothing,
Explanation for Update
- You got the Attribute error because of file modes, I could not reproduce it locally but I think opening the file with the right mode should fix it
- You now said there could be more than 1 genome sequence for
CDS
updated the code to print all the genome sequences for 1CDS
header in the file
I tested it with a modified fasta file like so
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
And this output
python fasta.py fasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
Done
edited Mar 27 at 22:06
answered Mar 27 at 21:54
SriniSrini
1,2591 gold badge15 silver badges30 bronze badges
1,2591 gold badge15 silver badges30 bronze badges
getting this error:AttributeError: '_io.TextIOWrapper' object has no attribute 'next'
– MAPK
Mar 27 at 22:01
I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you
– Srini
Mar 27 at 22:08
add a comment |
getting this error:AttributeError: '_io.TextIOWrapper' object has no attribute 'next'
– MAPK
Mar 27 at 22:01
I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you
– Srini
Mar 27 at 22:08
getting this error:
AttributeError: '_io.TextIOWrapper' object has no attribute 'next'
– MAPK
Mar 27 at 22:01
getting this error:
AttributeError: '_io.TextIOWrapper' object has no attribute 'next'
– MAPK
Mar 27 at 22:01
I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you
– Srini
Mar 27 at 22:08
I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you
– Srini
Mar 27 at 22:08
add a comment |
Is this what you want?
#!/usr/bin/env python
import sys
import os
from collections import defaultdict
myfasta = sys.argv[1]
with open(myfasta) as fasta:
data = fasta.read().splitlines()
pattern_data = defaultdict(list)
index = 0
while index < len(data):
if data[index].startswith('>'):
start = data[index].index('_') + 1
key = data[index][start:]
pattern_data[key].append(data[index + 1])
index += 2
At this point you are free to do whatever you please with the sorted data.
The above assumes that the whole file you parse follows the exact format shown above: 1 line starting with a ">" that id's the single line that follows. If you have multiple lines that follow, the code needs minor modification.
EDIT:
I just read up on fasta files. I now know that they actually may have sequences that are longer than one line after they are identified. So the above code does need to be modified to account for multiline sequences. A more generalized approach is as follows:
#!/usr/bin/env python
import sys
import os
from collections import defaultdict
myfasta = sys.argv[1]
with open(myfasta) as fasta:
data = fasta.read().splitlines()
id_line_indices = [index for index, line in enumerate(data) if line.startswith('>')]
id_line_indices.append(len(data))
pattern_buckets = defaultdict(list)
i = 0
while i < len(id_line_indices) - 1:
start = data[id_line_indices[i]].index('_') + 1
key = data[id_line_indices[i]][start:]
sequence = [data[index] for index in range(id_line_indices[i] + 1, id_line_indices[i + 1])]
sequence = ''.join(sequence)
pattern_buckets[key].append(sequence)
i += 1
This still achieves the same results for the above data set. For example,
print(pattern_buckets['CDS'])
print(pattern_buckets['rRNA'])
Will get you:
['AAAAATTTCTGGGCCCCGGGGG', 'TTAAAAATTTCTGGGCCCCGGGAAAAAA', 'TTTGGGAATTAAACCCT', 'TTTGGGAATTAAACCCT']
['TTAAAAATTTCTGGGCCCCGGGAAAAAA']
add a comment |
Is this what you want?
#!/usr/bin/env python
import sys
import os
from collections import defaultdict
myfasta = sys.argv[1]
with open(myfasta) as fasta:
data = fasta.read().splitlines()
pattern_data = defaultdict(list)
index = 0
while index < len(data):
if data[index].startswith('>'):
start = data[index].index('_') + 1
key = data[index][start:]
pattern_data[key].append(data[index + 1])
index += 2
At this point you are free to do whatever you please with the sorted data.
The above assumes that the whole file you parse follows the exact format shown above: 1 line starting with a ">" that id's the single line that follows. If you have multiple lines that follow, the code needs minor modification.
EDIT:
I just read up on fasta files. I now know that they actually may have sequences that are longer than one line after they are identified. So the above code does need to be modified to account for multiline sequences. A more generalized approach is as follows:
#!/usr/bin/env python
import sys
import os
from collections import defaultdict
myfasta = sys.argv[1]
with open(myfasta) as fasta:
data = fasta.read().splitlines()
id_line_indices = [index for index, line in enumerate(data) if line.startswith('>')]
id_line_indices.append(len(data))
pattern_buckets = defaultdict(list)
i = 0
while i < len(id_line_indices) - 1:
start = data[id_line_indices[i]].index('_') + 1
key = data[id_line_indices[i]][start:]
sequence = [data[index] for index in range(id_line_indices[i] + 1, id_line_indices[i + 1])]
sequence = ''.join(sequence)
pattern_buckets[key].append(sequence)
i += 1
This still achieves the same results for the above data set. For example,
print(pattern_buckets['CDS'])
print(pattern_buckets['rRNA'])
Will get you:
['AAAAATTTCTGGGCCCCGGGGG', 'TTAAAAATTTCTGGGCCCCGGGAAAAAA', 'TTTGGGAATTAAACCCT', 'TTTGGGAATTAAACCCT']
['TTAAAAATTTCTGGGCCCCGGGAAAAAA']
add a comment |
Is this what you want?
#!/usr/bin/env python
import sys
import os
from collections import defaultdict
myfasta = sys.argv[1]
with open(myfasta) as fasta:
data = fasta.read().splitlines()
pattern_data = defaultdict(list)
index = 0
while index < len(data):
if data[index].startswith('>'):
start = data[index].index('_') + 1
key = data[index][start:]
pattern_data[key].append(data[index + 1])
index += 2
At this point you are free to do whatever you please with the sorted data.
The above assumes that the whole file you parse follows the exact format shown above: 1 line starting with a ">" that id's the single line that follows. If you have multiple lines that follow, the code needs minor modification.
EDIT:
I just read up on fasta files. I now know that they actually may have sequences that are longer than one line after they are identified. So the above code does need to be modified to account for multiline sequences. A more generalized approach is as follows:
#!/usr/bin/env python
import sys
import os
from collections import defaultdict
myfasta = sys.argv[1]
with open(myfasta) as fasta:
data = fasta.read().splitlines()
id_line_indices = [index for index, line in enumerate(data) if line.startswith('>')]
id_line_indices.append(len(data))
pattern_buckets = defaultdict(list)
i = 0
while i < len(id_line_indices) - 1:
start = data[id_line_indices[i]].index('_') + 1
key = data[id_line_indices[i]][start:]
sequence = [data[index] for index in range(id_line_indices[i] + 1, id_line_indices[i + 1])]
sequence = ''.join(sequence)
pattern_buckets[key].append(sequence)
i += 1
This still achieves the same results for the above data set. For example,
print(pattern_buckets['CDS'])
print(pattern_buckets['rRNA'])
Will get you:
['AAAAATTTCTGGGCCCCGGGGG', 'TTAAAAATTTCTGGGCCCCGGGAAAAAA', 'TTTGGGAATTAAACCCT', 'TTTGGGAATTAAACCCT']
['TTAAAAATTTCTGGGCCCCGGGAAAAAA']
Is this what you want?
#!/usr/bin/env python
import sys
import os
from collections import defaultdict
myfasta = sys.argv[1]
with open(myfasta) as fasta:
data = fasta.read().splitlines()
pattern_data = defaultdict(list)
index = 0
while index < len(data):
if data[index].startswith('>'):
start = data[index].index('_') + 1
key = data[index][start:]
pattern_data[key].append(data[index + 1])
index += 2
At this point you are free to do whatever you please with the sorted data.
The above assumes that the whole file you parse follows the exact format shown above: 1 line starting with a ">" that id's the single line that follows. If you have multiple lines that follow, the code needs minor modification.
EDIT:
I just read up on fasta files. I now know that they actually may have sequences that are longer than one line after they are identified. So the above code does need to be modified to account for multiline sequences. A more generalized approach is as follows:
#!/usr/bin/env python
import sys
import os
from collections import defaultdict
myfasta = sys.argv[1]
with open(myfasta) as fasta:
data = fasta.read().splitlines()
id_line_indices = [index for index, line in enumerate(data) if line.startswith('>')]
id_line_indices.append(len(data))
pattern_buckets = defaultdict(list)
i = 0
while i < len(id_line_indices) - 1:
start = data[id_line_indices[i]].index('_') + 1
key = data[id_line_indices[i]][start:]
sequence = [data[index] for index in range(id_line_indices[i] + 1, id_line_indices[i + 1])]
sequence = ''.join(sequence)
pattern_buckets[key].append(sequence)
i += 1
This still achieves the same results for the above data set. For example,
print(pattern_buckets['CDS'])
print(pattern_buckets['rRNA'])
Will get you:
['AAAAATTTCTGGGCCCCGGGGG', 'TTAAAAATTTCTGGGCCCCGGGAAAAAA', 'TTTGGGAATTAAACCCT', 'TTTGGGAATTAAACCCT']
['TTAAAAATTTCTGGGCCCCGGGAAAAAA']
edited Mar 27 at 22:11
answered Mar 27 at 21:49
PerplexabotPerplexabot
1,1331 gold badge9 silver badges16 bronze badges
1,1331 gold badge9 silver badges16 bronze badges
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55386592%2fusing-conditions-to-match-multiple-patterns-within-a-line%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Mmmhh, seems like you're printing
line
anyway... How do you differentiate between positives and negatives?– Jacques Gaudin
Mar 27 at 21:21
1
yeah...code seems to do what you think, except that you're going to get blank lines because you have EOL at the end of each of your lines, and then print() is adding another one.
– Steve
Mar 27 at 21:25
1
You're still just printing every line unconditionally!!! You have to do something different depending on if your test succeeds or fails, or what's the point?
– Steve
Mar 27 at 21:26
1
You have to treat lines with '>' at the front differently than those without, regardless of if they match 'CDS' or not. When you see '>' at the front of the line, you need to check to see if 'CDS' is in it to know if you should print it. But you ALSO have to set a flag, so that the next time you see a line without a '>' at the front, you'll know if you should print that line or not. That's the secret hear. Create a boolean variable to keep track of if you printed the last '>' line you saw or not, and then use that variable to decide what to do with non-'>' lines
– Steve
Mar 27 at 21:30
2
easiest with Biopython:
from Bio import SeqIO; SeqIO.write((r for r in SeqIO.parse('in.fa', 'fasta') if 'CDS' in r.id), 'out.fa', 'fasta')
– Chris_Rands
Mar 28 at 14:00