Correctly splitting a CSV file after repetition in pandasDealing with commas in a CSV fileSave PL/pgSQL output from PostgreSQL to a CSV fileHow to import CSV file data into a PostgreSQL table?Dump a NumPy array into a csv fileWriting a pandas DataFrame to CSV fileUnicodeDecodeError when reading CSV file in Pandas with PythonHow to avoid Python/Pandas creating an index in a saved csv?Import multiple csv files into pandas and concatenate into one DataFrameRead 2D array in CSV into a Map c++splitting CSV file by columns
Can an open source licence be revoked if it violates employer's IP?
Can you open the door or die? v2
Idiom for 'person who gets violent when drunk"
Are athlete's college degrees discounted by employers and graduate school admissions?
Must I use my personal social media account for work?
Purpose of cylindrical attachments on Power Transmission towers
Do they make "karaoke" versions of concertos for solo practice?
Dedicated bike GPS computer over smartphone
If absolute velocity does not exist, how can we say a rocket accelerates in empty space?
How do I type a hyphen in iOS 12?
Is it good practice to create tables dynamically?
Why would a home insurer offer a discount based on credit score?
Placement of positioning lights on A320 winglets
Is it a good security practice to force employees hide their employer to avoid being targeted?
Is it possible to have battery technology that can't be duplicated?
How can I find out about the game world without meta-influencing it?
David slept with Bathsheba because she was pure?? What does that mean?
Nth term of Van Eck Sequence
Do Veracrypt encrypted volumes have any kind of brute force protection?
As easy as Three, Two, One... How fast can you go from Five to Four?
What is the theme of analysis?
Approach sick days in feedback meeting
Is tuition reimbursement a good idea if you have to stay with the job
Can I attach a DC blower to intake manifold of my 150CC Yamaha FZS FI engine?
Correctly splitting a CSV file after repetition in pandas
Dealing with commas in a CSV fileSave PL/pgSQL output from PostgreSQL to a CSV fileHow to import CSV file data into a PostgreSQL table?Dump a NumPy array into a csv fileWriting a pandas DataFrame to CSV fileUnicodeDecodeError when reading CSV file in Pandas with PythonHow to avoid Python/Pandas creating an index in a saved csv?Import multiple csv files into pandas and concatenate into one DataFrameRead 2D array in CSV into a Map c++splitting CSV file by columns
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I have CSV containing 5000 rows, every few hundred CSV lines there is a repeating section.
What is the most efficient option to divide this file into several different ones?
File looks like
Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
....
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2
....
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3
I need to split it into a few files by Header
. And I have no idea how can I do that. I write the whole script to process some biological stuff, but one of the files types (above) generates problems because it is several files in one. And the script does not want to work with it.
I've read a lot about the splitting files, but I've found nothing about the separation after repeating values in pandas.
In this case, it would be 3 files (but the number of these files in files varies)
python pandas csv split
add a comment |
I have CSV containing 5000 rows, every few hundred CSV lines there is a repeating section.
What is the most efficient option to divide this file into several different ones?
File looks like
Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
....
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2
....
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3
I need to split it into a few files by Header
. And I have no idea how can I do that. I write the whole script to process some biological stuff, but one of the files types (above) generates problems because it is several files in one. And the script does not want to work with it.
I've read a lot about the splitting files, but I've found nothing about the separation after repeating values in pandas.
In this case, it would be 3 files (but the number of these files in files varies)
python pandas csv split
1
Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have abreak
condition on non-content line (which will be the new header).
– vurmux
Mar 25 at 10:41
No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.
– Artur
Mar 25 at 11:06
add a comment |
I have CSV containing 5000 rows, every few hundred CSV lines there is a repeating section.
What is the most efficient option to divide this file into several different ones?
File looks like
Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
....
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2
....
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3
I need to split it into a few files by Header
. And I have no idea how can I do that. I write the whole script to process some biological stuff, but one of the files types (above) generates problems because it is several files in one. And the script does not want to work with it.
I've read a lot about the splitting files, but I've found nothing about the separation after repeating values in pandas.
In this case, it would be 3 files (but the number of these files in files varies)
python pandas csv split
I have CSV containing 5000 rows, every few hundred CSV lines there is a repeating section.
What is the most efficient option to divide this file into several different ones?
File looks like
Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
....
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2
....
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3
I need to split it into a few files by Header
. And I have no idea how can I do that. I write the whole script to process some biological stuff, but one of the files types (above) generates problems because it is several files in one. And the script does not want to work with it.
I've read a lot about the splitting files, but I've found nothing about the separation after repeating values in pandas.
In this case, it would be 3 files (but the number of these files in files varies)
python pandas csv split
python pandas csv split
edited Mar 25 at 4:51
Matěj Štágl
4931324
4931324
asked Mar 25 at 0:20
ArturArtur
1068
1068
1
Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have abreak
condition on non-content line (which will be the new header).
– vurmux
Mar 25 at 10:41
No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.
– Artur
Mar 25 at 11:06
add a comment |
1
Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have abreak
condition on non-content line (which will be the new header).
– vurmux
Mar 25 at 10:41
No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.
– Artur
Mar 25 at 11:06
1
1
Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a
break
condition on non-content line (which will be the new header).– vurmux
Mar 25 at 10:41
Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a
break
condition on non-content line (which will be the new header).– vurmux
Mar 25 at 10:41
No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.
– Artur
Mar 25 at 11:06
No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.
– Artur
Mar 25 at 11:06
add a comment |
1 Answer
1
active
oldest
votes
I found a bit better solution than break
statements, as I suggested in comment:
You can create the result
list and store each chunk data in separate element of list (in dict, for example). If you read non-Header line, you can guarantee, that the line you just read is related to the current chunk of data. And the current chunk of data is the last element in result
list, so you can just modify it. If you read the Header line, you just append the new element to the result
and start to write new chunk data into it.
If the size of content is constant, you can use the itertools.cycle
iterator that will "codify" your parsing process:
from itertools import cycle
text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
i = next(iterator)
if i == 0:
result.append('header': line)
elif i == 1:
result[-1]['num_of_samples'] = line
elif i == 2:
result[-1]['content_header'] = line
elif i == 3:
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))
If you don't know the size of content, you should parse each line, check its type and construct your data manually:
text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
if line.startswith('Header'): # Your condition for headers
result.append('header': line)
elif line.startswith('number'): # Your condition for number of samples
result[-1]['num_of_samples'] = line
elif line.startswith('Content'): # Your condition for content headers
result[-1]['content_header'] = line
else:
if 'content' not in result[-1]: # We don't know is the content list created
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))
Awesome! Works great. Thank u so much!
– Artur
Mar 25 at 13:28
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55329871%2fcorrectly-splitting-a-csv-file-after-repetition-in-pandas%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I found a bit better solution than break
statements, as I suggested in comment:
You can create the result
list and store each chunk data in separate element of list (in dict, for example). If you read non-Header line, you can guarantee, that the line you just read is related to the current chunk of data. And the current chunk of data is the last element in result
list, so you can just modify it. If you read the Header line, you just append the new element to the result
and start to write new chunk data into it.
If the size of content is constant, you can use the itertools.cycle
iterator that will "codify" your parsing process:
from itertools import cycle
text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
i = next(iterator)
if i == 0:
result.append('header': line)
elif i == 1:
result[-1]['num_of_samples'] = line
elif i == 2:
result[-1]['content_header'] = line
elif i == 3:
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))
If you don't know the size of content, you should parse each line, check its type and construct your data manually:
text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
if line.startswith('Header'): # Your condition for headers
result.append('header': line)
elif line.startswith('number'): # Your condition for number of samples
result[-1]['num_of_samples'] = line
elif line.startswith('Content'): # Your condition for content headers
result[-1]['content_header'] = line
else:
if 'content' not in result[-1]: # We don't know is the content list created
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))
Awesome! Works great. Thank u so much!
– Artur
Mar 25 at 13:28
add a comment |
I found a bit better solution than break
statements, as I suggested in comment:
You can create the result
list and store each chunk data in separate element of list (in dict, for example). If you read non-Header line, you can guarantee, that the line you just read is related to the current chunk of data. And the current chunk of data is the last element in result
list, so you can just modify it. If you read the Header line, you just append the new element to the result
and start to write new chunk data into it.
If the size of content is constant, you can use the itertools.cycle
iterator that will "codify" your parsing process:
from itertools import cycle
text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
i = next(iterator)
if i == 0:
result.append('header': line)
elif i == 1:
result[-1]['num_of_samples'] = line
elif i == 2:
result[-1]['content_header'] = line
elif i == 3:
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))
If you don't know the size of content, you should parse each line, check its type and construct your data manually:
text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
if line.startswith('Header'): # Your condition for headers
result.append('header': line)
elif line.startswith('number'): # Your condition for number of samples
result[-1]['num_of_samples'] = line
elif line.startswith('Content'): # Your condition for content headers
result[-1]['content_header'] = line
else:
if 'content' not in result[-1]: # We don't know is the content list created
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))
Awesome! Works great. Thank u so much!
– Artur
Mar 25 at 13:28
add a comment |
I found a bit better solution than break
statements, as I suggested in comment:
You can create the result
list and store each chunk data in separate element of list (in dict, for example). If you read non-Header line, you can guarantee, that the line you just read is related to the current chunk of data. And the current chunk of data is the last element in result
list, so you can just modify it. If you read the Header line, you just append the new element to the result
and start to write new chunk data into it.
If the size of content is constant, you can use the itertools.cycle
iterator that will "codify" your parsing process:
from itertools import cycle
text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
i = next(iterator)
if i == 0:
result.append('header': line)
elif i == 1:
result[-1]['num_of_samples'] = line
elif i == 2:
result[-1]['content_header'] = line
elif i == 3:
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))
If you don't know the size of content, you should parse each line, check its type and construct your data manually:
text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
if line.startswith('Header'): # Your condition for headers
result.append('header': line)
elif line.startswith('number'): # Your condition for number of samples
result[-1]['num_of_samples'] = line
elif line.startswith('Content'): # Your condition for content headers
result[-1]['content_header'] = line
else:
if 'content' not in result[-1]: # We don't know is the content list created
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))
I found a bit better solution than break
statements, as I suggested in comment:
You can create the result
list and store each chunk data in separate element of list (in dict, for example). If you read non-Header line, you can guarantee, that the line you just read is related to the current chunk of data. And the current chunk of data is the last element in result
list, so you can just modify it. If you read the Header line, you just append the new element to the result
and start to write new chunk data into it.
If the size of content is constant, you can use the itertools.cycle
iterator that will "codify" your parsing process:
from itertools import cycle
text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
i = next(iterator)
if i == 0:
result.append('header': line)
elif i == 1:
result[-1]['num_of_samples'] = line
elif i == 2:
result[-1]['content_header'] = line
elif i == 3:
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))
If you don't know the size of content, you should parse each line, check its type and construct your data manually:
text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
if line.startswith('Header'): # Your condition for headers
result.append('header': line)
elif line.startswith('number'): # Your condition for number of samples
result[-1]['num_of_samples'] = line
elif line.startswith('Content'): # Your condition for content headers
result[-1]['content_header'] = line
else:
if 'content' not in result[-1]: # We don't know is the content list created
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))
answered Mar 25 at 11:42
vurmuxvurmux
5,2902830
5,2902830
Awesome! Works great. Thank u so much!
– Artur
Mar 25 at 13:28
add a comment |
Awesome! Works great. Thank u so much!
– Artur
Mar 25 at 13:28
Awesome! Works great. Thank u so much!
– Artur
Mar 25 at 13:28
Awesome! Works great. Thank u so much!
– Artur
Mar 25 at 13:28
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55329871%2fcorrectly-splitting-a-csv-file-after-repetition-in-pandas%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a
break
condition on non-content line (which will be the new header).– vurmux
Mar 25 at 10:41
No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.
– Artur
Mar 25 at 11:06