Correctly splitting a CSV file after repetition in pandasDealing with commas in a CSV fileSave PL/pgSQL output from PostgreSQL to a CSV fileHow to import CSV file data into a PostgreSQL table?Dump a NumPy array into a csv fileWriting a pandas DataFrame to CSV fileUnicodeDecodeError when reading CSV file in Pandas with PythonHow to avoid Python/Pandas creating an index in a saved csv?Import multiple csv files into pandas and concatenate into one DataFrameRead 2D array in CSV into a Map c++splitting CSV file by columns

Can an open source licence be revoked if it violates employer's IP?

Can you open the door or die? v2

Idiom for 'person who gets violent when drunk"

Are athlete's college degrees discounted by employers and graduate school admissions?

Must I use my personal social media account for work?

Purpose of cylindrical attachments on Power Transmission towers

Do they make "karaoke" versions of concertos for solo practice?

Dedicated bike GPS computer over smartphone

If absolute velocity does not exist, how can we say a rocket accelerates in empty space?

How do I type a hyphen in iOS 12?

Is it good practice to create tables dynamically?

Why would a home insurer offer a discount based on credit score?

Placement of positioning lights on A320 winglets

Is it a good security practice to force employees hide their employer to avoid being targeted?

Is it possible to have battery technology that can't be duplicated?

How can I find out about the game world without meta-influencing it?

David slept with Bathsheba because she was pure?? What does that mean?

Nth term of Van Eck Sequence

Do Veracrypt encrypted volumes have any kind of brute force protection?

As easy as Three, Two, One... How fast can you go from Five to Four?

What is the theme of analysis?

Approach sick days in feedback meeting

Is tuition reimbursement a good idea if you have to stay with the job

Can I attach a DC blower to intake manifold of my 150CC Yamaha FZS FI engine?



Correctly splitting a CSV file after repetition in pandas


Dealing with commas in a CSV fileSave PL/pgSQL output from PostgreSQL to a CSV fileHow to import CSV file data into a PostgreSQL table?Dump a NumPy array into a csv fileWriting a pandas DataFrame to CSV fileUnicodeDecodeError when reading CSV file in Pandas with PythonHow to avoid Python/Pandas creating an index in a saved csv?Import multiple csv files into pandas and concatenate into one DataFrameRead 2D array in CSV into a Map c++splitting CSV file by columns






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








2















I have CSV containing 5000 rows, every few hundred CSV lines there is a repeating section.

What is the most efficient option to divide this file into several different ones?



File looks like



Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
....
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2
....
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3


I need to split it into a few files by Header. And I have no idea how can I do that. I write the whole script to process some biological stuff, but one of the files types (above) generates problems because it is several files in one. And the script does not want to work with it.



I've read a lot about the splitting files, but I've found nothing about the separation after repeating values in pandas.



In this case, it would be 3 files (but the number of these files in files varies)










share|improve this question



















  • 1





    Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a break condition on non-content line (which will be the new header).

    – vurmux
    Mar 25 at 10:41











  • No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.

    – Artur
    Mar 25 at 11:06


















2















I have CSV containing 5000 rows, every few hundred CSV lines there is a repeating section.

What is the most efficient option to divide this file into several different ones?



File looks like



Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
....
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2
....
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3


I need to split it into a few files by Header. And I have no idea how can I do that. I write the whole script to process some biological stuff, but one of the files types (above) generates problems because it is several files in one. And the script does not want to work with it.



I've read a lot about the splitting files, but I've found nothing about the separation after repeating values in pandas.



In this case, it would be 3 files (but the number of these files in files varies)










share|improve this question



















  • 1





    Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a break condition on non-content line (which will be the new header).

    – vurmux
    Mar 25 at 10:41











  • No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.

    – Artur
    Mar 25 at 11:06














2












2








2








I have CSV containing 5000 rows, every few hundred CSV lines there is a repeating section.

What is the most efficient option to divide this file into several different ones?



File looks like



Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
....
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2
....
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3


I need to split it into a few files by Header. And I have no idea how can I do that. I write the whole script to process some biological stuff, but one of the files types (above) generates problems because it is several files in one. And the script does not want to work with it.



I've read a lot about the splitting files, but I've found nothing about the separation after repeating values in pandas.



In this case, it would be 3 files (but the number of these files in files varies)










share|improve this question
















I have CSV containing 5000 rows, every few hundred CSV lines there is a repeating section.

What is the most efficient option to divide this file into several different ones?



File looks like



Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
....
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2
....
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3


I need to split it into a few files by Header. And I have no idea how can I do that. I write the whole script to process some biological stuff, but one of the files types (above) generates problems because it is several files in one. And the script does not want to work with it.



I've read a lot about the splitting files, but I've found nothing about the separation after repeating values in pandas.



In this case, it would be 3 files (but the number of these files in files varies)







python pandas csv split






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 25 at 4:51









Matěj Štágl

4931324




4931324










asked Mar 25 at 0:20









ArturArtur

1068




1068







  • 1





    Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a break condition on non-content line (which will be the new header).

    – vurmux
    Mar 25 at 10:41











  • No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.

    – Artur
    Mar 25 at 11:06













  • 1





    Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a break condition on non-content line (which will be the new header).

    – vurmux
    Mar 25 at 10:41











  • No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.

    – Artur
    Mar 25 at 11:06








1




1





Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a break condition on non-content line (which will be the new header).

– vurmux
Mar 25 at 10:41





Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a break condition on non-content line (which will be the new header).

– vurmux
Mar 25 at 10:41













No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.

– Artur
Mar 25 at 11:06






No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.

– Artur
Mar 25 at 11:06













1 Answer
1






active

oldest

votes


















1














I found a bit better solution than break statements, as I suggested in comment:



You can create the result list and store each chunk data in separate element of list (in dict, for example). If you read non-Header line, you can guarantee, that the line you just read is related to the current chunk of data. And the current chunk of data is the last element in result list, so you can just modify it. If you read the Header line, you just append the new element to the result and start to write new chunk data into it.



If the size of content is constant, you can use the itertools.cycle iterator that will "codify" your parsing process:



from itertools import cycle

text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
i = next(iterator)
if i == 0:
result.append('header': line)
elif i == 1:
result[-1]['num_of_samples'] = line
elif i == 2:
result[-1]['content_header'] = line
elif i == 3:
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))


If you don't know the size of content, you should parse each line, check its type and construct your data manually:



text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
if line.startswith('Header'): # Your condition for headers
result.append('header': line)
elif line.startswith('number'): # Your condition for number of samples
result[-1]['num_of_samples'] = line
elif line.startswith('Content'): # Your condition for content headers
result[-1]['content_header'] = line
else:
if 'content' not in result[-1]: # We don't know is the content list created
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))





share|improve this answer























  • Awesome! Works great. Thank u so much!

    – Artur
    Mar 25 at 13:28











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55329871%2fcorrectly-splitting-a-csv-file-after-repetition-in-pandas%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














I found a bit better solution than break statements, as I suggested in comment:



You can create the result list and store each chunk data in separate element of list (in dict, for example). If you read non-Header line, you can guarantee, that the line you just read is related to the current chunk of data. And the current chunk of data is the last element in result list, so you can just modify it. If you read the Header line, you just append the new element to the result and start to write new chunk data into it.



If the size of content is constant, you can use the itertools.cycle iterator that will "codify" your parsing process:



from itertools import cycle

text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
i = next(iterator)
if i == 0:
result.append('header': line)
elif i == 1:
result[-1]['num_of_samples'] = line
elif i == 2:
result[-1]['content_header'] = line
elif i == 3:
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))


If you don't know the size of content, you should parse each line, check its type and construct your data manually:



text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
if line.startswith('Header'): # Your condition for headers
result.append('header': line)
elif line.startswith('number'): # Your condition for number of samples
result[-1]['num_of_samples'] = line
elif line.startswith('Content'): # Your condition for content headers
result[-1]['content_header'] = line
else:
if 'content' not in result[-1]: # We don't know is the content list created
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))





share|improve this answer























  • Awesome! Works great. Thank u so much!

    – Artur
    Mar 25 at 13:28















1














I found a bit better solution than break statements, as I suggested in comment:



You can create the result list and store each chunk data in separate element of list (in dict, for example). If you read non-Header line, you can guarantee, that the line you just read is related to the current chunk of data. And the current chunk of data is the last element in result list, so you can just modify it. If you read the Header line, you just append the new element to the result and start to write new chunk data into it.



If the size of content is constant, you can use the itertools.cycle iterator that will "codify" your parsing process:



from itertools import cycle

text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
i = next(iterator)
if i == 0:
result.append('header': line)
elif i == 1:
result[-1]['num_of_samples'] = line
elif i == 2:
result[-1]['content_header'] = line
elif i == 3:
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))


If you don't know the size of content, you should parse each line, check its type and construct your data manually:



text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
if line.startswith('Header'): # Your condition for headers
result.append('header': line)
elif line.startswith('number'): # Your condition for number of samples
result[-1]['num_of_samples'] = line
elif line.startswith('Content'): # Your condition for content headers
result[-1]['content_header'] = line
else:
if 'content' not in result[-1]: # We don't know is the content list created
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))





share|improve this answer























  • Awesome! Works great. Thank u so much!

    – Artur
    Mar 25 at 13:28













1












1








1







I found a bit better solution than break statements, as I suggested in comment:



You can create the result list and store each chunk data in separate element of list (in dict, for example). If you read non-Header line, you can guarantee, that the line you just read is related to the current chunk of data. And the current chunk of data is the last element in result list, so you can just modify it. If you read the Header line, you just append the new element to the result and start to write new chunk data into it.



If the size of content is constant, you can use the itertools.cycle iterator that will "codify" your parsing process:



from itertools import cycle

text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
i = next(iterator)
if i == 0:
result.append('header': line)
elif i == 1:
result[-1]['num_of_samples'] = line
elif i == 2:
result[-1]['content_header'] = line
elif i == 3:
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))


If you don't know the size of content, you should parse each line, check its type and construct your data manually:



text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
if line.startswith('Header'): # Your condition for headers
result.append('header': line)
elif line.startswith('number'): # Your condition for number of samples
result[-1]['num_of_samples'] = line
elif line.startswith('Content'): # Your condition for content headers
result[-1]['content_header'] = line
else:
if 'content' not in result[-1]: # We don't know is the content list created
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))





share|improve this answer













I found a bit better solution than break statements, as I suggested in comment:



You can create the result list and store each chunk data in separate element of list (in dict, for example). If you read non-Header line, you can guarantee, that the line you just read is related to the current chunk of data. And the current chunk of data is the last element in result list, so you can just modify it. If you read the Header line, you just append the new element to the result and start to write new chunk data into it.



If the size of content is constant, you can use the itertools.cycle iterator that will "codify" your parsing process:



from itertools import cycle

text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
i = next(iterator)
if i == 0:
result.append('header': line)
elif i == 1:
result[-1]['num_of_samples'] = line
elif i == 2:
result[-1]['content_header'] = line
elif i == 3:
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))


If you don't know the size of content, you should parse each line, check its type and construct your data manually:



text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
if line.startswith('Header'): # Your condition for headers
result.append('header': line)
elif line.startswith('number'): # Your condition for number of samples
result[-1]['num_of_samples'] = line
elif line.startswith('Content'): # Your condition for content headers
result[-1]['content_header'] = line
else:
if 'content' not in result[-1]: # We don't know is the content list created
result[-1]['content'] = [line.split(', ')]
else:
result[-1]['content'].append(line.split(', '))






share|improve this answer












share|improve this answer



share|improve this answer










answered Mar 25 at 11:42









vurmuxvurmux

5,2902830




5,2902830












  • Awesome! Works great. Thank u so much!

    – Artur
    Mar 25 at 13:28

















  • Awesome! Works great. Thank u so much!

    – Artur
    Mar 25 at 13:28
















Awesome! Works great. Thank u so much!

– Artur
Mar 25 at 13:28





Awesome! Works great. Thank u so much!

– Artur
Mar 25 at 13:28

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55329871%2fcorrectly-splitting-a-csv-file-after-repetition-in-pandas%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript