Correctly splitting a CSV file after repetition in pandasDealing with commas in a CSV fileSave PL/pgSQL output from PostgreSQL to a CSV fileHow to import CSV file data into a PostgreSQL table?Dump a NumPy array into a csv fileWriting a pandas DataFrame to CSV fileUnicodeDecodeError when reading CSV file in Pandas with PythonHow to avoid Python/Pandas creating an index in a saved csv?Import multiple csv files into pandas and concatenate into one DataFrameRead 2D array in CSV into a Map c++splitting CSV file by columns

Can an open source licence be revoked if it violates employer's IP?

Can you open the door or die? v2

Idiom for 'person who gets violent when drunk"

Are athlete's college degrees discounted by employers and graduate school admissions?

Must I use my personal social media account for work?

Purpose of cylindrical attachments on Power Transmission towers

Do they make "karaoke" versions of concertos for solo practice?

Dedicated bike GPS computer over smartphone

If absolute velocity does not exist, how can we say a rocket accelerates in empty space?

How do I type a hyphen in iOS 12?

Is it good practice to create tables dynamically?

Why would a home insurer offer a discount based on credit score?

Placement of positioning lights on A320 winglets

Is it a good security practice to force employees hide their employer to avoid being targeted?

Is it possible to have battery technology that can't be duplicated?

How can I find out about the game world without meta-influencing it?

David slept with Bathsheba because she was pure?? What does that mean?

Nth term of Van Eck Sequence

Do Veracrypt encrypted volumes have any kind of brute force protection?

As easy as Three, Two, One... How fast can you go from Five to Four?

What is the theme of analysis?

Approach sick days in feedback meeting

Is tuition reimbursement a good idea if you have to stay with the job

Can I attach a DC blower to intake manifold of my 150CC Yamaha FZS FI engine?

Correctly splitting a CSV file after repetition in pandas

Dealing with commas in a CSV fileSave PL/pgSQL output from PostgreSQL to a CSV fileHow to import CSV file data into a PostgreSQL table?Dump a NumPy array into a csv fileWriting a pandas DataFrame to CSV fileUnicodeDecodeError when reading CSV file in Pandas with PythonHow to avoid Python/Pandas creating an index in a saved csv?Import multiple csv files into pandas and concatenate into one DataFrameRead 2D array in CSV into a Map c++splitting CSV file by columns

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I have CSV containing 5000 rows, every few hundred CSV lines there is a repeating section.

What is the most efficient option to divide this file into several different ones?

File looks like

Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
....
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2
....
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3

I need to split it into a few files by Header. And I have no idea how can I do that. I write the whole script to process some biological stuff, but one of the files types (above) generates problems because it is several files in one. And the script does not want to work with it.

I've read a lot about the splitting files, but I've found nothing about the separation after repeating values in pandas.

In this case, it would be 3 files (but the number of these files in files varies)

edited Mar 25 at 4:51

Matěj Štágl

4931324

asked Mar 25 at 0:20

Artur

1068

1

Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a break condition on non-content line (which will be the new header).

– vurmux
Mar 25 at 10:41

No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.

– Artur
Mar 25 at 11:06

add a comment |

I have CSV containing 5000 rows, every few hundred CSV lines there is a repeating section.

What is the most efficient option to divide this file into several different ones?

File looks like

Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
....
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2
....
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3

I've read a lot about the splitting files, but I've found nothing about the separation after repeating values in pandas.

In this case, it would be 3 files (but the number of these files in files varies)

edited Mar 25 at 4:51

Matěj Štágl

4931324

asked Mar 25 at 0:20

Artur

1068

1

Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a break condition on non-content line (which will be the new header).

– vurmux
Mar 25 at 10:41

No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.

– Artur
Mar 25 at 11:06

add a comment |

I have CSV containing 5000 rows, every few hundred CSV lines there is a repeating section.

What is the most efficient option to divide this file into several different ones?

File looks like

Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
....
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2
....
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3

I've read a lot about the splitting files, but I've found nothing about the separation after repeating values in pandas.

In this case, it would be 3 files (but the number of these files in files varies)

edited Mar 25 at 4:51

Matěj Štágl

4931324

asked Mar 25 at 0:20

Artur

1068

I have CSV containing 5000 rows, every few hundred CSV lines there is a repeating section.

What is the most efficient option to divide this file into several different ones?

File looks like

Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
....
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2
....
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3

I've read a lot about the splitting files, but I've found nothing about the separation after repeating values in pandas.

In this case, it would be 3 files (but the number of these files in files varies)

python pandas csv split

edited Mar 25 at 4:51

Matěj Štágl

4931324

asked Mar 25 at 0:20

Artur

1068

edited Mar 25 at 4:51

Matěj Štágl

4931324

asked Mar 25 at 0:20

Artur

1068

edited Mar 25 at 4:51

Matěj Štágl

4931324

edited Mar 25 at 4:51

Matěj Štágl

4931324

edited Mar 25 at 4:51

Matěj Štágl

4931324

asked Mar 25 at 0:20

Artur

1068

asked Mar 25 at 0:20

Artur

1068

asked Mar 25 at 0:20

Artur

1068

1

Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a break condition on non-content line (which will be the new header).

– vurmux
Mar 25 at 10:41

No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.

– Artur
Mar 25 at 11:06

add a comment |

1

Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a break condition on non-content line (which will be the new header).

– vurmux
Mar 25 at 10:41

No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.

– Artur
Mar 25 at 11:06

Is the content size constant? If yes, you can have a count variable to track when the new section begins. If no, you can read content lines in endless loop and have a break condition on non-content line (which will be the new header).

– vurmux
Mar 25 at 10:41

No, size isn't constant. But breaking statement it's good advice, thank U! I will try it later.

– Artur
Mar 25 at 11:06

add a comment |

1 Answer
1

active

oldest

votes

I found a bit better solution than break statements, as I suggested in comment:

You can create the result list and store each chunk data in separate element of list (in dict, for example). If you read non-Header line, you can guarantee, that the line you just read is related to the current chunk of data. And the current chunk of data is the last element in result list, so you can just modify it. If you read the Header line, you just append the new element to the result and start to write new chunk data into it.

If the size of content is constant, you can use the itertools.cycle iterator that will "codify" your parsing process:

from itertools import cycle

text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
 i = next(iterator)
 if i == 0:
 result.append('header': line)
 elif i == 1:
 result[-1]['num_of_samples'] = line
 elif i == 2:
 result[-1]['content_header'] = line
 elif i == 3:
 result[-1]['content'] = [line.split(', ')]
 else:
 result[-1]['content'].append(line.split(', '))

If you don't know the size of content, you should parse each line, check its type and construct your data manually:

text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
 if line.startswith('Header'): # Your condition for headers
 result.append('header': line)
 elif line.startswith('number'): # Your condition for number of samples
 result[-1]['num_of_samples'] = line
 elif line.startswith('Content'): # Your condition for content headers
 result[-1]['content_header'] = line
 else:
 if 'content' not in result[-1]: # We don't know is the content list created
 result[-1]['content'] = [line.split(', ')]
 else:
 result[-1]['content'].append(line.split(', '))

answered Mar 25 at 11:42

vurmux

5,2902830

Awesome! Works great. Thank u so much!

– Artur
Mar 25 at 13:28

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55329871%2fcorrectly-splitting-a-csv-file-after-repetition-in-pandas%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I found a bit better solution than break statements, as I suggested in comment:

If the size of content is constant, you can use the itertools.cycle iterator that will "codify" your parsing process:

from itertools import cycle

text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
 i = next(iterator)
 if i == 0:
 result.append('header': line)
 elif i == 1:
 result[-1]['num_of_samples'] = line
 elif i == 2:
 result[-1]['content_header'] = line
 elif i == 3:
 result[-1]['content'] = [line.split(', ')]
 else:
 result[-1]['content'].append(line.split(', '))

If you don't know the size of content, you should parse each line, check its type and construct your data manually:

text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
 if line.startswith('Header'): # Your condition for headers
 result.append('header': line)
 elif line.startswith('number'): # Your condition for number of samples
 result[-1]['num_of_samples'] = line
 elif line.startswith('Content'): # Your condition for content headers
 result[-1]['content_header'] = line
 else:
 if 'content' not in result[-1]: # We don't know is the content list created
 result[-1]['content'] = [line.split(', ')]
 else:
 result[-1]['content'].append(line.split(', '))

answered Mar 25 at 11:42

vurmux

5,2902830

Awesome! Works great. Thank u so much!

– Artur
Mar 25 at 13:28

add a comment |

I found a bit better solution than break statements, as I suggested in comment:

If the size of content is constant, you can use the itertools.cycle iterator that will "codify" your parsing process:

from itertools import cycle

text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
 i = next(iterator)
 if i == 0:
 result.append('header': line)
 elif i == 1:
 result[-1]['num_of_samples'] = line
 elif i == 2:
 result[-1]['content_header'] = line
 elif i == 3:
 result[-1]['content'] = [line.split(', ')]
 else:
 result[-1]['content'].append(line.split(', '))

If you don't know the size of content, you should parse each line, check its type and construct your data manually:

text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
 if line.startswith('Header'): # Your condition for headers
 result.append('header': line)
 elif line.startswith('number'): # Your condition for number of samples
 result[-1]['num_of_samples'] = line
 elif line.startswith('Content'): # Your condition for content headers
 result[-1]['content_header'] = line
 else:
 if 'content' not in result[-1]: # We don't know is the content list created
 result[-1]['content'] = [line.split(', ')]
 else:
 result[-1]['content'].append(line.split(', '))

answered Mar 25 at 11:42

vurmux

5,2902830

Awesome! Works great. Thank u so much!

– Artur
Mar 25 at 13:28

add a comment |

I found a bit better solution than break statements, as I suggested in comment:

If the size of content is constant, you can use the itertools.cycle iterator that will "codify" your parsing process:

from itertools import cycle

text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
 i = next(iterator)
 if i == 0:
 result.append('header': line)
 elif i == 1:
 result[-1]['num_of_samples'] = line
 elif i == 2:
 result[-1]['content_header'] = line
 elif i == 3:
 result[-1]['content'] = [line.split(', ')]
 else:
 result[-1]['content'].append(line.split(', '))

If you don't know the size of content, you should parse each line, check its type and construct your data manually:

text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
 if line.startswith('Header'): # Your condition for headers
 result.append('header': line)
 elif line.startswith('number'): # Your condition for number of samples
 result[-1]['num_of_samples'] = line
 elif line.startswith('Content'): # Your condition for content headers
 result[-1]['content_header'] = line
 else:
 if 'content' not in result[-1]: # We don't know is the content list created
 result[-1]['content'] = [line.split(', ')]
 else:
 result[-1]['content'].append(line.split(', '))

answered Mar 25 at 11:42

vurmux

5,2902830

I found a bit better solution than break statements, as I suggested in comment:

If the size of content is constant, you can use the itertools.cycle iterator that will "codify" your parsing process:

from itertools import cycle

text1 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
Header2
number of Samples2
Content2
a2, aa2, aaa2
b2, bb2, bbb2"""
size = 5
iterator = cycle(range(size))
result = []
for line in text1.split('n'):
 i = next(iterator)
 if i == 0:
 result.append('header': line)
 elif i == 1:
 result[-1]['num_of_samples'] = line
 elif i == 2:
 result[-1]['content_header'] = line
 elif i == 3:
 result[-1]['content'] = [line.split(', ')]
 else:
 result[-1]['content'].append(line.split(', '))

If you don't know the size of content, you should parse each line, check its type and construct your data manually:

text2 = """Header1
number of Samples1
Content1
a1, aa1, aaa1
b1, bb1, bbb1
b1, bb1, bbb1
Header2
number of Samples2
Content2
b2, bb2, bbb2
Header3
number of Samples3
Content3
a3, aa3, aaa3
b3, bb3, bbb3"""
result = []
i = 0
for line in text2.split('n'):
 if line.startswith('Header'): # Your condition for headers
 result.append('header': line)
 elif line.startswith('number'): # Your condition for number of samples
 result[-1]['num_of_samples'] = line
 elif line.startswith('Content'): # Your condition for content headers
 result[-1]['content_header'] = line
 else:
 if 'content' not in result[-1]: # We don't know is the content list created
 result[-1]['content'] = [line.split(', ')]
 else:
 result[-1]['content'].append(line.split(', '))

answered Mar 25 at 11:42

vurmux

5,2902830

answered Mar 25 at 11:42

vurmux

5,2902830

answered Mar 25 at 11:42

vurmux

5,2902830

answered Mar 25 at 11:42

vurmux

5,2902830

Awesome! Works great. Thank u so much!

– Artur
Mar 25 at 13:28

add a comment |

Awesome! Works great. Thank u so much!

– Artur
Mar 25 at 13:28

Awesome! Works great. Thank u so much!

– Artur
Mar 25 at 13:28

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer
1

1 Answer
1

1 Answer
1