Pandas: skip lines containing a certain string when reading a fileCalling an external command in PythonPython: What OS am I running on?How to read a file line-by-line into a list?Does Python have a string 'contains' substring method?Correct way to write line to file?How to read a large file line by lineHow to read a text file into a string variable and strip newlines?Why is reading lines from stdin much slower in C++ than Python?Pythonic way to create a long multi-line stringHow to drop rows of Pandas DataFrame whose value in a certain column is NaNWriting a pandas DataFrame to CSV filePandas read_csv does not raise exception for bad lines when names is specified

Neural Network vs regression

How do you use the interjection for snorting?

Why are there two fundamental laws of logic?

Past participle ending in -t versus -en

Received a package but didn't order it

Beyond Futuristic Technology for an Alien Warship?

Why does C++ have 'Undefined Behaviour' and other languages like C# or Java don't?

What does מעלה עליו הכתוב mean?

New road bike: alloy dual pivot brakes work poorly

Is a Middle Name a Given Name?

Difference between types of yeast

A food item only made possible by time-freezing storage?

Is a PWM required for regenerative braking on a DC Motor?

Suffocation while cooking under an umbrella?

Youtube not blocked by iptables

Top off gas with old oil, is that bad?

practicality of 30 year fix mortgage at 55 years of age

Hangman Game (YAHG)

Does "as soon as" imply simultaneity?

Windows 10 deletes lots of tiny files super slowly. Anything that can be done to speed it up?

Reorder a matrix, twice

Subverting the emotional woman and stoic man trope

Which lens has the same capability of lens mounted in Nikon P1000?

What would influence an alien race to map their planet in a way other than the traditional map of the Earth



Pandas: skip lines containing a certain string when reading a file


Calling an external command in PythonPython: What OS am I running on?How to read a file line-by-line into a list?Does Python have a string 'contains' substring method?Correct way to write line to file?How to read a large file line by lineHow to read a text file into a string variable and strip newlines?Why is reading lines from stdin much slower in C++ than Python?Pythonic way to create a long multi-line stringHow to drop rows of Pandas DataFrame whose value in a certain column is NaNWriting a pandas DataFrame to CSV filePandas read_csv does not raise exception for bad lines when names is specified






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0















I have a big text file (300000000 rows) but it is full of undesired data, wich I would like to remove. Those data are the one containing the string "0000e".



I tried:



f=pd.read_csv('File.txt', skiprows=139, header=None, index_col=False)
f=f.iloc[:,0]
f1=f[f.str.contains("0000e")==False]


and



f=pd.read_csv('file.txt', skiprows=139, header=None, index_col=False, chunksize=50)
dfs = pd.concat([x[x[0].str.endswith('000e')==False] for x in f])


but it is rather long, is there a faster way to skip some lines containing a certain string? Peraps with na_values ?










share|improve this question
























  • What are your intentions with these rows? You could also use the memory_map option to improve performance in the loading stage and possibly other stages/.

    – Jab
    Mar 28 at 18:33












  • To speed things up have you tried just reading the .txt file into a file object (not via pandas in this case) and then save each line to a string list. Excluding the lines that contain '000e' as you add them to the list. Could be a tuple even.

    – RockAndRoleCoder
    Mar 28 at 18:34











  • Then afterwords, if your file needs more analysis, you can save it to a dataframe and proceed.

    – RockAndRoleCoder
    Mar 28 at 18:34

















0















I have a big text file (300000000 rows) but it is full of undesired data, wich I would like to remove. Those data are the one containing the string "0000e".



I tried:



f=pd.read_csv('File.txt', skiprows=139, header=None, index_col=False)
f=f.iloc[:,0]
f1=f[f.str.contains("0000e")==False]


and



f=pd.read_csv('file.txt', skiprows=139, header=None, index_col=False, chunksize=50)
dfs = pd.concat([x[x[0].str.endswith('000e')==False] for x in f])


but it is rather long, is there a faster way to skip some lines containing a certain string? Peraps with na_values ?










share|improve this question
























  • What are your intentions with these rows? You could also use the memory_map option to improve performance in the loading stage and possibly other stages/.

    – Jab
    Mar 28 at 18:33












  • To speed things up have you tried just reading the .txt file into a file object (not via pandas in this case) and then save each line to a string list. Excluding the lines that contain '000e' as you add them to the list. Could be a tuple even.

    – RockAndRoleCoder
    Mar 28 at 18:34











  • Then afterwords, if your file needs more analysis, you can save it to a dataframe and proceed.

    – RockAndRoleCoder
    Mar 28 at 18:34













0












0








0








I have a big text file (300000000 rows) but it is full of undesired data, wich I would like to remove. Those data are the one containing the string "0000e".



I tried:



f=pd.read_csv('File.txt', skiprows=139, header=None, index_col=False)
f=f.iloc[:,0]
f1=f[f.str.contains("0000e")==False]


and



f=pd.read_csv('file.txt', skiprows=139, header=None, index_col=False, chunksize=50)
dfs = pd.concat([x[x[0].str.endswith('000e')==False] for x in f])


but it is rather long, is there a faster way to skip some lines containing a certain string? Peraps with na_values ?










share|improve this question














I have a big text file (300000000 rows) but it is full of undesired data, wich I would like to remove. Those data are the one containing the string "0000e".



I tried:



f=pd.read_csv('File.txt', skiprows=139, header=None, index_col=False)
f=f.iloc[:,0]
f1=f[f.str.contains("0000e")==False]


and



f=pd.read_csv('file.txt', skiprows=139, header=None, index_col=False, chunksize=50)
dfs = pd.concat([x[x[0].str.endswith('000e')==False] for x in f])


but it is rather long, is there a faster way to skip some lines containing a certain string? Peraps with na_values ?







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 28 at 18:25









JRLCARJRLCAR

11 bronze badge




11 bronze badge















  • What are your intentions with these rows? You could also use the memory_map option to improve performance in the loading stage and possibly other stages/.

    – Jab
    Mar 28 at 18:33












  • To speed things up have you tried just reading the .txt file into a file object (not via pandas in this case) and then save each line to a string list. Excluding the lines that contain '000e' as you add them to the list. Could be a tuple even.

    – RockAndRoleCoder
    Mar 28 at 18:34











  • Then afterwords, if your file needs more analysis, you can save it to a dataframe and proceed.

    – RockAndRoleCoder
    Mar 28 at 18:34

















  • What are your intentions with these rows? You could also use the memory_map option to improve performance in the loading stage and possibly other stages/.

    – Jab
    Mar 28 at 18:33












  • To speed things up have you tried just reading the .txt file into a file object (not via pandas in this case) and then save each line to a string list. Excluding the lines that contain '000e' as you add them to the list. Could be a tuple even.

    – RockAndRoleCoder
    Mar 28 at 18:34











  • Then afterwords, if your file needs more analysis, you can save it to a dataframe and proceed.

    – RockAndRoleCoder
    Mar 28 at 18:34
















What are your intentions with these rows? You could also use the memory_map option to improve performance in the loading stage and possibly other stages/.

– Jab
Mar 28 at 18:33






What are your intentions with these rows? You could also use the memory_map option to improve performance in the loading stage and possibly other stages/.

– Jab
Mar 28 at 18:33














To speed things up have you tried just reading the .txt file into a file object (not via pandas in this case) and then save each line to a string list. Excluding the lines that contain '000e' as you add them to the list. Could be a tuple even.

– RockAndRoleCoder
Mar 28 at 18:34





To speed things up have you tried just reading the .txt file into a file object (not via pandas in this case) and then save each line to a string list. Excluding the lines that contain '000e' as you add them to the list. Could be a tuple even.

– RockAndRoleCoder
Mar 28 at 18:34













Then afterwords, if your file needs more analysis, you can save it to a dataframe and proceed.

– RockAndRoleCoder
Mar 28 at 18:34





Then afterwords, if your file needs more analysis, you can save it to a dataframe and proceed.

– RockAndRoleCoder
Mar 28 at 18:34












2 Answers
2






active

oldest

votes


















1
















I prefer your first attempt more as it is definitely more readable atop the fact that your second line has x's and I don't know what they refer to.



That said, using memory_map=True will boost the performance as noted in the docs, you can also gain an extra advantage by removing the second line and accessing the column in the same line you create the df. Lastly, replacing the check ...==False with ~... may provide some benefit. as ~ is a logical not but you need to filter out all the NaN values or you get an error. Luckily Series.str.contains accepts and na attribute appliying the given function to NaN values.



import pandas as pd

df = pd.read_csv('File.txt', memory_map=True, header=None, index_col=False).iloc[:,0]
df1 = df[~df.str.contains("test", na=False)]
#if you want to also skip NaN rows use the below statement
df1 = df[~df.str.contains("test", na=False)].dropna()


Alternatively, doing this using csv is much faster even if you decide to load it into pandas afterwards. I don't know what your data looks like but I tested these with a csv file cointaining 3 columns and 100 rows and I got roughly 9x better performance. This probably won't correlate to you're results but this is definitely the method I would choose if I were you.



from csv import reader

filter = '0000e' #so we aren't making a new string every iteration
with open('File.txt', 'r') as f:
df = pd.DataFrame(first for first, *_ in reader(f) if filter not in first)
#if you want to skip NaN rows
...(first for first, *_ in reader(f) if not first and filter not in first)
#take note this also skips empty strings, use if first is not None for only skipping NaN values





share|improve this answer






















  • 1





    Thank you very much! I tried method 2 and it works faster. I have now a DataFrame of 1999048 rows and facing a new speed issue for the processing of my data. I'll post a new question.Thank you again for your answer!

    – JRLCAR
    Mar 29 at 10:19


















0
















if you have access to a linux or mac os system, you can do this in a pre-processing step that is probably much faster with grep -v, which returns all lines that do not match



grep -v 0000e File.txt > small_file.txt


on windows (I think) it's findstring /v



findstring /v File.txt > small_file.txt



you can call the os command from inside your python code, see here



and if you want to make it able to handle multiple os'es, see here






share|improve this answer



























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );














    draft saved

    draft discarded
















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55404520%2fpandas-skip-lines-containing-a-certain-string-when-reading-a-file%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1
















    I prefer your first attempt more as it is definitely more readable atop the fact that your second line has x's and I don't know what they refer to.



    That said, using memory_map=True will boost the performance as noted in the docs, you can also gain an extra advantage by removing the second line and accessing the column in the same line you create the df. Lastly, replacing the check ...==False with ~... may provide some benefit. as ~ is a logical not but you need to filter out all the NaN values or you get an error. Luckily Series.str.contains accepts and na attribute appliying the given function to NaN values.



    import pandas as pd

    df = pd.read_csv('File.txt', memory_map=True, header=None, index_col=False).iloc[:,0]
    df1 = df[~df.str.contains("test", na=False)]
    #if you want to also skip NaN rows use the below statement
    df1 = df[~df.str.contains("test", na=False)].dropna()


    Alternatively, doing this using csv is much faster even if you decide to load it into pandas afterwards. I don't know what your data looks like but I tested these with a csv file cointaining 3 columns and 100 rows and I got roughly 9x better performance. This probably won't correlate to you're results but this is definitely the method I would choose if I were you.



    from csv import reader

    filter = '0000e' #so we aren't making a new string every iteration
    with open('File.txt', 'r') as f:
    df = pd.DataFrame(first for first, *_ in reader(f) if filter not in first)
    #if you want to skip NaN rows
    ...(first for first, *_ in reader(f) if not first and filter not in first)
    #take note this also skips empty strings, use if first is not None for only skipping NaN values





    share|improve this answer






















    • 1





      Thank you very much! I tried method 2 and it works faster. I have now a DataFrame of 1999048 rows and facing a new speed issue for the processing of my data. I'll post a new question.Thank you again for your answer!

      – JRLCAR
      Mar 29 at 10:19















    1
















    I prefer your first attempt more as it is definitely more readable atop the fact that your second line has x's and I don't know what they refer to.



    That said, using memory_map=True will boost the performance as noted in the docs, you can also gain an extra advantage by removing the second line and accessing the column in the same line you create the df. Lastly, replacing the check ...==False with ~... may provide some benefit. as ~ is a logical not but you need to filter out all the NaN values or you get an error. Luckily Series.str.contains accepts and na attribute appliying the given function to NaN values.



    import pandas as pd

    df = pd.read_csv('File.txt', memory_map=True, header=None, index_col=False).iloc[:,0]
    df1 = df[~df.str.contains("test", na=False)]
    #if you want to also skip NaN rows use the below statement
    df1 = df[~df.str.contains("test", na=False)].dropna()


    Alternatively, doing this using csv is much faster even if you decide to load it into pandas afterwards. I don't know what your data looks like but I tested these with a csv file cointaining 3 columns and 100 rows and I got roughly 9x better performance. This probably won't correlate to you're results but this is definitely the method I would choose if I were you.



    from csv import reader

    filter = '0000e' #so we aren't making a new string every iteration
    with open('File.txt', 'r') as f:
    df = pd.DataFrame(first for first, *_ in reader(f) if filter not in first)
    #if you want to skip NaN rows
    ...(first for first, *_ in reader(f) if not first and filter not in first)
    #take note this also skips empty strings, use if first is not None for only skipping NaN values





    share|improve this answer






















    • 1





      Thank you very much! I tried method 2 and it works faster. I have now a DataFrame of 1999048 rows and facing a new speed issue for the processing of my data. I'll post a new question.Thank you again for your answer!

      – JRLCAR
      Mar 29 at 10:19













    1














    1










    1









    I prefer your first attempt more as it is definitely more readable atop the fact that your second line has x's and I don't know what they refer to.



    That said, using memory_map=True will boost the performance as noted in the docs, you can also gain an extra advantage by removing the second line and accessing the column in the same line you create the df. Lastly, replacing the check ...==False with ~... may provide some benefit. as ~ is a logical not but you need to filter out all the NaN values or you get an error. Luckily Series.str.contains accepts and na attribute appliying the given function to NaN values.



    import pandas as pd

    df = pd.read_csv('File.txt', memory_map=True, header=None, index_col=False).iloc[:,0]
    df1 = df[~df.str.contains("test", na=False)]
    #if you want to also skip NaN rows use the below statement
    df1 = df[~df.str.contains("test", na=False)].dropna()


    Alternatively, doing this using csv is much faster even if you decide to load it into pandas afterwards. I don't know what your data looks like but I tested these with a csv file cointaining 3 columns and 100 rows and I got roughly 9x better performance. This probably won't correlate to you're results but this is definitely the method I would choose if I were you.



    from csv import reader

    filter = '0000e' #so we aren't making a new string every iteration
    with open('File.txt', 'r') as f:
    df = pd.DataFrame(first for first, *_ in reader(f) if filter not in first)
    #if you want to skip NaN rows
    ...(first for first, *_ in reader(f) if not first and filter not in first)
    #take note this also skips empty strings, use if first is not None for only skipping NaN values





    share|improve this answer















    I prefer your first attempt more as it is definitely more readable atop the fact that your second line has x's and I don't know what they refer to.



    That said, using memory_map=True will boost the performance as noted in the docs, you can also gain an extra advantage by removing the second line and accessing the column in the same line you create the df. Lastly, replacing the check ...==False with ~... may provide some benefit. as ~ is a logical not but you need to filter out all the NaN values or you get an error. Luckily Series.str.contains accepts and na attribute appliying the given function to NaN values.



    import pandas as pd

    df = pd.read_csv('File.txt', memory_map=True, header=None, index_col=False).iloc[:,0]
    df1 = df[~df.str.contains("test", na=False)]
    #if you want to also skip NaN rows use the below statement
    df1 = df[~df.str.contains("test", na=False)].dropna()


    Alternatively, doing this using csv is much faster even if you decide to load it into pandas afterwards. I don't know what your data looks like but I tested these with a csv file cointaining 3 columns and 100 rows and I got roughly 9x better performance. This probably won't correlate to you're results but this is definitely the method I would choose if I were you.



    from csv import reader

    filter = '0000e' #so we aren't making a new string every iteration
    with open('File.txt', 'r') as f:
    df = pd.DataFrame(first for first, *_ in reader(f) if filter not in first)
    #if you want to skip NaN rows
    ...(first for first, *_ in reader(f) if not first and filter not in first)
    #take note this also skips empty strings, use if first is not None for only skipping NaN values






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Mar 28 at 21:26

























    answered Mar 28 at 21:15









    JabJab

    10.2k18 gold badges57 silver badges100 bronze badges




    10.2k18 gold badges57 silver badges100 bronze badges










    • 1





      Thank you very much! I tried method 2 and it works faster. I have now a DataFrame of 1999048 rows and facing a new speed issue for the processing of my data. I'll post a new question.Thank you again for your answer!

      – JRLCAR
      Mar 29 at 10:19












    • 1





      Thank you very much! I tried method 2 and it works faster. I have now a DataFrame of 1999048 rows and facing a new speed issue for the processing of my data. I'll post a new question.Thank you again for your answer!

      – JRLCAR
      Mar 29 at 10:19







    1




    1





    Thank you very much! I tried method 2 and it works faster. I have now a DataFrame of 1999048 rows and facing a new speed issue for the processing of my data. I'll post a new question.Thank you again for your answer!

    – JRLCAR
    Mar 29 at 10:19





    Thank you very much! I tried method 2 and it works faster. I have now a DataFrame of 1999048 rows and facing a new speed issue for the processing of my data. I'll post a new question.Thank you again for your answer!

    – JRLCAR
    Mar 29 at 10:19













    0
















    if you have access to a linux or mac os system, you can do this in a pre-processing step that is probably much faster with grep -v, which returns all lines that do not match



    grep -v 0000e File.txt > small_file.txt


    on windows (I think) it's findstring /v



    findstring /v File.txt > small_file.txt



    you can call the os command from inside your python code, see here



    and if you want to make it able to handle multiple os'es, see here






    share|improve this answer





























      0
















      if you have access to a linux or mac os system, you can do this in a pre-processing step that is probably much faster with grep -v, which returns all lines that do not match



      grep -v 0000e File.txt > small_file.txt


      on windows (I think) it's findstring /v



      findstring /v File.txt > small_file.txt



      you can call the os command from inside your python code, see here



      and if you want to make it able to handle multiple os'es, see here






      share|improve this answer



























        0














        0










        0









        if you have access to a linux or mac os system, you can do this in a pre-processing step that is probably much faster with grep -v, which returns all lines that do not match



        grep -v 0000e File.txt > small_file.txt


        on windows (I think) it's findstring /v



        findstring /v File.txt > small_file.txt



        you can call the os command from inside your python code, see here



        and if you want to make it able to handle multiple os'es, see here






        share|improve this answer













        if you have access to a linux or mac os system, you can do this in a pre-processing step that is probably much faster with grep -v, which returns all lines that do not match



        grep -v 0000e File.txt > small_file.txt


        on windows (I think) it's findstring /v



        findstring /v File.txt > small_file.txt



        you can call the os command from inside your python code, see here



        and if you want to make it able to handle multiple os'es, see here







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 28 at 21:29









        philshemphilshem

        18.5k5 gold badges36 silver badges95 bronze badges




        18.5k5 gold badges36 silver badges95 bronze badges































            draft saved

            draft discarded















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55404520%2fpandas-skip-lines-containing-a-certain-string-when-reading-a-file%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

            Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

            Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript