Title Extraction/Identification from PDFsPython module for converting PDF to textRecommended way to embed PDF in HTML?Proper MIME media type for PDF filesConvert HTML + CSS to PDF with PHP?Extracting extension from filename in PythonHow to know if a PDF contains only images or has been OCR scanned for searching?Merge / convert multiple PDF files into one PDFFast and Lean PDF Viewer for iPhone / iPad / iOs - tips and hints?Tell if text of PDF is visible or notTika duplicates text when used with Tesseract on OCR PDF

How do I check if a string is entirely made of the same substring?

How exactly does Hawking radiation decrease the mass of black holes?

Could the terminal length of components like resistors be reduced?

Betweenness centrality formula

What are the steps to solving this definite integral?

Pre-plastic human skin alternative

How to write a column outside the braces in a matrix?

Dynamic SOQL query relationship with field visibility for Users

Mistake in years of experience in resume?

Re-entry to Germany after vacation using blue card

What does ゆーか mean?

Minor Revision with suggestion of an alternative proof by reviewer

Was there a Viking Exchange as well as a Columbian one?

How to have a sharp product image?

What is the most expensive material in the world that could be used to create Pun-Pun's lute?

Why does nature favour the Laplacian?

Pulling the rope with one hand is as heavy as with two hands?

Can I grease a crank spindle/bracket without disassembling the crank set?

a sore throat vs a strep throat vs strep throat

Why does Mind Blank stop the Feeblemind spell?

Can someone publish a story that happened to you?

Can SQL Server create collisions in system generated constraint names?

How do I deal with a coworker that keeps asking to make small superficial changes to a report, and it is seriously triggering my anxiety?

What term is being referred to with "reflected-sound-of-underground-spirits"?



Title Extraction/Identification from PDFs


Python module for converting PDF to textRecommended way to embed PDF in HTML?Proper MIME media type for PDF filesConvert HTML + CSS to PDF with PHP?Extracting extension from filename in PythonHow to know if a PDF contains only images or has been OCR scanned for searching?Merge / convert multiple PDF files into one PDFFast and Lean PDF Viewer for iPhone / iPad / iOs - tips and hints?Tell if text of PDF is visible or notTika duplicates text when used with Tesseract on OCR PDF






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I have a large number of pdfs in different formats. Among other things, I need to extract their titles (not the document name, but a title in the text). Due to the range of formats, the titles are not in the same locations in the pdfs. Further, some of the pdfs are actually scanned images (I need to use OCR/Optical Character Recognition on them). The titles are sometimes one line, sometimes 2. They do not tend to have the same set of words. In the range of physical locations the titles usually show up, there are often other words (ie if doc 1 has title 1 at x1, y1, doc 2 might have title 2 at x2, y2 but have other non-title text at x1 y1). Further, there are some very rare cases where the pdfs don't have a title.



So far I can use pdftotext to extract text within a given bounding box, and convert it to a text file. If there's a title, this lets me capture the title, but often with other extraneous words included. This also only works on non-image pdfs. I'm wondering if a) There's a good way to identify the title from among all the words I extract for a document (because there are often extraneous words), ideally with a good way to identify that no title exists, and b) if there are any tools that are equivalent to pdftotext that will also work on scanned images (I do have an ocr script working, but it does ocr over an entire image rather than a section of one).



One method that somewhat answers the title dilemma is to extract the words in the bounding box, use the rest of the document to identify which of the bounding box words are keywords for the document, and construct the title from the keywords. This wouldn't extract the actual title, but may give words that could construct a reasonable alternative. I'm already extracting keywords for other parts of the project, but I would definitely prefer to extract the actual title as people may be using the verbatim title for lookup purposes.



Further note if it wasn't clear - I'm trying to do this programatically with open source/free tools, ideally in Python, and I will have a large number of documents (10,000+).










share|improve this question



















  • 1





    This sounds like a difficult task, not just the OCR but the identification of where the title is. I would be quite interested to know if there were a way to do this programmatically. I would suggest considering something like Amazon Mechanical Turk to accomplish this task. It wouldn't be free, but neither is your time, and it could be accomplished this way.

    – Nathaniel
    Mar 22 at 17:34


















0















I have a large number of pdfs in different formats. Among other things, I need to extract their titles (not the document name, but a title in the text). Due to the range of formats, the titles are not in the same locations in the pdfs. Further, some of the pdfs are actually scanned images (I need to use OCR/Optical Character Recognition on them). The titles are sometimes one line, sometimes 2. They do not tend to have the same set of words. In the range of physical locations the titles usually show up, there are often other words (ie if doc 1 has title 1 at x1, y1, doc 2 might have title 2 at x2, y2 but have other non-title text at x1 y1). Further, there are some very rare cases where the pdfs don't have a title.



So far I can use pdftotext to extract text within a given bounding box, and convert it to a text file. If there's a title, this lets me capture the title, but often with other extraneous words included. This also only works on non-image pdfs. I'm wondering if a) There's a good way to identify the title from among all the words I extract for a document (because there are often extraneous words), ideally with a good way to identify that no title exists, and b) if there are any tools that are equivalent to pdftotext that will also work on scanned images (I do have an ocr script working, but it does ocr over an entire image rather than a section of one).



One method that somewhat answers the title dilemma is to extract the words in the bounding box, use the rest of the document to identify which of the bounding box words are keywords for the document, and construct the title from the keywords. This wouldn't extract the actual title, but may give words that could construct a reasonable alternative. I'm already extracting keywords for other parts of the project, but I would definitely prefer to extract the actual title as people may be using the verbatim title for lookup purposes.



Further note if it wasn't clear - I'm trying to do this programatically with open source/free tools, ideally in Python, and I will have a large number of documents (10,000+).










share|improve this question



















  • 1





    This sounds like a difficult task, not just the OCR but the identification of where the title is. I would be quite interested to know if there were a way to do this programmatically. I would suggest considering something like Amazon Mechanical Turk to accomplish this task. It wouldn't be free, but neither is your time, and it could be accomplished this way.

    – Nathaniel
    Mar 22 at 17:34














0












0








0








I have a large number of pdfs in different formats. Among other things, I need to extract their titles (not the document name, but a title in the text). Due to the range of formats, the titles are not in the same locations in the pdfs. Further, some of the pdfs are actually scanned images (I need to use OCR/Optical Character Recognition on them). The titles are sometimes one line, sometimes 2. They do not tend to have the same set of words. In the range of physical locations the titles usually show up, there are often other words (ie if doc 1 has title 1 at x1, y1, doc 2 might have title 2 at x2, y2 but have other non-title text at x1 y1). Further, there are some very rare cases where the pdfs don't have a title.



So far I can use pdftotext to extract text within a given bounding box, and convert it to a text file. If there's a title, this lets me capture the title, but often with other extraneous words included. This also only works on non-image pdfs. I'm wondering if a) There's a good way to identify the title from among all the words I extract for a document (because there are often extraneous words), ideally with a good way to identify that no title exists, and b) if there are any tools that are equivalent to pdftotext that will also work on scanned images (I do have an ocr script working, but it does ocr over an entire image rather than a section of one).



One method that somewhat answers the title dilemma is to extract the words in the bounding box, use the rest of the document to identify which of the bounding box words are keywords for the document, and construct the title from the keywords. This wouldn't extract the actual title, but may give words that could construct a reasonable alternative. I'm already extracting keywords for other parts of the project, but I would definitely prefer to extract the actual title as people may be using the verbatim title for lookup purposes.



Further note if it wasn't clear - I'm trying to do this programatically with open source/free tools, ideally in Python, and I will have a large number of documents (10,000+).










share|improve this question
















I have a large number of pdfs in different formats. Among other things, I need to extract their titles (not the document name, but a title in the text). Due to the range of formats, the titles are not in the same locations in the pdfs. Further, some of the pdfs are actually scanned images (I need to use OCR/Optical Character Recognition on them). The titles are sometimes one line, sometimes 2. They do not tend to have the same set of words. In the range of physical locations the titles usually show up, there are often other words (ie if doc 1 has title 1 at x1, y1, doc 2 might have title 2 at x2, y2 but have other non-title text at x1 y1). Further, there are some very rare cases where the pdfs don't have a title.



So far I can use pdftotext to extract text within a given bounding box, and convert it to a text file. If there's a title, this lets me capture the title, but often with other extraneous words included. This also only works on non-image pdfs. I'm wondering if a) There's a good way to identify the title from among all the words I extract for a document (because there are often extraneous words), ideally with a good way to identify that no title exists, and b) if there are any tools that are equivalent to pdftotext that will also work on scanned images (I do have an ocr script working, but it does ocr over an entire image rather than a section of one).



One method that somewhat answers the title dilemma is to extract the words in the bounding box, use the rest of the document to identify which of the bounding box words are keywords for the document, and construct the title from the keywords. This wouldn't extract the actual title, but may give words that could construct a reasonable alternative. I'm already extracting keywords for other parts of the project, but I would definitely prefer to extract the actual title as people may be using the verbatim title for lookup purposes.



Further note if it wasn't clear - I'm trying to do this programatically with open source/free tools, ideally in Python, and I will have a large number of documents (10,000+).







python pdf nlp ocr pdf-scraping






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 22 at 18:21







Evan Mata

















asked Mar 22 at 17:23









Evan MataEvan Mata

16013




16013







  • 1





    This sounds like a difficult task, not just the OCR but the identification of where the title is. I would be quite interested to know if there were a way to do this programmatically. I would suggest considering something like Amazon Mechanical Turk to accomplish this task. It wouldn't be free, but neither is your time, and it could be accomplished this way.

    – Nathaniel
    Mar 22 at 17:34













  • 1





    This sounds like a difficult task, not just the OCR but the identification of where the title is. I would be quite interested to know if there were a way to do this programmatically. I would suggest considering something like Amazon Mechanical Turk to accomplish this task. It wouldn't be free, but neither is your time, and it could be accomplished this way.

    – Nathaniel
    Mar 22 at 17:34








1




1





This sounds like a difficult task, not just the OCR but the identification of where the title is. I would be quite interested to know if there were a way to do this programmatically. I would suggest considering something like Amazon Mechanical Turk to accomplish this task. It wouldn't be free, but neither is your time, and it could be accomplished this way.

– Nathaniel
Mar 22 at 17:34






This sounds like a difficult task, not just the OCR but the identification of where the title is. I would be quite interested to know if there were a way to do this programmatically. I would suggest considering something like Amazon Mechanical Turk to accomplish this task. It wouldn't be free, but neither is your time, and it could be accomplished this way.

– Nathaniel
Mar 22 at 17:34













2 Answers
2






active

oldest

votes


















1














You can utilize the word font-size information to extract the title words.
From your question what i understand here is what i am proposing to extract the title words:



Convert the pdf documents to image using any opensource module say pdf2image, then use tesseract for OCR. From OCR output you have text data along with their dimension information ie. individual word width and height.



Do some statistical analysis(histogram plot) on the word's height and see if you can use the height distribution to recognize the title word.
Either you can use a fixed threshold value based on the heuristic information or use some adaptive threshold based on height distribution and use this threshold value to recognize the title words.






share|improve this answer























  • Thank you for the recommendation - I think if other people see this, it would be useful for their problem, but my corpus unfortunately has the same font size for the titles and body text. Do you happen to know if tesseract's OCR can recognize bold text as well? I do have other bold text in my pdf, but its one feature that could help a fair amount.

    – Evan Mata
    Mar 27 at 13:37











  • I am not sure about bold text, but tesseract's TessbaseAPI gives the word and symbol level font type information which might be usefull in this case. If i guess then the bold text characters would be having different width as compared to non title text characters, so this info might be useful.

    – flamelite
    Mar 27 at 13:39



















0














For people who are come across this question later, I'll provide a quick update on what I've decided to do (albeit I haven't tested accuracy so I don't know if this approach is actually any good).



The overall approach I'll be using is machine learning via a neural net (I'll report back on accuracy once I have it). I'm essentially taking the first 200 words of a document, and generating n-grams of 4-20 sequential words (so ~16*200 n-grams of words; 4 b.c. none of my titles are shorter, 20 same but longer). I then generate a unique feature vector from each n-gram, the features I decided to use are partially dependent on my text but some are more general like "Is the first letter of the first word in the n-gram capitalized?". Knowing the correct titles, I can turn them into an equivalent vector. So If vec(n_gram) = vec(correct_title) then output 1, otherwise output 0. I'm using this to train an ML model. Currently this does Not solve my issue of scanned image pdfs, unless they're first converted into text documents. It also assumes word order is preserved among the title words when the pdf is turned into the n-grams. I have noticed the order of non-title words isn't always preserved by conversion but thats quite a rare problem and only seems to occur when there's line breaks and then the entire line is out of place (so it shouldn't affect the titles hopefully).






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55304845%2ftitle-extraction-identification-from-pdfs%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    You can utilize the word font-size information to extract the title words.
    From your question what i understand here is what i am proposing to extract the title words:



    Convert the pdf documents to image using any opensource module say pdf2image, then use tesseract for OCR. From OCR output you have text data along with their dimension information ie. individual word width and height.



    Do some statistical analysis(histogram plot) on the word's height and see if you can use the height distribution to recognize the title word.
    Either you can use a fixed threshold value based on the heuristic information or use some adaptive threshold based on height distribution and use this threshold value to recognize the title words.






    share|improve this answer























    • Thank you for the recommendation - I think if other people see this, it would be useful for their problem, but my corpus unfortunately has the same font size for the titles and body text. Do you happen to know if tesseract's OCR can recognize bold text as well? I do have other bold text in my pdf, but its one feature that could help a fair amount.

      – Evan Mata
      Mar 27 at 13:37











    • I am not sure about bold text, but tesseract's TessbaseAPI gives the word and symbol level font type information which might be usefull in this case. If i guess then the bold text characters would be having different width as compared to non title text characters, so this info might be useful.

      – flamelite
      Mar 27 at 13:39
















    1














    You can utilize the word font-size information to extract the title words.
    From your question what i understand here is what i am proposing to extract the title words:



    Convert the pdf documents to image using any opensource module say pdf2image, then use tesseract for OCR. From OCR output you have text data along with their dimension information ie. individual word width and height.



    Do some statistical analysis(histogram plot) on the word's height and see if you can use the height distribution to recognize the title word.
    Either you can use a fixed threshold value based on the heuristic information or use some adaptive threshold based on height distribution and use this threshold value to recognize the title words.






    share|improve this answer























    • Thank you for the recommendation - I think if other people see this, it would be useful for their problem, but my corpus unfortunately has the same font size for the titles and body text. Do you happen to know if tesseract's OCR can recognize bold text as well? I do have other bold text in my pdf, but its one feature that could help a fair amount.

      – Evan Mata
      Mar 27 at 13:37











    • I am not sure about bold text, but tesseract's TessbaseAPI gives the word and symbol level font type information which might be usefull in this case. If i guess then the bold text characters would be having different width as compared to non title text characters, so this info might be useful.

      – flamelite
      Mar 27 at 13:39














    1












    1








    1







    You can utilize the word font-size information to extract the title words.
    From your question what i understand here is what i am proposing to extract the title words:



    Convert the pdf documents to image using any opensource module say pdf2image, then use tesseract for OCR. From OCR output you have text data along with their dimension information ie. individual word width and height.



    Do some statistical analysis(histogram plot) on the word's height and see if you can use the height distribution to recognize the title word.
    Either you can use a fixed threshold value based on the heuristic information or use some adaptive threshold based on height distribution and use this threshold value to recognize the title words.






    share|improve this answer













    You can utilize the word font-size information to extract the title words.
    From your question what i understand here is what i am proposing to extract the title words:



    Convert the pdf documents to image using any opensource module say pdf2image, then use tesseract for OCR. From OCR output you have text data along with their dimension information ie. individual word width and height.



    Do some statistical analysis(histogram plot) on the word's height and see if you can use the height distribution to recognize the title word.
    Either you can use a fixed threshold value based on the heuristic information or use some adaptive threshold based on height distribution and use this threshold value to recognize the title words.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Mar 27 at 13:30









    flameliteflamelite

    9731726




    9731726












    • Thank you for the recommendation - I think if other people see this, it would be useful for their problem, but my corpus unfortunately has the same font size for the titles and body text. Do you happen to know if tesseract's OCR can recognize bold text as well? I do have other bold text in my pdf, but its one feature that could help a fair amount.

      – Evan Mata
      Mar 27 at 13:37











    • I am not sure about bold text, but tesseract's TessbaseAPI gives the word and symbol level font type information which might be usefull in this case. If i guess then the bold text characters would be having different width as compared to non title text characters, so this info might be useful.

      – flamelite
      Mar 27 at 13:39


















    • Thank you for the recommendation - I think if other people see this, it would be useful for their problem, but my corpus unfortunately has the same font size for the titles and body text. Do you happen to know if tesseract's OCR can recognize bold text as well? I do have other bold text in my pdf, but its one feature that could help a fair amount.

      – Evan Mata
      Mar 27 at 13:37











    • I am not sure about bold text, but tesseract's TessbaseAPI gives the word and symbol level font type information which might be usefull in this case. If i guess then the bold text characters would be having different width as compared to non title text characters, so this info might be useful.

      – flamelite
      Mar 27 at 13:39

















    Thank you for the recommendation - I think if other people see this, it would be useful for their problem, but my corpus unfortunately has the same font size for the titles and body text. Do you happen to know if tesseract's OCR can recognize bold text as well? I do have other bold text in my pdf, but its one feature that could help a fair amount.

    – Evan Mata
    Mar 27 at 13:37





    Thank you for the recommendation - I think if other people see this, it would be useful for their problem, but my corpus unfortunately has the same font size for the titles and body text. Do you happen to know if tesseract's OCR can recognize bold text as well? I do have other bold text in my pdf, but its one feature that could help a fair amount.

    – Evan Mata
    Mar 27 at 13:37













    I am not sure about bold text, but tesseract's TessbaseAPI gives the word and symbol level font type information which might be usefull in this case. If i guess then the bold text characters would be having different width as compared to non title text characters, so this info might be useful.

    – flamelite
    Mar 27 at 13:39






    I am not sure about bold text, but tesseract's TessbaseAPI gives the word and symbol level font type information which might be usefull in this case. If i guess then the bold text characters would be having different width as compared to non title text characters, so this info might be useful.

    – flamelite
    Mar 27 at 13:39














    0














    For people who are come across this question later, I'll provide a quick update on what I've decided to do (albeit I haven't tested accuracy so I don't know if this approach is actually any good).



    The overall approach I'll be using is machine learning via a neural net (I'll report back on accuracy once I have it). I'm essentially taking the first 200 words of a document, and generating n-grams of 4-20 sequential words (so ~16*200 n-grams of words; 4 b.c. none of my titles are shorter, 20 same but longer). I then generate a unique feature vector from each n-gram, the features I decided to use are partially dependent on my text but some are more general like "Is the first letter of the first word in the n-gram capitalized?". Knowing the correct titles, I can turn them into an equivalent vector. So If vec(n_gram) = vec(correct_title) then output 1, otherwise output 0. I'm using this to train an ML model. Currently this does Not solve my issue of scanned image pdfs, unless they're first converted into text documents. It also assumes word order is preserved among the title words when the pdf is turned into the n-grams. I have noticed the order of non-title words isn't always preserved by conversion but thats quite a rare problem and only seems to occur when there's line breaks and then the entire line is out of place (so it shouldn't affect the titles hopefully).






    share|improve this answer





























      0














      For people who are come across this question later, I'll provide a quick update on what I've decided to do (albeit I haven't tested accuracy so I don't know if this approach is actually any good).



      The overall approach I'll be using is machine learning via a neural net (I'll report back on accuracy once I have it). I'm essentially taking the first 200 words of a document, and generating n-grams of 4-20 sequential words (so ~16*200 n-grams of words; 4 b.c. none of my titles are shorter, 20 same but longer). I then generate a unique feature vector from each n-gram, the features I decided to use are partially dependent on my text but some are more general like "Is the first letter of the first word in the n-gram capitalized?". Knowing the correct titles, I can turn them into an equivalent vector. So If vec(n_gram) = vec(correct_title) then output 1, otherwise output 0. I'm using this to train an ML model. Currently this does Not solve my issue of scanned image pdfs, unless they're first converted into text documents. It also assumes word order is preserved among the title words when the pdf is turned into the n-grams. I have noticed the order of non-title words isn't always preserved by conversion but thats quite a rare problem and only seems to occur when there's line breaks and then the entire line is out of place (so it shouldn't affect the titles hopefully).






      share|improve this answer



























        0












        0








        0







        For people who are come across this question later, I'll provide a quick update on what I've decided to do (albeit I haven't tested accuracy so I don't know if this approach is actually any good).



        The overall approach I'll be using is machine learning via a neural net (I'll report back on accuracy once I have it). I'm essentially taking the first 200 words of a document, and generating n-grams of 4-20 sequential words (so ~16*200 n-grams of words; 4 b.c. none of my titles are shorter, 20 same but longer). I then generate a unique feature vector from each n-gram, the features I decided to use are partially dependent on my text but some are more general like "Is the first letter of the first word in the n-gram capitalized?". Knowing the correct titles, I can turn them into an equivalent vector. So If vec(n_gram) = vec(correct_title) then output 1, otherwise output 0. I'm using this to train an ML model. Currently this does Not solve my issue of scanned image pdfs, unless they're first converted into text documents. It also assumes word order is preserved among the title words when the pdf is turned into the n-grams. I have noticed the order of non-title words isn't always preserved by conversion but thats quite a rare problem and only seems to occur when there's line breaks and then the entire line is out of place (so it shouldn't affect the titles hopefully).






        share|improve this answer















        For people who are come across this question later, I'll provide a quick update on what I've decided to do (albeit I haven't tested accuracy so I don't know if this approach is actually any good).



        The overall approach I'll be using is machine learning via a neural net (I'll report back on accuracy once I have it). I'm essentially taking the first 200 words of a document, and generating n-grams of 4-20 sequential words (so ~16*200 n-grams of words; 4 b.c. none of my titles are shorter, 20 same but longer). I then generate a unique feature vector from each n-gram, the features I decided to use are partially dependent on my text but some are more general like "Is the first letter of the first word in the n-gram capitalized?". Knowing the correct titles, I can turn them into an equivalent vector. So If vec(n_gram) = vec(correct_title) then output 1, otherwise output 0. I'm using this to train an ML model. Currently this does Not solve my issue of scanned image pdfs, unless they're first converted into text documents. It also assumes word order is preserved among the title words when the pdf is turned into the n-grams. I have noticed the order of non-title words isn't always preserved by conversion but thats quite a rare problem and only seems to occur when there's line breaks and then the entire line is out of place (so it shouldn't affect the titles hopefully).







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Apr 3 at 18:10

























        answered Mar 22 at 18:07









        Evan MataEvan Mata

        16013




        16013



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55304845%2ftitle-extraction-identification-from-pdfs%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

            Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

            Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript