* please note many major news sites can not be scraped due to site protection, cookies, etc
so if a link is not working, please pick a link from examples below.
Here are few examples of URL link you can copy and paste into the form above!
Previously I built a simple News
Scraper APP on the web using Python to scrapethe latest news
from a specific news site using Beautiful Soup and Flask.
This time, I built slightly more advanced version of the app to scrape news data from a news
article using Python package newspaper3k, then deployed the app using Flask and on
Google App Engine.
First of all, when the URL link form above captures the URL link of a news article, the
newspaper3k package will extract and parse the data of the article with its Natural Language Processing.For form input
handling and validation, I used WTForms and
requests libraries to grab the URL link entered in the form.
Then, from the data extracted I extract following data to render on the first part of my result
page:
Title
Published date
Author
Top image (source link)
At the same time, using the full text of the article extracted,my app also generates
WordCloud for the news article.The WordCloud on the result page will display the
words that are the most frequent among the news text extracted.io library is used
to keep the WordCloud image in memory and base64 to convert the resulting bytes to
base64 in order to return the image as part of our HTML response and render the image.
* Please note WordCloud is currently disabled due to image storage issue
Lastly newspaper3k can also run its simple natural language processing to extract keywords from the news and also produce the summary of the article text.
Keywords (WordCloud image)
Summary
Keywords(WorldCloud) image and the summary of the news text will be displayed as the second part
of the result page.