Python3 Read Multiple Xml Files in a Directory Utf-8
This article focuses on how one tin parse a given XML file and excerpt some useful information out of it in a structured way.
XML: XML stands for eXtensible Markup Language. It was designed to store and send data. It was designed to be both human- and motorcar-readable.That's why, the design goals of XML emphasize simplicity, generality, and usability across the Internet.
The XML file to be parsed in this tutorial is actually a RSS feed.
RSS: RSS(Rich Site Summary, oftentimes called Really Simple Syndication) uses a family unit of standard web feed formats to publish frequently updated informationlike blog entries, news headlines, sound, video. RSS is XML formatted plain text.
- The RSS format itself is relatively piece of cake to read both past automated processes and past humans alike.
- The RSS processed in this tutorial is the RSS feed of superlative news stories from a popular news website. Yous can cheque it out here. Our goal is to procedure this RSS feed (or XML file) and salve it in some other format for future use.
Python Module used: This article volition focus on using inbuilt xml module in python for parsing XML and the main focus volition be on the ElementTree XML API of this module.
Implementation:
import
csv
import
requests
import
xml.etree.ElementTree as ET
def
loadRSS():
resp
=
requests.get(url)
with
open
(
'topnewsfeed.xml'
,
'wb'
) as f:
f.write(resp.content)
def
parseXML(xmlfile):
tree
=
ET.parse(xmlfile)
root
=
tree.getroot()
newsitems
=
[]
for
item
in
root.findall(
'./channel/particular'
):
news
=
{}
for
child
in
particular:
news[
'media'
]
=
child.attrib[
'url'
]
else
:
news[child.tag]
=
child.text.encode(
'utf8'
)
newsitems.suspend(news)
return
newsitems
def
savetoCSV(newsitems, filename):
fields
=
[
'guid'
,
'title'
,
'pubDate'
,
'description'
,
'link'
,
'media'
]
with
open up
(filename,
'w'
) as csvfile:
writer
=
csv.DictWriter(csvfile, fieldnames
=
fields)
writer.writeheader()
author.writerows(newsitems)
def
master():
loadRSS()
newsitems
=
parseXML(
'topnewsfeed.xml'
)
savetoCSV(newsitems,
'topnews.csv'
)
if
__name__
=
=
"__main__"
:
main()
Above code volition:
- Load RSS feed from specified URL and save it as an XML file.
- Parse the XML file to relieve news as a list of dictionaries where each dictionary is a single news item.
- Save the news items into a CSV file.
Let us try to empathize the code in pieces:
- Loading and saving RSS feed
def loadRSS(): # url of rss feed url = 'http://www.hindustantimes.com/rss/topnews/rssfeed.xml' # creating HTTP response object from given url resp = requests.become(url) # saving the xml file with open('topnewsfeed.xml', 'wb') as f: f.write(resp.content)
Hither, we kickoff created a HTTP response object by sending an HTTP request to the URL of the RSS feed. The content of response now contains the XML file data which we relieve as topnewsfeed.xml in our local directory.
For more insight on how requests module works, follow this commodity:
GET and POST requests using Python
- Parsing XML
Nosotros have created parseXML() role to parse XML file. Nosotros know that XML is an inherently hierarchical information format, and the virtually natural mode to represent it is with a tree. Look at the image beneath for example:
Here, nosotros are using xml.etree.ElementTree (call it ET, in short) module. Element Tree has two classes for this purpose – ElementTree represents the whole XML
document every bit a tree, and Element represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level.Ok, and then let's go through the parseXML() function now:
tree = ET.parse(xmlfile)
Here, nosotros create an ElementTree object by parsing the passed xmlfile.
root = tree.getroot()
getroot() part return the root of tree as an Element object.
for item in root.findall('./aqueduct/item'):
Now, once you have taken a wait at the construction of your XML file, yous will notice that we are interested simply in item element.
./aqueduct/item is really XPath syntax (XPath is a language for addressing parts of an XML certificate). Here, nosotros want to find all item grand-children of channel children of the root(denoted by '.') element.
You can read more about supported XPath syntax here.for item in root.findall('./channel/item'): # empty news dictionary news = {} # iterate child elements of particular for kid in item: # special checking for namespace object content:media if child.tag == '{http://search.yahoo.com/mrss/}content': news['media'] = child.attrib['url'] else: news[child.tag] = child.text.encode('utf8') # append news lexicon to news items listing newsitems.append(news)
Now, we know that we are iterating through item elements where each item element contains one news. So, nosotros create an empty news lexicon in which nosotros volition store all information bachelor about news item. To iterate though each child element of an element, nosotros just iterate through it, like this:
for child in item:
At present, notice a sample item element here:
Nosotros will have to handle namespace tags separately as they go expanded to their original value, when parsed. Then, we do something like this:
if child.tag == '{http://search.yahoo.com/mrss/}content': news['media'] = child.attrib['url']
kid.attrib is a lexicon of all the attributes related to an element. Here, we are interested in url attribute of media:content namespace tag.
Now, for all other children, nosotros simply do:news[kid.tag] = child.text.encode('utf8')
child.tag contains the name of child element. kid.text stores all the text inside that child chemical element. So, finally, a sample item element is converted to a dictionary and looks like this:
{'description': 'Ignis has a tough competition already, from Hyun.... , 'guid': 'http://world wide web.hindustantimes.com/autos/maruti-ignis-launch.... , 'link': 'http://www.hindustantimes.com/autos/maruti-ignis-launch.... , 'media': 'http://world wide web.hindustantimes.com/rf/image_size_630x354/HT/... , 'pubDate': 'Thu, 12 Jan 2017 12:33:04 GMT ', 'title': 'Maruti Ignis launches on Jan thirteen: Five cars that threa..... }
Then, nosotros just append this dict chemical element to the list newsitems.
Finally, this list is returned. - Saving data to a CSV file
Now, we just relieve the list of news items to a CSV file and then that it could be used or modified easily in future using savetoCSV() function. To know more nigh writing dictionary elements to a CSV file, go through this article:
Working with CSV files in Python
So now, hither is how our formatted data looks like now:
As you tin can see, the hierarchical XML file data has been converted to a unproblematic CSV file then that all news stories are stored in form of a table. This makes information technology easier to extend the database also.
Likewise, one tin can use the JSON-like information straight in their applications! This is the best alternative for extracting data from websites which practise not provide a public API but provide some RSS feeds.
All the lawmaking and files used in above commodity can exist found here.
What next?
- You can have a look at more rss feeds of the news website used in in a higher place case. You lot can try to create an extended version of higher up example past parsing other rss feeds too.
- Are you a cricket fan? Then this rss feed must exist of your involvement! You tin parse this XML file to scrape data most the alive cricket matches and use to make a desktop notifier!
Quiz of HTML and XML
This article is contributed by Nikhil Kumar. If you like GeeksforGeeks and would like to contribute, yous can besides write an commodity and postal service your commodity to review-team@geeksforgeeks.org. Run into your commodity actualization on the GeeksforGeeks chief page and help other Geeks.
Please write comments if you lot find anything incorrect, or yous want to share more information about the topic discussed higher up
Source: https://www.geeksforgeeks.org/xml-parsing-python/
0 Response to "Python3 Read Multiple Xml Files in a Directory Utf-8"
Publicar un comentario