2018-02-22 18:22:22 +00:00
<!DOCTYPE html>
<!-- [if IE 8]><html class="no - js lt - ie9" lang="en" > <![endif] -->
<!-- [if gt IE 8]><! --> < html class = "no-js" lang = "en" > <!-- <![endif] -->
< head >
< meta charset = "utf-8" >
< meta http-equiv = "X-UA-Compatible" content = "IE=edge" >
< meta name = "viewport" content = "width=device-width, initial-scale=1.0" >
< link rel = "shortcut icon" href = "../img/favicon.ico" >
2018-04-08 20:02:44 +00:00
< title > Concept of the Extractor - NewPipe Documentation< / title >
2018-09-08 17:06:35 +00:00
<!-- local fonts -->
2018-02-23 20:18:58 +00:00
< link rel = "stylesheet" href = "../css/local_fonts.css" type = "text/css" / >
2018-02-22 18:22:22 +00:00
< link rel = "stylesheet" href = "../css/theme.css" type = "text/css" / >
< link rel = "stylesheet" href = "../css/theme_extra.css" type = "text/css" / >
2018-09-08 17:06:35 +00:00
<!-- local code syntax highlighting -->
< link rel = "stylesheet" href = "../css/github.min.css" type = "text/css" / >
2018-02-23 20:18:58 +00:00
< link rel = "stylesheet" href = "../css/highlight.css" type = "text/css" / >
2018-02-22 18:22:22 +00:00
< script >
// Current page data
2018-02-24 21:17:40 +00:00
var mkdocs_page_name = "Concept of the Extractor";
var mkdocs_page_input_path = "01_Concept_of_the_extractor.md";
2018-09-01 13:48:12 +00:00
var mkdocs_page_url = null;
2018-02-22 18:22:22 +00:00
< / script >
2018-09-01 13:48:12 +00:00
< script src = "../js/jquery-2.1.1.min.js" defer > < / script >
< script src = "../js/modernizr-2.8.3.min.js" defer > < / script >
2018-09-08 17:06:35 +00:00
< script src = "../js/highlight.min.js" > < / script >
2018-09-01 13:48:12 +00:00
< script > hljs . initHighlightingOnLoad ( ) ; < / script >
2018-02-22 18:22:22 +00:00
< / head >
< body class = "wy-body-for-nav" role = "document" >
< div class = "wy-grid-for-nav" >
< nav data-toggle = "wy-nav-shift" class = "wy-nav-side stickynav" >
< div class = "wy-side-nav-search" >
2018-04-08 20:02:44 +00:00
< a href = ".." class = "icon icon-home" > NewPipe Documentation< / a >
2018-02-22 18:22:22 +00:00
< div role = "search" >
< form id = "rtd-search-form" class = "wy-form" action = "../search.html" method = "get" >
2018-09-01 13:48:12 +00:00
< input type = "text" name = "q" placeholder = "Search docs" title = "Type search term here" / >
2018-02-22 18:22:22 +00:00
< / form >
< / div >
< / div >
< div class = "wy-menu wy-menu-vertical" data-spy = "affix" role = "navigation" aria-label = "main navigation" >
< ul class = "current" >
< li class = "toctree-l1" >
2018-09-21 20:40:35 +00:00
< a class = "" href = ".." > Welcome to the NewPipe Documentation.< / a >
2018-02-22 18:22:22 +00:00
< / li >
< li class = "toctree-l1" >
< a class = "" href = "../00_Prepare_everything/" > Prepare everything< / a >
< / li >
< li class = "toctree-l1 current" >
2018-02-24 21:17:40 +00:00
< a class = "current" href = "./" > Concept of the Extractor< / a >
2018-02-22 18:22:22 +00:00
< ul class = "subnav" >
2018-02-24 21:17:40 +00:00
< li class = "toctree-l2" > < a href = "#concept-of-the-extractor" > Concept of the Extractor< / a > < / li >
2018-02-22 18:22:22 +00:00
< ul >
< li > < a class = "toctree-l3" href = "#collectorextractor-pattern" > Collector/Extractor pattern< / a > < / li >
2018-02-24 21:17:40 +00:00
< li > < a class = "toctree-l3" href = "#collectorextractor-pattern-for-lists" > Collector/Extractor pattern for lists< / a > < / li >
2018-03-26 06:47:05 +00:00
< li > < a class = "toctree-l3" href = "#infoitems-encapsulated-in-pages" > InfoItems encapsulated in pages< / a > < / li >
2018-02-22 18:22:22 +00:00
< / ul >
< / ul >
< / li >
2018-09-01 13:48:12 +00:00
< li class = "toctree-l1" >
< a class = "" href = "../02_Concept_of_LinkHandler/" > Concept of LinkHandler< / a >
< / li >
2018-09-09 15:02:52 +00:00
< li class = "toctree-l1" >
2018-09-11 18:21:55 +00:00
< a class = "" href = "../03_Implement_a_service/" > Implement a service< / a >
< / li >
< li class = "toctree-l1" >
2018-09-09 15:02:52 +00:00
< a class = "" href = "../04_Run_changes_in_App/" > Run the changes in the App< / a >
< / li >
2018-02-22 18:22:22 +00:00
< / ul >
< / div >
< / nav >
< section data-toggle = "wy-nav-shift" class = "wy-nav-content-wrap" >
< nav class = "wy-nav-top" role = "navigation" aria-label = "top navigation" >
< i data-toggle = "wy-nav-top" class = "fa fa-bars" > < / i >
2018-04-08 20:02:44 +00:00
< a href = ".." > NewPipe Documentation< / a >
2018-02-22 18:22:22 +00:00
< / nav >
< div class = "wy-nav-content" >
< div class = "rst-content" >
< div role = "navigation" aria-label = "breadcrumbs navigation" >
< ul class = "wy-breadcrumbs" >
< li > < a href = ".." > Docs< / a > » < / li >
2018-02-24 21:17:40 +00:00
< li > Concept of the Extractor< / li >
2018-02-22 18:22:22 +00:00
< li class = "wy-breadcrumbs-aside" >
< / li >
< / ul >
< hr / >
< / div >
< div role = "main" >
< div class = "section" >
2018-02-24 21:17:40 +00:00
< h1 id = "concept-of-the-extractor" > Concept of the Extractor< / h1 >
2018-02-22 18:22:22 +00:00
< h2 id = "collectorextractor-pattern" > Collector/Extractor pattern< / h2 >
< p > Before we can start coding our own service we need to understand the basic concept of the extractor. There is a pattern
2018-09-01 13:48:12 +00:00
you will find all over the code. It is called the < strong > extractor/collector< / strong > pattern. The idea behind it is that
2018-02-22 18:22:22 +00:00
the < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html" > extractor< / a >
2018-09-21 20:40:35 +00:00
would produce single pieces of data, and the collector would take it and form usable data for the front end out of it.
2018-02-22 18:22:22 +00:00
The collector also controls the parsing process, and takes care about error handling. So if the extractor fails at any
2018-02-23 20:18:58 +00:00
point the collector will decide whether it should continue parsing or not. This requires the extractor to be made out of
2018-02-22 18:22:22 +00:00
many small methods. One method for every data field the collector wants to have. The collectors are provided by NewPipe.
You need to take care of the extractors.< / p >
< h3 id = "usage-in-the-front-end" > Usage in the front end< / h3 >
< p > So typical call for retrieving data from a website would look like this:< / p >
< pre > < code class = "java" > Info info;
try {
2018-02-23 20:18:58 +00:00
// Create a new Extractor with a given context provided as parameter.
2018-02-24 21:17:40 +00:00
Extractor extractor = new Extractor(some_meta_info);
2018-02-23 20:18:58 +00:00
// Retrieves the data form extractor and builds info package.
info = Info.getInfo(extractor);
2018-02-22 18:22:22 +00:00
} catch(Exception e) {
2018-02-23 20:18:58 +00:00
// handle errors when collector decided to break up extraction
2018-02-22 18:22:22 +00:00
< / code > < / pre >
2018-02-24 21:17:40 +00:00
< h3 id = "typical-implementation-of-a-single-data-extractor" > Typical implementation of a single data extractor< / h3 >
< p > The typical implementation of a single data extractor on the other hand would look like this:< / p >
< pre > < code class = "java" > class MyExtractor extends FutureExtractor {
public MyExtractor(RequiredInfo requiredInfo, ForExtraction forExtraction) {
super(requiredInfo, forExtraction);
public void fetch() {
// Actually fetch the page data here
public String someDataFiled()
throws ExtractionException { //The exception needs to be thrown if someting failed
// get piece of information and return it
... // More datafields
< / code > < / pre >
< h2 id = "collectorextractor-pattern-for-lists" > Collector/Extractor pattern for lists< / h2 >
2018-03-26 06:47:05 +00:00
< p > Sometimes information can be represented as a list. In NewPipe a list is represented by a
< a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html" > InfoItemsCollector< / a > .
A InfoItemCollector will collect and assemble a list of < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItem.html" > InfoItem< / a > .
For each item that should be extracted a new Extractor must be created, and given to the InfoItemCollector via < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html#commit-E-" > commit()< / a > .< / p >
2018-11-16 18:23:26 +00:00
< p > < img alt = "InfoItemsCollector_objectdiagram.svg" src = "../img/InfoItemsCollector_objectdiagram.svg" / > < / p >
2018-03-26 06:47:05 +00:00
< p > If you are implementing a list for your service you need to extend InfoItem containing the extracted information,
and implement an < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html" > InfoItemExtractor< / a >
that will return the data of one InfoItem.< / p >
< p > A common Implementation would look like this:< / p >
< pre > < code > private MyInfoItemCollector collectInfoItemsFromElement(Element e) {
MyInfoItemCollector collector = new MyInfoItemCollector(getServiceId());
for(final Element li : element.children()) {
collector.commit(new InfoItemExtractor() {
public String getName() throws ParsingException {
public String getUrl() throws ParsingException {
return collector;
< / code > < / pre >
< h2 id = "infoitems-encapsulated-in-pages" > InfoItems encapsulated in pages< / h2 >
< p > When a streaming site shows a list of items it usually offers some additional information about that list, like it's title a thumbnail
2018-02-24 21:17:40 +00:00
or its creator. Such info can be called < strong > list header< / strong > .< / p >
2018-03-26 06:47:05 +00:00
< p > When a website shows a long list of items it usually does not load the whole list, but only a part of it. In order to get more items you may have to click on a next page button, or scroll down. < / p >
< p > This is why a list in NewPipe lists are chopped down into smaller lists called < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html" > InfoItemsPage< / a > s. Each page has its own URL, and needs to be extracted separately.< / p >
2018-09-21 20:40:35 +00:00
< p > Additional metainformation about the list such as its title a thumbnail
2018-03-26 06:47:05 +00:00
or its creator, and extracting multiple pages can be handled by a
< a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html" > ListExtractor< / a > ,
and it's < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html" > ListExtractor.InfoItemsPage< / a > .< / p >
< p > For extracting list header information it behaves like a regular extractor. For handling < code > InfoItemsPages< / code > it adds methods
such as:< / p >
< ul >
< li > < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getInitialPage--" > getInitialPage()< / a >
which will return the first page of InfoItems.< / li >
< li > < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getNextPageUrl--" > getNextPageUrl()< / a >
If a second Page of InfoItems is available this will return the URL pointing to them.< / li >
< li > < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getPage-java.lang.String-" > getPage()< / a >
returns a ListExtractor.InfoItemsPage by its URL which was retrieved by the < code > getNextPageUrl()< / code > method of the previous page.< / li >
< / ul >
2018-09-21 20:40:35 +00:00
< p > The reason why the first page is handled special is because many Websites such as YouTube will load the first page of
2018-03-26 06:47:05 +00:00
items like a regular webpage, but all the others as AJAX request.< / p >
2018-02-22 18:22:22 +00:00
< / div >
< / div >
< footer >
< div class = "rst-footer-buttons" role = "navigation" aria-label = "footer navigation" >
2018-09-01 13:48:12 +00:00
< a href = "../02_Concept_of_LinkHandler/" class = "btn btn-neutral float-right" title = "Concept of LinkHandler" > Next < span class = "icon icon-circle-arrow-right" > < / span > < / a >
2018-02-22 18:22:22 +00:00
< a href = "../00_Prepare_everything/" class = "btn btn-neutral" title = "Prepare everything" > < span class = "icon icon-circle-arrow-left" > < / span > Previous< / a >
< / div >
< hr / >
< div role = "contentinfo" >
<!-- Copyright etc -->
< / div >
Built with < a href = "http://www.mkdocs.org" > MkDocs< / a > using a < a href = "https://github.com/snide/sphinx_rtd_theme" > theme< / a > provided by < a href = "https://readthedocs.org" > Read the Docs< / a > .
< / footer >
< / div >
< / div >
< / section >
< / div >
< div class = "rst-versions" role = "note" style = "cursor: pointer" >
< span class = "rst-current-version" data-toggle = "rst-current-version" >
< span > < a href = "../00_Prepare_everything/" style = "color: #fcfcfc;" > « Previous< / a > < / span >
2018-09-01 13:48:12 +00:00
< span style = "margin-left: 15px" > < a href = "../02_Concept_of_LinkHandler/" style = "color: #fcfcfc" > Next » < / a > < / span >
2018-02-22 18:22:22 +00:00
< / span >
< / div >
< script > var base _url = '..' ; < / script >
2018-09-01 13:48:12 +00:00
< script src = "../js/theme.js" defer > < / script >
< script src = "../search/main.js" defer > < / script >
2018-02-22 18:22:22 +00:00
< / body >
< / html >