2018-02-22 18:22:22 +00:00
<!DOCTYPE html>
<!-- [if IE 8]><html class="no - js lt - ie9" lang="en" > <![endif] -->
<!-- [if gt IE 8]><! --> < html class = "no-js" lang = "en" > <!-- <![endif] -->
< head >
< meta charset = "utf-8" >
< meta http-equiv = "X-UA-Compatible" content = "IE=edge" >
< meta name = "viewport" content = "width=device-width, initial-scale=1.0" >
< link rel = "shortcut icon" href = "../img/favicon.ico" >
2018-04-08 20:02:44 +00:00
< title > Concept of the Extractor - NewPipe Documentation< / title >
2018-09-08 17:06:35 +00:00
<!-- local fonts -->
2018-02-23 20:18:58 +00:00
< link rel = "stylesheet" href = "../css/local_fonts.css" type = "text/css" / >
2018-02-22 18:22:22 +00:00
< link rel = "stylesheet" href = "../css/theme.css" type = "text/css" / >
< link rel = "stylesheet" href = "../css/theme_extra.css" type = "text/css" / >
2018-09-08 17:06:35 +00:00
<!-- local code syntax highlighting -->
< link rel = "stylesheet" href = "../css/github.min.css" type = "text/css" / >
2018-02-23 20:18:58 +00:00
< link rel = "stylesheet" href = "../css/highlight.css" type = "text/css" / >
2018-02-22 18:22:22 +00:00
< script >
// Current page data
2018-02-24 21:17:40 +00:00
var mkdocs_page_name = "Concept of the Extractor";
var mkdocs_page_input_path = "01_Concept_of_the_extractor.md";
2018-09-01 13:48:12 +00:00
var mkdocs_page_url = null;
2018-02-22 18:22:22 +00:00
< / script >
2018-09-01 13:48:12 +00:00
< script src = "../js/jquery-2.1.1.min.js" defer > < / script >
< script src = "../js/modernizr-2.8.3.min.js" defer > < / script >
2018-09-08 17:06:35 +00:00
< script src = "../js/highlight.min.js" > < / script >
2018-09-01 13:48:12 +00:00
< script > hljs . initHighlightingOnLoad ( ) ; < / script >
2018-02-22 18:22:22 +00:00
< / head >
< body class = "wy-body-for-nav" role = "document" >
< div class = "wy-grid-for-nav" >
< nav data-toggle = "wy-nav-shift" class = "wy-nav-side stickynav" >
< div class = "wy-side-nav-search" >
2018-04-08 20:02:44 +00:00
< a href = ".." class = "icon icon-home" > NewPipe Documentation< / a >
2018-02-22 18:22:22 +00:00
< div role = "search" >
< form id = "rtd-search-form" class = "wy-form" action = "../search.html" method = "get" >
2018-09-01 13:48:12 +00:00
< input type = "text" name = "q" placeholder = "Search docs" title = "Type search term here" / >
2018-02-22 18:22:22 +00:00
< / form >
< / div >
< / div >
< div class = "wy-menu wy-menu-vertical" data-spy = "affix" role = "navigation" aria-label = "main navigation" >
< ul class = "current" >
< li class = "toctree-l1" >
2018-09-21 20:40:35 +00:00
< a class = "" href = ".." > Welcome to the NewPipe Documentation.< / a >
2018-02-22 18:22:22 +00:00
< / li >
< li class = "toctree-l1" >
< a class = "" href = "../00_Prepare_everything/" > Prepare everything< / a >
< / li >
< li class = "toctree-l1 current" >
2018-02-24 21:17:40 +00:00
< a class = "current" href = "./" > Concept of the Extractor< / a >
2018-02-22 18:22:22 +00:00
< ul class = "subnav" >
2018-02-24 21:17:40 +00:00
< li class = "toctree-l2" > < a href = "#concept-of-the-extractor" > Concept of the Extractor< / a > < / li >
2018-02-22 18:22:22 +00:00
< ul >
< li > < a class = "toctree-l3" href = "#collectorextractor-pattern" > Collector/Extractor pattern< / a > < / li >
2018-02-24 21:17:40 +00:00
< li > < a class = "toctree-l3" href = "#collectorextractor-pattern-for-lists" > Collector/Extractor pattern for lists< / a > < / li >
2018-03-26 06:47:05 +00:00
< li > < a class = "toctree-l3" href = "#infoitems-encapsulated-in-pages" > InfoItems encapsulated in pages< / a > < / li >
2018-02-22 18:22:22 +00:00
< / ul >
< / ul >
< / li >
2018-09-01 13:48:12 +00:00
< li class = "toctree-l1" >
< a class = "" href = "../02_Concept_of_LinkHandler/" > Concept of LinkHandler< / a >
< / li >
2018-09-09 15:02:52 +00:00
< li class = "toctree-l1" >
2018-09-11 18:21:55 +00:00
< a class = "" href = "../03_Implement_a_service/" > Implement a service< / a >
< / li >
< li class = "toctree-l1" >
2018-09-09 15:02:52 +00:00
< a class = "" href = "../04_Run_changes_in_App/" > Run the changes in the App< / a >
< / li >
2018-02-22 18:22:22 +00:00
< / ul >
< / div >
< / nav >
< section data-toggle = "wy-nav-shift" class = "wy-nav-content-wrap" >
< nav class = "wy-nav-top" role = "navigation" aria-label = "top navigation" >
< i data-toggle = "wy-nav-top" class = "fa fa-bars" > < / i >
2018-04-08 20:02:44 +00:00
< a href = ".." > NewPipe Documentation< / a >
2018-02-22 18:22:22 +00:00
< / nav >
< div class = "wy-nav-content" >
< div class = "rst-content" >
< div role = "navigation" aria-label = "breadcrumbs navigation" >
< ul class = "wy-breadcrumbs" >
< li > < a href = ".." > Docs< / a > » < / li >
2018-02-24 21:17:40 +00:00
< li > Concept of the Extractor< / li >
2018-02-22 18:22:22 +00:00
< li class = "wy-breadcrumbs-aside" >
< / li >
< / ul >
< hr / >
< / div >
< div role = "main" >
< div class = "section" >
2018-02-24 21:17:40 +00:00
< h1 id = "concept-of-the-extractor" > Concept of the Extractor< / h1 >
2018-02-22 18:22:22 +00:00
< h2 id = "collectorextractor-pattern" > Collector/Extractor pattern< / h2 >
< p > Before we can start coding our own service we need to understand the basic concept of the extractor. There is a pattern
2018-09-01 13:48:12 +00:00
you will find all over the code. It is called the < strong > extractor/collector< / strong > pattern. The idea behind it is that
2018-02-22 18:22:22 +00:00
the < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html" > extractor< / a >
2018-12-14 08:57:47 +00:00
would produce single pieces of data, and the collector would collect it to form usable data for the front end.
The collector also controls the parsing process, and takes care of error handling. So if the extractor fails at any
point, the collector will decide whether or not it should continue parsing. This requires the extractor to be made out of
2018-02-22 18:22:22 +00:00
many small methods. One method for every data field the collector wants to have. The collectors are provided by NewPipe.
You need to take care of the extractors.< / p >
< h3 id = "usage-in-the-front-end" > Usage in the front end< / h3 >
< p > So typical call for retrieving data from a website would look like this:< / p >
< pre > < code class = "java" > Info info;
try {
2018-02-23 20:18:58 +00:00
// Create a new Extractor with a given context provided as parameter.
2018-02-24 21:17:40 +00:00
Extractor extractor = new Extractor(some_meta_info);
2018-02-23 20:18:58 +00:00
// Retrieves the data form extractor and builds info package.
info = Info.getInfo(extractor);
2018-02-22 18:22:22 +00:00
} catch(Exception e) {
2018-02-23 20:18:58 +00:00
// handle errors when collector decided to break up extraction
2018-02-22 18:22:22 +00:00
}
< / code > < / pre >
2018-02-24 21:17:40 +00:00
< h3 id = "typical-implementation-of-a-single-data-extractor" > Typical implementation of a single data extractor< / h3 >
< p > The typical implementation of a single data extractor on the other hand would look like this:< / p >
< pre > < code class = "java" > class MyExtractor extends FutureExtractor {
public MyExtractor(RequiredInfo requiredInfo, ForExtraction forExtraction) {
super(requiredInfo, forExtraction);
...
}
@Override
public void fetch() {
// Actually fetch the page data here
}
@Override
public String someDataFiled()
throws ExtractionException { //The exception needs to be thrown if someting failed
// get piece of information and return it
}
... // More datafields
}
< / code > < / pre >
< h2 id = "collectorextractor-pattern-for-lists" > Collector/Extractor pattern for lists< / h2 >
2018-03-26 06:47:05 +00:00
< p > Sometimes information can be represented as a list. In NewPipe a list is represented by a
< a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html" > InfoItemsCollector< / a > .
A InfoItemCollector will collect and assemble a list of < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItem.html" > InfoItem< / a > .
For each item that should be extracted a new Extractor must be created, and given to the InfoItemCollector via < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html#commit-E-" > commit()< / a > .< / p >
2018-11-16 18:23:26 +00:00
< p > < img alt = "InfoItemsCollector_objectdiagram.svg" src = "../img/InfoItemsCollector_objectdiagram.svg" / > < / p >
2018-03-26 06:47:05 +00:00
< p > If you are implementing a list for your service you need to extend InfoItem containing the extracted information,
and implement an < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html" > InfoItemExtractor< / a >
that will return the data of one InfoItem.< / p >
< p > A common Implementation would look like this:< / p >
< pre > < code > private MyInfoItemCollector collectInfoItemsFromElement(Element e) {
MyInfoItemCollector collector = new MyInfoItemCollector(getServiceId());
for(final Element li : element.children()) {
collector.commit(new InfoItemExtractor() {
@Override
public String getName() throws ParsingException {
...
}
@Override
public String getUrl() throws ParsingException {
...
}
...
}
return collector;
}
< / code > < / pre >
< h2 id = "infoitems-encapsulated-in-pages" > InfoItems encapsulated in pages< / h2 >
2018-12-14 08:57:47 +00:00
< p > When a streaming site shows a list of items it usually offers some additional information about that list, like it's title, a thumbnail,
2018-02-24 21:17:40 +00:00
or its creator. Such info can be called < strong > list header< / strong > .< / p >
2018-03-26 06:47:05 +00:00
< p > When a website shows a long list of items it usually does not load the whole list, but only a part of it. In order to get more items you may have to click on a next page button, or scroll down. < / p >
< p > This is why a list in NewPipe lists are chopped down into smaller lists called < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html" > InfoItemsPage< / a > s. Each page has its own URL, and needs to be extracted separately.< / p >
2018-09-21 20:40:35 +00:00
< p > Additional metainformation about the list such as its title a thumbnail
2018-03-26 06:47:05 +00:00
or its creator, and extracting multiple pages can be handled by a
< a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html" > ListExtractor< / a > ,
and it's < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html" > ListExtractor.InfoItemsPage< / a > .< / p >
< p > For extracting list header information it behaves like a regular extractor. For handling < code > InfoItemsPages< / code > it adds methods
such as:< / p >
< ul >
< li > < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getInitialPage--" > getInitialPage()< / a >
which will return the first page of InfoItems.< / li >
< li > < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getNextPageUrl--" > getNextPageUrl()< / a >
If a second Page of InfoItems is available this will return the URL pointing to them.< / li >
< li > < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getPage-java.lang.String-" > getPage()< / a >
returns a ListExtractor.InfoItemsPage by its URL which was retrieved by the < code > getNextPageUrl()< / code > method of the previous page.< / li >
< / ul >
2018-09-21 20:40:35 +00:00
< p > The reason why the first page is handled special is because many Websites such as YouTube will load the first page of
2018-12-14 08:57:47 +00:00
items like a regular webpage, but all the others as an AJAX request.< / p >
2018-02-22 18:22:22 +00:00
< / div >
< / div >
< footer >
< div class = "rst-footer-buttons" role = "navigation" aria-label = "footer navigation" >
2018-09-01 13:48:12 +00:00
< a href = "../02_Concept_of_LinkHandler/" class = "btn btn-neutral float-right" title = "Concept of LinkHandler" > Next < span class = "icon icon-circle-arrow-right" > < / span > < / a >
2018-02-22 18:22:22 +00:00
< a href = "../00_Prepare_everything/" class = "btn btn-neutral" title = "Prepare everything" > < span class = "icon icon-circle-arrow-left" > < / span > Previous< / a >
< / div >
< hr / >
< div role = "contentinfo" >
<!-- Copyright etc -->
< / div >
Built with < a href = "http://www.mkdocs.org" > MkDocs< / a > using a < a href = "https://github.com/snide/sphinx_rtd_theme" > theme< / a > provided by < a href = "https://readthedocs.org" > Read the Docs< / a > .
< / footer >
< / div >
< / div >
< / section >
< / div >
< div class = "rst-versions" role = "note" style = "cursor: pointer" >
< span class = "rst-current-version" data-toggle = "rst-current-version" >
< span > < a href = "../00_Prepare_everything/" style = "color: #fcfcfc;" > « Previous< / a > < / span >
2018-09-01 13:48:12 +00:00
< span style = "margin-left: 15px" > < a href = "../02_Concept_of_LinkHandler/" style = "color: #fcfcfc" > Next » < / a > < / span >
2018-02-22 18:22:22 +00:00
< / span >
< / div >
< script > var base _url = '..' ; < / script >
2018-09-01 13:48:12 +00:00
< script src = "../js/theme.js" defer > < / script >
< script src = "../search/main.js" defer > < / script >
2018-02-22 18:22:22 +00:00
< / body >
< / html >