HTML-Tag-Extractor-with-DIFF

Example of build Tag tree for full website (based on Springer test website) in file springerBuiltTree.txt

                ┌meta
           ┌head┤
           │    ├script
           │    ├title
           │    ├meta
           │    └meta
       html┤
           │    ┌script
           │    ├script
           │    ├span
           │    ├noscript┐
           │    │        └iframe
           │    ├nav┐
           │    │   └a
           └body┤
                │   ┌div

[...] (fragment above [main file is more than 1000 lines)

Working HTML Differencer in the main file, again based on test Springer websites.

Project Based on lxml, difflib

When we create the Website Object, we can access to each tag via it's name(class/id) or relative path. Examples: Website1.getTagContent("//title").lstrip().splitlines() Website1.getTagContent("//html/head/meta[@name='citation_author_institution']") #for 1.html

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
test/websites		test/websites
.gitignore		.gitignore
README.md		README.md
springerBuiltTree.txt		springerBuiltTree.txt
tagExtractor_differencer.py		tagExtractor_differencer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTML-Tag-Extractor-with-DIFF

About

Releases

Packages

Languages

PatrykOlewniak/HTML-Tag-Extractor-with-DIFF

Folders and files

Latest commit

History

Repository files navigation

HTML-Tag-Extractor-with-DIFF

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages