Linux html to json

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Convert HTML to Json via XPath

License

pldmgg/HTMLToJson

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Use XPath to specify how to parse a particular website and return your desired Json output. Leverages OpenScraping, dotnet-script, and ScrapingHub’s Splash Server in order to fully and faithfully render javascript.

All functions in the HTMLToJson Module except Install-Docker and Deploy-SplashServer are compatible with Windows PowerShell 5.1 and PowerShell Core 6.X (Windows and Linux). The Install-Docker and Deploy-SplashServer functions work on PowerShell Core 6.X on Linux (specifically Ubuntu 18.04/16.04/14.04, Debian 9/8, CentOS/RHEL 7, OpenSUSE 42).

In order to fully and faithfully render sites, the HTMLToJson Module relies on ScrapingHub’s Splash Server. If you do not already have Splash deployed to your environment, ssh to a VM running your preferred compatible Linux distro, launch PowerShell Core (using sudo ), and install the HTMLToJson Module —

sudo pwsh Install-Module HTMLToJson exit 

Next, launch pwsh (without sudo ), import the HTMLToJson Module, and install Docker (you will receive a sudo prompt unless you have password-less sudo configured on your system).

pwsh Import-Module HTMLToJson Install-Docker

Finally, deploy ScrapingHub’s Splash Server Docker Container —

At this point, you can continue on the same Linux VM running your Splash Docker container, or you can hop back into your local workstation (Windows or Linux. and make sure you install/import the module there). Either way, the following steps will be the same.

Next, we need to install the .Net Core SDK as well as dotnet-script. These provide the dotnet and dotnet-script binaries —

Install-DotNetSDK Install-DotNetScript

Parsing A Website Using XPath

Читайте также:  List known hosts linux

PS C:\Users\zeroadmin> $JsonXPathConfigString = @"   "title": "//*/h1", "VisibleAPIs":  "_xpath": "//a[(@class=\"list-group-item\")]", "APIName": ".//h3", "APIVersion": ".//p//code//span[normalize-space()][2]", "APIDescription": ".//p[(@class=\"list-group-item-text\")]" > > "@ PS C:\Users\zeroadmin> Get-SiteAsJson -Url 'http://dotnetapis.com/' -XPathJsonConfigString $JsonXPathConfigString -SplashServerUri 'http://192.168.2.50:8050' < "title": "DotNetApis (BETA)", "VisibleAPIs": [ < "APIName": "NUnit", "APIVersion": "3.11.0", "APIDescription": "NUnit is a unit-testing framework for all .NET languages with a strong TDD focus." >, < "APIName": "Json.NET", "APIVersion": "12.0.1", "APIDescription": "Json.NET is a popular high-performance JSON framework for .NET" >, < "APIName": "EntityFramework", "APIVersion": "6.2.0", "APIDescription": "Entity Framework is Microsoft's recommended data access technology for new applications." >, < "APIName": "MySql.Data", "APIVersion": "8.0.13", "APIDescription": "MySql.Data.MySqlClient .Net Core Class Library" >, < "APIName": "NuGet.Core", "APIVersion": "2.14.0", "APIDescription": "NuGet.Core is the core framework assembly for NuGet that the rest of NuGet builds upon." > ] >

Источник

Converting html code to a JavaScript object in node.js with html-to-json

So I wanted to make a simple tool to run threw all of my blog posts that have been parsed into html, and find certain values such as word count for my posts. In other words I want to create a collection of objects for each html file, or have a way to convert to a JSON format from HTML. So there should be some kind of dependency in the npmjs ecosystem that I can use to just quickly turn html into an object tyoe form that I can work with in a node environment, similarly to that of what I can work with in a browser or client side javaScript environment.

WIth that being said I took a little time to see what there is to work with if anything and after doing so I found a few projects that work nice. However in this post I will mostly be writing about a npm package called html-to-json. This package has a method where I can feed it an html string, and what is returned is a workable object.

1 — Basic example of html-to-json in node.

So of course as always the first thing is to install the package into a node project folder. I assume that you know the basics of setting up a new node project folder, if not this is nt the post to start out with the basic of using node and the default package manager for it called npm.

$ npm install html-to-json --save

After that I wanted to make my typical hello world example of how to get started with html to json. As such I put together a simple example to just test out how it works, by seeing if I can just pull the text from a single paragraph element just for starters.

So with that said there is the parse method of this project where the first argument that is given is an html string, and the second argument given is an object the serves as a filter. Then a third argument can be given that is a callback, but the method also returns a promise. The resulting object that is given via a callback, or in a resolved promise via the this method contains what I would expect.

var htmlToJson = require('html-to-json'),
htmlToJson.parse('

This is only an example

'
, {
p: function (doc) {
return doc.find('p').text();
}
}).then(function (result) {
console.log(result.p); // 'this is only an example'
});

So far so good, looks like this project is more or less what I had in mind, but lets look at a few more examples just for the hell of it.

2 — Converting many files to javaScript objects

To do this I used another javaScript dependency called node-dir, which comes in handy when I want to grab the content of many files that exist in a complex file structure. I wrote a post on it if you want to learn more about how to loop over files recursively with it.

There are other options that can be used to walk the contents of a file system, in fact I am not sure if I can say node dir is the best option when it comes to file system walkers. I wrote a post on the subject of file system walking that you might want to check out when it does come to other options for this. However in any case I just need a way to loop over the contents of a file system recursively, open each html file, and then use this project to parse the html into a workable object.

Anyway using node-dir with html-to-json i was able to quickly build the json report that I wanted.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
var htmlToJson = require('html-to-json'),
fs = require('fs'),
dir = require('node-dir'),
results = [],
source = './html',
jsonFN = './report.json';
// using the readFiles method in node-dir
dir.readFiles(source,
// a function to call for each file in the path
function (err, content, fileName, next) {
// if an error happens log it
if (err) {
console.log(err);
}
// log current filename
console.log(fileName);
// using html-to-jsons parse method
htmlToJson.parse(content, {
// include the filename
fn: fileName,
// get the h1 tag in my post
title: function (doc) {
return doc.find('title').text().replace(/\n/g, '').trim();
},
// getting word count
wc: function (doc) {
// finding word count by getting text of all p elements
return doc.find('p').text().split(' ').length;
}
}).then(function (result) {
// log the result
results.push(result);
})
next();
}, function ( ) {
// write out a json file
fs.writeFile(jsonFN, JSON.stringify(results), 'utf-8');
});

3 — Conclusion

So I might want to work on my content analysis tool some more as a great deal more comes to mind other than just word count of my posts. It seems like what is a lot more important than a high word count is targeting the right string of keywords that people are searching for. Anyway this is a great solution for helping me with the task of converting html to json, I hope this post helped you.

Источник

Linux Mint Forums

Forum rules
There are no such things as «stupid» questions. However if you think your question is a bit stupid, then this is the right place for you to post it. Please stick to easy to-the-point questions that you feel people can answer fast. For long and complicated questions prefer the other forums within the support section.
Before you post please read how to get help. Topics in this forum are automatically closed 6 months after creation.

convert html file to json

Post by kost » Mon Oct 13, 2014 11:01 am

I have export boomarks from firefox in windows in html extension. I know that firefox in linux mint supports json extensions. How can I import html boomark file in firefox linux mint?Can I convert html file to json?

Last edited by LockBot on Wed Dec 28, 2022 7:16 am, edited 2 times in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.

karlchen Level 22
Posts: 17363 Joined: Sat Dec 31, 2011 7:21 am Location: Germany

Re: convert html file to json

Post by karlchen » Mon Oct 13, 2014 11:11 am

I have got no idea how to convert an html file into a json file or vice versa. But I can tell you that you can import Firefox bookmarks, saved in an html file, in Firefox on Linux Mint as well.
Press o. This will open the «Library».
Inside the «Library», click on «Import and Export». Click on «Import bookmarks from html . «.

Image

The people of Alderaan have been bravely fighting back the clone warriors sent out by the unscrupulous Sith Lord Palpatine for 500 days now.
The Prophet’s Song

Re: convert html file to json

Post by kost » Mon Oct 13, 2014 1:47 pm

karlchen wrote: Hello, kost.

I have got no idea how to convert an html file into a json file or vice versa. But I can tell you that you can import Firefox bookmarks, saved in an html file, in Firefox on Linux Mint as well.
Press o. This will open the «Library».
Inside the «Library», click on «Import and Export». Click on «Import bookmarks from html . «.

  • Important Notices
  • ↳ Rules & Notices
  • ↳ Releases & Announcements
  • ↳ Main Edition Support
  • ↳ Beginner Questions
  • ↳ Installation & Boot
  • ↳ Software & Applications
  • ↳ Hardware Support
  • ↳ Graphics Cards & Monitors
  • ↳ Printers & Scanners
  • ↳ Storage
  • ↳ Sound
  • ↳ Networking
  • ↳ Virtual Machines
  • ↳ Desktop & Window Managers
  • ↳ Cinnamon
  • ↳ MATE
  • ↳ Xfce
  • ↳ Other topics
  • ↳ Non-technical Questions
  • ↳ Tutorials
  • Debian Edition Support
  • ↳ LMDE Forums
  • ↳ Beginner Questions
  • ↳ Installation & Boot
  • ↳ Software & Applications
  • ↳ Hardware Support
  • ↳ Networking
  • ↳ Tutorials
  • ↳ Other Topics & Open Discussion
  • ↳ LMDE Archive
  • Interests
  • ↳ Gaming
  • ↳ Scripts & Bash
  • ↳ Programming & Development
  • Customization
  • ↳ Themes, Icons & Wallpaper
  • ↳ Compiz, Conky, Docks & Widgets
  • ↳ Screenshots
  • ↳ Your Artwork
  • Chat
  • ↳ Introduce Yourself
  • ↳ Chat about Linux Mint
  • ↳ Chat about Linux
  • ↳ Open Chat
  • ↳ Suggestions & Feedback
  • International
  • ↳ Translations
  • ↳ Deutsch — German
  • ↳ Español — Spanish
  • ↳ Français — French
  • ↳ Italiano — Italian
  • ↳ Nederlands — Dutch
  • ↳ Português — Portuguese
  • ↳ Русский — Russian
  • ↳ Suomi — Finnish
  • ↳ Other Languages
  • ↳ Čeština-Slovenčina — Czech-Slovak
  • ↳ Magyar — Hungarian
  • ↳ 日本語 — Japanese
  • ↳ Polski — Polish
  • ↳ Svenska — Swedish
  • ↳ Українська — Ukrainian

Powered by phpBB® Forum Software © phpBB Limited

Источник

Оцените статью
Adblock
detector